|
通过前面的学习,我们基本了解抓取一个网站内容所需要的知识,下面我们就以猫眼电影网站为例,抓取他排名top100的电影信息,网址信息为:https://maoyan.com/board/4
登录https://maoyan.com/board/4网址发现,每个页面按顺序展示10个电影信息,第二页网址为:https://maoyan.com/board/4?offset=10,第三页为:https://maoyan.com/board/4?offset=20,则能推断出第N页的网址为:https://maoyan.com/board/4?offset=(N-1)*10 其中(N>1)我们要写一个循环传入偏移量,连续取10次,即可取出top100的电影信息。
我们先抓取首页试一试:
import requests
def get_one_page(url):
response = requests.get(url)
if response.status_code == 200:
return response.text
return None
def main():
url = 'https://maoyan.com/board/4'
html = get_one_page(url)
print(html)
main()
运行查看结果,此时抓取了首页的内容,当然除了top10的电影外,还有网页的其他信息,此处我们暂时先不过滤,因为还没有学习python对html的解析,而用正则表达式又太麻烦。
整合代码:
import requests
from requests.exceptions import RequestException
import time
def get_one_page(url):
try:
response = requests.get(url)
if response.status_code == 200:
return response.text
return None
except RequestException:
return None
def write_to_file(context):
with open('resultd.txt', 'a') as f:
f.write(context)
def main(offset):
url = 'https://maoyan.com/board/4?offset=' + str(offset)
html = get_one_page(url)
write_to_file(html)
if __name__ == '__main__':
for i in range(10):
main(offset = i * 10)
time.sleep(1)
示例中用write_to_file函数把每页内容写到文件中;由于内容太大而且杂乱此处不在展示。
下节,我们将解析这些html。 |