网页爬虫入门--莫烦教程笔记

网页爬虫入门–莫烦教程笔记

教程推荐:

莫烦教程–网页爬虫

崔庆才–Python爬虫学习系列教程

知乎问答中的各种推荐

孔淼–一看就明白的爬虫入门讲解

课程逻辑:

网页爬虫 $\to$ 解析网页 $\to$ 高效爬虫 $\to$ 爬虫高级库

爬虫简介

# 用Python登录网页

from urllib.request import urlopen

# if has Chinese, apply decode()
html = urlopen(
    "https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
print(html)

<!DOCTYPE html>
<html lang="cn">
<head>
    <meta charset="UTF-8">
    <title>Scraping tutorial 1 | 莫烦Python</title>
    <link rel="icon" href="https://morvanzhou.github.io/static/img/description/tab_icon.png">
</head>
<body>
    <h1>爬虫测试1</h1>
    <p>
        这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
        <a href="https://morvanzhou.github.io/tutorials/scraping">爬虫教程</a> 中的简单测试.
    </p>

</body>
</html>

注意:

我用jupyter notebook做动态交互, 运行时一直报错”ImportError: No module named request”. 但这段代码在IDE上没有出错, 并且这个模块是存在的. 所有我花了一些时间解决这个问题, 在此记录, 以防再次犯错.

首先分析是否是安装问题, 由于我很久没有用jupyter, 再次使用时忘了anaconda自带jupyter,所有用pip和pip3都分别安装了一次. 出问题后又用uninstall卸载, 但是问题没有解决, 依然报错. 并且发现anaconda里有jupyter.
google以及stackoverflow上找办法, 无果, 感觉问题还是出在Python版本上. 于是修改了配置文件, 还是出错.
分析了一下原因, 觉得问题应该出在anaconda自带jupyer上, 其默认的环境不是我正在使用的环境python3.5, 一番周折之后找到解决方案. 原来jupyter的ipykernel是使用一个叫kernel.json的文件管理的, 所以直接为我想要的环境安装ipykernel包:

$ conda install -n python35 ipykernel 
$ python -m ipykernel install --user #激活这个环境

# 匹配网页内容,初级网页匹配使用正则,繁琐匹配推荐使用BeautifulSoup

# 想要找到title的话
import re
res=re.findall(r'<title>(.+?)</title>',html)
print('\nPage title is:',res[0])

# 想要找到中间的那个段落
res=re.findall(r'<p>(.+?)</p>',html,flags=re.DOTALL) #re.DOTALL if multil line
print('\nPage paragraph is:',res[0])

# 找一找所有的链接
res=re.findall(r'href=(.+?)',html)
print('\nAll links:',res)

Page title is: Scraping tutorial 1 | 莫烦Python

Page paragraph is: 
        这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
        <a href="https://morvanzhou.github.io/tutorials/scraping">爬虫教程</a> 中的简单测试.


All links: ['"', '"', '"']

BeautifulSoup解析网页

爬网页流程:
1. 选择要爬的网址(url)
2. 使用Python登录上这个网址(urlopen等)
3. 读取网页信息(read()出来)
4. 将读取的信息放入BeautifulSoup
5. 使用BeautifulSoup选取tag信息等(代替正则表达式)

安装BeautifulSoup

$ pip install beautifulsoup4 # Python 2+ 
$ pip3 install beautifulsoup4 # Python 3+ 
$ conda install beautifulsoup4 # anaconda

BeautifulSoup 4.2.0文档

BeautifulSoup解析网页: 基础

# 按常规读取网页
from bs4 import BeautifulSoup
from urllib.request import urlopen

#if has Chinese, apple decode()
html = urlopen(
    "https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
print(html)

# 将网页信息加载进BeautifulSoup,以lxml的形式
soup=BeautifulSoup(html,features='lxml')
print(soup.h1)
print('\n',soup.p)

# 用find_all找到所有选项,但真正link在<a href="link">里面
# 可以看做是<a>的一个属性,可以用字典形式的key来读取
all_href=soup.find_all('a')
all_href=[l["href"] for l in all_href]
print('\n',all_href)

<!DOCTYPE html>
<html lang="cn">
<head>
    <meta charset="UTF-8">
    <title>Scraping tutorial 1 | 莫烦Python</title>
    <link rel="icon" href="https://morvanzhou.github.io/static/img/description/tab_icon.png">
</head>
<body>
    <h1>爬虫测试1</h1>
    <p>
        这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
        <a href="https://morvanzhou.github.io/tutorials/scraping">爬虫教程</a> 中的简单测试.
    </p>

</body>
</html>
<h1>爬虫测试1</h1>

 <p>
        这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
<a href="https://morvanzhou.github.io/tutorials/scraping">爬虫教程</a> 中的简单测试.
    </p>

 ['https://morvanzhou.github.io/', 'https://morvanzhou.github.io/tutorials/scraping']

BeautifulSoup解析网页: CSS

# 先读取页面
from bs4 import BeautifulSoup
from urllib.request import urlopen

html=urlopen(
    "https://morvanzhou.github.io/static/scraping/list.html"
).read().decode('utf-8')
print(html)

# 将网页信息加载进BeautifulSoup,以lxml的形式
soup=BeautifulSoup(html,features='lxml')

<!DOCTYPE html>
<html lang="cn">
<head>
    <meta charset="UTF-8">
    <title>爬虫练习 列表 class | 莫烦 Python</title>
    <style>
    .jan {
        background-color: yellow;
    }
    .feb {
        font-size: 25px;
    }
    .month {
        color: red;
    }
    </style>
</head>

<body>

<h1>列表 爬虫练习</h1>

<p>这是一个在 <a href="https://morvanzhou.github.io/" >莫烦 Python</a> 的 <a href="https://morvanzhou.github.io/tutorials/scraping" >爬虫教程</a>
    里无敌简单的网页, 所有的 code 让你一目了然, 清晰无比.</p>

<ul>
    <li class="month">一月</li>
    <ul class="jan">
        <li>一月一号</li>
        <li>一月二号</li>
        <li>一月三号</li>
    </ul>
    <li class="feb month">二月</li>
    <li class="month">三月</li>
    <li class="month">四月</li>
    <li class="month">五月</li>
</ul>

</body>
</html>

# 按CSS class 匹配
month=soup.find_all('li',{"class":"month"})
for m in month:
    print(m.get_text())

一月
二月
三月
四月
五月

jan=soup.find('ul',{'class':'jan'})
d_jan=jan.find_all('li')
for d in d_jan:
    print(d.get_text())

一月一号
一月二号
一月三号

BeautifulSoup解析网页: 正则表达

from  bs4 import BeautifulSoup
from urllib.request import urlopen
import re

html=urlopen(
    "https://morvanzhou.github.io/static/scraping/table.html"
).read().decode('utf-8')

soup=BeautifulSoup(html,features='lxml')

img_links=soup.find_all('img',{'src':re.compile('.*?\.jpg')})
for link in img_links:
    print(link['src'])

https://morvanzhou.github.io/static/img/course_cover/tf.jpg
https://morvanzhou.github.io/static/img/course_cover/rl.jpg
https://morvanzhou.github.io/static/img/course_cover/scraping.jpg

course_links=soup.find_all('a',{'href':re.compile('https://morvan.*')})
for link in course_links:
    print(link['href'])

https://morvanzhou.github.io/
https://morvanzhou.github.io/tutorials/scraping
https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/
https://morvanzhou.github.io/tutorials/data-manipulation/scraping/

小练习: 爬百度百科

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import random

# 设置起始页,并将/item/...的网页放在his中,做一个备案,记录我们浏览过的网页
base_url="https://baike.baidu.com"
his=["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]

# select the last sub url i 'his', print the title and url
url=base_url+his[-1]
html=urlopen(url).read().decode('utf-8')

soup=BeautifulSoup(html,features='lxml')
print(soup.find('h1').get_text(),'  url: ',his[-1])

网络爬虫   url:  /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711

# find valid urls
sub_urls=soup.find_all('a',{'target':'_blank','href':re.compile('/item/(%.{2})+$')})

if len(sub_urls)!=0:
    his.append(random.sample(sub_urls,1)[0]['href'])
else:
    # no valid sub link found
    his.pop()
print(his)

['/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711', '/item/%E6%8E%92%E5%BA%8F%E7%AE%97%E6%B3%95']

# 放在一个for loop中,让它在各种不同的网页中跳来跳去
his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]

for i in range(20):
    url = base_url + his[-1]

    html = urlopen(url).read().decode('utf-8')
    soup = BeautifulSoup(html, features='lxml')
    print(i, soup.find('h1').get_text(), '    url: ', his[-1])

    # find valid urls
    sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})

    if len(sub_urls) != 0:
        his.append(random.sample(sub_urls, 1)[0]['href'])
    else:
        # no valid sub link found
        his.pop()

0 网络爬虫     url:  /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711
1 搜索引擎     url:  /item/%E6%90%9C%E7%B4%A2%E5%BC%95%E6%93%8E
2 全文索引     url:  /item/%E5%85%A8%E6%96%87%E7%B4%A2%E5%BC%95
3 搜索引擎     url:  /item/%E6%90%9C%E7%B4%A2%E5%BC%95%E6%93%8E
4 网站     url:  /item/%E7%BD%91%E7%AB%99
5 通信协议     url:  /item/%E9%80%9A%E4%BF%A1%E5%8D%8F%E8%AE%AE
6 数据单元     url:  /item/%E6%95%B0%E6%8D%AE%E5%8D%95%E5%85%83
7 数据包     url:  /item/%E6%95%B0%E6%8D%AE%E5%8C%85
8 会话层     url:  /item/%E4%BC%9A%E8%AF%9D%E5%B1%82
9 数据流     url:  /item/%E6%95%B0%E6%8D%AE%E6%B5%81
10 聚类分析     url:  /item/%E8%81%9A%E7%B1%BB%E5%88%86%E6%9E%90
11 中数据     url:  /item/%E4%B8%AD%E6%95%B0%E6%8D%AE
12 中数据     url:  /item/%E4%B8%AD%E6%95%B0%E6%8D%AE
13 中数据     url:  /item/%E4%B8%AD%E6%95%B0%E6%8D%AE
14 企业邮箱     url:  /item/%E4%BC%81%E4%B8%9A%E9%82%AE%E7%AE%B1
15 域名     url:  /item/%E5%9F%9F%E5%90%8D
16 商标权     url:  /item/%E5%95%86%E6%A0%87%E6%9D%83
17 商标注册不当     url:  /item/%E5%95%86%E6%A0%87%E6%B3%A8%E5%86%8C%E4%B8%8D%E5%BD%93
18 中国驰名商标     url:  /item/%E9%A9%B0%E5%90%8D%E5%95%86%E6%A0%87
19 商标评审委员会     url:  /item/%E5%95%86%E6%A0%87%E8%AF%84%E5%AE%A1%E5%A7%94%E5%91%98%E4%BC%9A

注意: Python错误记录：IndexError: list index out of range
1. 第1种可能情况: list[index]index超出范围
2. 第2种可能情况: list是一个空的,没有一个元素, 进行list[0]就会出现该错误

加速你的爬虫

多进程分布式

import multiprocessing as mp
import time
from urllib.request import urlopen,urljoin
from bs4 import BeautifulSoup
import re

base_url = 'https://morvanzhou.github.io/'

def crawl(url):
    response=urlopen(url)
    time.sleep(0.1)
    return response.read().decode()

def parse(html):
    soup=BeautifulSoup(html,'lxml')
    urls=soup.find_all('a',{'href':re.compile('^/.+?/$')})
    title=soup.find('h1').get_text().strip()
    page_urls=set([urljoin(base_url,url['href']) for url in urls])
    url=soup.find('meta',{'property':'og:url'})['content']
    return title,page_urls,url

测试普通爬法

unseen = set([base_url,])
seen = set()

if base_url != "http://127.0.0.1:4000/":
    restricted_crawl = True
else:
    restricted_crawl = False

count, t1 = 1, time.time()

while len(unseen) != 0:                 # still get some url to visit
    if restricted_crawl and len(seen) > 20:
            break

    print('\nDistributed Crawling...')
    htmls = [crawl(url) for url in unseen]

    print('\nDistributed Parsing...')
    results = [parse(html) for html in htmls]

    print('\nAnalysing...')
    seen.update(unseen)         # seen the crawled
    unseen.clear()              # nothing unseen

    for title, page_urls, url in results:
        print(count, title, url)
        count += 1
        unseen.update(page_urls - seen)     # get new url to crawl
print('Total time: %.1f s' % (time.time()-t1, ))    # 53 s

Distributed Crawling...

Distributed Parsing...

Analysing...
1 教程 https://morvanzhou.github.io/

Distributed Crawling...

Distributed Parsing...

Analysing...
2 数据处理教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/
3 Matplotlib 画图教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/plt/
4 机器学习系列 https://morvanzhou.github.io/tutorials/machine-learning/
5 为了更优秀 https://morvanzhou.github.io/support/
6 高级爬虫: 高效无忧的 Scrapy 爬虫库 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-02-scrapy/
7 机器学习实践 https://morvanzhou.github.io/tutorials/machine-learning/ML-practice/
8 网页爬虫教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/
9 Why? https://morvanzhou.github.io/tutorials/data-manipulation/scraping/1-00-why/
10 multiprocessing 多进程教程系列 https://morvanzhou.github.io/tutorials/python-basic/multiprocessing/
11 Linux 简易教学 https://morvanzhou.github.io/tutorials/others/linux-basic/
12 关于莫烦 https://morvanzhou.github.io/about/
13 近期更新 https://morvanzhou.github.io/recent-posts/
14 高级爬虫: 让 Selenium 控制你的浏览器帮你爬 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-01-selenium/
15 进化算法 Evolutionary Strategies 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/evolutionary-algorithm/
16 Theano 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/theano/
17 Threading 多线程教程系列 https://morvanzhou.github.io/tutorials/python-basic/threading/
18 说吧~ https://morvanzhou.github.io/discuss/
19 有趣的机器学习系列 https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/
20 Python基础 教程系列 https://morvanzhou.github.io/tutorials/python-basic/
21 Keras 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/keras/
22 Pytorch 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/torch/
23 推荐学习顺序 https://morvanzhou.github.io/learning-steps/
24 Git 版本管理 教程系列 https://morvanzhou.github.io/tutorials/others/git/
25 Tkinter GUI 教程系列 https://morvanzhou.github.io/tutorials/python-basic/tkinter/
26 其他教学系列 https://morvanzhou.github.io/tutorials/others/
27 基础教程系列 https://morvanzhou.github.io/tutorials/python-basic/basic/
28 Tensorflow 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/
29 迁移学习 Transfer Learning https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/5-16-transfer-learning/
30 Numpy & Pandas 教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/np-pd/
31 Sklearn 通用机器学习 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/sklearn/
32 强化学习 Reinforcement Learning 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/
33 迁移学习 Transfer Learning https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/2-9-transfer-learning/
Total time: 19.8 s

测试分布式爬法

unseen = set([base_url,])
seen = set()

pool = mp.Pool(4)                       
count, t1 = 1, time.time()
while len(unseen) != 0:                 # still get some url to visit
    if restricted_crawl and len(seen) > 20:
            break
    print('\nDistributed Crawling...')
    crawl_jobs = [pool.apply_async(crawl, args=(url,)) for url in unseen]
    htmls = [j.get() for j in crawl_jobs]                                       # request connection

    print('\nDistributed Parsing...')
    parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]
    results = [j.get() for j in parse_jobs]                                     # parse html

    print('\nAnalysing...')
    seen.update(unseen)         # seen the crawled
    unseen.clear()              # nothing unseen

    for title, page_urls, url in results:
        print(count, title, url)
        count += 1
        unseen.update(page_urls - seen)     # get new url to crawl
print('Total time: %.1f s' % (time.time()-t1, ))    # 16 s !!!

Distributed Crawling...

Distributed Parsing...

Analysing...
1 教程 https://morvanzhou.github.io/

Distributed Crawling...

Distributed Parsing...

Analysing...
2 数据处理教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/
3 Matplotlib 画图教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/plt/
4 机器学习系列 https://morvanzhou.github.io/tutorials/machine-learning/
5 为了更优秀 https://morvanzhou.github.io/support/
6 高级爬虫: 高效无忧的 Scrapy 爬虫库 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-02-scrapy/
7 机器学习实践 https://morvanzhou.github.io/tutorials/machine-learning/ML-practice/
8 网页爬虫教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/
9 Why? https://morvanzhou.github.io/tutorials/data-manipulation/scraping/1-00-why/
10 multiprocessing 多进程教程系列 https://morvanzhou.github.io/tutorials/python-basic/multiprocessing/
11 Linux 简易教学 https://morvanzhou.github.io/tutorials/others/linux-basic/
12 关于莫烦 https://morvanzhou.github.io/about/
13 近期更新 https://morvanzhou.github.io/recent-posts/
14 高级爬虫: 让 Selenium 控制你的浏览器帮你爬 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-01-selenium/
15 进化算法 Evolutionary Strategies 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/evolutionary-algorithm/
16 Theano 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/theano/
17 Threading 多线程教程系列 https://morvanzhou.github.io/tutorials/python-basic/threading/
18 说吧~ https://morvanzhou.github.io/discuss/
19 有趣的机器学习系列 https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/
20 Python基础 教程系列 https://morvanzhou.github.io/tutorials/python-basic/
21 Keras 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/keras/
22 Pytorch 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/torch/
23 推荐学习顺序 https://morvanzhou.github.io/learning-steps/
24 Git 版本管理 教程系列 https://morvanzhou.github.io/tutorials/others/git/
25 Tkinter GUI 教程系列 https://morvanzhou.github.io/tutorials/python-basic/tkinter/
26 其他教学系列 https://morvanzhou.github.io/tutorials/others/
27 基础教程系列 https://morvanzhou.github.io/tutorials/python-basic/basic/
28 Tensorflow 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/
29 迁移学习 Transfer Learning https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/5-16-transfer-learning/
30 Numpy & Pandas 教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/np-pd/
31 Sklearn 通用机器学习 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/sklearn/
32 强化学习 Reinforcement Learning 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/
33 迁移学习 Transfer Learning https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/2-9-transfer-learning/
Total time: 7.8 s

异步加载Asyncio

基本用法

# asyncio: 在单线程里使用异步计算,下载网页的时候和处理网页的时候是不连续的
# 更有效利用了等待下载的这段时间
import time

def job(t):
    print('Start job',t)
    time.sleep(t)
    print('Job',t,' takes',t,'s')

def main():
    [job(t) for t in range(1,3)]

t1=time.time()
main()
print('No async total time: ',time.time()-t1)

Start job 1
Job 1  takes 1 s
Start job 2
Job 2  takes 2 s
No async total time:  3.00662899017334

import asyncio

async def job(t): #async形式的功能
    print('Start job',t)
    await asyncio.sleep(t)  #等待t秒,期间切换其他任务
    print('Job',t,' takes',t,'s')

async def main(loop): #async形式的功能
    tasks=[loop.create_task(job(t)) for t in range(1,3)] #创建任务,但不执行
    await asyncio.wait(tasks) #执行并等待所有任务完成

t1=time.time()
loop=asyncio.get_event_loop() #建立loop
loop.run_until_complete(main(loop)) #执行loop
print("Async total time:", time.time()-t1)

Start job 1
Start job 2
Job 1  takes 1 s
Job 2  takes 2 s
Async total time: 2.0041818618774414

aiohttp

$ pip3 install aiohttp

import requests

URL = 'https://morvanzhou.github.io/'

def normal():
    for i in range(2):
        r=requests.get(URL)
        url=r.url
        print(url)

t1=time.time()
normal()
print('Normal total time:',time.time()-t1)

https://morvanzhou.github.io/
https://morvanzhou.github.io/
Normal total time: 0.6391615867614746

import aiohttp

async def job(session):
    response=await session.get(URL) #等待并切换
    return str(response.url)

async def main(loop):
    async with aiohttp.ClientSession() as session: #官网推荐建立Session的形式
        tasks=[loop.create_task(job(session)) for _ in range(2)] 
        finished,unfinished=await asyncio.wait(tasks)
        all_results=[r.result() for r in finished] #获取所有结果
        print(all_results)

t1=time.time()
loop=asyncio.get_event_loop()
loop.run_until_complete(main(loop))
loop.close()
print('Async total time:',time.time()-t1)

['https://morvanzhou.github.io/', 'https://morvanzhou.github.io/']
Async total time: 0.30881452560424805

和多进程分布式爬虫对比

%E5%9B%BE%E7%89%87.png

import aiohttp
import asyncio
import time
from bs4 import BeautifulSoup
from urllib.request import urljoin
import re
import multiprocessing as mp

base_url = "https://morvanzhou.github.io/"

# DON'T OVER CRAWL THE WEBSITE OR YOU MAY NEVER VISIT AGAIN
if base_url != "http://127.0.0.1:4000/":
    restricted_crawl = True
else:
    restricted_crawl = False


seen = set()
unseen = set([base_url])


def parse(html):
    soup = BeautifulSoup(html, 'lxml')
    urls = soup.find_all('a', {"href": re.compile('^/.+?/$')})
    title = soup.find('h1').get_text().strip()
    page_urls = set([urljoin(base_url, url['href']) for url in urls])
    url = soup.find('meta', {'property': "og:url"})['content']
    return title, page_urls, url


async def crawl(url, session):
    r = await session.get(url)
    html = await r.text()
    await asyncio.sleep(0.1)        # slightly delay for downloading
    return html


async def main(loop):
    pool = mp.Pool(8)               # slightly affected
    async with aiohttp.ClientSession() as session:
        count = 1
        while len(unseen) != 0:
            print('\nAsync Crawling...')
            tasks = [loop.create_task(crawl(url, session)) for url in unseen]
            finished, unfinished = await asyncio.wait(tasks)
            htmls = [f.result() for f in finished]

            print('\nDistributed Parsing...')
            parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]
            results = [j.get() for j in parse_jobs]

            print('\nAnalysing...')
            seen.update(unseen)
            unseen.clear()
            for title, page_urls, url in results:
                # print(count, title, url)
                unseen.update(page_urls - seen)
                count += 1

if __name__ == "__main__":
    t1 = time.time()
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main(loop))
    #loop.close()
    print("Async total time: ", time.time() - t1)

---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

<ipython-input-42-e494cc65d4bf> in <module>()
     61     t1 = time.time()
     62     loop = asyncio.get_event_loop()
---> 63     loop.run_until_complete(main(loop))
     64     loop.close()
     65     print("Async total time: ", time.time() - t1)


~/anaconda2/envs/python35/lib/python3.5/asyncio/base_events.py in run_until_complete(self, future)
    441         Return the Future's result, or raise its exception.
    442         """
--> 443         self._check_closed()
    444 
    445         new_task = not futures.isfuture(future)


~/anaconda2/envs/python35/lib/python3.5/asyncio/base_events.py in _check_closed(self)
    355     def _check_closed(self):
    356         if self._closed:
--> 357             raise RuntimeError('Event loop is closed')
    358 
    359     def _asyncgen_finalizer_hook(self, agen):


RuntimeError: Event loop is closed

from urllib.request import urlopen, urljoin
from bs4 import BeautifulSoup
import multiprocessing as mp
import re
import time


def crawl(url):
    response = urlopen(url)
    time.sleep(0.1)             # slightly delay for downloading
    return response.read().decode()


def parse(html):
    soup = BeautifulSoup(html, 'lxml')
    urls = soup.find_all('a', {"href": re.compile('^/.+?/$')})
    title = soup.find('h1').get_text().strip()
    page_urls = set([urljoin(base_url, url['href']) for url in urls])
    url = soup.find('meta', {'property': "og:url"})['content']
    return title, page_urls, url


if __name__ == '__main__':
    base_url = 'https://morvanzhou.github.io/'
    #base_url = "http://127.0.0.1:4000/"

    # DON'T OVER CRAWL THE WEBSITE OR YOU MAY NEVER VISIT AGAIN
    if base_url != "http://127.0.0.1:4000/":
        restricted_crawl = True
    else:
        restricted_crawl = False

    unseen = set([base_url,])
    seen = set()

    pool = mp.Pool(8)                       # number strongly affected
    count, t1 = 1, time.time()
    while len(unseen) != 0:                 # still get some url to visit
        if restricted_crawl and len(seen) > 20:
            break
        print('\nDistributed Crawling...')
        crawl_jobs = [pool.apply_async(crawl, args=(url,)) for url in unseen]
        htmls = [j.get() for j in crawl_jobs]                                       # request connection
        htmls = [h for h in htmls if h is not None]     # remove None

        print('\nDistributed Parsing...')
        parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]
        results = [j.get() for j in parse_jobs]                                     # parse html

        print('\nAnalysing...')
        seen.update(unseen)
        unseen.clear()

        for title, page_urls, url in results:
            # print(count, title, url)
            count += 1
            unseen.update(page_urls - seen)

    print('Total time: %.1f s' % (time.time()-t1, ))

高级爬虫

高级爬虫:让Selenium控制你的浏览器帮你爬

Selenium: 能控制你的浏览器,有模有样地学人类”看”网页

什么时候用到Selenium:
- 发现用普通方法爬不到想要的内容
- 网站跟你玩”捉迷藏”,太多JavaScript内容
- 需要像人一样浏览的爬虫

安装Selenium

$ pip install selenium #python 2+

$ pip install selenium #python 3+

下载针对Linux和macOS的driver:

Linux和MacOS用户下载好之后, 需要将下载好的”geckodriver”文件放在你的计算机的”/usr/bin”或者”/usr/local/bin”目录, 在终端执行:

$ sudo cp 你的geckodriver位置 /usr/local/bin

$ sudo chmod +x /usr/local/bin/geckodriver

下载火狐浏览器插件Katalon Recorder,记录重复性工作

点击插件, 开始各种点击工作, 每当点击时, 插件就会记录下这些点击, 形成一些log, 最后点击Export按钮, 观看帮你生成的浏览记录代码, 可以输出Python2版本的代码, 复制你需要的代码.

import os

os.makedirs('./images/',exist_ok=True)

from selenium import webdriver

driver=webdriver.Firefox() #打开火狐浏览器

# 用Katalon复制的内容
driver.get("https://morvanzhou.github.io/")
driver.find_element_by_xpath(u"//img[@alt='强化学习 (Reinforcement Learning)']").click()
driver.find_element_by_link_text("About").click()
driver.find_element_by_link_text(u"赞助").click()
driver.find_element_by_link_text(u"教程 ").click()
driver.find_element_by_link_text(u"数据处理 ").click()
driver.find_element_by_link_text(u"网页爬虫").click()

# 得到网页html,还能截图
html=driver.page_source
driver.get_screenshot_as_file('./images/sreenshot1.png')
driver.close()
print(html[:200])

<html lang="zh-CN en"><head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <meta name="description" content="网页上有着海量的信息, 而我们可以用 Python 来定点锁定这些信息.

%E5%9B%BE%E7%89%87.png

# 让selenium不弹出浏览器窗口,让它安静地执行这些操作
from selenium.webdriver.firefox.options import Options

firefox_options=Options()
firefox_options.add_argument('--headless') #define headless

driver=webdriver.Firefox(firefox_options=firefox_options)

# 用Katalon复制的内容
driver.get("https://morvanzhou.github.io/")
driver.find_element_by_xpath(u"//img[@alt='强化学习 (Reinforcement Learning)']").click()
driver.find_element_by_link_text("About").click()
driver.find_element_by_link_text(u"赞助").click()
driver.find_element_by_link_text(u"教程 ").click()
driver.find_element_by_link_text(u"数据处理 ").click()
driver.find_element_by_link_text(u"网页爬虫").click()

# 得到网页html,还能截图
html=driver.page_source
driver.get_screenshot_as_file('./images/sreenshot2.png')
driver.close()
print(html[:200])

<html lang="zh-CN en"><head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <meta name="description" content="网页上有着海量的信息, 而我们可以用 Python 来定点锁定这些信息.

高级爬虫: 高效无忧的Scrapy爬虫库

$ pip3 install scrapy

Scrapy是一个爬虫的框架, 而不是一个简单的爬虫, 它整合了爬取, 处理数据, 存储数据的一条龙服务.

import scrapy

class MofanSpider(scrapy.Spider):
    name = "mofan"
    start_urls = [
        'https://morvanzhou.github.io/',
    ]
    # unseen = set()
    # seen = set()      # we don't need these two as scrapy will deal with them automatically

    def parse(self, response):
        yield {     # return some results
            'title': response.css('h1::text').extract_first(default='Missing').strip().replace('"', ""),
            'url': response.url,
        }

        urls = response.css('a::attr(href)').re(r'^/.+?/$')     # find all sub urls
        for url in urls:
            yield response.follow(url, callback=self.parse)     # it will filter duplication automatically


# lastly, run this in terminal
# scrapy runspider 5-2-scrapy.py -o res.json

网页爬虫入门--莫烦教程笔记

爬虫简介

BeautifulSoup解析网页

安装BeautifulSoup

BeautifulSoup解析网页: 基础

BeautifulSoup解析网页: CSS

BeautifulSoup解析网页: 正则表达

小练习: 爬百度百科

更多请求/下载方式

多功能的Requests

安装requests

reguests get 请求

requests post 请求

上传图片

登录

使用Session登录

下载文件

使用urlretrieve

使用request

小练习: 下载美图