python3爬虫2--网页源代码及图片提取

1、网页源代码提取

import urllib.request
def saveFile(data):
    path=r'G:\douban.out'
    f=open(path,'wb')
    f.write(data)
    f.close()

url="http://www.douban.com"
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '  
                        'Chrome/51.0.2704.63 Safari/537.36'} 
req=urllib.request.Request(url=url,headers=headers)
res=urllib.request.urlopen(req)
data=res.read()

saveFile(data)
#data=data.decode('utf-8')

print(data)

提取结果

这里写图片描述

豆瓣网页源代码查看：view-source:https://www.douban.com/

知识点：
1、urllib.request
2.Request.headers获取：使用Fiddler软件；
3.浏览器查看源代码的实现与python爬虫有区别吗，都是使用python的功能语句？

2、网页图片提取

#导入所需的库
import urllib.request,socket,re,sys,os

#定义文件保存路径
targetPath = r'G:\pictureout'

def saveFile(path):
    #检测当前路径的有效性
    if not os.path.isdir(targetPath):
        os.mkdir(targetPath)

    #设置每个图片的路径
    pos = path.rindex('/')
    t = os.path.join(targetPath,path[pos+1:])
    return t

#用if __name__ == '__main__'来判断是否是在直接运行该.py文件


# 网址
url = "https://www.douban.com/"
headers = {
              'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '                            'Chrome/51.0.2704.63 Safari/537.36'
           }

req = urllib.request.Request(url=url, headers=headers)

res = urllib.request.urlopen(req)

data = res.read()

for link,t in set(re.findall(r'(https:[^s]*?(jpg|png|gif))', str(data))):

    print(link)
    try:
        urllib.request.urlretrieve(link,saveFile(link))
    except:
        print('失败')

知识点：
1、os.mkdir(targetPath)#mkdir（make directory，创建目录）,创建targetPath
2.
pos = path.rindex('/')
t = os.path.join(targetPath,path[pos+1:])

这个是拼接图片的本地路径，第一句是获取/在路径中的位置坐标，第二句将本地的目录与文件名进行拼接。pos是/在字符串中的位置的，pos+1即文件名。

Python rindex() 返回子字符串 str 在字符串中最后出现的位置
https://wx1.sinaimg.cn/mw600/95e71c7fgy1fecsw86560j20dw0kstbp.jpg，path.rindex('/'),

path[post+1:] = 95e71c7fgy1fecsw86560j20dw0kstbp.jpg
targetPath = /Users/wangshengquan/Pictures/PythonImage/

os.path.join(targetPath,path[pos+1:]) = /Users/wangshengquan/Pictures/PythonImage/95e71c7fgy1fecsw45ojdj20dw0dc3zs.jpg
os.path.join(path1[, path2[, ...]]) #把目录和文件名合成一个路径

>>> path=r'https://wx1.sinaimg.cn/mw600/95e71c7fgy1fecsw86560j20dw0kstbp.jpg'
>>> path.rindex('/')
28
>>> os.path.join('G:\chengxu',path[29:])
'G:\\chengxu\\95e71c7fgy1fecsw86560j20dw0kstbp.jpg'

python3爬虫2--网页源代码及图片提取

1、网页源代码提取

2、网页图片提取

浏览过的版块