Python 数据可视化：Python 大佬有哪些？

（给Python开发者加星标，提升Python技能）

作者：法纳斯特（本文来自作者投稿，简介见末尾）

之前讲了代理池以及Cookies的相关知识，这里针对搜狗搜索微信公众号文章的爬取，将它俩实践一下。

在崔大的书里面，他是用代理IP来应对搜狗的反爬措施，因为同一IP访问网页过于频繁，就会跳转验证码页面。

不过时代在进步，搜狗搜索的反爬也在更新，现在它是IP加Cookies双重把关。

/ 01 / 网页分析

获取微信公众号文章信息，标题、开头、公众号、发布时间。

请求方式为GET，请求网址为红框部分，后面的信息没什么用。

/ 02 / 反爬破解

什么时候出现上图这种情况呢？

两种，一种同一个IP重复访问页面，另一种同一个Cookies重复访问页面。

两个都有，挂的更快！完整爬取我只成功了一次...

因为我最开始就是先什么都不设置，然后就出现验证码页面。然后用了代理IP，还是会跳转验证码页面，直到最后改变Cookies，才成功爬取。

01 代理IP设置

defget_proxies(i):
"""
获取代理IP
"""
df=pd.read_csv('sg_effective_ip.csv',header=None,names=["proxy_type","proxy_url"])
proxy_type=["{}".format(i)foriinnp.array(df['proxy_type'])]
proxy_url=["{}".format(i)foriinnp.array(df['proxy_url'])]
proxies={proxy_type:proxy_url}
returnproxies

代理的获取以及使用这里就不赘述了，前面的文章有提到，有兴趣的小伙伴可以自行去看看。

经过我两天的实践，免费IP确实没什么用，两下子就把我真实IP揪出来了。

02 Cookies设置

defget_cookies_snuid():
"""
获取SNUID值
"""
time.sleep(float(random.randint(2,5)))
url="http://weixin.sogou.com/weixin?type=2&s_from=input&query=python&ie=utf8&_sug_=n&_sug_type_="
headers={"Cookie":"ABTEST=你的参数;IPLOC=CN3301;SUID=你的参数;SUIR=你的参数"}
#HEAD请求,请求资源的首部
response=requests.head(url,headers=headers).headers
result=re.findall('SNUID=(.*?);expires',response['Set-Cookie'])
SNUID=result[0]
returnSNUID

总的来说，Cookies的设置是整个反爬中最重要的，而其中的关键便是动态改变SNUID值。

这里就不详细说其中缘由，毕竟我也是在网上看大神的帖子才领悟到的，而且领悟的还很浅。

成功爬取100页就只有一次，75页，50页，甚至到最后一爬就挂的情况都出现了...

我可不想身陷「爬-反爬-反反爬」的泥潭之中，爬虫之后的事情才是我的真正目的，比如数据分析，数据可视化。

所以干票大的赶紧溜，只能膜拜搜狗工程师。

/ 03 / 数据获取

01 构造请求头

head="""
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding:gzip,deflate
Accept-Language:zh-CN,zh;q=0.9
Connection:keep-alive
Host:weixin.sogou.com
Referer:'http://weixin.sogou.com/',
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/63.0.3239.132Safari/537.36
"""

#不包含SNUID值
cookie='你的Cookies'

defstr_to_dict(header):
"""
构造请求头,可以在不同函数里构造不同的请求头
"""
header_dict={}
header=header.split('\n')
forhinheader:
h=h.strip()
ifh:
k,v=h.split(':',1)
header_dict[k]=v.strip()
returnheader_dict

02 获取网页信息

defget_message():
"""
获取网页相关信息
"""
failed_list=[]
foriinrange(1,101):
print('第'+str(i)+'页')
print(float(random.randint(15,20)))
#设置延时,这里是度娘查到的,说要设置15s延迟以上,不会被封
time.sleep(float(random.randint(15,20)))
#每10页换一次SNUID值
if(i-1)%10==0:
value=get_cookies_snuid()
snuid='SNUID='+value+';'
#设置Cookies
cookies=cookie+snuid
url='http://weixin.sogou.com/weixin?query=python&type=2&page='+str(i)+'&ie=utf8'
host=cookies+'\n'
header=head+host
headers=str_to_dict(header)
#设置代理IP
proxies=get_proxies(i)
try:
response=requests.get(url=url,headers=headers,proxies=proxies)
html=response.text
soup=BeautifulSoup(html,'html.parser')
data=soup.find_all('ul',{'class':'news-list'})
lis=data[0].find_all('li')
forjin(range(len(lis))):

h3=lis[j].find_all('h3')
#print(h3[0].get_text().replace('\n',''))
title=h3[0].get_text().replace('\n','').replace(',','，')

p=lis[j].find_all('p')
#print(p[0].get_text())
article=p[0].get_text().replace(',','，')

a=lis[j].find_all('a',{'class':'account'})
#print(a[0].get_text())
name=a[0].get_text()

span=lis[j].find_all('span',{'class':'s2'})
cmp=re.findall("\d{10}",span[0].get_text())
#print(time.strftime("%Y-%m-%d",time.localtime(int(cmp[0])))+'\n')
date=time.strftime("%Y-%m-%d",time.localtime(int(cmp[0])))

withopen('sg_articles.csv','a+',encoding='utf-8-sig')asf:
f.write(title+','+article+','+name+','+date+'\n')
print('第'+str(i)+'页成功')
exceptExceptionase:
print('第'+str(i)+'页失败')
failed_list.append(i)
continue
#获取失败页码
print(failed_list)

defmain():
get_message()

if__name__=='__main__':
main()

最后成功获取数据。

/ 04 / 数据可视化

01 微信文章发布数量TOP10

这里对搜索过来的微信文章进行排序，发现了这十位Python大佬。

这里其实特想知道他们是团队运营，还是个人运营。不过不管了，先关注去。

这个结果可能也与我用Python这个关键词去搜索有关，一看公众号名字都是带有Python的(CSDN例外)。

frompyechartsimportBar
importpandasaspd

df=pd.read_csv('sg_articles.csv',header=None,names=["title","article","name","date"])

list1=[]
forjindf['date']:
#获取文章发布年份
time=j.split('-')[0]
list1.append(time)
df['year']=list1

#选取发布时间为2018年的文章，并对其统计
df=df.loc[df['year']=='2018']
place_message=df.groupby(['name'])
place_com=place_message['name'].agg(['count'])
place_com.reset_index(inplace=True)
place_com_last=place_com.sort_index()
dom=place_com_last.sort_values('count',ascending=False)[0:10]

attr=dom['name']
v1=dom['count']
bar=Bar("微信文章发布数量TOP10",title_pos='center',title_top='18',width=800,height=400)
bar.add("",attr,v1,is_convert=True,xaxis_min=10,yaxis_rotate=30,yaxis_label_textsize=10,is_yaxis_boundarygap=True,yaxis_interval=0,is_label_show=True,is_legend_show=False,label_pos='right',is_yaxis_inverse=True,is_splitline_show=False)
bar.render("微信文章发布数量TOP10.html")

02微信文章发布时间分布

因为这里发现搜索到的文章会有2018年以前的，这里予以删除，并且验证剩下文章的发布时间。

毕竟信息讲究时效性，如果我搜索获取的都是老掉牙的信息，就没什么意思了，更何况还是在一直在变化的互联网行业。

importnumpyasnp
importpandasaspd
frompyechartsimportBar

df=pd.read_csv('sg_articles.csv',header=None,names=["title","article","name","date"])

list1=[]
list2=[]
forjindf['date']:
#获取文章发布年份及月份
time_1=j.split('-')[0]
time_2=j.split('-')[1]
list1.append(time_1)
list2.append(time_2)
df['year']=list1
df['month']=list2

#选取发布时间为2018年的文章，并对其进行月份统计
df=df.loc[df['year']=='2018']
month_message=df.groupby(['month'])
month_com=month_message['month'].agg(['count'])
month_com.reset_index(inplace=True)
month_com_last=month_com.sort_index()

attr=["{}".format(str(i)+'月')foriinrange(1,12)]
v1=np.array(month_com_last['count'])
v1=["{}".format(int(i))foriinv1]
bar=Bar("微信文章发布时间分布",title_pos='center',title_top='18',width=800,height=400)
bar.add("",attr,v1,is_stack=True,is_label_show=True)
bar.render("微信文章发布时间分布.html")

03 标题、文章开头词云

fromwordcloudimportWordCloud,ImageColorGenerator
importmatplotlib.pyplotasplt
importpandasaspd
importjieba

df=pd.read_csv('sg_articles.csv',header=None,names=["title","article","name","date"])

text=''
#forlineindf['article'].astype(str):(前文词云代码)
forlineindf['title']:
text+=''.join(jieba.cut(line,cut_all=False))
backgroud_Image=plt.imread('python_logo.jpg')
wc=WordCloud(
background_color='white',
mask=backgroud_Image,
font_path='C:\Windows\Fonts\STZHONGS.TTF',
max_words=2000,
max_font_size=150,
random_state=30
)
wc.generate_from_text(text)
img_colors=ImageColorGenerator(backgroud_Image)
wc.recolor(color_func=img_colors)
plt.imshow(wc)
plt.axis('off')
#wc.to_file("文章.jpg")(前文词云代码)
wc.to_file("标题.jpg")
print('生成词云成功!')

公众号文章标题词云，因为是以Python这个关键词去搜索的，那么必然少不了Python。

然后词云里出现的的爬虫，数据分析，机器学习，人工智能。就便知道Python目前的主要用途啦！

不过Python还可用于Web开发，GUI开发等，这里没有体现，显然不是主流。

【本文作者】

法纳斯特，Python爱好者，专注爬虫，数据分析及可视化

推荐阅读
（点击标题可跳转阅读）
高效使用 Python 可视化工具 Matplotlib
Python 数据可视化利器
 Python 数据可视化 - 00 后高考大军

觉得本文对你有帮助？请分享给更多人
关注「Python开发者」加星标，提升Python技能

Python 数据可视化：Python 大佬有哪些？

浏览过的版块