python爬取论坛的数据,包括标题,作者,点击量和回复量

论坛 期权论坛 脚本     
匿名技术用户   2020-12-27 01:02   725   0

1.数据的爬取和清洗

(1)标题和作者的获取以及数据整理

from bs4 import BeautifulSoup
data_all =[]
for i in range(0,10):
    url = 'http://bbs.tianya.cn/list-no02-1.shtml'
    douban_data = requests.get(url)
    soup = BeautifulSoup(douban_data.text,'lxml')
    titles = soup.select('tr.bg td.td-title a')
    author = soup.select('tr.bg td a.author')
    
    for title,price in zip(titles,author):
         data = {'title':title.get_text().strip().split()[0],
                 'author':price.get_text().strip()}   
#         print(data)
         data_all.append(data)
len(data_all)

(2)点击量和回复量的获取(这里应该循环获取,因为每一个单页的网址不一样)

import requests
from bs4 import BeautifulSoup
url = 'http://bbs.tianya.cn/list.jsp?item=no02&nextid=1556923587000'
douban_data = requests.get(url)
soup = BeautifulSoup(douban_data.text,'lxml')
a_all = soup.select('td')

(3)点击量和回复量数据的整理

import pandas as pd
j=2
k=3
data_all1 = []
for click in zip(a_all):
#    print(a_all[3])
    if j<=400 and k<= 400:
        a_data = {'click':a_all[j].get_text().strip(),
                 'response':a_all[k].get_text().strip()}
#        print(a_all[k].get_text().strip())
        j = j+5
        k = j+1
        data_all1.append(a_data)

(4)两个数据的合并

①首先将list格式的数据转化为DataFrame格式

1)import pandas as pd
data_pd1 = pd.DataFrame(data1,columns = ['author','title'])
2)import pandas as pd
data_pd2 = pd.DataFrame(data3,columns = ['click','response'])

②为合并添加一组关键列

#生成一组不重复的随机数,用于作为两个DataFrame合并的key
import random
listww = random.sample(range(0,400),400)

③为两组list添加关键列

data_s['key'] = listww
data_s
2)data_f = pd.DataFrame(data_f,columns = ['click','response','key'])
data_f['key'] = listww
data_ft

④合并

result = pd.merge(data_s,data_f,on='key')

⑤删除关键列

re = result.drop('key',axis=1)

(5)保存文件–格式为csv

re.to_csv('sqy.csv',index=False)
分享到 :
0 人收藏
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

积分:7942463
帖子:1588486
精华:0
期权论坛 期权论坛
发布
内容

下载期权论坛手机APP