[原创][爬虫学习·二]爬取nndc上的核素数据
本文爬取的目标:爬取nndc官网上核素的数据(S(n),S(p))。
步骤如下:1)首先爬取所有核素的名称和质量数,将内容写入nucleus.txt;
2)去除nucleus.txt中的重复行,得nucleus_new.txt;
3)逐行读取nucleus_new.txt中的核素信息并构造URL请求,爬取nndc官网上核素的S(n)和S(p)数据,将结果写入nucleusSnSp.csv文件。
步骤一
先来看一下nndc的搜索页面:
https://www.nndc.bnl.gov/nudat2/indx_sigma.jsp

得到上图所示页面,点击页面中的search按钮。得:

元素左上角为质量数,审查红圈内元素,发现爬取其信息是较为简单的。写出代码如下:
from selenium import webdriver
co = webdriver.ChromeOptions()
co.headless = False #是否有浏览界面
chrome_driver = r'D:\anaconda\Lib\site-packages\selenium\webdriver\chrome\chromedriver.exe'
browser = webdriver.Chrome(executable_path=chrome_driver, options=co)
url = 'https://www.nndc.bnl.gov/nudat2/indx_sigma.jsp'
browser.get(url)
form = browser.find_element_by_tag_name('form')
p = form.find_element_by_css_selector('p:nth-child(2)')
#模拟点击search按钮
p.find_element_by_tag_name('input').click()
#等待30秒,保证页面加载完成
browser.implicitly_wait(30)
tbody = browser.find_element_by_tag_name('tbody')
trs = browser.find_elements_by_tag_name('tr')
with open('nucleus.txt', 'w', encoding='utf-8') as f:
for i in range(len(trs)):
if i == 0:
continue
elif i % 2 == 1:
continue
else:
nuc_td_num = trs[i].find_element_by_css_selector('td:first-child')
nuc_td_name = trs[i].find_element_by_css_selector('td:nth-child(2)')
nuc_info = nuc_td_num.text+'\n'+nuc_td_name.text
nuc_result = nuc_info.split('\n')
#过滤质量数中带m的非法数据
if 'm' in nuc_result[0]:
print('error')
else:
#将核素的质量数和名称写入txt
f.write(nuc_result[0]+nuc_result[2]+'\n')
print(nuc_result[0]+nuc_result[2])
webdriver的安装和简介见上一篇博客:[原创][爬虫学习·一]爬取天天基金网的基金收益排行信息
爬取完成后,nucleus.txt中内容如下,

步骤二
可以看到txt中有重复行,需要进行去重。代码如下:
readPath='nucleus.txt'
writePath='nucleus_new.txt'
lines_seen = set()
outfile = open(writePath, 'a+', encoding='utf-8')
f = open(readPath,'r',encoding='utf-8')
for line in f:
if line not in lines_seen:
outfile.write(line)
lines_seen.add(line)
去重完成,得到nucleus_new.txt。
步骤三
2H核素的S(n)和S(p)数据由下面的URL获得:
https://www.nndc.bnl.gov/nudat2/getdatasetClassic.jsp?nucleus=2H&unc=nds
20N核素的S(n)和S(p)数据由下面的URL获得:
https://www.nndc.bnl.gov/nudat2/getdatasetClassic.jsp?nucleus=20N&unc=nds
观察上述的URL,只有nucleus参数是变化的,参数内容正是我们在第二步中得到的necleus_new.txt中的每一行的信息。因此变换该参数,依次构造URL,就能得到所有核素的详细信息页面。2H的页面如下:

审查红圈元素中可知,S(n)和S(p)也是不难爬取的。代码如下:
import csv
#构造csv的表头
header = ['核素','S(n)(keV)', 'S(p)(keV)']
from selenium import webdriver
co = webdriver.ChromeOptions()
co.headless = False #是否有浏览界面
chrome_driver = r'D:\anaconda\Lib\site-packages\selenium\webdriver\chrome\chromedriver.exe'
browser = webdriver.Chrome(executable_path=chrome_driver, options=co)
url = 'https://www.nndc.bnl.gov/nudat2/getdatasetClassic.jsp?unc=nds'
blank = 'toBeDone'
with open('nucleusSnSp.csv', 'w', encoding='utf-8', newline='') as f:
writer = csv.writer(f)
writer.writerow(header)
for nuc in open('nucleus_new.txt', 'r', encoding='utf-8'):
url_new = '&nucleus='+nuc
url_new = url_new[:-1]
total_url = url+url_new
browser.set_page_load_timeout(60)
try:
browser.get(total_url)
except:
print('!!!time out after 60 seconds when loading page')
writer.writerow([nuc[:-1], blank, blank])
continue
else:
try:
body = browser.find_element_by_tag_name('body')
table = body.find_element_by_tag_name('table')
except:
print('empty dataset')
else:
tbody = table.find_element_by_tag_name('tbody')
tr = tbody.find_element_by_tag_name('tr')
tds = tr.find_elements_by_tag_name('td')
Sn_result = ''
Sp_result = ''
for td in tds:
if 'S(n)' in td.text:
Sn = td.text
SnResult = Sn.split('keV')
#截取Sn的数据
Sn_result = SnResult[0][5:]
if 'S(p)' in td.text:
Sp = td.text
SpResult = Sp.split('keV')
# 截取Sp的数据
Sp_result = SpResult[0][5:]
writer.writerow([nuc[:-1], Sn_result, Sp_result])
print(nuc[:-1] + ',' + Sn_result + ',' + Sp_result)
由于nndc上部分核素是没有数据的,如24P、27O、23C,因此需要做empty dataset的异常处理。另外,由于网络原因,页面加载不出来时,也会导致后续解析出错,因此也需要做异常处理。
爬取完成后,nucleusSnSp.csv的信息如下:

nucleusSnSp.csv下载地址 |