python爬虫数据清洗的方法_python爬虫基础——数据提取与清洗之正则表达式

论坛 期权论坛 编程之家     
选择匿名的用户   2021-5-21 14:05   11   0

复习正则表达式:

# ?尽可能少的匹配,在上面的案列中,当匹配到第一个div就结束匹配。?关闭贪婪

假设这里有一个HTML文件:

Title

Email:kefu@CSDN.net

手机号:400-660-0108

我想要提取Emai,创建一个python文件,import re

import re

with open('index.html','r', encoding='utf_8' ) as f:

html = f.read()

#print(html)

#提取Email: kefu @ CSDN.net

#过滤空

html = re.sub('\n', '', html)

#定义提取

pattern_1 = '

(.*?)'

#开始提取

re_1 = re.findall(pattern_1, html)

#strip()去两边的空,

print(re_1[0].strip())

匹配一个以字母开头,数字,字母,下划线长度为5-15位的密码

#定义匹配 加r代表防转义

password_pattern = r'^[a-zA-Z][a-zA-Z0-9_]{5,15}$'

password1 = '1234567'

password2 = 'a123456'

password3 = 'a123'

print(re.match(password_pattern, password1))

print(re.match(password_pattern, password2))

print(re.match(password_pattern, password3))

稍微拓展一下,提取更多的数据,提取商城的分类结构

import re

with open('static/html/index.html', 'r', encoding='utf-8') as f:

html = re.sub('\n', '', f.read())

section_pattern = '(.*?)'

section_s= re.findall(section_pattern, html)

print(section_s)

print(len(section_s))

crategory_pattern = '

(.*?)'

# crategory_s = re.findall(crategory_pattern, section_s[0])

#print(crategory_s)

course_pattern ='(.*?)'

data_s = []

for section in section_s:

crategory = re.findall(crategory_pattern, section)[0]

course = re.findall(course_pattern, section)

print(crategory)

data_s.append(

{

'crategory':crategory,

'course':course

}

)

print(data_s)

for data in data_s:

print(data.get('crategory'))

for d in data['course']:

print(" "+d)

测试的html

分享到 :
0 人收藏
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

积分:3875789
帖子:775174
精华:0
期权论坛 期权论坛
发布
内容

下载期权论坛手机APP