复习正则表达式:
![]()
![]()
![]()
![]()
![]()
# ?尽可能少的匹配,在上面的案列中,当匹配到第一个div就结束匹配。?关闭贪婪
![]()
![]()
![]()
假设这里有一个HTML文件:
Title
Email:kefu@CSDN.net
手机号:400-660-0108
我想要提取Emai,创建一个python文件,import re
import re
with open('index.html','r', encoding='utf_8' ) as f:
html = f.read()
#print(html)
#提取Email: kefu @ CSDN.net
#过滤空
html = re.sub('\n', '', html)
#定义提取
pattern_1 = '
(.*?)'
#开始提取
re_1 = re.findall(pattern_1, html)
#strip()去两边的空,
print(re_1[0].strip())
匹配一个以字母开头,数字,字母,下划线长度为5-15位的密码
#定义匹配 加r代表防转义
password_pattern = r'^[a-zA-Z][a-zA-Z0-9_]{5,15}$'
password1 = '1234567'
password2 = 'a123456'
password3 = 'a123'
print(re.match(password_pattern, password1))
print(re.match(password_pattern, password2))
print(re.match(password_pattern, password3))
稍微拓展一下,提取更多的数据,提取商城的分类结构
import re
with open('static/html/index.html', 'r', encoding='utf-8') as f:
html = re.sub('\n', '', f.read())
section_pattern = '(.*?)'
section_s= re.findall(section_pattern, html)
print(section_s)
print(len(section_s))
crategory_pattern = '
(.*?)'
# crategory_s = re.findall(crategory_pattern, section_s[0])
#print(crategory_s)
course_pattern ='(.*?)'
data_s = []
for section in section_s:
crategory = re.findall(crategory_pattern, section)[0]
course = re.findall(course_pattern, section)
print(crategory)
data_s.append(
{
'crategory':crategory,
'course':course
}
)
print(data_s)
for data in data_s:
print(data.get('crategory'))
for d in data['course']:
print(" "+d)
测试的html