pandas 利用 正则表达式 从文本中提取数字

论坛 期权论坛 脚本     
匿名技术用户   2020-12-27 04:58   11   0

需要从text特征中提取形如 13.5/10 这样的字符串,再分别提取分子分母。
1)可以利用 str.extract() 方法。
2)利用正则表达式 \d+\.?\d*\/\d+ 进行匹配
3)再利用 .split() 方法提取分子分母

代码:
这里写图片描述

test.text.tolist()

# output
['This is Bella. She hopes her smile made you smile. If not, she is also offering you her favorite monkey. 13.5/10 https://t.co/qjrljjt948',
 "This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS",
 "This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq",
 'Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD']
test['rating'] = test['text'].str.extract(r'(\d+\.?\d*\/\d+)', expand=False)

# 提取分子
test['rating_numerator'] = test.rating.apply(lambda x: eval(x.split('/')[0]))
# 提取分母
test['rating_denominator_fix'] = test.rating.apply(lambda x: eval(x.split('/')[1]))
# 删除中间量
test.drop(['rating'], axis=1, inplace=True)

这里写图片描述

分享到 :
0 人收藏
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

积分:7942463
帖子:1588486
精华:0
期权论坛 期权论坛
发布
内容

下载期权论坛手机APP