从HTML字符串中提取字符串
问题描述:
我想从html字符串中提取一个数字(我通常不知道该数字).
i want to extract a number from a html string (i usually do not know the number).
关键部分如下:
<test test="3" test="search_summary_figure WHR WVM">TOTAL : 286</test>
<tagend>
我想提取"286".我想做一些类似的事情,例如在"L:之后开始",在<"之前停止. 我怎样才能做到这一点 ?提前非常感谢您.
And i want to extract the "286". I want to do something like "start after "L :" and stop before "<". How can i do this ? Thank you very much in advance.
答
如果字符串"TOTAL:number"是唯一的,则使用正则表达式首先搜索该子字符串,然后从中提取数字.
If the string "TOTAL : number" is unique then use a regular expression to first search this substring and then extract the number from it.
import re
string = 'test test="3" test="search_summary_figure WHR WVM">TOTAL : 286</test>'
reg__expr = r'TOTAL\s:\s\d+' # TOTAL<whitespace>:<whitespace><number>
# find the substring
result = re.findall(reg__expr, string)
if result:
substring = result[0]
reg__expr = r'\d+' # <number>
result = re.findall(reg__expr, substring)
number = int(result[0])
print(number)
您可以在此处测试自己的正则表达式 https://regex101.com/
You can test your own regular expressions here https://regex101.com/