从HTML字符串中提取字符串

问题描述：

我想从html字符串中提取一个数字(我通常不知道该数字).

i want to extract a number from a html string (i usually do not know the number).

关键部分如下:

<test test="3" test="search_summary_figure WHR WVM">TOTAL : 286</test>
<tagend>

我想提取"286".我想做一些类似的事情，例如在"L:之后开始"，在<"之前停止. 我怎样才能做到这一点 ?提前非常感谢您.

And i want to extract the "286". I want to do something like "start after "L :" and stop before "<". How can i do this ? Thank you very much in advance.

答

如果字符串"TOTAL:number"是唯一的，则使用正则表达式首先搜索该子字符串，然后从中提取数字.

If the string "TOTAL : number" is unique then use a regular expression to first search this substring and then extract the number from it.

import re

string = 'test test="3" test="search_summary_figure WHR WVM">TOTAL : 286</test>'

reg__expr = r'TOTAL\s:\s\d+'  # TOTAL<whitespace>:<whitespace><number>
# find the substring
result = re.findall(reg__expr, string)
if result:

   substring = result[0]

   reg__expr = r'\d+'  # <number>
   result = re.findall(reg__expr, substring)
   number = int(result[0])

   print(number)

您可以在此处测试自己的正则表达式 https://regex101.com/

You can test your own regular expressions here https://regex101.com/

从HTML字符串中提取字符串

相关推荐