从python中的xml文档中提取文本
这是示例xml文档:
<bookstore>
<book category="COOKING">
<title lang="english">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>300.00</price>
</book>
<book category="CHILDREN">
<title lang="english">Harry Potter</title>
<author>J K. Rowling </author>
<year>2005</year>
<price>625.00</price>
</book>
</bookstore>
我想提取文本而不指定元素,我该怎么做,因为我有10个这样的文档.我想要这样做是因为我的问题是用户正在输入某个我不知道的单词,因此必须在其各自文本部分的所有10个xml文档中进行搜索.为此,我应该在不知道元素的情况下知道文本的位置.所有这些文档都不同的另一件事.
I want to extract the text without specifying the elements how can i do this , because i have 10 such documents. I want so because my problem is that user is entering some word which I don't know , it has to be searched in all of the 10 xml documents in their respective text portions. For this to happen I should know where the text lies without knowing about the element. One more thing that all these documents are different.
请帮助!!
您可以简单地删除所有标签:
You could simply strip out any tags:
>>> import re
>>> txt = """<bookstore>
... <book category="COOKING">
... <title lang="english">Everyday Italian</title>
... <author>Giada De Laurentiis</author>
... <year>2005</year>
... <price>300.00</price>
... </book>
...
... <book category="CHILDREN">
... <title lang="english">Harry Potter</title>
... <author>J K. Rowling </author>
... <year>2005</year>
... <price>625.00</price>
... </book>
... </bookstore>"""
>>> exp = re.compile(r'<.*?>')
>>> text_only = exp.sub('',txt).strip()
>>> text_only
'Everyday Italian\n Giada De Laurentiis\n 2005\n 300.00\n
\n\n \n Harry Potter\n J K. Rowling \n 2005\n 6
25.00'
但是,如果您只想在Linux中搜索文件中的某些文本,则可以使用grep
:
But if you just want to search files for some text in Linux, you can use grep
:
burhan@sandbox:~$ grep "Harry Potter" file.xml
<title lang="english">Harry Potter</title>
如果要搜索文件,请使用上面的grep
命令,或打开文件并在Python中搜索它:
If you want to search in a file, use the grep
command above, or open the file and search for it in Python:
>>> import re
>>> exp = re.compile(r'<.*?>')
>>> with open('file.xml') as f:
... lines = ''.join(line for line in f.readlines())
... text_only = exp.sub('',lines).strip()
...
>>> if 'Harry Potter' in text_only:
... print 'It exists'
... else:
... print 'It does not'
...
It exists