如何删除XML文件中的重复元素
这是我的XML文件:它包含重复的元素<houseNum>0</houseNum>
.
Here is my XML file: it contains a duplicated element <houseNum>0</houseNum>
.
<?xml version="1.0" encoding="utf-8"?>
<ArrayOfHouse>
<XmlForm>
<houseNum>0</houseNum>
<plan1>
<coord>
<X> 1.2 </X>
<Y> 2.1 </Y>
<Z> 3.0 </Z>
</coord>
<color>
<R> 255 </R>
<G> 0 </G>
<B> 0 </B>
</color>
</plan1>
<plan2>
<coord>
<X> 21.2 </X>
<Y> 22.1 </Y>
<Z> 31.0 </Z>
</coord>
<color>
<R> 255 </R>
<G> 0 </G>
<B> 0 </B>
</color>
</plan2>
</XmlForm>
<XmlForm>
<houseNum>0</houseNum>
<plan1>
<coord>
<X> 1.2 </X>
<Y> 2.1 </Y>
<Z> 3.0 </Z>
</coord>
<color>
<R> 255 </R>
<G> 0 </G>
<B> 0 </B>
</color>
</plan1>
<plan2>
<coord>
<X> 21.2 </X>
<Y> 22.1 </Y>
<Z> 31.0 </Z>
</coord>
<color>
<R> 255 </R>
<G> 0 </G>
<B> 0 </B>
</color>
</plan2>
</XmlForm>
<XmlForm>
<houseNum>1</houseNum>
<plan1>
<coord>
<X> 11.2 </X>
<Y> 12.1 </Y>
<Z> 13.0 </Z>
</coord>
<color>
<R> 255 </R>
<G> 255 </G>
<B> 0 </B>
</color>
</plan1>
<plan2>
<coord>
<X> 211.2 </X>
<Y> 212.1 </Y>
<Z> 311.0 </Z>
</coord>
<color>
<R> 255 </R>
<G> 0 </G>
<B> 255 </B>
</color>
</plan2>
</XmlForm>
</ArrayOfHouse>
就我而言,有两种重复类型:
1)如果重复的元素是连续的,这是删除重复元素的代码,我只是比较element [i]和element [i + 1](如果这些元素是elemet [i] .text = = element [i + 1] .text,我删除了element [i + 1]
from lxml import etree
def Remove_Duplication_XML(xml_file):
base_name = os.path.basename(xml_file)
start_time = time.time()
tree = etree.parse(xml_file)
# remove duplicate skeletons
root = tree.getroot()
elementlist = [e for e in root.iter('houseNum')]
numframes=[x.text for x in elementlist]
print(numframes)
for index_element in range(1, len(elementlist)):
try:
if elementlist[index_element].text == elementlist[index_element - 1].text:
elementlist[index_element].getparent().remove(elementlist[index_element])
print(elementlist[index_element].text)
except:
print(' except ')
# String xml without duplication
file = etree.tostring(root).decode("utf-8")
print(file)
2)如果重复的元素不是连续的,那么我正在寻找一条工作要做.有帮助吗?
考虑 XSLT ,用于转换XML文件的专用语言(类似于使用SQL,也是专用于查询数据库).而且,由于您已经使用了Python的lxml
,因此可以无缝运行这样的脚本,而无需单个for
循环或if
逻辑即可删除文档中任何地方的重复 .
Consider XSLT, the special-purpose language designed to transform XML files (analoguous to using SQL, also special-purpose, to query databases). And because you already use Python's lxml
you can seamlessly run such a script without a single for
loop or if
logic to remove duplicates anywhere in the document.
具体来说,运行Xalt 1.0方法 Muenchian分组,使用<xsl:key>
通过 houseNum 为XML文档建立索引,然后返回不同的分组.额外的好处是,XSLT之下甚至还删除了带有漂亮打印缩进的文本节点中的空白:
Specifically, run the Muenchian Grouping, an XSLT 1.0 method, to index your XML document by the houseNum using <xsl:key>
and then return distinct groupings. With an added bonus, below XSLT even removes the white spaces in text nodes with pretty print indentation:
XSLT (另存为.xsl文件,一个特殊的.xml文件)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" method="xml"/>
<xsl:strip-space elements="*"/>
<xsl:key name="id" match="XmlForm" use="houseNum" />
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="XmlForm[generate-id() != generate-id(key('id', houseNum))]"/>
<xsl:template match="text()">
<xsl:value-of select="normalize-space(.)"/>
</xsl:template>
</xsl:stylesheet>
Python
import os
import lxml.etree as et
# LOAD XML AND XSL FILES
xml = et.parse('Source.xml')
xsl = et.parse('XSLTScript.xsl')
# TRANSFORM SOURCE
transform = et.XSLT(xsl)
result = transform(xml)
# PRINT RESULT TO SCREEN
print(result)
# SAVE RESULT TO FILE
with open('Output.xml', 'wb') as f:
f.write(result)
输出 (注意,文本值被修剪为空白)
<?xml version="1.0"?>
<ArrayOfHouse>
<XmlForm>
<houseNum>0</houseNum>
<plan1>
<coord>
<X>1.2</X>
<Y>2.1</Y>
<Z>3.0</Z>
</coord>
<color>
<R>255</R>
<G>0</G>
<B>0</B>
</color>
</plan1>
<plan2>
<coord>
<X>21.2</X>
<Y>22.1</Y>
<Z>31.0</Z>
</coord>
<color>
<R>255</R>
<G>0</G>
<B>0</B>
</color>
</plan2>
</XmlForm>
<XmlForm>
<houseNum>1</houseNum>
<plan1>
<coord>
<X>11.2</X>
<Y>12.1</Y>
<Z>13.0</Z>
</coord>
<color>
<R>255</R>
<G>255</G>
<B>0</B>
</color>
</plan1>
<plan2>
<coord>
<X>211.2</X>
<Y>212.1</Y>
<Z>311.0</Z>
</coord>
<color>
<R>255</R>
<G>0</G>
<B>255</B>
</color>
</plan2>
</XmlForm>
</ArrayOfHouse>