Parsing Xml With Invalid Nodes
I have parsing too big XML. When a node fails I want to keep looping and doing stuff with remaining nodes. version 1 for event, element in etree.iterparse(file): if element.tag
Solution 1:
I have taken those lines that seem to be causing trouble and stuffed them into a slightly bigger xml file for trial purposes. This is it.
<whole><tag1><tag11>one</tag11><tag11><![CDATA[ foo bar
]]></tag11><tag11>two</tag11><tag11>three</tag11></tag1><tag1><tag11><![CDATA[ foo bar
]]></tag11></tag1><tag1><tag11><![CDATA[ foo bar]]></tag11></tag1><tag1><tag11><![CDATA[ foo bar]]></tag11></tag1><tag1><tag11>three</tag11><tag11>four</tag11><tag11>five</tag11><tag11>six</tag11></tag1></whole>
Then I ran the following code that displayed its results at the end.
>>>import os>>>os.chdir('c:/scratch')>>>from lxml import etree>>>context = etree.iterparse('temp.xml')>>>for action, elem in context:...print (action, elem.tag, elem.sourceline)...
end tag11 3
end tag11 4
end tag11 6
end tag11 7
end tag1 2
end tag11 9
end tag1 9
end tag11 11
end tag1 11
end tag11 12
end tag1 12
end tag11 14
end tag11 15
end tag11 16
end tag11 17
end tag1 13
end whole 1
In short, there seems to be nothing wrong with those lines.
You could try printing the line numbers in which tags were found, in order to find the vicinity of the place giving trouble in the xml. (This is an edit based on knowledge that I have newly acquired on SO.)
I would also suggest using the looping structure suggested in the documentation to avoid the infinite loop. That's what I did in this code.
Post a Comment for "Parsing Xml With Invalid Nodes"