Skip to content Skip to sidebar Skip to footer

Beautifulsoup Parsing Xml To Table

come back again with another issue. using BeautifulSoup really new in parsing XML , and have this problem since 2 weeks now. will appreciate your help have this structure :

Solution 1:

A list comprehension of bloc elements with an embedded dict comprehension of bloc attributes is the core. page by appending to dict of bloc attributes, navigating to parent and the required attribute.

Column order is based on order that they are seen

from bs4 import BeautifulSoup
xml = """<detail><pagenumber="01"><Bloccode="AF"A="000000000002550"B="000000000002550"/><Bloccode="AH"A="000000000035826"C="000000000035826"D="000000000035826"/><Bloccode="AR"A="000000000026935"B="000000000024503"C="000000000002431"D="000000000001669"/></page><pagenumber="02"><Bloccode="DA"A="000000000038486"B="000000000038486"/><Bloccode="DD"A="000000000003849"B="000000000003849"/><Bloccode="EA"A="000000000001029"/><Bloccode="EC"A="000000000063797"B="000000000082427"/></page><pagenumber="03"><Bloccode="FD"C="000000000574042"D="000000000610740"/><Bloccode="GW"C="000000000052677"D="000000000075362"/></page></detail>"""

soup = BeautifulSoup(xml)
df = pd.DataFrame([{**{k:b[k] for k in b.attrs.keys()},**{"page":b.parent["number"]}} 
                   for b in soup.find_all("bloc")])


output

code               a               b page               c               d
  AF 00000000000255000000000000255001NaNNaN
  AH 000000000035826NaN01000000000035826000000000035826
  AR 00000000002693500000000002450301000000000002431000000000001669
  DA 00000000003848600000000003848602NaNNaN
  DD 00000000000384900000000000384902NaNNaN
  EA 000000000001029NaN02NaNNaN
  EC 00000000006379700000000008242702NaNNaN
  FD             NaNNaN03000000000574042000000000610740
  GW             NaNNaN03000000000052677000000000075362

elementtree

Very similar to BeautifulSoup

import xml.etree.ElementTree as ET
root = ET.fromstring(xml)
df2 = pd.DataFrame([{**b.attrib, **{"page":p.attrib["number"]}} 
                    for p in root.iter("page") 
                    for b in p.iter("Bloc") ])

Post a Comment for "Beautifulsoup Parsing Xml To Table"