How To Preserve Links When Scraping A Table With Beautiful Soup And Pandas
Scraping a web to get a table, using Beautiful soup and Pandas. One of the columns got some urls. When I pass html to pandas, href are lost. is there any way of preserving the url
Solution 1:
pd.read_html
assumes the data you are interested in is in the text, not the tag attributes. However, it isn't hard to scrape the table yourself:
import bs4 as bs
import pandas as pd
with open(url,"r") as f:
soup = bs.BeautifulSoup(f, 'lxml')
parsed_table = soup.find_all('table')[1]
data = [[td.a['href'] if td.find('a') else''.join(td.stripped_strings)
for td in row.find_all('td')]
for row in parsed_table.find_all('tr')]
df = pd.DataFrame(data[1:], columns=data[0])
print(df)
yields
customer country area website link
0 IBM USA EMEA http://www.ibm.com
1 CISCO USA EMEA http://www.cisco.com
2 unknown company USA EMEA
Solution 2:
Just check if tag exists this way:
import numpy as np
with open(url,"r") as f:
sp = bs.BeautifulSoup(f, 'lxml')
tb = sp.find_all('table')[56]
df = pd.read_html(str(tb),encoding='utf-8', header=0)[0]
df['href'] = [np.where(tag.has_attr('href'),tag.get('href'),"no link") for tag in tb.find_all('a')]
Solution 3:
Here is another way to do it if you have more than one link to grab from html table. Instead of making list comprehension I would rather go with separeated for loops so code is more readable to those who are new to python and it is easier to adjust code ore handle errors if they emerge. I hope it will help someone.
soup = BeautifulSoup(html, "lxml")
table = table.find('table')
thead = table.find('thead')
column_names = [th.text.strip() for th in thead.find_all('th')]
data = []
for row intable.find_all('tr'):
row_data = []
for td in row.find_all('td'):
td_check = td.find('a')
if td_check is not None:
link = td.a['href']
row_data.append(link)
else:
not_link = ''.join(td.stripped_strings)
if not_link == '':
not_link = None
row_data.append(not_link)
data.append(row_data)
df = pd.DataFrame(data[1:], columns=column_names)
df_dict = df.to_dict('records')
for row in df_dict:
print(row)
Post a Comment for "How To Preserve Links When Scraping A Table With Beautiful Soup And Pandas"