Python Beautifulsoup Selenium Scraper
I'm using the following python script for scraping info from Amazon pages. At some point, it stopped returning page results. The script is starting, browsing through the keywords/p
Solution 1:
The following shows some changes you could make. I have changed to using css selectors at some points.
The main result set to loop over are retrieved by soup.select('.s-result-list [data-asin]')
. This specifies elements with class name .s-result-list
having children with attribute data-asin
. This matches the 60 (current) items on page.
I swapped the PRIME selection to using an attribute = value selector
Headers are now h5
i.e. header = soup.select_one('h5')
.
soup.select_one('[aria-label="Amazon Prime"]
Example code:
import datetime
from bs4 import BeautifulSoup
import time
from selenium import webdriver
import re
keyword = 'blue+skateboard'
driver = webdriver.Chrome()
url = 'https://www.amazon.co.uk/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords={}'
driver.get(url.format(keyword))
soup = BeautifulSoup(driver.page_source, 'lxml')
results = soup.select('.s-result-list [data-asin]')
for a, b inenumerate(results):
soup = b
header = soup.select_one('h5')
result = a + 1
title = header.text.strip()
try:
link = soup.select_one('h5 > a')
url = link['href']
url = re.sub(r'/ref=.*', '', str(url))
except:
url = "None"if url !='/gp/slredirect/picassoRedirect.html':
ASIN = re.sub(r'.*/dp/', '', str(url))
#print(ASIN)try:
score = soup.select_one('.a-icon-alt')
score = score.text
score = score.strip('\n')
score = re.sub(r' .*', '', str(score))
except:
score = "None"try:
reviews = soup.select_one("href*='#customerReviews']")
reviews = reviews.text.strip()
except:
reviews = "None"try:
PRIME = soup.select_one('[aria-label="Amazon Prime"]')
PRIME = PRIME['aria-label']
except:
PRIME = "None"
data = {keyword:[keyword,str(result),title,ASIN,score,reviews,PRIME,datetime.datetime.today().strftime("%B %d, %Y")]}
print(data)
Example output:
Post a Comment for "Python Beautifulsoup Selenium Scraper"