Skip to content Skip to sidebar Skip to footer

Python Beautifulsoup Selenium Scraper

I'm using the following python script for scraping info from Amazon pages. At some point, it stopped returning page results. The script is starting, browsing through the keywords/p

Solution 1:

The following shows some changes you could make. I have changed to using css selectors at some points.

The main result set to loop over are retrieved by soup.select('.s-result-list [data-asin]'). This specifies elements with class name .s-result-list having children with attribute data-asin. This matches the 60 (current) items on page.

I swapped the PRIME selection to using an attribute = value selector

Headers are now h5 i.e. header = soup.select_one('h5').


soup.select_one('[aria-label="Amazon Prime"]

Example code:

import datetime
from bs4 import BeautifulSoup
import time
from selenium import webdriver
import re

keyword = 'blue+skateboard'
driver = webdriver.Chrome()

url = 'https://www.amazon.co.uk/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords={}'

driver.get(url.format(keyword))
soup = BeautifulSoup(driver.page_source, 'lxml')
results = soup.select('.s-result-list [data-asin]')

for a, b inenumerate(results):
    soup = b
    header = soup.select_one('h5')
    result = a + 1
    title = header.text.strip()

    try:
        link = soup.select_one('h5 > a')
        url = link['href']
        url = re.sub(r'/ref=.*', '', str(url))
    except:
        url = "None"if url !='/gp/slredirect/picassoRedirect.html':
        ASIN = re.sub(r'.*/dp/', '', str(url))
        #print(ASIN)try:
            score = soup.select_one('.a-icon-alt')
            score = score.text
            score = score.strip('\n')
            score = re.sub(r' .*', '', str(score))
        except:
            score = "None"try:
            reviews = soup.select_one("href*='#customerReviews']")
            reviews = reviews.text.strip()
        except:
            reviews = "None"try:
            PRIME = soup.select_one('[aria-label="Amazon Prime"]')
            PRIME = PRIME['aria-label']
        except:
            PRIME = "None"
        data = {keyword:[keyword,str(result),title,ASIN,score,reviews,PRIME,datetime.datetime.today().strftime("%B %d, %Y")]}
        print(data)

Example output:

enter image description here

Post a Comment for "Python Beautifulsoup Selenium Scraper"