Python Beautifulsoup Selenium Scraper

March 31, 2024 Post a Comment

I'm using the following python script for scraping info from Amazon pages. At some point, it stopped returning page results. The script is starting, browsing through the keywords/p

Solution 1:

The following shows some changes you could make. I have changed to using css selectors at some points.

The main result set to loop over are retrieved by soup.select('.s-result-list [data-asin]'). This specifies elements with class name .s-result-list having children with attribute data-asin. This matches the 60 (current) items on page.

I swapped the PRIME selection to using an attribute = value selector

Headers are now h5 i.e. header = soup.select_one('h5').

soup.select_one('[aria-label="Amazon Prime"]

Example code:

import datetime
from bs4 import BeautifulSoup
import time
from selenium import webdriver
import re

keyword = 'blue+skateboard'
driver = webdriver.Chrome()

url = 'https://www.amazon.co.uk/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords={}'

driver.get(url.format(keyword))
soup = BeautifulSoup(driver.page_source, 'lxml')
results = soup.select('.s-result-list [data-asin]')

for a, b inenumerate(results):
    soup = b
    header = soup.select_one('h5')
    result = a + 1
    title = header.text.strip()

    try:
        link = soup.select_one('h5 > a')
        url = link['href']
        url = re.sub(r'/ref=.*', '', str(url))
    except:
        url = "None"if url !='/gp/slredirect/picassoRedirect.html':
        ASIN = re.sub(r'.*/dp/', '', str(url))
        #print(ASIN)try:
            score = soup.select_one('.a-icon-alt')
            score = score.text
            score = score.strip('\n')
            score = re.sub(r' .*', '', str(score))
        except:
            score = "None"try:
            reviews = soup.select_one("href*='#customerReviews']")
            reviews = reviews.text.strip()
        except:
            reviews = "None"try:
            PRIME = soup.select_one('[aria-label="Amazon Prime"]')
            PRIME = PRIME['aria-label']
        except:
            PRIME = "None"
        data = {keyword:[keyword,str(result),title,ASIN,score,reviews,PRIME,datetime.datetime.today().strftime("%B %d, %Y")]}
        print(data)

Example output:

lacucinadiadine

Python Beautifulsoup Selenium Scraper

Solution 1:

Post a Comment for "Python Beautifulsoup Selenium Scraper"

Widget HTML #3