Why Is Scrapy Skipping Some Url's But Not Others?

January 25, 2024 Post a Comment

I am writing a scrapy crawler to grab info on shirts from Amazon. The crawler starts on an amazon page for some search, 'funny shirts' for example, and collects all the result item

Solution 1:

first check if robots.txt is being ignored, from what you have said I suppose you already have that.

Sometimes the html code returned from the response is not the same as the one you are seeing when you look at the product. I dont really know what exactly is going on in your case but you can check what the spider is actually "reading" with.

scrapy shell 'yourURL'

After that

view(response)

There you can check out the code that the spider is actually seeing if the requests succeeds.

Sometimes the request does not succeed (Maybe Amazon is redirecting you to a CAPTCHA or something).

You can check the response while scraping with (Please check the code below, Im doing this from memory)

import request

#inside your parse method

r = request.get("url")
print(r.content)

If I remember correctly, you can get the URL from scrapy itself (something along the lines of response.url.

Solution 2:

Try to make use of dont_filter=True in your scrapy Requests. I had the same problem, seemed like the scrapy crawler was ignoring some URLs because it thought they were duplicate.

dont_filter=True

This makes sure that scrapy doesn't filter any URLS with its dupefilter.

lacucinadiadine

Why Is Scrapy Skipping Some Url's But Not Others?

Solution 1:

Solution 2:

Post a Comment for "Why Is Scrapy Skipping Some Url's But Not Others?"

Widget HTML #3