Why Is Scrapy Skipping Some Url's But Not Others?
Solution 1:
first check if robots.txt
is being ignored, from what you have said I suppose you already have that.
Sometimes the html code returned from the response is not the same as the one you are seeing when you look at the product. I dont really know what exactly is going on in your case but you can check what the spider is actually "reading" with.
scrapy shell 'yourURL'
After that
view(response)
There you can check out the code that the spider is actually seeing if the requests succeeds.
Sometimes the request does not succeed (Maybe Amazon is redirecting you to a CAPTCHA or something).
You can check the response while scraping with (Please check the code below, Im doing this from memory)
import request
#inside your parse method
r = request.get("url")
print(r.content)
If I remember correctly, you can get the URL from scrapy itself (something along the lines of response.url
.
Solution 2:
Try to make use of dont_filter=True
in your scrapy Requests. I had the same problem, seemed like the scrapy crawler was ignoring some URLs because it thought they were duplicate.
dont_filter=True
This makes sure that scrapy doesn't filter any URLS with its dupefilter.
Post a Comment for "Why Is Scrapy Skipping Some Url's But Not Others?"