Skip to content Skip to sidebar Skip to footer

Webpage Access While Using Scrapy

I am new to python and scrapy. I followed the tutorial and tried to crawl few webpages. I used the code in the tutorial and replaced the URLs - http://www.city-data.com/advanced/se

Solution 1:

First of all, this website looks like a JavaScript-heavy one. Scrapy itself only downloads HTML from servers but does not interpret JavaScript statements.

Second, the URL fragment (i.e. everything including and after #body) is not sent to the server and only http://www.city-data.com/advanced/search.php is fetched (scrapy does the same as your browser. You can confirm that with your browser's dev tools network tab.)

So for Scrapy, the requests to

http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0

and

http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=1

are the same resource, so it's only fetch once. They differ only in their URL fragments.

What you need is a JavaScript renderer. You could use Selenium or something like Splash. I recommend using the scrapy-splash plugin which includes a duplicate filter that takes into account URL fragments.

Post a Comment for "Webpage Access While Using Scrapy"