Webpage Access While Using Scrapy
I am new to python and scrapy. I followed the tutorial and tried to crawl few webpages. I used the code in the tutorial and replaced the URLs - http://www.city-data.com/advanced/se
Solution 1:
First of all, this website looks like a JavaScript-heavy one. Scrapy itself only downloads HTML from servers but does not interpret JavaScript statements.
Second, the URL fragment (i.e. everything including and after #body
) is not sent to the server and only http://www.city-data.com/advanced/search.php
is fetched (scrapy does the same as your browser.
You can confirm that with your browser's dev tools network tab.)
So for Scrapy, the requests to
http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0
and
http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=1
are the same resource, so it's only fetch once. They differ only in their URL fragments.
What you need is a JavaScript renderer. You could use Selenium or something like Splash. I recommend using the scrapy-splash plugin which includes a duplicate filter that takes into account URL fragments.
Post a Comment for "Webpage Access While Using Scrapy"