Scrapy Crawlspider + Splash: How To Follow Links Through Linkextractor?

January 28, 2024 Post a Comment

I have the following code that is partially working, class ThreadSpider(CrawlSpider): name = 'thread' allowed_domains = ['bbs.example.com'] start_urls = ['http://bbs.e

Solution 1:

I've had a similar issue that seemed specific to integrating Splash with a Scrapy CrawlSpider. It would visit only the start url and then close. The only way I managed to get it to work was to not use the scrapy-splash plugin and instead use the 'process_links' method to preppend the Splash http api url to all of the links scrapy collects. Then I made other adjustments to compensate for the new issues that arise from this method. Here's what I did:

You'need these two tools to put together the splash url and then take it apart if you intend to store it somewhere.

from urllib.parseimport urlencode, parse_qs

With the splash url being preppended to every link, scrapy will filter them all out as 'off site domain requests', so we make make 'localhost' the allowed domain.

allowed_domains = ['localhost']
start_urls = ['https://www.example.com/']

However, this poses a problem because then we may end up endlessly crawling the web when we only want to crawl one site. Let's fix this with the LinkExtractor rules. By only scraping links from our desired domain, we get around the offsite request problem.

LinkExtractor(allow=r'(http(s)?://)?(.*\.)?{}.*'.format(r'example.com')),
process_links='process_links',

Here's the process_links method. The dictionary in the urlencode method is where you'll put all of your splash arguments.

defprocess_links(self, links):
    for link inlinks:if"http://localhost:8050/render.html?&"notin link.url:
            link.url = "http://localhost:8050/render.html?&" + urlencode({'url':link.url,
                                                                          'wait':2.0})
    return links

Finally, to take the url back out of the splash url, use the parse_qs method.

parse_qs(response.url)['url'][0]

One final note about this approach. You'll notice that I have an '&' in the splash url right at the beginning. (...render.html?&). This makes parsing the splash url to take out the actual url consistent no matter what order you have the arguments when you're using the urlencode method.

Solution 2:

Personnaly I use dont_process_response=True so response is HtmlResponse (which is required by the code in _request_to_follows).

And I also redefine the _build_request method in my spyder, like so:

def_build_request(self, rule, link):
    r = SplashRequest(url=link.url, callback=self._response_downloaded, args={'wait': 0.5}, dont_process_response=True)
    r.meta.update(rule=rule, link_text=link.text)
    return r

In the github issues, some users just redefine the _request_to_follow method in their class.

Solution 3:

Use below code - Just copy and paste

restrict_xpaths=('//a[contains(text(), "Next Page")]')

Instead of

restrict_xpaths=("//a[contains(text(), 'Next Page')]")

lacucinadiadine

Scrapy Crawlspider + Splash: How To Follow Links Through Linkextractor?

Solution 1:

Solution 2:

Solution 3:

Post a Comment for "Scrapy Crawlspider + Splash: How To Follow Links Through Linkextractor?"

Widget HTML #3