Skip to content Skip to sidebar Skip to footer

How To Assign The Url That's Being Scraped From To An Item?

I'm pretty new to Python and Scrapy and this site has been an invaluable resource so far for my project, but now I'm stuck on a problem that seems like it'd be pretty simple. I'm p

Solution 1:

Put the start requests generation not in class body but in start_requests():

class MySpider(CrawlSpider):

    name = "spider"
    allowed_domains = ["www.domain.com"]

    def start_requests(self):
        # querying the database here...

        #getting the urls from the database and assigning them to the rows list
        rows = cur.fetchall()

        for url, ... in rows:
            yield self.make_requests_from_url(url)


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("a bunch of xpaths here...")

        for site in sites:
            item = SettingsItem()
            # a bunch of items and their xpaths...
            # here is my non-working code
            item['url_item'] = response.url

            yield item

Post a Comment for "How To Assign The Url That's Being Scraped From To An Item?"