How To Assign The Url That's Being Scraped From To An Item?

March 03, 2024 Post a Comment

I'm pretty new to Python and Scrapy and this site has been an invaluable resource so far for my project, but now I'm stuck on a problem that seems like it'd be pretty simple. I'm p

Solution 1:

Put the start requests generation not in class body but in start_requests():

class MySpider(CrawlSpider):

    name = "spider"
    allowed_domains = ["www.domain.com"]

    def start_requests(self):
        # querying the database here...

        #getting the urls from the database and assigning them to the rows list
        rows = cur.fetchall()

        for url, ... in rows:
            yield self.make_requests_from_url(url)


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("a bunch of xpaths here...")

        for site in sites:
            item = SettingsItem()
            # a bunch of items and their xpaths...
            # here is my non-working code
            item['url_item'] = response.url

            yield item

Baca Juga

lacucinadiadine

How To Assign The Url That's Being Scraped From To An Item?

Solution 1:

Post a Comment for "How To Assign The Url That's Being Scraped From To An Item?"

Widget HTML #3