Slicing Url With Python

October 18, 2023 Post a Comment

I am working with a huge list of URL's. Just a quick question I have trying to slice a part of the URL out, see below: http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m

Solution 1:

Use the urlparse module. Check this function:

import urlparse

defprocess_url(url, keep_params=('CONTENT_ITEM_ID=',)):
    parsed= urlparse.urlsplit(url)
    filtered_query= '&'.join(
        qry_item
        for qry_item in parsed.query.split('&')
        if qry_item.startswith(keep_params))
    return urlparse.urlunsplit(parsed[:3] + (filtered_query,) + parsed[4:])

In your example:

>>> process_url(a)
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'

This function has the added bonus that it's easier to use if you decide that you also want some more query parameters, or if the order of the parameters is not fixed, as in:

>>>url='http://www.domainname.com/page?other_value=xx&param3&CONTENT_ITEM_ID=1234&param1'>>>process_url(url, ('CONTENT_ITEM_ID', 'other_value'))
'http://www.domainname.com/page?other_value=xx&CONTENT_ITEM_ID=1234'

Solution 2:

The quick and dirty solution is this:

>>> "http://something.com/page?CONTENT_ITEM_ID=1234&param3".split("&")[0]
'http://something.com/page?CONTENT_ITEM_ID=1234'

Solution 3:

Another option would be to use the split function, with & as a parameter. That way, you'd extract both the base url and both parameters.

   url.split("&")

returns a list with

['http://www.domainname.com/page?CONTENT_ITEM_ID=1234', 'param2', 'param3']

Solution 4:

I figured it out below is what I needed to do:

url = "http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3"
url = url[: url.find("&")]
print url
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'

Solution 5:

Parsin URL is never as simple I it seems to be, that's why there are the urlparse and urllib modules.

E.G :

import urllib
url ="http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3"
query = urllib.splitquery(url)
result = "?".join((query[0], query[1].split("&")[0]))
print result
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'

This is still not 100 % reliable, but much more than splitting it yourself because there are a lot of valid url format that you and me don't know and discover one day in error logs.

lacucinadiadine