Skip to content Skip to sidebar Skip to footer

Multithread Python Requests

For my bachelor thesis I need to grab some data out of about 40000 websites. Therefore I am using python requests, but at the moment it is really slow with getting a response from

Solution 1:

Well you can use threads since this is a I/O Bound problem. Using the built in threading library is your best choice. I used the Semaphore object to limit how many threads can run at the same time.

import time
import threading

# Number of parallel threads
lock = threading.Semaphore(2)


defparse(url):
   """
   Change to your logic, I just use sleep to mock http request.
   """print'getting info', url
    sleep(2)

    # After we done, subtract 1 from the lock
    lock.release()


defparse_pool():
    # List of all your urls
    list_of_urls = ['website1', 'website2', 'website3', 'website4']

    # List of threads objects I so we can handle them later
    thread_pool = []

    for url in list_of_urls:
        # Create new thread that calls to your function with a url
        thread = threading.Thread(target=parse, args=(url,))
        thread_pool.append(thread)
        thread.start()

        # Add one to our lock, so we will wait if needed.
        lock.acquire()

    for thread in thread_pool:
        thread.join()

    print'done'

Solution 2:

You can use asyncio to run tasks concurrently. you can list the url responses (the ones which are completed as well as pending) using the returned value of asyncio.wait() and call coroutines asynchronously. The results will be in an unexpected order, but it is a faster approach.

import asyncio
import functools


asyncdefparse(url):
    print('in parse for url {}'.format(url))

    info = await#write the logic for fetching the info, it waits for the responses from the urlsprint('done with url {}'.format(url))
    return'parse {} result from {}'.format(info, url)


asyncdefmain(sites):
    print('starting main')
    parses = [
        parse(url)
        for url in sites
    ]
    print('waiting for phases to complete')
    completed, pending = await asyncio.wait(parses)

    results = [t.result() for t in completed]
    print('results: {!r}'.format(results))


event_loop = asyncio.get_event_loop()
try:
    websites = ['site1', 'site2', 'site3']
    event_loop.run_until_complete(main(websites))
finally:
    event_loop.close() 

Solution 3:

i think it's a good idea to use mutil-thread like threading or multiprocess, or you can use grequests(async requests) due to gevent

Post a Comment for "Multithread Python Requests"