Multithread Python Requests
For my bachelor thesis I need to grab some data out of about 40000 websites. Therefore I am using python requests, but at the moment it is really slow with getting a response from
Solution 1:
Well you can use threads since this is a I/O Bound problem. Using the built in threading
library is your best choice. I used the Semaphore
object to limit how many threads can run at the same time.
import time
import threading
# Number of parallel threads
lock = threading.Semaphore(2)
defparse(url):
"""
Change to your logic, I just use sleep to mock http request.
"""print'getting info', url
sleep(2)
# After we done, subtract 1 from the lock
lock.release()
defparse_pool():
# List of all your urls
list_of_urls = ['website1', 'website2', 'website3', 'website4']
# List of threads objects I so we can handle them later
thread_pool = []
for url in list_of_urls:
# Create new thread that calls to your function with a url
thread = threading.Thread(target=parse, args=(url,))
thread_pool.append(thread)
thread.start()
# Add one to our lock, so we will wait if needed.
lock.acquire()
for thread in thread_pool:
thread.join()
print'done'
Solution 2:
You can use asyncio to run tasks concurrently. you can list the url responses (the ones which are completed as well as pending) using the returned value of asyncio.wait()
and call coroutines asynchronously. The results will be in an unexpected order, but it is a faster approach.
import asyncio
import functools
asyncdefparse(url):
print('in parse for url {}'.format(url))
info = await#write the logic for fetching the info, it waits for the responses from the urlsprint('done with url {}'.format(url))
return'parse {} result from {}'.format(info, url)
asyncdefmain(sites):
print('starting main')
parses = [
parse(url)
for url in sites
]
print('waiting for phases to complete')
completed, pending = await asyncio.wait(parses)
results = [t.result() for t in completed]
print('results: {!r}'.format(results))
event_loop = asyncio.get_event_loop()
try:
websites = ['site1', 'site2', 'site3']
event_loop.run_until_complete(main(websites))
finally:
event_loop.close()
Post a Comment for "Multithread Python Requests"