How To Use Load More Option With A Non Head Web Scraper [instagram]
I am trying to download the location details from Instagram using URL scrape, but I am not able use Load more option to scrape more locations from the URLs. I appreciate suggestio
Solution 1:
The url for the instagram "see more" button I think you may be describing adds a page number to the url you are scraping like so: https://www.instagram.com/explore/locations/c1027234/hyderabad-india/?page=2
You can add a counter that iterates to mimic increasing the page number and loop through as long as you continue to receive results back. I add a try, except to watch for the KeyError thrown when there are no more results, then set conditions to exit loops and write the dataframe to csv.
Modified code:
import re
import requests
import json
import pandas as pd
import numpy as np
import csv
from geopy.geocoders import Nominatim
defLocation_city(F_name):
path="D:\\Everyday_around_world\\instagram\\"
filename=path+F_name
url1="https://www.instagram.com/explore/locations/c1027234/hyderabad-india/?page="
pageNumber = 1
r = requests.get(url1+ str(pageNumber)) #grabs page 1
df3=pd.DataFrame()
searching = Truewhile searching:
match = re.search('window._sharedData = (.*);</script>', r.text)
a= json.loads(match.group(1))
try:
b=a['entry_data']['LocationsDirectoryPage'][0]['location_list']
except KeyError: # print"No more locations returned"
searching = False# will exit while loop
b = [] # avoids duplicated from previous resultsiflen(b) > 0: # skips this section if there are no resultsfor j inrange(0,len(b)):
z= b[j]
ifall(ord(char) < 128for char in z['name'])==True:
x=str(z['name'])
print (x)
geolocator = Nominatim()
location = geolocator.geocode(x,timeout=10000)
if location!=None:
#print((location.latitude, location.longitude))
df3 = df3.append(pd.DataFrame({'name': z['name'], 'id':z['id'],'latitude':location.latitude,
'longitude':location.longitude},index=[0]), ignore_index=True)
pageNumber += 1next = url1 + str(pageNumber) # increments url
r = requests.get(next) # gets results for next url
df3.to_csv(filename,header=True,index=False) #When finished looping through pages, write dataframe to csv
Location_city("Hyderabad_locations.csv")
Post a Comment for "How To Use Load More Option With A Non Head Web Scraper [instagram]"