How To Get Page Id From Wikipedia Page Title
Solution 1:
import requests
page_titles = ['A', 'B', 'C', 'D']
url = (
'https://en.wikipedia.org/w/api.php''?action=query''&prop=info''&inprop=subjectid''&titles=' + '|'.join(page_titles) +
'&format=json')
json_response = requests.get(url).json()
title_to_page_id = {
page_info['title']: page_id
for page_id, page_info in json_response['query']['pages'].items()}
print(title_to_page_id)
print([title_to_page_id[title] for title in page_titles])
This will print:
{'A': '290', 'B': '34635826', 'C': '5200013', 'D': '8123'}
['290', '34635826', '5200013', '8123']
If you have too many titles, you have to query for them in multiple requests because there is a 50 (500 for bots) limit for the number of titles that can be queried at once.
Solution 2:
The answer provided by AXO works as long as you don't have unnormalized titles such as a category page "Category:Computer_storage_devices" or special characters like &.
In that case you also need to map the response with the normalized titles as following:
defget_page_ids(page_titles):
import requests
from requests import utils
page_titles_encoded = [requests.utils.quote(x) for x in page_titles]
url = (
'https://en.wikipedia.org/w/api.php''?action=query''&prop=info''&inprop=subjectid''&titles=' + '|'.join(page_titles_encoded) +
'&format=json')
# print(url)
json_response = requests.get(url).json()
# print(json_response)
page_normalized_titles = {x:x for x in page_titles}
result = {}
if'normalized'in json_response['query']:
for mapping in json_response['query']['normalized']:
page_normalized_titles[mapping['to']] = mapping['from']
for page_id, page_info in json_response['query']['pages'].items():
normalized_title = page_info['title']
page_title = page_normalized_titles[normalized_title]
result[page_title] = page_id
return result
get_page_ids(page_titles = ['Category:R&J_Records_artists', 'Category:Computer_storage_devices', 'Category:Main_topic_classifications'])
will print
{'Category:R&J_Records_artists': '33352333', 'Category:Computer_storage_devices': '895945', 'Category:Main_topic_classifications': '7345184'}
.
Solution 3:
Querying Wikipedia API for getting the mapping can be a bit time consuming given that there are some restrictions on its usage.
It would be better if you could download the Wikipedia dump and use wikiextractor for getting it into JSON format. Now, the key id
refers to Wikipedia page id and title
refers to the Wikipedia page title. So, in one go, we get the mapping for all the pages in Wikipedia!
Post a Comment for "How To Get Page Id From Wikipedia Page Title"