Skip to content Skip to sidebar Skip to footer

How To Get Page Id From Wikipedia Page Title

I am trying to find the wiki id of list of pages from wikipedia. So, the format is: input: list of wikipedia page titles output: list of wikipedia page ids. So far, I've gone throu

Solution 1:

Query basic page information:

import requests

page_titles = ['A', 'B', 'C', 'D']
url = (
    'https://en.wikipedia.org/w/api.php''?action=query''&prop=info''&inprop=subjectid''&titles=' + '|'.join(page_titles) +
    '&format=json')
json_response = requests.get(url).json()

title_to_page_id  = {
    page_info['title']: page_id
    for page_id, page_info in json_response['query']['pages'].items()}

print(title_to_page_id)
print([title_to_page_id[title] for title in page_titles])

This will print:

{'A': '290', 'B': '34635826', 'C': '5200013', 'D': '8123'}
['290', '34635826', '5200013', '8123']

If you have too many titles, you have to query for them in multiple requests because there is a 50 (500 for bots) limit for the number of titles that can be queried at once.

Solution 2:

The answer provided by AXO works as long as you don't have unnormalized titles such as a category page "Category:Computer_storage_devices" or special characters like &.

In that case you also need to map the response with the normalized titles as following:

defget_page_ids(page_titles):
    import requests
    from requests import utils

    page_titles_encoded = [requests.utils.quote(x) for x in page_titles]

    url = (
        'https://en.wikipedia.org/w/api.php''?action=query''&prop=info''&inprop=subjectid''&titles=' + '|'.join(page_titles_encoded) +
        '&format=json')
    # print(url)
    json_response = requests.get(url).json()
    # print(json_response)

    page_normalized_titles = {x:x for x in page_titles}
    result = {}
    if'normalized'in json_response['query']:
        for mapping in json_response['query']['normalized']:
            page_normalized_titles[mapping['to']] = mapping['from']

    for page_id, page_info in json_response['query']['pages'].items():
        normalized_title = page_info['title']
        page_title = page_normalized_titles[normalized_title]  
        result[page_title] = page_id

    return result


get_page_ids(page_titles = ['Category:R&J_Records_artists', 'Category:Computer_storage_devices', 'Category:Main_topic_classifications'])

will print

{'Category:R&J_Records_artists': '33352333', 'Category:Computer_storage_devices': '895945', 'Category:Main_topic_classifications': '7345184'}.

Solution 3:

Querying Wikipedia API for getting the mapping can be a bit time consuming given that there are some restrictions on its usage.

It would be better if you could download the Wikipedia dump and use wikiextractor for getting it into JSON format. Now, the key id refers to Wikipedia page id and title refers to the Wikipedia page title. So, in one go, we get the mapping for all the pages in Wikipedia!

Post a Comment for "How To Get Page Id From Wikipedia Page Title"