Skip to content Skip to sidebar Skip to footer

Match Names In Csv File To Filename In Folder

I have got a list of about 7000 names in a csv file that is arranged by surname, name, date of birth etc. I also have a folder of about 7000+ scanned documents (enrolment forms) w

Solution 1:

The first thing you want to do is read the CSV into memory. You can do this with the csv module. The most useful tool there is csv.DictReader, which takes the first line of the file as keys in a dictionary, and reads the remainder:

import csv
withopen('/path/to/yourfile.csv', 'r') as f:
    rows = list(csv.DictReader(f))

from pprint import pprint
pprint(rows[:100])

In windows, the path would look different, and would be something like c:/some folder/some other folder/ (note the forward-slashes instead of backslashes).

This will show the first 100 rows from the file. For example if you have columns named "First Name", "Last Name", "Date of Birth", this will look like:

[{'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
 {'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
 {'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
 {'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
 {'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
 {'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
 {'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
 {'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
 {'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'},
 {'Date of Birth': 'Jan 1, 1970', 'First Name': 'John', 'Last Name': 'Doe'}
 ...]

Next you want to get a list of all the 7000 files, using os.listdir:

import os
images_directory = '/path/to/images/'
image_paths = [
    os.path.join(images_directory, filename)
    for filename inos.listdir(images_directory)]

Now you'll need some way to extract the names from the files. This depends crucially on the way the files are structured. The tricky-to-use but very powerful tool for this task is called a regular expression, but probably something simple will suffice. For example, if the files are named like "first-name last-name.pdf", you could write a simple parsing method like:

def parse_filename(filename):
    name, extension = filename.split('.')
    first_name, last_name = name.split(' ')
    return first_name.replace('-', ' '), last_name.replace('-', ' ')

The exact implementation will depend on how the files are named, but the key things to get you started are str.split, str.strip and a few others in that same class. You might also take a look at the re module for handling regular expressions. As I said, that's a more advanced/powerful technique, so it may not be worth worrying about right now.

A simple algorithm to do the matching would be something like the following:

name_to_filename = {parse_filename(filename.lower()): filename for filename in filenames}
matched_rows = []
unmatched_files = []
for row in rows:
    name_key = (row['First Name'].lower(), row['Last Name'].lower())
    matching_file = name_to_filename.get(name_key)  # This sees if we have a matching file name, and returns# None otherwise.
    new_row = row.copy()
    if matching_file:
        new_row['File'] = matching_file
        print('Matched "%s" to %s' % (' '.join(name_key), matching_file))
    else:
        new_row['File'] = ''print('No match for "%s"' % (' '.join(name_key)))
    matched_rows.append(new_row)
with open('/path/to/output.csv', 'w') as f:
    writer = csv.DictWriter(f, ['First Name', 'Last Name', 'Date of Birth', 'File])
    writer.writeheader()
    writer.writerows(matched_rows)

This should give you an output spreadsheet with whatever rows you could match automatically matched up, and the remaining ones blank. Depending on how clean your data is, you might be able to just match the remaining few entries by hand. There's only 7000, and the "dumb" heuristic will probably catch most of them. If you need more advanced heuristics, you might look at Jaccard similarity of the "words" in the name, and the difflib module for approximate string matching.

Of course most of this code won't quite work on your problem, but hopefully it's enough to get you started.

Post a Comment for "Match Names In Csv File To Filename In Folder"