Parsing A Genbank File Format With Biopython's Seqio

May 24, 2024 Post a Comment

I'm trying to parse a protein genbank file format, Here's an example file (example.protein.gpff) LOCUS NP_001346895 208 aa linear PRI 20-JAN-2018 DEF

Solution 1:

Check out the Genebank-parser library. It accepts a genebank filename and the batch size; next_batch yields as many number of records as batch_size specifies.

Solution 2:

Seems like the easiest way to deal with this file format is to convert it to a JSON format (for example, using Bio), and then read it with various JSON parsers (like the rjson package in R, which parses a JSON file to a list of records)