Skip to content Skip to sidebar Skip to footer

How To Get Data From Pdf In Cyrillic?

I have error when I try to get data in cyrillic import codecs pdfFileObj = codecs.open('1.pdf', 'rb','utf-8') The error is 'utf8' codec can't decode byte 0x9c in position 1: inva

Solution 1:

PDF is not a textfile

PDF is not unicode, PDF is full of binary streams, with text, images and so on.

Use some PDF library

Take look at PyPDF2. To get text from first page do

pdf = PdfFileReader(open('/tmp/russian.pdf', 'rb'))
text = pdf.getPage(0).extractText()

Though you might also need to convert it to windows-1251

text.encode('latin').decode('windows-1251')

Solution 2:

This is a solution with pdfminer.six; it supports cyrillic chars

from pdfminer import high_level

withopen('file.pdf', 'rb') as f:
    text = high_level.extract_text(f)
    print(text)

see also https://stackoverflow.com/a/70501572/3367753

Post a Comment for "How To Get Data From Pdf In Cyrillic?"