How To Get Data From Pdf In Cyrillic?

February 28, 2024 Post a Comment

I have error when I try to get data in cyrillic import codecs pdfFileObj = codecs.open('1.pdf', 'rb','utf-8') The error is 'utf8' codec can't decode byte 0x9c in position 1: inva

Solution 1:

PDF is not a textfile

PDF is not unicode, PDF is full of binary streams, with text, images and so on.

Use some PDF library

Take look at PyPDF2. To get text from first page do

pdf = PdfFileReader(open('/tmp/russian.pdf', 'rb'))
text = pdf.getPage(0).extractText()

Though you might also need to convert it to windows-1251

text.encode('latin').decode('windows-1251')

Solution 2:

This is a solution with pdfminer.six; it supports cyrillic chars

from pdfminer import high_level

withopen('file.pdf', 'rb') as f:
    text = high_level.extract_text(f)
    print(text)

lacucinadiadine

How To Get Data From Pdf In Cyrillic?

Solution 1:

PDF is not a textfile

Use some PDF library

Solution 2:

Post a Comment for "How To Get Data From Pdf In Cyrillic?"

Widget HTML #3