How To Get Data From Pdf In Cyrillic?
I have error when I try to get data in cyrillic import codecs pdfFileObj = codecs.open('1.pdf', 'rb','utf-8') The error is 'utf8' codec can't decode byte 0x9c in position 1: inva
Solution 1:
PDF is not a textfile
PDF is not unicode, PDF is full of binary streams, with text, images and so on.
Use some PDF library
Take look at PyPDF2. To get text from first page do
pdf = PdfFileReader(open('/tmp/russian.pdf', 'rb'))
text = pdf.getPage(0).extractText()
Though you might also need to convert it to windows-1251
text.encode('latin').decode('windows-1251')
Solution 2:
This is a solution with pdfminer.six; it supports cyrillic chars
from pdfminer import high_level
withopen('file.pdf', 'rb') as f:
text = high_level.extract_text(f)
print(text)
Post a Comment for "How To Get Data From Pdf In Cyrillic?"