Send Request To Page With Windows-1251 Encoding From Python
Solution 1:
Let's create a page with an windows-1251
charset given in meta
tag and some Russian nonsense text. I saved it in Sublime Text as a windows-1251 file, for sure.
<!DOCTYPE HTMLPUBLIC"-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd"><html><head><metahttp-equiv="Content-Type"content="text/html; charset=windows-1251"></head><body><p>Привет, мир!</p></body></html>
You can use a little trick in the requests
library:
If you change the encoding, Requests will use the new value of r.encoding whenever you call r.text.
So it goes like that:
In [1]: import requests
In [2]: result = requests.get('http://127.0.0.1:1234/1251.html')
In [3]: result.encoding = 'windows-1251'
In [4]: u'Привет'in result.text
Out[4]: True
Voila!
If it doesn't work for you, there's a slightly uglier approach.
You should take a look at what encoding do the web-server is sending you.
It may be that the encoding of the response is actually cp1252
(also known as ISO-8859-1
), or whatever else, but neither utf8
nor cp1251
. It may differ and depends on a web-server!
In [1]: import requests
In [2]: result = requests.get('http://127.0.0.1:1234/1251.html')
In [3]: result.encoding
Out[3]: 'ISO-8859-1'
So we should recode it accordingly.
In [4]: u'Привет'.encode('cp1251').decode('cp1252') in result.text
Out[4]: True
But that just looks ugly to me (also, I suck at encodings and it's not really the best solution at all). I'd go with a re-setting the encoding using requests
itself.
Solution 2:
As documented, requests
automatically decode response.text
to unicode, so you must either look for a unicode string:
ifu'cyrillic symbols'in source.text:
# ...
or encode response.text
in the appropriate encoding:
# -*- coding: utf-8 -*-# (....)if'cyrillic symbols'in source.text.encode("utf-8"):
# ...
The first solution being much simpler and lighter.
Post a Comment for "Send Request To Page With Windows-1251 Encoding From Python"