bs4.UnicodeDammit
guesses a byte stream's encoding (assuming the byte stream represents text). from bs4 import UnicodeDammit # import bytes b1 = b'Ren\xc3\xa9' b2 = b'Ren\xe9' d1 = UnicodeDammit(b1) d2 = UnicodeDammit(b2) e1 = d1.original_encoding e2 = d2.original_encoding print(e1) # utf-8 print(e2) # cp037 print(b1.decode(encoding = e1)) print(b2.decode(encoding = e2)) # uh, oh!
UnicodeDammit
works more accurately if chardet
or cchardet
are installed. UnicodeDammit
is based on the work of Mark Pilgrim's Universal Feed Parser (feedparser
).