python2 + urllib2 : Comment retirer le BOM de façon propre ?

sputnick · Le 06/10/2011, à 00:09

Salut,

je m'essaye à python ces jours ci. Comme prétexte à apprendre, je parse une page HTML en cron pour être prévenu de la sortie du dernier épisode de la série 4 de Breaking Bad. (la semaine prochaine, le suspense est --INSOUTENABLE--)

Le souci c'est que le fichier contient un BOM dans les premiers caractères ( http://fr.wikipedia.org/wiki/Byte_Order_Mark )
la seule solution que j'ai trouvé est de couper la page à partir du 4° caractère (voir code ci-dessous) mais il doit y avoir plus propre.

Si j'utilise

page.read().decode('utf_8_sig')

( recommandé sur notamment http://stackoverflow.com/questions/7046 … ib-request ) j’obtiens :

Traceback (most recent call last):
File "beautifulsoup_breakingbad.py", line 16, in <module>
html = page.read().decode('utf_8_sig')
File "/usr/lib/python2.7/encodings/utf_8_sig.py", line 22, in decode
(output, consumed) = codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfb in position 2634: invalid start byte

#!/usr/bin/python2
# -*- coding: utf8 -*-
# vim:ts=4:sw=4

# http://www.crummy.com/software/BeautifulSoup/documentation.html
# http://docs.python.org/library/urllib2.html

searchStr = "Episode 13"

import os, cookielib, urllib2
from BeautifulSoup import BeautifulSoup

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
page = opener.open("http://moc.nwod-emertxe/series/vostfr/6840-breaking-bad-saison-04-vostfr.html")
page.addheaders = [('User-agent', 'Mozilla/5.0')]
#html = page.read().decode('utf_8_sig')
#print(html[0:4]) 
# affiche "357 273 277  \n  \n" quand pipé dans "od -c" : c'est un BOM
html = page.read()

soup = BeautifulSoup(html[4:-1])
for i in soup.findAll(style=['text-align: center;']):
    if i.text == searchStr:
        outputStr = ("""
        DISPLAY=:0 zenity --info --text "ALERT {0} breakingbad dispo\nhttp://moc.nwod-emertxe/series/vostfr/6840-breaking-bad-saison-04-vostfr.html"
        mail -s "ALERT {0} breakingbad dispo" -- xxxxxxx@gmail.com <<< "http://moc.nwod-emertxe/series/vostfr/6840-breaking-bad-saison-04-vostfr.html"
        """).format(searchStr)
        os.system(outputStr)

Any clue ?

Ubuntu-fr

Navigation

Liens de recherche

Annonce

#1 Le 06/10/2011, à 00:09

python2 + urllib2 : Comment retirer le BOM de façon propre ?

Pied de page des forums