J'ai quelques fichier HTML avec des éléments de portée:Comment résoudre le problème avec l'analyse du fichier html avec le symbole cyrillique?

<html> 
<body> 
<span class="one">Text</span>some text</br> 
<span class="two">Привет</span>Текст на русском</br> 
</body> 
</html>

Pour obtenir "un texte":

# -*- coding:cp1251 -*- 
import lxml 
from lxml import html 

filename = "t.html" 
fread = open(filename, 'r') 
source = fread.read() 

tree = html.fromstring(source) 
fread.close() 


tags = tree.xpath('//span[@class="one" and text()="Text"]') #This OK 
print "name: ",tags[0].text 
print "value: ",tags[0].tail 

tags = tree.xpath('//span[@class="two" and text()="Привет"]') #This False 

print "name: ",tags[0].text 
print "value: ",tags[0].tail

Ce spectacle:

name: Text 
value: some text 

Traceback: ... in line `tags = tree.xpath('//span[@class="two" and text()="Привет"]')` 
    ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes

Comment résoudre ce problème?

Source

2010-11-15 HammerSpb

lxml

(Comme l'a observé, cela est un peu louches entre encodages système et apparemment ne fonctionne pas correctement sous Windows XP, mais il l'a fait sous Linux)

Je l'ai appris à travailler en décodant la chaîne source - tree = html.fromstring(source.decode('utf-8')).

# -*- coding:cp1251 -*- 
import lxml 
from lxml import html 

filename = "t.html" 
fread = open(filename, 'r') 
source = fread.read() 

tree = html.fromstring(source.decode('utf-8')) 
fread.close() 


tags = tree.xpath('//span[@class="one" and text()="Text"]') #This OK 
print "name: ",tags[0].text 
print "value: ",tags[0].tail 

tags = tree.xpath('//span[@class="two" and text()="Привет"]') #This is now OK too 

print "name: ",tags[0].text 
print "value: ",tags[0].tail

Cela signifie que l'arborescence réelle est entièrement unicode objets. Si vous venez de mettre le paramètre xpath en tant que unicode, il trouve 0 correspond.

BeautifulSoup

Je préfère utiliser BeautifulSoup pour tout de ce genre de choses, de toute façon. Voici ma session interactive; J'ai enregistré le fichier dans cp1251.

>>> from BeautifulSoup import BeautifulSoup 
>>> filename = '/tmp/cyrillic' 
>>> fread = open(filename, 'r') 
>>> source = fread.read() 
>>> source # Scary 
'<html>\n<body>\n<span class="one">Text</span>some text</br>\n<span class="two">\xcf\xf0\xe8\xe2\xe5\xf2</span>\xd2\xe5\xea\xf1\xf2 \xed\xe0 \xf0\xf3\xf1\xf1\xea\xee\xec</br>\n</body>\n</html>\n' 
>>> source = source.decode('cp1251') # Let's try getting this right. 
u'<html>\n<body>\n<span class="one">Text</span>some text</br>\n<span class="two">\u041f\u0440\u0438\u0432\u0435\u0442</span>\u0422\u0435\u043a\u0441\u0442 \u043d\u0430 \u0440\u0443\u0441\u0441\u043a\u043e\u043c</br>\n</body>\n</html>\n' 
>>> soup = BeautifulSoup(source) 
>>> soup # OK, that's looking right now. Note the </br> was dropped as that's bad HTML with no meaning. 
<html> 
<body> 
<span class="one">Text</span>some text 
<span class="two">Привет</span>Текст на русском 
</body> 
</html> 

>>> soup.find('span', 'one').findNextSibling(text=True) 
u'some text' 
>>> soup.find('span', 'two').findNextSibling(text=True) # This looks a bit daunting ... 
u'\u0422\u0435\u043a\u0441\u0442 \u043d\u0430 \u0440\u0443\u0441\u0441\u043a\u043e\u043c' 
>>> print _ # ... but it's not, really. Just Unicode chars. 
Текст на русском 
>>> # Then you may also wish to get things by text: 
>>> print soup.find(text=u'Привет').findParent().findNextSibling(text=True) 
Текст на русском 
>>> # You can't get things by attributes and the contained NavigableString at the same time, though. That may be a limitation.

A la fin de cela, il est peut-être la peine d'essayer tout en tenant compte source.decode('cp1251') au lieu de source.decode('utf-8') quand vous le prenez du système de fichiers. LXML peut effectivement fonctionner alors.

Source

2010-11-15 03:40:11

Cela ne fonctionne pas aussi J'ai essayé de fonctionner sous Windows XP – HammerSpb

Je l'ai fait sous Linux Accrochez, je vais démarrer ma machine virtuelle XP et voir si je peux devinez sur XP –

Merci Chris! Sous XP ce sont des fichiers ANSI – HammerSpb

ai pas testé, mais enveloppant l'appel à tags[0].tail dans le unicode() fonction intégrée devrait le faire: unicode(tags[0].tail)

Source

2010-11-15 02:40:08 jonesy

problème dans cette ligne: tags = tree.xpath ('// span [@ class = "two" and text() = "Привет"] ') – HammerSpb

Ok, bien qu'en est-il de 'text() = u" Привет "' et si ce n'est pas le cas faites-le 'text() = unicode (" Привет ")' – jonesy

J'ai essayé ces options. Le même résultat. Mon fichier html au format ASCII et le script python au format ASCII. J'ai essayé de le convertir en UTF-8 mais je ne fais rien. ((Peut-être que je ne comprends pas certains routages de codecs? – HammerSpb

J'ai la même erreur pour XML générant avec lxml. ici solution trouvée: http://lethain.com/stripping-illegal-characters-from-xml-in-python/

Je viens de faire:

remove_re = re.compile(u'[\x00-\x08\x0B-\x0C\x0E-\x1F\x7F]') 
etree_sub_el.text = remove_re.sub('', text)

Source

2011-04-18 15:29:05 Alerion

Essayez cette

tree = html.fromstring(source.decode('utf-8'))

tags = tree.xpath('//span[@class="two" and text()="%s"]' % u'Привет')

Source

2012-10-24 08:59:01 zhmyh

Comment résoudre le problème avec l'analyse du fichier html avec le symbole cyrillique?

Répondre

lxml

BeautifulSoup

Questions connexes