Comment extraire des liens d'une page Web en utilisant lxml, XPath et Python?

J'ai cette requête XPath:Comment extraire des liens d'une page Web en utilisant lxml, XPath et Python?

/html/body//tbody/tr[*]/td[*]/a[@title]/@href

Il extrait tous les liens avec l'attribut title - et donne le href dans FireFox's Xpath checker add-on. Cependant, je n'arrive pas à l'utiliser avec lxml.

from lxml import etree 
parsedPage = etree.HTML(page) # Create parse tree from valid page. 

# Xpath query 
hyperlinks = parsedPage.xpath("/html/body//tbody/tr[*]/td[*]/a[@title]/@href") 
for x in hyperlinks: 
    print x # Print links in <a> tags, containing the title attribute

Ce produit aucun résultat de lxml (liste vide).

Comment attraper le texte href (lien) d'un lien hypertexte contenant le titre de l'attribut avec lxml sous Python?

Source

2010-01-18 torger

Le document analysé comporte-t-il un espace de nommage (xmlns)? –

j'ai pu le faire fonctionner avec le code suivant:

from lxml import html, etree 
from StringIO import StringIO 

html_string = '''<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 
    "http://www.w3.org/TR/html4/loose.dtd"> 

<html lang="en"> 
<head/> 
<body> 
    <table border="1"> 
     <tbody> 
     <tr> 
      <td><a href="http://stackoverflow.com/foobar" title="Foobar">A link</a></td> 
     </tr> 
     <tr> 
      <td><a href="http://stackoverflow.com/baz" title="Baz">Another link</a></td> 
     </tr> 
     </tbody> 
    </table> 
</body> 
</html>''' 

tree = etree.parse(StringIO(html_string)) 
print tree.xpath('/html/body//tbody/tr/td/a[@title]/@href') 

>>> ['http://stackoverflow.com/foobar', 'http://stackoverflow.com/baz']

Source

2010-01-18 09:03:58 jkp

Firefox adds additional html tags le html quand il rend, ce qui rend le XPath retourné par l'outil Firebug incompatible avec le code HTML réel renvoyé par le serveur (et quel urllib/2 reviendra).

La suppression de l'étiquette <tbody> fait généralement l'affaire.

Source

2011-12-06 01:48:51 mrmagooey

Comment extraire des liens d'une page Web en utilisant lxml, XPath et Python?

Répondre

Questions connexes