Parse la page HTML en Java

Je suis analyse ce segment de page:Parse la page HTML en Java

<tr valign="middle"> 
    <td class="inner"><span style=""><span class="" title=""></span> 2 <span class="icon ok" title="Verified"></span> </span><span class="icon cat_tv" title="Video » TV" style="bottom:-2;"></span> <a href="/VALUE.html" style="line-height:1.4em;">VALUE</a> </td> 
    <td width="1%" align="center" nowrap="nowrap" class="small inner" >VALUE</td> 
    <td width="1%" align="right" nowrap="nowrap" class="small inner" >VALUE</td> 
    <td width="1%" align="center" nowrap="nowrap" class="small inner" >VALUE</td> 
</tr>

J'ai ce segment à la télévision variable: HtmlElement tv = tr.get(i);

Je lis tag <a href="/VALUE.html" style="line-height:1.4em;">VALUE</a> de cette façon:

HtmlElement a = tv.getElementsByTagName("a").get(0);   
object.name.value(a.getTextContent()); 

url = a.getAttribute("href"); 
object.url_detail.value(myBase + url);

Comment puis-je lire uniquement le champ VALUE des autres sections <td>....</td>?

Source

2013-03-12 Emanuele Mazzoni

Quel cadre utilisez-vous pour l'analyse syntaxique? – Henrik

peut-être utiliser 'tv.getElementsByTagName (" td ")' et boucler sur le résultat et obtenir le contenu du texte en utilisant 'getTextContent()'? avez-vous essayé cela? – A4L

Je suggère d'utiliser XPath, qui est la méthode recommandée pour l'analyse syntaxique XML/HTML

Référence: How to read XML using XPath in Java

Jetez aussi un coup d'œil à cette question: RegEx match open tags except XHTML self-contained tags

Mise à jour

Si j'ai bien compris, vous avez besoin de la "VALEUR" de chaque td, non? Si oui, votre XPath serait quelque chose comme ceci:

//td[@class="small inner"]/text()

Source

2013-03-12 13:05:19 sfat

Vous pouvez essayer un paquet merveilleux java jsoup.

MISE À JOUR: en utilisant le package, vous pouvez résoudre le problème comme celui-ci:

String html = "<tr valign=\"middle\">" 
      + " <td class=\"inner\">" 
      + " <span style=\"\"><span class=\"\" title=\"\"></span> 2 <span class=\"icon ok\" title=\"Verified\"></span> </span><span class=\"icon cat_tv\" title=\"Video » TV\" style=\"bottom:-2;\"></span>" 
      + " <a href=\"/VALUE.html\" style=\"line-height:1.4em;\">VALUE</a> " 
      + " </td>" 
      + " <td width=\"1%\" align=\"center\" nowrap=\"nowrap\" class=\"small inner\" >VALUE</td>" 
      + " <td width=\"1%\" align=\"right\" nowrap=\"nowrap\" class=\"small inner\" >VALUE</td>" 
      + " <td width=\"1%\" align=\"center\" nowrap=\"nowrap\" class=\"small inner\" >VALUE</td>" 
      + "</tr>"; 
    Document doc = Jsoup.parse(html, "", Parser.xmlParser()); 
    Elements labelPLine = doc.select("a[href]"); 
    System.out.println("value 1:" + labelPLine.text()); 

    Elements labelPLine2 = doc.select("td[width=1%"); 
    Iterator<Element> it = labelPLine2.iterator(); 
    int n = 2; 
    while (it.hasNext()) { 
     System.out.println("value " + (n++) + ":" + it.next().text()); 
    }

Le résultat serait:

 
value 1:VALUE 
value 2:VALUE 
value 3:VALUE 
value 4:VALUE

Source

2014-03-10 04:34:05

Vous devriez dire comment vous pourriez résoudre le problème en utilisant jsoup. Sinon, il s'agit d'une non-réponse et aurait dû être un commentaire. – Bull

@B ... Merci pour votre suggestion, j'ai mis à jour. –

Parse la page HTML en Java

Répondre

Questions connexes