2010-11-13 6 views
0

J'utilise PHP/curl pour obtenir un HTML dans une chaîne, puis j'ai besoin d'extraire les données suivantes, puis de projeter un graphique.Comment extraire les valeurs d'une page html stockée en tant que chaîne en utilisant la fonction curl

Les données que je veux ressemble:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 

<html xmlns="http://www.w3.org/1999/xhtml"> 
<head> 
    <meta name="generator" content= 
    "HTML Tidy for Linux (vers 25 March 2009), see www.w3.org" /> 

    <title></title> 
</head> 

<body> 
    <table> 
    <tbody> 
     <tr> 
     <td> 
      <h3>Income</h3> 
     </td> 
     </tr> 

     <tr> 
     <td>Operating income</td> 

     <td class="numericalColumn">22,922.00</td> 

     <td class="numericalColumn">21,507.30</td> 

     <td class="numericalColumn">17,492.60</td> 

     <td class="numericalColumn">13,683.90</td> 

     <td class="numericalColumn">10,227.12</td> 
     </tr> 

     <tr> 
     <td> 
      <h3>Expenses</h3> 
     </td> 
     </tr> 

     <tr> 
     <td>Material consumed</td> 

     <td class="numericalColumn">4,029.40</td> 

     <td class="numericalColumn">3,442.60</td> 

     <td class="numericalColumn">2,952.30</td> 

     <td class="numericalColumn">1,889.00</td> 

     <td class="numericalColumn">1,367.67</td> 
     </tr> 

     <tr> 
     <td>Manufacturing expenses&nbsp;</td> 

     <td class="numericalColumn">2,213.20</td> 

     <td class="numericalColumn">1,841.80</td> 

     <td class="numericalColumn">299.80</td> 

     <td class="numericalColumn">120.50</td> 

     <td class="numericalColumn">1,020.70</td> 
     </tr> 

     <tr> 
     <td>Personnel expenses</td> 

     <td class="numericalColumn">9,062.80</td> 

     <td class="numericalColumn">9,249.80</td> 

     <td class="numericalColumn">7,409.10</td> 

     <td class="numericalColumn">5,768.20</td> 

     <td class="numericalColumn">4,279.03</td> 
     </tr> 

     <tr> 
     <td>Selling expenses</td> 

     <td class="numericalColumn">378.10</td> 

     <td class="numericalColumn">308.40</td> 

     <td class="numericalColumn">532.10</td> 

     <td class="numericalColumn">-</td> 

     <td class="numericalColumn">171.05</td> 
     </tr> 

     <tr> 
     <td>Adminstrative expenses</td> 

     <td class="numericalColumn">1,737.00</td> 

     <td class="numericalColumn">1,906.00</td> 

     <td class="numericalColumn">2,583.70</td> 

     <td class="numericalColumn">2,651.70</td> 

     <td class="numericalColumn">904.78</td> 
     </tr> 

     <tr> 
     <td>Expenses capitalised</td> 

     <td class="numericalColumn">-</td> 

     <td class="numericalColumn">-</td> 

     <td class="numericalColumn">-</td> 

     <td class="numericalColumn">-</td> 

     <td class="numericalColumn">-</td> 
     </tr> 

     <tr> 
     <td>Cost of sales</td> 

     <td class="numericalColumn">17,420.50</td> 

     <td class="numericalColumn">16,748.60</td> 

     <td class="numericalColumn">13,777.00</td> 

     <td class="numericalColumn">10,429.40</td> 

     <td class="numericalColumn">7,743.22</td> 
     </tr> 

     <tr> 
     <td>Operating profit</td> 

     <td class="numericalColumn">5,501.50</td> 

     <td class="numericalColumn">4,758.70</td> 

     <td class="numericalColumn">3,715.60</td> 

     <td class="numericalColumn">3,254.50</td> 

     <td class="numericalColumn">2,483.90</td> 
     </tr> 

     <tr> 
     <td>Other recurring income</td> 

     <td class="numericalColumn">434.20</td> 

     <td class="numericalColumn">468.20</td> 

     <td class="numericalColumn">326.90</td> 

     <td class="numericalColumn">288.70</td> 

     <td class="numericalColumn">113.59</td> 
     </tr> 

     <tr> 
     <td>Adjusted PBDIT</td> 

     <td class="numericalColumn">5,935.70</td> 

     <td class="numericalColumn">5,226.90</td> 

     <td class="numericalColumn">4,042.50</td> 

     <td class="numericalColumn">3,543.20</td> 

     <td class="numericalColumn">2,597.49</td> 
     </tr> 

     <tr> 
     <td>Financial expenses</td> 

     <td class="numericalColumn">108.40</td> 

     <td class="numericalColumn">196.80</td> 

     <td class="numericalColumn">116.80</td> 

     <td class="numericalColumn">7.20</td> 

     <td class="numericalColumn">3.13</td> 
     </tr> 

     <tr> 
     <td>Depreciation&nbsp;</td> 

     <td class="numericalColumn">579.60</td> 

     <td class="numericalColumn">533.60</td> 

     <td class="numericalColumn">456.00</td> 

     <td class="numericalColumn">359.80</td> 

     <td class="numericalColumn">292.26</td> 
     </tr> 

     <tr> 
     <td>Other write offs</td> 

     <td class="numericalColumn">-</td> 

     <td class="numericalColumn">-</td> 

     <td class="numericalColumn">-</td> 

     <td class="numericalColumn">-</td> 

     <td class="numericalColumn">-</td> 
     </tr> 

     <tr> 
     <td>Adjusted PBT</td> 

     <td class="numericalColumn">5,247.70</td> 

     <td class="numericalColumn">4,496.50</td> 

     <td class="numericalColumn">3,469.70</td> 

     <td class="numericalColumn">3,176.20</td> 

     <td class="numericalColumn">2,302.10</td> 
     </tr> 

     <tr> 
     <td>Tax charges&nbsp;</td> 

     <td class="numericalColumn">790.80</td> 

     <td class="numericalColumn">574.10</td> 

     <td class="numericalColumn">406.40</td> 

     <td class="numericalColumn">334.10</td> 

     <td class="numericalColumn">286.10</td> 
     </tr> 

     <tr> 
     <td>Adjusted PAT</td> 

     <td class="numericalColumn">4,456.90</td> 

     <td class="numericalColumn">3,922.40</td> 

     <td class="numericalColumn">3,063.30</td> 

     <td class="numericalColumn">2,842.10</td> 

     <td class="numericalColumn">2,016.00</td> 
     </tr> 

     <tr> 
     <td>Non recurring items</td> 

     <td class="numericalColumn">441.10</td> 

     <td class="numericalColumn">-948.60</td> 

     <td class="numericalColumn">-</td> 

     <td class="numericalColumn">-</td> 

     <td class="numericalColumn">38.33</td> 
     </tr> 

     <tr> 
     <td>Other non cash adjustments</td> 

     <td class="numericalColumn">-</td> 

     <td class="numericalColumn">-</td> 

     <td class="numericalColumn">-</td> 

     <td class="numericalColumn">-</td> 

     <td class="numericalColumn">-33.85</td> 
     </tr> 

     <tr> 
     <td>Reported net profit</td> 

     <td class="numericalColumn">4,898.00</td> 

     <td class="numericalColumn">2,973.80</td> 

     <td class="numericalColumn">3,063.30</td> 

     <td class="numericalColumn">2,842.10</td> 

     <td class="numericalColumn">2,020.48</td> 
     </tr> 

     <tr> 
     <td>Earnigs before appropriation</td> 

     <td class="numericalColumn">4,898.00</td> 

     <td class="numericalColumn">2,973.80</td> 

     <td class="numericalColumn">3,063.30</td> 

     <td class="numericalColumn">2,842.10</td> 

     <td class="numericalColumn">2,020.48</td> 
     </tr> 

     <tr> 
     <td>Equity dividend</td> 

     <td class="numericalColumn">880.90</td> 

     <td class="numericalColumn">586.00</td> 

     <td class="numericalColumn">876.50</td> 

     <td class="numericalColumn">873.70</td> 

     <td class="numericalColumn">712.88</td> 
     </tr> 

     <tr> 
     <td>Preference dividend</td> 

     <td class="numericalColumn">-</td> 

     <td class="numericalColumn">-</td> 

     <td class="numericalColumn">-</td> 

     <td class="numericalColumn">-</td> 

     <td class="numericalColumn">-</td> 
     </tr> 

     <tr> 
     <td>Dividend tax</td> 

     <td class="numericalColumn">128.30</td> 

     <td class="numericalColumn">99.60</td> 

     <td class="numericalColumn">148.90</td> 

     <td class="numericalColumn">126.80</td> 

     <td class="numericalColumn">99.98</td> 
     </tr> 

     <tr> 
     <td>Retained earnings</td> 

     <td class="numericalColumn">3,888.80</td> 

     <td class="numericalColumn">2,288.20</td> 

     <td class="numericalColumn">2,037.90</td> 

     <td class="numericalColumn">1,841.60</td> 

     <td class="numericalColumn">1,207.62</td> 
     </tr> 
    </tbody> 
    </table> 
</body> 
</html> 

Je veux extraire chaque valeur comme les données de fabrication et les valeurs de toutes les années mentionnées dans cette ligne. Comment je vais à ce sujet?

J'ai trouvé quelque chose comme preg_match('#<tr><th>(.*)</th> <td><b>price</b></td></tr>#', $content, $match); mais cela n'obtient pas les valeurs que je veux.

+0

* (apparenté) * [Meilleures méthodes pour analyser HTML] (http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) – Gordon

+0

** Don utiliser regex pour analyser HTML **, utilisez un [html dom parser à la place] (http://simplehtmldom.sourceforge.net/) –

Répondre

0

Si je comprends bien vous la question que vous voulez quelque chose comme this à faire. Cela a été écrit par moi, donc si vous avez besoin de clarifications j'aimerais aider.

bravo!

0

Vous pouvez utiliser des bibliothèques comme PHP Simple HTML DOM Parser pour extraire des données à partir de HTML/XHTML.
http://simplehtmldom.sourceforge.net/manual.htm

Un exemple:

$pageDom = str_get_html($rawHtmlData); 
foreach($pageDom->find('td') as $tblElem) 
{ 
    if(FALSE !== stristr($tblElem->innertext, 'Manufacturing expenses')) 
    { 
     // Do stuff 
    } 
}
+0

Suggestions d'alternatives tierces à [SimpleHtmlDom] (http://simplehtmldom.sourceforge.net/) qui utilisent réellement [DOM ] (http://php.net/manual/en/book.dom.php) au lieu de String Parsing: [phpQuery] (http://code.google.com/p/phpquery/), [Zend_Dom] (http : //framework.zend.com/manual/en/zend.dom.html), [QueryPath] (http://querypath.org/) et [FluentDom] (http://www.fluentdom.org). – Gordon

Questions connexes