2010-10-30 8 views
2

J'utilise la classe PHPCrawl et j'ai ajouté du code DOMDocument et DOMXpath pour retirer des données spécifiques des pages web, mais le script s'arrête avant même qu'il ne soit presque possible d'explorer tout le site.Empêcher mon Crawler PHP Web de bloquer

J'ai défini set_time_limit à 100000000, cela ne devrait donc pas poser de problème.

Des idées?

Merci, Nick

<?php 

// It may take a while to crawl a site ... 
set_time_limit(100000000); 

// Inculde the phpcrawl-mainclass 
include("classes/phpcrawler.class.php"); 

//connect to the database 
mysql_connect('localhost','#####','#####'); 
mysql_select_db('ft2'); 

// Extend the class and override the handlePageData()-method 
class MyCrawler extends PHPCrawler 
{ 
    function handlePageData(&$page_data) 
    { 
    // Here comes your code. 
    // Do whatever you want with the information given in the 
    // array $page_data about a page or file that the crawler actually found. 
    // See a complete list of elements the array will contain in the 
    // class-refenence. 
    // This is just a simple example. 

    // Print the URL of the actual requested page or file 
    echo "Page requested: ".$page_data["url"]."<br>"; 

    // Print the first line of the header the server sent (HTTP-status) 
    //echo "Status: ".strtok($page_data["header"], "\n")."<br>"; 

    // Print the referer 
    //echo "Referer-page: ".$page_data["referer_url"]."<br>"; 

    // Print if the content was be recieved or not 
    /*if ($page_data["received"]==true) 
     echo "Content received: ".$page_data["bytes_received"]." bytes"; 
    else 
     echo "Content not received"; 
    */ 
    // ... 

    // Now you should do something with the content of the actual 
    // received page or file ($page_data[source]), we skip it in this example 

    //echo "<br><br>"; 
    echo str_pad(" ", 5000); // "Force flush", workaround 
    flush(); 

//this is where we tear the data apart looking for username and timestamps 
$url = $page_data["url"]; 
$html = new DOMDocument(); 
$html->loadHTMLFile($url); 

$xpath = new DOMXpath($html); 

//children of ol id=posts 
$links = $xpath->query("//li[@class='postbit postbitim postcontainer']"); 

foreach($links as $results){ 
    $newDom = new DOMDocument; 
    $newDom->appendChild($newDom->importNode($results,true)); 

    $xpath = new DOMXpath ($newDom); 
    $time_stamp = substr($xpath->query("div/div/span/span")->item(0)->nodeValue,0,10); 
    $user_name = trim($xpath->query("div/div[2]/div/div/div/a/strong/font")->item(0)->nodeValue); 

    $return[] = array(
    'time_stamp' => $time_stamp, 
    'username' => $user_name, 
    ); 
} 

foreach ($return as $output) { 
    echo "<strong>Time posted: " . $output['time_stamp'] . " by " . $output['username'] . "</strong>"; 
    //make your database entry 
    $time_stamp = $output['time_stamp']; 
    list($month, $day, $year) = split('[/.-]', $time_stamp); 
    $time_stamp = $year."-".$month."-".$day; 
    echo $time_stamp; 

    $username = $output['username']; 
    $sql="INSERT INTO lovesystems VALUES ('$username','$url','$time_stamp')"; 
    if (mysql_query($sql)) echo "Successfully input user in database!<br/>"; 
    else echo mysql_error(); 
} 
    } 
} 

// Now, create an instance of the class, set the behaviour 
// of the crawler (see class-reference for more methods) 
// and start the crawling-process. 

$crawler = &new MyCrawler(); 

// URL to crawl 
$crawler->setURL("http://######.com"); 

// Only receive content of files with content-type "text/html" 
// (regular expression, preg) 
$crawler->addReceiveContentType("/text\/html/"); 

// Ignore links to pictures, dont even request pictures 
// (preg_match) 
$crawler->addNonFollowMatch("/.(jpg|gif|png)$/ i"); 

// Store and send cookie-data like a browser does 
$crawler->setCookieHandling(true); 

// Set the traffic-limit to 1 MB (in bytes, 
// for testing we dont want to "suck" the whole site) 
//$crawler->setTrafficLimit(1000 * 1024); 

// Thats enough, now here we go 
$crawler->go(); 




// At the end, after the process is finished, we print a short 
// report (see method getReport() for more information) 

$report = $crawler->getReport(); 

echo "Summary:<br>"; 
if ($report["traffic_limit_reached"]==true) 
    echo "Traffic-limit reached <br>"; 

echo "Links followed: ".$report["links_followed"]."<br>"; 
echo "Files received: ".$report["files_received"]."<br>"; 
echo "Bytes received: ".$report["bytes_received"]."<br>"; 

?> 
+0

Des informations plus détaillées sur le point de rupture? –

+0

Il se casse après environ une demi-heure. Il court parfois 27 minutes, parfois 31, et ne se casse jamais au même point. Je cours un wget sur un serveur d'Unix. Des pensées? –

Répondre

0

Vérifiez la configuration de votre serveur. Je suis assez sûr qu'Apache a un délai d'attente de script dans sa configuration.

+0

Il semble que Set_time_limit remplace les paramètres d'Apache de ce que j'ai lu en ligne. D'autres pensées? –

+0

il remplace les paramètres PHP.ini mais le paramètre Apache reste actif. Aussi, vous êtes mieux d'exécuter un tel script dans cron ou en ligne de commande plutôt que d'être intégré dans Apache –