2009-03-12 10 views

Répondre

2

Ce script est vaguement basé sur celui de la FAQ Nutch, qui ne fonctionne pas pour moi au début:

#!/bin/sh 
# 
# Automate crawling my site 
# 
crawldir=./crawl 
urldir=./urls 
NUTCH_HOME=${NUTCH_HOME:=.} 

nutch=$NUTCH_HOME/bin/nutch 

# Make sure the crawl directories exist 
mkdir -p $crawldir/crawldb $crawldir/segments $crawldir/linkdb 

# Inject the initial urls 
$nutch inject $crawldir/crawldb $urldir 

depth=1 
while(true) ; do 
    echo "beginning crawl at depth $depth" 
    echo "-generate" 
    $nutch generate $crawldir/crawldb $crawldir/segments 
    if [ $? -ne 0 ] ; then 
    echo "finishing at depth $depth - no more urls" 
    break 
    fi 

    segment=`/bin/ls -rtd $crawldir/segments/*|tail -1` 

    echo "$nutch fetch $segment" 
    $nutch fetch $segment 
    if [ $? -ne 0 ] ; then 
    echo "fetch failed at depth $depth, deleting segment" 
    rm -rf $segment 
    continue; 
    fi 

    echo "$nutch updatedb $crawldir/crawldb $segment" 
    $nutch updatedb $crawldir/crawldb $segment 
    depth=`expr $depth + 1` 
done 

echo "$nutch mergesegs $crawldir/MERGEDsegs $crawldir/segments/*" 
$nutch mergesegs $crawldir/MERGEDsegs $crawldir/segments/* 
if [ $? -eq 0 ] ; then 
    rm -rf $crawldir/segments/* 
    mv $crawldir/MERGEDsegs/* $crawldir/segments 
    rmdir $crawldir/MERGEDsegs 
else 
    echo "Something went wrong" 
    exit 
fi 

echo "$nutch invertlinks $crawldir/linkdb -dir $crawldir/segments" 
$nutch invertlinks $crawldir/linkdb -dir $crawldir/segments 

echo "$nutch index $crawldir/NEWindexes $crawldir/crawldb $crawldir/linkdb $crawldir/segments/*" 
$nutch index $crawldir/NEWindexes $crawldir/crawldb $crawldir/linkdb \ 
$crawldir/segments/* 

echo "$nutch dedup $crawldir/NEWindexes" 
$nutch dedup $crawldir/NEWindexes 

echo "$nutch merge $crawldir/MERGEDindexes $crawldir/NEWindexes" 
$nutch merge $crawldir/MERGEDindexes $crawldir/NEWindexes 

mv $crawldir/index $crawldir/OLDindexes 
mv $crawldir/MERGEDindexes $crawldir/index 
0

Nous utilisons nutch en combinaison avec solr. Notre indice Nutch est appr. 80 Mo environ 5000 sites Web. Jusqu'à présent, la meilleure façon de recibler est de supprimer l'index et de le créer à partir de zéro.

Questions connexes