2015-11-29 1 views
0

J'essaie d'explorer une page qui utilise les boutons suivants pour passer à de nouvelles pages en utilisant scrapy. J'utilise une instance de crawl spider et ai défini le Linkextractor pour extraire de nouvelles pages à suivre. Cependant, l'araignée ne fait que parcourir l'URL de départ et s'arrête là. J'ai ajouté le code spider et le journal. N'importe qui a une idée pourquoi l'araignée n'est pas capable d'explorer les pages.Scrapy CrawlSpider ne suit pas Liens

 from scrapy.spiders import CrawlSpider, Rule 
     from scrapy.linkextractors import LinkExtractor 
     from realcommercial.items import RealcommercialItem 
     from scrapy.selector import Selector 
     from scrapy.http import Request 

     class RealCommercial(CrawlSpider): 
      name = "realcommercial" 
      allowed_domains = ["realcommercial.com.au"] 
      start_urls = [ 
       "http://www.realcommercial.com.au/for-sale/in-vic/list-1?nearbySuburb=false&autoSuggest=false&activeSort=list-date" 
     ] 
      rules = [Rule(LinkExtractor(allow = ['/for-sale/in-vic/list-\d+?activeSort=list-date']), 

          callback='parse_response', 
          process_links='process_links', 
          follow=True), 
        Rule(LinkExtractor(allow = []), 

          callback='parse_response', 
          process_links='process_links', 
          follow=True)] 


      def parse_response(self, response):   
       sel = Selector(response) 
       sites = sel.xpath("//a[@class='details']") 
       #items = [] 
       for site in sites: 
        item = RealcommercialItem() 
        link = site.xpath('@href').extract() 
        #print link, '\n\n' 
        item['link'] = link 
        link = 'http://www.realcommercial.com.au/' + str(link[0]) 
        #print 'link!!!!!!=', link 
        new_request = Request(link, callback=self.parse_file_page) 
        new_request.meta['item'] = item 
        yield new_request 
        #items.append(item) 
       yield item 
       return 

      def process_links(self, links): 
       print 'inside process links' 
       for i, w in enumerate(links): 
        print w.url,'\n\n\n' 
        w.url = "http://www.realcommercial.com.au/" + w.url 
        print w.url,'\n\n\n' 
        links[i] = w 

       return links 

      def parse_file_page(self, response): 
       #item passed from request 
       #print 'parse_file_page!!!' 
       item = response.meta['item'] 
       #selector 
       sel = Selector(response) 
       title = sel.xpath('//*[@id="listing_address"]').extract() 
       #print title 
       item['title'] = title 

       return item 

Connexion

   2015-11-29 15:42:55 [scrapy] INFO: Scrapy 1.0.3 started (bot: realcommercial) 
       2015-11-29 15:42:55 [scrapy] INFO: Optional features available: ssl, http11, bot 
       o 
       2015-11-29 15:42:55 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 're 
       alcommercial.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['realcommercial. 
       spiders'], 'FEED_URI': 'aaa.csv', 'BOT_NAME': 'realcommercial'} 
       2015-11-29 15:42:56 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter 
       , TelnetConsole, LogStats, CoreStats, SpiderState 
       2015-11-29 15:42:57 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl 
       eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH 
       eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd 
       leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
       2015-11-29 15:42:57 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa 
       re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
       2015-11-29 15:42:57 [scrapy] INFO: Enabled item pipelines: 
       2015-11-29 15:42:57 [scrapy] INFO: Spider opened 
       2015-11-29 15:42:57 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i 
       tems (at 0 items/min) 
       2015-11-29 15:42:57 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
       2015-11-29 15:42:59 [scrapy] DEBUG: Crawled (200) <GET http://www.realcommercial 
       .com.au/for-sale/in-vic/list-1?nearbySuburb=false&autoSuggest=false&activeSort=l 
       ist-date> (referer: None) 
       2015-11-29 15:42:59 [scrapy] INFO: Closing spider (finished) 
       2015-11-29 15:42:59 [scrapy] INFO: Dumping Scrapy stats: 
       {'downloader/request_bytes': 303, 
       'downloader/request_count': 1, 
       'downloader/request_method_count/GET': 1, 
       'downloader/response_bytes': 30599, 
       'downloader/response_count': 1, 
       'downloader/response_status_count/200': 1, 
       'finish_reason': 'finished', 
       'finish_time': datetime.datetime(2015, 11, 29, 10, 12, 59, 418000), 
       'log_count/DEBUG': 2, 
       'log_count/INFO': 7, 
       'response_received_count': 1, 
       'scheduler/dequeued': 1, 
       'scheduler/dequeued/memory': 1, 
       'scheduler/enqueued': 1, 
       'scheduler/enqueued/memory': 1, 
       'start_time': datetime.datetime(2015, 11, 29, 10, 12, 57, 780000)} 
       2015-11-29 15:42:59 [scrapy] INFO: Spider closed (finished) 

Répondre

0

j'ai eu la réponse moi-même. Il y avait deux problèmes:

  1. processus_links était "http://www.realcommercial.com.au/" bien qu'il était déjà là. Je pensais que ça rendrait l'url relative.
  2. L'expression régulière dans l'extracteur de liens n'était pas correcte.

J'ai fait des changements à ces deux et cela a fonctionné.