2017-09-27 1 views
0

j'apprends scrapy et je voulais Scrapy quelques articles de cette page: https://www.gumtree.com/search?sort=date&search_category=flats-houses&q=box&search_location=Vale+of+Glamorganscrapy rampé 0 pages (à 0 pages/min), gratté 0 articles (à 0 articles/min)

Pour éviter des robots Les politiques de .txt etc. J'ai sauvé la page sur mon hd et ai testé mes xpaths en utilisant le shell scrapy. Ils semblent fonctionner comme prévu. Mais quand je lance mon araignée avec la commande scrapy crawl basic (comme il est recommandé dans le livre que je lis) Je suis la sortie suivante:

2017-09-27 12:05:02 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: properties) 
2017-09-27 12:05:02 [scrapy.utils.log] INFO: Overridden settings: {'USER_AGENT': 'Mozila/5.0', 'SPIDER_MODULES': ['properties.spiders'], 'BOT_NAME': 'properties', 'NEWSPIDER_MODULE': 'properties.spiders'} 
2017-09-27 12:05:03 [scrapy.middleware] INFO: Enabled extensions: 
['scrapy.extensions.logstats.LogStats', 
'scrapy.extensions.memusage.MemoryUsage', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.corestats.CoreStats'] 
2017-09-27 12:05:03 [scrapy.middleware] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2017-09-27 12:05:03 [scrapy.middleware] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2017-09-27 12:05:03 [scrapy.middleware] INFO: Enabled item pipelines: 
[] 
2017-09-27 12:05:03 [scrapy.core.engine] INFO: Spider opened 
2017-09-27 12:05:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-09-27 12:05:03 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6026 
2017-09-27 12:05:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///home/albert/Documents/programming/python/scrapy/properties/properties/tests/test_page.html> (referer: None) 
2017-09-27 12:05:04 [basic] DEBUG: title: 
2017-09-27 12:05:04 [basic] DEBUG: price: 
2017-09-27 12:05:04 [basic] DEBUG: description: 
2017-09-27 12:05:04 [basic] DEBUG: address: 
2017-09-27 12:05:04 [basic] DEBUG: image_urls: 
2017-09-27 12:05:04 [scrapy.core.engine] INFO: Closing spider (finished) 
2017-09-27 12:05:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 262, 
'downloader/request_count': 1, 
'downloader/request_method_count/GET': 1, 
'downloader/response_bytes': 270547, 
'downloader/response_count': 1, 
'downloader/response_status_count/200': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2017, 9, 27, 9, 5, 4, 91741), 
'log_count/DEBUG': 7, 
'log_count/INFO': 7, 
'memusage/max': 50790400, 
'memusage/startup': 50790400, 
'response_received_count': 1, 
'scheduler/dequeued': 1, 
'scheduler/dequeued/memory': 1, 
'scheduler/enqueued': 1, 
'scheduler/enqueued/memory': 1, 
'start_time': datetime.datetime(2017, 9, 27, 9, 5, 3, 718976)} 
2017-09-27 12:05:04 [scrapy.core.engine] INFO: Spider closed (finished) 
[email protected]:properties$ scrapy crawl basic 
2017-09-27 12:10:13 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: properties) 
2017-09-27 12:10:13 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['properties.spiders'], 'BOT_NAME': 'properties', 'NEWSPIDER_MODULE': 'properties.spiders', 'USER_AGENT': 'Mozila/5.0'} 
2017-09-27 12:10:13 [scrapy.middleware] INFO: Enabled extensions: 
['scrapy.extensions.memusage.MemoryUsage', 
'scrapy.extensions.corestats.CoreStats', 
'scrapy.extensions.logstats.LogStats', 
'scrapy.extensions.telnet.TelnetConsole'] 
2017-09-27 12:10:13 [scrapy.middleware] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2017-09-27 12:10:13 [scrapy.middleware] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2017-09-27 12:10:13 [scrapy.middleware] INFO: Enabled item pipelines: 
[] 
2017-09-27 12:10:13 [scrapy.core.engine] INFO: Spider opened 
2017-09-27 12:10:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-09-27 12:10:13 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6026 
2017-09-27 12:10:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///home/albert/Documents/programming/python/scrapy/properties/properties/tests/test_page.html> (referer: None) 
2017-09-27 12:10:13 [basic] DEBUG: title: 
2017-09-27 12:10:13 [basic] DEBUG: price: 
2017-09-27 12:10:13 [basic] DEBUG: description: 
2017-09-27 12:10:13 [basic] DEBUG: address: 
2017-09-27 12:10:13 [basic] DEBUG: image_urls: 
2017-09-27 12:10:13 [scrapy.core.engine] INFO: Closing spider (finished) 
2017-09-27 12:10:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 262, 
'downloader/request_count': 1, 
'downloader/request_method_count/GET': 1, 
'downloader/response_bytes': 270547, 
'downloader/response_count': 1, 
'downloader/response_status_count/200': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2017, 9, 27, 9, 10, 13, 927817), 
'log_count/DEBUG': 7, 
'log_count/INFO': 7, 
'memusage/max': 51032064, 
'memusage/startup': 51032064, 
'response_received_count': 1, 
'scheduler/dequeued': 1, 
'scheduler/dequeued/memory': 1, 
'scheduler/enqueued': 1, 
'scheduler/enqueued/memory': 1, 
'start_time': datetime.datetime(2017, 9, 27, 9, 10, 13, 722731)} 
2017-09-27 12:10:13 [scrapy.core.engine] INFO: Spider closed (finished) 

Voici mon items.py:

from scrapy.item import Item, Field 


class PropertiesItem(Item): 
    title = Field() 
    price = Field() 
    description = Field() 
    address = Field() 
    image_urls = Field() 

    images = Field() 
    location = Field() 

    url = Field() 
    project = Field() 
    spider = Field() 
    server = Field() 
    date = Field() 

et voici l'araignée basic.py:

import scrapy 


class BasicSpider(scrapy.Spider): 
    name = 'basic' 
    start_urls = ['file:///home/albert/Documents/programming/python/scrapy/properties/properties/site/test_page.html'] 

    def parse(self, response): 
     self.log('title: '.format(response.xpath(
      "//h2[@class='listing-title' and not(span)]/text()").extract())) 
     self.log('price: '.format(response.xpath(
      "//meta[@itemprop='price']/@content").extract())) 
     self.log("description: ".format(response.xpath(
      "//p[@itemprop='description' and not(span)]/text()").extract())) 
     self.log('address: '.format(response.xpath(
      "//span[@class='truncate-line']/text()[2]").re('\|(\s+\w+.+)'))) 
     self.log('image_urls: '.format(response.xpath(
      "//noscript/img/@src").extract())) 

les XPath sont un peu maladroit, mais e Ils travaillent. Mais néanmoins les articles ne sont pas collectés. J'aimerais savoir pourquoi.

+0

Ajouter 'impression (response.body)' et 'imprimer (type (réponse))' dans la fonction d'analyse syntaxique voir et si vous obtenez HTMLResponse et le corps correct avec tout le HTML attendu? –

+0

@TarunLalwani Laissez-moi vérifier. Mais j'ai essayé d'exécuter cette page sauvegardée dans un shell scrapy et d'implémenter les xpaths et ils ont bien fonctionné, ce qui, je l'ai conclu, est le signe que le corps html est correct. 'Print (type (response))' renvoie '' et 'print (response.body)' imprime le corps du document html – Albert

+0

@TarunLalwani ' À première vue, tout semble bien. – Albert

Répondre

1

Votre Le problème est que vous n'avez pas inséré la sortie de la fonction de formatage dans la chaîne. Vous devez donc remplacer title par title {}, de sorte que le format insère les valeurs. Utilisez également extract_first() au lieu de extract(). Ainsi, vous obtenez une une sortie de chaîne au lieu d'un tableau

class BasicSpider(scrapy.Spider): 
    name = 'basic' 
    start_urls = ['file:///home/albert/Documents/programming/python/scrapy/properties/properties/site/test_page.html'] 

    def parse(self, response): 
     self.log('title: {}'.format(response.xpath(
      "//h2[@class='listing-title' and not(span)]/text()").extract_first())) 
     self.log('price: {}'.format(response.xpath(
      "//meta[@itemprop='price']/@content").extract_first())) 
     self.log("description: {}".format(response.xpath(
      "//p[@itemprop='description' and not(span)]/text()").extract_first())) 
     self.log('address: {}'.format(response.xpath(
      "//span[@class='truncate-line']/text()[2]").re('\|(\s+\w+.+)'))) 
     self.log('image_urls: {}'.format(response.xpath(
      "//noscript/img/@src").extract_first())) 
+0

ooooh mon .... Je ne l'ai pas remarqué .... Je ne peux pas croire que je l'ai manqué ... Je me sens tellement stupide. .. Merci! Ouais maintenant ça fonctionne comme il se doit! – Albert

1

Je ne cherche pas Scrapy dans le fichier local, mais si vous voulez quelque chose Scrapy, vous devez d'abord initialiser Items et doivent cession Item comme dict en python, enfin yield item à pipeline

import scrapy 
from properties.items import PropertiesItem 

class BasicSpider(scrapy.Spider): 
    name = 'basic' 
    start_urls = ['file:///home/albert/Documents/programming/python/scrapy/properties/properties/site/test_page.html'] 

    def parse(self, response): 
     item = PropertiesItem()  # init Item 
     # assignment 
     item['title'] = response.xpath("//h2[@class='listing-title' and not(span)]/text()").extract() 
     item['price'] = response.xpath("//h2[@class='listing-title' and not(span)]/text()").extract() 
     item['description'] = response.xpath("//h2[@class='listing-title' and not(span)]/text()").extract() 
     item['address'] = response.xpath("//h2[@class='listing-title' and not(span)]/text()").extract() 
     item['image_urls'] = response.xpath("//h2[@class='listing-title' and not(span)]/text()").extract() 
     # yield item 
     yield item 
+0

hmm ouais ce code a beaucoup plus de sens que ce qui est écrit dans le tutoriel que je suis. En fait, ils grattent un autre site mais je ne faisais que répéter leur code avec toutes les modifications nécessaires pour gratter le site que je voulais. Je ne comprends pas comment leur exemple fonctionne quand écrit de cette façon ... Merci de l'avoir clarifié! Et en passant, il devrait être 'from properties.items import PropertiesItem'. – Albert

+0

Ouais il devrait être 'de properties.items importer PropertiesItem', je l'édite – zhongjiajie