j'apprends scrapy et je voulais Scrapy quelques articles de cette page: https://www.gumtree.com/search?sort=date&search_category=flats-houses&q=box&search_location=Vale+of+Glamorganscrapy rampé 0 pages (à 0 pages/min), gratté 0 articles (à 0 articles/min)
Pour éviter des robots Les politiques de .txt etc. J'ai sauvé la page sur mon hd et ai testé mes xpaths en utilisant le shell scrapy. Ils semblent fonctionner comme prévu. Mais quand je lance mon araignée avec la commande scrapy crawl basic
(comme il est recommandé dans le livre que je lis) Je suis la sortie suivante:
2017-09-27 12:05:02 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: properties)
2017-09-27 12:05:02 [scrapy.utils.log] INFO: Overridden settings: {'USER_AGENT': 'Mozila/5.0', 'SPIDER_MODULES': ['properties.spiders'], 'BOT_NAME': 'properties', 'NEWSPIDER_MODULE': 'properties.spiders'}
2017-09-27 12:05:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-09-27 12:05:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-09-27 12:05:03 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-09-27 12:05:03 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-09-27 12:05:03 [scrapy.core.engine] INFO: Spider opened
2017-09-27 12:05:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-09-27 12:05:03 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6026
2017-09-27 12:05:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///home/albert/Documents/programming/python/scrapy/properties/properties/tests/test_page.html> (referer: None)
2017-09-27 12:05:04 [basic] DEBUG: title:
2017-09-27 12:05:04 [basic] DEBUG: price:
2017-09-27 12:05:04 [basic] DEBUG: description:
2017-09-27 12:05:04 [basic] DEBUG: address:
2017-09-27 12:05:04 [basic] DEBUG: image_urls:
2017-09-27 12:05:04 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-27 12:05:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 262,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 270547,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 9, 27, 9, 5, 4, 91741),
'log_count/DEBUG': 7,
'log_count/INFO': 7,
'memusage/max': 50790400,
'memusage/startup': 50790400,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 9, 27, 9, 5, 3, 718976)}
2017-09-27 12:05:04 [scrapy.core.engine] INFO: Spider closed (finished)
[email protected]:properties$ scrapy crawl basic
2017-09-27 12:10:13 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: properties)
2017-09-27 12:10:13 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['properties.spiders'], 'BOT_NAME': 'properties', 'NEWSPIDER_MODULE': 'properties.spiders', 'USER_AGENT': 'Mozila/5.0'}
2017-09-27 12:10:13 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole']
2017-09-27 12:10:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-09-27 12:10:13 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-09-27 12:10:13 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-09-27 12:10:13 [scrapy.core.engine] INFO: Spider opened
2017-09-27 12:10:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-09-27 12:10:13 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6026
2017-09-27 12:10:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///home/albert/Documents/programming/python/scrapy/properties/properties/tests/test_page.html> (referer: None)
2017-09-27 12:10:13 [basic] DEBUG: title:
2017-09-27 12:10:13 [basic] DEBUG: price:
2017-09-27 12:10:13 [basic] DEBUG: description:
2017-09-27 12:10:13 [basic] DEBUG: address:
2017-09-27 12:10:13 [basic] DEBUG: image_urls:
2017-09-27 12:10:13 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-27 12:10:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 262,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 270547,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 9, 27, 9, 10, 13, 927817),
'log_count/DEBUG': 7,
'log_count/INFO': 7,
'memusage/max': 51032064,
'memusage/startup': 51032064,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 9, 27, 9, 10, 13, 722731)}
2017-09-27 12:10:13 [scrapy.core.engine] INFO: Spider closed (finished)
Voici mon items.py:
from scrapy.item import Item, Field
class PropertiesItem(Item):
title = Field()
price = Field()
description = Field()
address = Field()
image_urls = Field()
images = Field()
location = Field()
url = Field()
project = Field()
spider = Field()
server = Field()
date = Field()
et voici l'araignée basic.py:
import scrapy
class BasicSpider(scrapy.Spider):
name = 'basic'
start_urls = ['file:///home/albert/Documents/programming/python/scrapy/properties/properties/site/test_page.html']
def parse(self, response):
self.log('title: '.format(response.xpath(
"//h2[@class='listing-title' and not(span)]/text()").extract()))
self.log('price: '.format(response.xpath(
"//meta[@itemprop='price']/@content").extract()))
self.log("description: ".format(response.xpath(
"//p[@itemprop='description' and not(span)]/text()").extract()))
self.log('address: '.format(response.xpath(
"//span[@class='truncate-line']/text()[2]").re('\|(\s+\w+.+)')))
self.log('image_urls: '.format(response.xpath(
"//noscript/img/@src").extract()))
les XPath sont un peu maladroit, mais e Ils travaillent. Mais néanmoins les articles ne sont pas collectés. J'aimerais savoir pourquoi.
Ajouter 'impression (response.body)' et 'imprimer (type (réponse))' dans la fonction d'analyse syntaxique voir et si vous obtenez HTMLResponse et le corps correct avec tout le HTML attendu? –
@TarunLalwani Laissez-moi vérifier. Mais j'ai essayé d'exécuter cette page sauvegardée dans un shell scrapy et d'implémenter les xpaths et ils ont bien fonctionné, ce qui, je l'ai conclu, est le signe que le corps html est correct. 'Print (type (response))' renvoie '' et 'print (response.body)' imprime le corps du document html –
Albert
@TarunLalwani ' À première vue, tout semble bien. – Albert