2017-07-28 3 views
0

de Salut les gars je reçois l'erreur de pagination suivante tout en attachant de gratter un sitel'erreur de pagination Scrapy

2017-07-27 18:30:21 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.pedidosja.com.br/restaurantes/sao-paulo?a=rua%20tenente%20negr%C3%A3o%20200&cep=04530030&doorNumber=200&bairro=Itaim%20Bibi&lat=-23.585202&lng=-46.671715199999994> (referer: None) 
Traceback (most recent call last): 
    File "/usr/local/lib/python3.5/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback 
    yield next(it) 
    File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output 
    for x in result: 
    File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr> 
    return (_set_referer(r) for r in result or()) 
    File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "/root/Documents/Spiders/pedidosYa/pedidosYa/spiders/pedidosya.py", line 35, in parse 
    next_page_url = response.urljoin(next_page_url) 
    File "/usr/local/lib/python3.5/dist-packages/scrapy/http/response/text.py", line 82, in urljoin 
    return urljoin(get_base_url(self), url) 
    File "/usr/lib/python3.5/urllib/parse.py", line 416, in urljoin 
    base, url, _coerce_result = _coerce_args(base, url) 
    File "/usr/lib/python3.5/urllib/parse.py", line 112, in _coerce_args 
    raise TypeError("Cannot mix str and non-str arguments") 
TypeError: Cannot mix str and non-str arguments 
2017-07-27 18:30:21 [scrapy.core.engine] INFO: Closing spider (finished) 
2017-07-27 18:30:21 [scrapy.extensions.feedexport] INFO: Stored csv feed (13 items) in: test3.csv 
2017-07-27 18:30:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 653, 
'downloader/request_count': 2, 
'downloader/request_method_count/GET': 2, 
'downloader/response_bytes': 62571, 
'downloader/response_count': 2, 
'downloader/response_status_count/200': 2, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2017, 7, 27, 23, 30, 21, 221038), 
'item_scraped_count': 13, 
'log_count/DEBUG': 16, 
'log_count/ERROR': 1, 
'log_count/INFO': 8, 
'memusage/max': 49278976, 
'memusage/startup': 49278976, 
'response_received_count': 2, 
'scheduler/dequeued': 1, 
'scheduler/dequeued/memory': 1, 
'scheduler/enqueued': 1, 
'scheduler/enqueued/memory': 1, 
'spider_exceptions/TypeError': 1, 
'start_time': datetime.datetime(2017, 7, 27, 23, 30, 17, 538310)} 
2017-07-27 18:30:21 [scrapy.core.engine] INFO: Spider closed (finished) 

L'araignée soulève une erreur de type « ne peut pas mélanger les arguments str et non str » Je pas très expérimenté en pyhton, je voudrais aussi aprécier quelques ressources où je pourrais apprendre sur ce type d'erreur. Ci-dessous, vous trouverez le code de l'araignée.

# -*- coding: utf-8 -*- 
import scrapy 
from pedidosYa.items import PedidosyaItem 
from scrapy.loader import ItemLoader 
from scrapy.loader.processors import MapCompose 


class PedidosyaSpider(scrapy.Spider): 
    name = 'pedidosya' 
    allowed_domains = ['www.pedidosya.com.br'] 
    start_urls = [ 
     'https://www.pedidosja.com.br/restaurantes/sao-paulo?a=rua%20tenente%20negr%C3%A3o%20200&cep=04530030&doorNumber=200&bairro=Itaim%20Bibi&lat=-23.585202&lng=-46.671715199999994'] 

    def parse(self, response): 
     # need to define wrapper 
     for wrapper in response.css('.restaurant-wrapper.peyaCard.show.with_tags'): 
      l = ItemLoader(item=PedidosyaItem(), selector=wrapper) 
      l.add_css('Name', 'a.arrivalName::text') 
      l.add_css('Menu1', 'span.categories > span::text', MapCompose(str.strip)) 
      l.add_css('Menu2', 'span.categories > span + span::text', MapCompose(str.strip)) 
      l.add_css('Menu3', 'span.categories > span + span + span::text', MapCompose(str.strip)) 
      l.add_css('Address', 'span.address::text', MapCompose(str.strip)) 
      l.add_css('DeliveryTime', 'i.delTime::text', MapCompose(str.strip)) 
      l.add_css('CreditCard', 'ul.content_credit_cards > li > img::attr(alt)', MapCompose(str.strip)) 
      l.add_css('DeliveryCost', 'div.shipping > i::text', MapCompose(str.strip)) 
      l.add_css('Rankink', 'span.ranking i::text', MapCompose(str.strip)) 
      l.add_css('No', 'span.ranking a::text', MapCompose(str.strip)) 
      l.add_css('Sponsor', 'span.grey_small.not-logged::text', MapCompose(str.strip)) 
      l.add_css('DeliveryMinimun', 'div.minDelivery::text', MapCompose(str.strip)) 
      l.add_css('Distance', 'div.distance i::text', MapCompose(str.strip)) 
      yield l.load_item() 

     next_page_url = response.css('li.arrow.next > a ::attr(href)').extract() 
     if next_page_url: 
      next_page_url = response.urljoin(next_page_url) 
     yield scrapy.Request(url=next_page_url, callback=self.parse) 

Merci d'avance et bonne journée !!

Répondre

0

Le problème est dans cette ligne:

next_page_url = response.css('li.arrow.next > a::attr(href)').extract() 

car extract() méthode renvoie toujours une liste de résultats, même si elle fonde un seul. Soit utiliser la méthode extract_first() qui vous donnera juste le premier résultat:

next_page_url = response.css('li.arrow.next > a::attr(href)').extract_first() 

ou obtenir le premier élément de la liste des résultats vous:

next_page_url = response.css('li.arrow.next > a::attr(href)').extract()[0] 
+0

Merci pour votre réponse rapide, vous avez raison. – oscarQ

1
next_page_url = response.css('li.arrow.next > a ::attr(href)').extract() 
                   ^^^^^^^^^^ 
if next_page_url: 
    next_page_url = response.urljoin(next_page_url) 
            ^^^^^^^^^^^^^ 

Vous appelez urljoin sur une liste depuis extract() méthode lors de la création next_page_url renvoie une liste de toutes les valeurs, même si elle est seulement un membre.
Pour remédier à cette utilisation extract_first() à la place:

next_page_url = response.css('li.arrow.next > a ::attr(href)').extract_first() 
                   ^^^^^^^^^^^^^^^