Make scrapy suivre les liens dans l'ordre

J'ai écrit un script et utilisé Scrapy pour trouver des liens dans la première phase et suivre les liens et extraire quelque chose de la page dans la deuxième phase. Scrapy-t-il mais il suit les liens d'une manière non ordonnée, à savoir j'attendre une sortie comme ci-dessous:Make scrapy suivre les liens dans l'ordre

link1 | data_extracted_from_link1_destination_page 
link2 | data_extracted_from_link2_destination_page 
link3 | data_extracted_from_link3_destination_page 
. 
. 
.

mais je reçois

link1 | data_extracted_from_link2_destination_page 
link2 | data_extracted_from_link3_destination_page 
link3 | data_extracted_from_link1_destination_page 
. 
. 
.

voici mon code:

import scrapy 


class firstSpider(scrapy.Spider): 
    name = "ipatranscription" 
    start_urls = ['http://www.phonemicchart.com/transcribe/biglist.html'] 

    def parse(self, response): 
     body = response.xpath('./body/div[3]/div[1]/div/a') 
     LinkTextSelector = './text()' 
     LinkDestSelector = './@href' 

     for link in body: 
      LinkText = link.xpath(LinkTextSelector).extract_first() 
      LinkDest = response.urljoin(link.xpath(LinkDestSelector).extract_first()) 

      yield {"LinkText": LinkText} 
      yield scrapy.Request(url=LinkDest, callback=self.parse_contents) 

    def parse_contents(self, response): 

     lContent = response.xpath("/html/body/div[3]/div[1]/div/center/span/text()").extract() 
     sContent = "" 
     for i in lContent: 
      sContent += i 
     sContent = sContent.replace("\n", "").replace("\t", "") 
     yield {"LinkContent": sContent}

Quel est le problème dans mon code?

Source

2017-05-28 Gmosy Gnaq

rendement n'est pas synchrone, vous devez utiliser meta pour y parvenir. Doc: https://doc.scrapy.org/en/latest/topics/request-response.html
code:

import scrapy 
class firstSpider(scrapy.Spider): 
    name = "ipatranscription" 
    start_urls = ['http://www.phonemicchart.com/transcribe/biglist.html'] 
    def parse(self, response): 
     body = response.xpath('./body/div[3]/div[1]/div/a') 
     LinkTextSelector = './text()' 
     LinkDestSelector = './@href' 
     for link in body: 
      LinkText = link.xpath(LinkTextSelector).extract_first() 
      LinkDest = 
       response.urljoin(link.xpath(LinkDestSelector).extract_first()) 
      yield scrapy.Request(url=LinkDest, callback=self.parse_contents, meta={"LinkText": LinkText}) 

    def parse_contents(self, response): 
     lContent = 
response.xpath("/html/body/div[3]/div[1]/div/center/span/text()").extract() 
     sContent = "" 
     for i in lContent: 
      sContent += i 
     sContent = sContent.replace("\n", "").replace("\t", "") 
     linkText = response.meta['LinkText'] 
     yield {"LinkContent": sContent,"LinkText": linkText}

Source

2017-05-29 01:12:45

Make scrapy suivre les liens dans l'ordre

Répondre

Questions connexes