2017-10-20 16 views
0

Voici la structure simple de mon scrapy Web.Comment emballer le processus de création de start_urls dans scrapy?

import scrapy,urllib.request  
class TestSpider(scrapy.Spider): 
    def __init__(self, *args, **kw): 
     self.timeout = 10 

    name = "quotes" 
    allowed_domains = ["finance.yahoo.com"] 

    url_nasdaq = "ftp://ftp.nasdaqtrader.com/SymbolDirectory/nasdaqlisted.txt" 
    s = urllib.request.urlopen(url_nasdaq).read().decode('ascii') 
    s1 = s.split('\r\n')[1:-2] 
    namelist = [] 
    for item in s1: 
     if "NASDAQ TEST STOCK" not in item:namelist.append(item) 
    s2 = [s.split('|')[0] for s in namelist] 
    s3=[] 
    for symbol in s2: 
     if "." not in symbol : 
      s3.append(symbol) 

    start_urls = ["https://finance.yahoo.com/quote/"+s+"/financials?p="+s for s in s2] 


    def parse(self, response): 
     content = response.body 
     target = response.url 
     #doing somthing ,omitted code 

Pour enregistrer comme test.py et de l'exécuter avec scrapy runspider test.py.

Maintenant, je veux emballer tous les codes créant le start_urls.
Mon essai ici.

class TestSpider(scrapy.Spider): 
    def __init__(self, *args, **kw): 
     self.timeout = 10 
     url_nasdaq = "ftp://ftp.nasdaqtrader.com/SymbolDirectory/nasdaqlisted.txt" 
     s = urllib.request.urlopen(url_nasdaq).read().decode('ascii') 
     s1 = s.split('\r\n')[1:-2] 
     namelist = [] 
     for item in s1: 
      if "NASDAQ TEST STOCK" not in item : namelist.append(item) 
     s2 = [s.split('|')[0] for s in namelist] 
     s3=[] 
     for symbol in s2: 
      if "." not in symbol : s3.append(symbol) 
     self.start_urls = ["https://finance.yahoo.com/quote/"+s+"/financials?p="+s for s in s3] 

Cela ne peut pas fonctionner.

Répondre

1

C'est ce que start_requests méthode de spider est pour. Cela sert à créer un ensemble initial de demandes. À partir de votre exemple, il se lirait comme suit:

class TestSpider(scrapy.Spider): 
    def __init__(self, *args, **kw): 
     self.timeout = 10 

    def start_requests(self): 
     url_nasdaq = "ftp://ftp.nasdaqtrader.com/SymbolDirectory/nasdaqlisted.txt" 
     s = urllib.request.urlopen(url_nasdaq).read().decode('ascii') 
     s1 = s.split('\r\n')[1:-2] 
     namelist = [] 
     for item in s1: 
      if "NASDAQ TEST STOCK" not in item : namelist.append(item) 
     s2 = [s.split('|')[0] for s in namelist] 
     s3=[] 
     for symbol in s2: 
      if "." not in symbol : s3.append(symbol) 
     for s in s3: 
      yield scrapy.Request("https://finance.yahoo.com/quote/"+s+"/financials?p="+s, callback=self.parse)