Scrapy SgmlLinkExtractor Ajouter une URL arbitraire

Comment ajouter une URL à SgmlLinkExtractor? C'est, comment puis-je ajouter une URL arbitraire pour exécuter le rappel?Scrapy SgmlLinkExtractor Ajouter une URL arbitraire

Pour élaborer, en utilisant dirbot comme exemple: https://github.com/scrapy/dirbot/blob/master/dirbot/spiders/googledir.py

parse_category les accès que tout ce qui correspond à la SgmlLinkExtractor SgmlLinkExtractor (allow = 'directory.google.com/[AZ] [a-zA-Z_ /] + $)

Source

2011-11-20 Lionel

utilisation BaseSpider au lieu de CrawlSpider, puis mis en ajouter à start_requests ou start_urls []

class MySpider(BaseSpider): 
    name = "myspider" 

    def start_requests(self): 
     return [Request("https://www.example.com", 
      callback=self.parse)] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     ...

Source

2011-11-21 05:06:03 Lionel

ThemenHubSpider classe (CrawlSpider):

name = 'themenHub' 
allowed_domains = ['themen.t-online.de'] 
start_urls = ["http://themen.t-online.de/themen-a-z/a"] 
rules = [Rule(SgmlLinkExtractor(allow=['id_\d+']), 'parse_news')]

Source

2013-01-15 16:42:10 Anno2001

Scrapy SgmlLinkExtractor Ajouter une URL arbitraire

Répondre

Questions connexes