Nokogiri trouver seulement des liens entrants

import re 
import urlparse 

domain = ... 
html = ... 
links = re.findall('href=[\'"](.*?)[\'"]', html) 
links = [urlparse.urljoin(domain, link) for link in links if link]

Source

2010-05-27 01:45:43 hoju

Quelque chose comme

doc.search("a").map do |a| 
    url = a.attribute("href") 
    #this part could be a lot more robust, but you get the idea... 
    full_url = url.match("^http://") ? url : "http://somedomain.com/#{url}" 
end.select{|url| url.match("^http://somedomain.com")}

Source

2010-05-27 02:13:09

Nokogiri trouver seulement des liens entrants

Répondre

Questions connexes