R - Grattage de pages Web - Problème lors de l'obtention des valeurs d'attribut en utilisant rvest

J'essaie d'utiliser rvest pour extraire les informations de pays ISO de wikipedia (y compris les liens d'une autre page). Je ne peux pas trouver un moyen d'obtenir correctement les liens (attribut href) sans inclure le nom (j'ai essayé la fonction de chaîne xpath cela provoque une erreur). C'est assez facile à gérer - et explicite.R - Grattage de pages Web - Problème lors de l'obtention des valeurs d'attribut en utilisant rvest

Toute aide appréciée!

library(rvest) 
library(dplyr) 

searchPage <- read_html("https://en.wikipedia.org/wiki/ISO_3166-2") 
nodes <- html_node(searchPage, xpath = '(//h2[(span/@id = "Current_codes")]/following-sibling::table)[1]') 
codes <- html_nodes(nodes, xpath = 'tr/td[1]/a/text()') 
names <- html_nodes(nodes, xpath = 'tr/td[2]//a[@title]/text()') 
#Following brings back data but attribute name as well 
links <- html_nodes(nodes, xpath = 'tr/td[2]//a[@title]/@href') 
#Following returns nothing 
links2 <- html_nodes(nodes, xpath = 'tr/td[2]//a[@title]/@href/text()') 
#Following Errors 
links3 <- html_nodes(nodes, xpath = 'string(tr/td[2]//a[@title]/@href)') 
#Following Errors 
links4 <- sapply(nodes, function(x) { x %>% read_html() %>% html_nodes("tr/td[2]//a[@title]") %>% html_attr("href") })

Source

2017-10-21 Martin Thompson

Vous auriez dû inclure plus d'informations dans votre question. Le peu «explicite» peu m'a presque fait ignorer la question (indice: envisager de fournir suffisamment de détails verbaux par respect pour le temps des autres ainsi que le code brisé).

Je dis que b/c je n'ai aucune idée si c'est ce dont vous aviez besoin ou non b/c vous n'avez pas vraiment dit.

library(rvest) 
library(tibble) 

pg <- read_html("https://en.wikipedia.org/wiki/ISO_3166-2") 

tab <- html_node(pg, xpath=".//table[contains(., 'Zimbabwe')]") 

iso_col <- html_nodes(tab, xpath=".//td[1]/a[contains(@href, 'ISO')]") 
name_col <- html_nodes(tab, xpath=".//td[2]") 

data_frame(
    iso2c = html_text(iso_col), 
    iso2c_link = html_attr(iso_col, "href"), 
    country_name = html_text(name_col), 
    country_link = html_nodes(name_col, xpath=".//a[contains(@href, 'wiki')]") %>% html_attr("href") 
) 
## # A tibble: 249 x 4 
## iso2c   iso2c_link   country_name    country_link 
## <chr>    <chr>    <chr>      <chr> 
## 1 AD /wiki/ISO_3166-2:AD    Andorra    /wiki/Andorra 
## 2 AE /wiki/ISO_3166-2:AE United Arab Emirates /wiki/United_Arab_Emirates 
## 3 AF /wiki/ISO_3166-2:AF   Afghanistan   /wiki/Afghanistan 
## 4 AG /wiki/ISO_3166-2:AG Antigua and Barbuda /wiki/Antigua_and_Barbuda 
## 5 AI /wiki/ISO_3166-2:AI    Anguilla    /wiki/Anguilla 
## 6 AL /wiki/ISO_3166-2:AL    Albania    /wiki/Albania 
## 7 AM /wiki/ISO_3166-2:AM    Armenia    /wiki/Armenia 
## 8 AO /wiki/ISO_3166-2:AO    Angola    /wiki/Angola 
## 9 AQ /wiki/ISO_3166-2:AQ   Antarctica   /wiki/Antarctica 
## 10 AR /wiki/ISO_3166-2:AR   Argentina   /wiki/Argentina 
## # ... with 239 more rows

Source

2017-10-21 11:50:21 hrbrmstr

Merci! Désolé, je pensais que les commentaires seraient assez bons, j'essaierai de mettre plus d'infos à l'avenir! –

R - Grattage de pages Web - Problème lors de l'obtention des valeurs d'attribut en utilisant rvest

Répondre

Questions connexes