2017-03-05 1 views
0

Je veux extraire tous les liens http://example.com/1 et ignorer tous les liens après 2 <br><br> tag avec beautifulsoup.analyser avant 2 tag beautifulsoup python

<div class="compost"> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index18" class="select_index"></span>text 2</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index19" class="select_index"></span>text 3</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index20" class="select_index"></span>text 4</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index21" class="select_index"></span>text 5</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index22" class="select_index"></span>text 6</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index23" class="select_index"></span>text 7</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index24" class="select_index"></span>text 8</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index25" class="select_index"></span>text 9</a></b> 
<br> 
<br> 
<b><a target="_blank" href="http://example.com/2"><span id="s_index18" class="select_index"></span>text 2</a></b> 
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index19" class="select_index"></span>text 3</a></b> 
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index20" class="select_index"></span>text 4</a></b> 
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index21" class="select_index"></span>text 5</a></b> 
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index22" class="select_index"></span>text 6</a></b> 
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index23" class="select_index"></span>text 7</a></b> 
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index24" class="select_index"></span>text 8</a></b> 
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index25" class="select_index"></span>text 9</a></b> 
<br> 
<br> 
<b><a target="_blank" href="http://example.com/3"><span id="s_index18" class="select_index"></span>text 2</a></b> 
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index19" class="select_index"></span>text 3</a></b> 
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index20" class="select_index"></span>text 4</a></b> 
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index21" class="select_index"></span>text 5</a></b> 
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index22" class="select_index"></span>text 6</a></b> 
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index23" class="select_index"></span>text 7</a></b> 
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index24" class="select_index"></span>text 8</a></b> 
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index25" class="select_index"></span>text 9</a></b> 
<br> 
<br> 

ici fait partie que je dois analyser:

<br><b><a target="_blank" href="http://example.com/1"><span id="s_index18" class="select_index"></span>text 2</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index19" class="select_index"></span>text 3</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index20" class="select_index"></span>text 4</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index21" class="select_index"></span>text 5</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index22" class="select_index"></span>text 6</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index23" class="select_index"></span>text 7</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index24" class="select_index"></span>text 8</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index25" class="select_index"></span>text 9</a></b> 

ici fait partie de mon code

for links in obja.find_all("div", class_="compost"): 
     if links.has_attr('href'): 
      print links['href'] 
     # 
     aa = links.findAll('a')[0] 
     print aa.attrs['href'] 
     txt = [] 
     for i in links.findAll('br'): 
      txt.append(i.text) 
      print i.nextSibling 
      if i.nextSibling.text != u'br': 
       txt.append(i.nextSibling.text) 

     ''.join(txt) 

mon extrait de script tous les liens et je ne sais pas comment puis-je extraire tout http://example.com/1 et ignorer tous les liens après <br><br>?

+0

Pouvez-vous ignorer tous les autres liens et saisir uniquement 'A' tags avec une' 'href' de http://example.com/ 1'? 'obja.find_all ('a', attrs = {'href': 'http://example.com/1'})' –

+0

@GregReda, je ne peux pas parce que les liens ont le même domaine, il est utilisé pour la redirection. – Kate

Répondre

1

Vous pouvez simplement trouver le premier <br><br> et rechercher des hrefs dans seulement cette sous-chaîne.

Comme ceci:

from bs4 import BeautifulSoup 

example = """ 
<div class="compost"> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index18"  class="select_index"></span>text 2</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index19" class="select_index"></span>text 3</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index20" class="select_index"></span>text 4</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index21" class="select_index"></span>text 5</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index22" class="select_index"></span>text 6</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index23" class="select_index"></span>text 7</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index24" class="select_index"></span>text 8</a></b> 
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index25" class="select_index"></span>text 9</a></b> 
<br> 
<br> 
<b><a target="_blank" href="http://example.com/2"><span id="s_index18" class="select_index"></span>text 2</a></b> 
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index19" class="select_index"></span>text 3</a></b> 
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index20" class="select_index"></span>text 4</a></b> 
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index21" class="select_index"></span>text 5</a></b> 
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index22" class="select_index"></span>text 6</a></b> 
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index23" class="select_index"></span>text 7</a></b> 
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index24" class="select_index"></span>text 8</a></b> 
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index25" class="select_index"></span>text 9</a></b> 
<br> 
<br> 
....""" 

br_split = example[0: example.index("<br>\n<br>")] 

soup = BeautifulSoup(br_split, "html.parser") 

print (soup.find_all("a")) 

Sorties: Output

+0

merci pour votre exemple instructif, c'est étrange, je n'ai jamais vu un tel index. – Kate