Je veux extraire tous les liens http://example.com/1 et ignorer tous les liens après 2 <br><br>
tag avec beautifulsoup.analyser avant 2 tag beautifulsoup python
<div class="compost">
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index18" class="select_index"></span>text 2</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index19" class="select_index"></span>text 3</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index20" class="select_index"></span>text 4</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index21" class="select_index"></span>text 5</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index22" class="select_index"></span>text 6</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index23" class="select_index"></span>text 7</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index24" class="select_index"></span>text 8</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index25" class="select_index"></span>text 9</a></b>
<br>
<br>
<b><a target="_blank" href="http://example.com/2"><span id="s_index18" class="select_index"></span>text 2</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index19" class="select_index"></span>text 3</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index20" class="select_index"></span>text 4</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index21" class="select_index"></span>text 5</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index22" class="select_index"></span>text 6</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index23" class="select_index"></span>text 7</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index24" class="select_index"></span>text 8</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index25" class="select_index"></span>text 9</a></b>
<br>
<br>
<b><a target="_blank" href="http://example.com/3"><span id="s_index18" class="select_index"></span>text 2</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index19" class="select_index"></span>text 3</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index20" class="select_index"></span>text 4</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index21" class="select_index"></span>text 5</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index22" class="select_index"></span>text 6</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index23" class="select_index"></span>text 7</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index24" class="select_index"></span>text 8</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index25" class="select_index"></span>text 9</a></b>
<br>
<br>
ici fait partie que je dois analyser:
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index18" class="select_index"></span>text 2</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index19" class="select_index"></span>text 3</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index20" class="select_index"></span>text 4</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index21" class="select_index"></span>text 5</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index22" class="select_index"></span>text 6</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index23" class="select_index"></span>text 7</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index24" class="select_index"></span>text 8</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index25" class="select_index"></span>text 9</a></b>
ici fait partie de mon code
for links in obja.find_all("div", class_="compost"):
if links.has_attr('href'):
print links['href']
#
aa = links.findAll('a')[0]
print aa.attrs['href']
txt = []
for i in links.findAll('br'):
txt.append(i.text)
print i.nextSibling
if i.nextSibling.text != u'br':
txt.append(i.nextSibling.text)
''.join(txt)
mon extrait de script tous les liens et je ne sais pas comment puis-je extraire tout http://example.com/1 et ignorer tous les liens après <br><br>
?
Pouvez-vous ignorer tous les autres liens et saisir uniquement 'A' tags avec une' 'href' de http://example.com/ 1'? 'obja.find_all ('a', attrs = {'href': 'http://example.com/1'})' –
@GregReda, je ne peux pas parce que les liens ont le même domaine, il est utilisé pour la redirection. – Kate