2010-09-16 7 views
2

J'essaie d'utiliser urllib3 dans un thread simple pour aller chercher plusieurs pages wiki. Le script vaexemple urllib3 et threading en python

Créer 1 connexion pour chaque thread (je ne comprends pas pourquoi) et Hang indéfiniment. Toute pointe, des conseils ou un exemple simple urllib3 et le filetage

import threadpool 
from urllib3 import connection_from_url 

HTTP_POOL = connection_from_url(url, timeout=10.0, maxsize=10, block=True) 

def fetch(url, fiedls): 
    kwargs={'retries':6} 
    return HTTP_POOL.get_url(url, fields, **kwargs) 

pool = threadpool.ThreadPool(5) 
requests = threadpool.makeRequests(fetch, iterable) 
[pool.putRequest(req) for req in requests] 

@ script de Lennart a obtenu cette erreur:

http://en.wikipedia.org/wiki/2010-11_Premier_LeagueTraceback (most recent call last): 
    File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run 
http://en.wikipedia.org/wiki/List_of_MythBusters_episodeshttp://en.wikipedia.org/wiki/List_of_Top_Gear_episodes http://en.wikipedia.org/wiki/List_of_Unicode_characters result = request.callable(*request.args, **request.kwds) 
    File "crawler.py", line 9, in fetch 
    print url, conn.get_url(url) 
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url' 
Traceback (most recent call last): 
    File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run 
    result = request.callable(*request.args, **request.kwds) 
    File "crawler.py", line 9, in fetch 
    print url, conn.get_url(url) 
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url' 
Traceback (most recent call last): 
    File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run 
    result = request.callable(*request.args, **request.kwds) 
    File "crawler.py", line 9, in fetch 
    print url, conn.get_url(url) 
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url' 
Traceback (most recent call last): 
    File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run 
    result = request.callable(*request.args, **request.kwds) 
    File "crawler.py", line 9, in fetch 
    print url, conn.get_url(url) 
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url' 

Après avoir ajouté import threadpool; import urllib3 et tpool = threadpool.ThreadPool(4) @ code de user318904 obtenu cette erreur:

Traceback (most recent call last): 
    File "crawler.py", line 21, in <module> 
    tpool.map_async(fetch, urls) 
AttributeError: ThreadPool instance has no attribute 'map_async' 

Répondre

1

De toute évidence, il va créer une connexion par thread, comment devrait chaque fil peut aller chercher une page? Et vous essayez d'utiliser la même connexion, faite à partir d'une URL, pour toutes les URL. Cela peut difficilement être ce que vous vouliez.

Ce code fonctionne à merveille:

import threadpool 
from urllib3 import connection_from_url 

def fetch(url): 
    kwargs={'retries':6} 
    conn = connection_from_url(url, timeout=10.0, maxsize=10, block=True) 
    print url, conn.get_url(url) 
    print "Done!" 

pool = threadpool.ThreadPool(4) 
urls = ['http://en.wikipedia.org/wiki/2010-11_Premier_League', 
     'http://en.wikipedia.org/wiki/List_of_MythBusters_episodes', 
     'http://en.wikipedia.org/wiki/List_of_Top_Gear_episodes', 
     'http://en.wikipedia.org/wiki/List_of_Unicode_characters', 
     ] 
requests = threadpool.makeRequests(fetch, urls) 

[pool.putRequest(req) for req in requests] 
pool.wait() 
0

J'utilise quelque chose comme ceci:

#excluding setup for threadpool etc 

upool = urllib3.HTTPConnectionPool('en.wikipedia.org', block=True) 

urls = ['/wiki/2010-11_Premier_League', 
     '/wiki/List_of_MythBusters_episodes', 
     '/wiki/List_of_Top_Gear_episodes', 
     '/wiki/List_of_Unicode_characters', 
     ] 

def fetch(path): 
    # add error checking 
    return pool.get_url(path).data 

tpool = ThreadPool() 

tpool.map_async(fetch, urls) 

# either wait on the result object or give map_async a callback function for the results 
1

programmation de fil est difficile, donc j'écrit workerpool pour faire exactement ce que vous faites plus facile. Pour plus d'informations, voir l'exemple Mass Downloader.

Pour faire la même chose avec urllib3, il ressemble à ceci:

import urllib3 
import workerpool 

pool = urllib3.connection_from_url("foo", maxsize=3) 

def download(url): 
    r = pool.get_url(url) 
    # TODO: Do something with r.data 
    print "Downloaded %s" % url 

# Initialize a pool, 5 threads in this case 
pool = workerpool.WorkerPool(size=5) 

# The ``download`` method will be called with a line from the second 
# parameter for each job. 
pool.map(download, open("urls.txt").readlines()) 

# Send shutdown jobs to all threads, and wait until all the jobs have been completed 
pool.shutdown() 
pool.wait() 

Pour un code plus sophistiqué, jetez un oeil à workerpool.EquippedWorker (et the tests here par exemple l'utilisation). Vous pouvez faire de la piscine le toolbox que vous passez.