2016-10-29 6 views
1

J'essaye d'implémenter doc2vec de gensim mais j'ai quelques erreurs et il n'y a pas assez de documentation ou d'aide sur le web. Voici une partie de mon code de travail:Python implémentation simple de doc2vec?

from gensim.models import Doc2Vec 
from gensim.models.doc2vec import LabeledSentence 

class LabeledLineSentence(object): 
    def __init__(self, filename): 
     self.filename = filename 
    def __iter__(self): 
     with open(self.filename, 'r') as f: 
      for uid, line in enumerate(f): 
       print LabeledSentence(line.split(), tags=['TXT_%s' % uid]) 
       yield LabeledSentence(words=line.split(), tags=['TXT_%s' % uid]) 

sentences = LabeledLineSentence('myfile.txt') 

ce que mon fichier txt ressemble:

1 hi how are you 
    2 hi how are you 
    3 hi how are you 
    4 its such a great day 
    5 its such a great day 
    6 its such a great day 
    7 i like dogs 
    8 i like cats 
    9 i like snakes 
10 the ice cream was yummy 
11 the cake was awesome 

initialisation du modèle

model = Doc2Vec(alpha=0.025, min_alpha=0.025, size=50, window=5, min_count=5, 
       dm=1, workers=8, sample=1e-5)  

exemple sortie d'impression:

LabeledSentence(['hi', 'how', 'are', 'you'], ['TXT_0']) 
LabeledSentence(['hi', 'how', 'are', 'you'], ['TXT_1']) 
LabeledSentence(['hi', 'how', 'are', 'you'], ['TXT_2']) 
LabeledSentence(['its', 'such', 'a', 'great', 'day'], ['TXT_3']) 
LabeledSentence(['its', 'such', 'a', 'great', 'day'], ['TXT_4']) 

C'est ici que L'erreur est:

for epoch in range(500): 
    try: 
     print 'epoch %d' % (epoch) 
     model.train(sentences) 
     model.alpha *= 0.99 
     model.min_alpha = model.alpha 
    except (KeyboardInterrupt, SystemExit): 
     break 

RuntimeError: you must first build vocabulary before training the model 

Une idée pourquoi?

Répondre