0

Je veux faire un modèle word2vec avec plus de n-grammes que d'habitude. Comme je l'ai trouvé, la classe Phrase dans gensim.models.phrase peut trouver des phrases que je veux et il est possible d'utiliser des phrases sur un corpus et d'utiliser son modèle de résultat pour la fonction de train word2vec.Traitement de texte - formation Word2Vec après détection de phrase (modèle bigram)

Donc tout d'abord, je fais quelque chose comme ci-dessous, exactement comme les codes d'exemple dans gensim documentation.

modèle
class MySentences(object): 
    def __init__(self, dirname): 
     self.dirname = dirname 

    def __iter__(self): 
     for fname in os.listdir(self.dirname): 
      for line in open(os.path.join(self.dirname, fname)): 
       yield word_tokenize(line) 

sentences = MySentences('sentences_directory') 

bigram = gensim.models.Phrases(sentences) 

model = gensim.models.Word2Vec(bigram['sentences'], size=300, window=5, workers=8) 

a été créé, mais sans aucun bon résultat en matière d'évaluation et un avertissement:

WARNING : train() called with an empty iterator (if not intended, be sure to provide a corpus that offers restartable iteration = an iterable) 

je recherchais et j'ai trouvé https://groups.google.com/forum/#!topic/gensim/XWQ8fPMFSi0 et changé mon code:

class MySentences(object): 
    def __init__(self, dirname): 
     self.dirname = dirname 

    def __iter__(self): 
     for fname in os.listdir(self.dirname): 
      for line in open(os.path.join(self.dirname, fname)): 
       yield word_tokenize(line) 

class PhraseItertor(object): 
    def __init__(self, my_phraser, data): 
     self.my_phraser, self.data = my_phraser, data 

    def __iter__(self): 
     yield self.my_phraser[self.data] 


sentences = MySentences('sentences_directory') 

bigram_transformer = gensim.models.Phrases(sentences) 

bigram = gensim.models.phrases.Phraser(bigram_transformer) 

corpus = PhraseItertor(bigram, sentences) 

model = gensim.models.Word2Vec(corpus, size=300, window=5, workers=8) 

I obtenir l'erreur:

Traceback (most recent call last): 
    File "/home/fatemeh/Desktop/Thesis/bigramModeler.py", line 36, in <module> 
    model = gensim.models.Word2Vec(corpus, size=300, window=5, workers=8) 
    File "/home/fatemeh/.local/lib/python3.4/site-packages/gensim/models/word2vec.py", line 478, in init 
    self.build_vocab(sentences, trim_rule=trim_rule) 
    File "/home/fatemeh/.local/lib/python3.4/site-packages/gensim/models/word2vec.py", line 553, in build_vocab 
    self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey 
    File "/home/fatemeh/.local/lib/python3.4/site-packages/gensim/models/word2vec.py", line 575, in scan_vocab 
    vocab[word] += 1 
TypeError: unhashable type: 'list' 

Maintenant, je veux savoir ce qui ne va pas dans mes codes.

Répondre

0

J'ai posé ma question à Gensim GoogleGroup et Mr Gordon Mohr m'a répondu:

You typically wouldn't want an __iter__() method to do a single yield . It should return an iterator object (ready to return multiple objects via next() or a StopIteration exception). One way to effect a iterator is to use yield to have the method treated as a 'generator' – but that would typically require the yield to be inside a loop.

But I now see that my example code in the thread you reference does the wrong thing with its__iter__() return line: it should not be returning the raw phrasifier, but one that has already been started-as-an-iterator, by use of the iter() built-in method. That is, the example there should have read:

class PhrasingIterable(object): 
    def __init__(self, phrasifier, texts): 
     self. phrasifier, self.texts = phrasifier, texts 
    def __iter__(): 
     return iter(phrasifier[texts]) 

Making a similar change in your variation may resolve the TypeError: iter() returned non-iterator of type 'TransformedCorpus' error.