2017-03-09 2 views
0

utilisation sklean tf-idf, l'espace d'utilisation de defult diviséPython, sklearn, il-idf comment diviser par "####", l'espace defult

corpus = [ 
'This is the first document.', 
'This is the second second document.', 
'And the third one.', 
'Is this the first document?' 
]  

vectorizer = CountVectorizer() 
X = vectorizer.fit_transform(corpus) 

mais, je veux utiliser ce formulaire:

enter code herecorpus = [ 
'This####is####the####first####document.', 
'This####is####the####second####second####document.' 
] 
vectorizer = CountVectorizer() 
X = vectorizer.fit_transform(corpus) 
tfidf=transformer.fit_transform(vectorizer.fit_transform(documents)) 
word=vectorizer.get_feature_names() 
weight=tfidf.toarray() 

Comment faire?

+0

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html passer votre propre tokenizer –

Répondre

1

Utilisez un tokenizer personnalisé:

def four_pounds_tokenizer(s): 
    return s.split('####') 

vectorizer = CountVectorizer(tokenizer=four_pounds_tokenizer) 
X = vectorizer.fit_transform(corpus)