Comment supprimer les informations inutiles dans la modélisation du sujet (LDA)Comment supprimer les informations inutiles dans la modélisation du sujet (LDA)
Bonjour, Je voudrais créer une modélisation de sujet. Mes données sont cette structure.
1. Doesn't taste good to me.
2. Most delicious ramen I have ever had. Spicy and tasty. Great price too.
3. I have this on my subscription, my family loves this version. The taste is great by itself or when we add the vegetables and.or meats.
4. The noodle is ok, but I had better ones.
5. some day's this is lunch and or dinner on second case
6. Really good ramen!
je nettoyais les commentaires et tourné la modélisation de sujet. Mais vous pouvez voir "", "26.6564810276031", "character (0)".
[,1] [,2] [,3] [,4]
[1,] "cabbag" ")." "=" "side"
[2,] "gonna" "26.6564810276031," "" "day,"
[3,] "broth" "figur" "character(0)," "ok."
A l'origine, on ne pouvait pas voir ces choses si vous aviez seulement la fréquence des mots, mais vous pouvez voir ces mots lorsque vous exécutez la modélisation de sujet.
Qu'est-ce qui n'allait pas chez moi? Comment le réparer?
library(tm)
library(XML)
library(SnowballC)
crudeCorp<-VCorpus(VectorSource(readLines(file.choose())))
crudeCorp <- tm_map(crudeCorp, stripWhitespace)
crudeCorp<-tm_map(crudeCorp, content_transformer(tolower))
# remove stopwords from corpus
crudeCorp<-tm_map(crudeCorp, removeWords, stopwords("english"))
myStopwords <- c(stopwords("english"),"noth","two","first","lot", "because", "can", "will","go","also","get","since","way","even","just","now","will","give","gave","got","one","make","even","much","come","take","without","goes","along","alot","alone")
myStopwords <- setdiff(myStopwords, c("will","can"))
crudeCorp <- tm_map(crudeCorp, removeWords, myStopwords)
crudeCorp<-tm_map(crudeCorp,removeNumbers)
crudeCorp <- tm_map(crudeCorp, content_transformer(function(x)
gsub(x, pattern = "bought", replacement = "buy")))
crudeCorp <- tm_map(crudeCorp, content_transformer(function(x)
gsub(x, pattern = "broke", replacement = "break")))
crudeCorp <- tm_map(crudeCorp, content_transformer(function(x)
gsub(x, pattern = "products", replacement = "product")))
crudeCorp <- tm_map(crudeCorp, content_transformer(function(x)
gsub(x, pattern = "made", replacement = "make")))
crudeCorp <- tm_map(crudeCorp, stemDocument)
library(reshape)
library(ScottKnott)
library(lda)
### Faster Way of doing LDA
corpusLDA <- lexicalize(crudeCorp)
## K: Number of factors, ,vocab=corpusLDA$vocab (Word contents)
ldaModel=lda.collapsed.gibbs.sampler(corpusLDA$documents,K=7,
vocab=corpusLDA$vocab,burnin=9999,num.iterations=1000,alpha=1,eta=0.1)
top.words <- top.topic.words(ldaModel$topics, 10, by.score=TRUE)
print(top.words)
Je suis débutant et je n'ai pas bien compris votre réponse. J'ai supprimé les abréviations et les nombres avant de créer le modèle. Avez-vous à le faire après l'avoir fait? Je voudrais donner plus de détails. – yome