2017-03-18 2 views
-1

J'ai un code que j'utilise pour la classification Spam et il fonctionne très bien, mais chaque fois que je tente d'endiguer/lemmatiser le mot que je reçois cette erreur:codec Ascii ne peut pas décoder python octets 0xC2 NLTK

Fichier " /Users/Ramit/Desktop/Bayes1/src/filter.py "ligne 16, dans le mot trim_word = ps.stem (mot)

fichier" /Library/Python/2.7/site-packages/nltk/stem /porter.py ", ligne 664, dans la racine stem = self._step1a (racine)

Fichier" /Library/Python/2.7/site-packages/nltk/stem/porter.py ", ligne 289, dans _step1a

if word.endswith('ies') and len(word) == 4: 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128) 

Voici mon code:

from word import Word 
    from nltk.corpus import stopwords 
    from nltk.stem import PorterStemmer 
    ps = PorterStemmer() 
    class Filter(): 

def __init__(self): 
    self.words = dict() 


def trim_word(self, word): 
    # Helper method to trim away some of the non-alphabetic characters 
    # I deliberately do not remove all non-alphabetic characters. 
    word = word.strip(' .:,-!()"?+<>*') 
    word = word.lower() 
      word = ps.stem(word) 
    return word 


def train(self, train_file): 
    lineNumber = 1 
    ham_words = 0 
    spam_words = 0 
      stop = set(stopwords.words('english')) 

    # Loop through all the lines 
    for line in train_file: 
     if lineNumber % 2 != 0: 
      line = line.split('\t') 
      category = line[0] 
      input_words = line[1].strip().split(' ') 

      #Loop through all the words in the line, remove some characters 
      for input_word in input_words: 
       input_word = self.trim_word(input_word) 
       if (input_word != "") and (input_word not in stop): 

        # Check if word is in dicionary, else add 
        if input_word in self.words: 
         word = self.words[input_word] 
        else: 
         word = Word(input_word) 
         self.words[input_word] = word 

        # Check wether the word is in ham or spam sentence, increment counters 
        if category == "ham": 
         word.increment_ham() 
         ham_words += 1 
        elif category == "spam": 
         word.increment_spam() 
         spam_words += 1 

        # Probably bad training file input... 
        else: 
         print "Not valid training file format" 

     lineNumber+=1 

    # Compute the probability for each word in the training set 
    for word in self.words: 
     self.words[word].compute_probability(ham_words, spam_words) 


def get_interesting_words(self, sms): 
    interesting_words = [] 
      stop = set(stopwords.words('english')) 
    # Go through all words in the SMS and append to list. 
    # If we have not seen the word in training, assign probability of 0.4 
    for input_word in sms.split(' '): 
     input_word = self.trim_word(input_word) 
     if (input_word != "") and (input_word not in stop): 
      if input_word in self.words: 
       word = self.words[input_word] 
      else: 
       word = Word(input_word) 
       word.set_probability(0.40) 
      interesting_words.append(word) 

    # Sort the list of interesting words, return top 15 elements if list is longer than 15 
    interesting_words.sort(key=lambda word: word.interesting(), reverse=True) 
    return interesting_words[0:15] 


def filter(self, input_file, result_file): 
    # Loop through all SMSes and compute total spam probability of the sms-message 
    lineNumber = 0 
    for sms in input_file: 
     lineNumber+=1 
     spam_product = 1.0 
     ham_product = 1.0 
     if lineNumber % 2 != 0: 
      try: 
       for word in self.get_interesting_words(sms): 
        spam_product *= word.get_probability() 
        ham_product *= (1.0 - word.get_probability()) 

       sms_spam_probability = spam_product/(spam_product + ham_product) 
      except: 
       result_file.write("error") 

      if sms_spam_probability > 0.8: 
       result_file.write("SPAM: "+sms) 
      else: 
       result_file.write("HAM: "+sms) 
     result_file.write("\n") 

Je suis à la recherche d'une solution qui me permettrait lemmatiser/STEM les mots. J'ai essayé de regarder autour du filet, j'ai trouvé des problèmes similaires, mais ils n'ont pas travaillé pour moi.

+1

Suggestions: (1) Convertissez vos onglets aux espaces avant de poster. (2) Créez un [exemple minimal] (http://stackoverflow.com/help/mcve). –

+0

Peut-être que cela aiderait https://gist.github.com/alvations/07758d02412d928414bb de https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66 – alvas

+0

Le problème pourrait être que vous ' ne lisez-vous pas le fichier correctement? essayez 'import io; file_in = io.open ('filename.txt', 'r', encoding = 'UTF-8') '. Il est un peu difficile de savoir ce qui ne va pas mais si vous pouviez publier les données que vous essayez de traiter, il sera beaucoup plus facile de comprendre ce qui s'est mal passé. – alvas

Répondre

0

Utilisez sys.

import sys 
sys.setdefaultencoding('utf-8') 
reload(sys)