2009-07-28 7 views

Répondre

20

Pour ce faire, vous devez écrire votre propre classe d'analyseur. C'est relativement simple. Voici celui que j'utilise. Il combine l'arrêt du filtrage des mots. Porter et (ce peut être trop pour vos besoins) décapage des accents des personnages.

/// <summary> 
/// An analyzer that implements a number of filters. Including porter stemming, 
/// Diacritic stripping, and stop word filtering. 
/// </summary> 
public class CustomAnalyzer : Analyzer 
{ 
    /// <summary> 
    /// A rather short list of stop words that is fine for basic search use. 
    /// </summary> 
    private static readonly string[] stopWords = new[] 
    { 
     "0", "1", "2", "3", "4", "5", "6", "7", "8", 
     "9", "000", "$", "£", 
     "about", "after", "all", "also", "an", "and", 
     "another", "any", "are", "as", "at", "be", 
     "because", "been", "before", "being", "between", 
     "both", "but", "by", "came", "can", "come", 
     "could", "did", "do", "does", "each", "else", 
     "for", "from", "get", "got", "has", "had", 
     "he", "have", "her", "here", "him", "himself", 
     "his", "how","if", "in", "into", "is", "it", 
     "its", "just", "like", "make", "many", "me", 
     "might", "more", "most", "much", "must", "my", 
     "never", "now", "of", "on", "only", "or", 
     "other", "our", "out", "over", "re", "said", 
     "same", "see", "should", "since", "so", "some", 
     "still", "such", "take", "than", "that", "the", 
     "their", "them", "then", "there", "these", 
     "they", "this", "those", "through", "to", "too", 
     "under", "up", "use", "very", "want", "was", 
     "way", "we", "well", "were", "what", "when", 
     "where", "which", "while", "who", "will", 
     "with", "would", "you", "your", 
     "a", "b", "c", "d", "e", "f", "g", "h", "i", 
     "j", "k", "l", "m", "n", "o", "p", "q", "r", 
     "s", "t", "u", "v", "w", "x", "y", "z" 
    }; 

    private Hashtable stopTable; 

    /// <summary> 
    /// Creates an analyzer with the default stop word list. 
    /// </summary> 
    public CustomAnalyzer() : this(stopWords) {} 

    /// <summary> 
    /// Creates an analyzer with the passed in stop words list. 
    /// </summary> 
    public CustomAnalyzer(string[] stopWords) 
    { 
     stopTable = StopFilter.MakeStopSet(stopWords);  
    } 

    public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader) 
    { 
     return new PorterStemFilter(new ISOLatin1AccentFilter(new StopFilter(new LowerCaseTokenizer(reader), stopWords))); 
    } 
} 
+1

Merci, je vais essayer ça. – devson

+1

+1 merci Jack, juste ce que je cherchais. Si je pouvais, je marquerais cela comme la réponse! – andy

+0

J'ai utilisé votre exemple, mais je n'obtiens pas de résultats pour les requêtes pour un nombre '4656' (l'analyseur standard fonctionne) J'ai remplacé les mots d'arrêt par le' StopAnalyzer.ENGLISH_STOP_WORDS' intégré qui n'inclut pas de chiffres, aucune idée de ce qui se passe Ici? – Myster

7

Vous pouvez utiliser Snowball ou PorterStemFilter. Voir le Java Analyzer documentation comme guide pour combiner différents Filtres/Tokenizers/Analyseurs. Notez que vous devez utiliser le même analyseur pour l'indexation et la récupération, de sorte que la gestion du stemming doit commencer au moment de l'indexation.

+0

Merci, je vais essayer ça. – devson

Questions connexes