Comment activer le stemming lors d'une recherche avec lucene.net?

Comment activer le stemming lors de la recherche avec lucene.net?Comment activer le stemming lors d'une recherche avec lucene.net?

Source

2009-07-28 devson

Quel analyseur utilisez-vous? – Kane

J'utilise un analyseur standard. – devson

Pour ce faire, vous devez écrire votre propre classe d'analyseur. C'est relativement simple. Voici celui que j'utilise. Il combine l'arrêt du filtrage des mots. Porter et (ce peut être trop pour vos besoins) décapage des accents des personnages.

/// <summary> 
/// An analyzer that implements a number of filters. Including porter stemming, 
/// Diacritic stripping, and stop word filtering. 
/// </summary> 
public class CustomAnalyzer : Analyzer 
{ 
    /// <summary> 
    /// A rather short list of stop words that is fine for basic search use. 
    /// </summary> 
    private static readonly string[] stopWords = new[] 
    { 
     "0", "1", "2", "3", "4", "5", "6", "7", "8", 
     "9", "000", "$", "£", 
     "about", "after", "all", "also", "an", "and", 
     "another", "any", "are", "as", "at", "be", 
     "because", "been", "before", "being", "between", 
     "both", "but", "by", "came", "can", "come", 
     "could", "did", "do", "does", "each", "else", 
     "for", "from", "get", "got", "has", "had", 
     "he", "have", "her", "here", "him", "himself", 
     "his", "how","if", "in", "into", "is", "it", 
     "its", "just", "like", "make", "many", "me", 
     "might", "more", "most", "much", "must", "my", 
     "never", "now", "of", "on", "only", "or", 
     "other", "our", "out", "over", "re", "said", 
     "same", "see", "should", "since", "so", "some", 
     "still", "such", "take", "than", "that", "the", 
     "their", "them", "then", "there", "these", 
     "they", "this", "those", "through", "to", "too", 
     "under", "up", "use", "very", "want", "was", 
     "way", "we", "well", "were", "what", "when", 
     "where", "which", "while", "who", "will", 
     "with", "would", "you", "your", 
     "a", "b", "c", "d", "e", "f", "g", "h", "i", 
     "j", "k", "l", "m", "n", "o", "p", "q", "r", 
     "s", "t", "u", "v", "w", "x", "y", "z" 
    }; 

    private Hashtable stopTable; 

    /// <summary> 
    /// Creates an analyzer with the default stop word list. 
    /// </summary> 
    public CustomAnalyzer() : this(stopWords) {} 

    /// <summary> 
    /// Creates an analyzer with the passed in stop words list. 
    /// </summary> 
    public CustomAnalyzer(string[] stopWords) 
    { 
     stopTable = StopFilter.MakeStopSet(stopWords);  
    } 

    public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader) 
    { 
     return new PorterStemFilter(new ISOLatin1AccentFilter(new StopFilter(new LowerCaseTokenizer(reader), stopWords))); 
    } 
}

Source

2009-07-28 11:04:48

Merci, je vais essayer ça. – devson

+1 merci Jack, juste ce que je cherchais. Si je pouvais, je marquerais cela comme la réponse! – andy

J'ai utilisé votre exemple, mais je n'obtiens pas de résultats pour les requêtes pour un nombre '4656' (l'analyseur standard fonctionne) J'ai remplacé les mots d'arrêt par le' StopAnalyzer.ENGLISH_STOP_WORDS' intégré qui n'inclut pas de chiffres, aucune idée de ce qui se passe Ici? – Myster

Vous pouvez utiliser Snowball ou PorterStemFilter. Voir le Java Analyzer documentation comme guide pour combiner différents Filtres/Tokenizers/Analyseurs. Notez que vous devez utiliser le même analyseur pour l'indexation et la récupération, de sorte que la gestion du stemming doit commencer au moment de l'indexation.

Source

2009-07-28 10:57:26

Merci, je vais essayer ça. – devson

Comment activer le stemming lors d'une recherche avec lucene.net?

Répondre

Questions connexes