J'utilise Whoosh pour indexer et rechercher une variété de textes dans divers codages. Cependant, lorsque j'effectue une recherche sur mes fichiers indexés, certains des résultats correspondants n'apparaissent pas dans la sortie en utilisant la fonction de "mise en surbrillance". J'ai l'impression que cela est lié aux erreurs d'encodage, mais je n'arrive pas à comprendre ce qui pourrait empêcher l'affichage de tous les résultats. Je serais très reconnaissant pour toute la lumière que d'autres peuvent apporter sur ce mystère.Whoosh retournant des valeurs vides
est le script Ici, je me sers pour créer mon index et here sont les fichiers que je suis indexation:
from whoosh.index import create_in
from whoosh.fields import *
import glob, os, chardet
encodings = ['utf-8', 'ISO-8859-2', 'windows-1250', 'windows-1252', 'latin1', 'ascii']
def determine_string_encoding(string):
result = chardet.detect(string)
string_encoding = result['encoding']
return string_encoding
#specify a list of paths that contain all of the texts we wish to index
text_dirs = [
"C:\Users\Douglas\Desktop\intertextuality\sample_datasets\hume",
"C:\Users\Douglas\Desktop\intertextuality\sample_datasets\complete_pope\clean"
]
#establish the schema to be used when storing texts; storing content allows us to retrieve hightlighted extracts from texts in which matches occur
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored=True))
#check to see if we already have an index directory. If we don't, make it
if not os.path.exists("index"):
os.mkdir("index")
ix = create_in("index", schema)
#create writer object we'll use to write each of the documents in text_dir to the index
writer = ix.writer()
#create file in which we can write the encoding of each file to disk for review
with open("encodings_log.txt","w") as encodings_out:
#for each directory in our list
for i in text_dirs:
#for each text file in that directory (j is now the path to the current file within the current directory)
for j in glob.glob(i + "\\*.txt"):
#first, let's grab j title. If the title is stored in the text file name, we can use this method:
text_title = j.split("\\")[-1]
#now let's read the file
with open(j, "r") as text_content:
text_content = text_content.read()
#use method defined above to determine encoding of path and text_content
path_encoding = determine_string_encoding(j)
text_content_encoding = determine_string_encoding(text_content)
#because we know the encoding of the files in this directory, let's override the previous text_content_encoding value and specify that encoding explicitly
if "clean" in j:
text_content_encoding = "iso-8859-1"
#decode text_title, path, and text_content to unicode using the encodings we determined for each above
unicode_text_title = unicode(text_title, path_encoding)
unicode_text_path = unicode(j, path_encoding)
unicode_text_content = unicode(text_content, text_content_encoding)
#use writer method to add document to index
writer.add_document(title = unicode_text_title, path = unicode_text_path, content = unicode_text_content)
#after you've added all of your documents, commit changes to the index
writer.commit()
Ce code semble indexer les textes sans aucun problème, mais quand j'utilise le script suivant pour analyser l'index, j'obtiens trois valeurs vides dans le fichier de sortie out.txt - les deux premières lignes sont vides, et la ligne six est vide, mais je m'attends à ce que ces trois lignes soient non vides. Voici le script que je utilise pour analyser l'index:
from whoosh.qparser import QueryParser
from whoosh.qparser import FuzzyTermPlugin
from whoosh.index import open_dir
import codecs
#now that we have an index, we can open it with open_dir
ix = open_dir("index")
with ix.searcher() as searcher:
parser = QueryParser("content", schema=ix.schema)
#to enable Levenshtein-based parse, use plugin
parser.add_plugin(FuzzyTermPlugin())
#using ~2/3 means: allow for edit distance of two (where additions, subtractions, and insertions each cost one), but only count matches for which first three letters match. Increasing this denominator greatly increases speed
query = parser.parse(u"swallow~2/3")
results = searcher.search(query)
#see see whoosh.query.phrase, which describes "slop" parameter (ie: number of words we can insert between any two words in our search query)
#write query results to disk or html
with codecs.open("out.txt","w") as out:
for i in results[0:]:
title = i["title"]
highlight = i.highlights("content")
clean_highlight = " ".join(highlight.split())
out.write(clean_highlight.encode("utf-8") + "\n")
Si quelqu'un peut suggérer des raisons pour lesquelles ces trois lignes sont vides, je serais éternellement reconnaissant.
Y at-il quelque chose que je peux ajouter à la description ci-dessus pour mieux diagnostiquer la situation? – duhaime