Découvrir des "templates" dans un texte donné?

Si j'ai beaucoup de texte et que j'essaie de découvrir les modèles qui se produisent le plus fréquemment, je pensais le résoudre en utilisant l'approche N-Gram et en fait il a été suggéré comme solution dans la question this légèrement différent. Juste pour clarifier, j'ai un texte comme celui-ci:Découvrir des "templates" dans un texte donné?

I wake up every day morning and read the newspaper and then go to work 
I wake up every day morning and eat my breakfast and then go to work 
I am not sure that this is the solution but I will try 
I am not sure that this is the answer but I will try 
I am not feeling well today but I will get the work done and deliver it tomorrow 
I was not feeling well yesterday but I will get the work done and let you know by tomorrow

et essaie d'extraire « modèles » comme ceci:

I wake up every day morning and ... and then go to work 
I am not sure that this is the ... but I will try 
I ... not feeling well ... but I will get the work done and ... tomorrow

Je suis à la recherche d'une approche qui peut évoluer à millions de lignes de texte donc je me demandais si je peux adapter la même approche N-gram pour résoudre ce problème ou y at-il des alternatives?

Source

2011-06-29 Legend

Des millions de lignes de texte ne sont pas un très gros chiffre :)

Qu'est-ce que vous cherchez est au moins similaire à la recherche de colocalisation. Vous pouvez essayer de calculer des informations mutuelles ponctuelles sur les n-grammes. Voir Manning & Schütze (1999) pour cela et d'autres approches au problème.

Source

2011-06-29 21:24:11

Nous vous remercions de vos suggestions. J'ai finalement eu le livre aujourd'hui :) – Legend

Découvrir des "templates" dans un texte donné?

Répondre

Questions connexes