D'abord, définissons ce qu'est la polysémie.
Polysemy: The coexistence of many possible meanings for a word or phrase.
(Source: https://www.google.com/search?q=polysemy)
De Wordnet:
WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.
Et dans WordNet il y a plusieurs termes que nous devrions connaître:
Synset: a distinct concept/meaning
Lemma: a root form of a word
Part-Of-Speech (POS): the linguistic category of a word
Word: a surface form of a word (surface words are not in WordNet)
(Note: @alexis a une bonne réponse sur lemma vs synset
: https://stackoverflow.com/a/42050466/610569; Voir aussi https://stackoverflow.com/a/23715743/610569 et https://stackoverflow.com/a/29478711/610569)
Dans le code:
from nltk.corpus import wordnet as wn
# Given a word "run"
word = 'run'
# We get multiple meaning (i.e. synsets) for
# the word in wordnet.
for synset in wn.synsets(word):
# Each synset comes with an ID.
offset = str(synset.offset()).zfill(8)
# Each meaning comes with their
# linguistic category (i.e. POS)
pos = synset.pos()
# Usually, offset + POS is the way
# to index a synset.
idx = offset + '-' + pos
# Each meaning also comes with their
# distinct definition.
definition = synset.definition()
# For each meaning, there are multiple
# root words (i.e. lemma)
lemmas = ','.join(synset.lemma_names())
print ('\t'.join([idx, definition, lemmas]))
[out]:
00189565-n a score in baseball made by a runner touching all four bases safely run,tally
00791078-n the act of testing something test,trial,run
07460104-n a race run on foot footrace,foot_race,run
00309011-n a short trip run
01926311-v move fast by using one's feet, with one foot off the ground at any given time run
02075049-v flee; take to one's heels; cut and run scat,run,scarper,turn_tail,lam,run_away,hightail_it,bunk,head_for_the_hills,take_to_the_woods,escape,fly_the_coop,break_away
Pour en revenir à la question, comment « calculer la polysémie moyenne des noms , verbes, adjectifs et adverbes selon WordNet "?
Depuis que nous travaillons avec WordNet, les mots de surface sont à l'écart et il ne nous reste plus que des lemmes. D'abord, nous devons définir ce que les lemmes sont en noms, verbes, adjectifs.
from nltk.corpus import wordnet as wn
from collections import defaultdict
words_by_pos = defaultdict(set)
for synset in wn.all_synsets():
pos = synset.pos()
for lemma in synset.lemmas():
words_by_pos[pos].add(lemma)
Mais ceci est une vision simpliste des relations entre lemmes vs POS:
# There are 5 POS.
>>> words_by_pos.keys()
dict_keys(['a', 's', 'r', 'n', 'v'])
# Some words have multiple POS tags =(
>>> len(words_by_pos['n'])
119034
>>> len(words_by_pos['v'])
11531
>> len(words_by_pos['n'].intersection(words_by_pos['v']))
4062
Voyons voir si nous pouvons ignorer et passer à autre chose:
# Lets look that the verb 'v' category
num_meanings_per_verb = []
for word in words_by_pos['v']:
# No. of meaning for a word given a POS.
num_meaning = len(wn.synsets(word, pos='v'))
num_meanings_per_verb.append(num_meaning)
print(sum(num_meanings_per_verb)/len(num_meanings_per_verb))
[out]:
2.1866273523545225
Que signifie le nombre?(Si cela signifie quoi que ce soit)
Cela signifie que
- sur chaque verbe dans WordNet,
- il y a une moyenne de 2 significations;
- en ignorant le fait que certains mots ont plus de sens dans d'autres catégories POS
Peut-être, il y a un sens à, peut-être, mais si l'on regarde les comptes des valeurs num_meanings_per_verb
:
Counter({1: 101168,
2: 11136,
3: 3384,
4: 1398,
5: 747,
6: 393,
7: 265,
8: 139,
9: 122,
10: 85,
11: 74,
12: 39,
13: 29,
14: 10,
15: 19,
16: 10,
17: 6,
18: 2,
20: 5,
26: 1,
30: 1,
33: 1})
S'il vous plaît montrer le retraçage complet. – BrenBarn
On dirait que 'synset.lemma_names' devrait être' sysnet.lemma_names() '? – BrenBarn
J'ai ajusté cela mais je reçois toujours la même erreur – Anna