J'ai une liste de mots sur lesquels j'ai effectué l'algorithme TF-IDF pour obtenir la liste des 100 premiers mots. Après quoi je suis supposé faire du clustering. Pour l'instant je suis capable de faire les deux tâches (je partage la partie pertinente du code et le fichier d'entrée, capture d'écran de sortie).Obtenez une liste des clusters formés à partir de Dendrogram en Python
Ma requête est que je voulais la liste des clusters qui sont formés dans le Dendrogramme de sortie, Comment puis-je faire cela? La fonction Dendrogram renvoie un Tuple ax
qui a quelques coordonnées et une liste des noeuds. Comment puis-je les manipuler pour obtenir la liste complète des clusters.
L'extrait suivant est extrait du fichier d'entrée.
"recommended stories dylan scott stat advertisement kate sheridan dylan scott dylan scott",
"email touting former representative mike fergusons genuine connection",
"email touting former representative mike ferguson \u2019",
"facebook donald trump fda hhs privacy policy",
"president trump appoints dr scott gottlieb",
"trade groups including novartis ag",
"bush alumni coalition supporting trump",
"online presidential transition analysis center",
"tennessee republican representative marsha blackburn",
"nonprofit global health care company",
"paula stannard ,\u201d said ladd wiley",
"bremberg returned calls seeking comment",
"0 \u2026. 0 \u2026 1c",
"2016 w ashington \u2014 let",
"take place ,\u201d said dr",
"\u201c selling baby parts .\u201d",
"health care companies whose boards",
"transition ,\u201d said lisa tofil",
Voici le code que j'utilise
punctuations = '''!()-[]{};:'\<>./[email protected]#$%^&*_~'''
n_a =fin_a= ""
for file in os.listdir():
if (file.endswith(".kwp")):
with open(file) as f:
#print(f.read())
a = f.read()
a = re.sub(r"\\[a-z0-9A-Z]+","",a)
a = re.sub(r"\"","",a)
a = re.sub(r"\,","",a)
#a = re.sub("\\","",a)
#print(a)
for ch in a:
if (ch not in punctuations):
n_a = n_a + ch
n_a = n_a.lower()
#print(n_a)
#new_f = open("n")
fin_a = fin_a + n_a
tfidf_vectorizer = TfidfVectorizer(max_df=1,stop_words='english',use_idf=True)
tfd_mat = tfidf_vectorizer.fit_transform([n_a])
dense = tfd_mat.todense()
#print(len(dense[0].tolist()[0]))
ep = dense[0].tolist()[0]
phrase_scores = [pair for pair in zip(range(0, len(ep)), ep) if pair[1] > 0]
#print(phrase_scores)
#print(len(phrase_scores))
phrase_scores=sorted(phrase_scores, key=lambda t: t[1] * -1)[:100]
#rint(tfd_mat)
fin_term = []
terms = tfidf_vectorizer.get_feature_names()
with open("/home/laitkor/Desktop/New_Paul/kwp_top100.txt","w") as fl:
for t in range(0,100):
#print(t)
key,valu = phrase_scores[t]
#print(key)
#print(valu)
fl.write(terms[key]+'\n')
fin_term.append(terms[key])
#print(fin_term)
#print(phrase_scores[1:100])
dist = 1 - cosine_similarity(phrase_scores[1:100])
#print(dist)
linkage_matrix = ward(dist)
#print(linkage_matrix)
fig, ax = plt.subplots(figsize=(30, 30)) # set size
ax = dendrogram(linkage_matrix, orientation="right", labels=fin_term);
#print(ax)
#print(leaves)
plt.tick_params(\
axis= 'x', # changes apply to the x-axis
which='both', # both major and minor ticks are affected
bottom='off', # ticks along the bottom edge are off
top='off', # ticks along the top edge are off
labelbottom='off')
plt.tight_layout() #show plot with tight layout
#uncomment below to save figure
plt.savefig('kw[enter image description here][1]p.png', dpi=200)
Le lien ci-dessous contient la sortie du Dendrogramme formé
https://www.screencast.com/t/2MEc3ohBe