J'ai entendu des gens dire que vous pouvez ajuster le seuil pour régler le compromis entre la précision et le rappel, mais je ne trouve pas d'exemple concret de la façon de le faire.Comment changer le seuil pour la précision et le rappel en python scikit-learn?
Mon code:
for i in mass[k]:
df = df_temp # reset df before each loop
#$$
#$$
if 1==1:
###if i == singleEthnic:
count+=1
ethnicity_tar = str(i) # fr, en, ir, sc, others, ab, rus, ch, it, jp
# fn, metis, inuit; algonquian, iroquoian, athapaskan, wakashan, siouan, salish, tsimshian, kootenay
############################################
############################################
def ethnicity_target(row):
try:
if row[ethnicity_var] == ethnicity_tar:
return 1
else:
return 0
except: return None
df['ethnicity_scan'] = df.apply(ethnicity_target, axis=1)
print '1=', ethnicity_tar
print '0=', 'non-'+ethnicity_tar
# Random sampling a smaller dataframe for debugging
rows = df.sample(n=subsample_size, random_state=seed) # Seed gives fixed randomness
df = DataFrame(rows)
print 'Class count:'
print df['ethnicity_scan'].value_counts()
# Assign X and y variables
X = df.raw_name.values
X2 = df.name.values
X3 = df.gender.values
X4 = df.location.values
y = df.ethnicity_scan.values
# Feature extraction functions
def feature_full_name(nameString):
try:
full_name = nameString
if len(full_name) > 1: # not accept name with only 1 character
return full_name
else: return '?'
except: return '?'
def feature_full_last_name(nameString):
try:
last_name = nameString.rsplit(None, 1)[-1]
if len(last_name) > 1: # not accept name with only 1 character
return last_name
else: return '?'
except: return '?'
def feature_full_first_name(nameString):
try:
first_name = nameString.rsplit(' ', 1)[0]
if len(first_name) > 1: # not accept name with only 1 character
return first_name
else: return '?'
except: return '?'
# Transform format of X variables, and spit out a numpy array for all features
my_dict = [{'last-name': feature_full_last_name(i)} for i in X]
my_dict5 = [{'first-name': feature_full_first_name(i)} for i in X]
all_dict = []
for i in range(0, len(my_dict)):
temp_dict = dict(
my_dict[i].items() + my_dict5[i].items()
)
all_dict.append(temp_dict)
newX = dv.fit_transform(all_dict)
# Separate the training and testing data sets
X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit)
# Fitting X and y into model, using training data
classifierUsed2.fit(X_train, y_train)
# Making predictions using trained data
y_train_predictions = classifierUsed2.predict(X_train)
y_test_predictions = classifierUsed2.predict(X_test)
j'ai essayé de remplacer la ligne "y_test_predictions = classifierUsed2.predict(X_test)" with "y_test_predictions = classifierUsed2.predict(X_test) > 0.8"
et "y_test_predictions = classifierUsed2.predict(X_test) > 0.01"
, rien ne change radicalement.
Merci DoughnutZombie, pourriez-vous me dire comment surligner le texte en gris? – KubiK888
Pour marquer le code en ligne, utilisez le caractère arrière 'au début et à la fin. Voir aussi http://stackoverflow.com/editing-help, par exemple. tout en bas "le formatage des commentaires". –
A votre question: Quel classificateur utilisez-vous? Au lieu de 'predict ', le classificateur a-t-il' predict_proba'? Parce que prédire habituellement seulement les sorties 1 et 0. 'predict_proba' génère un flottant que vous pouvez seuil. –