2017-10-13 1 views
0

J'avais besoin de sauvegarder toutes les combinaisons de paramètres et les exactitudes correspondantes dans une sorte de données pandas.Sauvegarder le meilleur Params dans Gridsearch dans une base de données pandas

J'espère, je suis clair, S'il vous plaît signaler, si je fais une erreur.

Exemple de code est:

from sklearn.grid_search import GridSearchCV 
import sklearn 
from sklearn.ensemble import RandomForestClassifier 


X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(iris.data, iris.target, test_size=0.3, random_state=0) 

rfc = RandomForestClassifier(n_jobs=-1,max_features= 'sqrt' ,n_estimators=50, oob_score = True) 

param_grid = { 
    'n_estimators': [200, 700], 
    'max_features': ['auto', 'sqrt', 'log2'], 
    'criterion' : ['gini', 'entropy'] 
} 

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5) 
CV_rfc.fit(X_train, y_train) 
CV_rfc.grid_scores_ 

J'utilise la grille de recherche CV en sklearn, pour obtenir les meilleurs paramètres. Mais, mon souci est, Y at-il un moyen, je peux stocker toutes les différentes combinaisons paramétriques et les exactitudes correspondantes dans un cadre de données pandas que je peux enregistrer dans un fichier CSV pour plus tard à des fins.

[mean: 0.94286, std: 0.05344, params: {'criterion': 'gini', 'max_features': 'auto', 'n_estimators': 200}, 
mean: 0.94286, std: 0.05344, params: {'criterion': 'gini', 'max_features': 'auto', 'n_estimators': 700}, 
mean: 0.94286, std: 0.05344, params: {'criterion': 'gini', 'max_features': 'sqrt', 'n_estimators': 200}, 
mean: 0.94286, std: 0.05344, params: {'criterion': 'gini', 'max_features': 'sqrt', 'n_estimators': 700}, 
mean: 0.94286, std: 0.05344, params: {'criterion': 'gini', 'max_features': 'log2', 'n_estimators': 200}, 
mean: 0.94286, std: 0.05344, params: {'criterion': 'gini', 'max_features': 'log2', 'n_estimators': 700}, 
mean: 0.94286, std: 0.05344, params: {'criterion': 'entropy', 'max_features': 'auto', 'n_estimators': 200}, 
mean: 0.94286, std: 0.05344, params: {'criterion': 'entropy', 'max_features': 'auto', 'n_estimators': 700}, 
mean: 0.94286, std: 0.05344, params: {'criterion': 'entropy', 'max_features': 'sqrt', 'n_estimators': 200}, 
mean: 0.94286, std: 0.05344, params: {'criterion': 'entropy', 'max_features': 'sqrt', 'n_estimators': 700}, 
mean: 0.94286, std: 0.05344, params: {'criterion': 'entropy', 'max_features': 'log2', 'n_estimators': 200}, 
mean: 0.94286, std: 0.05344, params: {'criterion': 'entropy', 'max_features': 'log2', 'n_estimators': 700}] 

Donc, j'ai une liste de ces valeurs, je veux une trame de données de celui-ci, pour enregistrer dans un fichier csv.

len(CV_rfc.grid_scores_) 
12 

Répondre

0

Je l'ai trouvé sur Internet, le code a été pour python 2, mais je l'ai fixé à courir sur python 3.

Voici ce que j'ai trouvé là-bas.

import pandas as pd 
from sklearn.grid_search import GridSearchCV 
import numpy as np 

class EstimatorSelectionHelper: 
    def __init__(self, models, params): 
     if not set(models.keys()).issubset(set(params.keys())): 
      missing_params = list(set(models.keys()) - set(params.keys())) 
      raise ValueError("Some estimators are missing parameters: %s" % missing_params) 
     self.models = models 
     self.params = params 
     self.keys = models.keys() 
     self.grid_searches = {} 

    def fit(self, X, y, cv=3, n_jobs=1, verbose=1, scoring=None, refit=False): 
     for key in self.keys: 
      print("Running GridSearchCV for %s." % key) 
      model = self.models[key] 
      params = self.params[key] 
      gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs, 
           verbose=verbose, scoring=scoring, refit=refit) 
      gs.fit(X,y) 
      self.grid_searches[key] = gs  

    def score_summary(self, sort_by='mean_score'): 
     def row(key, scores, params): 
      d = { 
       'estimator': key, 
       'min_score': min(scores), 
       'max_score': max(scores), 
       'mean_score': np.mean(scores), 
       'std_score': np.std(scores), 
      } 
      return pd.Series({**params,**d}) 

     rows = [row(k, gsc.cv_validation_scores, gsc.parameters) 
        for k in self.keys 
        for gsc in self.grid_searches[k].grid_scores_] 
     df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False) 

     columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score'] 
     columns = columns + [c for c in df.columns if c not in columns] 

     return df[columns] 

from sklearn import datasets 

iris = datasets.load_iris() 
X_iris = iris.data 
y_iris = iris.target 

from sklearn.ensemble import (ExtraTreesClassifier, RandomForestClassifier, 
           AdaBoostClassifier, GradientBoostingClassifier) 
from sklearn.svm import SVC 

models = {'RandomForestClassifier': RandomForestClassifier()} 

params = {'RandomForestClassifier': { 'n_estimators': [16, 32], 
             'max_features': ['auto', 'sqrt', 'log2'], 
             'criterion' : ['gini', 'entropy'] }} 

helper = EstimatorSelectionHelper(models, params) 
helper.fit(X_iris, y_iris) 

helper.score_summary() 

SORTIE:

Running GridSearchCV for RandomForestClassifier. 
Fitting 3 folds for each of 12 candidates, totalling 36 fits 
[Parallel(n_jobs=1)]: Done 36 out of 36 | elapsed: 1.7s finished 
Out[31]: 
estimator min_score mean_score max_score std_score criterion max_features n_estimators 
1 RandomForestClassifier 0.921569 0.96732 1 0.0333269 gini auto 32 
6 RandomForestClassifier 0.921569 0.96732 1 0.0333269 entropy auto 16 
10 RandomForestClassifier 0.941176 0.966912 0.980392 0.0182045 entropy log2 16 
2 RandomForestClassifier 0.901961 0.960784 1 0.0423578 gini sqrt 16 
4 RandomForestClassifier 0.921569 0.960376 0.980392 0.0274454 gini log2 16 
7 RandomForestClassifier 0.921569 0.960376 0.980392 0.0274454 entropy auto 32 
8 RandomForestClassifier 0.921569 0.960376 0.980392 0.0274454 entropy sqrt 16 
9 RandomForestClassifier 0.921569 0.960376 0.980392 0.0274454 entropy sqrt 32 
3 RandomForestClassifier 0.941176 0.959967 0.980392 0.0160514 gini sqrt 32 
0 RandomForestClassifier 0.901961 0.95384 0.980392 0.0366875 gini auto 16 
11 RandomForestClassifier 0.901961 0.95384 0.980392 0.0366875 entropy log2 32 
5 RandomForestClassifier 0.921569 0.953431 0.980392 0.0242635 gini log2 32