Yesterday I implemented forward sequential feature selection into the SVM code with the goal of automatically picking the best features to include. However, we haven’t chosen the correct model parameters (e.g., the kernel, , and ) to use and as a result the code fails to learn.

In Fabiana’s paper they suggest starting with a grid search to find the best parameters. The grid search workflow is as follows:

center

Grid search is implemented in Scikit-learn via GridSearchCV which considers all possible parameter combinations. We provide the class from a grid of parameters specified by param_grid.

param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]

To perform a grid-search we need:

  • an estimator such as SVM
  • a parameter space
  • a search method
  • a cross-validation schem
  • a score function such as MCC

It is unwise to learn the parameters and test on the same data set. This often results in overfitting where the model works great on the original data, but fails on unseen data. A common approach in supervised learning is to split the data set into a training set and a test set. Scikit-learn has a method called train_test_split to do just that.

When finding parameters, there is still a risk of overfitting. One approach to solve this problem is to split the data into a third “validation set”. But this greatly reduces the total number of samples we are training on. Instead we can use cross-validation (CV).

A -fold CV splits the training set into smaller sets. The model is then trained on of the sets and validated on the remaining set. We then measure the performance of the model by taking the average. This method is slow but uses a larger portion of the data set.

center

We can now perform the grid search.

scorer = make_scorer(matthews_corrcoef)
clf_svm = svm.SVC()
clf_gs = GridSearchCV(
    estimator=clf_svm, 
    param_grid=param_grid, 
    scoring=scorer, 
    refit=True,
    cv=KFOLD_N_SPLITS
)
clf_gs.fit(X_train, y_train)