The following are notes on the LIBSVM (https://github.com/cjlin1/libsvm) library.


The entire library is a single header and a single source file.

An SVM model is represented by the struct svm_model. These models can be saved and loaded to disk via svm_save_model and svm_load_model. We can create a model via svm_train:

struct svm_model *svm_train(const struct svm_problem *prob,
					const struct svm_parameter *param);

svm_train takes in two arguments: a svm_problem struct and a svm_parameter struct. The svm_problem struct represents the training data set.

struct svm_problem {
	int l;
	double *y;
	struct svm_node **x;
};
 
struct svm_node {
	int index;
	double value;
}

Here l is the size of training data, y is the target values (i.e., classification of each data point) and x is the actual data. An index of -1 is used to indicate the end of a row’s data.

For example, consider the following data set:

ClassAttr1Attr2Attr3Attr4Attr5
100.10.200
200.10.3-1.20
10.40000
200.101.40.5
3-0.1-0.20.11.10.1

Then the svm_problem struct is as follows:

{
	"l": 5,
	"y": [1,2,1,2,3],
	"x": [
		(2,0.1), (3,0.2), (-1,?),
		(2,0.1), (3,0.3), (4,-1.2), (-1,?),
		(1,0.4), (-1,?),
		(2,0.1), (4,1.4), (5,0.5), (-1,?),
		(1,-0.1), (2,-0.2), (3,0.1), (4,1.1), (5,0.1), (-1,?)
	]
}

Note that only the non-zero values need to be stored.

Note: The indices must be in ASCENDING order.

Warning: The svm_model contains a pointer to svm_problem. Do not free svm_problem until done with svm_model.

The svm_parameter struct is used to configure the SVM model parameters.

struct svm_parameter {
	int svm_type;
	int kernel_type;
	int degree;	/* for poly */
	double gamma;	/* for poly/rbf/sigmoid */
	double coef0;	/* for poly/sigmoid */
 
	/* these are for training only */
	double cache_size; /* in MB */
	double eps;	/* stopping criteria */
	double C;	/* for C_SVC, EPSILON_SVR, and NU_SVR */
	int nr_weight;		/* for C_SVC */
	int *weight_label;	/* for C_SVC */
	double* weight;		/* for C_SVC */
	double nu;	/* for NU_SVC, ONE_CLASS, and NU_SVR */
	double p;	/* for EPSILON_SVR */
	int shrinking;	/* use the shrinking heuristics */
	int probability; /* do probability estimates */
};

The svm_type is one of C_SVC, NU_SVC, ONE_CLASS, EPSILON_SVR, NU_SVR. To match my code from scikit-learn I should use C_SVC. The authors suggest using an epsilon value of 0.001. One can check that the parameters are setup correctly with svm_check_parameter() which returns NULL if everything is okay, otherwise it returns a const char* describing the error.

const char *svm_check_parameter(const struct svm_problem *prob, 
	const struct svm_parameter *param);

Once an svm_model is constructed, one can call svm_predict to classify a test vector. The class of the vector is returned. svm_predict takes in two arguments: a pointer to the svm_model obtained via training and an array of svm_node representing the test vector.

double svm_predict(const struct svm_model *model, const struct svm_node *x);

LIBSVM supports cross validation out of the box. It separates the data into nr_fold folds. Sequentially each fold is validated using the model from training the remaining folds. The predicted labels in the validation process are stored in target.

void svm_cross_validation(const struct svm_problem *prob,
	const struct svm_parameter *param, int nr_fold, double *target);

A quick test of LIBSVM vs scikit-learn resulted in LIBSVM being approximately 530 times faster:

▶ hyperfine './test' 'python scikit.py' -N
Benchmark 1: ./test
  Time (mean ± σ):       1.2 ms ±   0.1 ms    [User: 0.9 ms, System: 0.2 ms]
  Range (min … max):     1.1 ms …   1.5 ms    2301 runs
 
Benchmark 2: python scikit.py
  Time (mean ± σ):     644.1 ms ±  14.7 ms    [User: 1409.4 ms, System: 66.7 ms]
  Range (min … max):   622.1 ms … 667.4 ms    10 runs
 
Summary
  ./test ran
  530.01 ± 25.38 times faster than python scikit.py