To understand how to apply SVD to the Paragonimiasis data, Bala suggested reading Fabiana’s paper: https://arxiv.org/abs/2404.03701. The following is my notes on the paper.
Data for potatoes: emergence, plant vigor, plant uniformity, stems per plant, vine maturity, total yield, yield for different weight categories, attributes pertaining to external characteristics, specific gravity, internal defects, performance.
Breeders decide on potato clones by performance. Researchers have used machine learning models to obtain accurate predictions of root yield based on vegetative indices of the different growth stages of cassava.
Data set contains 1086 clones (216 unique) each trialed for 1-3 years. Majority in year 1 of trialing. Three control varieties: Ranger Russet, Russet Burbank, and Russet Norkotah. Shepody (early) and Umatilla (late) were measured for only some years. There are 38 attributes for each clone.
Features with over 400 NA ratings were dropped (flower color, vine size, maturity). The feature bruise data was dropped as it mostly contained text descriptions. The following features were added:
- true_keeps: assigned 1 if it passed three trials
- Ctrl ave: average total yield of control clones
- % CA: .
- GDD 1-60: the sum of first 60 growing degree days
- GDD 61-90: the sum of growing degree days 61-90
- GDD 91-end: the sum of growing degree days 91 to vine kill date
The last three features pertain to growing degree days (GDD) which is defined by
where and are the day’s maximum and minimum air temperatures, and are the temperature below which the process of interest does not progress, i.e., .
Imputation
Many variables were missing observations. For example flower color was missing for 638/1086 clones. Dropping these rows would drastically reduce the sample size. Instead, we use imputation to fill in the missing data.
We use Scikit-learn’s IterativeImputer
with sample_posterior
set to true. This function performs multiple imputation by chained equations, aka MICE. MICE assumes the data is missing at random. It begins by replacing missing data with the mean value as a placeholder. Then for each variable, the place holder means are removed and the variable is considered as dependent on the other variables and performs regression.
Two data sets were constructed from the original: imputed set and non-imputed set with 404 observations. The non-imputed set was constructed by dropping rows so that the maximum number of NA entries per column was below 400.
For categorical data, we use One-Hot Encoding.
Support Vector Machines (SVM)
SVM identifies the hyperplane that partitions the data set into two categories. Not all data sets have a linear decision boundary, but there may be a non-linear decision boundary. Since SVM finds a linear boundary, we transform the vector space using a kernel. A kernel is a function that computes the dot product of two observations in the transformed space. More importantly, this allows one to not need to explicitly define the transformation to the higher dimension.
Two crucial parameters in non-linear SVM are the scale parameter which affects the radius of influence of a single observation and which controls how much observations are allowed to lie outside the target hyperspace. Tweaking these parameter comes down to a trade-off between classification error and overfitting.
Evaluation Metrics
It is important to evaluate the performance of a machine learning model. This requires a metric. The paper uses three: accuracy, F1-score, and Matthews correlation coefficient (MCC). MCC is given by
where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. The MCC lies between -1 and 1 and is more informative than accuracy or F1-score for imbalanced data sets.
Variable Selection
To determine which variables to include in the machine learning model we use variable selection. We start by performing a “grid search with cross-validation on the hyperparameter space to determine the best scoring model”. Not entirely sure what this is describing.
Next we use “forward feature selection” by starting with zero features and iterating through each feature and computing the model’s score with that feature included. The feature with the highest model score is included into the model. This is repeated until we reach a desired number of features or adding a feature has negligible impact.