Support Vector Machine Background


Support Vector Machine (SVM) is a supervised classification method derived from statistical learning theory that often yields good classification results from complex and noisy data. It separates the classes with a decision surface that maximizes the margin between the classes. The surface is often called the optimal hyperplane, and the data points closest to the hyperplane are called support vectors. The support vectors are the critical elements of the training set.

You can adapt SVM to become a nonlinear classifier through the use of nonlinear kernels. While SVM is a binary classifier in its simplest form, it can function as a multiclass classifier by combining several binary SVM classifiers (creating a binary classifier for each possible pair of classes). ENVI’s implementation of SVM uses the pairwise classification strategy for multiclass classification.

SVM includes a number of parameters, which are described below.

References

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27:1-27:27 (2011). Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

Hsu, C.-W., C.-C. Chang, and C.-J. Lin. (2010). A practical guide to support vector classification. National Taiwan University. https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.

Wu, T.-F., C.-J. Lin, and R. C. Weng. (2004). Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research, 5:975-1005, https://www.csie.ntu.edu.tw/~cjlin/papers/svmprob/svmprob.pdf.

Kernel Type

Different options are available for mathematically representing a kernel function, which gives the weights of nearby data points in estimating target classes. The Radial Basis Function (RBF, default) kernel type works well in most cases. The mathematical definitions of each kernel function are listed as follows:

    Linear

    K(xi,xj) = xiTxj

    Polynomial

    K(xi,xj) = (gxiTxj + r)d, g > 0

    RBF

    K(xi,xj) = exp(-g||xi - xj||2), g > 0

    Sigmoid

    K(xi,xj) = tanh(gxiTxj + r)

Degree of Kernel Polynomial

The d term in the Polynomial kernel function represents the degree of the kernel polynomial. Increasing this parameter more accurately delineates the boundary between classes. A value of 1 represents a first-degree polynomial function, which is essentially a straight line between two classes. A value of 1 works well when you have two very distinctive classes. In most cases, however, you will be working with imagery that has a high degree of variation and mixed pixels. Increasing the polynomial value causes the algorithm to more accurately follow the contours between classes, but you risk fitting the classification to noise.

Bias in Kernel Function

The "r" term in the Polynomial and Sigmoid kernel functions represents the kernel bias.

Gamma in Kernel Function

The "g" term in the Polynomial, Radial Basis Function, and Sigmoid kernel functions represents the gamma parameter.

Penalty

This parameter controls the trade-off between allowing training errors and forcing rigid margins.It allows a certain degree of misclassification, which is particularly important for non-separable training sets. Increasing this value also increases the cost of misclassifying points and creates a more accurate model that may not generalize well.