Support Vector Machine
Last updated
Last updated
SVM is trying to find biggest distance between the points, so there is as much separation as possible. That said, SVM works well with less data where the separation is obvious (there are big margins between the data points). So, SVM does not work well for data sets with a lot of points and with data sets with a lot of noice.
Taken from An Idiot’s guide to Support vector machines (SVMs):
Two independent developments within last decade:
New efficient separability of non-linear regions that use “kernel functions” : generalization of ‘similarity’ to new kinds of similarity measures based on dot products
Use of quadratic optimization problem to avoid ‘local minimum’ issues with neural nets
The resulting learning algorithm is an optimization algorithm rather than a greedy search
SVMs maximize the margin (Winston terminology: the ‘street’) around the separating hyperplane.
The decision function is fully specified by a (usually very small) subset of training samples, the support vectors.
This becomes a Quadratic programming problem that is easy to solve by standard methods
General input/output for SVMs just like for neural nets, but for one important addition...
Input: set of (input, output) training pair samples; call the input sample features x1, x2...xn, and the output result y.Typically, there can be lots of input features x(i).
Output: set of weights w(or w(i)), one for each feature, whose linear combination predicts the value of y.(So far, just like neural nets...)
Important difference: we use the optimization of maximizing the margin (‘street width’) to reduce the number of weights that are nonzero to just a few that correspond to the important features that ‘matter’ in deciding the separating line (hyperplane)...these nonzero weights correspond to the support vectors (because they ‘support’ the separating hyperplan
Support vectors are the elements of the training set that would change the position of the dividing hyperplane if removed.
Support vectors are the critical elements of the training set
Moving a support vector moves the decision boundary
Moving the other vectors has no effect
Nice video explaining SVM is here.
Here is another picture showing basic concept in SVM, finding separating line by finding biggest margin.
What should SVM do with this data set? It should ignore that outlier. These points are called outliers.
Here is an example from SVM documentation.
The result is:
In case the data are no separable at the first sight, we can calculate a new feature. We need to create a linear line.
We do not have to calculate new features. We can change 'kernel' parameter to force SVM to create linear separator. There are other parameters we can configure for SVM. There are to more that are important, it is C and gamma.
Here is what C does. Bigger C value causes more points to be correctly separated, but it also costs more. There must be a balance between algorithm accuracy and speed.
If the lines separating data points are too complex, it is called overfitting. The black curve on the chart below is example of overfitting. The green line, show how the good separation could look like. We can change C, gamma or kernel to tune separation and get rid of overfitting (do not take all the data too seriously).
Here is a nice example from documentation. First we generate 5 points for each label. Then we use linear SVC and fit the created points. Then see how separation line is calculated, together with another two lines that make margin easily observable. Hyper plane has always one less dimension than the space it exists (e.g. for 2D system, 1D line is hyperplane).
Here is the chart with the separation lines and another two "margin" lines.
Here is the data used in the chart.
The problem of finding the optimal hyper plane is an optimization problem and can be solved by optimization techniques (we use Lagrange multipliers to get this problem into a form that can be solved analytically).