1. Introduction
This blog post is about Support Vector Machines (SVM), but not only about SVMs. SVMs belong to the class of classification algorithms and are used to separate one or more groups. In it’s pure form an SVM is a linear separator, meaning that SVMs can only separate groups using a a straight line. However ANY linear classifier can be transformed to a nonlinear classifier and SVMs are excellent to explain how this can be done. For a deeper introduction to the topic I recommend Tibshirani (2009), one can find a more detailed description including an derivation of the complete lagrangian there.
The general Idea of SVM is to separate two (or more) groups using a straight line (see Figure 1). However, in general there exits infinitely many lines that fulfill the task. So which one is the “correct” one? The SVM answers this question by choosing the line (or hyperplane if we suppose more than two features) which is most far away (the distance is denoted by ) from the nearest points within each group.
2. Mathematical Formulation
2.1 Separating Hyperplanes
Suppose multivariate data given by a pair where the explanatory variable
and the group coding
. In the following we assume an iid. sample of size
given by
.
Any separating hyperplane (which is a line if ) can therefore be described such that there exits some
and
(1)
data:image/s3,"s3://crabby-images/36184/36184488222346ce5e7023a14be77cb622c34663" alt="Rendered by QuickLaTeX.com ||\beta||=1"
(2)
2.2 Support Vector Machines
In most cases the assumption that there exits a hyperplane that perfectly parts the data points is unrealistic. Usually some points will lie on the other side of the hyperplane. In that case (2) will not have a solution. The idea of SVM is now to introduce an parameter to fix this issue. In particular we modify (1) such that we require
(3)
(4)
(5)
data:image/s3,"s3://crabby-images/600d4/600d40c177bf2a2970182c2c17b72e3c130409bf" alt="Rendered by QuickLaTeX.com a_i>0"
data:image/s3,"s3://crabby-images/0e197/0e1971b0cd88d77ad46789c1d43902ca5eec98da" alt="Rendered by QuickLaTeX.com ="
data:image/s3,"s3://crabby-images/7de79/7de79952e8e25abdd9351169fecc80f0122a1385" alt="Rendered by QuickLaTeX.com a_i=0"
data:image/s3,"s3://crabby-images/600d4/600d40c177bf2a2970182c2c17b72e3c130409bf" alt="Rendered by QuickLaTeX.com a_i>0"
2.2 Nonlinear SVMs
The method we described so far can only handle data where the groups can be separate using some hyperplane (line). However in many cases the data to be considered is not suited to be separated using a linear method. See figure 2 for example, in figure 2 the groups are arranged in two circles with different radius. Any attempt to separate the groups using a line will thus fail. However any linear method can be transformed into a non-linear method by projecting the data into a higher dimensional space. This new data
may then be separable by some linear method. In case of figure 2 a suitable projection is for example given by
,
maps the data onto a cone where the data can be separated using a hyperplane as to be seen in figure 3. However, there are some drawbacks using this method. First of all
will in general be unknown. In our simple example we where lucky to find a suitable
, however concerning a more complicated data structure fining a suitable
turns out to be very hard. Secondly, depending on the dataset, the dimension of
needed to guarantee the existence of a separating hyperplane can become quite large and even infinite. Before we deal with this issues using kernels we will first have a look at the modified minimization problem (4) using
instead of
.
Let the nonlinear solution function given by . To overcome the need of the constant
we introduce some penalty
and rewrite (4) as
(6)
The notation
![Rendered by QuickLaTeX.com []_+](https://www.thebigdatablog.com/wp-content/ql-cache/quicklatex.com-e70617edac1f804d04d89cd3fcb6802f_l3.png)
![Rendered by QuickLaTeX.com [1- y f(x)]](https://www.thebigdatablog.com/wp-content/ql-cache/quicklatex.com-fe55da2b2663a96b010ba3b76b78fb11_l3.png)
![Rendered by QuickLaTeX.com L(y,f)=[1- y f(x)]_+](https://www.thebigdatablog.com/wp-content/ql-cache/quicklatex.com-67043ff3e0e46b3f4d460abddf74480c_l3.png)
data:image/s3,"s3://crabby-images/ea17e/ea17e6d414e1740b5bea6202fc398192627a408c" alt="Rendered by QuickLaTeX.com L(y,f)"
data:image/s3,"s3://crabby-images/a6fad/a6fad5a9ae4184e888e8ab61b7ffe753cd8167bb" alt="Rendered by QuickLaTeX.com L(y,f)=log(1+exp(-yf(x)))"
From (5) we already know, that a nonlinear solution function will have the form
(7)
We can verify, that knowledge about
data:image/s3,"s3://crabby-images/74853/748535b00e1516d55abf8d8fe8bd543634ae6768" alt="Rendered by QuickLaTeX.com \phi"
data:image/s3,"s3://crabby-images/8c544/8c54424dde791bbf648a85f470d5e1e9d85e4ba0" alt="Rendered by QuickLaTeX.com \langle\phi(x),\phi(x_i)\rangle"
data:image/s3,"s3://crabby-images/9cc46/9cc46b635185da99c1b7878ab4f6790b2b297e02" alt="Rendered by QuickLaTeX.com \phi(x)"
2.2.1 Kernels
We already get in touch with kernels when estimating densities or a regression. In this application Kernels are a way to reduce the infinite-dimensional problem to a finite dimensional optimization problem because the complexity of the optimization problem remains only dependent on the dimensionality of the input space and not of the feature space. To see this, let instead of looking at some particular function
, we consider the whole space of functions generated by the linear span of
where
is just one element in this space. To model this space in practice popular kernels are
- linear kernel:
- polynomial kernel:
- radial kernel:
data:image/s3,"s3://crabby-images/61ba7/61ba7079bf1a5564cc5a3d151599506e4237eccc" alt="Rendered by QuickLaTeX.com K"
(8)
data:image/s3,"s3://crabby-images/e2bd6/e2bd6659ac777ca869e598a0241f8a3008fd2d7e" alt="Rendered by QuickLaTeX.com \varphi"
data:image/s3,"s3://crabby-images/61ba7/61ba7079bf1a5564cc5a3d151599506e4237eccc" alt="Rendered by QuickLaTeX.com K"
data:image/s3,"s3://crabby-images/c0894/c08943fdf6dc0b78c41054bd9c04ca7a44af2464" alt="Rendered by QuickLaTeX.com \gamma_i\geq 0"
data:image/s3,"s3://crabby-images/ced88/ced8898605025bd0514c69cc621417abefd1c252" alt="Rendered by QuickLaTeX.com \sum \gamma_i^2 < \infty"
data:image/s3,"s3://crabby-images/d5292/d529230f219fe278aeb896cccfebaffcecb0c25e" alt="Rendered by QuickLaTeX.com \phi(x)= \sum_{j=1}^\infty \delta_j \varphi_j(x)"
data:image/s3,"s3://crabby-images/8c6da/8c6da78c8072f5fb25afba5acf15f86976cf1a98" alt="Rendered by QuickLaTeX.com \delta"
data:image/s3,"s3://crabby-images/62a79/62a7964358c29eb99882949b40cd67e65416a515" alt="Rendered by QuickLaTeX.com c_j=\sum_{i=1}^N \alpha_i \gamma_j \varphi_j(x_i) ,j=1,\dots,\infty"
(9)
Thus we can write (6) in its general form as
(10)
According to (9) we can write down (10) using matrix notation
data:image/s3,"s3://crabby-images/50266/502667cfe776791b483e415624a8193332203051" alt="Rendered by QuickLaTeX.com \mathbf{K}_{ij} = K(x_i,x_j),\;i,j=1,\dots,N \; \mathbf{y}=(y_1,\dots,y_N),\; \mathbf{\alpha}=(\alpha_1,\dots,\alpha_N)"
(11)
Pingback: Non-Linear Classification Methods in Spark – The Big Data Blog