Bounded Online Learning with Kernel-Based Perceptrons

A Cognition Briefing

Contributed by: Francesco Orabona, IDIAP Research Institute, Martigny, Switzerland

Introduction
A key feature of an artificial cognitive system should be the ability to adapt and learn to new stimuli in a supervised manner. In the framework of supervised learning one of the most interesting is the online setting. In this setting a learning agent observes new inputs in a sequential manner. After each observation, the agent predicts an outcome. This outcome can be as simple as a yes/no decision, as in the case of binary classification problems, and as complex as a string over a large alphabet. Once the agent has made a prediction, it receives feedback indicating the correct outcome. Then, it may modify its classification rule (hypothesis) on the data, presumably improving the chance of making an accurate prediction on subsequent rounds. Hence it is implicit that at each moment the current hypothesis depends only on the previous received samples and the previous hypothesis [1]. Note that in the definition of online learning is not required to forget the past samples.

An immediate consequence of this setting is that an online learning system must face a potentially endless flow of training samples, and must therefore limit the number of samples taken into account when training, due to inevitable restrictions on the available resources (e.g., a finite computational power). In other words, regardless of the number of samples seen until a given moment, the algorithm should have a bounded complexity.

Moreover note that online algorithms are typically simple to implement and they can handle more easily big quantity of data, compared to batch algorithms.

We will briefly review the state of the art algorithms for online learning with kernels, with emphasis on the ones with bounded complexity.

Online Learning with Kernels
Online learning algorithms try to find a good approximation to the batch solution, that is to the solution found by an algorithm that can access to all the training samples at once. Most of them are linear algorithm, so we will briefly introduce the idea behind the linear classifiers.


Figure 1. Maximum-margin separating hyperplane.

Linear algorithms for binary classification find a separating hyperplane in the input space that divides the two classes. If such hyperplane exists, the problem is said to be linearly separable. In this case it is always possible to find a maximum-margin separating hyperplane which is at equal distance from the two classes. The distance between the hyperplane and the classes is thus maximal, and it is called margin. See Figure 1. The margin is related to the difficulty of the classification task. It is also linked to the generalization performance of the classifier.

If the problem is not linearly separable we could think to use a non-linear separating surface, instead of a hyperplane. But to retain all simplicity of linear classifier, a common trick is to transform the data with a non-linear transform and then apply the linear algorithm (see for example Slow Feature Analysis [2] for a similar approach and a discussion on the biological plausibility of the method). This will result in a non-linear surface in the original space. Of course this comes at the expense of an increase of the input dimensionality and hence of the computational complexity of the algorithm. However in most of the algorithms we just need to access to the inner products between the samples. So we can use a function that is returning directly the inner product in the transformed space, from the original input samples. This function is called Kernel function and it is the core of the Support Vector Machine (SVM) algorithm. We can use kernels in online linear algorithms too, obtaining non-linear versions, almost for free. The price to pay is that the size of the solution will be proportional to the number of training samples. Note that the same is true for the SVM. This is of course unacceptable if you consider the fact that the number of training sample is potentially endless in the online setting.

There is a big number of online learning algorithms in the scientific literature. The "quality" of an online algorithm is measured by how good its predictions are, and most of them provides a bound on the maximum number of mistakes they will commit in the worst case scenario. These bounds compare the performance of the algorithm to the best possible hypothesis in the same functional space. Often the algorithms itself are designed to have good bounds, that is to provide performance that are asymptotically close to the best hypothesis (that is with small "regret"). These bounds also provide a principled way to compare different algorithms. Note that a better bounds does not always imply better performance, but it often implies a better "controllability" of its behavior. We could draw a parallel between the preference for algorithms with bounds and the preference in robotics for actuators that have a mechanical design that simplifies the mathematical forms of the equations to control them.

Online learning algorithms include, for example, the Perceptron algorithm [3]. More modern examples include the ROMMA algorithm of Li and Long [4], Gentile’s ALMA algorithm [5], and the NORMA algorithm [6]. A framework of online learning for structured learning is the Passive-Aggressive online algorithm [7]. All these algorithms observe examples in a sequence of rounds, and construct the classification function incrementally, by storing a subset of the observed examples in their internal memory. The classification function is then defined by a combination of kernel function evaluated on the stored examples. This set of stored examples is the online equivalent of the support set in SVMs. Note all the above algorithms can be easily extended to the multiclass setting, see for example the MIRA algorithm [12]. Other interesting variants include the case of multiclass classification in which after the prediction the true label is not disclosed, but we only receive a binary feedback (right/wrong prediction). This setting is close to the reinforcement learning setting and even in this case it is possible to design good algorithms with mistake bounds [13]. Given the fact that many of the most important algorithms are based on the Perceptron, we will describe it in more details.

The Perceptron
We will describe the linear Perceptron, but it is easy to generalize the algorithm to the kernel version. Suppose that the samples are vector in Rn and the labels are {1,-1}. We start with an initial hypothesis w0 equal to the null vector (0,...,0). At each time the Perceptron receives a sample xt and predicts its label using sign(wt*xt), where '*' indicates the inner product. If the prediction is wrong the sample multiplied by its label is added to the current hypothesis: wt+1= wt +yt xt. This very simple algorithm has several properties. The first one is that it will converge in a finite number of steps if the classification problem is linearly separable. In particular the maximum number of iterations is inversely proportional to the square of the margin between the two classes. If the problem is not linearly separable, it is still possible to prove a bound on the maximum number of mistakes that he will commit, compared to the best hyperplane on the same samples.

Bounded testing time and Kernels
As said above, if we use kernels we will enhance the classification power of the classifier but at the same time the size of the solution will be proportional to the number of training samples. Given that the testing time is proportional to the size of the solution we have that sooner or later the system will too slow to be used in any setting. Several approaches has been proposed to solve this problem, constraining the hypothesis to have a maximum number of elements. Crammer et al. [8] proposed an online algorithm with a fix budget of supports. The Forgetron algorithm [9] is a variant on the Perceptron algorithm with a limited number of supports. In this approaches the algorithm keeps forgetting the past, to make room to the new samples. A similar approach was followed in the Randomized Budget Perceptron. It is possible to show that these approaches are suitable in case of "shifting targets" [14], that is when the classification task is continuously changing. In this case the algorithm is "tracking" the best hypothesis.

A different approach was followed in Orabona et al. [10] using online variant of the SVM with a limited number of support patterns called OISVM. Starting from this last algorithm Orabona et al. have proposed the Projectron [11]. The Projectron is based on the idea of approximating the new samples with the old ones, in order keep the size of the solution bounded, but at the same time retaining most of the accuracy of the solution. We briefly review in the following some budget online algorithms and the Projectron.

The Forgetron
The algorithm of the Forgetron [9] is similar to the Perceptron until it reaches the maximum a priori allowed size for the classifier. Then every time it acquires a new sample, it shrinks a bit the coefficients of the old ones and then it discards the oldest one. The shrinking factor is calculated at each round in order to influence as few as possible to current solution but at the same time to obtain a good mistake bound. The basic idea is that reducing enough the old coefficients with the shrinking operation, the removed sample will have a small weight. In this way the effect of its removal will be small.

The Random Budget Perceptron (RBP)
This algorithm has been proposed by Conconi et al. [14] and it is extremely simply, it discards an old sample at random from the solution, once it has reached the maximum size. In this case the algorithm is stochastic, hence only a bound on the expected number of mistakes can be proved. However in pratice it works extremely well, with performance similar to the Forgetron. It is also very simple to implement and it is interesting due to its use of the random discarding. Here the trick is in the random discarding step that in expectation does not change the direction of the hyperplane and shrinks it.

The Projectron
The starting point of the algorithm is the standard Perceptron algorithm for binary classification. Again it simply adds to the current solution the misclassified sample multiplied for its label. Now suppose that the current sample is in the space spanned by the previous added samples, so it is possible to write the current sample as a linear combination of the previous ones. In this case we don't need to add the sample to the solution, but it is enough to update the coefficients of the past ones. In general it is not always possible to express a vector as a linear combination of other vectors. So it can be useful to consider the error that we commit doing such approximation, that is projecting the current vector in the space spanned by the old vectors. It turns out that it is possible to prove a mistake bound and to bound the maximum size of the solution, depending on the maximum allowed error in the approximation step.

Results
We have compared some of the previous algorithms on a standard machine learning database (Adult9), using a Gaussian kernel. On the left there is the online average number of mistakes and on the right the size of the solutions for the different methods, in relation to the number of training samples. It is shown how Projectron (and its variant Projectron++) outperforms the other methods in terms of trade-off between accuracy and size of the solution. Other results can be found in [11].


Figure 2. Online average number of mistakes vs number of training samples.


Figure 3. Size of the solution vs number of training samples.

Conclusions
The biggest problem of online Perceptron-based algorithms is their gap in performance compared to batch algorithms. Future works will address this problem.

References
[1] R. Herbrich. Learning Kernel Classi?ers: Theory and Algorithms (Adaptive Computation and Machine Learning). The MIT Press, December 2001.

[2] L. Wiskott and T.J. Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715-770, 2002.

[3] F. Rosenblatt. The Perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386-407, 1958.

[4] Y. Li and P. M. Long. The Relaxed Online Maximum Margin Algorithm. 46(1-3), 361-387, 2002.

[5] C. Gentile. A new approximate maximal margin classification algorithm. J. Mach. Learn. Res., 2, 213-242, 2002.

[6] J. Kivinen, A.J. Smola, and R.C. Williamson. Online learning with kernels. IEEE Trans. on Signal Processing, 52(8), 2165-2176, 2004.

[7] K. Crammer, O. Dekel, J. Keshet , S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7, 551-585, 2006

[8] K. Crammer, J. Kandola, and Y. Singer. Online classi?cation on a budget. Advances in Neural Information Processing Systems 16, 2003

[9] O. Dekel, S. Shalev-Shwartz, and Y. Singer. The Forgetron: a kernel-based Perceptron on a budget. SIAM Journal on Computing, 37, 1342-1372, 2007

[10] F. Orabona, C. Castellini, B. Caputo, J. Luo, and G. Sandini. Indoor place recognition using online independent support vector machines. In Proc. of the British Machine Vision Conference 2007, 1090-1099, University of Warwick, UK, September 2007

[11] F. Orabona, J. Keshet, and B. Caputo. The Projectron: a bounded kernel-based Perceptron. In Proc. of International Conference on Machine Learning (ICML08), Helsinki, Finland, July 2008

[12] K. Crammer and Y. Singer. Ultraconservative Online Algorithms for Multiclass Problems. Journal of Machine Learning Research, 3(Jan):951-991, 2003.

[13] S. Kakade, S. Shalev-Shwartz, and A. Tewari. Efficient bandit algorithms for online multiclass prediction. In Proc. of the 25th Annual International Conference on Machine Learning (ICML 2008), Helsinki, Finland, July 2008

[14] N. Cesa-Bianchi, A. Conconi, and C. Gentile, Tracking the best hyperplane with a simple budget Perceptron. In Proc. of the 19th Conference on Learning Theory, 483-498, 2006