IDIAP : > Projects > DivideLearn

[ > Projects > DivideLearn]

Divide & Learn : Improved Learning for Large Classification Problems

Samy Bengio

The machine learning community has lately devoted considerable attention to the decomposition of large scale classification problems into a series of sub-problems and to the recombination of the learned models into a global model. Two major motivations underlie these approaches:

1.: reducing the complexity of each single task, eventually by increasing the number of tasks,
2.: improving the global accuracy by combining several classifiers.

These motivations are particularly relevant to the research themes covered by IDIAP (such as speech recognition and computer vision tasks), since the databases we are typically dealing with are of large size: the number of attributes can be several hundreds; the number of data points in the order of several thousands and the number of classes (in classification tasks) is typically 10 or more (10 digits, 26 characters, 30-60 phonemes, etc.). To handle each of these scaling problems, a series of subtasks is typically defined where each subtask focuses either on a subset of the attributes (feature selection); on a different sample of the data (resampling, i.e. sub-sampling, bagging, boosting, etc.); or on a different relabeling of the data (decomposition of polychotomies into dichotomies).

When mixing several basic learners, the global accuracy can improve beyond that of the best basic learner only if the errors of the learners are not too positively correlated. This is ensured either by changing the model used to learn each sub-problem (e.g., by using models from different families or by modifying the parameters of the model from one subtask to another one) or by varying the data set used to train each model (e.g. by feature selection or resampling).

A range of solutions have been proposed in the literature for the combination of different models into a global system. In the simplest case, this is done with a majority vote; in other situations, this combination is taken as a new learning problem having as inputs the outputs of the basic models (stacking). Finally, in its most elaborated form, this combination is dynamic (i.e. varies with each input) and its parameters are determined simultaneously with the training phase of each basic model. The latter form is the so called mixture of experts (ME) model, which has been developed in a rigorous probabilistic framework in the early nineties and has been widely studied and extended since then. It was the main object of study of the last 3 years of research of one of our previous project.

As stated before, the less correlated the experts (basic models) are, the better the performance of the ME model. A ME has an inherent bias towards uncorrelated experts because of its dynamic recombination that partitions the input space in different regions on which the experts specialize. To further favor this property, a natural way is to use feature selection and dimensionality reduction in such a way that each expert relies on a different set of inputs.

The expert models of a ME used for classification problems can be based, in principle, on any method that estimates a posteriori class probabilities (e.g. neural networks). Introduced in 1995, the support vector machine (SVM) has been shown to be an extremely powerful learning tool for 2-class problems. Although an SVM does not output probabilities, its good performance makes it very appealing to find a way to design a ME model with experts based on SVMs.

SVMs were originally designed for binary classification problems. In our research group, we have acquired some know-how in the decomposition of multiclass classification problems into 2-class sub-problems. Besides different strategies of decomposition, we have only investigated static combination techniques (i.e. combinations of the binary classifiers which do not depend on the inputs) so far. The design of K-class (or multiclass) classification systems decomposed into binary classifiers and recombined dynamically (i.e. MEs for classification with binary classifiers as experts) constitutes a new field of study of great potential in the field of pattern classification in general and in speech processing and computer vision tasks in particular.

This research project is thus composed of three main parts:

A.: exploitation of feature selection in mixture of experts models,
B.: elaboration of a mixture of experts based on support vector machines,
C.: development of a mixture of binary classifiers for multiclass classification.

Keywords: learning, classification, mixture of experts, support vector machine, feature selection, dimensionality reduction, resampling, binary classifiers for multiclass classification.