Available classifiers

Currently, a whole set of one-class classifiers is already implemented, ready to use. Please feel free to extend this list!

`gauss_dd`: simple Gaussian target distribution, without any robustifying. This is the first classifier I would try. The target class is modeled as a Gaussian distribution. To avoid numerical instabilities the density estimate is avoided, and just the Mahalanobis distance is used:(4.1)

The classifier is defined as:

The mean and covariance matrix are just sample estimates. The threshold is set according to the target error that the user has to supply.`rob_gauss_dd`: Gaussian target distribution, but robustified. Can be significantly better for data with long tails. The mathematical description of the method is identical to`gauss_dd`. The difference is in the computation of the and . This procedure reweights the objects in the training set according to their proximity to the (previously estimated) mean. Remote objects (candidate outliers) will be down weighted such that a more robust estimate is obtained. It will not be strictly robust, because these outliers will always have some influence.`mcd_gauss_dd`: Minimum Covariance Determinant Gaussian classifier. The mathematical description is again the same as of the`gauss_dd`. For the estimation of the mean and covariance matrix just a fraction of the data is used. This part of the data is selected such that the determinant of the covariance matrix is minimal. The implementation is taken from [RVD99], but unfortunately it only works up to 50 dimensional data.`mog_dd`: Mixture of Gaussians. Here the target class is modeled using a mixture of Gaussians, to create a more flexible description. The model looks like:(4.3)

The classifier is defined as:

The parameters and are optimized using the EM algorithm. Because in high dimensional data and larger number of clusters , the number of free parameters can become huge (in particular in the covariance matrices), the covariance matrices can be constrained. This can improve the performance significantly.In this version of

`mog_dd`it is also possible to use outlier objects in training. Individual mixtures of Gaussians are fitted for both the target and outlier data (having and Gaussians respectively). Objects are assigned to the class with the highest density. To avoid that the decision boundary around the target class will not be closed, one extra outlier cluster is introduced with a very wide covariance matrix. This 'background' outlier cluster is fixed and will not be adapted in the EM algorithm (although it will be used in the computation of the probability density). This results in the following model:

(4.5)

Classifying is done according to (4.4). Here is the mean of the complete dataset, and is taken as , where is covariance matrix of the complete dataset. The is still optimized in the EM procedure, such that .Finally, it is also possible to

*extend*a trained mixture by a new cluster (can be both a target or an outlier cluster). Have a look at the function`mog_extend`.`parzen_dd`: Parzen density estimator. An even more flexible density model with a Gaussian model around each of the training objects:(4.6)

The free parameter is optimized by maximizing the likelihood on the training data using leave-one-out. The classifier becomes as in (4.4). This method often performs pretty well, but it requires a reasonable training set. My second choice!`autoenc_dd`: the auto-encoder neural network. A full explanation of neural networks is outside the scope of this manual, please have a look at [Bis95]. The idea is that a neural network is trained to reconstruct the input pattern at the output NeurN of the network. The difference between the input and output pattern is used as a characterization of the target class. This results in:(4.7)

The classifier then becomes as in (4.2).`kmeans_dd`: the k-means data description, where the data is described by clusters, placed such that the average distance to a cluster center is minimized. The cluster centers are placed using the standard -means clustering procedure ([Bis95]). The target class is then characterized by:

(4.8)

The classifier then becomes as in (4.2).`kcenter_dd`: the k-center data description, where the data is described by clusters, placed such that the maximal distance to a cluster center is minimized [YD98]. When the clusters are placed, the mathematical description of the method is similar to`kmeans_dd`.`pca_dd`: Principal Component Analysis data description. This method describes the target data by a linear subspace. This subspace is defined by the eigenvectors of the data covariance matrix . Only eigenvectors are used. Assume they are stored in a matrix (where is the dimensionality of the original feature space).To check if a new object fits the target subspace, the reconstruction error is computed. The reconstruction error is the difference between the original object and the projection of that object onto the subspace (in the original data). This projection is computed by:

(4.9)

For the reconstruction error I decided to use:(4.10)

The default setting of the method is that the eigenvectors with the largest eigenvalues are used. It appears that this does not always have the best performance [TM03]. It is also possible to choose the eigenvectors with the

*smallest*eigenvalues. This is an extra feature in the toolbox.`som_dd`: Self-Organizing Map data description. This method uses a Self-Organizing Map to describe the data. A SOM is an unsupervised clustering method in which the cluster centers are constrained in their placing. The construction of the SOM is such that all objects in the feature space retain as much as possible their distance and neighborhood relations in the mapped space.The mapping is performed by a specific type of neural network, equipped with a special learning rule. Assume that we want to map an -dimensional measurement space to a -dimensional feature space, where . In the feature space, we define a finite orthogonal grid with grid points . At each grid point we place a neuron. Each neuron stores an -dimensional vector that serves as a cluster center. By defining a grid for the neurons, each neuron does not only have a neighboring neuron in the measurement space, it also has a neighboring neuron in the grid. During the learning phase, neighboring neurons in the grid are enforced to also be neighbors in the measurement space. By doing so, the local topology will be preserved. In this implementation the SOM only or .

To evaluate if a new object fits this model, again a reconstruction error is defined. This reconstruction error is the difference between the object and its closest cluster center (neuron) in the SOM:

(4.11)

This distance is thresholded in order to get a classification result.`mst_dd`: The minimum spanning tree data description. On the training data a minimum spanning tree is fitted. The distance to the*edges*is used as the similarity to the target class. That means that a training target dataset of just two objects will define sausage-shaped target class.`nndd`: A simple nearest neighbor method [Tax01]. Here a new object is evaluated by computing the distance to its nearest neighbor . This distance is normalized by the nearest neighbor distance of this object (that means the distance between object and . Works not very well for low dimensional, well-sampled data, but surprisingly well for low-sample, high-dimensional data!`knndd`: A -nearest neighbor data description, a much smarter approach to the`nndd`method. In its most simple version just the distance to the -th nearest neighbor is used. Slightly advanced methods use averaged distances, which works somewhat better. This simple method is often very good in high dimensional feature spaces.`svdd`: The support vector data description, the description inspired by the support vector classifier. For a full discussion of this method, please read [Tax01]. It basically fits a hypersphere around the target class. By introducing kernels, this inflexible model becomes much more powerful, and can give excellent results when a suitable kernel is used. It is possible to optimize the method to reject a pre-defined fraction of the target data. That means that for different rejection rates, the shape of the boundary changes. Furthermore it is possible to use example outliers to improve the classification results. The main drawback of the method is that it requires a difficult optimization. For this the`quadprog`algorithm from the optimization toolbox is required.This first implementation has the RBF kernel hard-coded. This is because it is the most useful kernel, and it makes the implementation and evaluation must simpler. When you want to use other kernels, you have to have a look at one of the next two classifiers:

`ksvdd`: The support vector data description using a general kernel. The choice of kernel is free in this implementation, but this makes the evaluation in general much slower. The implementation is not the most convenient, so maybe I will change this again in the future.`incsvdd`: The incremental support vector machine which uses its own optimization routine. This makes it possible to optimize the SVDD without the use of an external quadratic programming optimizer, and to use any kernel. In future versions this will be adapted to cope with dynamically changing data (data distributions which change in time). Currently this is my preferred way to train a SVDD.`dlpdd`: The linear programming distance-data description [PTD03]. This data descriptor is specifically constructed to describe target objects which are represented in terms of distances to other objects. In some cases it might be much easier to define distances between objects than informative features (for instance when shapes have to be distinguished). To stress that the classifier is operating on distance data, the name starts with a`d`. The classifier has basically the following form:(4.12)

The weights are optimized such that just a few weights stay non-zero, and the boundary is as tight as possible around the data.`lpdd`: The linear programming data description. The fact that`dlpdd`is using distance data instead of standard feature data makes it harder to simply use it on normal feature data. Therefore`lpdd`is created, which is basically a wrapper combining some high level algorithms together to make the application of`dlpdd`simpler.`mpm_dd`: The minimax probability machine by Lanckriet [LEGJ03]. It tries to find the linear classifier that separates the data from the origin, rejecting maximally a specific fraction of the target data. In the original version, an upper bound on this rejection error is used (applying a very general bound using only the mean and covariance matrix of the target data). Unfortunately, in practice this bound is so loose that it is not useful. Therefore the rejection threshold is re-derived from the target data.