Currently, a whole set of one-class classifiers is already
implemented, ready to use. Please feel free to extend this list!
- gauss_dd: simple Gaussian target distribution, without
any robustifying. This is the first classifier I would try. The target
class is modeled as a Gaussian distribution. To avoid numerical
instabilities the density estimate is avoided, and just the
Mahalanobis distance is used:
 |
(4.1) |
The classifier is defined as:
 |
(4.2) |
The mean
and covariance matrix
are just
sample estimates. The threshold
is set according to the
target error that the user has to supply.
- rob_gauss_dd: Gaussian target distribution, but
robustified. Can be significantly better for data with long tails.
The mathematical description of the method is identical to
gauss_dd. The difference is in the computation of the
and
. This procedure reweights the objects in the training set
according to their proximity to the (previously estimated) mean.
Remote objects (candidate outliers) will be down weighted such that
a more robust estimate is obtained. It will not be strictly robust,
because these outliers will always have some influence.
- mcd_gauss_dd: Minimum Covariance Determinant Gaussian
classifier. The mathematical description is again the same as of the
gauss_dd. For the estimation of the mean and covariance matrix
just a fraction of the data is used. This part of the data is selected
such that the determinant of the covariance
matrix is minimal. The implementation is taken from [RVD99],
but unfortunately it only works up to 50 dimensional data.
- mog_dd: Mixture of Gaussians. Here the target class is
modeled using a mixture of
Gaussians, to create a more flexible
description. The model looks like:
 |
(4.3) |
The classifier is defined as:
 |
(4.4) |
The parameters
and
are optimized using the
EM algorithm. Because in high dimensional data and larger number of
clusters
, the number of free parameters can become huge (in
particular in the covariance matrices), the covariance matrices can be
constrained. This can improve the performance significantly.
In this version of mog_dd it is also possible to use outlier
objects in training. Individual mixtures of Gaussians are fitted for
both the target and outlier data (having
and
Gaussians
respectively). Objects are assigned to the class with the highest
density. To avoid that the decision boundary around the target class
will not be closed, one extra outlier cluster is introduced with a
very wide covariance matrix. This 'background' outlier cluster is
fixed and will not be adapted in the EM algorithm (although it will be
used in the computation of the probability density). This results in
the following model:
Classifying is done according to (4.4).
Here
is the mean of the complete dataset, and
is
taken as
, where
is covariance matrix of the
complete dataset. The
is still optimized in the EM procedure,
such that
.
Finally, it is also possible to extend a trained mixture by a
new cluster (can be both a target or an outlier cluster). Have a look
at the function mog_extend.
- parzen_dd: Parzen density estimator. An even more flexible
density model with a Gaussian model around each of the training
objects:
 |
(4.6) |
The free parameter
is optimized by maximizing the likelihood on
the training data using leave-one-out. The classifier becomes as in
(4.4). This method often performs pretty well, but it
requires a reasonable training set. My second choice!
- autoenc_dd: the auto-encoder neural network. A full
explanation of neural networks is outside the scope of this manual,
please have a look at [Bis95]. The idea is that a neural
network is trained to reconstruct the input pattern
at the
output
NeurN
of the network.
The difference between the input and output pattern is used as a
characterization of the target class. This results in:
 |
(4.7) |
The classifier then becomes as in (4.2).
- kmeans_dd: the k-means data description, where the data
is described by
clusters, placed such that the average distance
to a cluster center is minimized. The cluster centers
are placed using the
standard
-means clustering procedure ([Bis95]). The target
class is then characterized by:
 |
|
|
(4.8) |
The classifier then becomes as in (4.2).
- kcenter_dd: the k-center data description, where the data
is described by
clusters, placed such that the maximal distance
to a cluster center is minimized [YD98]. When the clusters
are placed, the mathematical description of the method is similar to
kmeans_dd.
- pca_dd: Principal Component Analysis data description.
This method describes the target data by a linear subspace. This
subspace is defined by the eigenvectors of the data covariance
matrix
. Only
eigenvectors are used. Assume they are
stored in a
matrix
(where
is the
dimensionality of the original feature space).
To check if a new object fits the target subspace, the reconstruction
error is computed. The reconstruction error is the difference between
the original object
and the projection of that object onto the
subspace (in the original data). This projection is computed by:
 |
(4.9) |
For the reconstruction error I decided to use:
 |
(4.10) |
The default setting of the method is that the
eigenvectors with
the largest eigenvalues are used. It appears that this does not always
have the best performance [TM03]. It is also possible to
choose the
eigenvectors with the smallest eigenvalues. This
is an extra feature in the toolbox.
- som_dd: Self-Organizing Map data description. This method
uses a Self-Organizing Map to describe the data.
A SOM is an unsupervised clustering
method in which the cluster centers are constrained in their placing.
The construction of the SOM is such that all objects in the feature
space retain as much as possible their distance and neighborhood
relations in the mapped space.
The mapping is performed by a specific type of neural network,
equipped with a special learning rule. Assume that we want to map an
-dimensional measurement space to a
-dimensional feature space,
where
. In the feature space, we define a finite orthogonal grid
with
grid points
. At each grid point we
place a neuron. Each neuron stores an
-dimensional vector that
serves as a cluster center. By defining a grid for the neurons, each
neuron does not only have a neighboring neuron in the measurement
space, it also has a neighboring neuron in the grid. During the
learning phase, neighboring neurons in the grid are enforced to also
be neighbors in the measurement space. By doing so, the local
topology will be preserved. In this implementation the SOM only
or
.
To evaluate if a new object fits this model, again a reconstruction
error is defined. This reconstruction error is the difference between
the object and its closest cluster center (neuron) in the SOM:
 |
(4.11) |
This distance is thresholded in order to get a classification result.
- mst_dd: The minimum spanning tree data description. On the
training data a minimum spanning tree is fitted. The distance to the
edges is used as the similarity to the target class. That means
that a training target dataset of just two objects will define
sausage-shaped target class.
- nndd: A simple nearest neighbor method [Tax01]. Here
a new object is evaluated by computing the distance to its nearest
neighbor
. This distance is normalized by the nearest
neighbor distance of this object (that means the distance between
object
and
. Works not very well for low
dimensional, well-sampled data, but surprisingly well for low-sample,
high-dimensional data!
- knndd: A
-nearest neighbor data description, a much
smarter approach to the nndd method. In its most simple version
just the distance to the
-th nearest neighbor is used. Slightly
advanced methods use averaged distances, which works somewhat better.
This simple method is often very good in high dimensional feature
spaces.
- svdd: The support vector data description, the description
inspired by the support vector classifier. For a full discussion of
this method, please read [Tax01]. It basically fits a
hypersphere around the target class. By introducing kernels, this
inflexible model becomes much more powerful, and can give excellent
results when a suitable kernel is used. It is possible to
optimize the method to reject a pre-defined fraction of the target data.
That means that for different rejection rates, the shape of the
boundary changes. Furthermore it is possible to use example outliers
to improve the classification results. The main drawback of the method
is that it requires a difficult optimization. For this the
quadprog algorithm from the optimization toolbox is required.
This first implementation has the RBF kernel hard-coded. This is
because it is the most useful kernel, and it makes the implementation
and evaluation must simpler. When you want to use other kernels, you
have to have a look at one of the next two classifiers:
- ksvdd: The support vector data description using a general
kernel. The choice of kernel is free in this implementation, but
this makes the evaluation in general much slower. The implementation
is not the most convenient, so maybe I will change this again in the
future.
- incsvdd: The incremental support vector machine which uses
its own optimization routine. This makes it possible to optimize the
SVDD without the use of an external quadratic programming optimizer,
and to use any kernel. In future versions this will be adapted to cope
with dynamically changing data (data distributions which change in
time). Currently this is my preferred way to train a SVDD.
- dlpdd: The linear programming distance-data description
[PTD03]. This data descriptor is specifically constructed
to describe target objects which are represented in terms of distances
to other objects. In some cases it might be much easier to define
distances between objects than informative features (for instance when
shapes have to be distinguished). To stress that the classifier is
operating on distance data, the name starts with a d. The
classifier has basically the following form:
 |
(4.12) |
The weights
are optimized such that just a few weights stay
non-zero, and the boundary is as tight as possible around the data.
- lpdd: The linear programming data description.
The fact that dlpdd is using distance data instead of standard feature
data makes it harder to simply use it on normal feature data.
Therefore lpdd is created, which is basically a wrapper combining
some high level algorithms together to make the application of
dlpdd simpler.
- mpm_dd: The minimax probability machine by Lanckriet
[LEGJ03]. It tries to find the linear classifier that
separates the data from the origin, rejecting maximally a specific
fraction of the target data. In the original version, an upper bound
on this rejection error is used (applying a very general bound using
only the mean and covariance matrix of the target data).
Unfortunately, in practice this bound is so loose that it is not
useful. Therefore the rejection threshold is re-derived from the
target data.
To get more information for each individual classifier, have a look at
their help (for instance