next up previous contents index
Next: General remarks Up: Error computation Previous: Generating artificial outliers   Contents   Index


Cross-validation

The toolbox has an extra procedure to facilitate cross-validation. In cross-validation a dataset is split into $ B$ batches. From these batches $ B-1$ are used to train a classifier, and the left-out batch is used to evaluate it. This is repeated $ B$ times, and the performances are averaged. The advantage is that given a limited training set, it is still possible to obtain a relatively good classifier, and estimate its performance on an independent set.

In practice, this cross-validation procedure is applied over and over again. Not only to evaluate and compare the performance of classifiers, but also to optimize hyperparameters. To keep the procedure as flexible as possible, the cross-validation is kept as simple as possible. An index vector is generated that indicates to with batch each object in a training set belongs. By repeatedly applying the procedure, the different batches are combined in a training and evaluation set. The following piece of code shows how this is done in practice:

a = oc_set(gendatb,1);  % make or get some data
nrbags = 10;  % we are doing 10-fold crossvalidation
I = nrbags;   % initialization
              % now start the 10 folds:
for i=1:nrbags
    % extract the training (x) and validation (z) sets, and
	 % update the index vector I:
    [x,z,I] = dd_crossval(a,I);

    % do something useful with the training and evaluation:
	 w = gauss_dd(x,0.1);
	 e(i) = dd_auc(z*w*dd_roc);
end
fprintf('AUC (10-fold) %5.3 (%5.3)',mean(e),std(e));

Note that the procedure takes class priors into account. It tries to retain the number of objects per class in each fold according to the total dataset.


next up previous contents index
Next: General remarks Up: Error computation Previous: Generating artificial outliers   Contents   Index
David M.J. Tax 2006-07-26