section_logo

K-Fold cross validation

C++ function to create random balanced indices for using in K-Fold cross validation.

description


While working with IA you most of the time have to train your implementation with a data set.

Here you get two well known problems, if you train your machine with the whole set you'll get a perfect trained machine for that data set but you can't be sure if out from that you've done a nice work or not.

For that reason you usally create two groups, first one it's the one you use for train, and second one for testing it

First approach to create these two groups it's to consider train data from beginning of the whole set to an index, and from there it will be testing data.

The problem with this easy technique is that you could get non-representative population of the original data, so both the training and the testing could be wrong.

A better solution is to use for example K-fold cross validation where you divide randomly the data into K balanced boxes.

Once you get the K boxes, you iterate from 1 to K and on each step you use the box(i) for testing while all the other boxes will be used for training. This method will give you a much better way to test and train your data getting

There're many implementations of K-fold crossvalidation for languages such us matlab, but it was hard for me to find an implementation for C++ so I just created a simple implementation I think could be useful for someone else, so here is it.

The following function it's just a conversion to C++ from the implementation from the wikipedia article: Knuth shuffle:

void shuffleArray(int* array,int size) 
{
	int n = size;
	while (n > 1) 
	{
		// 0 <= k < n.
		int k = rand()%n;		
		
		// n is now the last pertinent index;
		n--;					
		
		// swap array[n] with array[k]
		int temp = array[n];	
		array[n] = array[k];
		array[k] = temp;
	}
}

And the next one it's the simple function that returns the array of indices to arrange the crossvalidation:

int* kfold(int size,int k)
{
	int* indices=new int[size];
	float inc=(float)k/size;

	for (int i=0;i<size;i++)
		indices[i]=ceil((i+0.9)*inc)-1;

	shuffleArray(indices,size);

	return indices;
}

I've included also a little project for VS2003 that you can execute to see the result using size=20 and k=4:

2 2 1 0 1 1 1 3 0 1 0 2 3 2 0 3 2 3 3 0

downloads


kfold implementation with VS2003 project and executable

comments


#1 posted by Raied at 2010-04-13 09:24:17

I run your program with size=45 and k=6, the results are: 0 0 1 3 3 4 5 5 3 2 4 0 0 3 4 1 5 5 2 3 2 4 2 5 4 0 2 1 1 1 1 4 2 3 0 2 3 6 1 0 3 5 1 4 5 As you can see there is a number 6!! would you please send us an explanation. Thanks

#2 posted by kile at 2010-04-13 22:27:04

Hi Raied,
Exactly, it was a bug in the code!
I just corrected it, it was related with the offset used in the loop while calculating the indices.
You can download the new version above
Thank you very much for comment ;)

#3 posted by Shamma at 2010-10-25 18:34:59

hi, I tried your code, partially following is my code --> written in Octave index = zeros(1,N); partition = K/N; for i=1:N index(1,i) = ceil((i+0.9)*partition)-1; end end I put the values --> octave-3.2.4.exe:125> partitioning_KFold(15,6) ans = 0 1 1 1 2 2 3 3 3 4 4 5 5 5 6 any idea, why I have 0 @ the begining and 6 at the end ???? I want 1 through 6 and I need 6 to occur more

#4 posted by Kartar at 2011-05-13 06:21:53

Hi, now please explain how to use this indices to separate train and test samples.

#5 posted by s at 2011-07-21 10:49:12

hi do you have this code in matlab. please send me . thanks

Message posting is currently unavailable :(
Feel free to contact me using mail or contact form