Gesture recognition by Deep Neural Network

Previously on Camshift

In the last TP, we used the camshift algorithme to track the hand and take picture of it. Then we resize the picture to 16X16 matrix and store it as the data set to train our neural network.
I spend an hour to capture the gesture of my hand and I only got like 1500 pictures. So in order to have more data to train the network. I randomly select 1000 pictures I have and rotate each of them with a random angle and add them into the data.(The code for selecting and rotating them is here). So now I have a data set with 2500 pictures.
The data set we have
We have four letters to predict C,V,I and O
We will use two kinds of Network, MLP and CNN

1.Multilayer Perceptron

A multilayer perceptron (MLP) is a class of feedforward artificial neural network. A MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. For MLP, we will using an existing python file in the sample of openCV letter_recog.py

1.1 Load data and preprocessing

First, we define a function to load the data and do some preprocessing.

def load_base(path):  
    data = np.loadtxt(path, np.float32, delimiter=',', converters={0: lambda ch : convertFun(ord(ch))})  
    index = [i for i in range(len(data))]  
    np.random.shuffle(index)  
    data = data[index]  
    samples, responses = data[:,1:], data[:,0]  
    return samples, responses
    
def convertFun(letter):  
    if chr(letter) == 'C':  
        return 0  
  elif chr(letter) == 'V':  
        return 1  
  elif chr(letter) == 'I':  
        return 2  
  elif chr(letter) == 'O':  
        return 3

We first load the data from .txt file and use a convert function to transfer letter to numbers. Then we split the data into samples and responses or you can call it X and Y.
Then in class LetterStatModel, I modified the function unroll_responses in order to create the one-hot array for dimension 4(the previous code is for 26 letters,but now i only have 4) with the help of np_utils.to_categorical() in keras.utils

def unroll_responses(self, responses):  
    new_responses = responses  
    new_responses = np_utils.to_categorical(responses, 4)  
    return new_responses

1.2 Structure of the neural network

We can only modify two things for this MLP, the number of hidden layer and how many neurons in each layer. Since there are simple pictures and we only have four letters, I tried one and two layers with number of neurons equals to (5,10,15,20,…..200)

single hidden layer

enter image description here
For the single layer MLP with 40 nerons, we have the best result of 90.89% accuracy in the validation. And with the number of nerons becomes lager, the accuracy decreases.

two hidden layer

enter image description here
For two hidden layers, the best result showed up at 55 nerons for each layers and its accuracy was 91.29%. We don’t see too much difference with the two kind of MLP but I still choose to use the second one.

The code for train the MLP is here

2. Convolutional Neural Network

The introduction for CNN is in my another markdown
This time I used keras with backend tensorflow to build the CNN.

2.1 Load data and preprocessing

It’s the same with MLP for data loading but a little bit difference in preprocessing. This time we will have more works to do.

shuffle and split the data

data = np.loadtxt(path, np.float32, delimiter=',', converters={0: lambda ch : convertFun(ord(ch))})  
index = [i for i in range(len(data))]  
np.random.shuffle(index)  
data = data[index]  
x_train, y_train = data[:2000,1:], data[:2000,0]  
x_test, y_test = data[2000:2521,1:], data[2000:2521,0]

The most important part here is to shuffle the data. When I first try to use the CNN and the results were not satisfying. I tried to modifer everything but none of them worked. Then I realised that when I took the picture, I usually press like 50 times button C to store the gesture of C and 50 times for O, and V and so on. So the data I feed to the network are in orderded, that may influence the performence of the network. So I tried to shuffle the data first and the accuracy becames much better than before.
After the shuffling, I split the data so that I have 80% data for training and 20% of data for testing.

reshape the format of the picture

For a set of pictures, we have two kinds of format to represent it. One is channels_first, like this (100,3,16,16). The first number represent the number of samples, second one is the number of channels, and the last two is the lenght and height of the picture.
But we still have channels_last format like (100,16,16,3) which the number of the channels is the last one.
After that, I convert the data into float32, and do the normalization in order to speed up the compute.

from keras import backend as K  
img_row, img_col = 16,16  
  
if K.image_data_format() == 'channels_first':  
    shape_ord = (1,img_row,img_col)  
else:  
    shape_ord = (img_row,img_col,1)
    
x_train = x_train.reshape((x_train.shape[0],)+shape_ord)  
x_test = x_test.reshape((x_test.shape[0],)+shape_ord)  
x_train = x_train.astype(np.float32)  
x_test = x_test.astype(np.float32)  
x_train /= 255  
x_test /= 255

Then I convert the Y into one-hot encoding

1
2
3

nb_class = 4  
y_train = np_utils.to_categorical(y_train,nb_class)  
y_test = np_utils.to_categorical(y_test,nb_class)

2.2 Structure of the neural network

For CNN I tried two structure, one is quit simple with only one convolution layer(call it simple CNN),the other is the classic LeNet.

simple CNN

def simpleCNN(kernel_size=(3,3),activation='relu'):  
	model = Sequential()  
	model.add(Conv2D(nb_filter,  
			kernel_size=kernel_size,  
			padding='valid',  
			input_shape=shape_ord,  
			activation=activation))  
	model.add(MaxPooling2D(pool_size=(nb_pool,nb_pool)))  
	model.add(Dropout(0.2))  
	model.add(Flatten())  
	model.add(Dense(units=128,activation=activation))  
	model.add(Dense(units=4,activation='softmax'))  
	return model

As we can see it’s quit simple. One Conv2d with 16 kernels, one Maxpooling and one Dense with 128 units. The kernel size is 3X3

LeNet

def LeNet(kernel_size=(5,5),activation='relu'):  
	model = Sequential()  
	model.add(Conv2D(filters=6,  
			kernel_size=kernel_size,  
			strides=(1, 1),  
			padding='valid',  
			input_shape=shape_ord,  
			activation=activation,  
			name='Conv1'))  

	model.add(AveragePooling2D(pool_size=(2, 2),  
			strides=(1, 1),  
			padding='valid'))  

	model.add(Conv2D(16,  
			kernel_size=kernel_size,  
			strides=(1, 1),  
			padding='valid',  
			activation=activation,  
			name='Conv2'))  

	model.add(AveragePooling2D(pool_size=(2, 2),  
			strides=(2, 2),  
			padding='valid'))  

	model.add(Conv2D(120,  
			kernel_size=(5, 5),  
			strides=(1, 1),  
			padding='valid',  
			activation=activation,  
			name='Conv3'))  

	model.add(Flatten())  

	model.add(Dense(units=120,  
			activation='tanh'))  

	model.add(Dense(units=84,  
			activation='tanh'))  

	model.add(Dense(units=4,  
			activation='softmax'))  
	return model

click here to see the svg picture of the network.
It has 3 Conv2d layer, 2 average pooling layer and 2 dense layer.

2.3 choose of epoch

In order to prevent them from overfitting, I tried to train both of them 100 epoch and observr at which epoch the accuracy of validation stop to increase.

LeNet
enter image description here

Simple CNN
enter image description here
So for LeNet, it stopped increasing at 30th epoch. So I decided to train LeNet and simple CNN with 30 epoches.

2.4 choose of parameters

Here I want to difference performence between the different optimizer and different kernel size.

Optimizer

For the introduciton of different gradient decent method, see my other markdown.
Here I tried sgd and adam.

1. SGD

enter image description here

2.ADAM

enter image description here

We can see that SGD converge faster, but adam is much smoother and more stable.

Kernel size

I tried 3 x 3 and 5 x 5 kernel for the two sturcture.
BUT for LeNet I have to modify the last convolution layer if the padding option is valide
The formula to calculate the out put size for each layer is:
FOR valid

$\left\lceil\frac{(W-F+1)}{S}\right\rceil$

FOR same

$\left\lceil\frac{W}{S}\right\rceil$

W is the input size, F is the size of kernel, S is the strde.

3 X 3 kernel

Conv1:$\left\lceil\frac{(16-3+1)}{1}\right\rceil=14$

Averagepool1:$\left\lceil\frac{(14-2+1)}{1}\right\rceil=13$

Conv2:$\left\lceil\frac{(13-3+1)}{1}\right\rceil=11$

Averagepool2:$\left\lceil\frac{(11-2+1)}{2}\right\rceil=5$

Conv3:$\left\lceil\frac{(5-5+1)}{1}\right\rceil=1$

enter image description here

5 X 5 kernel

Conv1:$\left\lceil\frac{(16-5+1)}{1}\right\rceil=12$
Averagepool1:$\left\lceil\frac{(12-2+1)}{1}\right\rceil=11$
Conv2:$\left\lceil\frac{(11-5+1)}{1}\right\rceil=7$
Averagepool2:$\left\lceil\frac{(7-2+1)}{2}\right\rceil=3$
Conv3:$\left\lceil\frac{(3-3+1)}{1}\right\rceil=1$

enter image description here

But I don’t see any evident difference for 5x5 kernel and 3x3 kernel.

2.5 K-fold Cross-Validation

I used k-fold cross validation to test the models.

enter image description here
K-fold cross validation

Split the data into k parts(usually 5 or 10)
Pick one part as test set and others as training set
Compute the mse for this iteration
Compute the average mse for k mse

I first tried to use the function in package scikitlearn called ~~StratifiedKFold().split()~~（should use function kfold() instead, thanks to melissa） to split the data in k parts but it turns out that this function can only handle binary or multiclass so it didn’t work.
So I have to write it by myself, fortunately it isn’t diffcult.(If I wrote it right)

def kfoldSplit(X,Y,k=10):  
    total_size = X.shape[0]  
    precentage = 1/k  
    size = int(total_size * precentage)  
    start = 0  
    end = size  
    x_train = []  
    y_train = []  
    x_test = []  
    y_test = []  
    for i in range(k):  
        x_test.append(X[start:end,:,:,:])  
        x_train.append(np.concatenate((X[:start,:,:,:],X[end:2522,:,:,:]),axis=0))  
        y_test.append(Y[start:end, :])  
        y_train.append(np.concatenate((Y[:start,:],Y[end:2522,:]),axis=0))  
        start = end  
        end += size  
    return [x_train,y_train,x_test,y_test]

and with this function I can apply the k-fold cross validation.

for i in range(k):  
    model = CNN.LeNet()  
    model.compile(loss='categorical_crossentropy',  
		      optimizer=CNN.adam,  
			  metrics=['accuracy']  # 评价函数  
			  )  
    model.fit(kfold[0][i],kfold[1][i],  
			  epochs=e,  
		      batch_size=size,  
		      verbose=0)  
    score = model.evaluate(kfold[2][i],kfold[3][i],verbose=0)  
    print("%s: %.2f%%" % (model.metrics_names[1], score[1] * 100))  
    cv_scores.append(score[1]*100)

Evaluation the two CNN models I saved before(k=5):

enter image description here

Using cross-validation to find hyper parameters

We can also using the kfold validation to find the best hyper parameter for the model.
I tried to find best batch size with best epoch by grid search:

enter image description here

The result is evident. But the question is why none of them can compare with the 2 models I built before even with the same parameters.(Haven’t figure it out)

2.5 Visualization

Test Visualization

Randomly chose 5 pics from the test set and make the prediction

enter image description here

Output of Convolution layer

Output of the first Conv2d

enter image description here

Output of the second Conv2d

enter image description here

Output of the third Conv2d

enter image description here

3. Apply the model on Camshift

After I had the model it’s easy to apply it on the camshift

import letter_recog_NN as lcn
elif ch == ord('p'):  
    xs0, ys0, xs1, ys1 = self.track_window  
    small_pic = cv.resize(prob[ys0:ys1 + ys0, xs0:xs1 + xs0], dsize=(16, 16))  
    small_pic = small_pic.astype(np.float32)  
    lcn.showResult(small_pic)

code in letter_recog_NN is here

So now I can make a gesture and press button p to make the prediction by the two nerual network,MLP and CNN. And the result is the captured picture with the prediction on the top.
enter image description here

Problems

Why the accuracy is worse when I using LeNet than using the SimpleCNN with only one convolution layer?

It may because my data set was gathered under the same light conditon,or I have few letters(Just four). So the “function” I want to fit is simple(for exemple a stright line). With more Convolution layer(or other layer), I have more parameters. So what I’m doing is to use a comlicated function to simulate a simple one, which means many of the parameters are useless and should be set to zero. And I don’t have enough data to train them to zero.
So, if I want the LeNet model to have the same performance of the simpleCNN, what I can do are

Gathering more data under the light conditon, so that I will have enought data to force some parameters to zero. But I think this can only make the LeNet getting closer to the simpleCNN but it will never perform better than simple one.
Make the function more complicated. Since the cause of my problem is the function is too simple, I can make the funcion more complex by adding more letters or taking more photos in different light conditions. And in this case the LeNet could get better generalization ability than the simple one which is what I’m looking for.