Gesture recognition by Deep Neural Network
Previously on Camshift
In the last TP, we used the camshift algorithme to track the hand and take picture of it. Then we resize the picture to 16X16 matrix and store it as the data set to train our neural network.
I spend an hour to capture the gesture of my hand and I only got like 1500 pictures. So in order to have more data to train the network. I randomly select 1000 pictures I have and rotate each of them with a random angle and add them into the data.(The code for selecting and rotating them is here). So now I have a data set with 2500 pictures.
We have four letters to predict C,V,I and O
We will use two kinds of Network, MLP and CNN
1.Multilayer Perceptron
A multilayer perceptron (MLP) is a class of feedforward artificial neural network. A MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. For MLP, we will using an existing python file in the sample of openCV letter_recog.py
1.1 Load data and preprocessing
First, we define a function to load the data and do some preprocessing.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17def load_base(path):
data = np.loadtxt(path, np.float32, delimiter=',', converters={0: lambda ch : convertFun(ord(ch))})
index = [i for i in range(len(data))]
np.random.shuffle(index)
data = data[index]
samples, responses = data[:,1:], data[:,0]
return samples, responses
def convertFun(letter):
if chr(letter) == 'C':
return 0
elif chr(letter) == 'V':
return 1
elif chr(letter) == 'I':
return 2
elif chr(letter) == 'O':
return 3
We first load the data from .txt file and use a convert function to transfer letter to numbers. Then we split the data into samples and responses or you can call it X and Y.
Then in class LetterStatModel, I modified the function unroll_responses in order to create the one-hot array for dimension 4(the previous code is for 26 letters,but now i only have 4) with the help of np_utils.to_categorical() in keras.utils1
2
3
4def unroll_responses(self, responses):
new_responses = responses
new_responses = np_utils.to_categorical(responses, 4)
return new_responses
1.2 Structure of the neural network
We can only modify two things for this MLP, the number of hidden layer and how many neurons in each layer. Since there are simple pictures and we only have four letters, I tried one and two layers with number of neurons equals to (5,10,15,20,…..200)
single hidden layer
For the single layer MLP with 40 nerons, we have the best result of 90.89% accuracy in the validation. And with the number of nerons becomes lager, the accuracy decreases.
two hidden layer
For two hidden layers, the best result showed up at 55 nerons for each layers and its accuracy was 91.29%. We don’t see too much difference with the two kind of MLP but I still choose to use the second one.
The code for train the MLP is here
2. Convolutional Neural Network
The introduction for CNN is in my another markdown
This time I used keras with backend tensorflow to build the CNN.
2.1 Load data and preprocessing
It’s the same with MLP for data loading but a little bit difference in preprocessing. This time we will have more works to do.
shuffle and split the data
1 | data = np.loadtxt(path, np.float32, delimiter=',', converters={0: lambda ch : convertFun(ord(ch))}) |
The most important part here is to shuffle the data. When I first try to use the CNN and the results were not satisfying. I tried to modifer everything but none of them worked. Then I realised that when I took the picture, I usually press like 50 times button C to store the gesture of C and 50 times for O, and V and so on. So the data I feed to the network are in orderded, that may influence the performence of the network. So I tried to shuffle the data first and the accuracy becames much better than before.
After the shuffling, I split the data so that I have 80% data for training and 20% of data for testing.
reshape the format of the picture
For a set of pictures, we have two kinds of format to represent it. One is channels_first, like this (100,3,16,16). The first number represent the number of samples, second one is the number of channels, and the last two is the lenght and height of the picture.
But we still have channels_last format like (100,16,16,3) which the number of the channels is the last one.
After that, I convert the data into float32, and do the normalization in order to speed up the compute.1
2
3
4
5
6
7
8
9
10
11
12
13
14from keras import backend as K
img_row, img_col = 16,16
if K.image_data_format() == 'channels_first':
shape_ord = (1,img_row,img_col)
else:
shape_ord = (img_row,img_col,1)
x_train = x_train.reshape((x_train.shape[0],)+shape_ord)
x_test = x_test.reshape((x_test.shape[0],)+shape_ord)
x_train = x_train.astype(np.float32)
x_test = x_test.astype(np.float32)
x_train /= 255
x_test /= 255
Then I convert the Y into one-hot encoding1
2
3nb_class = 4
y_train = np_utils.to_categorical(y_train,nb_class)
y_test = np_utils.to_categorical(y_test,nb_class)
2.2 Structure of the neural network
For CNN I tried two structure, one is quit simple with only one convolution layer(call it simple CNN),the other is the classic LeNet.
simple CNN
1 | def simpleCNN(kernel_size=(3,3),activation='relu'): |
As we can see it’s quit simple. One Conv2d with 16 kernels, one Maxpooling and one Dense with 128 units. The kernel size is 3X3
LeNet
1 | def LeNet(kernel_size=(5,5),activation='relu'): |
click here to see the svg picture of the network.
It has 3 Conv2d layer, 2 average pooling layer and 2 dense layer.
2.3 choose of epoch
In order to prevent them from overfitting, I tried to train both of them 100 epoch and observr at which epoch the accuracy of validation stop to increase.
LeNet
Simple CNN
So for LeNet, it stopped increasing at 30th epoch. So I decided to train LeNet and simple CNN with 30 epoches.
2.4 choose of parameters
Here I want to difference performence between the different optimizer and different kernel size.
Optimizer
For the introduciton of different gradient decent method, see my other markdown.
Here I tried sgd and adam.
1. SGD
2.ADAM
We can see that SGD converge faster, but adam is much smoother and more stable.
Kernel size
I tried 3 x 3 and 5 x 5 kernel for the two sturcture.
BUT for LeNet I have to modify the last convolution layer if the padding option is valide
The formula to calculate the out put size for each layer is:
FOR valid
FOR same
W is the input size, F is the size of kernel, S is the strde.
3 X 3 kernel
Conv1:$\left\lceil\frac{(16-3+1)}{1}\right\rceil=14$
Averagepool1:$\left\lceil\frac{(14-2+1)}{1}\right\rceil=13$
Conv2:$\left\lceil\frac{(13-3+1)}{1}\right\rceil=11$
Averagepool2:$\left\lceil\frac{(11-2+1)}{2}\right\rceil=5$
Conv3:$\left\lceil\frac{(5-5+1)}{1}\right\rceil=1$
5 X 5 kernel
Conv1:$\left\lceil\frac{(16-5+1)}{1}\right\rceil=12$
Averagepool1:$\left\lceil\frac{(12-2+1)}{1}\right\rceil=11$
Conv2:$\left\lceil\frac{(11-5+1)}{1}\right\rceil=7$
Averagepool2:$\left\lceil\frac{(7-2+1)}{2}\right\rceil=3$
Conv3:$\left\lceil\frac{(3-3+1)}{1}\right\rceil=1$
But I don’t see any evident difference for 5x5 kernel and 3x3 kernel.
2.5 K-fold Cross-Validation
I used k-fold cross validation to test the models.
K-fold cross validation
- Split the data into k parts(usually 5 or 10)
- Pick one part as test set and others as training set
- Compute the mse for this iteration
- Compute the average mse for k mse
I first tried to use the function in package scikitlearn called StratifiedKFold().split()(should use function kfold() instead, thanks to melissa) to split the data in k parts but it turns out that this function can only handle binary or multiclass so it didn’t work.
So I have to write it by myself, fortunately it isn’t diffcult.(If I wrote it right)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18def kfoldSplit(X,Y,k=10):
total_size = X.shape[0]
precentage = 1/k
size = int(total_size * precentage)
start = 0
end = size
x_train = []
y_train = []
x_test = []
y_test = []
for i in range(k):
x_test.append(X[start:end,:,:,:])
x_train.append(np.concatenate((X[:start,:,:,:],X[end:2522,:,:,:]),axis=0))
y_test.append(Y[start:end, :])
y_train.append(np.concatenate((Y[:start,:],Y[end:2522,:]),axis=0))
start = end
end += size
return [x_train,y_train,x_test,y_test]
and with this function I can apply the k-fold cross validation.1
2
3
4
5
6
7
8
9
10
11
12
13for i in range(k):
model = CNN.LeNet()
model.compile(loss='categorical_crossentropy',
optimizer=CNN.adam,
metrics=['accuracy'] # 评价函数
)
model.fit(kfold[0][i],kfold[1][i],
epochs=e,
batch_size=size,
verbose=0)
score = model.evaluate(kfold[2][i],kfold[3][i],verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], score[1] * 100))
cv_scores.append(score[1]*100)
Evaluation the two CNN models I saved before(k=5):
Using cross-validation to find hyper parameters
We can also using the kfold validation to find the best hyper parameter for the model.
I tried to find best batch size with best epoch by grid search:
The result is evident. But the question is why none of them can compare with the 2 models I built before even with the same parameters.(Haven’t figure it out)
2.5 Visualization
Test Visualization
Randomly chose 5 pics from the test set and make the prediction
Output of Convolution layer
Output of the first Conv2d
Output of the second Conv2d
Output of the third Conv2d
3. Apply the model on Camshift
After I had the model it’s easy to apply it on the camshift1
2
3
4
5
6import letter_recog_NN as lcn
elif ch == ord('p'):
xs0, ys0, xs1, ys1 = self.track_window
small_pic = cv.resize(prob[ys0:ys1 + ys0, xs0:xs1 + xs0], dsize=(16, 16))
small_pic = small_pic.astype(np.float32)
lcn.showResult(small_pic)
code in letter_recog_NN is here
So now I can make a gesture and press button p to make the prediction by the two nerual network,MLP and CNN. And the result is the captured picture with the prediction on the top.
Problems
Why the accuracy is worse when I using LeNet than using the SimpleCNN with only one convolution layer?
It may because my data set was gathered under the same light conditon,or I have few letters(Just four). So the “function” I want to fit is simple(for exemple a stright line). With more Convolution layer(or other layer), I have more parameters. So what I’m doing is to use a comlicated function to simulate a simple one, which means many of the parameters are useless and should be set to zero. And I don’t have enough data to train them to zero.
So, if I want the LeNet model to have the same performance of the simpleCNN, what I can do are
- Gathering more data under the light conditon, so that I will have enought data to force some parameters to zero. But I think this can only make the LeNet getting closer to the simpleCNN but it will never perform better than simple one.
- Make the function more complicated. Since the cause of my problem is the function is too simple, I can make the funcion more complex by adding more letters or taking more photos in different light conditions. And in this case the LeNet could get better generalization ability than the simple one which is what I’m looking for.