Quick n’ Dirty Facial Detection

By Tyler Diamond –


Machine learning has become a hot subject as of late, due to the advent of huge amounts
of data combined with improvements in parallelized hardware, such as GPUs. This will be a simple introduction to some machine learning concepts and using a python library to train a face detector.

The main focus machine learning has shifted to the use of Neural Networks. These are basically a series of operations that take an input, such as an image or set of images, cast it to a matrix (in the case of an image, representing each pixel and its value) . As an example is here is how one could construct a neural network representing the OR bitwise function.

import numpy as np

#We define the
 nonlinear function 1/(1+e^-x)

def nonlin(x,
 deriv=False):
    if (deriv==True):
        return x*(1-x)
    return 1/(1+np.exp(-x))

#Input array
x =
 np.array([  [0,0],
                [1,0],
                [0,1],
                [1,1] ])

#We need to add a bias weight that will always be 1, shifts the curve
x = np.concatenate((np.ones((x.shape[0],1)), x), axis=1)

#Output array
y=np.array([[0,1,1,1]]).T
np.random.seed(1)

#We initialize our weights at random
theta1=2*np.random.random((3,1)) – 1
print theta1

#We train for 1000 iterations

for iter in xrange(1000):
    layer0 = x
    output = nonlin(np.dot(layer0, theta1))

    error = y - output

    if (iter % 100) == 0:
        print "Error: "+ str(np.mean(np.abs(error)))

    #This is the derivative of the activation function, as this will determine how
    #much this error will affect our training
    delta = error * nonlin(output, True)
    theta1 += np.dot(layer0.T, delta)
print(theta1)
#Print an example
print np.dot(x[0],theta1)

And we have the output:

    Error: 0.581614356677

    Error: 0.114982506443

    Error: 0.0768737768202

    Error: 0.0607942182133

    Error: 0.0515777261779

    Error: 0.0454666239591

    Error: 0.0410534391916

    Error: 0.037682432675

    Error: 0.0350032016526

    Error: 0.0328098946256

    [[-2.84285071]

     [ 6.1781472 ]

     [ 6.17797823]]

    [-2.84285071]

 

We use the nonlinear function also known as the sigmoid function to compute our output for primarily 3 reasons:

         

1.The derivative declines as we approach the boundaries (0,1), being highest at the middle (0.5).

2.   The derivative of this function is easy to compute.

3.    Nonlinear functions allow us to represent complex relations that cannot be done with linear functions.

f1

Illustration 1: Sigmoid Function: https://en.wikipedia.org/wiki/Sigmoid_function

We want either a 0 or 1 output. This function is perfect for us, since when we’re at
the farther from the boundaries (I.e the specific neuron is not confident in its answers) the weight is updated more   significantly than if it was confident in its answer.

The derivative is important to the function, as we utilize two popular machine
learning techniques:
Gradient Descent and Backpropagation.

f2.png

Backpropagation allows us to compute the gradient of
layer of neurons against the output. In other words, we can compute how
significantly a neuron contributed to the error of the output. In terms of
math, this is the partial derivative

Gradient descent  is the method on how we can update the weights based upon how far we are from the solution. See the image to the right for an example.

The input of each example is a 1D vector containing two values. All 4 possible configurations can then be reprsented as a 4x2x1 matrix, since there are 4 examples
each containing 2 rows and 1 column. We must add the bias weight as this allows us to shift the curve of the signmoid function over, in order to fit our network.

We define the output as we are taking a supervised learning approach, which means
this network is trained by guessing based on the input, and then correcting itself as it is presented the correct output of the training data.

We then have our weight vector theta 3×1, as each weight from the input layer (1 bias + 2 input values) will map to a single output (either a 0 or 1, hopefully corresponding to the correct OR output).

So before we train, here are the current values of our network:

h1

And here is a visual representation of the first example (although we train all 4
examples at the same time, this process is what each example undergoes).

g2

We would then compute the error for all the examples, and the respective gradient:

g4

And if we calculate the new weights:

g5

This is then repeated for 1000 iterations until we get the following network design:

 g6

 Something interesting to note is that the network corrected itself so that both input
weights are very similar, as the OR function does not require a specific order (I.e 0 OR 1 = 1 OR 0).

Convolutional Neural Networks

Without changing to many of the details, you can create networks that stack layers of
neurons on top of each other, allowing you to create very complex functions and relations between neurons. Currently it seems the best methods of creating neural networks involve stacking many of these layers and connecting them, a method known as deep learning.

g7.png

Illustration 3: Deep Learning: http://neuralnetworksanddeeplearning.com

Stacking neurons is great for creating complex models and representing sophisticated relationships, however deep models require large amounts of computational resources. Therefore we need to use these highly connected layers sparingly and find other methods to efficiently represent data we want.

In this post, I will be using the dataset available for the Kaggle competition for Facial keypoints detection: https://www.kaggle.com/c/facial-keypoints-detection/data

 The problem of facial detection requires us to scan an image looking for certain elements that indicate we are looking at face, such as eyes, mouth and nose. These features will differ based upon the orientation of a person’s face, so we want to be able to detect the features anywhere in the image. Convolutional neural networks are networks that create a “window” that will scan over the image.

 convolution_schematic

As you can see, we are able to use these to represent squares of pixels by using only a single number. This greatly reduces the amount of weights we have to train and therefore also the amount of time we have to train our network. However, our network now has a little more complex of a structure. Instead of having each window region have its own set of weights, all regions will share the weights, since we want to be able to detect an object at any location of the image. What this means is instead we will train multiple weights per layer-to-layer connection. These weights are called filters or kernels. These filters will each learn something different (hopefully) about the image. Now the layers and filters will be 3-dimensional, since we are training multiple at a time..

g8 The input x now has the size Nx1x96x96, where N is the number of examples, 1 is the amount of channels (only 1 because we’re using gray scale) and 96 by 96 pixels.

 The output is Nx30, where the numbers correspond to the location of the keypoints. Here is a list of the keypoints we are attempting to detect (each has an x and y value):

·  left_eye_center, right_eye_center, left_eye_inner_corner, right_eye_inner_corner, left_eye_outer_corener, right_eye_outer_corner

·     left_eyebrow_inner_end, right_eye_brow_inner_end, left_eyebrow_outer_end, right_eyebrow_outer_end

·     nose_tip

·     mouth_left_corner, mouth_right_corner, mouth_center_top_lip, mouth_center_bottom_lip

In addition to convolution layers, a pool layer is also utilized in order to connect the network more while reducing the size. We will utilize the commong max(0, x) pool function, in which we take the max of the pool size. For example a (2,2) window of [1, -2, 3, 2.2] would map to [3].

 Building the application

In this section I will show you how you can build a network that will be able to find these keypoints relatively well. Note that in order to execute this you must have the required libraries and a powerful graphics card. I am running this on a GTX 1070 and it takes around a half hour to train, if the number of iterations is turned down. Much of this code is inspired by referencing a similar blogpost. The machine learning library Lasagne, built upon Theano, is very powerful and allows you to easily create neural networks by abstracting away the technical details. In our network, we will be using a series of the following: convolution->pool->dropout. After these layers occur 3 times, we add 2 fully connected deep layers. This is what our network will look like, with the first series of convolution being shown:

g9.png

The files are provided as csv, containg the pixels of each example and either 15 or
4 labeled features of that image.

def load(test=False, cols=None):

    selection = testpath if test else trainpath
    data = read_csv(os.path.expanduser(selection))
    data['Image'] = data['Image'].apply(lambda im: np.fromstring(im, sep=' '))
    if cols:
        data = data[list(cols) + ['Image']]
    #print(data.count())
    data = data.dropna()        #Drop the rows that have missing values
    x = np.vstack(data['Image'].values) / 255
    x = x.astype(np.float32)
    if not test:
        y = data[data.columns[:-1]].values
        y = (y-48)/48   #scale [-1,1], as the dataset comes with intensity 0-95
        x,y = shuffle(x, y, random_state=42)
        y = y.astype(np.float32)
    else:
        y = None
    x = x.reshape(-1,1,96,96)
    return x,y

 As you can see, we load the image data into x and reshape it to be (1x96x96) and then load the labeled keypoints into y. We have the option to load test data as well. Test data that is separate than training data is key to evaluating and improving your neural networks, as neural networks tend to overfit our training data, which means it does not generalize the features we are looking for and won’t work on new data.We’ll also want to create a function that can predict once, as this will allow us to view the results on test cases after training.

def predict(specialists, X):
    y_pred = np.empty((X.shape[0],0))
    for model in specialists.values():
        y_pred1 = model.predict(X)
        y_pred = np.hstack([y_pred, y_pred1])
    return y_pred

In order to create the most accurate model, we train 6 different “specialist” models that each predict a subset of the features. This will allow us to have separate weights when we’re looking for different features (ex: the features for the mouth are learned different than those for the eyes).

global SPECIALIST_SETTINGS

SPECIALIST_SETTINGS = [
    dict(
        columns=(
            'left_eye_center_x', 'left_eye_center_y',
            'right_eye_center_x', 'right_eye_center_y',
            ),
        flip_indices=((0, 2), (1, 3)),
        ),

    dict(
        columns=(
            'nose_tip_x', 'nose_tip_y',
            ),
        flip_indices=(),
        ),


    dict(
        columns=(
            'mouth_left_corner_x', 'mouth_left_corner_y',
            'mouth_right_corner_x', 'mouth_right_corner_y',
            'mouth_center_top_lip_x', 'mouth_center_top_lip_y',
            ),
        flip_indices=((0, 2), (1, 3)),
        ),

    dict(
        columns=(
            'mouth_center_bottom_lip_x',
            'mouth_center_bottom_lip_y',
            ),
        flip_indices=(),
        ),

    dict(
        columns=(
            'left_eye_inner_corner_x', 'left_eye_inner_corner_y',
            'right_eye_inner_corner_x', 'right_eye_inner_corner_y',
            'left_eye_outer_corner_x', 'left_eye_outer_corner_y',
            'right_eye_outer_corner_x', 'right_eye_outer_corner_y',
            ),
        flip_indices=((0, 2), (1, 3), (4, 6), (5, 7)),
        ),

    dict(
        columns=(
            'left_eyebrow_inner_end_x', 'left_eyebrow_inner_end_y',
            'right_eyebrow_inner_end_x', 'right_eyebrow_inner_end_y',
            'left_eyebrow_outer_end_x', 'left_eyebrow_outer_end_y',
            'right_eyebrow_outer_end_x', 'right_eyebrow_outer_end_y',
            ),
        flip_indices=((0, 2), (1, 3), (4, 6), (5, 7)),
        ),
    ]

Notice there is a field called “flip_indices”. This is because a novel method to generate more data for us to train on is to simply flip images that we have available to us. When this occurs, we’ll need to flip the left and right values of our labeled keypoints so we can still accurately train on the data. Now we’ll code the BatchIterator that will randomly flip images:


class FlipBackIterator(BatchIterator):
    flip_indicies = [
        (0,2), (1,3),
        (4,8), (5,9), (6,10), (7,11),
        (12,16),(13,17),(14,18),(15,19),
        (22,24),(23,25),
    ]

    def transform(self, xb, yb):
        xb, yb = super(FlipBackIterator, self).transform(xb,yb)

        batch_size=xb.shape[0]
        indices=np.random.choice(batch_size,batch_size/2,replace=False)
        xb[indices]=xb[indices,:,:,::-1]

        if yb is not None:
            yb[indices,::2] = yb[indices,::2]*-1
            for a,b in self.flip_indices:
                yb[indices,a],yb[indices,b] = (
                    yb[indices,b],yb[indices,a])
        return xb, yb

 A common problem with neural networks is that due to their randomized nature in training, a network could get stuck in an area that won’t allow it to improve. If we get stuck in one of these areas, we’ll want to stop training as it won’t help to keep training in the wrong direction.

class EarlyStopping(object):
    def __init__(self, patience=100):
        self.patience = patience
        self.best_valid = np.inf
        self.best_valid_epoch = 0
        self.best_weights = None


   def __call__(self, nn, train_history):
        current_valid = train_history[-1]['valid_loss']
        current_epoch = train_history[-1]['epoch']
        if current_valid < self.best_valid:
            self.best_valid = current_valid
            self.best_valid_epoch = current_epoch
            self.best_weights = nn.get_all_params_values()
        elif self.best_valid_epoch + self.patience < current_epoch:
            print("Early stopping")
            print("Best valid loss was {:.6f} at epoch {}.".format(
                self.best_valid, self.best_valid_epoch))
            nn.load_params_from(self.best_weights)
            raise StopIteration()

And finally, we’ll create our neural network:

class EarlyStopping(object):
    def __init__(self, patience=100):
        self.patience = patience
        self.best_valid = np.inf
        self.best_valid_epoch = 0
        self.best_weights = None


   def __call__(self, nn, train_history):
        current_valid = train_history[-1]['valid_loss']
        current_epoch = train_history[-1]['epoch']
        if current_valid < self.best_valid:
            self.best_valid = current_valid
            self.best_valid_epoch = current_epoch
            self.best_weights = nn.get_all_params_values()
        elif self.best_valid_epoch + self.patience < current_epoch:
            print("Early stopping")
            print("Best valid loss was {:.6f} at epoch {}.".format(
                self.best_valid, self.best_valid_epoch))
            nn.load_params_from(self.best_weights)
            raise StopIteration()

And finally, we’ll create our neural network:

net = NeuralNet(
    layers=[
        ('input',layers.InputLayer),
        ('conv1', layers.Conv2DLayer),
        ('pool1', layers.MaxPool2DLayer),
        ('dropout1', layers.DropoutLayer),
        ('conv2', layers.Conv2DLayer),
        ('pool2', layers.MaxPool2DLayer),
        ('dropout2', layers.DropoutLayer),
        ('conv3', layers.Conv2DLayer),
        ('pool3', layers.MaxPool2DLayer),
        ('dropout3', layers.DropoutLayer),
        ('conv4', layers.Conv2DLayer),
        ('pool4', layers.MaxPool2DLayer),
        ('hidden4', layers.DenseLayer),
        ('dropout4', layers.DropoutLayer),
        ('hidden5', layers.DenseLayer),
        ('hidden6', layers.DenseLayer),
        ('output', layers.DenseLayer),
    ],


    input_shape=(None,1,96,96),  #4d, none will change based on number of samples
    conv1_num_filters=32, conv1_filter_size=(3,3), pool1_pool_size=(2,2),
    conv2_num_filters=64, conv2_filter_size=(3,1), pool2_pool_size=(2,2),
    conv3_num_filters=128, conv3_filter_size=(1,3), pool3_pool_size=(2,2),
    conv4_num_filters=64, conv4_filter_size=(2,2), pool4_pool_size=(2,2),
    hidden4_num_units=500, hidden5_num_units=500, hidden6_num_units=250,
    output_num_units=30, output_nonlinearity=None,


    #Dropouts
    dropout1_p=0.1,
    dropout2_p=0.2,
    dropout3_p=0.3,
    dropout4_p=0.4,


    update_learning_rate=theano.shared(float32(0.03)),
    update_momentum=theano.shared(float32(0.9)),


    regression=True,
    batch_iterator_train=FlipBackIterator(batch_size=128),
    on_epoch_finished=[
        AdjustVairable('update_learning_rate', start=0.03, stop=0.0001),
        AdjustVairable('update_momentum', start=0.9, stop=0.999),
        EarlyStopping(patience=150),
    ],
    max_epochs=400,
    verbose=1,
)

That’s a lot of information, so we’ll break this down section by section:

net = NeuralNet(
 layers=[
 ('input',layers.InputLayer),
 ('conv1', layers.Conv2DLayer),
 ('pool1', layers.MaxPool2DLayer),
 ('dropout1', layers.DropoutLayer),
 ('conv2', layers.Conv2DLayer),
 ('pool2', layers.MaxPool2DLayer),
 ('dropout2', layers.DropoutLayer),
 ('conv3', layers.Conv2DLayer),
 ('pool3', layers.MaxPool2DLayer),
 ('dropout3', layers.DropoutLayer),
 ('conv4', layers.Conv2DLayer),
 ('pool4', layers.MaxPool2DLayer),
 ('hidden4', layers.DenseLayer),
 ('dropout4', layers.DropoutLayer),
 ('hidden5', layers.DenseLayer),
 ('hidden6', layers.DenseLayer),
 ('output', layers.DenseLayer),
 ],

This creates the network shown in the diagram above.

 input_shape=(None,1,96,96),  #4d, none will change based on number of  samples
 conv1_num_filters=32, conv1_filter_size=(3,3),
 pool1_pool_size=(2,2),
 conv2_num_filters=64, conv2_filter_size=(3,1),
 pool2_pool_size=(2,2),
 conv3_num_filters=128, conv3_filter_size=(1,3),
 pool3_pool_size=(2,2),
 conv4_num_filters=64, conv4_filter_size=(2,2),
 pool4_pool_size=(2,2),
 hidden4_num_units=500, hidden5_num_units=500,
 hidden6_num_units=250,
 output_num_units=30, output_nonlinearity=None,

Here we define all the dimensions of our layers. The input is 4 dimensional, as the 4th (one set to None) is variable dependent upon our training samples. We then set the number of filters each convolution layer has, its filter size and also the size of the MAX pooling. We then define the fully connected hidden layer number of units and output number of units.

 dropout1_p=0.1,
 dropout2_p=0.2,
 dropout3_p=0.3,
 dropout4_p=0.4,

We set the probabilities of our dropouts. Dropout layers randomly drop examples, (I.e dropout1 drops a random 10% of examples). This is done to prevent overfitting of the network.

update_learning_rate=theano.shared(float32(0.03)),
update_momentum=theano.shared(float32(0.9)),

The learning rate of a network will determine how big of steps to take when correct our weights, whereas the momentum helps smooth how quick our neural network converges on a minima. We’ll want to decrease the learning rate while increasing the momentum as we increase our training steps.

regression=True,
 batch_iterator_train=FlipBackIterator(batch_size=128),
 on_epoch_finished=[
 AdjustVairable('update_learning_rate', start=0.03, stop=0.0001),
 AdjustVairable('update_momentum', start=0.9, stop=0.999),
 EarlyStopping(patience=150),
 ],
 max_epochs=400,
 verbose=1,

We set the batch iterator to our flipbatchiterator so we can randomly flip images. Lasagne provides an “on_epoch_finished” method that will call our functions when we finish updating the weights. We use this to adjust our learning rate momentum and determine if we should stop early.

Lastly, we’ll define a method to train our specialist networks and also allow us to load/save weights, which will help us train our network.

global pretrain_file
pretrain_file = 'face_detect.pickle'
def fit_specialists(pretrain=False, train=True):
 if pretrain:
 with open(pretrain_file, 'rb') as f:
            net_pretrain = pickle.load(f)
 specialists = OrderedDict()
 for setting in SPECIALIST_SETTINGS:
    cols = setting['columns']
    X, y = load(cols=cols)
    model = clone(net)
    model.output_num_units = y.shape[1]
    model.batch_iterator_train.flip_indices = setting['flip_indices']
    # set number of epochs relative to training examples:
    model.max_epochs = int(3e6 / y.shape[0])
    if 'kwargs' in setting:
            # an option 'kwargs' in the settings list may be used to
            # set any other parameter of the net:
            vars(model).update(setting['kwargs'])
    if pretrain:
            model.load_params_from(net_pretrain[cols])
            print "Loaded pre-trained weights"
    if train:
            print("Training model for columns {} for {} epochs".format(
                cols, model.max_epochs))
            model.fit(X, y)
    specialists[cols] = model
 return specialists

We train more on features we don’t have many examples as this will help compensate
for the lack of data.

Putting it all together, I ran the data through provided test data and some “test” data
of my own.

g10

g11

 
As can be seen, the data is pretty accurate on the provided test data. In addition the neural network does pretty decent on Professor Sanders pictures, especially those that are brighter. Going forward, one could create a facial recognition system by combining this with a triplet loss function (ex: FaceNet) and manually asking a user if two images are of the same person based on keypoints being similar to one another.

Advertisements

1 Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s