Statistics Hands-on session for Machine Learning Lecture 2

Students are required to hand in the exercises and questions below in the form of a SWAN notebook or html or pdf file (with code snippets) by 7 June. Please send the material to Anna Sfyrla (anna.sfyrla@cern.ch) and Stefan Gadatsch (stefan.gadatsch@cern.ch).

For this session you will need the directory "data" too.

Neural networks

In todays exercise you'll implement from scratch a neural network for classifying handwritten digits. Start with loading the dataset.

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
In [ ]:
from scipy.io import loadmat
data = loadmat('data/mnist.mat')
X = data['X']
y = data['y']
In [ ]:
print(X.shape)
print(np.unique(y))
In [ ]:
# from https://gist.github.com/soply/f3eec2e79c165e39c9d540e916142ae1
def show_images(images, cols = 1, titles = None):
    """Display a list of images in a single figure with matplotlib.
    
    Parameters
    ---------
    images: List of np.arrays compatible with plt.imshow.
    
    cols (Default = 1): Number of columns in figure (number of rows is 
                        set to np.ceil(n_images/float(cols))).
    
    titles: List of titles corresponding to each image. Must have
            the same length as titles.
    """
    assert((titles is None) or (len(images) == len(titles)))
    n_images = len(images)
    if titles is None: titles = ['Image (%d)' % i for i in range(1,n_images + 1)]
    fig = plt.figure()
    for n, (image, title) in enumerate(zip(images, titles)):
        a = fig.add_subplot(cols, np.ceil(n_images/float(cols)), n + 1)
        if image.ndim == 2:
            plt.gray()
        plt.imshow(image)
        a.set_title(title)
    fig.set_size_inches(np.array(fig.get_size_inches()) * n_images)
    plt.show()
In [ ]:
indices = np.random.randint(X.shape[0], size=9)
images = [np.reshape(np.array(X[i], dtype='float'), (20, 20)) for i in indices]
show_images(images, cols=3)

We're need to one-hot encode the y labels. One-hot encoding turns a class label n (out of k classes) into a vector of length k where index n is "hot" (1) while the rest are zero. sklearn has a built in utility we can use for this.

In [ ]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
y_onehot = encoder.fit_transform(y)
print(y_onehot.shape)
print(y[0], y_onehot[0,:])

The neural network that you will implement has 3 layers, i.e. 1 input layer, 1 hidden layer and 1 output layer. The input layer matches the size of the instance data (28x28 pixels + the bias unit), the hidden layer will have 25 units plus the bias unit and the output layer will have 10 units corresponding to the one-hot encoded class labels.

In [ ]:
input_size = 20*20
hidden_size = 25
num_labels = 10
learning_rate = 1

Below is a set of randomly initialised weights theta1 and theta2, unraveled into parameter matrices for each layer. Note that more complicated networks can be implemented but will potentially be slow to train.

In [ ]:
params = (np.random.random(size=hidden_size * (input_size + 1) + num_labels * (hidden_size + 1)) - 0.5) * 0.25

m = X.shape[0]
X = np.matrix(X)
y = np.matrix(y)

theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))

Example exercise 1:
Implement the feedforward propagation and prediction.

For convenience, the sigmoid function is implemented already.

In [ ]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
`forward_propagate(X, theta1, theta2)` takes as input the data `X` as well as the weights `theta1` and `theta2`. The function should return `a1`, `z2`, `a2`, `z3` and `h`. To compute `a1`, insert a column of ones to `X`. `z2` is computed from the weights `theta1.T` applied to `a1`. To get `a2`, feed `z2` through the non-linearity (`sigmoid`) and add ones again. `z3` is `theta2` applied to `a2`. Finally to obtain `h`, feed `z3` through `sigmoid`.
In [ ]:
def forward_propagate(X, theta1, theta2):
    ### Define the function here ###
    
    return a1, z2, a2, z3, h

Lets test the implementation.

In [ ]:
a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)
a1.shape, z2.shape, a2.shape, z3.shape, h.shape

Example exercise 2:
Implement the cost function.

Recall that the cost function for above neural network without regularisation is $$J = \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[ -y_k^{(i)} \log\left( (h_\theta(x^{(i)})_k )\right) - \left( 1 - y_k^{(i)} \right) \log\left( 1 - (h_\theta(x^{(i)}))_k \right) \right]$$ with h computed as in the previous exercise and K is the total number of possible labels, i.e. 10. Make sure that the code works for a dataset of any size with any number of labels.

Q: Below you will have to compute the cost

In [ ]:
def cost(params, input_size, hidden_size, num_labels, X, y, learning_rate):
    m = X.shape[0]
    X = np.matrix(X)
    y = np.matrix(y)
    
    # reshape the parameter array into parameter matrices for each layer
    theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
    theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))
    
    # run the feed-forward pass
    a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)
    
    ########################################
    # compute the cost
    #...
    ########################################
    
    return J

Let's test the implementation. The cost function, after computing the hypothesis matrix h, applies the cost equation to compute the total error between y and h.

In [ ]:
cost(params, input_size, hidden_size, num_labels, X, y_onehot, learning_rate)

Example exercise 3:
Add a regularisation term to the cost function.

The regularisation term is an addition to the cost implemented in the previous exercise. It reads $$\frac{\lambda}{2m} \left[ \sum_{j=1}^\texttt{hidden_size} \sum_{k=1}^\texttt{input_size} (\Theta_{j,k}^{(1)})^2 + \sum_{j=1}^\texttt{num_labels} \sum_{k=1}^\texttt{hidden_size} (\Theta_{j,k}^{(2)})^2 \right] $$ Make use of vectorised computations rather than writing out the sums explicitly.

Q: Below you will have to compute the cost

In [ ]:
def cost(params, input_size, hidden_size, num_labels, X, y, learning_rate):
    m = X.shape[0]
    X = np.matrix(X)
    y = np.matrix(y)
    
    # reshape the parameter array into parameter matrices for each layer
    theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
    theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))
    
    # run the feed-forward pass
    a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)

    ########################################
    # compute the cost
    #...
    ########################################
    
    return J

Lets test it.

In [ ]:
cost(params, input_size, hidden_size, num_labels, X, y_onehot, learning_rate)

Example exercise 4:
Implement the backpropagation algorithm.

Backpropagation computes the parameter updates that will reduce the error of the network on the training data.

Start with implementing a function that computes the gradient of the sigmoid function we created earlier, `sigmoid_gradient(z)`. It can be computed as $$\frac{\mathrm{d}}{\mathrm{d}z} g(z) = g(z) (1 - g(z))$$
In [ ]:
def sigmoid_gradient(z):
    return np.multiply(sigmoid(z), (1 - sigmoid(z)))

The intuition behind the backpropagation algorithm is as follows. Given a training example $(x^{(t)},y^{(t)})$, we will first run a forward pass to compute all the activations throughout the network, including the output value of the hypothesis $h_\theta(x)$. Then, for each node $j$ in layer l, we would like to compute an error term $\delta_j^{(l)}$ that measures how much that node was responsible for any errors in our output.

For an output node, we can directly measure the difference between the network's activation and the true target value, and use that to define $\delta_j^{(3)}$ (since layer 3 is the output layer). For the hidden units, you will compute $\delta_j^{(l)}$ based on a weighted average of the error terms of the nodes in layer $(l + 1)$.

Since the computations required for backpropagation are a superset of those required in the cost function, we're actually going to extend the cost function to also perform backpropagation and return both the cost and the gradients.

Implement steps 1 to 4 below in a for loop that processes one example at a time, with the $\text{t}^\text{th}$ iteration performing the calculation on the $\text{t}^\text{th}$ training example $(x^{(t)},y^{(t)})$. Step 5 will divide the accumulated gradients by m to obtain the gradients for the neural network cost function.

Step 1: Set the input layer’s values $(a^{(1)})$ to the $\text{t}^\text{th}$ training example $x^{(t)}$. Perform a feedforward pass, computing the activations $(z^{(2)}, a^{(2)}, z^{(3)}, a^{(3)})$ for layers 2 and 3. Note that you need to add a +1 term to ensure that the vectors of activations for layers $a^{(1)}$ and $a^{(2)}$ also include the bias unit.

Step 2: For each output unit $k$ in layer 3 (the output layer), set $$\delta_k^{(3)} = (a_k^{(3)} − y_k),$$ where $y_k \in \{0,1\}$ indicates whether the current training example belongs to class $k$ ($y_k = 1$), or if it belongs to a different class ($y_k = 0$).

Step 3: For the hidden layer $l = 2$, set $$\delta^{(2)} = {\left(\Theta^{(2)}\right)}^{T} \, \delta^{(3)} \cdot g^\prime(z^{(2)})$$

Step 4: Accumulate the gradient from this example using the following formula. Note that you should skip or remove $\delta^{(2)}$. $$\Delta^{(l)} = \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T$$

Step 5: Obtain the (unregularized) gradient for the neural network cost function by dividing the accumulated gradients by $\frac{1}{m}$.

Step 6: To account for regularization you can add it as an additional term after computing the gradients using backpropagation. Specifically add regularization using $$\frac{1}{m}\Delta_{ij}^{(l)} \quad \text{for } j=0$$ and $$\frac{1}{m}\Delta_{ij}^{(l)} + \frac{\lambda}{m} \Theta_{ij}^{(l)} \quad \text{for } j\ge1$$ Do not add regularizing tp the first column of $\Theta_{ij}$ used for the bias term.

Q: Below you will have to compute the cost

In [ ]:
def backprop(params, input_size, hidden_size, num_labels, X, y, learning_rate):
    m = X.shape[0]
    X = np.matrix(X)
    y = np.matrix(y)
    
    # reshape the parameter array into parameter matrices for each layer
    theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
    theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))
    
    # run the feed-forward pass
    a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)
    
    # initializations
    J = 0
    delta1 = np.zeros(theta1.shape)  # (hidden_size, input_size + 1)
    delta2 = np.zeros(theta2.shape)  # (num_labels, hidden_size + 1)
    
    ########################################
    # compute the cost
    #...
    ########################################
    
    # unravel the gradient matrices into a single array
    grad = np.concatenate((np.ravel(delta1), np.ravel(delta2)))
    
    
    return J, grad

Lets test the implementation.

In [ ]:
J, grad = backprop(params, input_size, hidden_size, num_labels, X, y_onehot, learning_rate)
print(J, grad.shape)

Now lets train the network and use it to make predictions.

In [ ]:
from scipy.optimize import minimize

# minimize the objective function
fmin = minimize(fun=backprop, x0=params, args=(input_size, hidden_size, num_labels, X, y_onehot, learning_rate),method='TNC', jac=True, options={'maxiter': 250})
print(fmin)

We put a bound on the number of iterations since the objective function is not likely to completely converge. Our total cost has dropped below 0.5 though so that's a good indicator that the algorithm is working. Lets use the parameters it found and forward-propagate them through the network to get some predictions.

In [ ]:
X = np.matrix(X)
theta1 = np.matrix(np.reshape(fmin.x[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.matrix(np.reshape(fmin.x[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))

a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)
y_pred = np.array(np.argmax(h, axis=1) + 1)
print(y_pred)

Finally we can compute the accuracy to see how well our trained network is doing.

In [ ]:
correct = [1 if a == b else 0 for (a, b) in zip(y_pred, y)]
accuracy = (sum(map(int, correct)) / float(len(correct)))
print('accuracy = {0}%'.format(accuracy * 100))

Homework exercise 1:
Train a classifier of your choice to detect spam.

Make use of the `sklearn` library -- you don't have to implement all functions from scratch for this project. Explain your motivation why you picked your algorithm. Your algorithm will be evaluated on a test dataset.

The header of the emails are excluded. The body of each email has been converted converted to a vector with 1899 dimensions corresponding to 1899 words in a vocabulary. The values are binary, indicating the presence or absence of the word in the document.

In [ ]:
data = loadmat('data/spam.mat')
X = data['X']  
y = data['y'].ravel()  
print(X.shape, y.shape)

Homework exercise 2:
See Hands-On-Lecture6_b

Before checking that other notebook, please make sure you install PIL, see below.

In [ ]:
## You will need this if further commands for importing PIL don't work
import sys
!{sys.executable} -m pip install Pillow --user