Deep Learning

Type of machine learning: supervised vs. unsupervised

Method of machine learning: parametric (trial-and-error) vs. non-parametric (“counting”)

Deep learning is a class of parametric model.

# stare at this
weight = 0.5
goal_pred = 0.8
input = 2
alpha = 0.1
for iteration in range(20):
    pred = input * weight
    error = (pred - goal_pred) ** 2
    derivative = input * (pred - goal_pred)
    weight = weight - (alpha * derivative)
    print("Error:" + str(error) + " Prediction:" + str(pred))

Stochastic Gradient Descent updates the weights after each input. Batch Gradient Descent updates the weights after each batch of input.

# start at this
import numpy as np
np.random.seed(1)
 
def relu(x):
    return (x > 0) * x # returns x if x > 0
                       # return 0 otherwise
 
def relu2deriv(output):
    return output>0 # returns 1 for input > 0
                    # return 0 otherwise
streetlights = np.array( [[ 1, 0, 1 ],
                          [ 0, 1, 1 ],
                          [ 0, 0, 1 ],
                          [ 1, 1, 1 ] ] )
 
walk_vs_stop = np.array([ 1, 1, 0, 0](/notes/)).T
 
alpha = 0.2
hidden_size = 4
 
weights_0_1 = 2*np.random.random((3,hidden_size)) - 1
weights_1_2 = 2*np.random.random((hidden_size,1)) - 1
 
for iteration in range(60):
    layer_2_error = 0
    for i in range(len(streetlights)):
        layer_0 = streetlights[i:i+1]
        layer_1 = relu(np.dot(layer_0,weights_0_1))
        layer_2 = np.dot(layer_1,weights_1_2)
 
        layer_2_error += np.sum((layer_2 - walk_vs_stop[i:i+1]) ** 2)
        layer_2_delta = (layer_2 - walk_vs_stop[i:i+1])
        layer_1_delta=layer_2_delta.dot(weights_1_2.T)*relu2deriv(layer_1)
        weights_1_2 -= alpha * layer_1.T.dot(layer_2_delta)
        weights_0_1 -= alpha * layer_0.T.dot(layer_1_delta)
 
    if(iteration % 10 == 9):
        print("Error:" + str(layer_2_error))

Normalization techniques:

  • minibatch
  • early stopping
  • dropout

Activation functions:

  • relu is fast
  • sigmoid is often used for output because it squishes the values between 0 and 1
  • tanh is often used for middle layers because it squishes the values between -1 and 1

Output activation functions:

  • predicting raw data values (like temperature) => no activation
  • predicting yes/no probabilities => sigmoid
  • predicting “which one” probabilities => softmax

function forward prop back prop delta

  • Relu ones_and_zeros = (input > 0) mask = output > 0 output = input*ones_and_zeros deriv = output * mask
  • Sigmoid output = 1/(1 + np.exp(-input)) deriv = output*(1-output)
  • Tanh output = np.tanh(input) deriv = 1 - (output**2)
  • Softmax temp = np.exp(input) temp = (output - true) output /= np.sum(temp) output = temp/len(true)

A convolution layer aggregates the kernels with sum pooling, mean pooling, or max pooling. Max pooling is the most common.

When a neural network needs to use the same idea in mutliple places, endeavor to use the same weights in both places.

The perceptron step works as follows. For a point with coordinates (p,q), label y, and prediction given by the equation ŷ = step(w₁x₁ + w₂x₂ + b): ∀ points: - If the point is correctly classified, do nothing. - If the point is classified positive, but it has a negative label, subtract αp, αq, and α from w₁, w₂ and b respectively. - If the point is classified negative, but it has a positive label, add αp, αq, and α from w₁, w₂ and b respectively.

By replacing the step function with the sigmoid function, ŷ = σ(w₁x₁ + w₂x₂ + b) becomes a probability that the point is above or below the line.

$$Softmax=\sum(\mathbf{z})_j=$$

$$\frac{e^{z_j}}{\sum_{k=1}^{K}e^{z_k}}$$

$$CrossEntropy=-\sum_{i=1}^m y_i ln(p_i) + (1-y_i)ln(1-p_i)$$

$$MultiClassCE=-\sum_{i=1}^n \sum_{j=1}^m y_{ij} ln(p_{ij})$$

$$Error=-\frac{1}{m}\sum_{i=1}^m (1-y_i)ln(1-\hat{y}_i) + y_i ln(\hat{y}_i)$$

$$E(W,b)=-\frac{1}{m}\sum_{i=1}^m (1-y_i)ln(1-\sigma(Wx^{(i)}+b)) + y_i ln(\sigma(Wx^{(i)}+b))$$

$$MultiClassError=-\sum_{i=1}^n \sum_{j=1}^m y_{ij} ln(\hat{y}_{ij})$$

The derivative of the sigmoid function is really simple (here the tick means first-order derivative):

$$\sigma’(x) = \sigma(x) (1-\sigma(x))$$

After applying some calculus, this is the gradient step (here the tick means new value):

$$w_i’ \leftarrow w_i + \alpha (y - \hat{y}) x_i \qquad b’ \leftarrow b + \alpha (y - \hat{y})$$

Feedforward:

$$\hat{y} = \sigma \circ W^{(n)} \circ \ldots \circ \sigma \circ W^{(2)} \circ \sigma \circ W^{(1)}(x)$$ $$\nabla E = (\ldots, \frac{\delta E}{\delta W_{ij}^{(k)}}, \ldots)$$

Backpropagation:

$$\forall W_{ij}^{(k)}\text{ in }\nabla E: \quad W_{ij}^{’(k)} \leftarrow W_{ij}^{(k)} - \alpha\frac{\delta E}{\delta W_{ij}^{(k)}}$$

If you can’t find the right size of pants, it’s better to go for the slightly larger pair and use a belt.

# Putting together a 
[Keras](https://keras.io/getting-started/sequential-model-guide/) network is straightforward:
model = Sequential()
model.add(...) # a bunch of layers here
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=20, batch_size=128)
score = model.evaluate(x_test, y_test, batch_size=128)

An epoch is a single forward and backward pass of the whole dataset.

Backpropagation (another notation):

$$\delta^h_j = \sum{W_{jk}\delta^0_kf’(h_j)}$$ $$\Delta w_{ij} = \eta \delta^h_jx_i$$

Limitations of MLPs:

  • use a lot of parameters because they only use fully connected layers
  • only accept vectors as input

Four Cases when Using Transfer Learning A large data set might have one million images. A small data could have two-thousand images. The dividing line between a large data set and small data set is somewhat subjective. Overfitting is a concern when using transfer learning with a small data set. Images of dogs and images of wolves would be considered similar; the images would share common characteristics. A data set of flower images would be different from a data set of dog images. Each of the four transfer learning cases has its own approach. In the following sections, we will look at each case one by one. Demonstration Network To explain how each situation works, we will start with a generic pre-trained convolutional neural network and explain how to adjust the network for each case. Our example network contains three convolutional layers and three fully connected layers:

General Overview of a Neural Network Here is an generalized overview of what the convolutional neural network does: the first layer will detect edges in the image the second layer will detect shapes the third convolutional layer detects higher level features Each transfer learning case will use the pre-trained convolutional neural network in a different way. Case 1: Small Data Set, Similar Data

Case 1: Small Data Set with Similar Data If the new data set is small and similar to the original training data: slice off the end of the neural network add a new fully connected layer that matches the number of classes in the new data set randomize the weights of the new fully connected layer; freeze all the weights from the pre-trained network train the network to update the weights of the new fully connected layer To avoid overfitting on the small data set, the weights of the original network will be held constant rather than re-training the weights. Since the data sets are similar, images from each data set will have similar higher level features. Therefore most or all of the pre-trained neural network layers already contain relevant information about the new data set and should be kept. Here’s how to visualize this approach:

Neural Network with Small Data Set, Similar Data Case 2: Small Data Set, Different Data

Case 2: Small Data Set, Different Data If the new data set is small and different from the original training data: slice off most of the pre-trained layers near the beginning of the network add to the remaining pre-trained layers a new fully connected layer that matches the number of classes in the new data set randomize the weights of the new fully connected layer; freeze all the weights from the pre-trained network train the network to update the weights of the new fully connected layer Because the data set is small, overfitting is still a concern. To combat overfitting, the weights of the original neural network will be held constant, like in the first case. But the original training set and the new data set do not share higher level features. In this case, the new network will only use the layers containing lower level features. Here is how to visualize this approach:

Neural Network with Small Data Set, Different Data Case 3: Large Data Set, Similar Data

Case 3: Large Data Set, Similar Data If the new data set is large and similar to the original training data: remove the last fully connected layer and replace with a layer matching the number of classes in the new data set randomly initialize the weights in the new fully connected layer initialize the rest of the weights using the pre-trained weights re-train the entire neural network Overfitting is not as much of a concern when training on a large data set; therefore, you can re-train all of the weights. Because the original training set and the new data set share higher level features, the entire neural network is used as well. Here is how to visualize this approach:

Neural Network with Large Data Set, Similar Data Case 4: Large Data Set, Different Data

Case 4: Large Data Set, Different Data If the new data set is large and different from the original training data: remove the last fully connected layer and replace with a layer matching the number of classes in the new data set retrain the network from scratch with randomly initialized weights alternatively, you could just use the same strategy as the “large and similar” data case Even though the data set is different from the training data, initializing the weights from the pre-trained network might make training faster. So this case is exactly the same as the case with a large, similar data set. If using the pre-trained network as a starting point does not produce a successful model, another option is to randomly initialize the convolutional neural network weights and train the network from scratch. Here is how to visualize this approach:

Neural Network with Large Data Set, Different Data


References