Deep Learning
Type of machine learning: supervised vs. unsupervised
Method of machine learning: parametric (trial-and-error) vs. non-parametric (“counting”)
Deep learning is a class of parametric model.
# stare at this
weight = 0.5
goal_pred = 0.8
input = 2
alpha = 0.1
for iteration in range(20):
pred = input * weight
error = (pred - goal_pred) ** 2
derivative = input * (pred - goal_pred)
weight = weight - (alpha * derivative)
print("Error:" + str(error) + " Prediction:" + str(pred))
Stochastic Gradient Descent updates the weights after each input. Batch Gradient Descent updates the weights after each batch of input.
# start at this
import numpy as np
np.random.seed(1)
def relu(x):
return (x > 0) * x # returns x if x > 0
# return 0 otherwise
def relu2deriv(output):
return output>0 # returns 1 for input > 0
# return 0 otherwise
streetlights = np.array( [[ 1, 0, 1 ],
[ 0, 1, 1 ],
[ 0, 0, 1 ],
[ 1, 1, 1 ] ] )
walk_vs_stop = np.array([ 1, 1, 0, 0](/notes/)).T
alpha = 0.2
hidden_size = 4
weights_0_1 = 2*np.random.random((3,hidden_size)) - 1
weights_1_2 = 2*np.random.random((hidden_size,1)) - 1
for iteration in range(60):
layer_2_error = 0
for i in range(len(streetlights)):
layer_0 = streetlights[i:i+1]
layer_1 = relu(np.dot(layer_0,weights_0_1))
layer_2 = np.dot(layer_1,weights_1_2)
layer_2_error += np.sum((layer_2 - walk_vs_stop[i:i+1]) ** 2)
layer_2_delta = (layer_2 - walk_vs_stop[i:i+1])
layer_1_delta=layer_2_delta.dot(weights_1_2.T)*relu2deriv(layer_1)
weights_1_2 -= alpha * layer_1.T.dot(layer_2_delta)
weights_0_1 -= alpha * layer_0.T.dot(layer_1_delta)
if(iteration % 10 == 9):
print("Error:" + str(layer_2_error))
Normalization techniques:
- minibatch
- early stopping
- dropout
Activation functions:
- relu is fast
- sigmoid is often used for output because it squishes the values between 0 and 1
- tanh is often used for middle layers because it squishes the values between -1 and 1
Output activation functions:
- predicting raw data values (like temperature) => no activation
- predicting yes/no probabilities => sigmoid
- predicting “which one” probabilities => softmax
function forward prop back prop delta
- Relu ones_and_zeros = (input > 0) mask = output > 0 output = input*ones_and_zeros deriv = output * mask
- Sigmoid output = 1/(1 + np.exp(-input)) deriv = output*(1-output)
- Tanh output = np.tanh(input) deriv = 1 - (output**2)
- Softmax temp = np.exp(input) temp = (output - true) output /= np.sum(temp) output = temp/len(true)
A convolution layer aggregates the kernels with sum pooling, mean pooling, or max pooling. Max pooling is the most common.
When a neural network needs to use the same idea in mutliple places, endeavor to use the same weights in both places.
The perceptron step works as follows. For a point with coordinates (p,q), label y, and prediction given by the equation ŷ = step(w₁x₁ + w₂x₂ + b): ∀ points: - If the point is correctly classified, do nothing. - If the point is classified positive, but it has a negative label, subtract αp, αq, and α from w₁, w₂ and b respectively. - If the point is classified negative, but it has a positive label, add αp, αq, and α from w₁, w₂ and b respectively.
By replacing the step function with the sigmoid function, ŷ = σ(w₁x₁ + w₂x₂ + b) becomes a probability that the point is above or below the line.
$$Softmax=\sum(\mathbf{z})_j=$$
$$\frac{e^{z_j}}{\sum_{k=1}^{K}e^{z_k}}$$
$$CrossEntropy=-\sum_{i=1}^m y_i ln(p_i) + (1-y_i)ln(1-p_i)$$
$$MultiClassCE=-\sum_{i=1}^n \sum_{j=1}^m y_{ij} ln(p_{ij})$$
$$Error=-\frac{1}{m}\sum_{i=1}^m (1-y_i)ln(1-\hat{y}_i) + y_i ln(\hat{y}_i)$$
$$E(W,b)=-\frac{1}{m}\sum_{i=1}^m (1-y_i)ln(1-\sigma(Wx^{(i)}+b)) + y_i ln(\sigma(Wx^{(i)}+b))$$
$$MultiClassError=-\sum_{i=1}^n \sum_{j=1}^m y_{ij} ln(\hat{y}_{ij})$$
The derivative of the sigmoid function is really simple (here the tick means first-order derivative):
$$\sigma’(x) = \sigma(x) (1-\sigma(x))$$
After applying some calculus, this is the gradient step (here the tick means new value):
$$w_i’ \leftarrow w_i + \alpha (y - \hat{y}) x_i \qquad b’ \leftarrow b + \alpha (y - \hat{y})$$
Feedforward:
$$\hat{y} = \sigma \circ W^{(n)} \circ \ldots \circ \sigma \circ W^{(2)} \circ \sigma \circ W^{(1)}(x)$$ $$\nabla E = (\ldots, \frac{\delta E}{\delta W_{ij}^{(k)}}, \ldots)$$
Backpropagation:
$$\forall W_{ij}^{(k)}\text{ in }\nabla E: \quad W_{ij}^{’(k)} \leftarrow W_{ij}^{(k)} - \alpha\frac{\delta E}{\delta W_{ij}^{(k)}}$$
If you can’t find the right size of pants, it’s better to go for the slightly larger pair and use a belt.
# Putting together a
[Keras](https://keras.io/getting-started/sequential-model-guide/) network is straightforward:
model = Sequential()
model.add(...) # a bunch of layers here
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=20, batch_size=128)
score = model.evaluate(x_test, y_test, batch_size=128)
An epoch is a single forward and backward pass of the whole dataset.
Backpropagation (another notation):
$$\delta^h_j = \sum{W_{jk}\delta^0_kf’(h_j)}$$ $$\Delta w_{ij} = \eta \delta^h_jx_i$$
Limitations of MLPs:
- use a lot of parameters because they only use fully connected layers
- only accept vectors as input
Four Cases when Using Transfer Learning A large data set might have one million images. A small data could have two-thousand images. The dividing line between a large data set and small data set is somewhat subjective. Overfitting is a concern when using transfer learning with a small data set. Images of dogs and images of wolves would be considered similar; the images would share common characteristics. A data set of flower images would be different from a data set of dog images. Each of the four transfer learning cases has its own approach. In the following sections, we will look at each case one by one. Demonstration Network To explain how each situation works, we will start with a generic pre-trained convolutional neural network and explain how to adjust the network for each case. Our example network contains three convolutional layers and three fully connected layers:
General Overview of a Neural Network Here is an generalized overview of what the convolutional neural network does: the first layer will detect edges in the image the second layer will detect shapes the third convolutional layer detects higher level features Each transfer learning case will use the pre-trained convolutional neural network in a different way. Case 1: Small Data Set, Similar Data
Case 1: Small Data Set with Similar Data If the new data set is small and similar to the original training data: slice off the end of the neural network add a new fully connected layer that matches the number of classes in the new data set randomize the weights of the new fully connected layer; freeze all the weights from the pre-trained network train the network to update the weights of the new fully connected layer To avoid overfitting on the small data set, the weights of the original network will be held constant rather than re-training the weights. Since the data sets are similar, images from each data set will have similar higher level features. Therefore most or all of the pre-trained neural network layers already contain relevant information about the new data set and should be kept. Here’s how to visualize this approach:
Neural Network with Small Data Set, Similar Data Case 2: Small Data Set, Different Data
Case 2: Small Data Set, Different Data If the new data set is small and different from the original training data: slice off most of the pre-trained layers near the beginning of the network add to the remaining pre-trained layers a new fully connected layer that matches the number of classes in the new data set randomize the weights of the new fully connected layer; freeze all the weights from the pre-trained network train the network to update the weights of the new fully connected layer Because the data set is small, overfitting is still a concern. To combat overfitting, the weights of the original neural network will be held constant, like in the first case. But the original training set and the new data set do not share higher level features. In this case, the new network will only use the layers containing lower level features. Here is how to visualize this approach:
Neural Network with Small Data Set, Different Data Case 3: Large Data Set, Similar Data
Case 3: Large Data Set, Similar Data If the new data set is large and similar to the original training data: remove the last fully connected layer and replace with a layer matching the number of classes in the new data set randomly initialize the weights in the new fully connected layer initialize the rest of the weights using the pre-trained weights re-train the entire neural network Overfitting is not as much of a concern when training on a large data set; therefore, you can re-train all of the weights. Because the original training set and the new data set share higher level features, the entire neural network is used as well. Here is how to visualize this approach:
Neural Network with Large Data Set, Similar Data Case 4: Large Data Set, Different Data
Case 4: Large Data Set, Different Data If the new data set is large and different from the original training data: remove the last fully connected layer and replace with a layer matching the number of classes in the new data set retrain the network from scratch with randomly initialized weights alternatively, you could just use the same strategy as the “large and similar” data case Even though the data set is different from the training data, initializing the weights from the pre-trained network might make training faster. So this case is exactly the same as the case with a large, similar data set. If using the pre-trained network as a starting point does not produce a successful model, another option is to randomly initialize the convolutional neural network weights and train the network from scratch. Here is how to visualize this approach:
Neural Network with Large Data Set, Different Data
References
- German Traffic Sign dataset and project
- Exploring LSTMs
- CIFAR-10 Competition Winners
- A Theoretical and Empirical Analysis of Expected Sarsa
- Issues in Using Function Approximation for Reinforcement Learning (1993)
- Learning Deep Features for Discriminative Localization
- Understanding LSTM Networks
- CS231n: Convolutional Neural Networks for Visual Recognition
- Commonly used activation functions
- Visualizing what ConvNets learn)
- Deep Reinforcement Learning: Pong from Pixels
- Linear Combinations
- Neural Networks and Deep Learning (book)
- Why are deep neural networks hard to train?
- ImageNet Classification with Deep Convolutional NeuralNetworks
- An Empirical Explorationof Recurrent Network Architectures
- Understanding the difficulty of training deep feedforward neural networks
- Inventory management in supply chains: a reinforcement learning approach
- Image Kernels Explained Visually
- common derivatives
- The Street View House Numbers (SVHN) Dataset and project
- Reinforcement Learning with Replacing Eligibility Traces
- LSTM
- WaveNet model that generates songs
- automatic handwriting generation
- Learning Long-Term Dependencies with RNN
- Deep Learning (book)
- IMAGENET Large Scale Visual Recognition Challenge (ILSVRC)
- Performance Bounds on Greedy Policies
- CNNs for text classification
- Introduction to Learning to Trade with Reinforcement Learning
- THE MNIST DATABASE of handwritten digits
- Efficient BackProp
- Assisting Pathologists in Detecting Cancer with Deep Learning
- A.I. Experiments website
- Practical recommendations for gradient-based training of deep architectures
- On the difficulty of training Recurrent Neural Networks
- Speech Recognition with Deep Recurrent Neural Networks
- Sequence to Sequence Learning with Neural Networks
- Show and Tell: A Neural Image Caption Generator
- Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
- Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
- DRAW: A Recurrent Neural Network For Image Generation
- Visualizing and Understanding Recurrent Networks
- How to Generate a Good Word Embedding?
- Deep Recurrent Q-Learning for Partially Observable MDPs
- Deep Reinforcement Learning with Double Q-learning
- Prioritized Experience Replay
- Dueling Network Architectures for Deep Reinforcement Learning
- Systematic evaluation of CNN advances on the ImageNet
- Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition
- Understanding deep learning requires rethinking generalization
- Massive Exploration of Neural Machine Translation Architectures
- Generative Adversarial Nets
- VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITIO
- How transferable are features in deep neural networks?
- Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
- LSTM: A Search Space Odyssey
- SESSION-BASED RECOMMENDATIONS WITH RECURRENT NEURAL NETWORKS
- Deep Residual Learning for Image Recognition
- Improved Techniques for Training GANs
- EmergenceofLocomotionBehaviours inRichEnvironments
- Amazon Lex FAQs
- Building powerful image classification models using very little data
- How convolutional neural networks see the world
- DotA 2 bot by Open AI
- Reading Barcodes on Hooves: How Deep Learning Is Helping Save Endangered Zebras
- Facebook’s CNN approach for language translation
- Building an efficient neural language model over a billion words
- Understanding LSTM Networks
- Depth Map Prediction from a Single Image using a Multi-Scale Deep Network
- Deep Dream Generator
- DeepMind
- AlphaGo Zero: Learning from scratch
- Producing flexible behaviours in simulated environments
- WaveNet
- AlphaGo
- Human-level control through Deep Reinforcement Learning
- Play Atari games with a CNN and reinforcement learning and its source code
- Bias of an estimator
- Conditional probability distribution
- Convergent series
- Divergent series
- Expected value
- Geometric series
- Law of large numbers
- Markov reward model
- Mean Squared Error (MSE) (usually used in regression problems)
- Mean squared error
- Negative binomial distribution
- Elman and Jordan networks
- Time delay neural network
- Vanishing gradient problem
- Word2vec
- What Neural Networks See
- ResNetCAM-keras
- Keras Transfer Learning on CIFAR-10
- Benchmarks for popular CNN models
- OpenAI Gym GitHub
- Learning to trade under the reinforcement learning framework
- Reinforcement Learning Cheat Sheet
- Getting Started with OpenAI Gym
- Deep Q-Learning with Keras and Gym
- How to Check-Point Deep Learning Models in Keras
- How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras
- Image Augmentation for Deep Learning With Keras
- NeurIPS
- Elman network
- Attacking Machine Learning with Adversarial Examples
- OpenFrameworks
- Low Power Wireless Communication via Reinforcement Learning
- SEQUENCE-TO-SEQUENCE RNNS FOR TEXT SUMMARIZATION
- Reinforcement Learning for Robots Using Neural Networks
- Reading game frames in Python with OpenCV - Python Plays GTA V
- Reinforcement Learning (DQN) Tutorial
- Play pictionary with a CNN
- Reinforcement Learning (book) and Python implementation
- Keras Cheat Sheet
- MIT 6.S094: Deep Learning for Self-Driving Cars
- Deep Traffic
- A Beginner’s Guide to LSTMs and Recurrent Neural Networks
- Geometric Sequences and Exponential Functions
- Human-level control through deep reinforcement learning
- A Long Short-Term Memory Model for Answer Sentence Selection in Question Answering
- AutoDraw
- Dropout: A Simple Way to Prevent Neural Networks from Overfitting
- FaceApp uses neural networks to change your look, now available on Android
- How GPS Drone Navigation Works
- Deep Learning Newsletter
- cross entropy (usually used in classification problems)
- Popular Datasets Over Time
- Grokking Deep Learning
- Nature publication detailing cancer-detecting CNN
- The Dark Secret at the Heart of AI
- Finding Solace in Defeat by Artificial Intelligence
- Visually-Indicated Sounds
- Intelligent Flying Machines (IFM)
- Visualizing and Understanding Deep Neural Networks by Matt Zeiler