# Deep Learning

Exploring LSTMs

CIFAR-10 Competition Winners

A Theoretical and Empirical Analysis of Expected Sarsa

Issues in Using Function Approximation for Reinforcement Learning (1993)

Learning Deep Features for Discriminative Localization

Understanding LSTM Networks

CS231n: Convolutional Neural Networks for Visual Recognition

Commonly used activation functions

Visualizing what ConvNets learn)

Deep Reinforcement Learning: Pong from Pixels

Linear Combinations

Neural Networks and Deep Learning (book)

Why are deep neural networks hard to train?

ImageNet Classiﬁcation with Deep Convolutional NeuralNetworks

An Empirical Explorationof Recurrent Network Architectures

Understanding the difﬁculty of training deep feedforward neural networks

Inventory management in supply chains: a reinforcement learning approach

Image Kernels Explained Visually

common derivatives

Reinforcement Learning with Replacing Eligibility Traces

LSTM

WaveNet model that generates songs

automatic handwriting generation

Learning Long-Term Dependencies with RNN

Deep Learning (book)

IMAGENET Large Scale Visual Recognition Challenge (ILSVRC)

CNNs for text classification

Introduction to Learning to Trade with Reinforcement Learning

THE MNIST DATABASE of handwritten digits

Efficient BackProp

Assisting Pathologists in Detecting Cancer with Deep Learning

A.I. Experiments website

Practical recommendations for gradient-based training of deep architectures

On the difficulty of training Recurrent Neural Networks

Speech Recognition with Deep Recurrent Neural Networks

Sequence to Sequence Learning with Neural Networks

Show and Tell: A Neural Image Caption Generator

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

DRAW: A Recurrent Neural Network For Image Generation

Visualizing and Understanding Recurrent Networks

How to Generate a Good Word Embedding?

Deep Recurrent Q-Learning for Partially Observable MDPs

Deep Reinforcement Learning with Double Q-learning

Prioritized Experience Replay

Dueling Network Architectures for Deep Reinforcement Learning

Systematic evaluation of CNN advances on the ImageNet

Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition

Understanding deep learning requires rethinking generalization

Massive Exploration of Neural Machine Translation Architectures

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITIO

How transferable are features in deep neural networks?

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

LSTM: A Search Space Odyssey

SESSION-BASED RECOMMENDATIONS WITH RECURRENT NEURAL NETWORKS

Deep Residual Learning for Image Recognition

Improved Techniques for Training GANs

EmergenceofLocomotionBehaviours inRichEnvironments

Amazon Lex FAQs

Building powerful image classification models using very little data

How convolutional neural networks see the world

DotA 2 bot by Open AI

Reading Barcodes on Hooves: How Deep Learning Is Helping Save Endangered Zebras

Facebook's CNN approach for language translation

Building an efficient neural language model over a billion words

Understanding LSTM Networks

Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

Deep Dream Generator

DeepMind

AlphaGo Zero: Learning from scratch

Producing flexible behaviours in simulated environments

WaveNet

AlphaGo

Human-level control through Deep Reinforcement Learning

Bias of an estimator

Conditional probability distribution

Convergent series

Divergent series

Expected value

Geometric series

Law of large numbers

Markov reward model

Mean Squared Error (MSE) (usually used in regression problems)

Mean squared error

Negative binomial distribution

Elman and Jordan networks

Time delay neural network

Word2vec

What Neural Networks See

ResNetCAM-keras

Keras Transfer Learning on CIFAR-10

Benchmarks for popular CNN models

OpenAI Gym GitHub

Learning to trade under the reinforcement learning framework

Reinforcement Learning Cheat Sheet

Getting Started with OpenAI Gym

Deep Q-Learning with Keras and Gym

How to Check-Point Deep Learning Models in Keras

How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras

Image Augmentation for Deep Learning With Keras

NeurIPS

Elman network

Attacking Machine Learning with Adversarial Examples

OpenFrameworks

Low Power Wireless Communication via Reinforcement Learning

SEQUENCE-TO-SEQUENCE RNNS FOR TEXT SUMMARIZATION

Reinforcement Learning for Robots Using Neural Networks

Reading game frames in Python with OpenCV - Python Plays GTA V

Reinforcement Learning (DQN) Tutorial

Play pictionary with a CNN

Keras Cheat Sheet

MIT 6.S094: Deep Learning for Self-Driving Cars

Deep Traffic

A Beginner's Guide to LSTMs and Recurrent Neural Networks

Geometric Sequences and Exponential Functions

Human-level control through deep reinforcement learning

A Long Short-Term Memory Model for Answer Sentence Selection in Question Answering

AutoDraw

Dropout: A Simple Way to Prevent Neural Networks from Overﬁtting

FaceApp uses neural networks to change your look, now available on Android

cross entropy (usually used in classification problems)

Popular Datasets Over Time

Grokking Deep Learning

Nature publication detailing cancer-detecting CNN

The Dark Secret at the Heart of AI

Finding Solace in Defeat by Artificial Intelligence

Visually-Indicated Sounds

Intelligent Flying Machines (IFM)

Visualizing and Understanding Deep Neural Networks by Matt Zeiler

Type of machine learning: supervised vs. unsupervised

Method of machine learning: parametric (trial-and-error) vs. non-parametric (“counting”)

Deep learning is a class of parametric model.

# stare at this
weight = 0.5
goal_pred = 0.8
input = 2
alpha = 0.1
for iteration in range(20):
pred = input * weight
error = (pred - goal_pred) ** 2
derivative = input * (pred - goal_pred)
weight = weight - (alpha * derivative)
print("Error:" + str(error) + " Prediction:" + str(pred))


# start at this
import numpy as np
np.random.seed(1)

def relu(x):
return (x > 0) * x # returns x if x > 0
# return 0 otherwise

def relu2deriv(output):
return output>0 # returns 1 for input > 0
# return 0 otherwise
streetlights = np.array( [[ 1, 0, 1 ],
[ 0, 1, 1 ],
[ 0, 0, 1 ],
[ 1, 1, 1 ] ] )

walk_vs_stop = np.array([[ 1, 1, 0, 0]]).T

alpha = 0.2
hidden_size = 4

weights_0_1 = 2*np.random.random((3,hidden_size)) - 1
weights_1_2 = 2*np.random.random((hidden_size,1)) - 1

for iteration in range(60):
layer_2_error = 0
for i in range(len(streetlights)):
layer_0 = streetlights[i:i+1]
layer_1 = relu(np.dot(layer_0,weights_0_1))
layer_2 = np.dot(layer_1,weights_1_2)

layer_2_error += np.sum((layer_2 - walk_vs_stop[i:i+1]) ** 2)
layer_2_delta = (layer_2 - walk_vs_stop[i:i+1])
layer_1_delta=layer_2_delta.dot(weights_1_2.T)*relu2deriv(layer_1)
weights_1_2 -= alpha * layer_1.T.dot(layer_2_delta)
weights_0_1 -= alpha * layer_0.T.dot(layer_1_delta)

if(iteration % 10 == 9):
print("Error:" + str(layer_2_error))


Normalization techniques:

• minibatch
• early stopping
• dropout

Activation functions:

• relu is fast
• sigmoid is often used for output because it squishes the values between 0 and 1
• tanh is often used for middle layers because it squishes the values between -1 and 1

Output activation functions:

• predicting raw data values (like temperature) => no activation
• predicting yes/no probabilities => sigmoid
• predicting “which one” probabilities => softmax

function forward prop back prop delta

• Relu ones_and_zeros = (input > 0) mask = output > 0 output = input*ones_and_zeros deriv = output * mask
• Sigmoid output = 1/(1 + np.exp(-input)) deriv = output*(1-output)
• Tanh output = np.tanh(input) deriv = 1 - (output**2)
• Softmax temp = np.exp(input) temp = (output - true) output /= np.sum(temp) output = temp/len(true)

A convolution layer aggregates the kernels with sum pooling, mean pooling, or max pooling. Max pooling is the most common.

When a neural network needs to use the same idea in mutliple places, endeavor to use the same weights in both places.

The perceptron step works as follows. For a point with coordinates (p,q), label y, and prediction given by the equation ŷ = step(w₁x₁ + w₂x₂ + b): ∀ points: - If the point is correctly classified, do nothing. - If the point is classified positive, but it has a negative label, subtract αp, αq, and α from w₁, w₂ and b respectively. - If the point is classified negative, but it has a positive label, add αp, αq, and α from w₁, w₂ and b respectively.

By replacing the step function with the sigmoid function, ŷ = σ(w₁x₁ + w₂x₂ + b) becomes a probability that the point is above or below the line.

$$Softmax=\sum(\mathbf{z})_j=$$

$$\frac{e^{z_j}}{\sum_{k=1}^{K}e^{z_k}}$$

$$CrossEntropy=-\sum_{i=1}^m y_i ln(p_i) + (1-y_i)ln(1-p_i)$$

$$MultiClassCE=-\sum_{i=1}^n \sum_{j=1}^m y_{ij} ln(p_{ij})$$

$$Error=-\frac{1}{m}\sum_{i=1}^m (1-y_i)ln(1-\hat{y}_i) + y_i ln(\hat{y}_i)$$

$$E(W,b)=-\frac{1}{m}\sum_{i=1}^m (1-y_i)ln(1-\sigma(Wx^{(i)}+b)) + y_i ln(\sigma(Wx^{(i)}+b))$$

$$MultiClassError=-\sum_{i=1}^n \sum_{j=1}^m y_{ij} ln(\hat{y}_{ij})$$

The derivative of the sigmoid function is really simple (here the tick means first-order derivative):

$$\sigma’(x) = \sigma(x) (1-\sigma(x))$$

After applying some calculus, this is the gradient step (here the tick means new value):

$$w_i’ \leftarrow w_i + \alpha (y - \hat{y}) x_i \qquad b’ \leftarrow b + \alpha (y - \hat{y})$$

Feedforward:

$$\hat{y} = \sigma \circ W^{(n)} \circ \ldots \circ \sigma \circ W^{(2)} \circ \sigma \circ W^{(1)}(x)$$ $$\nabla E = (\ldots, \frac{\delta E}{\delta W_{ij}^{(k)}}, \ldots)$$

Backpropagation:

$$\forall W_{ij}^{(k)}\text{ in }\nabla E: \quad W_{ij}^{'(k)} \leftarrow W_{ij}^{(k)} - \alpha\frac{\delta E}{\delta W_{ij}^{(k)}}$$

If you can't find the right size of pants, it's better to go for the slightly larger pair and use a belt.

# Putting together a
[Keras](https://keras.io/getting-started/sequential-model-guide/) network is straightforward:
model = Sequential()
model.add(...) # a bunch of layers here
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=20, batch_size=128)
score = model.evaluate(x_test, y_test, batch_size=128)


An epoch is a single forward and backward pass of the whole dataset.

Backpropagation (another notation):

$$\delta^h_j = \sum{W_{jk}\delta^0_kf’(h_j)}$$ $$\Delta w_{ij} = \eta \delta^h_jx_i$$

Limitations of MLPs:

• use a lot of parameters because they only use fully connected layers
• only accept vectors as input

Four Cases when Using Transfer Learning A large data set might have one million images. A small data could have two-thousand images. The dividing line between a large data set and small data set is somewhat subjective. Overfitting is a concern when using transfer learning with a small data set. Images of dogs and images of wolves would be considered similar; the images would share common characteristics. A data set of flower images would be different from a data set of dog images. Each of the four transfer learning cases has its own approach. In the following sections, we will look at each case one by one. Demonstration Network To explain how each situation works, we will start with a generic pre-trained convolutional neural network and explain how to adjust the network for each case. Our example network contains three convolutional layers and three fully connected layers:

General Overview of a Neural Network Here is an generalized overview of what the convolutional neural network does: the first layer will detect edges in the image the second layer will detect shapes the third convolutional layer detects higher level features Each transfer learning case will use the pre-trained convolutional neural network in a different way. Case 1: Small Data Set, Similar Data

Case 1: Small Data Set with Similar Data If the new data set is small and similar to the original training data: slice off the end of the neural network add a new fully connected layer that matches the number of classes in the new data set randomize the weights of the new fully connected layer; freeze all the weights from the pre-trained network train the network to update the weights of the new fully connected layer To avoid overfitting on the small data set, the weights of the original network will be held constant rather than re-training the weights. Since the data sets are similar, images from each data set will have similar higher level features. Therefore most or all of the pre-trained neural network layers already contain relevant information about the new data set and should be kept. Here's how to visualize this approach:

Neural Network with Small Data Set, Similar Data Case 2: Small Data Set, Different Data

Case 2: Small Data Set, Different Data If the new data set is small and different from the original training data: slice off most of the pre-trained layers near the beginning of the network add to the remaining pre-trained layers a new fully connected layer that matches the number of classes in the new data set randomize the weights of the new fully connected layer; freeze all the weights from the pre-trained network train the network to update the weights of the new fully connected layer Because the data set is small, overfitting is still a concern. To combat overfitting, the weights of the original neural network will be held constant, like in the first case. But the original training set and the new data set do not share higher level features. In this case, the new network will only use the layers containing lower level features. Here is how to visualize this approach:

Neural Network with Small Data Set, Different Data Case 3: Large Data Set, Similar Data

Case 3: Large Data Set, Similar Data If the new data set is large and similar to the original training data: remove the last fully connected layer and replace with a layer matching the number of classes in the new data set randomly initialize the weights in the new fully connected layer initialize the rest of the weights using the pre-trained weights re-train the entire neural network Overfitting is not as much of a concern when training on a large data set; therefore, you can re-train all of the weights. Because the original training set and the new data set share higher level features, the entire neural network is used as well. Here is how to visualize this approach:

Neural Network with Large Data Set, Similar Data Case 4: Large Data Set, Different Data

Case 4: Large Data Set, Different Data If the new data set is large and different from the original training data: remove the last fully connected layer and replace with a layer matching the number of classes in the new data set retrain the network from scratch with randomly initialized weights alternatively, you could just use the same strategy as the “large and similar” data case Even though the data set is different from the training data, initializing the weights from the pre-trained network might make training faster. So this case is exactly the same as the case with a large, similar data set. If using the pre-trained network as a starting point does not produce a successful model, another option is to randomly initialize the convolutional neural network weights and train the network from scratch. Here is how to visualize this approach:

Neural Network with Large Data Set, Different Data