CSW182 Notes

Milestones
- Digit recognition
- Image classification
- Speech Recognition
- Language Translation
- Deep reinforcement learning
History
- Cybernetics -> Digital Computers -> Connectionism -> Computational learning theory, graphical models -> Deep learning: end to end, large datasets
Logistic Regression
Minibatches
Cross Entropy Loss
Saddle Points
Stochastic Gradient Descent
- Linear complexity
- Not a lot of variance
Newton’s Method is a standard extension of Stochastic Gradient Descent
- Derive from Taylor Series
- Optimal for quadratic functions
- Why do we not use this for deep learning?
  - Will not get you to a global optimum, will get you right to the saddle point
  - LDFDS, computing the inverse of the Hessian will take cubic time.
SGD with momentum and damping (viscosity)
- Momentum is actually the damping ratio
- Convergence ratio of $O(\frac{1}{t})$
Nesterov Momentum Update
- Convergence ratio of $O(\frac{1}{t^2})$
RMSProp (Root mean squared gradient)
- Moving average of the Mean-Squared Gradient at step t.
Adagrad (Uses cumulative sum of squared gradients)
- The exploding gradient problem
RMSProp and Adagrad work great with datasets with wide ranges of gradient magnitudes.
- To an ellipsoidal surface, they scale the surface
- Professor Canny what are you saying? Include a slide or diagram or something!
Momentum + RMSProp $\approx$ ADAM
- Bias correction, Canny says this is wack?
Learning rate decay
- Step decay, exponential decay, $\frac{1}{t}$ decay, $\frac{1}{\sqrt{t}}$ decay (Adagrad).
Higher degree models

Backpropgation and CNNs

Minibatches and SGD, gradients are noisy but do better on average
Momentum based approaches, or coordinate scaling
Loss is of the form $L(f(x,W),y)$
$\frac{dL}{dW}…$
$\frac{\nabla L}{\nabla W_j} = \sum_{i=1}^k\frac{\nabla L}{\nabla f_i}\frac{\nabla f_i}{\nabla W_j}$
For $j=1,…,m$. This can be written as a matrix multiply: $J_L(W) = J_L(f)J_f(W)$
Where $J_f(W)$ is a Jacobian matrix…
Cross-entropy loss $L=-\log{x_y}$.
- $J_L(x_i)$
Two distinctive types of Jacobians:
- Derivatives of loss wrt an input vector (“data” path)
- Derivatives of loss wrt model parameters (“model” path)
Pooling doesn’t get used as much anymore. The idea of pooling is to take in contribution from different localities. But this happens regardless because networks are so deep.
Common to zero pad the border. Aside: Remove the image mean first!

CNNs continued

Subtract the average and the pad with zeroes.
$D_{out} = \frac{N + 2P - F}{\text{stride}}+1$
Pooling layer
- By “pooling” (e.g., taking max) filter responses at different locations, we gain robestness to the exact spatial location.
- Makes the representations…
Case Study: AlexNet.
- first use of ReLU
- used LRNorm layers (not common anymore)
- heavy data augmentation
- dropout 0.5
- batch size 128
- SGD Momentum 0.9
- Learning rate 1w-2, reduced by 10 manually when val accuracy plateaus
- L2 weight decal 5e-4
- 7 CNN ensemble: 18.2% -> 15.4%.
“You need a lot of data if you want to train/use CNNs”.
- Transfer Learning with CNNs
  - 1. Train on Imagenet
  - 1. If small dataset: fix all weights (treat CNN as fixed feature extractor), retrain only the classifier. i.e. swap the softmax layer at the end.
  - 1. If you have medium sized dataset, “finetune” instead: use the old weights as initialziation, train the full network or only some of the higher layers. Retrain bigger portion of the network, or even all of it.
Case study: VGGNet, 7.3% top 5 error. How many convolutions can you stack on top of each other? 100 million parameters vs 20 million parameters. Very inefficient in terms of space and speed. Very expensive.
Case study: GoogLeNet, Inception module
- Biological motivation: parallel pooling and convolution layers.
- 6.5% top 5 error.
- Several 1x1 convolutions.
- Helper loss (during training only)
ResNet, what are inception networks.
- 3.6% Top 5 error
- Creates a residual layer.
- Batch normalization after every CONV layer
- SGD + Momentum
- Pooling layer is not necessary anymore
- FC is being replaced by global averaging…
- Why does the Residual layer help so much?
- Each layer can multitask
- What happens when the residual transfer function has a difference in dimensions?
Activation functions: Biological Inspiration
- Nonlinearity, weighted function with bias
- Sigmoid
  - Goes from zero to one. Can kill gradients. Can be used for gating neurons. Differentiable logic functions.
  - Doesn’t work out with deep vision tasks
  - What happens to a neuron when input is all positive.
  - Gradient vectors are either all positive or all negative. In a high dimensional space, this quadrant is very small.
- tanh
  - used for RNNs
  - Strong saturation
- ReLU
  - Doesn’t ever saturate
  - Need to take care that there are no ReLUs outside the data envelope.
- Leaky ReLU
  - Left side is also sloped
- Maxout
- ELU

Lecture 7, Training Deep Neural Networks

AlexNet started it all in 2012 with 7 layers.
Transfer Learning with CNNs.
- 1. Train on ImageNet
- 1. If small dataset: fix all weights (treat CNN as fixed feature extractor), retrain only the classifier.
- i.e. swap the Softmax layer at the end.
- 1. If you hve medium sized dataset, “finetune” instead: use the old weights as initialization, train the full network or only some of the higher layers.
- retrain bigger portion of the network, or even all of it.
Activation Functions
- Nonlinearities is why Deep Learning really works.
Weight Initialization
- First idea: Fixed random initialization
- Gaussian with zero mean and fixed variance
- $W_{ij} \sim \mathcal{0,0.0001}$. Works OK for small networks. Used in Alexnet.
- Problem is that activations are scaled by weights in the forward pass.
- If the initial weights are too small/large, the activations will vanish or explode during the forward pass…
Analysis of activation growth
- Consider a linear layer (affine or convolutional), and define the fan-n F as the number of input neurons that contribute to each output.
- For an affine layer, the fan-n is just the number of neurons in the input layer.
- For a convolutional layer, the fan-in is $F_H x F_W x C$, where $C$ is the depth of the input layer.
- then each output activation $a_{out} = \sum_{i=1}^F W_ia_i$ and assuming the activations and weights are independent, same variance: $Var{a_{out}}=FVar(W)Var(a_{in})$. To get the same variance on input and output, we want $Var(W)=\frac{1}{F}$.
- this is called the Xavier initialziation
- Our analysis assumed that layers were linear. Not accurate with ReLU layers.
- ReLU adjustment to Xavier. Note additional / 2 factor of 2 doesn’t seem like much, but it applies multiplicatively 150 times in a large ResNet.
- Weight initialization is essential for converging to the best model.
Batch Normalization
- “Imagine neural activations were normal random variables”. Consider a batch of activations at some layer. To make each dimension unit normal, apply:
- $\hat{x}^{(k)}=\frac{}{}$
- Improves gradient flow through the network
- Allows higher learning rates
- Reduces the strong dependence on initialization
- Reduces need for dropout
- Un-normalization!! Re-computer and apply the optimal scaling and bias for each neuron! Learn $\gamma$ and $\beta$…
- Where to do BatchNorm! BatchNorm is just a linear scale/bias layer in the limit of large batch sizes $(\mu, \sigma^2)$
- In general, try BN before the nonlinearity.
Gradient Magnitudes
- Occasional “monster” gradients.
- Gradient Clipping by Value: Simply limit the magnitude of each gradient. Clip to limit the norm of the gradient.
- One-bit gradients.
- Batch normalization prevents all of this.
Dropout
- Randomly set some neurons to zero in the forward pass.
- Forward pass has the effect of ensembling.
- Create a boolean mask to implement a dropout.
- There is no dropout during test time, but during training, you do dropout.
- You need to renormalize after removing the dropout, by the expectation of the Bernoulli.
- $\gamma$ and $\beta$ allow the model to know that we are doing batch normalization
- Key insight: Higher order effects are removed by using the scaling and shifting.
- Dropout is effectively doing ensembling
Ensembles
- Bagging (Bootstrap Aggregation)
  - Train base models on bootstrap samples of the data. Take majority vote for classification tasks, or average output for regression. Models trained independently
  - Reduces variance in the prediction, not bias.
  - So works best with models that don’t have much bias.
  - Try to make errors in the models independent - e.g. by training them on different data with bootstrap sampling.
  - Models take a long time to train, bagging trivially parallelizes. Model differences seem to be mostly due to variance.
- Boosting
  - Learners are ordered: Each learner tries to reduce error (residual) on “hard” examples, which are those misclassified by earlier learners.
  - Models are dependent, trained sequentially.
  - ADABOOST: weight hard samples more; GRADIENT BOOST: use residual to train later models. Reduces bias and possibly variance compared to base learners.
  - Deep models are very powerful, not obvious that they have much bias, so unclear that boosting will help.
Ensemble Approaches
- True ensemble: train several models independently. Combining them:
  - Prediction averaging: averaged predicted probabilities, or just vote. Always works
  - Parameter averaging: average the parameters of the models. Almost never works (too many different equivalent parametrizations).
Model Snapshots: Just keep training a single base model (with fairly high learning rate), and take periodic snapshots of its parameters.
- Prediction averaging: Always works
- Parameter averaging: Often works! Snapshots are sufficiently close in parameter space!
- Parameter averaging means you only need to keep a single model for future predictions. If you use a moving average of the snapshots, you only keep one extra set of parameters.
Distilling the Knowledge in a Neural Network: Constructs a single (better) model from an ensemble.
Gradient Noise
- If a little noise is good, what about adding noise to gradients?
- A: Works

Lecture 8

Hyperparameter Optimization
- Normal Cross Validation
- Use a validation block within each training block.
- Coarse -> Fine tuning
- It’s best to optimize in log space!
- Random search works better than grid search
Monitor and visualize the loss curve
If the loss drops too quickly, that means that the learning rate is too high
Some networks take longer to get going.
Record as much as you can, capture all of the activations to see how big they are and if they make sense. They should be relatively close to the Xavier activations.
Training accuracy vs validation accuracy. Track ratio of weight updates / weight magnitudes. 0.01 is probably too high.
Localization and detection.
- Classification, Classification and localization, Object detection, instance segmentation
- Semantic segmentation: Label every pixel, don’t differentiate instances, just worry about pixels. Classic computer vision problem.
- Idea 1: classify every pixel
  - Repeats the same computation multiple times
  - Doesn’t make use of global information
- Idea 2: Convolutional Network
  - Design a network as a bunch of convolutional layers to make predictions for pixels all at once.
  - Convolutions at full resolution could be expensive
- Idea 3: Fully connected CNN
  - Design a network as a bunch of convolutional layers with downsampling and upsampling inside the network.
  - Downsampling: Pooling, strided convolution
  - In network upsampling: Striding
    - Nearest neighbor, bed of nails
    - Max pooling, max unpooling
  - Learnable upsampling: Transpose convolution
    - Somewhat intuitive
- Idea 4: UNet
  - Residuals with different resolutions
Still need to localize.
- Evaluation metric: Intersection over union
- Share layers other than the classification and regression layers
- You can have class agnostic bounding boxes and per class bounding boxes.
Aside: Localizing multiple objects.
- Multiple regression heads, multiple regression heads
- mAP “mean average precision”. Computer average precision separately for each class, then average over classes. Precision recall curve and stuff.
Region proposals
- Low precision, but high recall.
R-CNN:
- Input image, extract region proposals, run each proposal through a CNN
- Solution: share computation of convolutional layers between proposals for an image
- Problem #2:

Lecture 10

Visualizing NNs
- First layer, visualize the filter weights
- Doesn’t work for later layers because of nonlinearities
- Occulusion experiments, place grey boxes and block the image
- Saliency approaches (network centric)
  - Generate an image that maximizes the class score

Lecture 11: Attention

Review: T-SNE, Activation Maximization, Generate an image that maximizes a class score, Deconv approaches, Neural Style Transfer
Attention “the regarding of someone or something as interesting or important.” Attention is one of the most important ideas in deep networks in the last decade…Cross-cuts computer vision, NLP, speech, RL
Early attention models
2014, series of breakthroughs
- Visual attention for recognition
- Speech recognition
- Neural Turing machines
- Video description generation
- Conversational Agents
- Image caption generation
- Visual question answering…
Hard attention: attend to a single input location, cant use gradient descent, need reinforcement learning
Soft attention: compute a weighted combination (attention) over some inputs using an attention network. Can use backpropagation to train end-to-end.
Supervised learning: Input samples are independent, each sample x receives a label y. The pair (x,y) is assigned a loss value which is assumed to be differentiable.
Reinforcement learning: Learning visits a sequence of (correlated) states s_t in an epoch t=1,…,T. At time t, learner performs action a_t and receives reward r_t from the environment. Agent tries to maximize the sum of rewards over an epoch.

Lecture 12: Natural Language Processing

Text semantics: the meaning of the text. Text is embedded into a high dimensional space. Used for neural models.
Word similarity: Similar words occur in similar contexts.
Paradigmatic Similarity
Vector embedding, dimension reduction, vector of contexts
LSA, Latent Semantic Analysis: A neural network that could learn something like a SVD
Word2Vec, Center words and context words.

Written on