Introduction to Deep Learning

# Introduction to Deep Learning
## Foundational Computational Biology II
### Kieran Campbell
### Lunenfeld Tanenbaum Research Institute & University of Toronto
### 2021-05-28 (updated: 2022-04-04)

---

# What we'll cover

1. Feed forward neural networks
2. Training neural nets: backpropagation
3. Convolutional neural networks
4. Recurrent neural networks
5. Autoencoders revisited

---

# Image classification

![imagenet-benchmarks](deep-learning-figs/chart.png)

---

# The perceptron

The perceptron [Ros58] early example of "biologically inspired" learning

Predict a binary output given input `$x$` via

$$
f(\mathbf{x})=\left \\{
`\begin{aligned}
1 & \text{ if } \mathbf{w}\cdot \mathbf{x}+b > 0, \\
0 & \text{ otherwise}
\end{aligned}`
\right.
$$

* `$f(x)$` classifies sample as 0 or 1 depending on `$x$`

* Iteratively adjust `$w$`, `$b$` to get `$f(x)$` to match ground truth on a training dataset

---

# Linear decision boundary

.center[
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Perceptron_example.svg/1024px-Perceptron_example.svg.png" width=50%>
]

---

# Multi-layer perceptrons

Previously:

$$
f(\mathbf{x})=\left \\{
`\begin{aligned}
1 & \text{ if } \mathbf{w}\cdot \mathbf{x}+b > 0, \\
0 & \text{ otherwise}
\end{aligned}`
\right.
$$
--

Rather than compare `$f(x)$` to our ground-truth, use it as input to a second function:

$$
g(\mathbf{x})=\left \\{
`\begin{aligned}
1 & \text{ if } v f(\mathbf{x}) + c > 0, \\
0 & \text{ otherwise}
\end{aligned}`
\right.
$$
* Typically use multiple `$f$`s as input

* Now want to adjust `$\mathbf{w}$`, `$\mathbf{b}$`, `$v$`, `$c$` to make `$g(x)$` as close to output as possible

Remember (from ACB-I) want to minimize `$\mathrm{LOSS}(\mathbf{y}, g(\mathbf{x}))$`

---

# Activation functions

In the perceptron the output of each layer is set to 0 or 1:

$$
f(\mathbf{x})=\left \\{
`\begin{aligned}
1 & \text{ if } \mathbf{w}\cdot \mathbf{x}+b > 0, \\
0 & \text{ otherwise}
\end{aligned}`
\right.
$$
Many possible _activation functions_ with appealing properties:

---

# DNNs visualized

Figure: [KG19]

---

background-image: url('intro-ml_figs/mountain.jpg')
background-position: center
background-size: contain
class: inverse

# Gradient descent

You're at the top of a mountain, it's getting dark, and you need to get down
--
 
* Your position `$(x,y)$` is your parameter space `$(w,b)$` to explore

--
 
* Your height is your loss you want to minimize

## Q: What's the strategy?

Take successive little steps downhill until things flatten out

## Local optimality

Note this doesn't guarantee you to get to the _bottom_, only to a much flatter region

---

# What is downhill?

> successive little steps downhill

.pull-right[
<img src="22-deep-learning_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" />
]

Notice:
* When `$x > 1$` we want to go to the _left_ and `$\frac{\mathrm{d}y}{\mathrm{d}x} > 0$`
* When `$x < 1$` we want to go to the _right_ and  `$\frac{\mathrm{d}y}{\mathrm{d}x} < 0$`

The sign of the gradient always points uphill!

---

# Gradient descent

This suggests an iterative scheme:

1. Initialize some values for `$(w,b)$`

2. For a given number of steps:
 * Update `$w \leftarrow w - \epsilon \frac{\partial}{\partial w} \mathrm{LOSS(f(\mathbf{x}), \mathbf{y}; w, b)}$`
 * Update `$b \leftarrow b - \epsilon \frac{\partial}{\partial b} \mathrm{LOSS(f(\mathbf{x}), \mathbf{y}; w, b)}$`

3. Monitor `$\mathrm{LOSS(f(\mathbf{x}), \mathbf{y}; w, b)}$` - if it "levels off" we can end with the optimal values `$(w,b)$`

## Learning rate

`$\epsilon>0$` is known as the _learning rate_ or _step size_.

* Important parameter to tune: too large and you overshoot, too small and it's inefficient

---

# Backpropagation in one slide

To use gradient descent to train our neural network we need to compute

`$$\frac{\partial}{\partial \theta_i} \mathrm{LOSS}(f(\mathbf{x}), \mathbf{y}; \mathbf{\theta})$$`
where `$\mathbf{\theta}$` are all the parameters of the neural network

_Backpropagation_ is an efficient way of computing these derivatives using the chain rule and storing intermediate values

Recall chain rule: if `$y = g(f(x))$` then `$\frac{dy}{dx} = \frac{d g}{d f}\frac{df}{dx}$`

Deep NNs often composition of functions, e.g.

`$$\text{LOSS}(\text{activation}(\text{linear}(\textbf{x})))$$`
]

.pull-right[
![test](https://colah.github.io/posts/2015-08-Backprop/img/tree-backprop.png)
.footnote[
https://colah.github.io/posts/2015-08-Backprop/
]
]

---

# Deep neural networks for imaging data

Images are large!

CIFAR-10 dataset has "small" images (32x32x3) <svg viewBox="0 0 448 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> 32x32x3 = 3072 weights for each neuron in input layer

A biologically-inspired solution to this is **convolutional neural networks** (CNNs)

1. **Convolutional layer**

Scans a set of learnable filters across the image

--
  
2. **Pooling layer**

Spatially downsamples / pools output

3. **Fully connected layer**

Computes output probabilities (similar to feed forward network)

---

# Convolutional layer

.pull-right[
  
  ### Key hyperparameters:
  
  * Width, height
  
  * Stride (how big a step do you take?)
  
  * How many features?
  
]

---

# (Max) pooling layer

.center[
  <img src="deep-learning-figs/maxpooling.png" width=70%>
  ]
  
* Reduces dimensionality via local aggregation

* Multiple variations depending on aggregation operation (max, mean)

---

# What do CNNs learn?

---

# Image augmentation

The meaning of images are subject to a set of _invariances_

**Example**: A photo of a chair upside-down is still a chair

One strategy is to feed augmented training data in to reflect these invariances (rotations, translations, altered colouring, skew,...)

---

# Deep neural networks for temporal data

Often data is has inherent ordering, e.g.:

* Time series data

* Textual data

Can represent this via `$y_t$` for `$t=1,\ldots,T$`

Typical task is to predict future values given past

### Why might a standard (feed forward) net not work so well here?

---

# Recurrent neural networks

Time dependent input-output pairs `$(x_t, y_t)$` for `$t=1,\ldots,T$`, e.g.:

* `$x$` = english words, `$y$` = french words  <svg viewBox="0 0 448 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> translation

* `$x$` = image, `$y$` = caption  <svg viewBox="0 0 448 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> image captioning

RNNs compute a hidden state `$h_t$` that's a function of `$x_t$` and `$h_{t-1}$`

`$h_t = g_1(x_t, h_{t-1})$`

`$y_t = g_2(h_t)$`

---

### Example: predict next character in sequence

---

---

# More recent developments in sequence modelling

## Long short term memory networks (LSTMs, [HS97])

* Allow for long range dependencies

* Fantastic blog post: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

## Transformer models

* Use _attention_, a form of adaptive input weighting

* Modern examples applied to language include GPT, BERT

---

# Deep learning in genomics

Many applications including

1. Predicting sequence specificity of DNA binding proteins

2. Predict methylation based on genome topology, DNA sequence

3. Predict expression from sequence

Good intro reading: [Zou+19]

## In practice...

Deep neural networks are **universal function approximators**

Consequently, they are **data hungry**

Think `$>1000$` samples before reaching for a DNN

---

# References

These slides: [camlab.ca/teaching](https://www.camlab.ca/teaching)

<small>
Fleuren, L. M., T. L. Klausch, C. L. Zwager, et al. (2020). "Machine
learning for the prediction of sepsis: a systematic review and
meta-analysis of diagnostic test accuracy". In: _Intensive care
medicine_ 46.3, pp. 383-400.

Hochreiter, S. and J. Schmidhuber (1997). "Long short-term memory". In:
_Neural computation_ 9.8, pp. 1735-1780.

Kriegeskorte, N. and T. Golan (2019). "Neural network models and deep
learning". In: _Current Biology_ 29.7, pp. R231-R236.

Rosenblatt, F. (1958). "The perceptron: a probabilistic model for
information storage and organization in the brain." In: _Psychological
review_ 65.6, p. 386.

Schmidt, R. M. (2019). "Recurrent neural networks (rnns): A gentle
introduction and overview". In: _arXiv preprint arXiv:1912.05911_.

Yosinski, J., J. Clune, A. Nguyen, et al. (2015). "Understanding neural
networks through deep visualization". In: _arXiv preprint
arXiv:1506.06579_.

Zou, J., M. Huss, A. Abid, et al. (2019). "A primer on deep learning in
genomics". In: _Nature genetics_ 51.1, pp. 12-18.
</small>