An introduction to supervised machine learning

# An introduction to supervised machine learning
## Foundational Computational Biology I
### Kieran Campbell
### Lunenfeld Tanenbaum Research Institute & University of Toronto
### 2021-02-22 (updated: 2022-04-04)

---

# What we'll cover

- What is supervised learning?

- Continuous outputs: linear regression

- Categorical outputs: logistic regression

- A gallery of more sophisticated models

- Controlling model complexity

---

# What is supervised learning?

Given some input data, we would like to train a model that _accurately_ predicts a corresponding output

## Examples

1. Given todays's weather (rainfall, hours of sunshine, average windspeed), will it rain tomorrow?
  - **Input**: today's weather, **output**: tomorrow's weather
  
--

2. Given a patient's laboratory test values (creatinine, glucose), does the patient have a particular disease?
  - **Input**: lab test values, **output**: diagnosis
  
--
  
3. Given a cancer biopsy gene expression measurement, what is the tissue of origin?
  - **Input**: gene expression for a given biopsy, **output**: cancer tissue of origin
  
---

# What's special compared to other prediction types?

Supervised machine learning explicitly involves

1. Past data where you _know_ the correct answer (input output pairs)
2. A model that we train on past data to get _good_ at predicting outputs given inputs

Contrast this with:

1. **Rule based predictions**: e.g. if `blood glucose > 125mg/dL`, patient has diabetes
2. **Simulation based predictions**: e.g. use physics simulator to forward simulate tomorrow's weather, given today's

## When should we use supervised learning?

If you have sufficient, <svg viewBox="0 0 576 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M528.1 171.5L382 150.2 316.7 17.8c-11.7-23.6-45.6-23.9-57.4 0L194 150.2 47.9 171.5c-26.2 3.8-36.7 36.1-17.7 54.6l105.7 103-25 145.5c-4.5 26.3 23.2 46 46.4 33.7L288 439.6l130.7 68.7c23.2 12.2 50.9-7.4 46.4-33.7l-25-145.5 105.7-103c19-18.5 8.5-50.8-17.7-54.6zM388.6 312.3l23.7 138.4L288 385.4l-124.3 65.3 23.7-138.4-100.6-98 139-20.2 62.2-126 62.2 126 139 20.2-100.6 98z"></path></svg> 
**unbiased** <svg viewBox="0 0 576 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M528.1 171.5L382 150.2 316.7 17.8c-11.7-23.6-45.6-23.9-57.4 0L194 150.2 47.9 171.5c-26.2 3.8-36.7 36.1-17.7 54.6l105.7 103-25 145.5c-4.5 26.3 23.2 46 46.4 33.7L288 439.6l130.7 68.7c23.2 12.2 50.9-7.4 46.4-33.7l-25-145.5 105.7-103c19-18.5 8.5-50.8-17.7-54.6zM388.6 312.3l23.7 138.4L288 385.4l-124.3 65.3 23.7-138.4-100.6-98 139-20.2 62.2-126 62.2 126 139 20.2-100.6 98z"></path></svg> data, it _may_ be better than other approaches

---

# What's different from supervised?

Unsupervised learning (typically) has no outputs measured<sup>1</sup>

Instead we "find structure" in the data

## Examples

1. **Netflix**: are there certain types of viewers who prefer particular genres?

2. **Clinical oncology**: are there subtypes of breast cancer that have similar prognosis / respond similarly to treatment?

###  The line between supervised and unsupervised learning is not always clear cut!

---

# The Big Picture

Three ingredients of supervised ML

1. Some training data (input-output pairs)

2. A _model_ that maps **input data** to **output predictions**

3. A _loss function_ that measures the discrepancy between the **output prediction** and **output data**

We aim to **iteratively improve the model to minimize the loss**.

## The Bigger Picture

In probabilistic ML <svg viewBox="0 0 448 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> maximize joint probability of the data given the model<sup>1</sup>

---

# Starting a supervised learning project

Always ask

_is my output continuous or discrete?_

---

# Supervised learning

## Continuous outputs

* Typically known as _regression_
* Loss is often mean square error
* Example models:
 * Linear regression
 * Polynomial / spline regression
 * Random forests
 * Deep neural networks
 
![](22-introduction_to_supervised_learning_files/figure-html/unnamed-chunk-2-1.png)
]

## Discrete outputs

* Typically known as _classification_
* Loss is often binary cross entropy
* Example models:
 * Logistic regression regression
 * Naive Bayes
 * Support vector machines
 * Deep neural networks

![](22-introduction_to_supervised_learning_files/figure-html/unnamed-chunk-3-1.png)

]

---

# Regression (continuous outputs)

---

# Supervised ML

Example problem: predict reduction in tumor volume given expression of a gene

## Notation

`$N$` samples indexed by `$n = 1, \ldots, N$`, <svg viewBox="0 0 448 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> `$N$` patients

Input data `$x_n \in \mathbb{R}$` for each sample<sup>1</sup>, <svg viewBox="0 0 448 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> expression of genes

Output data `$y_n \in \mathbb{R}$` for each sample, _e.g. change in tumour volume_

A _model_ `$f_\theta$` parametrized by `$\theta$` that maps the inputs to predicted outputs `$t_n$`, i.e.

`$$t_n = f_\theta(x_n)$$`
--

- `$t_n$` is the predicted output for sample `$n$`
- `$y_n$` is the actual output for sample `$n$`

---

# Defining a loss

We aim to adjust `$\theta$` to make `$t_n$` as _close_ to `$y_n$` averaged over `$n$`

Define some _distance_ between `$t_n$` and `$y_n$` and minimize wrt `$\theta$`

Common choice: mean squared error

`$$\mathrm{MSE(\mathbf{t}, \mathbf{y})} = \frac{1}{N} \sum_{n=1}^N (y_n-t_n)^2$$`

Remember this is still a function of `$\theta$`:

`$$\mathrm{MSE(\mathbf{t}, \mathbf{y})} = \frac{1}{N} \sum_{n=1}^N (y_n-f_\theta(x_n))^2$$`

---

# Linear regression

So what is linear regression?

`$$t_n = f_\theta(x_n) = wx_n + b$$`
--

* Parameters `$\theta = \{w,b\}$`

* `$w$` known as the _weight_, `$b$` the _bias_
--

* Loss now becomes

`$$\mathrm{MSE(\mathbf{t}, \mathbf{y})} = \frac{1}{N} \sum_{n=1}^N (y_n-wx_n - b)^2$$`

Iteratively adjust `$w$` and `$b$` to minimize `$\mathrm{MSE(\mathbf{t}, \mathbf{y})}$`

---

# Minimizing losses

So how do we minimize `$\mathrm{MSE(\mathbf{t}, \mathbf{y})}$`? Strategies:

1. Randomly guess<sup>1</sup> `$(w,b)$` and choose the values that result in lowest `$\mathrm{MSE(\mathbf{t}, \mathbf{y})}$`
.footnote[
[1] Not recommended
]

2. Differentiate `$\mathrm{MSE(\mathbf{t}, \mathbf{y})}$` w.r.t `$(w,b)$`, set to 0, solve for `$(w,b)$`

3. <svg viewBox="0 0 576 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M528.1 171.5L382 150.2 316.7 17.8c-11.7-23.6-45.6-23.9-57.4 0L194 150.2 47.9 171.5c-26.2 3.8-36.7 36.1-17.7 54.6l105.7 103-25 145.5c-4.5 26.3 23.2 46 46.4 33.7L288 439.6l130.7 68.7c23.2 12.2 50.9-7.4 46.4-33.7l-25-145.5 105.7-103c19-18.5 8.5-50.8-17.7-54.6zM388.6 312.3l23.7 138.4L288 385.4l-124.3 65.3 23.7-138.4-100.6-98 139-20.2 62.2-126 62.2 126 139 20.2-100.6 98z"></path></svg> Numerical methods <svg viewBox="0 0 576 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M528.1 171.5L382 150.2 316.7 17.8c-11.7-23.6-45.6-23.9-57.4 0L194 150.2 47.9 171.5c-26.2 3.8-36.7 36.1-17.7 54.6l105.7 103-25 145.5c-4.5 26.3 23.2 46 46.4 33.7L288 439.6l130.7 68.7c23.2 12.2 50.9-7.4 46.4-33.7l-25-145.5 105.7-103c19-18.5 8.5-50.8-17.7-54.6zM388.6 312.3l23.7 138.4L288 385.4l-124.3 65.3 23.7-138.4-100.6-98 139-20.2 62.2-126 62.2 126 139 20.2-100.6 98z"></path></svg>
 * Giant scientific field
 * _gradient descent_ increasingly popular <svg viewBox="0 0 448 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> good in high dimensional problems where gradients of the loss w.r.t. the parameters are available

---

# Logistic regression (discrete outputs)

---

# Logistic regression

* Previously: `$y_n$` was continuous <svg viewBox="0 0 448 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> regression
* Now: `$y_n$` binary i.e. `$y_n = 0$` or `$1$` <svg viewBox="0 0 448 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> logistic regression

Instead of modelling `$y_n \approx t_n = wx_n + b$` we're now interested in modelling

$$p(y_n=1 | x_n) $$
--

Examples

* `$y_n=1$` <svg viewBox="0 0 448 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> patient has diabetes, `$x_n$` <svg viewBox="0 0 448 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> blood glucose, `$p(y_n=1 | x_n)$` <svg viewBox="0 0 448 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> "_what's the probability the patient has diabetes given the blood glucose measurement?_"

As before, require `$(y_n, x_n)$` _training points_ for `$n=1,\ldots,N$`

---

# Loss function for logistic regression

Let's define `$\pi_n = p(y_n=1 | x_n)$`. What is `$p(y_n)$` ?

* If `$y_n=1$` then `$p(y_n) = \pi_n$`
--

* If `$y_n=0$` then `$p(y_n) = 1-\pi_n$`

We can write this as

`$$p(y_n) = \pi_n^{y_n} (1-\pi_n)^{1-y_n}$$`
--

For multiple observations `$y_1, \ldots, y_N$` we assume

`$$p(y_1, \ldots, y_N) = \prod_{n=1}^N \pi_n^{y_n} (1-\pi_n)^{1-y_n}$$`
--

Consider the log and maximize 
`$$\log p(y_1, \ldots, y_N) = \sum_{n=1}^N \left[ y_n \log(\pi_n) + (1-y_n) \log(1-\pi_n) \right]$$`

---

# Logistic regression: modelling probabilities

What's left? Need to specify how `$\pi_n = p(y_n=1|x_n)$` depends on `$x_n$` !

`$$\pi_n = \frac{1}{1 + \exp\left( -(wx_n + b) \right)}$$`
]
.pull-right[
![PDAC subtypes](intro-ml_figs/1280px-Logistic-curve.svg.png)
]

Has the additional interpretation that the log-odds for a data point is linear in the covariates:

`$$\log \frac{\pi_n}{1-\pi_n} = wx_n + b$$`

???

Many other possibilities for the function

---

# Putting it all together

Our loss:

`$$l(w,b) = - \sum_{n=1}^N \left[ y_n \log(\pi_n) + (1-y_n) \log(1-\pi_n) \right]$$`
where

`$$\pi_n = \frac{1}{1 + \exp\left( -(wx_n + b) \right)}$$`
--

Would like to minimize `$l(w,b)$` wrt `$w,b$`

* Could use gradient descent as before since derivatives of `$l$` are computable
* Better to use built-in implementations in R/Python/...

---

# Logistic regression example

.pull-left[
![](22-introduction_to_supervised_learning_files/figure-html/unnamed-chunk-4-1.png)

]

```r
df <- data.frame(y = y, x = x)
fit <- glm(y ~ x, data = df, 
           family='binomial')
df$predicted <- predict(fit, 
                        type='response')

ggplot(df, aes(x = x, y = y)) +
  geom_point(alpha=.5) +
  geom_line(aes(y = predicted), 
            colour='darkred', size=1.5)
```

![](22-introduction_to_supervised_learning_files/figure-html/unnamed-chunk-5-1.png)

]

---

# Moving past linear to more complex models

---

# Moving past linear models

So far we have modelled

1. [Linear regression] The predicted output `$t_n(x_n)$`
2. [Logistic regression] The conditional probabilities `$p(y_n|x_n)$`

_linearly_ as a function of `$x_n$` via `$wx_n + b$`

Almost certainly underestimates complexity of true relationship:

* Higher order moments, e.g. `$x^2, x^3$`
* Interactions between inputs, e.g. if `$\mathbf{x_n}$` is a vector, `$x_{1,n} \times x_{2,n}$`

---

# Examples of more complex algorithms

1. Polynomial regression
  
  Model `$y = m_0 + m_1 x + m_2 x^2 + m_3 x^3 + ...$`
  
--
  
2. Decision trees / random forests
  
  Ask series of binary questions about input to arrive at output
  
--
  
3. Deep neural networks

Model output as series of function compositions `$y_n = f_K(f_{K-1}(...f_1(x_n)...))$`
  
  where 
  
  `$$f_k(x) = \sigma(wx+b)$$`
  `$\sigma$` often called _nonlinearity_, e.g. sigmoid, tanh, ReLU

---

# Controlling model complexity

Consider

`$$y_n = x_n \log(x_n) + \epsilon_n$$`
`$$\epsilon_n \sim \text{N}(0,1)$$`

> `$\epsilon_n$` is drawn from a **N**ormal distribution with mean 0 and standard deviation 1

]

```r
set.seed(123456)
x <- runif(100)
y <- x * log(x) + 
  rnorm(100, sd=0.1)
qplot(x,y)
```

![](22-introduction_to_supervised_learning_files/figure-html/unnamed-chunk-6-1.png)
]

---

# Controlling model complexity

## Linear fit

```r
fit <- lm(y ~ x)
df <- data.frame(x,y,
                 t=predict(fit))

ggplot(df, aes(x=x)) +
  geom_point(aes(y=y)) +
  geom_line(aes(y=t), colour='red')
```

![](22-introduction_to_supervised_learning_files/figure-html/unnamed-chunk-7-1.png)

]

## High degree polynomial fit

```r
fit <- lm(y ~ poly(x, 20))
df <- data.frame(x,y,
                 t=predict(fit))

ggplot(df, aes(x=x)) +
  geom_point(aes(y=y)) +
  geom_line(aes(y=t), colour='red')
```

![](22-introduction_to_supervised_learning_files/figure-html/unnamed-chunk-8-1.png)
]

---

# How do we set the model complexity?

Basic idea: tune model complexity by optimizing performance on held out dataset

![](intro-ml_figs/ML_dataset_training_validation_test_sets.png)

---

# Practical example

## Create train and test datasets

`$y_n = x_n + 5 x_n^2 + \epsilon, \; \; x_n \sim \text{Unif}(0,1), \epsilon \sim \text{N}(0, 0.01)$`

```r
set.seed(123456)
x_train <- runif(100); x_test <- runif(100)
y_train <- x_train + 5 * x_train^2 + 
  rnorm(100, sd=0.5)
y_test <- x_train + 5 * x_train^2 + 
  rnorm(100, sd=0.5)
```

## Fit models on test sets

```r
models <- lapply(1:20, function(i) {
  lm(y_train ~ poly(x_train, i))
})
```

---

# Practical example

## Compute mean squared error on test set

```r
mse <- sapply(models, function(fit) {
  test_predictions <- predict(fit, newdata=data.frame(x=x_test))
  mean((y_test-test_predictions)^2)
})
```

```r
qplot(1:20, mse) +
  labs(x = "Degree of polynomial", y = "Mean squared error (test)")
```

---

# Overfitting and underfitting

* If your model is too complex, it adapts to the noise in the training set <svg viewBox="0 0 448 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> **overfitting**

* If your model is too simple, it can't approximate the true function/decision boundary <svg viewBox="0 0 448 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> **underfitting**

* Closely related to bias-variance tradeoff

---

# Controlling model complexity: penalized models

Popular method for controlling model complexity is penalizing coefficient size

Consider `$P$` input variables <svg viewBox="0 0 448 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> `$P$` coefficients

`$$y_n = \sum_{p=1}^P w_p x_{np} + \epsilon$$`
--

Would like to keep entries of `$\mathbf{w}$` _small_ and/or _sparse_

Do this by minimizing

`$$\text{MSE}(\mathbf{y}, \mathbf{t}) + \lambda \times \text{size}(\mathbf{w})$$`
--

* How we specify `$\text{size}(\mathbf{w})$` gives different properties

* `$\lambda$` controls the trade-off between predictive accuracy and model complexity

* Typically set `$\lambda$` via optimizing accuracy on test set

---

# Choice of penalties

## L2 norm (_Ridge_)

Set `$\text{size}(\mathbf{w}) = \sqrt{\sum_{p=1}^P w_p^2}$`

Leads to non-sparse solutions with small coefficients

]

## L1 norm (_Lasso_)

Set `$\text{size}(\mathbf{w}) = \sum_{p=1}^P |w_p|$`

Leads to sparse solutions

Combination L2 + L1 norms _elasticnet_

]

## Implementations

R: [glmnet](https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html)

Python: [sklearn](https://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_and_elasticnet.html)

---

# Wrap up

1. Supervised learning requires existing input-output pair data

2. Output is continuous <svg viewBox="0 0 448 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> regression <svg viewBox="0 0 448 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg>mean square error

3. Output is binary <svg viewBox="0 0 448 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> classification <svg viewBox="0 0 448 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> binary cross entropy

4. We can minimize loss functions in multiple ways including gradient descent

5. Many complex models beyond linear including random forests and deep neural networks

6. Using too complex a model can lead to _overfitting_

7. We can reduce model complexity by penalizing coefficient sizes

8. We can select model complexity by optimizing performance on a held-out test set

---

# Resources

* Bishop's _Pattern Recognition and Machine Learning_ is [free online](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf)

* [scikit-learn](https://scikit-learn.org/stable/) one stop python shop for everything machine learning
  * Many toy and real datasets readily available
  * `pipeline` functionality very useful

[Kaggle](https://www.kaggle.com/) for competitions

Not mentioned: **visualization** of data before anything is **key**!

These slides: [camlab.ca/teaching](https://www.camlab.ca/teaching)

---

background-image: url('intro-ml_figs/mountain.jpg')
background-position: center
background-size: contain
class: inverse

# Gradient descent

You're at the top of a mountain, it's getting dark, and you need to get down
--
 
* Your position `$(x,y)$` is your parameter space `$(w,b)$` to explore

--
 
* Your height is `$\mathrm{MSE(\mathbf{t}, \mathbf{y})}$` you want to minimize

## Q: What's the strategy?

Take successive little steps downhill until things flatten out

## Local optimality

Note this doesn't guarantee you to get to the _bottom_, only to a much flatter region

---

# What is downhill?

> successive little steps downhill

.pull-right[
<img src="22-introduction_to_supervised_learning_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" />
]

Notice:
* When `$x > 1$` we want to go to the _left_ and `$\frac{\mathrm{d}y}{\mathrm{d}x} > 0$`
* When `$x < 1$` we want to go to the _right_ and  `$\frac{\mathrm{d}y}{\mathrm{d}x} < 0$`

The sign of the gradient always points uphill!

---

# Gradient descent

This suggests an iterative scheme:

1. Initialize some values for `$(w,b)$`

2. For a given number of steps:
 * Update `$w \leftarrow w - \epsilon \frac{\partial}{\partial w} \mathrm{MSE(\mathbf{t}, \mathbf{y}; w, b)}$`
 * Update `$b \leftarrow b - \epsilon \frac{\partial}{\partial b} \mathrm{MSE(\mathbf{t}, \mathbf{y}; w, b)}$`

3. Monitor `$MSE(\mathbf{t}, \mathbf{y}; w, b)$` - if it "levels off" we can end with the optimal values `$(w,b)$`

## Learning rate

`$\epsilon>0$` is known as the _learning rate_ or _step size_.

* Important parameter to tune: too large and you overshoot, too small and it's inefficient

---