Neural net basics

Multi-layer perceptrons

<aside> <img src="/icons/compose_gray.svg" alt="/icons/compose_gray.svg" width="40px" />

in math notation, a linear layer takes $X\in\mathbb R^{m\times n_\text{in}}$ and applies $W\in\mathbb R^{n_\text{in}\times n_\text{out}}$ as $XW+b$.

in PyTorch, the weight matrix W is actually stored as $n_\text{out}\times n_\text{in}$. the forward pass transposes W, computing X @ W.T $(m, n_\text{in})\times (n_\text{in},n_\text{out})$. the transpose is free because it only changes the stride. this is so that the gradients for $W$ naturally comes out as $n_\text{out}\times n_\text{in}$, matching the shape of $W$.

</aside>

Activation functions

image.png