Neural net basics

Multi-layer perceptrons

multi-layer perceptron: fully-connected network with an input layer, at least one hidden layer, and an output layer
- often used synonymously with “feed-forward network” even though FFN is technically a broader category where information flows in one direction
a single neuron computes a weighted sum of its inputs, adds a bias, and passes the result through an activation function
- $\mathbf x\in\mathbb R^n$ is the input vector (activations from the previous layer)
- $\mathbf w\in\mathbb R^n$ is the weight vector (edge weights leading into the neuron)
- $b\in\mathbb R$ is the bias
- $f$ is the activation function
$$ y=f\left(\sum_{i=1}^n w_i x_i+b\right)=f(\mathbf w^\top\mathbf x +b) $$
a layer with $n_\text{in}$ inputs and $n_\text{out}$ neurons can be computed through matrix multiplication
- so $\mathbf x\in\mathbb R^{n_\text{in}}$ (column vector)
- stack all the weight vectors into a single weight matrix $W\in\mathbb R^{n_\text{out}\times n_\text{in}}$
  - each row is the weights going into a single neuron
- stack biases into vector $\mathbf b\in\mathbb R^{n_\text{out}}$
- output hidden state will have shape $\mathbf h\in\mathbb R^{n_\text{out}}$
$$ \mathbf h=f(W\mathbf x+\mathbf b) $$
in practice, we process a batch of $m$ inputs at once!
- in this case, we arrange inputs as rows of a matrix $X \in\mathbb R^{m\times n_\text{in}}$
- conventionally change $W$ to have shape $\mathbb R^{n_\text{in}\times n_\text{out}}$
  - each column is the weights going into a single neuron
- the layer then becomes
  
  $$ H=f(XW+\mathbf b) $$
  - where $\mathbf b$ is broadcast to have shape $m\times n_\text{out}$

in math notation, a linear layer takes $X\in\mathbb R^{m\times n_\text{in}}$ and applies $W\in\mathbb R^{n_\text{in}\times n_\text{out}}$ as $XW+b$.

in PyTorch, the weight matrix W is actually stored as $n_\text{out}\times n_\text{in}$. the forward pass transposes W, computing X @ W.T $(m, n_\text{in})\times (n_\text{in},n_\text{out})$. the transpose is free because it only changes the stride. this is so that the gradients for $W$ naturally comes out as $n_\text{out}\times n_\text{in}$, matching the shape of $W$.

</aside>

let’s do the backprop for $Z=XW+b$

$$ \frac{\partial L}{\partial X}=\frac{\partial L}{\partial Z}W^\top\quad (m,n_\text{out})\times (n_\text{out},n_\text{in})=(m,n_\text{in})\\[1em] \frac{\partial L}{\partial W}=X^\top\frac{\partial L}{\partial Z}\quad (n_\text{in}, m)\times (m,n_\text{out})=(n_\text{in},n_\text{out}) $$
- the same bias $\mathbf b\in\mathbb R^{n_\text{out}}$ is added to every sample, and each sample produces its own gradient for $\mathbf b$
  - these gradients thus accumulate
  $$ \frac{\partial L}{\partial b_j}=\sum_{i=1}^m\frac{\partial L}{\partial z_{ij}}\cdot\frac{\partial z_{ij}}{\partial b_j}=\sum_{i=1}^m\frac{\partial L}{\partial z_{ij}}\cdot 1=\sum_{i=1}^m\frac{\partial L}{\partial z_{ij}} $$
- the most intuitive way to see this
  - we know that $\partial L/\partial X$ (if $X$ is a single example) is $\partial L/\partial Z \cdot W^\top (n_\text{in})$
  - when $X$ has a batch dimension, we know we are looking for output with shape $(m,n_\text{in})$
  - each row $i$ of $Z$ depends only on row $i$ of $X$ (the batch examples don’t interact)
  - so we can just stack the gradients for each row
- in general, derive Jacobian for a single example (which is clean, 2-dimensional)
  - if the tensor is shared across the batch (like $W$), then the batch dimension is summed out → contract (matmul where the batch dim is the inner dimension)
  - if the tensor is not shared (like $X$, activations), the batch dimension is preserved → stack (matmul with batch dim on the inside)
- note the PyTorch implementation with $Z=XW^\top$ with $W\in\mathbb R^{n_\text{out}\times n_\text{in}}$ looks like this
  
  $$ \frac{\partial L}{\partial X}=\frac{\partial L}{\partial Z} W\quad (m,n_\text{out})\times (n_\text{out}, n_\text{in})=(m,n_\text{in})\\[1em] \frac{\partial L}{\partial W}=\left(\frac{\partial L}{\partial Z}\right)^\top X\quad (n_\text{out}, m)\times (m,n_\text{in})=(n_\text{out},n_\text{in}) $$

Activation functions

sigmoid $\sigma(x)\in(0,1)$

$$ \sigma(x)=\frac{1}{1+e^{-x}} $$
- good for interpreting outputs as probabilities
- not used for hidden layers in neural nets
  - vanishing gradients since the derivative is $\sigma(x)(1-\sigma(x)) \leq 0.25$
  - not zero-centered, so downstream gradients for a single node are either all positive or all negative (depending on the upstream grad)
tanh $\in(-1,1)$

$$ \tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}=2\sigma(2x)-1 $$
- derivative peaks at 1.0 (for $x=0$), can still vanish
- $\tanh^\prime$ factors only ever shrink, since $\tanh^\prime(z)=1-\tanh^2(x)\in(0,1]$
softmax → probability distribution

$$ \text{softmax}(\mathbf x)_i=\frac{e^{x_i}}{\sum_j e^{x_j}} $$
- with temperature
  
  $$ \text{softmax}(\mathbf x/T)_i=\frac{e^{x_i/T}}{\sum_j e^{x_j/T}} $$
ReLU $\in(0, \infty)$

$$ \operatorname{ReLU}(x)=\max(x,0) $$
- derivative is 1 for $x>0$, 0 for $x<0$
- dying ReLUs: if a pre-activation becomes permanently negative (i.e., negative for every input), it receives zero gradient forever
  - a fraction of network can go dead during training
Leaky ReLU $\in(-\infty,\infty)$

$$ \text{LeakyReLU}(x)=\begin{cases}x&\text{if }x>0\\\alpha x&\text{if }x\leq0\end{cases} $$
- fixes the dying ReLU problem
Swish (smooth, non-monotonic)

$$ \text{Swish}(x)=x\cdot\sigma(x) $$
GLU uses one linear projection to produce the “content” [left], and another to produce the gate [right]

$$ \text{GLU}(x)=xW_1\odot\sigma(xW_2) $$
SwiGLU plugs Swish in as the activation function inside GLU

$$ \text{SwiGLU}(x)=(x W_1)\odot \text{Swish}(xW_2) $$
without non-linearities, neural nets can’t do anything more than a linear transform
- extra layers can be compiled down to a single linear transform $W_1 W_2x=Wx$
- without non-linearities, adding more layers doesn’t give any more representational power
- with more layers that include non-linearities, they can approximate any complex function!