multi-layer perceptron: fully-connected network with an input layer, at least one hidden layer, and an output layer
a single neuron computes a weighted sum of its inputs, adds a bias, and passes the result through an activation function
$$ y=f\left(\sum_{i=1}^n w_i x_i+b\right)=f(\mathbf w^\top\mathbf x +b) $$
a layer with $n_\text{in}$ inputs and $n_\text{out}$ neurons can be computed through matrix multiplication
$$ \mathbf h=f(W\mathbf x+\mathbf b) $$
in practice, we process a batch of $m$ inputs at once!
in this case, we arrange inputs as rows of a matrix $X \in\mathbb R^{m\times n_\text{in}}$
conventionally change $W$ to have shape $\mathbb R^{n_\text{in}\times n_\text{out}}$
the layer then becomes
$$ H=f(XW+\mathbf b) $$
<aside> <img src="/icons/compose_gray.svg" alt="/icons/compose_gray.svg" width="40px" />
in math notation, a linear layer takes $X\in\mathbb R^{m\times n_\text{in}}$ and applies $W\in\mathbb R^{n_\text{in}\times n_\text{out}}$ as $XW+b$.
in PyTorch, the weight matrix W is actually stored as $n_\text{out}\times n_\text{in}$. the forward pass transposes W, computing X @ W.T $(m, n_\text{in})\times (n_\text{in},n_\text{out})$. the transpose is free because it only changes the stride. this is so that the gradients for $W$ naturally comes out as $n_\text{out}\times n_\text{in}$, matching the shape of $W$.
</aside>
let’s do the backprop for $Z=XW+b$
$$ \frac{\partial L}{\partial X}=\frac{\partial L}{\partial Z}W^\top\quad (m,n_\text{out})\times (n_\text{out},n_\text{in})=(m,n_\text{in})\\[1em] \frac{\partial L}{\partial W}=X^\top\frac{\partial L}{\partial Z}\quad (n_\text{in}, m)\times (m,n_\text{out})=(n_\text{in},n_\text{out}) $$
the same bias $\mathbf b\in\mathbb R^{n_\text{out}}$ is added to every sample, and each sample produces its own gradient for $\mathbf b$
$$ \frac{\partial L}{\partial b_j}=\sum_{i=1}^m\frac{\partial L}{\partial z_{ij}}\cdot\frac{\partial z_{ij}}{\partial b_j}=\sum_{i=1}^m\frac{\partial L}{\partial z_{ij}}\cdot 1=\sum_{i=1}^m\frac{\partial L}{\partial z_{ij}} $$
the most intuitive way to see this
in general, derive Jacobian for a single example (which is clean, 2-dimensional)
note the PyTorch implementation with $Z=XW^\top$ with $W\in\mathbb R^{n_\text{out}\times n_\text{in}}$ looks like this
$$ \frac{\partial L}{\partial X}=\frac{\partial L}{\partial Z} W\quad (m,n_\text{out})\times (n_\text{out}, n_\text{in})=(m,n_\text{in})\\[1em] \frac{\partial L}{\partial W}=\left(\frac{\partial L}{\partial Z}\right)^\top X\quad (n_\text{out}, m)\times (m,n_\text{in})=(n_\text{out},n_\text{in}) $$
sigmoid $\sigma(x)\in(0,1)$
$$ \sigma(x)=\frac{1}{1+e^{-x}} $$
tanh $\in(-1,1)$
$$ \tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}=2\sigma(2x)-1 $$
softmax → probability distribution
$$ \text{softmax}(\mathbf x)_i=\frac{e^{x_i}}{\sum_j e^{x_j}} $$
with temperature
$$ \text{softmax}(\mathbf x/T)_i=\frac{e^{x_i/T}}{\sum_j e^{x_j/T}} $$
ReLU $\in(0, \infty)$
$$ \operatorname{ReLU}(x)=\max(x,0) $$
Leaky ReLU $\in(-\infty,\infty)$
$$ \text{LeakyReLU}(x)=\begin{cases}x&\text{if }x>0\\\alpha x&\text{if }x\leq0\end{cases} $$
Swish (smooth, non-monotonic)
$$ \text{Swish}(x)=x\cdot\sigma(x) $$
GLU uses one linear projection to produce the “content” [left], and another to produce the gate [right]
$$ \text{GLU}(x)=xW_1\odot\sigma(xW_2) $$
SwiGLU plugs Swish in as the activation function inside GLU
$$ \text{SwiGLU}(x)=(x W_1)\odot \text{Swish}(xW_2) $$
without non-linearities, neural nets can’t do anything more than a linear transform
