Skip to main content

Neural Networks 2: Deep Neural Networks, Linear Separability, Backpropagation

Thanks to Tianyang Liu for help with these notes.

Gradient Descent and Backpropagation

0. Gradient and Vector Calculus for Gradient Descent

Gradient and vector calculus are not part of the ECS170 syllabus. However, since training a neural network is essentially a high-dimensional differentiable optimization problem, having some understanding of gradients, Jacobians, and the chain rule in vector calculus will help you better understand gradient descent.

(1) Gradient of a Scalar Function

For a scalar function L(x):RnRL(\mathbf{x}) : \mathbb{R}^n \to \mathbb{R}, its gradient is defined as:

xL=[Lx1Lx2Lxn]Rn\nabla_{\mathbf{x}} L = \begin{bmatrix} \frac{\partial L}{\partial x_1} \\\\ \frac{\partial L}{\partial x_2} \\\\ \vdots \\\\ \frac{\partial L}{\partial x_n} \end{bmatrix} \in \mathbb{R}^n

Geometric meaning:

  • The gradient points to the direction of the steepest ascent.
  • The negative gradient points to the direction of the steepest descent.

Thus, the gradient descent update rule:

xxηxL\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla_{\mathbf{x}} L

means moving a step of size η\eta in the direction of the steepest descent.


(2) Jacobian Matrix

For a vector-valued function y=f(x)\mathbf{y} = f(\mathbf{x}) where f(x):RnRmf(\mathbf{x}) : \mathbb{R}^n \to \mathbb{R}^m:

y=[y1y2ym]=f(x),xRn,yRm\mathbf{y} = \begin{bmatrix} y_1 \\\\ y_2 \\\\ \vdots \\\\ y_m \end{bmatrix} = f(\mathbf{x}), \quad \mathbf{x} \in \mathbb{R}^n,\quad \mathbf{y} \in \mathbb{R}^m

The Jacobian matrix is defined as:

Jf(x)=yx=[y1x1y1xny2x1y2xnymx1ymxn]Rm×n\mathbf{J}_{f}(\mathbf{x}) = \frac{\partial \mathbf{y}}{\partial \mathbf{x}^\top} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\\\ \frac{\partial y_2}{\partial x_1} & \cdots & \frac{\partial y_2}{\partial x_n} \\\\ \vdots & \ddots & \vdots \\\\ \frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{m \times n}

When m=1m = 1, the Jacobian reduces to the transpose of the gradient. You can think of it like the matrix version of derivative.


(3) Chain Rule in Vector Form

For a composite function:

L=g(f(x))L = g(f(\mathbf{x}))

with f:RnRmf : \mathbb{R}^n \to \mathbb{R}^m and g:RmRg : \mathbb{R}^m \to \mathbb{R}, the chain rule becomes:

xL=Jf(x)yg\nabla_{\mathbf{x}} L = \mathbf{J}_f(\mathbf{x})^\top \nabla_{\mathbf{y}} g

That is:

Lx=(yx)Ly\frac{\partial L}{\partial \mathbf{x}} = \left( \frac{\partial \mathbf{y}}{\partial \mathbf{x}} \right)^\top \frac{\partial L}{\partial \mathbf{y}}

This is the mathematical foundation of backpropagation — the gradient of each layer is propagated backward through the chain rule.


(4) Gradient with Respect to Matrices

If LL depends on a matrix WRm×n\mathbf{W} \in \mathbb{R}^{m\times n}:

LW=[Lw11Lw1nLw21Lw2nLwm1Lwmn]\frac{\partial L}{\partial \mathbf{W}} = \begin{bmatrix} \frac{\partial L}{\partial w_{11}} & \cdots & \frac{\partial L}{\partial w_{1n}} \\\\ \frac{\partial L}{\partial w_{21}} & \cdots & \frac{\partial L}{\partial w_{2n}} \\\\ \vdots & \ddots & \vdots \\\\ \frac{\partial L}{\partial w_{m1}} & \cdots & \frac{\partial L}{\partial w_{mn}} \end{bmatrix}

A key rule:

(aWb)W=ab\frac{\partial (\mathbf{a}^\top \mathbf{W}\mathbf{b})}{\partial \mathbf{W}} = \mathbf{a}\mathbf{b}^\top

This appears often in backpropagation, e.g.:

LW(l)=δ(l)(h(l1))\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \delta^{(l)} (\mathbf{h}^{(l-1)})^\top

1. Optimization Objective

The goal of training is to find parameters

Θ={W(1),W(2),,W(L)}\Theta = \{ \mathbf{W}^{(1)}, \mathbf{W}^{(2)}, \dots, \mathbf{W}^{(L)} \}

that minimize the loss function:

L=L(y^,y)\mathcal{L} = L(\hat{\mathbf{y}}, \mathbf{y})

with y^=f(x;Θ)\hat{\mathbf{y}} = f(\mathbf{x}; \Theta).

Gradient descent updates the weights iteratively:

W(l)W(l)ηLW(l)\mathbf{W}^{(l)} \leftarrow \mathbf{W}^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}

2. Forward Computation

At layer ll:

Note that sometimes there can also be a bias term: z(l)=W(l)h(l1)+b\mathbf{z}^{(l)} = \mathbf{W}^{(l)}\mathbf{h}^{(l-1)} + \mathbf{b}. In this case, you can apply the augmented matrix trick and consider b\mathbf{b} as part of Wh\mathbf{Wh}.

z(l)=W(l)h(l1)h(l)=f(l)(z(l))\begin{aligned} \mathbf{z}^{(l)} &= \mathbf{W}^{(l)}\mathbf{h}^{(l-1)} \\\\ \mathbf{h}^{(l)} &= f^{(l)}(\mathbf{z}^{(l)}) \end{aligned}

and the final output:

y^=f(L)(W(L)h(L1))\hat{\mathbf{y}} = f^{(L)}(\mathbf{W}^{(L)}\mathbf{h}^{(L-1)}) W(l)W(l)ηLW(l)\mathbf{W}^{(l)} \leftarrow \mathbf{W}^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}

3. Backward Propagation Derivation

Meaning of δ(l)\delta^{(l)}

δ(l)\delta^{(l)} (often called the error term or local gradient) measures how much the loss would change if the pre-activation z(l)\mathbf{z}^{(l)} changed slightly. It quantifies each neuron's "responsibility" for the overall error.

In simple terms:

  • The output layer's δ(L)\delta^{(L)} tells us how wrong the prediction is.
  • Each hidden layer's δ(l)\delta^{(l)} tells us how much it contributed to that error through the weights.

(1) Output Layer

We define:

δ(L)=Lz(L)\delta^{(L)} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(L)}}

so that the notation becomes simpler.

Here, \odot denotes elementwise multiplication (Hadamard product).

Applying the chain rule:

δ(L)=Ly^f(L)(z(L))\delta^{(L)} = \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}} \odot f^{(L)'}(\mathbf{z}^{(L)})

Then the gradient with respect to the weights is:

LW(L)=δ(L)(h(L1))\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(L)}} = \delta^{(L)} (\mathbf{h}^{(L-1)})^\top

(2) Hidden Layers

For each hidden layer l=L1,L2,,1l = L-1, L-2, \dots, 1:

δ(l)=((W(l+1))δ(l+1))f(l)(z(l))\delta^{(l)} = ((\mathbf{W}^{(l+1)})^\top \delta^{(l+1)}) \odot f^{(l)'}(\mathbf{z}^{(l)})

and

LW(l)=δ(l)(h(l1))\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \delta^{(l)} (\mathbf{h}^{(l-1)})^\top

Why "Backpropagation"? Notice that δ(l)\delta^{(l)} depends on δ(l+1)\delta^{(l+1)}, meaning we compute error terms moving backward from the output layer to the input layer.

Why "Forward Propagation"? In contrast, z(l)\mathbf{z}^{(l)} depends on z(l1)\mathbf{z}^{(l-1)}, meaning activations are computed moving forward from the input layer to the output layer.


4. Iterative Weight Update

Once all gradients are computed, we update the weights layer by layer:

W(l)W(l)ηLW(l)\mathbf{W}^{(l)} \leftarrow \mathbf{W}^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}

A visualization is available in the further reading section below.


5. Further Reading

Backpropagation - Dive Into Deep Learning

Gradient Descent Visualizer - UCLA ACM

TensorFlow Neural Network Playground