Neural Networks 2: Deep Neural Networks, Linear Separability, Backpropagation
Thanks to Tianyang Liu for help with these notes.
Gradient Descent and Backpropagation
0. Gradient and Vector Calculus for Gradient Descent
Gradient and vector calculus are not part of the ECS170 syllabus . However, since training a neural network is essentially a high-dimensional differentiable optimization problem, having some understanding of gradients, Jacobians, and the chain rule in vector calculus will help you better understand gradient descent.
(1) Gradient of a Scalar Function
For a scalar function L ( x ) : R n → R L(\mathbf{x}) : \mathbb{R}^n \to \mathbb{R} L ( x ) : R n → R , its gradient is defined as:
∇ x L = [ ∂ L ∂ x 1 ∂ L ∂ x 2 ⋮ ∂ L ∂ x n ] ∈ R n \nabla_{\mathbf{x}} L =
\begin{bmatrix}
\frac{\partial L}{\partial x_1} \\\\
\frac{\partial L}{\partial x_2} \\\\
\vdots \\\\
\frac{\partial L}{\partial x_n}
\end{bmatrix}
\in \mathbb{R}^n ∇ x L = ∂ x 1 ∂ L ∂ x 2 ∂ L ⋮ ∂ x n ∂ L ∈ R n
Geometric meaning:
The gradient points to the direction of the steepest ascent .
The negative gradient points to the direction of the steepest descent .
Thus, the gradient descent update rule:
x ← x − η ∇ x L \mathbf{x} \leftarrow \mathbf{x} - \eta \nabla_{\mathbf{x}} L x ← x − η ∇ x L
means moving a step of size η \eta η in the direction of the steepest descent.
(2) Jacobian Matrix
For a vector-valued function y = f ( x ) \mathbf{y} = f(\mathbf{x}) y = f ( x ) where f ( x ) : R n → R m f(\mathbf{x}) : \mathbb{R}^n \to \mathbb{R}^m f ( x ) : R n → R m :
y = [ y 1 y 2 ⋮ y m ] = f ( x ) , x ∈ R n , y ∈ R m \mathbf{y} =
\begin{bmatrix}
y_1 \\\\
y_2 \\\\
\vdots \\\\
y_m
\end{bmatrix}
= f(\mathbf{x}), \quad \mathbf{x} \in \mathbb{R}^n,\quad \mathbf{y} \in \mathbb{R}^m y = y 1 y 2 ⋮ y m = f ( x ) , x ∈ R n , y ∈ R m
The Jacobian matrix is defined as:
J f ( x ) = ∂ y ∂ x ⊤ = [ ∂ y 1 ∂ x 1 ⋯ ∂ y 1 ∂ x n ∂ y 2 ∂ x 1 ⋯ ∂ y 2 ∂ x n ⋮ ⋱ ⋮ ∂ y m ∂ x 1 ⋯ ∂ y m ∂ x n ] ∈ R m × n \mathbf{J}_{f}(\mathbf{x})
=
\frac{\partial \mathbf{y}}{\partial \mathbf{x}^\top}
=
\begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\\\
\frac{\partial y_2}{\partial x_1} & \cdots & \frac{\partial y_2}{\partial x_n} \\\\
\vdots & \ddots & \vdots \\\\
\frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n}
\end{bmatrix}
\in \mathbb{R}^{m \times n} J f ( x ) = ∂ x ⊤ ∂ y = ∂ x 1 ∂ y 1 ∂ x 1 ∂ y 2 ⋮ ∂ x 1 ∂ y m ⋯ ⋯ ⋱ ⋯ ∂ x n ∂ y 1 ∂ x n ∂ y 2 ⋮ ∂ x n ∂ y m ∈ R m × n
When m = 1 m = 1 m = 1 , the Jacobian reduces to the transpose of the gradient. You can think of it like the matrix version of derivative.
For a composite function:
L = g ( f ( x ) ) L = g(f(\mathbf{x})) L = g ( f ( x ))
with f : R n → R m f : \mathbb{R}^n \to \mathbb{R}^m f : R n → R m and g : R m → R g : \mathbb{R}^m \to \mathbb{R} g : R m → R , the chain rule becomes:
∇ x L = J f ( x ) ⊤ ∇ y g \nabla_{\mathbf{x}} L = \mathbf{J}_f(\mathbf{x})^\top \nabla_{\mathbf{y}} g ∇ x L = J f ( x ) ⊤ ∇ y g
That is:
∂ L ∂ x = ( ∂ y ∂ x ) ⊤ ∂ L ∂ y \frac{\partial L}{\partial \mathbf{x}}
=
\left( \frac{\partial \mathbf{y}}{\partial \mathbf{x}} \right)^\top
\frac{\partial L}{\partial \mathbf{y}} ∂ x ∂ L = ( ∂ x ∂ y ) ⊤ ∂ y ∂ L
This is the mathematical foundation of backpropagation — the gradient of each layer is propagated backward through the chain rule.
(4) Gradient with Respect to Matrices
If L L L depends on a matrix W ∈ R m × n \mathbf{W} \in \mathbb{R}^{m\times n} W ∈ R m × n :
∂ L ∂ W = [ ∂ L ∂ w 11 ⋯ ∂ L ∂ w 1 n ∂ L ∂ w 21 ⋯ ∂ L ∂ w 2 n ⋮ ⋱ ⋮ ∂ L ∂ w m 1 ⋯ ∂ L ∂ w m n ] \frac{\partial L}{\partial \mathbf{W}}
=
\begin{bmatrix}
\frac{\partial L}{\partial w_{11}} & \cdots & \frac{\partial L}{\partial w_{1n}} \\\\
\frac{\partial L}{\partial w_{21}} & \cdots & \frac{\partial L}{\partial w_{2n}} \\\\
\vdots & \ddots & \vdots \\\\
\frac{\partial L}{\partial w_{m1}} & \cdots & \frac{\partial L}{\partial w_{mn}}
\end{bmatrix} ∂ W ∂ L = ∂ w 11 ∂ L ∂ w 21 ∂ L ⋮ ∂ w m 1 ∂ L ⋯ ⋯ ⋱ ⋯ ∂ w 1 n ∂ L ∂ w 2 n ∂ L ⋮ ∂ w mn ∂ L
A key rule:
∂ ( a ⊤ W b ) ∂ W = a b ⊤ \frac{\partial (\mathbf{a}^\top \mathbf{W}\mathbf{b})}{\partial \mathbf{W}} = \mathbf{a}\mathbf{b}^\top ∂ W ∂ ( a ⊤ Wb ) = a b ⊤
This appears often in backpropagation, e.g.:
∂ L ∂ W ( l ) = δ ( l ) ( h ( l − 1 ) ) ⊤ \frac{\partial L}{\partial \mathbf{W}^{(l)}} = \delta^{(l)} (\mathbf{h}^{(l-1)})^\top ∂ W ( l ) ∂ L = δ ( l ) ( h ( l − 1 ) ) ⊤
1. Optimization Objective
The goal of training is to find parameters
Θ = { W ( 1 ) , W ( 2 ) , … , W ( L ) } \Theta = \{ \mathbf{W}^{(1)}, \mathbf{W}^{(2)}, \dots, \mathbf{W}^{(L)} \} Θ = { W ( 1 ) , W ( 2 ) , … , W ( L ) }
that minimize the loss function:
L = L ( y ^ , y ) \mathcal{L} = L(\hat{\mathbf{y}}, \mathbf{y}) L = L ( y ^ , y )
with y ^ = f ( x ; Θ ) \hat{\mathbf{y}} = f(\mathbf{x}; \Theta) y ^ = f ( x ; Θ ) .
Gradient descent updates the weights iteratively:
W ( l ) ← W ( l ) − η ∂ L ∂ W ( l ) \mathbf{W}^{(l)} \leftarrow \mathbf{W}^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} W ( l ) ← W ( l ) − η ∂ W ( l ) ∂ L
2. Forward Computation
At layer l l l :
Note that sometimes there can also be a bias term: z ( l ) = W ( l ) h ( l − 1 ) + b \mathbf{z}^{(l)} = \mathbf{W}^{(l)}\mathbf{h}^{(l-1)} + \mathbf{b} z ( l ) = W ( l ) h ( l − 1 ) + b . In this case, you can apply the augmented matrix trick and consider b \mathbf{b} b as part of W h \mathbf{Wh} Wh .
z ( l ) = W ( l ) h ( l − 1 ) h ( l ) = f ( l ) ( z ( l ) ) \begin{aligned}
\mathbf{z}^{(l)} &= \mathbf{W}^{(l)}\mathbf{h}^{(l-1)} \\\\
\mathbf{h}^{(l)} &= f^{(l)}(\mathbf{z}^{(l)})
\end{aligned} z ( l ) h ( l ) = W ( l ) h ( l − 1 ) = f ( l ) ( z ( l ) )
and the final output:
y ^ = f ( L ) ( W ( L ) h ( L − 1 ) ) \hat{\mathbf{y}} = f^{(L)}(\mathbf{W}^{(L)}\mathbf{h}^{(L-1)}) y ^ = f ( L ) ( W ( L ) h ( L − 1 ) )
W ( l ) ← W ( l ) − η ∂ L ∂ W ( l ) \mathbf{W}^{(l)} \leftarrow \mathbf{W}^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} W ( l ) ← W ( l ) − η ∂ W ( l ) ∂ L
3. Backward Propagation Derivation
Meaning of δ ( l ) \delta^{(l)} δ ( l )
δ ( l ) \delta^{(l)} δ ( l ) (often called the error term or local gradient ) measures how much the loss would change if the pre-activation z ( l ) \mathbf{z}^{(l)} z ( l ) changed slightly .
It quantifies each neuron's "responsibility" for the overall error.
In simple terms:
The output layer's δ ( L ) \delta^{(L)} δ ( L ) tells us how wrong the prediction is.
Each hidden layer's δ ( l ) \delta^{(l)} δ ( l ) tells us how much it contributed to that error through the weights.
(1) Output Layer
We define:
δ ( L ) = ∂ L ∂ z ( L ) \delta^{(L)} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(L)}} δ ( L ) = ∂ z ( L ) ∂ L
so that the notation becomes simpler.
Here, ⊙ \odot ⊙ denotes elementwise multiplication (Hadamard product) .
Applying the chain rule:
δ ( L ) = ∂ L ∂ y ^ ⊙ f ( L ) ′ ( z ( L ) ) \delta^{(L)} = \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}} \odot f^{(L)'}(\mathbf{z}^{(L)}) δ ( L ) = ∂ y ^ ∂ L ⊙ f ( L ) ′ ( z ( L ) )
Then the gradient with respect to the weights is:
∂ L ∂ W ( L ) = δ ( L ) ( h ( L − 1 ) ) ⊤ \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(L)}} = \delta^{(L)} (\mathbf{h}^{(L-1)})^\top ∂ W ( L ) ∂ L = δ ( L ) ( h ( L − 1 ) ) ⊤
(2) Hidden Layers
For each hidden layer l = L − 1 , L − 2 , … , 1 l = L-1, L-2, \dots, 1 l = L − 1 , L − 2 , … , 1 :
δ ( l ) = ( ( W ( l + 1 ) ) ⊤ δ ( l + 1 ) ) ⊙ f ( l ) ′ ( z ( l ) ) \delta^{(l)} = ((\mathbf{W}^{(l+1)})^\top \delta^{(l+1)}) \odot f^{(l)'}(\mathbf{z}^{(l)}) δ ( l ) = (( W ( l + 1 ) ) ⊤ δ ( l + 1 ) ) ⊙ f ( l ) ′ ( z ( l ) )
and
∂ L ∂ W ( l ) = δ ( l ) ( h ( l − 1 ) ) ⊤ \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \delta^{(l)} (\mathbf{h}^{(l-1)})^\top ∂ W ( l ) ∂ L = δ ( l ) ( h ( l − 1 ) ) ⊤
Why "Backpropagation"? Notice that δ ( l ) \delta^{(l)} δ ( l ) depends on δ ( l + 1 ) \delta^{(l+1)} δ ( l + 1 ) , meaning we compute error terms moving backward from the output layer to the input layer.
Why "Forward Propagation"? In contrast, z ( l ) \mathbf{z}^{(l)} z ( l ) depends on z ( l − 1 ) \mathbf{z}^{(l-1)} z ( l − 1 ) , meaning activations are computed moving forward from the input layer to the output layer.
4. Iterative Weight Update
Once all gradients are computed, we update the weights layer by layer:
W ( l ) ← W ( l ) − η ∂ L ∂ W ( l ) \mathbf{W}^{(l)} \leftarrow \mathbf{W}^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} W ( l ) ← W ( l ) − η ∂ W ( l ) ∂ L
A visualization is available in the further reading section below.
5. Further Reading
Backpropagation - Dive Into Deep Learning
Gradient Descent Visualizer - UCLA ACM
TensorFlow Neural Network Playground