Lecture 7.4: Policy Gradient Methods

Last Time:

Q-Learning with Function Approximation:

Learned to approximate $`Q(s,a)`$ using parameters $`\theta`$
Updated parameters using TD error and gradient
Still focused on learning value functions

Direct Policy Search

Key Idea: Instead of learning value functions, directly optimize policy parameters Policy Representation:

Policy $`\pi_\theta(s,a)`$ gives probability of taking action $`a`$ in state $`s`$
Parameters $`\theta`$ determine the policy behavior

Common Policy Forms:

Deterministic: $`\pi_\theta(s) = \text{argmax}_a Q_\theta(s,a)`$
Stochastic (Softmax): $`\pi_\theta(s,a) = \frac{e^{\beta Q_\theta(s,a)}}{\sum_{a'} e^{\beta Q_\theta(s,a')}}`$ where $`\beta`$ controls exploration-exploitation trade-off

Policy Value and Gradient

Policy Value:

Expected return under policy $`\pi_\theta`$ : $`\rho(\theta) = \mathbb{E}_{\pi_\theta}[\sum_{t=0}^\infty \gamma^t r_t]`$

Policy Gradient:

Goal: Find $`\nabla_\theta \rho(\theta)`$ to improve policy
For episodic case with immediate rewards $`R(s_0,a,s_0)`$ : $`\nabla_\theta \rho(\theta) = \sum_a R(s_0,a,s_0)\nabla_\theta \pi_\theta(s_0,a)`$

REINFORCE Algorithm

Key Insight: Can estimate gradient from experience Update Rule: $`\nabla_\theta \rho(\theta) \approx \frac{1}{N} \sum_{j=1}^N \frac{u_j(s)\nabla_\theta \pi_\theta(s,a_j)}{\pi_\theta(s,a_j)}`$ where:

$`N`$ is number of trials
$`u_j(s)`$ is total reward from state $`s`$ in trial $`j`$
$`a_j`$ is action taken in trial $`j`$

Practical Considerations

Challenges:

High variance in gradient estimates
Need many samples for reliable estimates

Solutions:

Correlated sampling (PEGASUS algorithm)
- Generate fixed set of random sequences
- Compare policies on same sequences
- Reduces variance in policy comparison

Advantages of Policy Gradient Methods

Natural with Function Approximation:
- Policy can be any differentiable function
- Direct optimization of performance
Stochastic Policies:
- Better exploration properties
- Smoother optimization landscape
Handle Continuous Actions:
- No need for explicit maximization over actions

Last Time:​

Direct Policy Search​

Policy Value and Gradient​

REINFORCE Algorithm​

Practical Considerations​

Advantages of Policy Gradient Methods​