Lecture 7.4: Policy Gradient Methods

Last Time:

Q-Learning with Function Approximation:

Learned to approximate $‘ Q (s, a) ‘$ using parameters $‘ θ ‘$
Updated parameters using TD error and gradient
Still focused on learning value functions

Direct Policy Search

Key Idea: Instead of learning value functions, directly optimize policy parameters Policy Representation:

Policy $‘ π_{θ} (s, a) ‘$ gives probability of taking action $‘ a ‘$ in state $‘ s ‘$
Parameters $‘ θ ‘$ determine the policy behavior

Common Policy Forms:

Deterministic: $‘ π_{θ} (s) = argmax_{a} Q_{θ} (s, a) ‘$
Stochastic (Softmax): $‘ π_{θ} (s, a) = \frac{e ^{β Q_{θ} (s, a)}}{\sum _{a^{'}} e ^{β Q_{θ} (s, a^{'})}} ‘$ where $‘ β ‘$ controls exploration-exploitation trade-off

Policy Value and Gradient

Policy Value:

Expected return under policy $‘ π_{θ} ‘$ : $‘ ρ (θ) = E_{π_{θ}} [\sum_{t = 0}^{\infty} γ^{t} r_{t}] ‘$

Policy Gradient:

Goal: Find $‘ \nabla_{θ} ρ (θ) ‘$ to improve policy
For episodic case with immediate rewards $‘ R (s_{0}, a, s_{0}) ‘$ : $‘ \nabla_{θ} ρ (θ) = \sum_{a} R (s_{0}, a, s_{0}) \nabla_{θ} π_{θ} (s_{0}, a) ‘$

REINFORCE Algorithm

Key Insight: Can estimate gradient from experience Update Rule: $‘ \nabla_{θ} ρ (θ) \approx \frac{1}{N} \sum_{j = 1}^{N} \frac{u _{j} ( s ) \nabla _{θ} π _{θ} ( s , a _{j} )}{π _{θ} ( s , a _{j} )} ‘$ where:

$‘ N ‘$ is number of trials
$‘ u_{j} (s) ‘$ is total reward from state $‘ s ‘$ in trial $‘ j ‘$
$‘ a_{j} ‘$ is action taken in trial $‘ j ‘$

Practical Considerations

Challenges:

High variance in gradient estimates
Need many samples for reliable estimates

Solutions:

Correlated sampling (PEGASUS algorithm)
- Generate fixed set of random sequences
- Compare policies on same sequences
- Reduces variance in policy comparison

Advantages of Policy Gradient Methods

Natural with Function Approximation:
- Policy can be any differentiable function
- Direct optimization of performance
Stochastic Policies:
- Better exploration properties
- Smoother optimization landscape
Handle Continuous Actions:
- No need for explicit maximization over actions

📚 gabe's wiki

Explorer

Lecture 7.4: Policy Gradient Methods

Lecture 7.4: Policy Gradient Methods

Last Time:

Direct Policy Search

Policy Value and Gradient

REINFORCE Algorithm

Practical Considerations

Advantages of Policy Gradient Methods

Table of Contents

Table of Contents