Lecture 7.4: Policy Gradient Methods
Last Time:
Q-Learning with Function Approximation:
- Learned to approximate using parameters
- Updated parameters using TD error and gradient
- Still focused on learning value functions
Direct Policy Search
Key Idea: Instead of learning value functions, directly optimize policy parameters Policy Representation:
- Policy gives probability of taking action in state
- Parameters determine the policy behavior
Common Policy Forms:
-
Deterministic:
-
Stochastic (Softmax): where controls exploration-exploitation trade-off
Policy Value and Gradient
Policy Value:
- Expected return under policy :
Policy Gradient:
- Goal: Find to improve policy
- For episodic case with immediate rewards :
REINFORCE Algorithm
Key Insight: Can estimate gradient from experience Update Rule: where:
- is number of trials
- is total reward from state in trial
- is action taken in trial
Practical Considerations
Challenges:
- High variance in gradient estimates
- Need many samples for reliable estimates
Solutions:
- Correlated sampling (PEGASUS algorithm)
- Generate fixed set of random sequences
- Compare policies on same sequences
- Reduces variance in policy comparison
Advantages of Policy Gradient Methods
- Natural with Function Approximation:
- Policy can be any differentiable function
- Direct optimization of performance
- Stochastic Policies:
- Better exploration properties
- Smoother optimization landscape
- Handle Continuous Actions:
- No need for explicit maximization over actions