Lecture 7.4: Policy Gradient Methods
Last Time:
Q-Learning with Function Approximation:
- Learned to approximate using parameters
- Updated parameters using TD error and gradient
- Still focused on learning value functions
Direct Policy Search
Key Idea: Instead of learning value functions, directly optimize policy parameters Policy Representation:
- Policy gives probability of taking action in state
- Parameters determine the policy behavior
Common Policy Forms:
-
Deterministic:
-
Stochastic (Softmax): where controls exploration-exploitation trade-off
Policy Value and Gradient
Policy Value:
- Expected return under policy :
Policy Gradient:
- Goal: Find