Lecture 7.4: Policy Gradient Methods

Last Time:

Q-Learning with Function Approximation:

  • Learned to approximate using parameters
  • Updated parameters using TD error and gradient
  • Still focused on learning value functions

Key Idea: Instead of learning value functions, directly optimize policy parameters Policy Representation:

  • Policy gives probability of taking action in state
  • Parameters determine the policy behavior

Common Policy Forms:

  • Deterministic:

  • Stochastic (Softmax): where controls exploration-exploitation trade-off

Policy Value and Gradient

Policy Value:

  • Expected return under policy :

Policy Gradient:

  • Goal: Find to improve policy
  • For episodic case with immediate rewards :

REINFORCE Algorithm

Key Insight: Can estimate gradient from experience Update Rule: where:

  • is number of trials
  • is total reward from state in trial
  • is action taken in trial

Practical Considerations

Challenges:

  • High variance in gradient estimates
  • Need many samples for reliable estimates

Solutions:

  • Correlated sampling (PEGASUS algorithm)
    • Generate fixed set of random sequences
    • Compare policies on same sequences
    • Reduces variance in policy comparison

Advantages of Policy Gradient Methods

  • Natural with Function Approximation:
    • Policy can be any differentiable function
    • Direct optimization of performance
  • Stochastic Policies:
    • Better exploration properties
    • Smoother optimization landscape
  • Handle Continuous Actions:
    • No need for explicit maximization over actions