Skip to main content

Lecture 7.4: Policy Gradient Methods

Last Time:

Q-Learning with Function Approximation:

  • Learned to approximate Q(s,a)`Q(s,a)` using parameters θ`\theta`
  • Updated parameters using TD error and gradient
  • Still focused on learning value functions

Key Idea: Instead of learning value functions, directly optimize policy parameters Policy Representation:

  • Policy πθ(s,a)`\pi_\theta(s,a)` gives probability of taking action a`a` in state s`s`
  • Parameters θ`\theta` determine the policy behavior

Common Policy Forms:

  • Deterministic: πθ(s)=argmaxaQθ(s,a)`\pi_\theta(s) = \text{argmax}_a Q_\theta(s,a)`

  • Stochastic (Softmax): πθ(s,a)=eβQθ(s,a)aeβQθ(s,a)`\pi_\theta(s,a) = \frac{e^{\beta Q_\theta(s,a)}}{\sum_{a'} e^{\beta Q_\theta(s,a')}}` where β`\beta` controls exploration-exploitation trade-off

Policy Value and Gradient

Policy Value:

  • Expected return under policy πθ`\pi_\theta`: ρ(θ)=Eπθ[t=0γtrt]`\rho(\theta) = \mathbb{E}_{\pi_\theta}[\sum_{t=0}^\infty \gamma^t r_t]`

Policy Gradient:

  • Goal: Find θρ(θ)`\nabla_\theta \rho(\theta)` to improve policy
  • For episodic case with immediate rewards R(s0,a,s0)`R(s_0,a,s_0)`: θρ(θ)=aR(s0,a,s0)θπθ(s0,a)`\nabla_\theta \rho(\theta) = \sum_a R(s_0,a,s_0)\nabla_\theta \pi_\theta(s_0,a)`

REINFORCE Algorithm

Key Insight: Can estimate gradient from experience Update Rule: θρ(θ)1Nj=1Nuj(s)θπθ(s,aj)πθ(s,aj)`\nabla_\theta \rho(\theta) \approx \frac{1}{N} \sum_{j=1}^N \frac{u_j(s)\nabla_\theta \pi_\theta(s,a_j)}{\pi_\theta(s,a_j)}` where:

  • N`N` is number of trials
  • uj(s)`u_j(s)` is total reward from state s`s` in trial j`j`
  • aj`a_j` is action taken in trial j`j`

Practical Considerations

Challenges:

  • High variance in gradient estimates
  • Need many samples for reliable estimates

Solutions:

  • Correlated sampling (PEGASUS algorithm)
    • Generate fixed set of random sequences
    • Compare policies on same sequences
    • Reduces variance in policy comparison

Advantages of Policy Gradient Methods

  • Natural with Function Approximation:
    • Policy can be any differentiable function
    • Direct optimization of performance
  • Stochastic Policies:
    • Better exploration properties
    • Smoother optimization landscape
  • Handle Continuous Actions:
    • No need for explicit maximization over actions