Skip to main content

Lecture 7.2: Solving Markov Decision Processes

Slides

Core Definitions

Policy

A policy maps states to actions, either deterministically or stochastically:

Deterministic: π:SA\pi: \mathcal{S} \rightarrow \mathcal{A} Stochastic: π:SP(A)\pi: \mathcal{S} \rightarrow P(\mathcal{A})

State Transitions

The dynamics/transition function specifies the probability of next states:

P(ss,a):S×A×S[0,1]P(s'|s,a): \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow [0,1]

State Utility Function

The utility function mapping from state space to expected rewards (also denoted as V(s) for "value" in many texts, but following Russel & Norvig):

U:SRU: \mathcal{S} \rightarrow \mathbb{R}

Quality Function (Q-Function)

The quality function mapping from state-action pairs to expected total rewards:

Q:S×ARQ: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}

Return (Reward-to-go)

The discounted sum of future rewards from time step t:

Gt=k=0γkrt+kG_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}

Expected Utility Under a Policy

The expected return when following policy π\pi:

Uπ(s)=Eπ[k=0γkrt+kst=s]U^{\pi}(s) = \mathbb{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k r_{t+k} | s_t=s]

Quality Function Under a Policy

Expected return starting with action a and following policy π\pi:

Qπ(s,a)=Eπ[k=0γkrt+kst=s,at=a]Q^{\pi}(s,a) = \mathbb{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k r_{t+k} | s_t=s, a_t=a]

Advantage Function

The relative advantage of action a compared to the average action in state s:

Aπ(s,a)=Qπ(s,a)Uπ(s)A^{\pi}(s,a) = Q^{\pi}(s,a) - U^{\pi}(s)

Bellman Equations

Bellman Equation for Utility

Uπ(s)=aπ(as)sP(ss,a)[r(s,a,s)+γUπ(s)]U^{\pi}(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[r(s,a,s') + \gamma U^{\pi}(s')]

Bellman Equation for Q-Values

Qπ(s,a)=sP(ss,a)[r(s,a,s)+γaπ(as)Qπ(s,a)]Q^{\pi}(s,a) = \sum_{s'} P(s'|s,a)[r(s,a,s') + \gamma \sum_{a'} \pi(a'|s')Q^{\pi}(s',a')]

Bellman Optimality Equations

U(s)=maxasP(ss,a)[r(s,a,s)+γU(s)]U^*(s) = \max_a \sum_{s'} P(s'|s,a)[r(s,a,s') + \gamma U^*(s')] Q(s,a)=sP(ss,a)[r(s,a,s)+γmaxaQ(s,a)]Q^*(s,a) = \sum_{s'} P(s'|s,a)[r(s,a,s') + \gamma \max_{a'} Q^*(s',a')]

Key Relationships

Policy Improvement

The greedy policy with respect to Q:

π(s)=argmaxaQπ(s,a)\pi'(s) = \arg\max_a Q^{\pi}(s,a)

Utility-Quality Relationship

Uπ(s)=Eaπ(s)[Qπ(s,a)]U^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)}[Q^{\pi}(s,a)] U(s)=maxaQ(s,a)U^*(s) = \max_a Q^*(s,a)

Learning Paradigms

Model-Free vs Model-Based

Model-Based

Requires transition function P(s'|s,a) and reward function r(s,a,s')

Model-Free

Works directly with samples (s, a, r, s') without requiring transition/reward models Uses sample estimates: Q^(s,a)Q(s,a)\hat{Q}(s,a) \approx Q^*(s,a)

Active vs. Passive Learning

Passive Learning

  • Policy π\pi is fixed
  • Goal: Learn UπU^{\pi} or QπQ^{\pi} for the given policy
  • Example: Policy evaluation

Active Learning

  • Agent can modify policy π\pi
  • Goal: Find optimal policy π\pi^*
  • Examples: Q-learning, policy gradient methods

On-Policy vs. Off-Policy Learning

On-Policy Learning

  • Learn about policy π\pi from experience generated by π\pi
  • Policy being evaluated = Policy generating experience
  • Example: SARSA update: Q(st,at)Q(st,at)+α[rt+γQ(st+1,at+1)Q(st,at)]Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha[r_t + \gamma Q(s_{t+1},a_{t+1}) - Q(s_t,a_t)] where at+1a_{t+1} is chosen according to current policy π\pi

Off-Policy Learning

  • Learn about target policy π\pi from experience generated by behavior policy μ\mu
  • Enables learning optimal policy while following exploratory policy
  • Example: Q-learning update: Q(st,at)Q(st,at)+α[rt+γmaxaQ(st+1,a)Q(st,at)]Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha[r_t + \gamma \max_{a'} Q(s_{t+1},a') - Q(s_t,a_t)]
  • Important concept: Importance Sampling Ratio: ρt=π(atst)μ(atst)\rho_t = \frac{\pi(a_t|s_t)}{\mu(a_t|s_t)} Used to correct for difference between target and behavior policies