Lecture 7.2: Solving Markov Decision Processes

Slides

Core Definitions

Policy

A policy maps states to actions, either deterministically or stochastically:

Deterministic: $\pi: \mathcal{S} \rightarrow \mathcal{A}$ Stochastic: $\pi: \mathcal{S} \rightarrow P(\mathcal{A})$

State Transitions

The dynamics/transition function specifies the probability of next states:

$P(s'|s,a): \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow [0,1]$

State Utility Function

The utility function mapping from state space to expected rewards (also denoted as V(s) for "value" in many texts, but following Russel & Norvig):

$U: \mathcal{S} \rightarrow \mathbb{R}$

Quality Function (Q-Function)

The quality function mapping from state-action pairs to expected total rewards:

$Q: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$

Return (Reward-to-go)

The discounted sum of future rewards from time step t:

$G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$

Expected Utility Under a Policy

The expected return when following policy $\pi$ :

$U^{\pi}(s) = \mathbb{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k r_{t+k} | s_t=s]$

Quality Function Under a Policy

Expected return starting with action a and following policy $\pi$ :

$Q^{\pi}(s,a) = \mathbb{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k r_{t+k} | s_t=s, a_t=a]$

Advantage Function

The relative advantage of action a compared to the average action in state s:

$A^{\pi}(s,a) = Q^{\pi}(s,a) - U^{\pi}(s)$

Bellman Equations

Bellman Equation for Utility

$U^{\pi}(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[r(s,a,s') + \gamma U^{\pi}(s')]$

Bellman Equation for Q-Values

$Q^{\pi}(s,a) = \sum_{s'} P(s'|s,a)[r(s,a,s') + \gamma \sum_{a'} \pi(a'|s')Q^{\pi}(s',a')]$

Bellman Optimality Equations

$U^*(s) = \max_a \sum_{s'} P(s'|s,a)[r(s,a,s') + \gamma U^*(s')]$ $Q^*(s,a) = \sum_{s'} P(s'|s,a)[r(s,a,s') + \gamma \max_{a'} Q^*(s',a')]$

Key Relationships

Policy Improvement

The greedy policy with respect to Q:

$\pi'(s) = \arg\max_a Q^{\pi}(s,a)$

Utility-Quality Relationship

$U^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)}[Q^{\pi}(s,a)]$ $U^*(s) = \max_a Q^*(s,a)$

Learning Paradigms

Model-Free vs Model-Based

Model-Based

Requires transition function P(s'|s,a) and reward function r(s,a,s')

Model-Free

Works directly with samples (s, a, r, s') without requiring transition/reward models Uses sample estimates: $\hat{Q}(s,a) \approx Q^*(s,a)$

Active vs. Passive Learning

Passive Learning

Policy $\pi$ is fixed
Goal: Learn $U^{\pi}$ or $Q^{\pi}$ for the given policy
Example: Policy evaluation

Active Learning

Agent can modify policy $\pi$
Goal: Find optimal policy $\pi^*$
Examples: Q-learning, policy gradient methods

On-Policy vs. Off-Policy Learning

On-Policy Learning

Learn about policy $\pi$ from experience generated by $\pi$
Policy being evaluated = Policy generating experience
Example: SARSA update: $Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha[r_t + \gamma Q(s_{t+1},a_{t+1}) - Q(s_t,a_t)]$ where $a_{t+1}$ is chosen according to current policy $\pi$

Off-Policy Learning

Learn about target policy $\pi$ from experience generated by behavior policy $\mu$
Enables learning optimal policy while following exploratory policy
Example: Q-learning update: $Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha[r_t + \gamma \max_{a'} Q(s_{t+1},a') - Q(s_t,a_t)]$
Important concept: Importance Sampling Ratio: $\rho_t = \frac{\pi(a_t|s_t)}{\mu(a_t|s_t)}$ Used to correct for difference between target and behavior policies

Core Definitions​

Policy​

State Transitions​

State Utility Function​

Quality Function (Q-Function)​

Return (Reward-to-go)​

Expected Utility Under a Policy​

Quality Function Under a Policy​

Advantage Function​

Bellman Equations​

Bellman Equation for Utility​

Bellman Equation for Q-Values​

Bellman Optimality Equations​

Key Relationships​

Policy Improvement​

Utility-Quality Relationship​

Learning Paradigms​

Model-Free vs Model-Based​

Model-Based​

Model-Free​

Active vs. Passive Learning​

On-Policy vs. Off-Policy Learning​