Lecture 7.2 Reinforcement Learning Terminology

Policy

A policy maps states to actions, either deterministically or stochastically:

Deterministic: $π : S \to A$ Stochastic: $π : S \to P (A)$

State Transitions

The dynamics/transition function specifies the probability of next states:

$P (s^{'} ∣ s, a) : S \times A \times S \to [0, 1]$

State Utility Function

The utility function mapping from state space to expected rewards (also denoted as V(s) for “value” in many texts, but following Russel & Norvig):

$U : S \to R$

Quality Function (Q-Function)

The quality function mapping from state-action pairs to expected total rewards:

$Q : S \times A \to R$

Core Definitions

Return (Reward-to-go)

The discounted sum of future rewards from time step t:

$G_{t} = \sum_{k = 0}^{\infty} γ^{k} r_{t + k}$

Expected Utility Under a Policy

The expected return when following policy $π$ :

$U^{π} (s) = E_{π} [\sum_{k = 0}^{\infty} γ^{k} r_{t + k} ∣ s_{t} = s]$

Quality Function Under a Policy

Expected return starting with action a and following policy $π$ :

$Q^{π} (s, a) = E_{π} [\sum_{k = 0}^{\infty} γ^{k} r_{t + k} ∣ s_{t} = s, a_{t} = a]$

Advantage Function

The relative advantage of action a compared to the average action in state s:

$A^{π} (s, a) = Q^{π} (s, a) - U^{π} (s)$

Bellman Equations

Bellman Equation for Utility

$U^{π} (s) = \sum_{a} π (a ∣ s) \sum_{s^{'}} P (s^{'} ∣ s, a) [r (s, a, s^{'}) + γ U^{π} (s^{'})]$

Bellman Equation for Q-Values

$Q^{π} (s, a) = \sum_{s^{'}} P (s^{'} ∣ s, a) [r (s, a, s^{'}) + γ \sum_{a^{'}} π (a^{'} ∣ s^{'}) Q^{π} (s^{'}, a^{'})]$

Bellman Optimality Equations

$U^{*} (s) = max_{a} \sum_{s^{'}} P (s^{'} ∣ s, a) [r (s, a, s^{'}) + γ U^{*} (s^{'})]$ $Q^{*} (s, a) = \sum_{s^{'}} P (s^{'} ∣ s, a) [r (s, a, s^{'}) + γ max_{a^{'}} Q^{*} (s^{'}, a^{'})]$

Key Relationships

Policy Improvement

The greedy policy with respect to Q:

$π^{'} (s) = ar g max_{a} Q^{π} (s, a)$

Utility-Quality Relationship

$U^{π} (s) = E_{a \sim π (s)} [Q^{π} (s, a)]$ $U^{*} (s) = max_{a} Q^{*} (s, a)$

Learning Paradigms

Model-Free vs Model-Based

Model-Based

Requires transition function P(s’|s,a) and reward function r(s,a,s’)

Model-Free

Works directly with samples (s, a, r, s’) without requiring transition/reward models Uses sample estimates: $\hat{Q} (s, a) \approx Q^{*} (s, a)$

Active vs. Passive Learning

Passive Learning

Policy $π$ is fixed
Goal: Learn $U^{π}$ or $Q^{π}$ for the given policy
Example: Policy evaluation

Active Learning

Agent can modify policy $π$
Goal: Find optimal policy $π^{*}$
Examples: Q-learning, policy gradient methods

On-Policy vs. Off-Policy Learning

On-Policy Learning

Learn about policy $π$ from experience generated by $π$
Policy being evaluated = Policy generating experience
Example: SARSA update: $Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t} + γ Q (s_{t + 1}, a_{t + 1}) - Q (s_{t}, a_{t})]$ where $a_{t + 1}$ is chosen according to current policy $π$

Off-Policy Learning

Learn about target policy $π$ from experience generated by behavior policy $μ$
Enables learning optimal policy while following exploratory policy
Example: Q-learning update: $Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t} + γ max_{a^{'}} Q (s_{t + 1}, a^{'}) - Q (s_{t}, a_{t})]$
Important concept: Importance Sampling Ratio: $ρ_{t} = \frac{π ( a _{t} ∣ s _{t} )}{μ ( a _{t} ∣ s _{t} )}$ Used to correct for difference between target and behavior policies

📚 gabe's wiki

Explorer