Policy

A policy maps states to actions, either deterministically or stochastically:

Deterministic: Stochastic:

State Transitions

The dynamics/transition function specifies the probability of next states:

State Utility Function

The utility function mapping from state space to expected rewards (also denoted as V(s) for “value” in many texts, but following Russel & Norvig):

Quality Function (Q-Function)

The quality function mapping from state-action pairs to expected total rewards:

Core Definitions

Return (Reward-to-go)

The discounted sum of future rewards from time step t:

Expected Utility Under a Policy

The expected return when following policy :

Quality Function Under a Policy

Expected return starting with action a and following policy :

Advantage Function

The relative advantage of action a compared to the average action in state s:

Bellman Equations

Bellman Equation for Utility

Bellman Equation for Q-Values

Bellman Optimality Equations

Key Relationships

Policy Improvement

The greedy policy with respect to Q:

Utility-Quality Relationship

Learning Paradigms

Model-Free vs Model-Based

Model-Based

Requires transition function P(s’|s,a) and reward function r(s,a,s’)

Model-Free

Works directly with samples (s, a, r, s’) without requiring transition/reward models Uses sample estimates:

Active vs. Passive Learning

Passive Learning

  • Policy is fixed
  • Goal: Learn or for the given policy
  • Example: Policy evaluation

Active Learning

  • Agent can modify policy
  • Goal: Find optimal policy
  • Examples: Q-learning, policy gradient methods

On-Policy vs. Off-Policy Learning

On-Policy Learning

  • Learn about policy from experience generated by
  • Policy being evaluated = Policy generating experience
  • Example: SARSA update: where is chosen according to current policy

Off-Policy Learning

  • Learn about target policy from experience generated by behavior policy
  • Enables learning optimal policy while following exploratory policy
  • Example: Q-learning update:
  • Important concept: Importance Sampling Ratio: Used to correct for difference between target and behavior policies