Core Definitions
Policy
A policy maps states to actions, either deterministically or stochastically:
Deterministic: Stochastic:
State Transitions
The dynamics/transition function specifies the probability of next states:
State Utility Function
The utility function mapping from state space to expected rewards (also denoted as V(s) for “value” in many texts, but following Russel & Norvig):
Quality Function (Q-Function)
The quality function mapping from state-action pairs to expected total rewards:
Return (Reward-to-go)
The discounted sum of future rewards from time step t:
Expected Utility Under a Policy
The expected return when following policy :
Quality Function Under a Policy
Expected return starting with action a and following policy :
Advantage Function
The relative advantage of action a compared to the average action in state s:
Bellman Equations
Bellman Equation for Utility
Bellman Equation for Q-Values
Bellman Optimality Equations
Key Relationships
Policy Improvement
The greedy policy with respect to Q:
Utility-Quality Relationship
Learning Paradigms
Model-Free vs Model-Based
Model-Based
Requires transition function P(s’|s,a) and reward function r(s,a,s’)
Model-Free
Works directly with samples (s, a, r, s’) without requiring transition/reward models Uses sample estimates:
Active vs. Passive Learning
Passive Learning
- Policy is fixed
- Goal: Learn or for the given policy
- Example: Policy evaluation
Active Learning
- Agent can modify policy
- Goal: Find optimal policy
- Examples: Q-learning, policy gradient methods
On-Policy vs. Off-Policy Learning
On-Policy Learning
- Learn about policy from experience generated by
- Policy being evaluated = Policy generating experience
- Example: SARSA update: where is chosen according to current policy
Off-Policy Learning
- Learn about target policy from experience generated by behavior policy
- Enables learning optimal policy while following exploratory policy
- Example: Q-learning update:
- Important concept: Importance Sampling Ratio: Used to correct for difference between target and behavior policies