Lecture 7.2: Solving Markov Decision Processes
Core Definitions
Policy
A policy maps states to actions, either deterministically or stochastically:
Deterministic: Stochastic:
State Transitions
The dynamics/transition function specifies the probability of next states:
State Utility Function
The utility function mapping from state space to expected rewards (also denoted as V(s) for "value" in many texts, but following Russel & Norvig):
Quality Function (Q-Function)
The quality function mapping from state-action pairs to expected total rewards:
Return (Reward-to-go)
The discounted sum of future rewards from time step t:
Expected Utility Under a Policy
The expected return when following policy :
Quality Function Under a Policy
Expected return starting with action a and following policy :