Lecture 7.1: Reinforcement Learning Intro

Learning Objectives

Compare and contrast reinforcement learning with other AI paradigms (search, adversarial search, machine learning)
Define the key components of the agent-environment interaction loop
Identify the five components of a Markov Decision Process (MDP), explain each, explain their relationships
Differentiate between states and observations in reinforcement learning environments
Explain the structure of a “trajectory”. Explain how trajectories are generated using the MDP framework
Define fully observed versus partially observed environments

Developer writes heuristics

Imperative specification for how to solve the problem

Developer writes evaluation function

Developer collects data

Declarative, implicit specification via examples

Developer defines state space, actions, and reward function

The reward function is an abstract definition of good behavior

Different from machine learning… ML relies on concrete examples of good behavior

Different from search in Weeks 1 & 2… Those approaches relied on specifications of how to do good behavior

RL involves an agent-environment interaction loop

An MDP is a 5-tuple, ⟨ S, A, R, P, ρ_{0} ⟩ S A R : S \times A \times S \to R P : S \times A \to P (S) ρ_{0} : S \to R set of valid states set of valid actions reward function transition probability function initial state distribution P (s^{'} ∣ s, a) is the probability of transitioning into state s^{'} if you start in state s and take action a ρ_{0} (s) is the probability of starting in state s

A state is a complete description of the state of the world. There is no information about the world which is hidden from the state.

An observation is a partial description of a state, which may omit information.

When the agent is able to observe the complete state of the environment, we say that the environment is fully observed.

When the agent can only see a partial observation, we say that the environment is partially observed.

A trajectory τ is a sequence of states and actions in the world,

$τ = (s_{0}, a_{0}, s_{1}, a_{1}, ...)$

The very first state of the world, s₀, is randomly sampled from the start-state distribution, denoted by ρ₀:

$s_{0} \sim ρ_{0} (\cdot)$