Lecture 7.1: Reinforcement Learning Intro

Learning Objectives

  1. Compare and contrast reinforcement learning with other AI paradigms (search, adversarial search, machine learning)
  2. Define the key components of the agent-environment interaction loop
  3. Identify the five components of a Markov Decision Process (MDP), explain each, explain their relationships
  4. Differentiate between states and observations in reinforcement learning environments
  5. Explain the structure of a “trajectory”. Explain how trajectories are generated using the MDP framework
  6. Define fully observed versus partially observed environments

Artificial Intelligence paradigms so far

Developer writes heuristics

Imperative specification for how to solve the problem

Developer writes evaluation function

Machine Learning:

Developer collects data

Declarative, implicit specification via examples

Reinforcement Learning:

Developer defines state space, actions, and reward function

Reward Function

The reward function is an abstract definition of good behavior

Different from machine learning… ML relies on concrete examples of good behavior

Different from search in Weeks 1 & 2… Those approaches relied on specifications of how to do good behavior

Agent-Environment Interaction

RL involves an agent-environment interaction loop

img

Markov Decision Process

States and Observations

A state is a complete description of the state of the world. There is no information about the world which is hidden from the state.

An observation is a partial description of a state, which may omit information.

When the agent is able to observe the complete state of the environment, we say that the environment is fully observed.

When the agent can only see a partial observation, we say that the environment is partially observed.

Trajectory

A trajectory τ is a sequence of states and actions in the world,

The very first state of the world, s₀, is randomly sampled from the start-state distribution, denoted by ρ₀: