Lecture 7.1: Reinforcement Learning Intro
Learning Objectives
- Compare and contrast reinforcement learning with other AI paradigms (search, adversarial search, machine learning)
- Define the key components of the agent-environment interaction loop
- Identify the five components of a Markov Decision Process (MDP), explain each, explain their relationships
- Differentiate between states and observations in reinforcement learning environments
- Explain the structure of a “trajectory”. Explain how trajectories are generated using the MDP framework
- Define fully observed versus partially observed environments
Artificial Intelligence paradigms so far
Search:
Developer writes heuristics
Imperative specification for how to solve the problem
Adversarial Search:
Developer writes evaluation function
Machine Learning:
Developer collects data
Declarative, implicit specification via examples
Reinforcement Learning:
Developer defines state space, actions, and reward function
Reward Function
The reward function is an abstract definition of good behavior
Different from machine learning… ML relies on concrete examples of good behavior
Different from search in Weeks 1 & 2… Those approaches relied on specifications of how to do good behavior
Agent-Environment Interaction
RL involves an agent-environment interaction loop
Markov Decision Process
States and Observations
A state is a complete description of the state of the world. There is no information about the world which is hidden from the state.
An observation is a partial description of a state, which may omit information.
When the agent is able to observe the complete state of the environment, we say that the environment is fully observed.
When the agent can only see a partial observation, we say that the environment is partially observed.
Trajectory
A trajectory τ is a sequence of states and actions in the world,
The very first state of the world, s₀, is randomly sampled from the start-state distribution, denoted by ρ₀: