Reinforcement learning is a kind of machine learning methods that are goal-directed and learn from interaction
RL is about how to map situations to actions so as to maximize a reward signal
1.1.2 Formulation (MDP)
Markov Decision Process
For most of RL cases, the states and actions spaces are finite, and the MDP is called finite MDP
to be continue
1.2 Characteristics
RL problems are:
closed-loop: the action is decided by input, then action will influence the next input
no direct instruction: the learner must try each policy to discover which yield the best reward
delayed reward: actions may affect not only the next reward but all the subsequent rewards
MDP suggests that the learning machine must:
be able to sense the state
be able to take actions that affect the state
have a goal or goals that related to the state
RL differs from:
supervised learning: in supervised learning, each sample (or called situation) is labeled by a specification - the label (or called the correct action), while in RL there doesn't exist a "correct instruction of actions"
unsupervised learning: RL does not learn hidden structures in unlabeled data
These differences make RL a third paradigm of machine learning, instead part of supervised or unsupervised learning
1.3 Challenges and Pros
trade-off between exploitation (already known, greedy) and exploration (unknown, global)
consider the whole problem, instead split it apart and study subproblems
1.4 Examples
A master chess
An adaptive controller
Learning how to run
A robot decides whether to enter a room to collect trash
Preparing for a breakfast
These cases share features that they contain interaction with theri environment, in order to achieve a goal despite uncertainty about the environment
1.5 Elements of RL
policy: given the environment, what will the agent do
reward: defines the goal. A status gives a number, and the total sum of rewards should be maximized
value function: total sum of rewards, indicating the long term goodness
(optional)model of the environment: infering how environment will behave
Simplest forms of RL: the state and action spaces are small enough for the approximate value functions to be represented as a tabular (array).
Dynamic Programming
Monte Carlo methods
Temporal-difference learning (TD)
Ensemble of Above Three
2.2 Approximate Solution
When the state or the action space is enormous, we cannot expect finding a optimal solution to the policy or value function. Instead, we find a good approximate solution