Overall Map¶

约 350 个字预计阅读时间 2 分钟

Chapter 1: Basic Concepts¶

Concepts: state,action,reward,return,episode,policy

Grid-world example

Markov decision process(MDP)

Chapter 2: Bellman Equation¶

One concept: state value

\(v_\pi(s)=\mathbb{E}[G_t|S_t=s]\)

One tool: Bellman equation

\(v_\pi = r_\pi + \gamma P_\pi v_\pi\)

Chapter 3: Bellman Optimality Equation¶

A special Bellman equation

Two concepts: optimal policy \(\pi^*\) & optimal state value

One tool: Bellman optimality equation

\(v = \max_\pi{r_\pi+\gamma P_\pi v} = f(v)\)

1) Fixed-Point theorem
2) Fundamental problems
3) An algorithm solving the equation

Chapter 4: Value Iteration & Policy Iteration¶

First algorithms for optimal policies

Three algorithms:

1) Value iteration(VI)
2) Policy iteration(PI)
3) Truncated policy iteration

Need the environment model

Chapter 5: Monte Carlo Learning¶

Mean estimation with sampling data

\(\mathbb{E}[X] \approx \bar{x} =\frac{1}{n}\sum_{i=1}^n x_i\)

First model-free RL algorithms

1) MC Basic
2) MC Exploring Starts
3) MC \(\epsilon\)-greedy

Chapter 6: Stochastic Approximation¶

Gap: from non-incremental to incremental

Mean estimation

Algorithm:

1) Robbins-Monro (RM) algorithm
2) Stochastic gradient descent(SGD)
3) SGD,BGD,MBGD

Chapter 7: Temporal-Difference Learning¶

1) TD learning of state values

2) Sarsa: TD learning of action values

3) Q-learning : TD learning of optimal action values

on-policy & off-policy

4) Unified point of view

Chapter 8: Value Function Approximation¶

Gap: tabular representation to function representation

Algorithms:

1) State value estimation with value function approxmation(VFA):

2) \(\min_w{J(w)} = \mathbb{E}[v_\pi(S)-\hat{v}(S,w)]\)

Sarsa/W-learning with VFA

Deep W-learning

Chapter 9: Policy Gradient Methods¶

Gap: From value-based to policy-based

1) Metrics to define optimal policies

\(J(\theta)=\bar{v_\pi},\bar{r_\pi}\)

2) Policy gradient:

\(\nabla J(\theta)=\mathbb{E}[\nabla_\theta \ln{\pi(A|S,\theta)q_\pi(S,A)}]\)

3) Gradient-ascent algorithm(REINFORCE)

\(\theta_{t+1}=\theta_t + \alpha \nabla_\theta \ln{\pi(a_t|s_t,\theta_t)q_t(s_t,a_t)}\)

Chapter 10: Actor-Critic Methods¶

Gap: policy-based + value-based

Algorithms:

1) The simplest actor-critic(QAC)

2) Advantage actor-critic (A2C)

3) Off-policy actor-critic

Importance sampling

4) Deterministic actor-critic(DPG)