Resources:

Lecture notes by S. Mannor, N. Shimkin, A. Tamar

Introduction to Multi Armed Bandits by A. Slivkins

Week | Lecture | Tutorial | Reference |
---|---|---|---|

1 | Overview and examples Dynamic Programming (DP): basic ideas |
DP - knapsack, LCS | Ch. 1, 3.1 in Lecture Notes |

2 | Deterministic Controlled Processes Finite horizon DP (for deterministic process) |
Fixed horizon DP, TSP | Ch. 2 in Lecture Notes |

3 | Shortest-path problems Graph search algorithms: Bellman-Ford, Dijkstra, A* Deterministic continuous control: LQR, iLQR |
Shortest path on graph Dijkstra proof |
Ch. 3.2, 3.3 in Lecture Notes |

4 | Markov Decision Processes (MDP) Finite horizon DP (for MDPs) |
LQR | Ch. 4 in Lecture Notes |

5 | Finite horizon DP (cont'd) Discounted MDPs |
Markov chains Viterbi algorithm |
Ch. 5 in Lecture Notes |

6 | Discounted MDPs (cont'd) Value Iteration |
VI, Bellman operator | Ch. 5 in Lecture Notes |

7 | Policy Iteration Linear Programming formulations |
Robust DP PI vs. VI |
Ch. 5 in Lecture Notes |

8 | Model free learning (small state spaces): TD(0), Q-learning, SARSA | TD(0) TD(lambda) |
Ch. 6 in Lecture Notes |

9 | MDPs with large state spaces - value-based approximations projected Bellman equation Policy evaluation algorithms: regression, TD(0), LSTD Approximate policy iteration: LSPI, SARSA Approximate value iteration: fitted-Q, Q-learning |
LSTD LSPI |
Ch. 9.1, 9.3 in Lecture Notes |

10 | Stochastic approximation Convergence of RL algorithms |
Stochastic approximation | Ch. 7,8 in Lecture Notes |

11 | MDPs with large state spaces - policy-based approximations Policy search Policy gradients Actor-critic |
Policy gradients | Ch. 10 in Lecture Notes |

12 | Multi-armed bandits in the stationary arm model Regret, Hoeffding inequality, uniform exploration, adaptive exploration |
UCB | Ch. 1 in Slivkins' book |

13 | MDPs with large state spaces - Monte Carlo Tree Search (UCT) Deep RL (overview) |
TD(0)+FA convergence | Ch. 9.2 in Lecture Notes |