Teaching

Spring 2021

046203 Planning and Reinforcement Learning

Spring 2019 offering

Resources:

Lecture notes by S. Mannor, N. Shimkin, A. Tamar

Introduction to Multi Armed Bandits by A. Slivkins

Week	Lecture	Tutorial	Reference
1	Overview and examples Dynamic Programming (DP): basic ideas	DP - knapsack, LCS	Ch. 1, 3.1 in Lecture Notes
2	Deterministic Controlled Processes Finite horizon DP (for deterministic process)	Fixed horizon DP, TSP	Ch. 2 in Lecture Notes
3	Shortest-path problems Graph search algorithms: Bellman-Ford, Dijkstra, A* Deterministic continuous control: LQR, iLQR	Shortest path on graph Dijkstra proof	Ch. 3.2, 3.3 in Lecture Notes
4	Markov Decision Processes (MDP) Finite horizon DP (for MDPs)	LQR	Ch. 4 in Lecture Notes
5	Finite horizon DP (cont'd) Discounted MDPs	Markov chains Viterbi algorithm	Ch. 5 in Lecture Notes
6	Discounted MDPs (cont'd) Value Iteration	VI, Bellman operator	Ch. 5 in Lecture Notes
7	Policy Iteration Linear Programming formulations	Robust DP PI vs. VI	Ch. 5 in Lecture Notes
8	Model free learning (small state spaces): TD(0), Q-learning, SARSA	TD(0) TD(lambda)	Ch. 6 in Lecture Notes
9	MDPs with large state spaces - value-based approximations projected Bellman equation Policy evaluation algorithms: regression, TD(0), LSTD Approximate policy iteration: LSPI, SARSA Approximate value iteration: fitted-Q, Q-learning	LSTD LSPI	Ch. 9.1, 9.3 in Lecture Notes
10	Stochastic approximation Convergence of RL algorithms	Stochastic approximation	Ch. 7,8 in Lecture Notes
11	MDPs with large state spaces - policy-based approximations Policy search Policy gradients Actor-critic	Policy gradients	Ch. 10 in Lecture Notes
12	Multi-armed bandits in the stationary arm model Regret, Hoeffding inequality, uniform exploration, adaptive exploration	UCB	Ch. 1 in Slivkins' book
13	MDPs with large state spaces - Monte Carlo Tree Search (UCT) Deep RL (overview)	TD(0)+FA convergence	Ch. 9.2 in Lecture Notes