Spring 2021

046203 Planning and Reinforcement Learning

Spring 2019 offering


Lecture notes by S. Mannor, N. Shimkin, A. Tamar

Introduction to Multi Armed Bandits by A. Slivkins

Week Lecture Tutorial Reference
1 Overview and examples
Dynamic Programming (DP): basic ideas
DP - knapsack, LCS Ch. 1, 3.1 in Lecture Notes
2 Deterministic Controlled Processes
Finite horizon DP (for deterministic process)
Fixed horizon DP, TSP Ch. 2 in Lecture Notes
3 Shortest-path problems
Graph search algorithms: Bellman-Ford, Dijkstra, A*
Deterministic continuous control: LQR, iLQR
Shortest path on graph
Dijkstra proof
Ch. 3.2, 3.3 in Lecture Notes
4 Markov Decision Processes (MDP)
Finite horizon DP (for MDPs)
LQR Ch. 4 in Lecture Notes
5 Finite horizon DP (cont'd)
Discounted MDPs
Markov chains
Viterbi algorithm
Ch. 5 in Lecture Notes
6 Discounted MDPs (cont'd)
Value Iteration
VI, Bellman operator Ch. 5 in Lecture Notes
7 Policy Iteration
Linear Programming formulations
Robust DP
PI vs. VI
Ch. 5 in Lecture Notes
8 Model free learning (small state spaces): TD(0), Q-learning, SARSA TD(0)
Ch. 6 in Lecture Notes
9 MDPs with large state spaces - value-based approximations
projected Bellman equation
Policy evaluation algorithms: regression, TD(0), LSTD
Approximate policy iteration: LSPI, SARSA
Approximate value iteration: fitted-Q, Q-learning
Ch. 9.1, 9.3 in Lecture Notes
10 Stochastic approximation
Convergence of RL algorithms
Stochastic approximation Ch. 7,8 in Lecture Notes
11 MDPs with large state spaces - policy-based approximations
Policy search
Policy gradients
Policy gradients Ch. 10 in Lecture Notes
12 Multi-armed bandits in the stationary arm model
Regret, Hoeffding inequality, uniform exploration, adaptive exploration
UCB Ch. 1 in Slivkins' book
13 MDPs with large state spaces - Monte Carlo Tree Search (UCT)
Deep RL (overview)
TD(0)+FA convergence Ch. 9.2 in Lecture Notes