Teaching

Spring 2019

046194 Planning and Learning in Dynamical Systems

(a.k.a. Reinforcement Learning)

Resources:

Lecture notes by S. Mannor, N. Shimkin, A. Tamar

Introduction to Multi Armed Bandits by A. Slivkins

Week Lecture Tutorial Reference
1 Overview and examples
Dynamic Programming (DP): basic ideas
DP - knapsack, LCS Ch. 1, 3.1 in Lecture Notes
2 Deterministic Controlled Processes
Finite horizon DP (for deterministic process)
Fixed horizon DP, TSP Ch. 2 in Lecture Notes
3 Shortest-path problems
Graph search algorithms: Bellman-Ford, Dijkstra, A*
Deterministic continuous control: LQR, iLQR
Shortest path on graph
Dijkstra proof
Ch. 3.2, 3.3 in Lecture Notes
4 Markov Decision Processes (MDP)
Finite horizon DP (for MDPs)
LQR Ch. 4 in Lecture Notes
5 Finite horizon DP (cont'd)
Discounted MDPs
Markov chains
Viterbi algorithm
Ch. 5 in Lecture Notes
6 Discounted MDPs (cont'd)
Value Iteration
VI, Bellman operator Ch. 5 in Lecture Notes
7 Policy Iteration
Linear Programming formulations
Robust DP
PI vs. VI
Ch. 5 in Lecture Notes
8 Model free learning (small state spaces): TD(0), Q-learning, SARSA TD(0)
TD(lambda)
Ch. 6 in Lecture Notes
9 MDPs with large state spaces - value-based approximations
projected Bellman equation
Policy evaluation algorithms: regression, TD(0), LSTD
Approximate policy iteration: LSPI, SARSA
Approximate value iteration: fitted-Q, Q-learning
LSTD
LSPI
Ch. 9.1, 9.3 in Lecture Notes
10 Stochastic approximation
Convergence of RL algorithms
Stochastic approximation Ch. 7,8 in Lecture Notes
11 MDPs with large state spaces - policy-based approximations
Policy search
Policy gradients
Actor-critic
Policy gradients Ch. 10 in Lecture Notes
12 Multi-armed bandits in the stationary arm model
Regret, Hoeffding inequality, uniform exploration, adaptive exploration
UCB Ch. 1 in Slivkins' book
13 MDPs with large state spaces - Monte Carlo Tree Search (UCT)
Deep RL (overview)
TD(0)+FA convergence Ch. 9.2 in Lecture Notes