Reinforcement Learning: An Introduction (2nd Edition)
Personal study notes
These are personal study notes from Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (2nd Edition, MIT Press, 2018).
The notes cover the core ideas, equations, and algorithms from each chapter, transcribed and formatted for review. Mathematical notation follows the textbook closely. Backup diagrams are shown as figures.
Coverage
| Chapter | Topic | Status |
|---|---|---|
| 1 | Introduction to RL | pending |
| 2 | Multi-Armed Bandits | ✓ |
| 3 | Finite Markov Decision Processes | ✓ |
| 4 | Dynamic Programming | ✓ |
| 5 | Monte Carlo Methods | ✓ |
| 6 | Temporal-Difference Learning | ✓ |
| 7 | n-step Bootstrapping | ✓ |
| 8 | Planning & Learning with Tabular Methods | ✓ |
| 9 | On-policy Prediction with Approximation | ✓ |
| 10 | On-policy Control with Approximation | ✓ |
| 11 | Off-policy Methods with Approximation | ✓ |
| 12 | Eligibility Traces | ✓ |
| 13 | Policy Gradient Methods | ✓ |
How to Read These Notes
These notes are meant to accompany the textbook, not replace it. They are most useful for:
- Quick review of key equations and algorithms before exams or projects.
- Checking notation and definitions across chapters.
- Following the logical progression of ideas from bandits through full RL.
Key Notation
| Symbol | Meaning |
|---|---|
| s, s' | State, next state |
| a, a' | Action, next action |
| r, R | Reward |
| t | Discrete time step |
| T | Terminal time step |
| \pi | Policy |
| \pi(a \mid s) | Probability of taking action a in state s under policy \pi |
| \pi(a \vert s, \boldsymbol{\theta}) | Parametrized policy: probability of action a in state s under \boldsymbol{\theta} |
| \pi_{*} | Optimal policy |
| b(a \mid s) | Behavior policy |
| v_{\pi}(s) | State-value function under policy \pi |
| v_{\pi_{\boldsymbol{\theta}}}(s) | True value function for policy \pi_{\boldsymbol{\theta}} |
| q_{\pi}(s,a) | Action-value function under policy \pi |
| v_{*}(s) | Optimal state-value function |
| q_{*}(s,a) | Optimal action-value function |
| V(s) | Estimated state-value |
| Q(s,a) | Estimated action-value |
| \hat{v}(s, \mathbf{w}) | Approximate state-value function parameterized by \mathbf{w} |
| \hat{q}(s, a, \mathbf{w}) | Approximate action-value function parameterized by \mathbf{w} |
| G_t | Return (cumulative discounted reward) at time t |
| G_{t:h} | Truncated return from t to horizon h |
| G_{t:t+n} | n-step return from t |
| G_t^\lambda | \lambda-return at time t |
| G_t^{\lambda s} | State-based \lambda-return (bootstraps from state values) |
| G_t^{\lambda a} | Action-based \lambda-return (bootstraps from action values) |
| G_{t:h}^\lambda | Truncated \lambda-return from t to horizon h |
| J(\boldsymbol{\theta}) | Scalar performance measure (episodic: value of start state; continuing: average reward) |
| \nabla J(\boldsymbol{\theta}) | Performance gradient w.r.t. \boldsymbol{\theta} |
| \widehat{\nabla J(\boldsymbol{\theta}_t)} | Stochastic estimate of the performance gradient |
| h(s, a, \boldsymbol{\theta}) | Numerical action preference for (s,a) under \boldsymbol{\theta} |
| \lambda | Trace-decay parameter, \lambda \in [0,1] |
| \lambda_t | Variable trace-decay / termination function, \lambda_t \doteq \lambda(S_t, A_t) |
| \lambda^{\boldsymbol{\theta}} | Trace-decay rate for actor eligibility traces |
| \lambda^{\mathbf{w}} | Trace-decay rate for critic eligibility traces |
| \gamma | Discount factor, \gamma \in [0,1] |
| \gamma_t | Variable discount factor, \gamma_t \doteq \gamma(S_t) |
| \alpha | Step size (learning rate) |
| \alpha^{\boldsymbol{\theta}} | Step size for policy parameter updates |
| \alpha^{\mathbf{w}} | Step size for state-value weight updates |
| \beta | Secondary step-size parameter (GTD2/TDC) |
| \varepsilon | Exploration parameter (\varepsilon-greedy) |
| n | Number of steps in n-step methods |
| h | Horizon (truncation point in TTD(\lambda)) |
| d | Dimension of weight vector \mathbf{w} |
| d' | Dimension of policy parameter vector \boldsymbol{\theta} |
| \boldsymbol{\theta} | Policy parameter vector, \boldsymbol{\theta} \in \mathbb{R}^{d'} |
| \boldsymbol{\theta}_\mu,\ \boldsymbol{\theta}_\sigma | Policy parameter subvectors for mean and std dev (continuous actions) |
| \mathbf{w} | Weight vector for function approximation, \mathbf{w} \in \mathbb{R}^d |
| \mathbf{w}_t^h | Weight vector at time t in sequence up to horizon h (online \lambda-return) |
| \mathbf{w}_{\text{TD}} | TD fixed point, \mathbf{w}_{\text{TD}} = A^{-1}\mathbf{b} |
| \mathbf{v} | Secondary weight vector (Gradient-TD cascade) |
| \nabla \hat{v}(s, \mathbf{w}) | Gradient of \hat{v} w.r.t. \mathbf{w} |
| \mathbf{x}(s) | Feature vector representing state s |
| \mathbf{x}(s,a) | Feature vector representing state-action pair (s,a) |
| x_i(s) | i-th component of feature vector, x_i : S \to \mathbb{R} |
| \bar{\mathbf{x}}_t | Average feature vector for S_t under the target policy (GQ(\lambda)) |
| \mathbf{z}_t | Eligibility trace vector at time t, \mathbf{z}_t \in \mathbb{R}^d |
| \mathbf{z}_t^b | Accumulating eligibility trace for behavior policy (HTD(\lambda)) |
| \mathbf{z}^{\boldsymbol{\theta}} | Eligibility trace vector for the actor (d'-component) |
| \mathbf{z}^{\mathbf{w}} | Eligibility trace vector for the critic (d-component) |
| \mathbf{F}_t | Forgetting/fading matrix, \mathbf{F}_t \doteq \mathbf{I} - \alpha \mathbf{x}_t \mathbf{x}_t^T |
| \mathbf{a}_t | Auxiliary memory vector in dutch trace MC derivation |
| \delta_t | TD error at time t |
| \delta_t^s | State-based TD error (off-policy traces) |
| \delta_t^a | Action-based / expected TD error (off-policy traces) |
| \bar{\delta}_\mathbf{w}(s) | Bellman error at state s |
| U_t | Update target at time t |
| \rho_t | Per-step importance sampling ratio, \rho_t = \frac{\pi(A_t \vert S_t)}{b(A_t \vert S_t)} |
| \rho_{t:h} | Importance sampling ratio from t to h |
| \mu(s) | On-policy state distribution |
| \mu(s, \boldsymbol{\theta}) | Mean of action distribution for state s (continuous action policy) |
| \mu_\pi(s) | Steady-state distribution under policy \pi |
| \sigma(s, \boldsymbol{\theta}) | Standard deviation of action distribution for state s (continuous action policy) |
| r(\pi) | Average reward under policy \pi |
| \bar{R}_t | Estimate of average reward r(\pi) at time t |
| I | Discount accumulator in actor-critic, initialized to 1, updated I \leftarrow \gamma I |
| \mathcal{I}_t | Interest at time t, \mathcal{I}_t \geq 0 |
| F_t | Followon trace at time t (Emphatic-TD(\lambda)), F_t \geq 0 |
| M_t | Emphasis at time t (Emphatic-TD(\lambda)), M_t \geq 0 |
| \bar{V}_t(s) | Expected approximate value of state s under target policy \pi |
| \overline{\text{VE}}(\mathbf{w}) | Mean Squared Value Error |
| \overline{\text{BE}}(\mathbf{w}) | Mean Squared Bellman Error |
| \overline{\text{PBE}}(\mathbf{w}) | Mean Squared Projected Bellman Error |
| \overline{\text{TDE}}(\mathbf{w}) | Mean Squared TD Error |
| \overline{\text{RE}}(\mathbf{w}) | Mean Squared Return Error |
| \Pi | Projection operator onto the representable subspace |
| B_\pi | Bellman operator, B_\pi : \mathbb{R}^{\vert S \vert} \to \mathbb{R}^{\vert S \vert} |
| \mathbf{B} | Diagonal matrix with \mu(s) on the diagonal, \vert S \vert \times \vert S \vert |
| \mathbf{X} | Feature matrix, \vert S \vert \times d, rows are \mathbf{x}(s)^T |
| A | Matrix \mathbb{E}[\mathbf{x}_t(\mathbf{x}_t - \gamma \mathbf{x}_{t+1})^T] \in \mathbb{R}^{d \times d} (linear TD) |
| \mathbf{b} | Vector \mathbb{E}[R_{t+1} \mathbf{x}_t] \in \mathbb{R}^d (linear TD) |
| \eta(s) | Expected number of time steps spent in state s per episode |
| h(s) | Probability that an episode begins in state s |
| \tau | Elapsed time since a state-action pair was last visited (Dyna-Q+) |
| k | Small bonus constant in Dyna-Q+ |
| b | Branching factor |
| N(s,a) | Visit count for state-action pair (s,a) (MCTS) |
| W(s,a) | Total reward accumulated through (s,a) (MCTS) |
| c | Exploration constant in UCT, typically \sqrt{2} |
| p(s', r \mid s, a) | Transition probability |
| \hat{p}(s', r \mid s, a) | Estimated/model transition probability |
| \mathbb{E}_{\pi}[\cdot] | Expectation under policy \pi |
Reference
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Available at: http://incompleteideas.net/book/the-book-2nd.html