Reinforcement Learning: An Introduction (2nd Edition)

Personal study notes

Author

Chizoba Obasi

Published

March 2026

These are personal study notes from Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (2nd Edition, MIT Press, 2018).

The notes cover the core ideas, equations, and algorithms from each chapter, transcribed and formatted for review. Mathematical notation follows the textbook closely. Backup diagrams are shown as figures.

Coverage

Chapter Topic Status
1 Introduction to RL pending
2 Multi-Armed Bandits
3 Finite Markov Decision Processes
4 Dynamic Programming
5 Monte Carlo Methods
6 Temporal-Difference Learning
7 n-step Bootstrapping
8 Planning & Learning with Tabular Methods
9 On-policy Prediction with Approximation
10 On-policy Control with Approximation
11 Off-policy Methods with Approximation
12 Eligibility Traces
13 Policy Gradient Methods pending

How to Read These Notes

These notes are meant to accompany the textbook, not replace it. They are most useful for:

  • Quick review of key equations and algorithms before exams or projects.
  • Checking notation and definitions across chapters.
  • Following the logical progression of ideas from bandits through full RL.

Key Notation

Symbol Meaning
s, s' State, next state
a, a' Action, next action
r, R Reward
t Discrete time step
T Terminal time step
\pi Policy
\pi(a \mid s) Probability of taking action a in state s under policy \pi
\pi_{*} Optimal policy
b(a \mid s) Behavior policy
v_{\pi}(s) State-value function under policy \pi
q_{\pi}(s,a) Action-value function under policy \pi
v_{*}(s) Optimal state-value function
q_{*}(s,a) Optimal action-value function
V(s) Estimated state-value
Q(s,a) Estimated action-value
\hat{v}(s, \mathbf{w}) Approximate state-value function parameterized by \mathbf{w}
\hat{q}(s, a, \mathbf{w}) Approximate action-value function parameterized by \mathbf{w}
G_t Return (cumulative discounted reward) at time t
G_{t:h} Truncated return from t to horizon h
G_{t:t+n} n-step return from t
G_t^\lambda \lambda-return at time t
G_t^{\lambda s} State-based \lambda-return (bootstraps from state values)
G_t^{\lambda a} Action-based \lambda-return (bootstraps from action values)
G_{t:h}^\lambda Truncated \lambda-return from t to horizon h
\lambda Trace-decay parameter, \lambda \in [0,1]
\lambda_t Variable trace-decay / termination function, \lambda_t \doteq \lambda(S_t, A_t)
\gamma Discount factor, \gamma \in [0,1]
\gamma_t Variable discount factor, \gamma_t \doteq \gamma(S_t)
\alpha Step size (learning rate)
\beta Secondary step-size parameter (GTD2/TDC)
\varepsilon Exploration parameter (\varepsilon-greedy)
n Number of steps in n-step methods
h Horizon (truncation point in TTD(\lambda))
d Dimension of weight vector \mathbf{w}
\mathbf{w} Weight vector for function approximation, \mathbf{w} \in \mathbb{R}^d
\mathbf{w}_t^h Weight vector at time t in sequence up to horizon h (online \lambda-return)
\mathbf{w}_{\text{TD}} TD fixed point, \mathbf{w}_{\text{TD}} = A^{-1}\mathbf{b}
\mathbf{v} Secondary weight vector (Gradient-TD cascade)
\nabla \hat{v}(s, \mathbf{w}) Gradient of \hat{v} w.r.t. \mathbf{w}
\mathbf{x}(s) Feature vector representing state s
\mathbf{x}(s,a) Feature vector representing state-action pair (s,a)
x_i(s) i-th component of feature vector, x_i : S \to \mathbb{R}
\bar{\mathbf{x}}_t Average feature vector for S_t under the target policy (GQ(\lambda))
\mathbf{z}_t Eligibility trace vector at time t, \mathbf{z}_t \in \mathbb{R}^d
\mathbf{z}_t^b Accumulating eligibility trace for behavior policy (HTD(\lambda))
\mathbf{F}_t Forgetting/fading matrix, \mathbf{F}_t \doteq \mathbf{I} - \alpha \mathbf{x}_t \mathbf{x}_t^T
\mathbf{a}_t Auxiliary memory vector in dutch trace MC derivation
\delta_t TD error at time t
\delta_t^s State-based TD error (off-policy traces)
\delta_t^a Action-based / expected TD error (off-policy traces)
\bar{\delta}_\mathbf{w}(s) Bellman error at state s
U_t Update target at time t
\rho_t Per-step importance sampling ratio, \rho_t = \frac{\pi(A_t \vert S_t)}{b(A_t \vert S_t)}
\rho_{t:h} Importance sampling ratio from t to h
\mu(s) On-policy state distribution
\mu_\pi(s) Steady-state distribution under policy \pi
r(\pi) Average reward under policy \pi
\bar{R}_t Estimate of average reward r(\pi) at time t
\mathcal{I}_t Interest at time t, \mathcal{I}_t \geq 0
F_t Followon trace at time t (Emphatic-TD(\lambda)), F_t \geq 0
M_t Emphasis at time t (Emphatic-TD(\lambda)), M_t \geq 0
\bar{V}_t(s) Expected approximate value of state s under target policy \pi
\overline{\text{VE}}(\mathbf{w}) Mean Squared Value Error
\overline{\text{BE}}(\mathbf{w}) Mean Squared Bellman Error
\overline{\text{PBE}}(\mathbf{w}) Mean Squared Projected Bellman Error
\overline{\text{TDE}}(\mathbf{w}) Mean Squared TD Error
\overline{\text{RE}}(\mathbf{w}) Mean Squared Return Error
\Pi Projection operator onto the representable subspace
B_\pi Bellman operator, B_\pi : \mathbb{R}^{\vert S \vert} \to \mathbb{R}^{\vert S \vert}
\mathbf{B} Diagonal matrix with \mu(s) on the diagonal, \vert S \vert \times \vert S \vert
\mathbf{X} Feature matrix, \vert S \vert \times d, rows are \mathbf{x}(s)^T
A Matrix \mathbb{E}[\mathbf{x}_t(\mathbf{x}_t - \gamma \mathbf{x}_{t+1})^T] \in \mathbb{R}^{d \times d} (linear TD)
\mathbf{b} Vector \mathbb{E}[R_{t+1} \mathbf{x}_t] \in \mathbb{R}^d (linear TD)
\eta(s) Expected number of time steps spent in state s per episode
h(s) Probability that an episode begins in state s
\tau Elapsed time since a state-action pair was last visited (Dyna-Q+)
k Small bonus constant in Dyna-Q+
b Branching factor
N(s,a) Visit count for state-action pair (s,a) (MCTS)
W(s,a) Total reward accumulated through (s,a) (MCTS)
c Exploration constant in UCT, typically \sqrt{2}
p(s', r \mid s, a) Transition probability
\hat{p}(s', r \mid s, a) Estimated/model transition probability
\mathbb{E}_{\pi}[\cdot] Expectation under policy \pi

Reference

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Available at: http://incompleteideas.net/book/the-book-2nd.html