Reinforcement Learning: An Introduction (2nd Edition)

Personal study notes

Author

Chizoba Obasi

Published

March 2026

These are personal study notes from Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (2nd Edition, MIT Press, 2018).

The notes cover the core ideas, equations, and algorithms from each chapter, transcribed and formatted for review. Mathematical notation follows the textbook closely. Backup diagrams are shown as figures.

Coverage

Chapter Topic Status
1 Introduction to RL pending
2 Multi-Armed Bandits
3 Finite Markov Decision Processes
4 Dynamic Programming
5 Monte Carlo Methods
6 Temporal-Difference Learning
7 n-step Bootstrapping
8 Planning & Learning with Tabular Methods
9 On-policy Prediction with Approximation
10 On-policy Control with Approximation pending
11 Off-policy Methods with Approximation pending
12 Eligibility Traces pending
13 Policy Gradient Methods pending

How to Read These Notes

These notes are meant to accompany the textbook, not replace it. They are most useful for:

  • Quick review of key equations and algorithms before exams or projects.
  • Checking notation and definitions across chapters.
  • Following the logical progression of ideas from bandits through full RL.

Key Notation

Symbol Meaning
s, s' State, next state
a, a' Action, next action
r, R Reward
t Discrete time step
T Terminal time step
\pi Policy
\pi(a \mid s) Probability of taking action a in state s under policy \pi
\pi_{*} Optimal policy
v_{\pi}(s) State-value function under policy \pi
q_{\pi}(s,a) Action-value function under policy \pi
v_{*}(s) Optimal state-value function
q_{*}(s,a) Optimal action-value function
G_t Return (cumulative discounted reward) at time t
G_{t:h} Truncated return from t to horizon h
\gamma Discount factor, \gamma \in [0,1]
\alpha Step size (learning rate)
\varepsilon Exploration parameter (\varepsilon-greedy)
\delta_t TD error at time t
\rho_{t:h} Importance sampling ratio from t to h
b(a \mid s) Behavior policy
n Number of steps in n-step methods
Q(s,a) Estimated action-value
V(s) Estimated state-value
\hat{v}(s, \mathbf{w}) Approximate value function parameterized by \mathbf{w}
\mathbf{w} Weight vector for function approximation, \mathbf{w} \in \mathbb{R}^d
\nabla \hat{v}(s, \mathbf{w}) Gradient of \hat{v} w.r.t. \mathbf{w}
\mathbf{x}(s) Feature vector representing state s
x_i(s) i-th component of feature vector, x_i : S \to \mathbb{R}
\mu(s) On-policy state distribution
\overline{VE}(\mathbf{w}) Mean square value error
\eta(s) Expected number of time steps spent in state s per episode
U_t Update target at time t
h(s) Probability that an episode begins in state s
b Branching factor
\tau Elapsed time since a state-action pair was last visited (Dyna-Q+)
k Small bonus constant in Dyna-Q+
N(s,a) Visit count for state-action pair (s,a) (MCTS)
W(s,a) Total reward accumulated through (s,a) (MCTS)
c Exploration constant in UCT, typically \sqrt{2}
d Dimension of weight vector \mathbf{w}
\mathbb{E}_{\pi}[\cdot] Expectation under policy \pi
p(s', r \mid s, a) Transition probability
\hat{p}(s', r \mid s, a) Estimated/model transition probability
A Matrix \mathbb{E}[\mathbf{x}_t(\mathbf{x}_t - \gamma \mathbf{x}_{t+1})^T] \in \mathbb{R}^{d \times d} (linear TD)
\mathbf{b} Vector \mathbb{E}[R_{t+1} \mathbf{x}_t] \in \mathbb{R}^d (linear TD)
\mathbf{w}_{\text{TD}} TD fixed point, \mathbf{w}_{\text{TD}} = A^{-1}\mathbf{b}

Reference

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Available at: http://incompleteideas.net/book/the-book-2nd.html