Chizoba Obasi blog

Sutton & Barto, Ch. 13: Policy Gradient Methods (Personal Notes)

Thu, 07 May 2026 00:00:00 +0000

Almost all the algorithms/methods covered so far have been action-value methods (except gradient-bandit algorithms, Section 2.8).
Action-value methods learn the values of actions and then derive the policy thereafter to select actions based on their estimated action values.
Here, we explicitly learn a parametrized policy that can select actions without consulting a value function.
A value function is not required for action selection, but still could be used to learn the policy parameters $\boldsymbol{\theta}$.
The parametrized policy $\pi(a \vert s, \boldsymbol{\theta})$ is now the probability that action $a$ is taken at time $t$ given that the environment is in state $s$ at time $t$ with parameter $\boldsymbol{\theta} \in \mathbb{R}^{d’}$:

\[\pi(a \vert s, \boldsymbol{\theta}) = \Pr\{A_t = a \vert S_t = s,\ \boldsymbol{\theta}_t = \boldsymbol{\theta}\}\]

We will consider methods for learning $\boldsymbol{\theta}$ based on the gradient of some scalar performance measure $J(\boldsymbol{\theta})$ w.r.t. $\boldsymbol{\theta}$. The goal is to maximize performance, hence the use of gradient ascent:

\[\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \alpha \widehat{\nabla J(\boldsymbol{\theta}_t)}\] \[\begin{aligned} \text{where} \\ \widehat{\nabla J(\boldsymbol{\theta}_t)} &\in \mathbb{R}^{d'} \equiv \text{a stochastic estimate whose expectation approximates} \\ &\phantom{{}\equiv{}} \text{the gradient of the performance measure } J \text{ w.r.t. } \boldsymbol{\theta}_t \end{aligned}\]

This general methodology applies to policy gradient methods.
Methods that learn approximations to both policy & value functions are often called Actor-Critic methods
- Actor refers to the learned policy.
- Critic refers to the learned value function (state-value function).

13.1 Policy Approximation & Its Advantages
13.2 The Policy Gradient Theorem
13.3 REINFORCE: Monte Carlo Policy Gradient
13.4 REINFORCE with Baseline
13.5 Actor-Critic Methods
13.6 Policy Gradient for Continuing Problems
13.7 Policy Parametrization for Continuous Actions
13.8 Summary

Appendix

Citation

13.1 Policy Approximation & Its Advantages

In policy gradient methods, the policy can be parametrized in any way, as long as $\pi(a \vert s, \boldsymbol{\theta})$ is differentiable w.r.t. $\boldsymbol{\theta}$ (essentially the partial derivatives’ column vector exists and is finite):

\[\nabla \pi(a \vert s, \boldsymbol{\theta}) \text{ exists and is finite for all } s \in S,\ a \in A(s),\ \boldsymbol{\theta} \in \mathbb{R}^{d'}\]

If the action-space is discrete and not too large, then a natural & common kind of parametrization is to form parametrized numerical preferences $h(s, a, \boldsymbol{\theta}) \in \mathbb{R}$ for each $(s, a)$ pair.
The actions with the highest preferences in each state are given the highest probabilities of being selected, for example, according to an exponential soft-max distribution:

\[\boxed{\pi(a \vert s, \boldsymbol{\theta}) \doteq \frac{e^{h(s,a,\boldsymbol{\theta})}}{\sum_b e^{h(s,b,\boldsymbol{\theta})}}}\]

This kind of policy parametrization is called softmax in action preferences.
The action preferences can be parametrized arbitrarily, they can be computed by a deep artificial neural network (ANN) or could simply be linear in features:

\[h(s, a, \boldsymbol{\theta}) = \boldsymbol{\theta}^T \mathbf{x}(s, a)\]

Advantages of softmax in action preferences policy parametrization:

The approx. policy can approach a deterministic policy, unlike $\varepsilon$-greedy action selection.
It enables the selection of actions with arbitrary probabilities, which is useful for cases that require near-optimal stochastic policy (e.g. significant function approximation).
- E.g. useful in environments with imperfect information (e.g., card games) where it is optimal to act stochastically, such as when bluffing in Poker; it is important to do so randomly to unnerve/confuse an opponent.
The policy from policy parametrization may be a simpler function to approximate than the action-value function.
Often the most important reason for using a policy-based learning method is that it is a good way to inject prior knowledge about the desired form of the policy into the RL system.

13.2 The Policy Gradient Theorem

Policy-gradient methods have stronger convergence guarantees compared to action-value methods due to smooth continuity in change of action probabilities as a function of the learned parameter (gradient ascent).
Let’s consider the episodic performance measure, which is the value of the start state of the episode:

\[J(\boldsymbol{\theta}) \doteq v_{\pi_{\boldsymbol{\theta}}}(s_0)\] \[\begin{aligned} \text{where} \\ v_{\pi_{\boldsymbol{\theta}}} &\equiv \text{the true value function for } \pi_{\boldsymbol{\theta}}\text{, the policy determined by } \boldsymbol{\theta} \\ s_0 &\equiv \text{some non-random state} \end{aligned}\]

How can we estimate the performance gradient w.r.t. the policy parameter when the gradient depends on the unknown effect of policy changes on the state distribution?
- The theoretical answer is the policy gradient theorem.
The policy gradient theorem, which provides an analytic expression for the performance gradient w.r.t. the policy parameter, for the episodic case establishes that:

\[\boxed{\nabla J(\boldsymbol{\theta}) \propto \sum_s \mu(s) \sum_a q_\pi(s,a)\, \nabla\pi(a \vert s, \boldsymbol{\theta})}\] \[\begin{aligned} \text{where} \quad &\\ \nabla J(\boldsymbol{\theta}) &\equiv \text{column vectors of partial derivatives w.r.t. the components of } \boldsymbol{\theta} \\ \mu &\equiv \text{the on-policy distribution under the policy } \pi \\ &\phantom{{}\equiv{}} \text{(in the episodic case, proportionality constant is the average length of an episode;} \\ &\phantom{{}\equiv{}} \text{in the continuing case it is 1)} \end{aligned}\]

Proof: Policy Gradient Theorem (Episodic Case)

Let’s prove the policy gradient theorem from first principles using elementary calculus.

Note: To keep the notation simple, we leave it implicit in all cases that $\pi = f(\boldsymbol{\theta})$ and all gradients $\nabla[…]$ are also implicit w.r.t. $\boldsymbol{\theta}$.

\[\begin{align*} \nabla v_\pi(s) &= \nabla \!\left[\sum_a \pi(a \vert s)\, q_\pi(s,a)\right], \quad \text{for all } s \in \mathcal{S} \\ &= \sum_a \!\Biggl[\nabla\pi(a \vert s)\, q_\pi(s,a) + \pi(a \vert s)\, \nabla q_\pi(s,a)\Biggr] \\ &= \sum_a \!\left[\nabla\pi(a \vert s)\, q_\pi(s,a) + \pi(a \vert s)\, \nabla \sum_{s',r} p(s',r \vert s,a)\!\left(r + v_\pi(s')\right)\right] \\ &= \sum_a \!\left[\nabla\pi(a \vert s)\, q_\pi(s,a) + \pi(a \vert s) \sum_{s'} p(s' \vert s,a)\, \nabla v_\pi(s')\right] \\ &= \sum_a \Biggl[\nabla\pi(a \vert s)\, q_\pi(s,a) + \pi(a \vert s) \sum_{s'} p(s' \vert s,a) \sum_{a'} \Biggl[\nabla\pi(a' \vert s')\, q_\pi(s',a') \\ &\qquad\qquad + \pi(a' \vert s') \sum_{s''} p(s'' \vert s',a')\, \nabla v_\pi(s'')\Biggr]\Biggr] \end{align*}\]

Unrolling this recursion:

\[\boxed{\nabla v_\pi(s) = \sum_{x \in \mathcal{S}} \sum_{k=0}^{\infty} \Pr(s \to x, k, \pi) \sum_a \nabla\pi(a \vert x)\, q_\pi(x,a)}\] \[\begin{aligned} \Pr(s \to x, k, \pi) &\equiv \text{probability of transitioning from state } s \text{ to state } x \text{ in } k \text{ steps under policy } \pi \end{aligned}\]

Then:

\[\begin{align*} \nabla J(\boldsymbol{\theta}) &= \nabla v_\pi(s_0) \\ &= \sum_s \!\left(\sum_{k=0}^{\infty} \Pr(s_0 \to s, k, \pi)\right) \sum_a \nabla\pi(a \vert s)\, q_\pi(s,a) \\ &= \sum_s \eta(s) \sum_a \nabla\pi(a \vert s)\, q_\pi(s,a) \\ &= \sum_{s'} \eta(s') \sum_s \frac{\eta(s)}{\sum_{s'} \eta(s')} \sum_a \nabla\pi(a \vert s)\, q_\pi(s,a) \\ &= \sum_{s'} \eta(s') \sum_s \mu(s) \sum_a \nabla\pi(a \vert s)\, q_\pi(s,a) \end{align*}\] \[\boxed{\nabla J(\boldsymbol{\theta}) \propto \sum_s \mu(s) \sum_a \nabla\pi(a \vert s)\, q_\pi(s,a)}\]

13.3 REINFORCE: Monte Carlo Policy Gradient

REINFORCE is our first policy gradient algorithm.
The goal/strategy is to find a way to get samples such that the expectation of the sample gradient is proportional to the actual performance gradient as a function of the parameters.
The sample gradients need to only be proportional to the performance gradient because any proportionality constant can be absorbed into the step size $\alpha$.
The policy gradient theorem gives an exact expression proportional to the gradient; so all that is needed is a way of sampling whose expectation equals or approximates this expression.
Recall the RHS of the policy gradient theorem is a sum over states weighted by how often the states occur under the target policy $\pi$:

\[\begin{align*} \nabla J(\boldsymbol{\theta}) &\propto \sum_s \mu(s) \sum_a q_\pi(s,a)\, \nabla\pi(a \vert s, \boldsymbol{\theta}) \\ &= \mathbb{E}_\pi \!\left[\sum_a q_\pi(S_t, a)\, \nabla\pi(a \vert S_t, \boldsymbol{\theta})\right] \end{align*}\]

We get REINFORCE by replacing the sum over the random variable’s possible values by an expectation under $\pi$, and then sampling the expectation:

\[\begin{align*} \nabla J(\boldsymbol{\theta}) &\propto \mathbb{E}_\pi \!\left[\sum_a \pi(a \vert S_t, \boldsymbol{\theta})\, q_\pi(S_t, a)\, \frac{\nabla\pi(a \vert S_t, \boldsymbol{\theta})}{\pi(a \vert S_t, \boldsymbol{\theta})}\right] \\ &= \mathbb{E}_\pi \!\left[q_\pi(S_t, A_t)\, \frac{\nabla\pi(A_t \vert S_t, \boldsymbol{\theta})}{\pi(A_t \vert S_t, \boldsymbol{\theta})}\right] \quad \text{(replacing } a \text{ by the sample } A_t \sim \pi\text{)} \\ &= \mathbb{E}_\pi \!\left[G_t\, \frac{\nabla\pi(A_t \vert S_t, \boldsymbol{\theta})}{\pi(A_t \vert S_t, \boldsymbol{\theta})}\right] \quad \text{(because } \mathbb{E}_\pi[G_t \vert S_t, A_t] = q_\pi(S_t, A_t)\text{)} \end{align*}\]

The last expression is the required expression; a quantity that can be sampled on each time step whose expectation is proportional to the gradient. This leads to the REINFORCE update:

\[\boxed{\boldsymbol{\theta}_{t+1} \doteq \boldsymbol{\theta}_t + \alpha G_t \frac{\nabla\pi(A_t \vert S_t, \boldsymbol{\theta}_t)}{\pi(A_t \vert S_t, \boldsymbol{\theta}_t)}}\]

This update is intuitively coherent:
- The gradient term represents the direction in parameter space that most increases the probability of repeating/taking again the same action $A_t$ in the future.
- The update moves the parameter the most in the directions that favour actions that yield the highest return.
- The update ensures a balancing act (lower impact) of frequently selected actions with lower returns.
REINFORCE has good convergence properties but may be of high variance and slow learning as a MC method.

REINFORCE: Monte Carlo Policy Gradient Control (Episodic) for $\pi_*$

\[\boxed{ \begin{aligned} &\textbf{Input: } \text{a differentiable policy parametrization } \pi(a \vert s, \boldsymbol{\theta}) \\ &\textbf{Algorithm parameter: } \text{step size } \alpha > 0 \\ &\textbf{Initialize } \text{policy parameter } \boldsymbol{\theta} \in \mathbb{R}^{d'} \text{ (e.g., to } \mathbf{0}\text{)} \\ &\textbf{Loop forever } \text{(for each episode):} \\ &\quad \text{Generate an episode } S_0, A_0, R_1, \ldots, S_{T-1}, A_{T-1}, R_T \text{ following } \pi(\cdot \vert \cdot, \boldsymbol{\theta}) \\ &\quad \textbf{Loop for each step of the episode } t = 0, 1, 2, \ldots, T-1\text{:} \\ &\qquad G \leftarrow \sum_{k=t+1}^{T} \gamma^{k-t-1} R_k \\ &\qquad \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha\gamma^t G\, \nabla \ln \pi(A_t \vert S_t, \boldsymbol{\theta}) \end{aligned} }\]

13.4 REINFORCE with Baseline

The policy gradient theorem can be generalized to include a comparison of the action value to an arbitrary baseline $b(s)$:

\[\nabla J(\boldsymbol{\theta}) \propto \sum_s \mu(s) \sum_a \!\left[q_\pi(s,a) - b(s)\right] \nabla\pi(a \vert s, \boldsymbol{\theta})\]

The baseline can be any function that is not dependent on the action $a$:

\[\sum_a b(s)\, \nabla\pi(a \vert s, \boldsymbol{\theta}) = b(s)\, \nabla \sum_a \pi(a \vert s, \boldsymbol{\theta}) = b(s)\, \nabla 1 = 0\]

Now the REINFORCE update with baseline is:

\[\boxed{\boldsymbol{\theta}_{t+1} \doteq \boldsymbol{\theta}_t + \alpha \!\left[G_t - b(S_t)\right] \frac{\nabla\pi(A_t \vert S_t, \boldsymbol{\theta}_t)}{\pi(A_t \vert S_t, \boldsymbol{\theta}_t)}}\]

The Baseline REINFORCE update reduces the variance despite leaving the expected update value unchanged, which speeds up learning.
One natural choice for the baseline is an estimate of the state value $\hat{v}(S_t, \mathbf{w})$.
Since REINFORCE is a Monte Carlo method for learning the policy parameter $\boldsymbol{\theta}$, it’s natural to also use a Monte Carlo method to learn the state-value weights $\mathbf{w}$.

REINFORCE with Baseline (Episodic), for estimating $\pi_{\boldsymbol{\theta}} \approx \pi_*$

\[\boxed{ \begin{aligned} &\textbf{Input: } \text{a differentiable policy parametrization } \pi(a \vert s, \boldsymbol{\theta}) \\ &\textbf{Input: } \text{a differentiable state-value function parametrization } \hat{v}(s, \mathbf{w}) \\ &\textbf{Algorithm parameters: } \text{step sizes } \alpha^{\boldsymbol{\theta}} > 0,\ \alpha^{\mathbf{w}} > 0 \\ &\textbf{Initialize } \text{policy parameter } \boldsymbol{\theta} \in \mathbb{R}^{d'} \text{ and state-value weights } \mathbf{w} \in \mathbb{R}^d \text{ (e.g., to } \mathbf{0}\text{)} \\ &\textbf{Loop forever } \text{(for each episode):} \\ &\quad \text{Generate an episode } S_0, A_0, R_1, \ldots, S_{T-1}, A_{T-1}, R_T \text{ following } \pi(\cdot \vert \cdot, \boldsymbol{\theta}) \\ &\quad \textbf{Loop for each step of the episode } t = 0, 1, \ldots, T-1\text{:} \\ &\qquad G \leftarrow \sum_{k=t+1}^{T} \gamma^{k-t-1} R_k \\ &\qquad \delta \leftarrow G - \hat{v}(S_t, \mathbf{w}) \\ &\qquad \mathbf{w} \leftarrow \mathbf{w} + \alpha^{\mathbf{w}}\, \delta\, \nabla \hat{v}(S_t, \mathbf{w}) \\ &\qquad \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha^{\boldsymbol{\theta}} \gamma^t\, \delta\, \nabla \ln \pi(A_t \vert S_t, \boldsymbol{\theta}) \end{aligned} }\]

This algorithm has 2 step sizes $\alpha^{\boldsymbol{\theta}}$ and $\alpha^{\mathbf{w}}$.
Choosing $\alpha^{\mathbf{w}}$ is relatively easy; in the linear case a good rule of thumb is (see Section 9.6):

\[\alpha^{\mathbf{w}} = \frac{0.1}{\mathbb{E}\!\left[\|\nabla \hat{v}(S_t, \mathbf{w})\|_\mu^2\right]}\]

Choosing $\alpha^{\boldsymbol{\theta}}$ is much less clear since its best value depends on the range of variation of the rewards and on the policy parametrization.

13.5 Actor-Critic Methods

REINFORCE with baseline cannot be used to evaluate actions because its state-value function only estimates the value of the 1st state of each state transition (1st state to 2nd state), which serves as a baseline of what to expect for the subsequent return to be, prior to the actual transition’s action.
Actor-critic methods, however, apply the state-value function to the 2nd state of the transition thereby estimating its value and thus evaluating the action.
The policy is the actor that maps states to actions, while the state-value function used to assess actions in this way is the critic.
The estimated value of the 2nd state, when discounted & added to the reward, yields the one-step return, $G_{t:t+1}$.

13.5.1 One-Step Actor-Critic Methods

One-step actor-critic methods replace the REINFORCE full return with the one-step return and use a learned state-value function as the baseline as follows:

\[\begin{align*} \boldsymbol{\theta}_{t+1} &\doteq \boldsymbol{\theta}_t + \alpha \!\left[G_{t:t+1} - \hat{v}(S_t, \mathbf{w})\right] \frac{\nabla\pi(A_t \vert S_t, \boldsymbol{\theta}_t)}{\pi(A_t \vert S_t, \boldsymbol{\theta}_t)} \\ &= \boldsymbol{\theta}_t + \alpha \!\left[R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w})\right] \frac{\nabla\pi(A_t \vert S_t, \boldsymbol{\theta}_t)}{\pi(A_t \vert S_t, \boldsymbol{\theta}_t)} \\ &= \boldsymbol{\theta}_t + \alpha\, \delta_t\, \frac{\nabla\pi(A_t \vert S_t, \boldsymbol{\theta}_t)}{\pi(A_t \vert S_t, \boldsymbol{\theta}_t)} \end{align*}\]

The semi-gradient TD(0) could serve as the natural state-value function learning method.
PROS: simple, fully online & incremental.
It is analogous to TD(0), Sarsa(0) & Q-learning.

One-Step Actor-Critic (Episodic), for estimating $\pi_{\boldsymbol{\theta}} \approx \pi_*$

\[\boxed{ \begin{aligned} &\textbf{Inputs: } \text{a differentiable policy } \pi(a \vert s, \boldsymbol{\theta}) \text{ and state-value function } \hat{v}(s, \mathbf{w}) \text{ parametrization} \\ &\textbf{Parameters: } \text{step sizes } \alpha^{\boldsymbol{\theta}} > 0,\ \alpha^{\mathbf{w}} > 0 \\ &\textbf{Initialize } \text{policy parameter } \boldsymbol{\theta} \in \mathbb{R}^{d'} \text{ and state-value weights } \mathbf{w} \in \mathbb{R}^d \text{ (e.g., to } \mathbf{0}\text{)} \\ &\textbf{Loop forever } \text{(for each episode):} \\ &\quad \text{Initialize } S \text{ (1st state of episode)} \\ &\quad I \leftarrow 1 \\ &\quad \textbf{Loop while } S \text{ is not terminal (for each time step):} \\ &\qquad A \sim \pi(\cdot \vert S, \boldsymbol{\theta}) \\ &\qquad \text{Take action } A\text{, observe } S', R \\ &\qquad \delta \leftarrow R + \gamma \hat{v}(S', \mathbf{w}) - \hat{v}(S, \mathbf{w}) \quad \text{(if } S' \text{ is terminal, then } \hat{v}(S', \mathbf{w}) \doteq 0\text{)} \\ &\qquad \mathbf{w} \leftarrow \mathbf{w} + \alpha^{\mathbf{w}}\, \delta\, \nabla \hat{v}(S, \mathbf{w}) \\ &\qquad \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha^{\boldsymbol{\theta}} I\, \delta\, \nabla \ln \pi(A \vert S, \boldsymbol{\theta}) \\ &\qquad I \leftarrow \gamma I \\ &\qquad S \leftarrow S' \end{aligned} }\]

13.5.2 Actor-Critic with Eligibility Traces

We generalize to the forward view of $n$-step methods and then to a $\lambda$-return algorithm.
We replace the one-step return $G_{t:t+1}$ by $G_{t:t+n}$ or $G_t^\lambda$ respectively for either $n$-step or $\lambda$-return.
The backward view of the $\lambda$-return algorithm uses eligibility traces for the actor and the critic.

Actor-Critic with Eligibility Traces (Episodic), for estimating $\pi_{\boldsymbol{\theta}} \approx \pi_*$

\[\boxed{ \begin{aligned} &\textbf{Inputs: } \text{a differentiable policy and state-value function parametrization: } \pi(a \vert s, \boldsymbol{\theta}),\ \hat{v}(s, \mathbf{w}) \\ &\textbf{Parameters: } \text{trace-decay rates } \lambda^{\boldsymbol{\theta}} \in [0,1],\ \lambda^{\mathbf{w}} \in [0,1]\text{; step sizes } \alpha^{\boldsymbol{\theta}} > 0,\ \alpha^{\mathbf{w}} > 0 \\ &\textbf{Initialize } \text{policy parameter } \boldsymbol{\theta} \in \mathbb{R}^{d'} \text{ and state-value weights } \mathbf{w} \in \mathbb{R}^d \text{ (e.g., to } \mathbf{0}\text{)} \\ &\textbf{Loop forever } \text{(for each episode):} \\ &\quad \text{Initialize } S \text{ (1st state of episode)} \\ &\quad \mathbf{z}^{\boldsymbol{\theta}} \leftarrow \mathbf{0} \quad \text{($d'$-component eligibility trace vector)} \\ &\quad \mathbf{z}^{\mathbf{w}} \leftarrow \mathbf{0} \quad \text{($d$-component eligibility trace vector)} \\ &\quad \textbf{Loop while } S \text{ is not terminal (for each time step):} \\ &\qquad A \sim \pi(\cdot \vert S, \boldsymbol{\theta}) \\ &\qquad \text{Take action } A\text{, observe } S', R \quad \text{(if } S' \text{ is terminal, then } \hat{v}(S', \mathbf{w}) \doteq 0\text{)} \\ &\qquad \delta \leftarrow R + \gamma \hat{v}(S', \mathbf{w}) - \hat{v}(S, \mathbf{w}) \\ &\qquad \mathbf{z}^{\mathbf{w}} \leftarrow \gamma\lambda^{\mathbf{w}} \mathbf{z}^{\mathbf{w}} + \nabla \hat{v}(S, \mathbf{w}) \\ &\qquad \mathbf{z}^{\boldsymbol{\theta}} \leftarrow \gamma\lambda^{\boldsymbol{\theta}} \mathbf{z}^{\boldsymbol{\theta}} + I\, \nabla \ln \pi(A \vert S, \boldsymbol{\theta}) \\ &\qquad \mathbf{w} \leftarrow \mathbf{w} + \alpha^{\mathbf{w}}\, \delta\, \mathbf{z}^{\mathbf{w}} \\ &\qquad \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha^{\boldsymbol{\theta}}\, \delta\, \mathbf{z}^{\boldsymbol{\theta}} \\ &\qquad I \leftarrow \gamma I \\ &\qquad S \leftarrow S' \end{aligned} }\]

13.6 Policy Gradient for Continuing Problems

From Section 10.3 on continuing problems, lack of episode boundaries requires a new performance measure definition in terms of the average rate of reward per time step:

\[\begin{align*} J(\boldsymbol{\theta}) \doteq r(\pi) &\doteq \lim_{h \to \infty} \frac{1}{h} \sum_{t=1}^{h} \mathbb{E}\!\left[R_t \vert S_0, A_{0:t-1} \sim \pi\right] \\ &= \lim_{t \to \infty} \mathbb{E}\!\left[R_t \vert S_0, A_{0:t-1} \sim \pi\right] \\ &= \sum_s \mu(s) \sum_a \pi(a \vert s) \sum_{s',r} p(s',r \vert s,a)\, r \end{align*}\] \[\begin{aligned} \text{where} \\ \mu(s) &\doteq \lim_{t \to \infty} \Pr\{S_t = s \vert A_{0:t} \sim \pi\} \equiv \text{steady state distribution under } \pi\text{,} \\ &\phantom{{}\doteq{}} \text{assumed to exist and be independent of } S_0 \textbf{ [ergodicity assumption]} \end{aligned}\] \[\sum_s \mu(s) \sum_a \pi(a \vert s, \boldsymbol{\theta})\, p(s' \vert s,a) = \mu(s') \quad \text{for all } s' \in S \quad \text{(ergodicity)}\]

In the continuing case, we define values $v_{\pi}(s) \doteq \mathbb{E}_{\pi}[G_t \vert S_t = s]$ and $q_{\pi}(s,a) \doteq \mathbb{E}_{\pi}[G_t \vert S_t = s, A_t = a]$ w.r.t. the differential return:

\[G_t \doteq R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + R_{t+3} - r(\pi) + \ldots\]

Proof: Policy Gradient Theorem (Continuing Case)

Leave the notation implicit in all cases that $\pi = f(\boldsymbol{\theta})$ and that the gradients $\nabla[…]$ are w.r.t. $\boldsymbol{\theta}$. In the continuing case $J(\boldsymbol{\theta}) = r(\pi)$, and $v_\pi$ & $q_\pi$ denote values w.r.t. the differential return.

\[\begin{align*} \nabla v_\pi(s) &= \nabla \!\left[\sum_a \pi(a \vert s)\, q_\pi(s,a)\right], \quad \text{for all } s \in \mathcal{S} \\ &= \sum_a \!\left[\nabla\pi(a \vert s)\, q_\pi(s,a) + \pi(a \vert s)\, \nabla q_\pi(s,a)\right] \\ &= \sum_a \!\left[\nabla\pi(a \vert s)\, q_\pi(s,a) + \pi(a \vert s)\, \nabla \sum_{s',r} p(s',r \vert s,a)\!\left(r - r(\boldsymbol{\theta}) + v_\pi(s')\right)\right] \\ &= \sum_a \!\left[\nabla\pi(a \vert s)\, q_\pi(s,a) + \pi(a \vert s)\!\left[-\nabla r(\boldsymbol{\theta}) + \sum_{s'} p(s' \vert s,a)\, \nabla v_\pi(s')\right]\right] \end{align*}\] \[\nabla r(\boldsymbol{\theta}) = \sum_a \!\left[\nabla\pi(a \vert s)\, q_\pi(s,a) + \pi(a \vert s) \sum_{s'} p(s' \vert s,a)\, \nabla v_\pi(s')\right] - \nabla v_\pi(s)\]

Since $\nabla J(\boldsymbol{\theta})$ does not depend on $s$, we sum over all $s \in \mathcal{S}$ weighted by $\mu(s)$ (because $\sum_s \mu(s) = 1$):

\[\begin{align*} \nabla J(\boldsymbol{\theta}) &= \sum_s \mu(s) \Bigl(\sum_a \Bigl[\nabla\pi(a \vert s)\, q_\pi(s,a) + \pi(a \vert s) \sum_{s'} p(s' \vert s,a)\, \nabla v_\pi(s')\Bigr] - \nabla v_\pi(s)\Bigr) \\ &= \sum_s \mu(s) \sum_a \nabla\pi(a \vert s)\, q_\pi(s,a) \\ &\quad + \sum_{s'} \underbrace{\sum_s \mu(s) \sum_a \pi(a \vert s)\, p(s' \vert s,a)}_{\mu(s')} \nabla v_\pi(s') - \sum_s \mu(s)\, \nabla v_\pi(s) \\ &= \sum_s \mu(s) \sum_a \nabla\pi(a \vert s)\, q_\pi(s,a) + \sum_{s'} \mu(s')\, \nabla v_\pi(s') - \sum_s \mu(s)\, \nabla v_\pi(s) \\ &= \sum_s \mu(s) \sum_a \nabla\pi(a \vert s)\, q_\pi(s,a) \qquad \text{Q.E.D.} \end{align*}\]

Actor-Critic with Eligibility Traces (Continuing), for estimating $\pi_{\boldsymbol{\theta}} \approx \pi_*$

\[\boxed{ \begin{aligned} &\textbf{Inputs: } \pi(a \vert s, \boldsymbol{\theta}),\ \hat{v}(s, \mathbf{w}) \\ &\textbf{Parameters: } \lambda^{\mathbf{w}} \in [0,1],\ \lambda^{\boldsymbol{\theta}} \in [0,1],\ \alpha^{\mathbf{w}} > 0,\ \alpha^{\boldsymbol{\theta}} > 0,\ \alpha^{\bar{R}} > 0 \\ &\textbf{Initialize } \bar{R} \in \mathbb{R} \text{ (e.g., to 0)},\ \mathbf{w} \in \mathbb{R}^d\ \&\ \boldsymbol{\theta} \in \mathbb{R}^{d'} \text{ (e.g., to } \mathbf{0}\text{)},\ S \in \mathcal{S} \text{ (e.g., to } s_0\text{)} \\ &\mathbf{z}^{\mathbf{w}} \leftarrow \mathbf{0};\ \mathbf{z}^{\boldsymbol{\theta}} \leftarrow \mathbf{0} \\ &\textbf{Loop forever } \text{(for each time step):} \\ &\quad A \sim \pi(\cdot \vert S, \boldsymbol{\theta}) \\ &\quad \text{Take action } A\text{, observe } S', R \\ &\quad \delta \leftarrow R - \bar{R} + \hat{v}(S', \mathbf{w}) - \hat{v}(S, \mathbf{w}) \\ &\quad \bar{R} \leftarrow \bar{R} + \alpha^{\bar{R}}\, \delta \\ &\quad \mathbf{z}^{\mathbf{w}} \leftarrow \lambda^{\mathbf{w}} \mathbf{z}^{\mathbf{w}} + \nabla \hat{v}(S, \mathbf{w}) \\ &\quad \mathbf{z}^{\boldsymbol{\theta}} \leftarrow \lambda^{\boldsymbol{\theta}} \mathbf{z}^{\boldsymbol{\theta}} + \nabla \ln \pi(A \vert S, \boldsymbol{\theta}) \\ &\quad \mathbf{w} \leftarrow \mathbf{w} + \alpha^{\mathbf{w}}\, \delta\, \mathbf{z}^{\mathbf{w}} \\ &\quad \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha^{\boldsymbol{\theta}}\, \delta\, \mathbf{z}^{\boldsymbol{\theta}} \\ &\quad S \leftarrow S' \end{aligned} }\]

13.7 Policy Parametrization for Continuous Actions

Policy gradient methods enable us to handle large (and even continuous) action spaces by learning the statistics of the probability distribution instead of computing learned probabilities for each of the many actions.
The probability density function for the normal distribution is conventionally written as:

\[p(x) \doteq \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\] \[\begin{aligned} \text{where} \\ \mu &\equiv \text{mean of the normal distribution} \\ \sigma &\equiv \text{standard deviation of the normal distribution} \end{aligned}\]

Probability Density Function (PDF)

The probability density functions for several different means and standard deviations are shown above.

$p(x)$ is the density of the probability at $x$, and not the probability. It can be greater than 1; it is the total area under the graph that must sum to 1.
To get the probability of $x$ falling within a range, take the integral under $p(x)$ for that specific range of $x$ values.
The policy parametrization, defining the policy as the normal probability density over a real-valued scalar action $a$ with mean & standard deviation given by state-dependent, parametric function approximators, is as follows:

\[\pi(a \vert s, \boldsymbol{\theta}) \doteq \frac{1}{\sigma(s, \boldsymbol{\theta})\sqrt{2\pi}} \exp\!\left(-\frac{(a - \mu(s, \boldsymbol{\theta}))^2}{2\sigma(s, \boldsymbol{\theta})^2}\right)\] \[\begin{aligned} \text{where} \\ \mu &: S \times \mathbb{R}^{d'} \to \mathbb{R} \equiv \text{mean function approximator} \\ \sigma &: S \times \mathbb{R}^{d'} \to \mathbb{R}^{+} \equiv \text{standard deviation function approximator} \end{aligned}\]

The approximators need a representation form, so we split the policy’s parameter vector into 2 parts, $\boldsymbol{\theta} = [\boldsymbol{\theta}_{\mu}, \boldsymbol{\theta}_{\sigma}]^T$, one for mean approximation and the other for standard deviation approximation.
The mean can be approximated as a linear function while the standard deviation can be approximated as the exponential of a linear function:

\[\begin{aligned} \mu(s, \boldsymbol{\theta}) &\doteq \boldsymbol{\theta}_\mu^T \mathbf{x}_\mu(s) \\ \sigma(s, \boldsymbol{\theta}) &\doteq \exp\!\left(\boldsymbol{\theta}_\sigma^T \mathbf{x}_\sigma(s)\right) \end{aligned}\]

13.8 Summary

Going from action-value methods to parametrized policy methods that take actions without consulting action-value estimates.
More specifically, policy gradient methods update the policy parameter on each step in the direction of an estimate of the performance gradient w.r.t. the policy parameter.
Advantages of parametrized policy methods over $\varepsilon$-greedy & action-value methods:
- They can learn specific probabilities for taking actions.
- They can learn appropriate levels of exploration & approach deterministic policies asymptotically.
- They can naturally handle continuous action spaces.
- Important theoretical advantage over action-value methods in the form of the policy gradient theorem.
REINFORCE uses the policy gradient theorem with Monte Carlo returns.
Addition of a state-value function as a baseline with REINFORCE reduces its variance without introducing bias and speeds up learning.
If the state-value function is used to criticize/assess the policy’s action selections, then the value function is called a critic and the policy is called an actor. Overall this is referred to as the actor-critic method.
The critic introduces bias into the actor’s gradient estimates, but is still often desirable for the same reason that bootstrapping TD methods are superior to Monte Carlo methods (significant variance reduction).
Overall, policy-gradient methods provide a significantly different set of strengths & weaknesses than action-value methods.

Citation

If you found this blog post helpful, please consider citing it:

@article{obasi2026RLsuttonBartoCh13notes,
  title   = "Sutton & Barto, Ch. 13: Policy Gradient Methods (Personal Notes)",
  author  = "Obasi, Chizoba",
  journal = "chizkidd.github.io",
  year    = "2026",
  month   = "May",
  url     = "https://chizkidd.github.io/2026/05/07/rl-sutton-barto-notes-ch013/"
}

Transformers

Fri, 17 Apr 2026 00:00:00 +0000

Transformers are a sequence-to-sequence model: given an input sequence, produce an output sequence.
Architecture: an Encoder processes the input; a Decoder generates the output autoregressively.

\[\text{(En) "I am sorry"} \xrightarrow{\text{Encoder}} \xrightarrow{\text{Decoder}} \texttt{<start>}\ \text{Je suis désolé}\ \texttt{<end>}\]

Autoregressive: the decoder generates one token at a time, conditioning on all previously generated tokens.

Input Text Sequence Representation
Encoders
From MLP to Attention
Self-Attention & Multi-Head Attention
Decoders
Masked Attention
Encoder-Decoder Cross Attention

Appendix

Citation

Input Text Sequence Representation

Tokenization

Two approaches to representing input text:

1. One-hot encoding

No semantic similarity or meaning of words encoded.

2. Token embedding

Encodes semantic similarity between words.
Embedding matrix is learned (Lookup Table).
Each token embedding is stored as a column vector.

\[\begin{aligned} \text{where} \\ W_E &\equiv \text{embedding matrix, } d \times \text{\#tokens} \\ d &\equiv \text{embedding dimension} \end{aligned}\]

Why We Need Context

Many words have different meanings in different contexts:
- “I bought an apple & an orange”
- “I bought an apple watch”
We need to rely on context to resolve the ambiguity.

Encoders

The encoder pipeline:

\[\text{Input} \to \text{token} + \text{POS embed} \to \text{Norm} \to \text{MHA(self)} \to \text{Add} \to \text{Norm} \to \text{FFN} \to \text{Add} \to \text{Output}\]

Input tokens are embedded using $W_E$ and combined with positional encodings to produce the input matrix $X$.
The architecture stacks: Multi-Head Self-Attention + residual, then FeedForward Network + residual.

From MLP to Attention

MLP only

MLP stands for Multilayer Perceptron
No contextual information; each token is processed independently.

Concatenation of nearby token embeddings before MLP

Need a sufficiently large window to cover the entire input sequence.
Cannot handle variable sequence lengths.
Requires many model parameters.

Attention

Use token similarity to determine the relevance of each token to every other token by performing a dot product.
Allows the model to dynamically weight which parts of the input are relevant for each position.

Self-Attention & Multi-Head Attention

Self-Attention

\[Q = XW_Q, \quad K = XW_K, \quad V = XW_V\] \[\text{head}_i = \text{Softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V = \text{Attention}(Q, K, V)\]

Multi-Head Attention (MHA)

\[\text{MHA}(X) = \text{multi-head}(Q, K, V) = \text{Concat}(h_1, h_2, \ldots, h_H)\, W_O = Z\] \[\begin{aligned} \text{where} \\ h_i &\equiv i\text{-th attention head} \\ W_O &\equiv \text{output projection matrix} \\ Z &\equiv \text{final output} \end{aligned}\]

Decoders

The decoder pipeline:

\[\text{Masked MHA} \to \text{Cross-Attn} \to \text{FFN} \to \text{Linear} \to \text{Softmax} \to \text{Output}\]

The decoder takes as input the previously generated tokens $(z_i)$ along with their positional encodings $(p_i)$, and at each step attends both to itself (masked) and to the encoder output (cross attention).

Masked Attention

In the decoder, we use masked (causal) self-attention to prevent the decoder from attending to future tokens:

\[\text{Masked Attn}(Q, K, V) = \text{Softmax}\!\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V\] \[\begin{aligned} \text{where} \\ M &\equiv \text{lookahead mask} \end{aligned}\] \[M = \begin{bmatrix} 0 & -\infty & -\infty & -\infty \\ 0 & 0 & -\infty & -\infty \\ 0 & 0 & 0 & -\infty \\ 0 & 0 & 0 & 0 \end{bmatrix}\]

Adding $-\infty$ to future positions drives their softmax weights to zero, ensuring position $i$ can only attend to positions $\leq i$.

Encoder-Decoder Cross Attention

The decoder queries the encoder output $E$ (encoder output) to incorporate source context:

\[\begin{aligned} Y' &= Y + \text{Masked MHA}(\text{Norm}(Y)) \\[6pt] Q &= Y' W_Q^{\text{dec}} \\ K^{\text{enc}} &= E W_K^{\text{enc}} \\ V^{\text{enc}} &= E W_V^{\text{enc}} \end{aligned}\] \[\begin{aligned} \text{where} \\ E &\equiv \text{encoder output} \end{aligned}\] \[\text{Cross Attn}(Y', E) = \text{Softmax}\!\left(\frac{Q K_{\text{enc}}^T}{\sqrt{d_k}}\right)V^{\text{enc}}\]

Then the rest of the decoder:

\[\begin{aligned} Y'' &= Y' + \text{Cross Attn}(\text{Norm}(Y'),\ E) \\ Y^O &= Y'' + \text{FFN}(\text{Norm}(Y'')) \\ D &= Y^O \\ \text{logits} &= DW^{\text{out}} + b \\ P(y_t) &= \text{Softmax}(\text{logits}_t) \end{aligned}\] \[\begin{aligned} \text{where} \\ D &\equiv \text{decoder output} \\ W^{\text{out}} &\equiv \text{output projection to vocabulary} \\ P(y_t) &\equiv \text{probability distribution over next token} \end{aligned}\]

Citation

If you found this blog post helpful, please consider citing it:

@article{obasi2026transformers,
  title   = "Transformers",
  author  = "Obasi, Chizoba",
  journal = "chizkidd.github.io",
  year    = "2026",
  month   = "Apr",
  url     = "https://chizkidd.github.io/2026/04/17/transformers/"
}

SAM 2: Segment Anything in Images & Videos

Fri, 17 Apr 2026 00:00:00 +0000

Meta’s unified model for promptable image and video segmentation.
- A foundation model for solving promptable visual segmentation in images & videos.
Built a data engine to collect the largest video segmentation dataset to date.
Model: Simple transformer architecture with streaming memory for real-time video processing.
Trained on a wide range of tasks: video segmentation and image segmentation.
The paper can be found here.

1. Introduction
2. Related Work
3. Task: Promptable Visual Segmentation (PVS)
4. Model
5. Data
6. Zero-Shot Experiments
7. Comparison to SOTA in Semi-Supervised VOS
8. Conclusion
9. Discussion

1. Introduction

Why video and not image?

Image is only a static snapshot of the real world; lacks motion information (temporal).
Video captures temporal information.
Many vital applications (robotics, AR/VR, autonomous vehicles) require temporal localization beyond image-level segmentation.
A universal visual segmentation system should be applicable to both images & videos.
Video segmentation aims to determine the spatio-temporal extent of entities, which presents unique challenges beyond those in images.
Significant changes in appearance encountered by entities & lower quality nature of videos than images present challenges for video segmentation.
SAM successfully solves image segmentation, but existing video segmentation models & datasets fall short in providing a comparable capability to “segment anything in videos.”
SAM 2: A unified model for video & image segmentation.
Promptable Visual Segmentation (PVS): Task that generalizes image segmentation to the video domain.
A data engine that generates training data via an in-the-loop model with annotators and produces the Segment Anything Video (SA-V) dataset.

Video Object Segmentation (VOS)
- Video augmentation datasets
Interactive Video Object Segmentation (iVOS)
Image Segmentation task, model and dataset
- Research Paper: Segment Anything (SA)

Segment Anything (Adapted from the Paper)

We aim to build a foundation model for segmentation by introducing three interconnected components: a promptable segmentation task, a segmentation model (SAM) that powers data annotation and enables zero-shot transfer to a range of tasks via prompt engineering, and a data engine for collecting SA-1B, our dataset of over 1 billion masks.

3. Task: Promptable Visual Segmentation (PVS)

\[\text{PVS} \longrightarrow \text{SAM 2} \longrightarrow \text{SA-V dataset}\]

PVS task allows providing prompts to the model on any frame of a video.
The interactive segmentation with SAM2 involves the steps below:
- SAM 2 is prompted on a single frame and responds instantly with a valid segmentation mask of the target object on this frame.
- SAM 2 then propagates the target object’s segment to multiple frames to form a masklet.
- Multiple initial prompts are received and propagated by the model to obtain the masklet of the object across the entire video, which leads to localization of the segmentation mask of the target on every single video frame.
- Additional prompts on any frame can be added to SAM 2 for segmentation mask refinement.
SAM 2 is applied as a data collection tool to the PVS task for building the SA-V dataset.
Model evaluation is done via simulation of interactive video segmentation scenarios across multiple frames in the conventional first-frame, limited, semi-supervised VOS setting, and for image segmentation on the SA benchmarks.

4. Model

SAM 2 is a generalization of SAM to the video (& image) domain. Essentially, it employs taking point, box & mask prompts on individual frames to define the spatial extent of the object to be segmented spatio-temporally.

Figure 1: The SAM 2 architecture. For a given frame, the segmentation prediction is conditioned on the current prompt and/or on previously observed memories. Video frames are processed in a streaming fashion by the image encoder, cross-attended to memories of the target object from previous frames stored in the memory bank, and decoded via the mask decoder (optionally prompted by the prompt encoder) to predict the segmentation mask for that frame. Finally, a memory encoder transforms the prediction and image encoder embeddings for use in future frames.

Components:

Image encoder: For real-time processing of arbitrarily long videos via a streaming, hierarchical approach.
Memory attention: Condition the current frame features on the past frames features and predictions or any new prompts.
Prompt encoder & mask decoder: Encode and send the input prompts to the mask decoder to predict the segmentation mask for the current frame.
Memory encoder: Generates a memory by downsampling, element-wise summation and light-weight convolution of the output mask which fuses the information.
Memory bank: Stores spatial memory information (from prompts) about past predictions for the target object in the video, and high-level semantic information of the object to segment based on each frame’s mask decoder output tokens.

Training:

The model is trained jointly on image and video data via interactive prompting simulation of the model, sequential sampling of 8 frames and random selection of 2 frames to prompt, with the goal of sequential and interactive prediction of the gound-truth masklet.

5. Data

The data engine was built – by an interactive in-the-loop model setup with human annotators – to collect a large and diverse video segmentation dataset to develop the “segemnt anything” capability in video. The data engine underwent 3 phases, each based on the level of model assistance provided to annotators.

5.1 Data Engine

Phase 1: SAM per frame
Phase 2: SAM + SAM 2
Phase 3: SAM 2
Quality verification including a separate set of QA annotators for the masklets (satisfactory or unsatisfactory)
Auto masklet generation enables the “anything capability” of the model by ensuring annotation diversity.
Analysis 1: Comparison in each data engine phase through a controlled experiment of the average annotation time per frame, the average percentage of manually edited frames per masklet, and the average number of clicks per clicked frame. For QA, the defined metric is called Phase 1 Mask Alignment Score, the percentage of masks whose IoU compared to the corresponding masks in Phase 1 (highest quality manual annotations) exceeds 0.75.
Analysis 2: Performance comparison of SAM 2 trained on the available data at the end of each phase keeping the number of iterations fixed, therefore measuring solely the impact of the additional data. Evaluation is done on the SA-V val dataset and 9 zero-shot benchmarks using the standard $J\&F$ accuracy metric.

Analysis Results

Phase 3 is 8.4× faster than Phase 1, has the lowest edited frame percentage and clicks per frame, and results in better alignment.
Consistent segmentation accuracy improvement from iteratively adding data from each data engine phase (1, 2 & 3) for both SA-V val set and 9 zero-shot benchmarks is observed.

5.2 SA-V Dataset

Videos (50.9K total, 54% indoor + 46% outdoor scenes, average duration of 14 secs)
Masklets (190.9K manual + 451.7K automatic, 642.6K total)
SA-V training, validation (293 masklets & 155 videos) & test splits (278 masklets & 150 videos)
Internal dataset (62.9K videos & 69.6K masklets annotated in Phase 2 & 3 for training; 96 videos & 189 masklets annotated using Phase 1 for testing)

6. Zero-Shot Experiments

Compare SAM 2 with previous work on zero-shot video & image tasks using the $J\&F$ accuracy metric.

6.1 Promptable Video Segmentation (PVS)

First evaluate PVS, which involves simulating an interactive setting that is akin to the user experience. Both evaluation settings below are done on 9 densely annotated zero-shot video datasets using $N_{click} = 3$ clicks per frame and are compared to 2 strong baselines based on 2 SOTA VOS models (XMem++ & Cutie)
- Offline evaluation: multiple passes made through a video for frame selection via the largest model error.
- Online evaluation: single forward pass for video frames’ annotation.

6.2 Semi-Supervised Video Object Segmentation

Evaluate the semi-supervised VOS setting with click, box or mask prompts only on the 1st frame of the video.
For click prompts, try either 1, 3, or 5 clicks on the 1st video frame (mIoU).
Comparison is done via $J\&F$ accuracy between SAM + XMem++, SAM + Cutie and SAM 2.

6.3 Image Segmentation

Evaluate SAM 2 on the Segment Anything task across 37 zero-shot datasets using 1-click and 5-click mIoUs.

7. Comparison to SOTA in Semi-Supervised VOS

Evaluate 2 versions of SAM 2 with varying image encoder sizes with different speed-vs-accuracy tradeoffs.
Comparison with existing SOTA models/methods via accuracy ($J\&F, G$) using standard protocols.
Evaluate existing work on the SA-V val & test sets which measure performance for open-world segments of “any” object class via $J\&F$ accuracy metric.

Comparison Results

SAM 2 performs well in accuracy for video segmentation based on first-frame ground-truth mask prompts.
SAM 2 performs significantly better on SA-V val/test.

8. Conclusion

Present a natural extension of Segment Anything into the video domain, based on 3 key aspects:

Extending the promptable segmentation task to video.
Equipping the SAM architecture to use memory when applied to video.
The diverse SA-V dataset for training & benchmarking video segmentation.

The authors believed SAM 2 marked a significant advancement in visual perception, positioning their contributions as milestones that will propel further research & applications.

9. Discussion

When I first read about SAM 2’s memory bank component, my immediate thought was: this feels familiar. Not because it copies prior work, but because it sits at an intersection I’ve been circling for a while: how do neural systems remember what matters, without remembering everything?

SAM 2 maintains object identity across video frames by storing two kinds of information in its memory bank:

Spatial feature maps from up to N recent frames and M prompted frames (stored in FIFO queues)
Object pointers: lightweight semantic vectors derived from the mask decoder’s output tokens

Memory attention then cross-attends over both when predicting the current frame. This design elegantly sidesteps the sequential bottleneck that plagued RNNs. Instead of compressing history into a single fixed-length hidden state, SAM 2 gives the model direct, position-invariant access to past representations. The “distance” between frame 1 and frame 200 is functionally the same as between frame 199 and 200. In that sense, attention acts as a temporal superhighway.

Distance vs. Capacity

Key distinction: RNNs encode history into a single vector (making long-range dependencies hard). SAM 2’s attention mechanism bypasses that bottleneck entirely, so distance becomes irrelevant. But attention solves the access problem, not the retention problem.

And that’s where the FIFO design matters. The paper explicitly states that the memory bank evicts the oldest frame once the queue is full, regardless of semantic importance.¹ This validates a subtle but critical observation: SAM 2’s forgetting mechanism is a fixed heuristic, not a learned one. The model doesn’t decide what to remember based on future tracking utility; it drops frames based on arrival time.

This creates a tangible trade-off between memory availability and diagnostic utility. Consider an object that:

Undergoes dramatic lighting change
Is occluded for dozens of frames
Reappears with significant appearance deformation

The frames that best capture its initial identity might be evicted long before it reappears. Meanwhile, newer frames with degraded or ambiguous representations linger in the queue simply because they arrived later. The “object pointers” partially mitigate this by storing lightweight semantic summaries, but they’re still bound by the same FIFO eviction policy. If the pointer for frame t is overwritten, the model loses not just the spatial map but also its high-level anchor.

flowchart LR A[FIFO Queue
time-based eviction] --> B[Eviction Policy
fixed heuristic] B --> C[Memory Attention
position-invariant access] C --> D{Tracking Outcome} D -->|Object reappears
with useful memory| E[✓ Success] D -->|Key frame evicted
or pointer overwritten| F[✗ Failure] style A fill:#e3f2fd style B fill:#fff3e0 style C fill:#e8f5e9 style E fill:#c8e6c9,stroke:#2e7d32 style F fill:#ffcdd2,stroke:#c62828

Figure 2: The memory bank’s fixed eviction policy (FIFO) interacts with attention’s position-invariant access. When evicted frames contain critical identity information, tracking fails—even if attention could theoretically retrieve them.

The paper’s handling of temporal position encoding reinforces this pragmatic trade-off. Temporal embeddings are injected into the N recent frames to capture short-term motion, but deliberately omitted from prompted frames due to sparse training signals and inference generalization concerns. This is a sound engineering decision, but it reveals a boundary: SAM 2 optimizes for stable, short-to-medium horizon tracking, not open-ended temporal reasoning.

Where This Fits in the Broader Literature

SAM 2’s memory bank isn’t operating in a vacuum. It shares conceptual DNA with several prior lines of work:

Neural Turing Machines introduced differentiable external memory with read/write heads, allowing networks to learn what to store and where to retrieve from ². SAM 2’s memory attention is a specialized, non-differentiable cousin: it retrieves, but doesn’t learn the eviction policy.
RETRO demonstrated that retrieval-augmented transformers can scale knowledge without scaling parameters, by querying a frozen corpus at inference ³. SAM 2 does something analogous for video: query a frozen buffer of past frames. The open question is whether that buffer should be learned, not fixed.
TimeSformer showed that spatiotemporal attention alone can handle video understanding without recurrent components ⁴. SAM 2 extends this by adding explicit memory, and also inherits TimeSformer’s assumption that all frames are equally worth attending to.

What SAM 2 adds is a practical instantiation of these ideas for promptable segmentation. It’s not trying to solve general memory-augmented reasoning; it’s solving “keep this object tracked, please.” That focus is its strength; and its limitation.

Counter Argument to the Paper's Memory Design

About memory design, I’d say: “FIFO is computationally cheap and training-stable, which makes sense for a production model. But from a research standpoint, it hardcodes a failure mode: important frames get evicted by time, not relevance. A differentiable memory controller or retrieval-augmented eviction policy could close that gap.”

Why This Matters

There’s a tendency in ML to treat architectural choices as purely engineering decisions. But memory management isn’t just about compute budgets, it’s also about what the model values. When SAM 2 drops a frame because it’s old, not because it’s uninformative, it’s making a silent claim: recency matters more than relevance.

That claim works well for many tracking scenarios. But it breaks down when objects reappear after long occlusions, or when appearance changes dramatically. In those cases, the model isn’t failing because attention is weak; it’s failing because the right information was never kept around to attend to.

This isn’t unique to SAM 2. It’s a fundamental tension in any system that must balance finite memory against infinite context. But because SAM 2 is a foundation model positioned for broad adoption, its design choices will influence how thousands of downstream applications handle temporal reasoning. Getting the memory story right matters.

So where does this leave us? SAM 2 is undoubtedly a milestone in promptable video segmentation. But its memory bank inadvertently frames a deeper research problem: attention removes the barrier of temporal distance, but leaves the bottleneck of memory management wide open.

Open Questions I’d Want to Explore

Learnable eviction: Could we replace FIFO with a lightweight, content-aware, learnable memory eviction mechanistic network that predicts which frames to retain based on tracking confidence, appearance stability, or semantic salience? Would SAM 2 maintain robust identity over arbitrarily long horizons? What’s the compute trade-off from an engineering system to a long-context reasoning model?
Pointer robustness: The object pointers are a clever compression trick, but they’re still overwritten by FIFO. Could we decouple pointer retention from spatial memory eviction?
Cross-video retrieval: RETRO retrieves from a corpus of documents; could SAM 2 retrieve from a corpus of past videos to bootstrap tracking of familiar objects?
Failure diagnostics: Can we design a probe that predicts when SAM 2 is likely to lose an object, based on memory bank state? That would be valuable for safety-critical applications.

I don’t have answers to these yet. But they feel like the right questions to ask if we want video models that don’t just see, but remember. SAM 2 shows we’ve mastered access to the past. The next step is mastering retention of what matters.

Citation

If you found this blog post helpful, please consider citing it:

@article{obasi2026sam2,
  title   = "SAM 2: Segment Anything in Images & Videos",
  author  = "Obasi, Chizoba",
  journal = "chizkidd.github.io",
  year    = "2026",
  month   = "Apr",
  url     = "https://chizkidd.github.io/2026/04/17/sam2/"
}

References

Ravi N, Gabeur V, Hu Y-T, et al. SAM 2: Segment Anything in Images and Videos. arXiv. 2024. ↩
Graves A, Wayne G, Danihelka I. Neural Turing Machines. arXiv. 2014. ↩
Borgeaud S, Mensch A, Hoffmann J, et al. Improving Language Models by Retrieving from Trillions of Tokens. arXiv. 2022. ↩
Bertasius G, Wang H, Torresani L. Is Space-Time Attention All You Need for Video Understanding?. arXiv. 2021. ↩

Muon & MuonClip Optimizers

Sat, 04 Apr 2026 00:00:00 +0000

Muon stands for MomentUm Orthogonalized by Newton-Schulz and was invented by Keller Jordan.
The key idea: Instead of applying Adam-style per-element adaptive updates to model parameters, Muon orthogonalizes the momentum matrix before using it as the update direction.

Adam Optimizer
Matrix Orthogonalization
Newton-Schulz 5 Iteration
Muon
QK-Clip
Multihead Latent Attention (MLA)
MuonClip

Appendix

Citation
References

Moonshot AI, creator of Kimi, pioneered the improvements to Muon (MuonClip) which I dive into in detail in this post. But they have an article on X with more academic details on why they chose Muon titled “Why We Chose Muon: Our Chain of Thought”¹ by Jianlin Su, the first author of RoPE (Rotary Position Embedding).

https://t.co/dxZnLxvPae
— Kimi.ai (@Kimi_Moonshot) March 7, 2025

This blog post is directly from my personal handwritten notes on studying Muon & MuonClip. I posted those notes on X.

My notes on the Muon optimizer by @kellerjordan0 and MuonClip by @Kimi_Moonshot which integrates Muon with weight decay, RMS matching, & QK-Clip.

"Using MuonClip, we successfully pre-trained Kimi K2 on 15.5 trillion tokens without a single loss spike." - @Kimi_Moonshot pic.twitter.com/KW84xOi51E
— Chiz (@latentchiz) April 5, 2026

Adam Optimizer

Let’s start by looking at Adam optimizer, the most common optimizer for training neural networks, which I cover in this article guide on common deep learning optimizers. Adam stands for Adaptive Moment Estimation.

It combines momentum and adaptive learning rates so the model not only remembers the direction it has been moving in, but also adjusts how big each step should be for every parameter. More specifcally, it combines momentum (first moment) and RMSProp (second moment) with bias corrections to handle noisy gradients and early training instability.² The forward pass looks like:

Gradient Descent

Gradient descent updates the model parameters by stepping in the direction that reduces the loss, using the gradient as a guide for how to adjust each parameter. The momentum and velocity terms refine this process by smoothing past gradients and scaling updates adaptively, which helps stabilize training and converge faster, especially in noisy or complex loss landscapes.

\[\begin{aligned} \text{simply} \quad & \Theta_i \leftarrow \Theta_{i-1} - \alpha \frac{\partial L}{\partial \Theta_i} \\[6pt] \text{momentum} \quad & M_i \leftarrow \beta_1 M_{i-1} + (1 - \beta_1) \frac{\partial L}{\partial \Theta_i} \\[6pt] \text{velocity} \quad & V_i \leftarrow \beta_2 V_{i-1} + (1 - \beta_2) \left(\frac{\partial L}{\partial \Theta_i}\right)^2 \\[6pt] & \Theta_i \leftarrow \Theta_{i-1} - \alpha \left(\frac{M_i}{\sqrt{V_i} + \varepsilon}\right) \end{aligned}\] \[\begin{aligned} \text{where} \\ M_i &\equiv \text{momentum (1st moment)} \\ V_i &\equiv \text{velocity: gradient squared (2nd moment)} \\ \beta_1, \beta_2 &\equiv \text{decay hyperparameters} \\ \varepsilon &\equiv \text{small constant for numerical stability} \end{aligned}\]

Cons of Adam

Memory intensive.
Challenging hyper-parameter tuning.
Independent update of each value in a single, long, parameter vector without considering any internal structure (vector-based optimizer behavior) of the model parameters.³

Question: Can we explicitly account for the underlying matrix structure of the model parameters?

The Linear Layer & Matrix Momentum

For a linear layer with 3 inputs $(x_1, x_2, x_3)$ and 4 outputs $(z_1, z_2, z_3, z_4)$:

\[z_i = \Theta_{i1} x_1 + \Theta_{i2} x_2 + \Theta_{i3} x_3, \quad \forall\, i = 1, 2, 3, 4\]
In matrix form:
\[\begin{bmatrix} z_1 \\ z_2 \\ z_3 \\ z_4 \end{bmatrix} = \begin{bmatrix} \Theta_{11} & \Theta_{12} & \Theta_{13} \\ \Theta_{21} & \Theta_{22} & \Theta_{23} \\ \Theta_{31} & \Theta_{32} & \Theta_{33} \\ \Theta_{41} & \Theta_{42} & \Theta_{43} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}\]
Since $\Theta$ is now a matrix, we need a matrix momentum $\hat{M}$ for $\hat{\Theta}$:
\[\hat{M} = \begin{bmatrix} m_{11} & m_{12} & m_{13} \\ m_{21} & m_{22} & m_{23} \\ m_{31} & m_{32} & m_{33} \\ m_{41} & m_{42} & m_{43} \end{bmatrix}\]
So the matrix momentum update is:
\[\hat{M}_i \leftarrow \beta \hat{M}_i + \frac{\partial L}{\partial \hat{\Theta}_i}\]

The Problem with Vector-Based Optimizers

With vector-based optimizers like Adam, the momentum for a linear layer (a 2D matrix) tends to become almost low rank in practice.
Essentially, only a small number of dominant directions really drive the update, while the many remaining other directions contribute very little.

Question: How can we tackle this update direction imbalance and what makes a good optimizer?

From fundamental first principles, a good optimizer possesses two characteristics: stability and speed. The goal of each update of a good optimizer is to minimize model variance and maximize loss reduction contribution, which correspond to stability and speed respectively.¹

Matrix Orthogonalization

We can fix the imbalance of update directions via orthogonalization.
Orthogonalize the momentum matrix. This is where Muon comes in.
It amplifies the effect of rare directions – the directions that typically receive small or infrequent updates.
Even though these rare directions seem minor, they are often essential for effective learning and can help capture more nuance patterns in the data.

Orthogonalization via SVD

We want the orthogonal matrix $O$ closest to $M$:
\[\text{Ortho}(M) = \arg\min_{O} \left\{\| O - M \|_F\right\} \quad \text{subject to } OO^T = I \text{ or } O^TO = I\]
Using SVD (Singular Value Decomposition): $M = USV^T$, where:
\[\begin{aligned} UU^T &= U^TU = I \\ VV^T &= V^TV = I \end{aligned} \quad \bigg\} \text{ orthonormal matrices}\]
Since $OO^T = O^TO = I$:
\[\therefore\quad O = USV^T \quad \text{where } S = \begin{bmatrix} 1 & & 0 \\ & \ddots & \\ 0 & & 1 \end{bmatrix} \equiv \text{unit diagonal matrix}\]
Issue: SVD on a matrix is computationally expensive.
Fix: Use an odd polynomial matrix:
\[p(X) = aX + b(XX^T)X\]

Odd Polynomial Matrix

\[\begin{align*} p(M) &= a(M) + b(MM^T)M \\ &= \left(aI + b(MM^T)\right)M \\ &= \left(aI + b(USV^T VS U^T)\right)USV^T \\ &= \left(aI + b(US^2 U^T)\right)USV^T \\ &= aUSV^T + bUS^2 U^T USV^T \\ p(M) &= aUSV^T + bUS^3V^T \\ \end{align*}\]

This applies to any odd polynomial, so in general:
\[p(M) = U\!\left(aS + bS^3 + cS^5 + \ldots + (\text{const})\, S^{2n+1}\right)V^T, \quad \forall\, n \geq 0,\ n \to \infty\]
Sticking to the 5th order polynomial:
\[\boxed{p(M) = U\!\left(aS + bS^3 + cS^5\right)V^T}\]
We need to determine the coefficients $(a, b, c)$. The goal is to get $S$ to a unit diagonal matrix – i.e., get the diagonal values as close to 1 as possible.
Example: Plot $y = p(x)$ against $x$. If $(a, b, c) = (1.5,\ -0.5,\ 0)$:
\[y = 1.5x - 0.5x^3\]
Applying this repeatedly via Newton-Schulz iteration:

\[\begin{aligned} y_1 &= p(x) \\ y_2 &= p(p(x)) \\ y_3 &= p(p(p(x))) \\ y_4 &= p(p(p(p(x)))) \\ y_5 &= p(p(p(p(p(x))))) \\ \end{aligned}\]

Each $y_k$ represents one more composition: $y_1 \to y_5$ are multiple iterations aimed at converging the singular values toward 1.

Newton-Schulz-5 Iteration

After 5 iterations, almost all input values end up very close to 1.
We can change $(a, b, c)$ to see the effect on convergence of $y$ to 1.
$(a, b, c) = (2,\ -1.5,\ 0.5)$ speeds up the convergence to 1.
Empirically, we don’t need the singular values to converge to exactly 1.
Let’s set an upper & lower bound, e.g. $(0.7,\ 1.3)$, which is basically $[1 - \varepsilon, 1 + \varepsilon]$.
Tuned coefficients:
\[\boxed{(a,\ b,\ c) = (3.4445,\ -4.775,\ 2.0315)}\]
Each Newton-Schulz 5 iteration involves only matrix multiplication:
\[X \leftarrow aX + b(XX^T)X + c(XX^T)^2 X\]
With GPUs, no need to use SVD since GPUs can efficiently compute matrix multiplication.

def newtonschulz5(G, steps=5, eps=1e-7):
    assert G.ndim == 2
    a, b, c = (3.4445, -4.7750, 2.0315)
    X = G.bfloat16()
    X /= (X.norm() + eps)
    if G.size(0) > G.size(1):
        X = X.T
    for _ in range(steps):
        A = X @ X.T
        B = b * A + c * A @ A
        X = a * X + B @ X
    if G.size(0) > G.size(1):
        X = X.T
    return X

# Retrieved from https://kellerjordan.github.io/posts/muon/

Muon

Muon is designed specifically for 2D weight matrices in neural network hidden layers.⁴ Unlike traditional optimizers that treat each parameter independently, Muon leverages the geometric structure of weight matrices by orthogonalizing gradients using the Newton-Schulz iteration. Muon is specifically designed for linear neural network layers, which aligns with ongoing research that argues that different layer types require different optimizers due to their varying geometry.⁵

The optimizer formulates weight updates as a constrained optimization problem in the RMS-to-RMS operator norm space:

\[\text{Ortho}(M) = \arg\min_{O} \left\{\| O - M \|_F\right\} \quad \text{subject to } OO^T = I \text{ or } O^TO = I\]

Where $M$ is the gradient matrix. The solution involves projecting the gradient onto the set of orthogonal matrices, which aims to standardize all singular values to 1 while preserving gradient directions.

Pseudo-Algorithm for Muon

for t = 1, 2, ..., do:
    Compute gradient      G_t  ←  ∇L_t(θ_{t-1})
    Compute momentum      M_t  ←  βM_{t-1} + G_t
    Normalize             M'_t ←  M_t / ||M_t||_F
    Orthogonalization     O_t  ←  NewtonSchutz5(M'_t)
    Update parameter      θ_t  ←  θ_{t-1} - α O_t

With weight decay $(+)$:
\[\Theta_t \leftarrow \Theta_{t-1} - \alpha\!\left(O_t + \lambda\Theta_{t-1}\right)\]
With adjusted learning rate $(+)$:
\[\Theta_t \leftarrow \Theta_{t-1} - \alpha\!\left[\left(0.2\sqrt{\max(n,m)}\right) O_t + \lambda\Theta_{t-1}\right]\] \[\begin{aligned} \text{where} \\ \lambda &\equiv \text{weight decay coefficient} \\ n, m &\equiv \text{dimensions of the 2D parameter matrix} \end{aligned}\]

Muon + Weight Decay + RMS Alignment

Weight decay is used to address the diminished performance gains of Muon over AdamW when scaling up to train a larger model.
The learning rate also gets adjusted by taking into account the size of the 2D matrix. This is the underlying principle behind the RMS (Root Mean Squared) Alignment.
- The scaling factor used to scale the Muon update for each matrix to ensure per-matrix parameter update RMS alignment of around 1 of matrices of different shapes is $\sqrt(\max(A, B))$ for a full-rank weight matrix of shape $[A,\ B]$.⁶
- The $0.2$ factor is used to match Muon’s update RMS to that of AdamW. From empirical observations, AdamW’s update RMS is usually around $0.2$ to $0.4$.⁶, ¹
These 2 improvements (weight decay & adjusting the per-parameter update scale) help to stabilize the training of large models.

The Exploding Attention Logit Crisis

Issue: Attention logits can grow larger & larger as training continues, which may cause the training process to become unstable.
Consider a sequence of 4 tokens & assume self-attention for simplicity. Each token is mapped to an embedding vector of dimension $d$. Let the embedding matrix be $X$.

Self-Attention

$O = \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

In self attention above, $Q = XW^Q$, $K = XW^K$, and $V = XW^V$.
The attention logits $S$:
\[\begin{align*} S &= QK^T \\ &= (XW^Q)(XW^K)^T \\ &= X\!\left(W^Q W^{K^T}\right)X^T \end{align*}\]
where $X$ and $X^T$ denote the embedding vectors, which are typically normalized to have unit norms.
To prevent the attention logits from becoming excessively large, we must control the scale of $W^Q$ and $W^K$:
\[S = X\underbrace{\left(W^Q W^{K^T}\right)}_{\text{scale control}}X^T\]

QK-Clip

A common strategy is to apply a scaling vector to these matrices.
During training, monitor the maximum value of the attention logits, $S_{\max}$. If it exceeds a certain threshold $\tau$, calculate a scaling ratio $\gamma$:

\[\text{if } S_{\max} > \tau: \quad \gamma = \frac{\tau}{S_{\max}} \implies \frac{\tau}{S_{\max}} < 1\]

Directly constrains attention logits, ensuring they stay within a safe range by rescaling the query & key projection weights.

Idea: Scale the relevant model parameters by $\gamma$ when the attention logits surpass the threshold. Scale both $W^Q$, $W^K$ by $\sqrt{\gamma}$:

\[\begin{align*} S &= X\!\left(\gamma W^Q W^{K^T}\right)X^T \\ &= X\!\left(\sqrt{\gamma}\, W^Q\ \sqrt{\gamma}\, W^{K^T}\right)X^T \end{align*}\]

Revised pseudo-algorithm (QK-Clip):

θ_t ← MuonOptimizer(θ_{t-1}, G_t)
if S_max > τ:
    W^Q ← √γ W^Q
    W^K ← √γ W^K

QK-Clip for Multi-Head Attention

In practice, self-attention consists of multiple heads ($n_{\text{heads}} = h$).
When the maximum attention logits exceed the threshold, instead of rescaling all heads in the same way (which doesn’t make sense), we introduce an individual scaling factor for each head to control their logits separately.

Multi-Head Attention

$O = \text{MultiheadAttention}(Q, K, V) = \text{Concat}(head_0, head_1, .., head_h) W_o$

Attention logits per head:
\[S = X\!\left(\sqrt{\gamma_h}\, W_h^Q\ \sqrt{\gamma_h}\, W_h^{K^T}\right)X^T\] \[\text{if } S_{\max}^h > \tau \quad \text{then} \quad \gamma_h = \frac{\tau}{S_{\max}}\]

Algorithm for Muon + QK-Clip for MHA:

θ_t ← MuonOptimizer(θ_{t-1}, G_t)
if S^h_max > τ:
    W^Q_h ← √γ_h W^Q_h
    W^K_h ← √γ_h W^K_h

Multihead Latent Attention (MLA)

For multihead Latent Attention (MLA) by DeepSeek, things get tricky.
MLA compresses $Q, K, V$ representations into a low-rank space to reduce the size of the KV cache using a down-projection matrix which produces latent representations:
\[C^Q = XW^Q_\downarrow \qquad C^{KV} = XW^{KV}_\downarrow\]
These compressed latent vectors are then mapped back to $W^Q_\uparrow$, $W^K_\uparrow$, $W^V_\uparrow$ for each attention head using the corresponding up-projection matrices.
Issue: This low-rank KV compression fails with rotary position embedding (RoPE).
Fix: A decoupled RoPE technique which introduces extra multi-head queries $W^{QR}$ and a shared key $W^{KR}$ to encode positional information.
For MLA, the Query, Key & Values are regrouped for each head:
- The Query is constructed by concatenating the compressed query $Q^C$ with the rotated query $Q^R$.
- The Key is constructed similarly by concatenating the compressed key $K^C$ with the rotated key $K^R$.

Multihead Latent Attention (MLA)

MLA compresses $Q, K, V$ representations into a low-rank latent space via down-projection matrices $W^Q_\downarrow$ and $W^{KV}_\downarrow$, then maps back up via up-projection matrices. A decoupled RoPE technique adds rotary queries $W^{QR}$ and a shared rotary key $W^{KR}$.

MuonClip

In MLA, it is important to carefully decide how to rescale these 4 matrices: $W^Q_\uparrow,\ W^{QR},\ W^{KR},\ W^K_\uparrow$.

For the up-projection matrices $W^Q_\uparrow$, $W^K_\uparrow$: rescale these parameters for each head individually.
The RoPE components $W^{QR}$, $W^{KR}$ rescaling is more tricky:
- Each head has its own rotary query $W^{QR}$.
- But all heads share a single rotary key matrix $W^{KR}$.

Issue: Applying the same per-head scaling for both RoPE components leads to the shared $W^{KR}$ being rescaled multiple times, which is undesirable.

Fix: Rescale only the head-specific rotary query $W^{QR}$ by their respective $\gamma^h$, while leaving the shared rotary key matrix $W^{KR}$ unchanged. This technique is called MuonClip. MuonClip improves upon Muon with the QK-Clip technique to handle training instability while benefiting from Muon’s advanced token efficiency.⁷

MuonClip Algorithm:

θ_t ← MuonOptimizer(θ_{t-1}, G_t)
if S^h_max > τ:
    γ_h = τ / S^h_max
    W^Q_{h,↑}  ← √γ_h  W^Q_{h,↑}
    W^K_{h,↑}  ← √γ_h  W^K_{h,↑}
    W^{QR}_h   ← γ_h   W^{QR}_h
    W^{KR}     ← W^{KR}   (unchanged)

In math:

\[\begin{aligned} \gamma_h &= \frac{\tau}{S_{\max}^h} \\[6pt] W_{h,\uparrow}^Q &\leftarrow \sqrt{\gamma_h}\, W_{h,\uparrow}^Q \\ W_{h,\uparrow}^K &\leftarrow \sqrt{\gamma_h}\, W_{h,\uparrow}^K \\ W_h^{QR} &\leftarrow \gamma_h\, W_h^{QR} \\ W^{KR} &\leftarrow W^{KR} \quad \text{(unchanged)} \end{aligned}\]

Citation

If you found this blog post helpful, please consider citing it:

@article{obasi2026muonmuonclip,
  title   = "Muon & MuonClip Optimizers",
  author  = "Obasi, Chizoba",
  journal = "chizkidd.github.io",
  year    = "2026",
  month   = "Apr",
  url     = "https://chizkidd.github.io/2026/04/04/muon-muonclip/"
}

References

Kimi (Moonshot AI). Why We Chose Muon: Our Chain of Thought. X (Twitter). 2025. ↩ ↩² ↩³
Chizoba Obasi. A Complete Guide to Neural Network Optimizers. chizkidd.github.io. 2026. ↩
Jia-Bin Huang This Simple Optimizer Is Revolutionizing How We Train AI [Muon] (YouTube Video). YouTube. ↩
Keller Jordan. Muon: An optimizer for hidden layers in neural networks. kellerjordan.github.io. 2024. ↩
Jeremy Bernstein. Deriving Muon. jeremybernste.in. 2025. ↩
Muon is Scalable for LLM Training. arXiv. 2025. ↩ ↩²
Kimi K2: Open Agentic Intelligence. arXiv. 2025. ↩

Inkcast: A Free, Browser-Based Audiobook Player

Mon, 16 Mar 2026 00:00:00 +0000

Earlier this year, I decided to force myself to read more. Not a New Year’s resolution, because those never last. The reason is that growing up as a child and young teenager, reading often felt like punishment. My mum required my siblings and me to read a certain number of pages from a designated book every day throughout elementary school. Missing a day meant mandatory punishment. In boarding secondary school, this eventually led to a stubborn, subconscious resistance to non-essential reading. Over the six years I spent there, I probably read only five to ten non-academic fiction books (though Artemis Fowl was a delight). So it is not hard to see where my indifference to reading came from.

During the COVID-19 pandemic, however, I fell deep into podcasts. As an avid sports fan and TV show buff, I listened to everything: sports recaps, tech podcasts, expert interviews, the works (meeting Walter White at Anfield would be the ultimate dream). So when I decided to read more this year, audiobooks felt like the natural bridge. I already had a few EPUBs in the Apple Books app on my reading list and wondered: Can I listen to these EPUBs using Apple Dictation’s two-finger swipe-down feature? Unfortunately, it only works for the current page. It is quite janky, not very user-friendly, and frankly does not work well for my use case.

Recently, I worked on evaluating how a Facebook state-of-the-art (SOTA) automatic speech recognition (ASR) model handles Igbo tones, trying to see whether it actually “listens” properly. So I have been dabbling with audio quite a bit this year. You could say I have been thinking about listening a lot. In the past, I also experimented with WaveNet (a generative model for raw audio) and its fundamental building block, the dilated causal convolution.

With these experiences in mind, I wondered: Can I build an iPhone Shortcut that lets me listen to EPUBs properly? That question eventually led to Inkcast. The goal was not to build a Speechify competitor. I simply wanted to solve a personal problem. My aim was to create a low-effort, frictionless tool for personal use, so I made a GitHub repository and started building. Within a few days, I had a working website that could take EPUBs and PDFs and let users listen to the content organized by chapters in a sidebar with one-tap navigation. It included basic controls such as play/pause, rewind (15 seconds), forward (30 seconds), playback speed control (0.75× to 2×), and voice selection.

It worked well on desktop, so I used the URL to create a Shortcut on my iPhone. On mobile it also worked, but there was one problem: the reader voices sounded robotic and monotone, which is not ideal for long-form listening. The irony was that only weeks earlier I had been evaluating how machines handle speech. So I went back to the drawing board to figure out how to get more natural-sounding reader voices on Inkcast.

While researching audiobook-quality text-to-speech, I came across several APIs (OpenAI, ElevenLabs, Google Cloud). None were free, and for a personal project I wanted something that required no subscriptions or API keys. Most resources suggested that human-quality narration requires a dedicated TTS service. Eventually I discovered that the Web Speech API can access premium voices already installed on the device. These voices are free, require no API keys, and remain available offline. They are not state of the art, but they are surprisingly good. Many people do not realize that higher-quality Siri voices can be downloaded. The voice quality improved, but the project also started evolving in another direction.

I have always wanted to work through Paul Graham’s essays properly. There are 229 of them, and they read almost like long-form podcasts. But they live on webpages, which raised another question: Why limit the input to EPUBs and PDFs? So I added URL support. I pasted Paul Graham’s archive page into Inkcast, and it automatically pulled all 200+ essays into the sidebar. That was the moment I realized the idea actually worked.

The entire project lives in a single HTML file. There are no accounts, no installations, and files never leave the user’s device. Because the app has no server dependencies, it ended up functioning as a privacy-preserving tool by default. In a way, I started the year studying whether machines listen well. Along the way, I realized that humans do not have many good free tools for listening either. Speechify costs $139 per year and Audible requires a subscription, so I built something that worked for me.

If you find it useful, pls try it, star it, or buy me a coffee if it saves you a Speechify subscription.

Sutton & Barto, Ch. 12: Eligibility Traces (Personal Notes)

Fri, 13 Mar 2026 00:00:00 +0000

Eligibility traces are one of the basic mechanisms of RL that unify and generalize TD and Monte Carlo (MC) methods.
TD methods augmented with eligibility traces produce a family of methods spanning a range from MC methods at one end ($\lambda = 1$) to one-step TD (TD(0)) methods at the other end ($\lambda = 0$).
With eligibility traces, MC methods can be implemented online and on continuing problems.
$n$-step methods also unify TD and MC methods but are not as elegant algorithmically as eligibility traces (ET).
The eligibility traces algorithm entails:
- First, we have a short-term memory vector, the eligibility trace $\mathbf{z}_t \in \mathbb{R}^d$, that parallels the long-term weight vector $\mathbf{w}_t \in \mathbb{R}^d$.
- Then when a component of $\mathbf{w}_t$ participates in producing an estimated value, the corresponding component of $\mathbf{z}_t$ is bumped up and then begins to fade away.
- Learning occurs in that component of $\mathbf{w}_t$ if a non-zero TD error occurs before the trace falls back to zero (fades away).
- The trace-decay parameter $\lambda \in [0,1)$ determines the rate at which the trace falls.
Advantages of ET over $n$-step methods:
- Requires only a single trace vector $\mathbf{z}_t$ rather than storing the last $n$ feature vectors.
- Learning occurs continually and uniformly in time rather than being delayed and playing “catch up” at episode end. This leads to immediate behavior effect from learning rather than being delayed.
- ET is a backward view algorithm, unlike $n$-step methods that are forward view algorithms, which is less complex to implement.
Forward view algorithms are based on looking forward from the updated state, and the updated state depends on all the future rewards.
Backward view algorithms use the current TD error, looking backward to recently visited states, to achieve nearly the same updates as forward view.
We start with ideas for state values and prediction, then extend them to action values and control, then do on-policy, then extend to off-policy learning. The field of focus is linear function approximation (covers tabular and state aggregation cases).

12.1 The $\lambda$-return
12.2 TD($\lambda$)
12.3 $n$-step Truncated $\lambda$-return Methods
12.4 Redoing Updates: Online $\lambda$-return Algorithm
12.5 True Online TD($\lambda$)
12.6 Dutch Traces in Monte Carlo Learning
12.7 Sarsa($\lambda$)
12.8 Variable $\lambda$ and $\gamma$
12.9 Off-Policy Traces with Control Variates
12.10 Watkins’s Q($\lambda$) to Tree-Backup($\lambda$)
12.11 Stable Off-Policy Methods with Traces
12.12 Implementation Issues
12.13 Conclusions

Appendix

Citation

12.1 The $\lambda$-return

Recall in Chapter 7 we defined an $n$-step return as the sum of the first $n$ rewards plus the estimated value of the state reached in $n$ steps, each appropriately discounted:

\[G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{v}(S_{t+n}, \mathbf{w}_{t+n-1}), \quad\quad 0 \leq t \leq T - n\]

A valid update can be done not just towards any $n$-step return, but also towards any average of $n$-step returns.
- E.g. average the 2-step and 4-step return: $\frac{1}{2} G_{t:t+2} + \frac{1}{2} G_{t:t+4}$

Backup Diagram for Compound Update

The compound update mixing half of a two-step return and half of a four-step return.

Any set of $n$-step returns can be averaged, even an infinite set, as long as the weights on the component returns are positive and sum to $1$.
What if instead of using one $n$-step return, we use a weighted average of all $n$-step returns? This leads to averaging which produces a substantial new range of algorithms.E.g.,
1. Averaging one-step and infinite-step returns to interrelate TD and MC methods.
2. Averaging experience-based updates with Dynamic Programming (DP) updates to obtain a single combination of experience-based and model-based methods.
An update that averages simpler component updates is called a compound update or the $\lambda$-return.
The TD($\lambda$) algorithm is one way of averaging $n$-step updates, each weighted proportionally by $\lambda^{n-1}$ (where $\lambda \in [0,1]$) and normalized by a factor of $(1-\lambda)$ to ensure that the weights sum to $1$.

Backup Diagram for TD($\lambda$)

If $\lambda = 0$, then the overall update reduces to its first component, the TD(0) update, whereas if $\lambda = 1$, then the overall update reduces to its last component, the MC update.

Essentially, $\lambda$-return, $G_t^\lambda$, combines all $n$-step returns $G_{t:t+n}$ in a weighted average manner, $(1-\lambda)\lambda^{n-1}$, and is defined in its state-based form by:

\[\boxed{G_t^\lambda \doteq (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}}\]

TD($\lambda$) Weighting

The TD($\lambda$) weighting function diagram illustrates the weighting on the sequence of $n$-step returns in the $\lambda$-return:
- $1$-step return gets the largest weight, $1-\lambda$
- $2$-step return gets the next (2nd) largest weight, $(1-\lambda)\lambda$
- $3$-step return gets the 3rd largest weight, $(1-\lambda)\lambda^2$
- $n$-step return gets the $n$-th largest (smallest) weight, $(1-\lambda)\lambda^{n-1}$
- The weight fades by $\lambda$ with each additional step.
- After a terminal state has been reached, all subsequent $n$-step returns are equal to the conventional return $G_t$.
- So essentially, we can decompose $G_t^\lambda$ based on the TD($\lambda$) weighting function diagram into the main sum and post-termination terms:
\[\begin{array}{l} G_t^\lambda = (1-\lambda) \sum\nolimits_{n=1}^{T-t-1} \lambda^{n-1} G_{t:t+n} + \lambda^{T-t-1} G_t \\ \hspace{3em} \underbrace{\hspace{11em}}_{\text{pre-termination}} \kern{0.5em}\underbrace{\hspace{4em}}_{\text{post-termination}} \end{array}\]
- So now we can see the impact of $\lambda$ more clearly:
\[\begin{aligned} \text{if } \lambda = 1: \quad & G_t^\lambda = G_t \hspace{18em} \text{(MC)} \\[6pt] \text{if } \lambda = 0: \quad & G_t^\lambda = \left\{ \begin{array}{ll} \sum_{n=1}^{T-t-1} G_{t:t+n} & \text{for } n=1 \\ 0 & \text{for } n > 1 \end{array} \right\} = G_{t:t+1} \quad \text{(TD(0))} \end{aligned}\]

TD($\lambda$) Weighting

Weighting given in the $\lambda$-return to each of the $n$-step returns.

Our first learning algorithm based on the $\lambda$-return is the off-line $\lambda$-return algorithm, which waits until the end of an episode to make updates. Its semi-gradient, $\lambda$-return, target update for $t = 0, 1, 2, \ldots, T-1$ is:

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \!\left[G_t^\lambda - \hat{v}(S_t, \mathbf{w}_t)\right] \nabla \hat{v}(S_t, \mathbf{w}_t)}\]

The $\lambda$-return allows us to smoothly move between MC and TD(0) methods, comparable to $n$-step returning.

Forward View

This approach is called the theoretical, forward view of a learning algorithm:
- Update value function towards the $\lambda$-return.
- Look forward in time to all the future rewards to compute $G_t^\lambda$.
- Like MC, can only be computed from complete return.

Forward View

We decide how to update each state by looking forward to future rewards and states.

12.2 TD($\lambda$)

TD($\lambda$) was the first algorithm that showed a formal relationship between a forward view and backward view using eligibility traces.
TD($\lambda$) improves over the off-line $\lambda$-return algorithm in 3 ways:
- Updates the weight vector on every step of an episode and not just the end.
- Equal distributions in time of its computations rather than at an episode’s end.
- Can be applied to continuing problems and not just episodic problems.
Let’s focus on the semi-gradient version of TD($\lambda$) with function approximation:
- The eligibility trace $\mathbf{z}_t$ has the same number of components as $\mathbf{w}_t$.
- $\mathbf{z}$ is initialized to $\mathbf{0}$, incremented on each time step by the value gradient, and then fades away by $\gamma\lambda$:
\[\begin{align*} \mathbf{z}_{-1} &\doteq \mathbf{0} \\ \mathbf{z}_t &\doteq \gamma\lambda \mathbf{z}_{t-1} + \nabla \hat{v}(S_t, \mathbf{w}_t), \quad 0 \leq t \leq T \end{align*}\] \[\text{where } \lambda \equiv \text{trace decay parameter and } \gamma \equiv \text{discount rate}\]
The eligibility trace keeps track of which $\mathbf{w}_t$ components have contributed, positively or negatively, to recent state valuations.
This is the recency heuristic used for credit assignment, where more credit is assigned to the most recent states. Recent is defined in terms of $\gamma\lambda$.
The TD error for state-value prediction is:
\[\delta_t \doteq R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t)\]
and the weight vector update in TD($\lambda$) is proportional to the scalar TD error and the vector eligibility trace:

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\, \delta_t\, \mathbf{z}_t}\]

Backward View

Forward view provides theory but backward view provides mechanism (practical) where we update online, every step, from incomplete sequences.
Keep an eligibility trace for every state $s$.
Update value $V(s)$ for every state $s$ in proportion to TD-error $\delta_t$ and eligibility trace $\mathbf{z}_t$:

\[\begin{aligned} \delta_t &= R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \\ V(s) &\leftarrow V(s) + \alpha\, \delta_t\, \mathbf{z}_t \end{aligned}\]

Backward View of TD($\lambda$)

In the backward or mechanistic view of TD($\lambda$), each update depends on the current TD error combined with the current eligibility traces of past events.

Let’s look at the effect of $\lambda$ to understand the backward view of TD($\lambda$):

\[\begin{align*} \text{if } \lambda = 0: \quad & \mathbf{z}_t = \nabla \hat{v}(S_t, \mathbf{w}_t) \\ & \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\, \delta_t \nabla \hat{v}(S_t, \mathbf{w}_t) \quad \longrightarrow \quad \text{TD(0)} \\[6pt] \text{if } 0 < \lambda < 1: \quad & \text{earlier states are given less credit for the TD error} \\[6pt] \text{if } \lambda = 1: \quad & \mathbf{z}_t = \gamma \mathbf{z}_{t-1} + \nabla \hat{v}(S_t, \mathbf{w}_t) \quad \longrightarrow \quad \text{credit for earlier states falls by } \gamma \text{ per step} \\[6pt] \text{if } \lambda = 1,\ \gamma = 1: \quad & \mathbf{z}_t = \mathbf{z}_{t-1} + \nabla \hat{v}(S_t, \mathbf{w}_t) \quad \longrightarrow \quad \text{MC-like behavior (no time decay for ET)} \\[6pt] \text{if } \lambda = 1: \quad & \text{we get TD(1)} \end{align*}\]

In summary, $\lambda = 1$ yields TD(1).
TD(1) implements MC algorithms in a more general way and for a wider range of applicability:
- Not limited to episodic tasks, can be applied to discounted continuing tasks.
- Can be performed incrementally and online.
- Learns immediately and alters behavior during an episode if something good or bad happens, for control methods.
Linear TD($\lambda$) converges in the on-policy case if the step-size parameter $\alpha$ is reduced over time according to stochastic approximation theory conditions.
The convergence of linear TD($\lambda$) is not to the minimum-error weight vector but to a nearby weight vector that depends on $\lambda$.
The bound on solution quality generalized for any $\lambda$, for the continuing, discounted case is:

\[\overline{\text{VE}}(\mathbf{w}_\infty) \leq \frac{1 - \gamma\lambda}{1 - \gamma} \min_\mathbf{w} \overline{\text{VE}}(\mathbf{w})\]

That is, the asymptotic error is no more than $\dfrac{1-\gamma\lambda}{1-\gamma}$ times the smallest possible error for TD($\lambda$):

\[\begin{align*} \text{as } \lambda \to 1: \quad & \overline{\text{VE}}(\mathbf{w}_\infty) \to \min_\mathbf{w} \overline{\text{VE}}(\mathbf{w}) \\[6pt] \text{as } \lambda \to 0: \quad & \overline{\text{VE}}(\mathbf{w}_\infty) \to \frac{1}{1-\gamma} \min_\mathbf{w} \overline{\text{VE}}(\mathbf{w}) = \overline{\text{VE}}(\mathbf{w}_\text{TD}) \quad \text{(TD(0))} \end{align*}\]

However, $\lambda = 1$ is often the poorest choice.

12.3 $n$-step Truncated $\lambda$-return Methods

The off-line $\lambda$-return is of limited use because the $\lambda$-return is not known until the episode ends.
The off-line $\lambda$-return approximation is to truncate the sequence after a fixed number of steps.
Hence, a natural approximation is to truncate the sequence where $\lambda$-return cannot be calculated for an arbitrarily large $n$. This handles the continuing case.
The truncated $\lambda$-return for time $t$, given data only up to some later horizon $h$, is:

\[G_{t:h}^\lambda \doteq (1-\lambda) \sum_{n=1}^{h-t-1} \lambda^{n-1} G_{t:t+n} + \lambda^{h-t-1} G_{t:h}, \quad 0 \leq t \leq h \leq T\] \[\begin{aligned} \text{where } h &\equiv \text{horizon (plays same role as time of termination } T\text{)} \end{aligned}\]

Here the residual weighting is given to the longest available $n$-step return $G_{t:h}$.
The truncated $\lambda$-return gives rise to a family of $n$-step $\lambda$-return algorithms, known in the state-value case as Truncated TD($\lambda$) or TTD($\lambda$).
TTD($\lambda$) is defined for $0 \leq t < T$ by:

\[\boxed{\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha \!\left[G_{t:t+n}^\lambda - \hat{v}(S_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{v}(S_t, \mathbf{w}_{t+n-1})}\]

Efficient implementation of TTD($\lambda$) relies on the $k$-step $\lambda$-return:

\[\boxed{G_{t:t+k}^\lambda = \hat{v}(S_t, \mathbf{w}_{t-1}) + \sum_{i=t}^{t+k-1} (\delta\lambda)^{i-t} \delta_i'}\] \[\begin{aligned} \text{where } \delta_i' &\equiv R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_{t-1}) - \hat{v}(S_t, \mathbf{w}_{t-1}) \end{aligned}\]

Backup Diagram for Truncated TD($\lambda$)

The truncated $\lambda$-return gives rise to a family of $n$-step $\lambda$-return algorithms called TTD($\lambda$).

12.4 Redoing Updates: Online $\lambda$-return Algorithm

How do we choose the truncation parameter $n$ in TTD($\lambda$)?
It involves a tradeoff:
- $n$ should be large so that TTD($\lambda$) closely approximates off-line $\lambda$-return, but
- $n$ should also be small so that the updates can be made sooner and can influence behavior sooner.
In principle, we can achieve both cases via the online $\lambda$-return algorithm, but at the cost of computational complexity.
Essentially at each time step, we go back and redo all the updates since the beginning of the episode as we gather new increment of data:
- The new updates are better than the old ones because now they account for the time step’s new data.
- Basically this conceptual algorithm involves multiple passes over the episode, one at each horizon, each generating a different sequence of weight vectors.
Let’s distinguish between the weight vectors computed at the different horizons by seeing the first 3 sequences:

\[\begin{align*} h=1: \quad & \mathbf{w}_1^1 \doteq \mathbf{w}_0^1 + \alpha \!\left[G_{0:1}^\lambda - \hat{v}(S_0, \mathbf{w}_0^1)\right] \nabla \hat{v}(S_0, \mathbf{w}_0^1) \\[6pt] h=2: \quad & \mathbf{w}_0^2 \doteq \mathbf{w}_0^2 + \alpha \!\left[G_{0:2}^\lambda - \hat{v}(S_0, \mathbf{w}_0^2)\right] \nabla \hat{v}(S_0, \mathbf{w}_0^2) \\ & \mathbf{w}_2^2 \doteq \mathbf{w}_1^2 + \alpha \!\left[G_{1:2}^\lambda - \hat{v}(S_1, \mathbf{w}_1^2)\right] \nabla \hat{v}(S_1, \mathbf{w}_1^2) \\[6pt] h=3: \quad & \mathbf{w}_1^3 \doteq \mathbf{w}_0^3 + \alpha \!\left[G_{0:3}^\lambda - \hat{v}(S_0, \mathbf{w}_0^3)\right] \nabla \hat{v}(S_0, \mathbf{w}_0^3) \\ & \mathbf{w}_2^3 \doteq \mathbf{w}_1^3 + \alpha \!\left[G_{1:3}^\lambda - \hat{v}(S_1, \mathbf{w}_1^3)\right] \nabla \hat{v}(S_1, \mathbf{w}_1^3) \\ & \mathbf{w}_3^3 \doteq \mathbf{w}_2^3 + \alpha \!\left[G_{2:3}^\lambda - \hat{v}(S_2, \mathbf{w}_2^3)\right] \nabla \hat{v}(S_2, \mathbf{w}_2^3) \end{align*}\] \[\begin{aligned} \text{where} \\ \mathbf{w}_t^h &\equiv \text{weights used to generate the value at time } t \text{ in the sequence up to horizon } h \\ \mathbf{w}_0^h &\equiv \text{1st weight vector in each sequence that is inherited from the previous episode} \\ \mathbf{w}_n^h &\equiv \text{last weight vector in each sequence or the ultimate weight-vector sequence} \end{aligned}\]

The general form of the online $\lambda$-return update for $0 \leq t < h \leq T$ is:

\[\boxed{\mathbf{w}_{t+1}^h \doteq \mathbf{w}_t^h + \alpha \!\left[G_{t:h}^\lambda - \hat{v}(S_t, \mathbf{w}_t^h)\right] \nabla \hat{v}(S_t, \mathbf{w}_t^h)}\] \[\mathbf{w}_t \doteq \mathbf{w}_t^t\]

12.5 True Online TD($\lambda$)

The original ideal online $\lambda$-return algorithm shown in Section 12.4 is very complex so we use online TD($\lambda$) to approximate it.
We’ll use eligibility trace to invert the forward view, online $\lambda$-return to an efficient backward view algorithm. This is called the True Online TD($\lambda$).
It uses a simple strategy trick with the weight matrix, where we only need to keep the last weight vector from all the updates at the last time step (the diagonals of online $\lambda$-return weight matrix).

\[\begin{aligned} \begin{bmatrix} \mathbf{w}_0^0 & & & & \\ \mathbf{w}_0^1 & \mathbf{w}_1^1 & & & \\ \mathbf{w}_0^2 & \mathbf{w}_1^2 & \mathbf{w}_2^2 & & \\ \mathbf{w}_0^3 & \mathbf{w}_1^3 & \mathbf{w}_2^3 & \mathbf{w}_3^3 & \\ \vdots & \vdots & \vdots & \vdots & \ddots \\ \mathbf{w}_0^T & \mathbf{w}_1^T & \mathbf{w}_2^T & \mathbf{w}_3^T & \cdots & \mathbf{w}_T^T \end{bmatrix} &\longrightarrow \begin{bmatrix} \mathbf{w}_0^0 \\ & \mathbf{w}_1^1 \\ & & \mathbf{w}_2^2 \\ & & & \mathbf{w}_3^3 \\ & & & & \ddots \\ & & & & & \mathbf{w}_T^T \end{bmatrix} \\ \end{aligned}\] \[\text{Online } \lambda\text{-return} \hspace{8em} \text{True Online TD}(\lambda)\]

For the linear case in which $\hat{v}(s, \mathbf{w}) = \mathbf{w}^T \mathbf{x}(s)$, the true online TD($\lambda$) algorithm is:

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\, \delta_t\, \mathbf{z}_t + \alpha \!\left(\mathbf{w}_t^T \mathbf{x}_t - \mathbf{w}_{t-1}^T \mathbf{x}_t\right)\!\left(\mathbf{z}_t - \mathbf{x}_t\right)}\] \[\begin{aligned} \text{where} \\ \mathbf{w}_t &\doteq \mathbf{w}_t^t \\ \mathbf{x}_t &\doteq \mathbf{x}(S_t) \\ \mathbf{z}_t &\doteq \gamma\lambda \mathbf{z}_{t-1} + (1 - \alpha\gamma\lambda\, \mathbf{z}_{t-1}^T \mathbf{x}_t)\, \mathbf{x}_t \end{aligned}\]

The per-step computational complexity of true online TD($\lambda$) is the same as TD($\lambda$), $O(d)$.
$\mathbf{z}_t$ used in true online TD($\lambda$) is called a dutch trace, unlike that of TD($\lambda$) which is called an accumulating trace.
Earlier work used a 3rd kind of trace called the replacing trace, defined only for the tabular case or for binary feature vectors (tile coding). It is defined:

\[\tilde{z}_{i,t} \doteq \left\{ \begin{array}{ll} 1 & \text{if } x_{i,t} = 1 \\ \gamma\lambda\, z_{i,t-1} & \text{otherwise} \end{array} \right\}\]

Nowadays, dutch traces usually perform better than replacing traces and have a clearer theoretical basis.
Accumulating traces remain of interest for nonlinear function approximations where dutch traces are unavailable.

12.6 Dutch Traces in Monte Carlo Learning

Eligibility traces have nothing to do with TD learning despite their close historical association.
Eligibility traces arise even in Monte Carlo learning.
Using dutch traces, we can invert the forward view MC algorithm to an equivalent, yet computationally cheaper backward view algorithm.
This is the only equivalence of forward and backward view that is explicitly demonstrated in this book.
The linear, gradient MC prediction algorithm makes the following sequence of updates, one for each time step of the episode:

\[\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \!\left[G - \mathbf{w}_t^T \mathbf{x}_t\right] \mathbf{x}_t, \quad 0 \leq t < T\]

For simplicity, assume that the return $G$ is a single reward at the end of the episode (hence no subscript by time) and that there is no discounting.
This is known as the Least Mean Square (LMS) rule.
We introduce an additional vector memory, the eligibility trace, that keeps in it a summary of all the feature vectors seen so far. The overall update will be the same as the MC updates’ sequence shown above and is:
\[\begin{align*} \mathbf{w}_T &= \mathbf{w}_{T-1} + \alpha \!\left(G - \mathbf{w}_{T-1}^T \mathbf{x}_{T-1}\right) \mathbf{x}_{T-1} \\ &= \mathbf{w}_{T-1} + \alpha \mathbf{x}_{T-1}\!\left(-\mathbf{x}_{T-1}^T \mathbf{w}_{T-1}\right) + \alpha G \mathbf{x}_{T-1} \\ &= \!\left(\mathbf{I} - \alpha \mathbf{x}_{T-1} \mathbf{x}_{T-1}^T\right) \mathbf{w}_{T-1} + \alpha G \mathbf{x}_{T-1} \\ &= \mathbf{F}_{T-1}\, \mathbf{w}_{T-1} + \alpha G \mathbf{x}_{T-1} \end{align*}\] \[\begin{aligned} \text{where} \\ \mathbf{F}_t &\doteq \mathbf{I} - \alpha \mathbf{x}_t \mathbf{x}_t^T \equiv \text{a forgetting or fading matrix} \end{aligned}\] \[\therefore\quad \mathbf{w}_{T-1} = \mathbf{F}_{T-2}\, \mathbf{w}_{T-2} + \alpha G \mathbf{x}_{T-2}\]
Now recursing:
\[\begin{align*} \mathbf{w}_T &= \mathbf{F}_{T-1}\, \mathbf{w}_{T-1} + \alpha G \mathbf{x}_{T-1} \\ &= \mathbf{F}_{T-1}\!\left(\mathbf{F}_{T-2}\, \mathbf{w}_{T-2} + \alpha G \mathbf{x}_{T-2}\right) + \alpha G \mathbf{x}_{T-1} \\ &= \mathbf{F}_{T-1} \mathbf{F}_{T-2}\, \mathbf{w}_{T-2} + \alpha G\!\left(\mathbf{F}_{T-1} \mathbf{x}_{T-2} + \mathbf{x}_{T-1}\right) \\ &= \mathbf{F}_{T-1} \mathbf{F}_{T-2}\!\left(\mathbf{F}_{T-3}\, \mathbf{w}_{T-3} + \alpha G\, \mathbf{x}_{T-3}\right) + \alpha G\!\left(\mathbf{F}_{T-1}\, \mathbf{x}_{T-2} + \mathbf{x}_{T-1}\right) \\ &= \mathbf{F}_{T-1} \mathbf{F}_{T-2} \mathbf{F}_{T-3}\, \mathbf{w}_{T-3} + \alpha G\!\left(\mathbf{F}_{T-1} \mathbf{F}_{T-2}\, \mathbf{x}_{T-3} + \mathbf{F}_{T-1}\, \mathbf{x}_{T-2} + \mathbf{x}_{T-1}\right) \\ &\quad \vdots \\ &= \underbrace{\mathbf{F}_{T-1} \mathbf{F}_{T-2} \cdots \mathbf{F}_0\, \mathbf{w}_0}_{\mathbf{a}_{T-1}} + \alpha G \underbrace{\sum\nolimits_{k=0}^{T-1} \mathbf{F}_{T-1} \mathbf{F}_{T-2} \cdots \mathbf{F}_{k+1}\, \mathbf{x}_k}_{\mathbf{z}_{T-1}} \\ &= \mathbf{a}_{T-1} + \alpha G\, \mathbf{z}_{T-1} \end{align*}\] \[\begin{aligned} \text{where} \\ \mathbf{a}_{T-1}\ \&\ \mathbf{z}_{T-1} &\equiv \text{values at time } T-1 \text{ of 2 auxiliary memory vectors that can be updated} \\ &\phantom{{}\equiv{}} \text{incrementally w/o knowledge of } G \text{ and with } O(d) \text{ complexity per time step} \\ \mathbf{z}_t &\equiv \text{dutch-style eligibility trace, initialized to } \mathbf{z}_0 = \mathbf{x}_0 \end{aligned}\]
The $\mathbf{z}_t$ vector is in fact a dutch-style eligibility trace, initialized to $\mathbf{z}_0 = \mathbf{x}_0$, that can be updated according to:
\[\begin{align*} \mathbf{z}_t &= \sum\nolimits_{k=0}^{t} \mathbf{F}_t \mathbf{F}_{t-1} \cdots \mathbf{F}_{k+1}\, \mathbf{x}_k, \quad 1 \leq t < T \\ &= \sum\nolimits_{k=0}^{t-1} \mathbf{F}_t \mathbf{F}_{t-1} \cdots \mathbf{F}_{k+1}\, \mathbf{x}_k + \mathbf{x}_t \\ &= \mathbf{F}_t \sum\nolimits_{k=0}^{t-1} \mathbf{F}_{t-1} \mathbf{F}_{t-2} \cdots \mathbf{F}_{k+1}\, \mathbf{x}_k + \mathbf{x}_t \\ &= \mathbf{F}_t\, \mathbf{z}_{t-1} + \mathbf{x}_t \\ &= \!\left(\mathbf{I} - \alpha \mathbf{x}_t \mathbf{x}_t^T\right) \mathbf{z}_{t-1} + \mathbf{x}_t \\ &= \mathbf{z}_{t-1} - \alpha\!\left(\mathbf{z}_{t-1}^T \mathbf{x}_t\right) \mathbf{x}_t + \mathbf{x}_t \\ &\boxed{= \mathbf{z}_{t-1} + \!\left(1 - \alpha\, \mathbf{z}_{t-1}^T \mathbf{x}_t\right) \mathbf{x}_t} \end{align*}\]
which is the dutch trace for $\gamma\lambda = 1$.
The $\mathbf{a}_t$ auxiliary vector is initialized to $\mathbf{a}_0 = \mathbf{w}_0$ and then updated according to:
\[\begin{align*} \mathbf{a}_t &= \mathbf{F}_t \mathbf{F}_{t-1} \cdots \mathbf{F}_0\, \mathbf{w}_0, \quad 1 \leq t < T \\ &= \mathbf{F}_t\, \mathbf{a}_{t-1} \\ &= \mathbf{a}_{t-1} - \alpha \mathbf{x}_t \mathbf{x}_t^T \mathbf{a}_{t-1} \\ &\boxed{= \mathbf{a}_{t-1}\!\left(\mathbf{I} - \alpha \mathbf{x}_t \mathbf{x}_t^T\right)} \end{align*}\]

Takeaways

The auxiliary vectors, $\mathbf{a}_t$ and $\mathbf{z}_t$, are updated on each time step $t < T$ and then, at time $T$ when $G$ is observed, they are used to compute:

\[\boxed{\mathbf{w}_T = \mathbf{a}_{T-1} + \alpha G\, \mathbf{z}_{T-1}}\]

The time and memory complexity per step is $O(d)$.
This is surprising and intriguing since ET is working in a non-TD setting (ET arise where long-term predictions are needed to be learned efficiently).

12.7 Sarsa($\lambda$)

Now let’s extend eligibility traces to action-value methods.
First, let’s recall the action-value form of the $n$-step return:
\[G_{t:t+n} \doteq R_{t+1} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1}), \quad t+n < T\] \[\text{with} \quad G_{t:t+n} = G_t \quad \text{ if } t+n \geq T.\]
With this, for $t = 0, \ldots, T-1,$ let’s form the action-value form of the off-line $\lambda$-return algorithm which uses $\hat{q}$ rather than $\hat{v}$:

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \!\left[G_t^\lambda - \hat{q}(S_t, A_t, \mathbf{w}_t)\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}\] \[\begin{aligned} \text{where} \quad G_t^\lambda &\doteq G_{t:\infty}^\lambda \end{aligned}\]

For the forward view shown in the figure below, which is similar to TD($\lambda$), the updates are:
- 1st update: one full-step lookahead
- 2nd update: two-step lookahead
- Final update: complete return.

Backup Diagram for Sarsa($\lambda$)

The first update looks ahead one full step, to the next state–action pair, the second looks ahead two steps, to the second state–action pair, and so on. A final update is based on the complete return.

The weighting of each $n$-step update in the $\lambda$-return is same as in TD($\lambda$) and $\lambda$-return.
The forward view TD for action values, Sarsa($\lambda$), has the same update rule as TD($\lambda$):

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\, \delta_t\, \mathbf{z}_t}\]

The action-value form of the TD error is used:

\[\delta_t \doteq R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t)\]

The action-value form of the eligibility trace:

\[\begin{align*} \mathbf{z}_{-1} &\doteq \mathbf{0} \\ \mathbf{z}_t &\doteq \gamma\lambda \mathbf{z}_{t-1} + \nabla \hat{q}(S_t, A_t, \mathbf{w}_t), \quad 0 \leq t \leq T \end{align*}\]

There exists an action-value version of our ideal TD method, the online $\lambda$-return algorithm and its efficient implementation as true online TD($\lambda$):
- Section 12.4: Everything there holds here except for using the action-value form of the $n$-step return, $G_{t:t+n}$.
- Sections 12.5 & 12.6: Everything holds here except for using state-action feature vectors $\mathbf{x}_t = \mathbf{x}(S_t, A_t)$ instead of state feature vectors $\mathbf{x}_t = \mathbf{x}(S_t)$.
- The resulting efficient backward algorithm obtained from using the eligibility trace to invert the action-value form of the forward view, online $\lambda$-return is called the True Online Sarsa($\lambda$).
There is also a truncated version of Sarsa($\lambda$), called Forward Sarsa($\lambda$), which appears to be a model-free, control method for use in conjunction with multi-layer ANNs.

12.8 Variable $\lambda$ and $\gamma$

To get the most general forms of the final TD algorithms, it is vital to generalize the degree of bootstrapping and discounting beyond constant parameters to functions dependent on the state and action:

\[\begin{align*} \lambda_t &\doteq \lambda(S_t, A_t), \quad & \lambda &: S \times A \to [0,1] \\ \gamma_t &\doteq \gamma(S_t), \quad & \gamma &: S \to [0,1] \end{align*}\]

$\lambda_t$ is the termination function and is significant because it changes the return $G_t$, which is now more generally defined as:

\[\begin{align*} G_t &\doteq R_{t+1} + \gamma_{t+1} G_{t+1} \\ &= R_{t+1} + \gamma_{t+1} R_{t+2} + \gamma_{t+1} \gamma_{t+2} R_{t+3} + \gamma_{t+1} \gamma_{t+2} \gamma_{t+3} R_{t+4} + \ldots \\ &= \sum_{k=t}^{\infty} \left(\prod_{i=t+1}^{k} \gamma_i\right) R_{k+1} \end{align*}\] \[\begin{aligned} \text{where } \prod_{k=t}^{\infty} \gamma_k &= 0 \text{ with probability 1 for all } t, \text{ to assure the sums are finite} \end{aligned}\]

This general return $G_t$ definition enables episodic settings to become a single stream of experience, without special terminal state, start distributions or termination times
- A terminal state just becomes a state with $\gamma(s) = 0$ that transitions to the start distribution.
Generalization to variable bootstrapping yields a new state-based $\lambda$-return:

\[\boxed{G_t^{\lambda s} \doteq R_{t+1} + \gamma_{t+1}\!\left[(1 - \lambda_{t+1})\, \hat{v}(S_{t+1}, \mathbf{w}_t) + \lambda_{t+1}\, G_{t+1}^{\lambda s}\right]}\]

Action-based $\lambda$-return is either the Sarsa form:

\[\boxed{G_t^{\lambda a} \doteq R_{t+1} + \gamma_{t+1}\!\left[(1 - \lambda_{t+1})\, \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) + \lambda_{t+1}\, G_{t+1}^{\lambda a}\right]}\]

or the Expected Sarsa form:

\[\boxed{G_t^{\lambda a} \doteq R_{t+1} + \gamma_{t+1}\!\left[(1 - \lambda_{t+1})\, \bar{V}_t(S_{t+1}) + \lambda_{t+1}\, G_{t+1}^{\bar{\lambda}a}\right]}\] \[\begin{aligned} \text{where } \bar{V}_t(s) \doteq \sum_a \Pi(a \vert s)\, \hat{q}(s, a, \mathbf{w}_t) \\ \end{aligned}\]

Superscripts notation for $i$ in $G_t^{\lambda i}$

\[\begin{aligned} \text{"s"} &: \text{bootstraps from state values} \\ \text{"a"} &: \text{bootstraps from action values} \end{aligned}\]

12.9 Off-Policy Traces with Control Variates

To generalize to off-policy, we need to incorporate importance sampling using eligibility traces.
Let’s focus on the bootstrapping generalization of per-decision importance sampling with control variates (Section 7.4).
The new state-based $\lambda$-return in Section 12.8 generalizes, after the off-policy, control variate, $n$-step return (ending at horizon $h$) model to:

\[\boxed{G_t^{\lambda s} \doteq \rho_t \!\left(R_{t+1} + \gamma_{t+1}\!\left[(1 - \lambda_{t+1})\, \hat{v}(S_{t+1}, \mathbf{w}_t) + \lambda_{t+1}\, G_{t+1}^{\lambda s}\right]\right) + (1 - \rho_t)\, \hat{v}(S_t, \mathbf{w}_t)}\] \[\begin{aligned} \text{where } \rho_t &= \frac{\pi(A_t \vert S_t)}{b(A_t \vert S_t)} \end{aligned}\]

The final $\lambda$-return can be approximated in terms of sums of the state-based TD error $\delta_t^s$, with the approximation becoming exact if the approximate value function does not change:

\[\begin{align*} G_t^{\lambda s} &\approx \hat{v}(S_t, \mathbf{w}_t) + \rho_t \sum_{k=t}^{\infty} \delta_k^s \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \\ \delta_t^s &\doteq R_{t+1} + \gamma_{t+1}\, \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \end{align*}\]

The forward view update of the approximate $\lambda$-return is:

\[\begin{align*} \mathbf{w}_{t+1} &= \mathbf{w}_t + \alpha \!\left[G_t^{\lambda s} - \hat{v}(S_t, \mathbf{w}_t)\right] \nabla \hat{v}(S_t, \mathbf{w}_t) \\ &\boxed{\approx \mathbf{w}_t + \alpha \rho_t \!\left(\sum_{k=t}^{\infty} \delta_k^s \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i\right) \nabla \hat{v}(S_t, \mathbf{w}_t)} \end{align*}\]

We’re interested in the equivalence (approximately) between the forward-view update summed over time and a backward-view update summed over time. The equivalence is approximate because we ignore changes in the value function.
The sum of the forward-view update over time is:

\[\begin{align*} \sum_{t=0}^{\infty} \!\left(\mathbf{w}_{t+1} - \mathbf{w}_t\right) &\approx \sum_{t=0}^{\infty} \sum_{k=t}^{\infty} \alpha \rho_t\, \delta_k^s \nabla \hat{v}(S_t, \mathbf{w}_t) \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \\ &= \sum_{k=0}^{\infty} \sum_{t=0}^{k} \alpha \rho_t \nabla \hat{v}(S_t, \mathbf{w}_t)\, \delta_k^s \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \\ &\quad \left(\text{using the summation rule: } \sum_{t=x}^{y} \sum_{k=t}^{y} = \sum_{k=x}^{y} \sum_{t=x}^{k}\right) \\ &= \sum_{k=0}^{\infty} \alpha\, \delta_k^s \sum_{t=0}^{k} \rho_t \nabla \hat{v}(S_t, \mathbf{w}_t) \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \end{align*}\]

If the entire expression from the 2nd sum on could be written and updated incrementally as an eligibility trace, then the sum of the forward-view update over time would be in the form of the sum of a backward-view TD update.
- That is, if this expression was the trace at time $k$, then we could update it from its value at time $k-1$ by:

\[\begin{align*} \mathbf{z}_k &= \sum_{t=0}^{k} \rho_t \nabla \hat{v}(S_t, \mathbf{w}_t) \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \\ &= \sum_{t=0}^{k-1} \rho_t \nabla \hat{v}(S_t, \mathbf{w}_t) \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i + \rho_k \nabla \hat{v}(S_k, \mathbf{w}_k) \\ &= \gamma_k \lambda_k \rho_k \underbrace{\sum_{t=0}^{k-1} \rho_t \nabla \hat{v}(S_t, \mathbf{w}_t) \prod_{i=t+1}^{k-1} \gamma_i \lambda_i \rho_i}_{\mathbf{z}_{k-1}} + \rho_k \nabla \hat{v}(S_k, \mathbf{w}_k) \end{align*}\] \[\boxed{\mathbf{z}_k = \rho_k \!\left[\gamma_k \lambda_k\, \mathbf{z}_{k-1} + \nabla \hat{v}(S_k, \mathbf{w}_k)\right]}\]

If we change the index from $k$ to $t$ of the $\mathbf{z}_k$ equation above, we get the general accumulating trace update for state values:

\[\boxed{\mathbf{z}_t \doteq \rho_t \!\left[\gamma_t \lambda_t\, \mathbf{z}_{t-1} + \nabla \hat{v}(S_t, \mathbf{w}_t)\right]}\]

This eligibility trace combined with the usual semi-gradient TD($\lambda$) parameter-update rule (Section 12.2) forms a general TD($\lambda$) algorithm that can be applied to either on-policy or off-policy data:
- In on-policy, the algorithm is exactly TD($\lambda$) because $\rho_t = 1$ always and the ET above becomes the usual accumulating trace for variable $\lambda$ and $\gamma$:
\[\mathbf{z}_t \doteq \gamma_t \lambda_t\, \mathbf{z}_{t-1} + \nabla \hat{v}(S_t, \mathbf{w}_t)\]
- In off-policy, the algorithm stays as it is, although not guaranteed to be stable as a semi-gradient method.
- For off-policy, we’ll consider extensions that guarantee stability in the next few sections.
Let’s derive the off-policy ET for action-value methods and corresponding general Sarsa($\lambda$) algorithms.
- Starting with either recursive general action-based $\lambda$-return of Sarsa or Expected Sarsa, $G_t^{\lambda a}$, in Section 12.8 (Expected Sarsa works out to be simpler), we can extend the Expected Sarsa $G_t^{\lambda a}$ to the off-policy case after the off-policy model of action-based, off-policy, control variate, $n$-step return:

\[\boxed{\begin{align*} G_t^{\lambda a} &\doteq R_{t+1} + \gamma_{t+1}\!\left(\!\left[1 - \lambda_{t+1}\right] \bar{V}_t(S_{t+1}) + \lambda_{t+1}\!\left[\rho_{t+1} G_{t+1}^{\lambda a} + \bar{V}_t(S_{t+1}) - \rho_{t+1}\, \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t)\right]\right) \\ &= R_{t+1} + \gamma_{t+1}\!\left(\bar{V}_t(S_{t+1}) + \lambda_{t+1} \rho_{t+1} \!\left[G_{t+1}^{\lambda a} - \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t)\right]\right) \end{align*}}\] \[\begin{aligned} \text{where } \bar{V}_t(S_{t+1}) &= \sum_a \Pi(a \vert S_{t+1})\, \hat{q}(S_{t+1}, a, \mathbf{w}_t) \end{aligned}\]

The $\lambda$-return, approximately as the sum of TD errors, is:

\[\begin{align*} G_t^{\lambda a} &\approx \hat{q}(S_t, A_t, \mathbf{w}_t) + \sum_{k=t}^{\infty} \delta_k^a \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \\ \delta_t^a &= R_{t+1} + \gamma_{t+1} \bar{V}(S_{t+1}) - \hat{q}(S_t, A_t, \mathbf{w}_t) \end{align*}\]

Using steps analogous to those for the state case earlier in this section, write a forward-view update based on action-based $\lambda$-return $G_t^{\lambda a}$ above, then transform the sum of the updates using the summation rule and finally derive the eligibility trace for action values:

\[\boxed{\mathbf{z}_t \doteq \gamma_t \lambda_t \rho_t\, \mathbf{z}_{t-1} + \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}\]

This ET combined with the action-based, expected TD error $\delta_t^a$ and the usual semi-gradient TD($\lambda$) parameter-update rule (Section 12.2) forms an elegant, efficient Expected Sarsa($\lambda$) algorithm that can be applied to either on-policy or off-policy data:
- On-policy case: The algorithm becomes the Sarsa($\lambda$) algorithm given constant $\lambda$ and $\gamma$, and the usual state-action TD error:
\[\begin{aligned} &\quad \rho_t = 1, \quad \nabla\lambda_t = \nabla\gamma_t = 0 \\ &\boxed{\mathbf{z}_t \doteq \gamma\lambda\, \mathbf{z}_{t-1} + \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)} \end{aligned}\]
At $\lambda = 1$, these algorithms become closely related to corresponding Monte Carlo algorithms.
No episode-by-episode equivalence of updates exist, only of their expectations, even under the most favorable conditions.
- Methods have been proposed recently [Sutton, Mahmood, Precup & van Hasselt, 2014] that do achieve an exact equivalence.
- These methods require an additional vector of “provisional weights” that keep track of executed updates but may need to be retracted/emphasized depending on future actions taken.
- The state and state-action versions of these methods are called PTD($\lambda$) and PQ($\lambda$) respectively, where the ‘P’ stands for Provisional.
If $\lambda < 1$, then all these off-policy algorithms involve bootstrapping and the deadly triad applies, meaning that they can be guaranteed stable only for the tabular case, state aggregation and other limited forms of function approximation.
Recall the challenge of off-policy learning has 2 parts. Off-policy eligibility traces deal effectively with the 1st part, correcting for the expected value of the targets, but not with the 2nd part that has to do with distribution of updates (matching off-policy to on-policy).
Algorithmic strategies for handling the 2nd part of the off-policy learning challenge with eligibility traces are summarized in Section 12.11.

12.10 Watkins’s Q($\lambda$) to Tree-Backup($\lambda$)

Watkins’s Q($\lambda$)

Watkins’s Q($\lambda$) is the original method for extending Q-learning to eligibility traces.
It involves decaying the ET in the usual way as long as a greedy action was taken, then cuts the traces to 0 after the 1st non-greedy action.

Backup Diagram for Watkins's Q($\lambda$)

The series of component updates ends either with the end of the episode or with the first nongreedy action, whichever comes first.

Tree-Backup($\lambda$)

Let’s look at the eligibility trace version of Tree Backup, which is called Tree-Backup($\lambda$) or TB($\lambda$).
TB($\lambda$) is the true successor to Q-learning because it has no importance sampling.
TB($\lambda$) concept is straightforward:
- The tree-backup updates of each length (Section 7.5) are weighted dependent on the bootstrapping parameter $\lambda$.
- Using the recursive form of the action-based $\lambda$-return for Expected Sarsa and then expanding the bootstrapping target case after the model of tree-backup $n$-step return (Section 7.5):
  \[\boxed{\begin{align*} G_t^{\lambda a} &\doteq R_{t+1} + \gamma_{t+1}\!\left(\!\left[1 - \lambda_{t+1}\right] \bar{V}_t(S_{t+1}) + \lambda_{t+1}\!\left[\sum_{a \neq A_{t+1}} \pi(a \vert S_{t+1})\, \hat{q}(S_{t+1}, a, \mathbf{w}_t) + \pi(A_{t+1} \vert S_{t+1}) G_{t+1}^{\lambda a}\right]\right) \\ &= R_{t+1} + \gamma_{t+1}\!\left(\bar{V}_t(S_{t+1}) + \lambda_{t+1} \pi(A_{t+1} \vert S_{t+1}) \!\left[G_{t+1}^{\lambda a} - \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t)\right]\right) \end{align*}}\]
$G_t^{\lambda a}$ can be approximated (ignoring changes in approx. value function) as a sum of TD errors:
\[\begin{align*} G_t^{\lambda a} &\approx \hat{q}(S_t, A_t, \mathbf{w}_t) + \sum_{k=t}^{\infty} \delta_k^a \prod_{i=t+1}^{k} \gamma_i \lambda_i \pi(A_i \vert S_i) \\ \delta_t^a &= R_{t+1} + \gamma_{t+1} \bar{V}_t(S_{t+1}) - \hat{q}(S_t, A_t, \mathbf{w}_t) \end{align*}\]
As always, using same steps as in the previous section, we get a special eligibility trace update involving the target-policy probabilities of the selected actions:
\[\boxed{\mathbf{z}_t \doteq \gamma_t \lambda_t \pi(A_t \vert S_t)\, \mathbf{z}_{t-1} + \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}\]

Backup Diagram for Tree Backup($\lambda$)

The tree-backup updates of each length are weighted in the usual way dependent on the bootstrapping parameter $\lambda$

The ET above combined with the usual semi-gradient TD($\lambda$) parameter-update rule defines the TB($\lambda$) algorithm.
Like all semi-gradient algorithms, TB($\lambda$) is not guaranteed to be stable when used with off-policy data and a powerful function approximator (the deadly triad).

12.11 Stable Off-Policy Methods with Traces

Let’s look at 4 of the most important methods that achieve stable off-policy methods/training using eligibility traces.
All 4 are based on either the Gradient-TD or Emphatic TD and linear function approximation.

GTD($\lambda$)

Analogous to TDC, and aims to learn a parameter $\mathbf{w}_{t}$ such that $\hat{v}(s, \mathbf{w}) \doteq \mathbf{w}_{t}^T \mathbf{x}(s) \approx v_{\pi}(s)$ even from data that is due to following another policy $b$. Its update is:
\[\begin{aligned} \mathbf{w}_{t+1} &\doteq \mathbf{w}_t + \alpha\, \delta_t^s\, \mathbf{z}_t - \alpha \gamma_{t+1}(1 - \lambda_{t+1})\!\left(\mathbf{z}_t^T \mathbf{v}_t\right) \mathbf{x}_{t+1} \\ \mathbf{v}_{t+1} &\doteq \mathbf{v}_t + \beta\, \delta_t^s\, \mathbf{z}_t - \beta\!\left(\mathbf{v}_t^T \mathbf{x}_t\right) \mathbf{x}_t \end{aligned}\] \[\begin{aligned} \text{where} \\ \mathbf{v} &\in \mathbb{R}^d \equiv \text{a vector of the same dimension as } \mathbf{w}, \text{ initialized to } \mathbf{v}_0 = \mathbf{0} \\ \beta &> 0 \equiv \text{a 2nd step-size parameter} \end{aligned}\]

GQ($\lambda$)

Gradient-TD algorithm for action values with eligibility traces.
GQ($\lambda$) aims to learn $\mathbf{w}_{t}$ such that $\hat{q}(s, a, \mathbf{w}_{t}) \doteq \mathbf{w}_{t}^T \mathbf{x}(s,a) \approx q_{\pi}(s,a)$ from off-policy data.
If the target policy is $\varepsilon$-greedy, or otherwise biased towards the greedy policy for $\hat{q}$, then GQ($\lambda$) can be used as a control algorithm.
GQ($\lambda$) update is:
\[\begin{aligned} \mathbf{w}_{t+1} &\doteq \mathbf{w}_t + \alpha\, \delta_t^a\, \mathbf{z}_t - \alpha \gamma_{t+1}(1 - \lambda_{t+1})\!\left(\mathbf{z}_t^T \mathbf{v}_t\right) \bar{\mathbf{x}}_{t+1} \\ \bar{\mathbf{x}}_t &\doteq \sum_a \pi(a \vert S_t)\, \mathbf{x}(S_t, a) \\ \delta_t^a &\doteq R_{t+1} + \gamma_{t+1}\, \mathbf{w}_t^T \bar{\mathbf{x}}_{t+1} - \mathbf{w}_t^T \mathbf{x}_t \\ \mathbf{z}_t &\doteq \gamma_t \lambda_t \rho_t\, \mathbf{z}_{t-1} + \nabla \hat{q}(S_t, A_t, \mathbf{w}_t) \end{aligned}\] \[\begin{aligned} \text{where} \\ \bar{\mathbf{x}}_t &\equiv \text{average feature vector for } S_t \text{ under the target policy} \\ \delta_t^a &\equiv \text{expectation form of the TD error} \end{aligned}\]

HTD($\lambda$)

Hybrid TD($\lambda$) state-value algorithm combines aspects of GTD($\lambda$) and TD($\lambda$).
HTD($\lambda$) is a strict generalization of TD($\lambda$) to the off-policy setting, meaning it reduces exactly to TD($\lambda$) when the behavior and target policies coincide; a property GTD($\lambda$) does not share:
\[b(A_t \vert S_t) = \pi(A_t \vert S_t), \quad \rho_t = 1 \implies \text{HTD}(\lambda) = \text{TD}(\lambda)\]
HTD($\lambda$) is defined by:
\[\begin{aligned} \mathbf{w}_{t+1} &\doteq \mathbf{w}_t + \alpha\, \delta_t^s\, \mathbf{z}_t + \alpha\!\left[\!\left(\mathbf{z}_t - \mathbf{z}_t^b\right)^T \mathbf{v}_t\right]\!\left(\mathbf{x}_t - \gamma_{t+1} \mathbf{x}_{t+1}\right) \\ \mathbf{v}_{t+1} &\doteq \mathbf{v}_t + \beta\, \delta_t^s\, \mathbf{z}_t - \beta\!\left(\mathbf{z}_t^T \mathbf{v}_t\right)\!\left(\mathbf{x}_t - \gamma_{t+1} \mathbf{x}_{t+1}\right), \quad & \mathbf{v}_0 \doteq \mathbf{0} \\ \mathbf{z}_t &\doteq \rho_t \!\left(\gamma_t \lambda_t\, \mathbf{z}_{t-1} + \mathbf{x}_t\right), \quad & \mathbf{z}_{-1} \doteq \mathbf{0} \\ \mathbf{z}_t^b &\doteq \gamma_t \lambda_t\, \mathbf{z}_{t-1}^b + \mathbf{x}_t, \quad & \mathbf{z}_{-1}^b \doteq \mathbf{0} \end{aligned}\]
We get
- a 2nd set of weights, $\mathbf{v}_t$.
- a 2nd set of eligibility traces, $\mathbf{z}_t^b$, conventional accumulating traces for the behavior policy.
\[\begin{aligned} \mathbf{z}_t^b = \mathbf{z}_t \text{ if all } \rho_t = 1 &\implies \left(\mathbf{z}_t - \mathbf{z}_t^b\right)^T = \mathbf{0} \\ &\implies \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\, \delta_t^s\, \mathbf{z}_t \quad \text{(TD(}\lambda\text{))} \end{aligned}\]

Emphatic TD($\lambda$)

Extension of one-step Emphatic TD (Sections 9.11 & 11.8) to eligibility traces.
The resulting algorithm:
- (+) retains strong off-policy convergence guarantees
- (+) enables any degree of bootstrapping
- (-) has high variance
- (-) potentially slow convergence.
Emphatic TD($\lambda$) is defined by:
\[\begin{aligned} \mathbf{w}_{t+1} &\doteq \mathbf{w}_t + \alpha\, \delta_t\, \mathbf{z}_t \\ \delta_t &\doteq R_{t+1} + \gamma_{t+1}\, \mathbf{w}_t^T \mathbf{x}_{t+1} - \mathbf{w}_t^T \mathbf{x}_t \\ \mathbf{z}_t &\doteq \rho_t \!\left(\gamma_t \lambda_t\, \mathbf{z}_{t-1} + M_t \mathbf{x}_t\right), \quad & \mathbf{z}_{-1} \doteq \mathbf{0} \\ M_t &\doteq \lambda_t \mathcal{I}_t + (1 - \lambda_t) F_t \\ F_t &\doteq \rho_{t-1} \gamma_t F_{t-1} + \mathcal{I}_t, \quad & F_0 \doteq \mathcal{i}(S_0) \end{aligned}\] \[\begin{aligned} \text{where} \\ M_t &\geq 0 \equiv \text{emphasis} \\ F_t &\geq 0 \equiv \text{followon trace} \\ \mathcal{I}_t &\geq 0 \equiv \text{interest} \end{aligned}\]
In the on-policy case ($\rho_t = 1$ for all $t$), Emphatic TD($\lambda$) is similar to conventional TD($\lambda$), but still significantly different:
- Emphatic TD($\lambda$) is guaranteed to converge for all state-dependent $\lambda$ functions; TD($\lambda$) is not (TD($\lambda$) is guaranteed only for constant $\lambda$).
- See Yu’s counterexample [Ghassian, Rafiee & Sutton, 2016].

12.12 Implementation Issues

Naive implementation seems expensive: Updating eligibility traces for every state at every time step appears computationally costly on serial computers.
Practical optimization: Most ET are nearly 0; only recently visited states have significant traces, so implementations can track and update only these few states.
Computational cost: With this optimization, tabular methods with traces are only a few times more expensive than one-step methods.
Function approximation reduces overhead: When using neural networks, ET typically only double memory and computation per step (much less overhead than in tabular case).
Tabular is the worst case: The tabular setting represents the highest computational complexity for ET relative to simpler methods.

12.13 Conclusions

Eligibility traces provide an efficient, incremental way to interpolate between TD and MC methods.
ET offer advantages over $n$-step methods in terms of generality and computational trade-offs.
Empirically, an intermediate mix works best: ET should move towards MC but not all the way since pure MC performance degrades sharply.
ET are the first defense against long-delayed rewards and non-Markov tasks, used with TD methods to make them behave more like MC methods without full bootstrapping.
Use traces when data is scarce and online learning is required, as they provide faster learning per sample despite higher computational cost per step.
Avoid traces in offline settings with cheap abundant data (maximum data processing speed matters more than learning efficiency per sample).
True online methods achieve ideal $\lambda$-return performance while maintaining $O(d)$ computational efficiency.
Forward-to-backward view derivations provide computationally efficient, mechanistic, practical implementations of theory.

Citation

If you found this blog post helpful, please consider citing it:

@article{obasi2026RLsuttonBartoCh12notes,
  title   = "Sutton & Barto, Ch. 12: Eligibility Traces (Personal Notes)",
  author  = "Obasi, Chizoba",
  journal = "chizkidd.github.io",
  year    = "2026",
  month   = "Mar",
  url     = "https://chizkidd.github.io/2026/03/13/rl-sutton-barto-notes-ch012/"
}

Sutton & Barto, Ch. 11: Off-Policy Methods with Approximation (Personal Notes)

Mon, 09 Mar 2026 00:00:00 +0000

Let’s discuss the extension of off-policy methods from the tabular case (Ch. 6 & 7) to function approximation.
We’ll explore the convergence problems, the theory of linear function approximation, the notion of learnability, and stronger convergence off-policy algorithms.
Off-policy learning with function approximation has 2 challenges:
1. Finding the target of the update.
2. The off-policy distribution of updates does not match that of the on-policy distribution.

11.1 Semi-gradient Methods
11.2 Examples of Off-Policy Divergence
11.3 The Deadly Triad
11.4 Linear Value-Function Geometry
11.5 Gradient Descent in the Bellman Error
11.6 The Bellman Error is Not Learnable
11.7 Gradient-TD Methods
11.8 Emphatic-TD Methods
11.9 Reducing Variance
11.10 Summary

Appendix

Citation

11.1 Semi-gradient Methods

Let’s discuss the extension of previous off-policy methods to function approximation as semi-gradient methods.
This is how we find the update target (or change it) to address the first challenge.
Recall the semi-gradient update:

\[\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\!\left[U_t - \hat{v}(S_t, \mathbf{w}_t)\right] \nabla \hat{v}(S_t, \mathbf{w}_t)\] \[U_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t)\]

In the tabular case, we update the array ($V$ or $Q$), but now we update the weight vector $\mathbf{w}$.
Many off-policy algorithms use the per-step importance sampling ratio:

\[\rho_t \doteq \rho_{t:t} = \frac{\pi(A_t \vert S_t)}{b(A_t \vert S_t)}\]

The off-policy, semi-gradient TD(0) is same as that of the on-policy TD(0) except for the addition of the $\rho_t$ term:

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \rho_t\, \delta_t \nabla \hat{v}(S_t, \mathbf{w}_t)}\] \[\begin{align*} \text{(episodic)} \quad \delta_t &= R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \\ \text{(continuing)} \quad \delta_t &= R_{t+1} - \bar{R}_t + \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \end{align*}\]

For action values, the off-policy, semi-gradient Expected Sarsa update rule is (no importance sampling):

\[\boxed{\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\, \delta_t \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}\] \[\begin{align*} \text{(episodic)} \quad \delta_t &= R_{t+1} + \gamma \sum_a \pi(a \vert S_{t+1}) \hat{q}(S_{t+1}, a, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t) \\ \text{(continuing)} \quad \delta_t &= R_{t+1} - \bar{R}_t + \sum_a \pi(a \vert S_{t+1}) \hat{q}(S_{t+1}, a, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t) \end{align*}\]

The lack of use of importance sampling in Expected Sarsa is an unclear choice since we might want to weight different state-action pairs differently once they all contribute to the same overall approximation. This issue can only be properly resolved by more thorough understanding of the theory of function approximation in RL.
In the multi-step generalizations of the algorithms, both the state-value and action-value algorithms involve importance sampling. For example, the off-policy, semi-gradient $\mathbf{n}$-step Sarsa update is:

\[\boxed{\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha \rho_{t+1} \cdots \rho_{t+n}\!\left[G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})}\] \[\begin{align*} \text{(episodic)} \quad G_{t:t+n} &= R_{t+1} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1}) \\ \text{(continuing)} \quad G_{t:t+n} &= R_{t+1} - \bar{R}_t + \ldots + R_{t+n} - \bar{R}_{t+n-1} + \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1}) \end{align*}\] \[\text{where } \rho_k = 1 \hspace{0.5em} \text{ for } k \geq T \quad \text{and} \quad G_{t:t+n} = G_t \hspace{0.5em} \text{ if } t+n \geq T\]

The off-policy, semi-gradient $\mathbf{n}$-step backup tree (no importance sampling) algorithm is:

\[\boxed{\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha\!\left[G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})}\] \[G_{t:t+n} \doteq \hat{q}(S_t, A_t, \mathbf{w}_{t+n}) + \sum_{k=t}^{t+n-1} \delta_k \prod_{i=t+1}^{k} \gamma \pi(A_i \vert S_i)\] \[\text{where } \delta_t \text{ is the Expected Sarsa TD error defined earlier in this section.}\]

11.2 Examples of Off-Policy Divergence

Now let’s discuss the 2nd off-policy function approximation challenge.
We’ll look at some instructive counterexamples where the semi-gradient algorithm diverges.

Example 1

Consider part of a larger MDP with 2 states whose estimated values are $w$ and $2w$:

Simple Counterexample: 2-state part of an MDP.

$w$ updates will diverge to infinity, since the transition will always look good (higher next-state estimated value than current state estimated value).
The TD error on a transition between the 2 states is:

\[\begin{align*} \delta_t &= R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \\ &= 0 + \gamma \cdot 2w_t - w_t \\ &= (2\gamma - 1)\, w_t \end{align*}\]

The off-policy, semi-gradient TD(0) update is:

\[\begin{align*} w_{t+1} &= w_t + \alpha \rho_t\, \delta_t \nabla \hat{v}(S_t, w_t) \\ &= w_t + (\alpha)(1)\!\left[(2\gamma - 1) w_t\right](1) \\ &= w_t\!\left[1 + \alpha(2\gamma - 1)\right] \end{align*}\] \[\begin{aligned} \Rightarrow \quad & 1 + \alpha(2\gamma - 1) > 1 \\ & \alpha(2\gamma - 1) > 0 \\ & 2\gamma - 1 > 0 \\ & \gamma > \tfrac{1}{2} \quad \longrightarrow \quad w \to \pm\infty \end{aligned}\]

Example 2 (Baird’s Counterexample)

Now let’s look at an entire complete system with instability (divergence).
Consider the episodic 7-state, 2-action MDP shown below.

Baird’s Counterexample: Episodic 7-state, 2-action MDP.

Assumptions/knowns:
- $b(\text{dashed}\,\vert\,\cdot) = 6/7$
- $b(\text{solid}\,\vert\,\cdot) = 1/7$
- $\pi(a\,\vert\,\cdot) = \pi(\text{solid}\,\vert\,\cdot) = 1$
- $R = 0$ (on all transitions)
- $\gamma = 0.99$
- The state values are estimated via linear parametrization.
The estimated value of the leftmost state is $2w_1 + w_8$, which corresponds to a feature vector for the 1st state being:

\[\mathbf{x}(1) = (2, 0, 0, 0, 0, 0, 0, 1)^T\] \[R = 0 \quad \therefore\quad v_\pi(s) = 0 \; \forall s, \text{ which can be exactly approximated if } \mathbf{w} = \mathbf{0}\]

Since there are 8 components of the weight vector (more than the 7 non-terminal states), there exist many solutions.
Applying semi-gradient TD(0) to this problem will cause the weights to diverge to infinity. This also applies for the dynamic programming (DP) case.
The semi-gradient DP update is:

\[\mathbf{w}_{k+1} \doteq \mathbf{w}_k + \frac{\alpha}{\vert S \vert} \sum_s \left(\mathbb{E}_\pi\!\left[R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_k) \mid S_t = s\right] - \hat{v}(s, \mathbf{w}_k)\right) \nabla \hat{v}(s, \mathbf{w}_k)\]

This example shows that even the simplest combination of bootstrapping and function approximation can be unstable in the off-policy case.
- Simplest bootstrapping: DP and TD.
- Simplest function approximation: linear, semi-gradient descent method.

Example 3 (Tsitsiklis & Van Roy’s Counterexample)

This extends Example 1 with a terminal state and $R = 0$:

Tsitsiklis & Van Roy’s Counterexample: Extension of Example 1 with probability $\varepsilon$ of transitioning to the terminal state (shaded).

Let’s find $w_{k+1}$ at each step that minimizes the $\overline{\text{VE}}$ between the estimated value and the expected one-step return:

\[\begin{align*} w_{k+1} &= \arg\min_{w \in \mathbb{R}} \sum_{s \in S} \left(\hat{v}(s, w) - \mathbb{E}_\pi\!\left[R_{t+1} + \gamma \hat{v}(S_{t+1}, w_k) \mid S_t = s\right]\right)^2 \\[6pt] &= \arg\min_{w \in \mathbb{R}} \left(w - \gamma \cdot 2w_k\right)^2 + \left(2w - (1 - \varepsilon)\gamma \cdot 2w_k\right)^2 \\[6pt] &= \left(\frac{6 - 4\varepsilon}{5}\right) \gamma w_k \end{align*}\]

The sequence ${w_k}$ diverges when $\gamma > \dfrac{5}{6 - 4\varepsilon}$ and $w_0 \neq 0$.

Takeaways

Instability can be prevented by using special methods for function approximation.
These special methods guarantee stability because they do not extrapolate from the observed targets. They are called averagers.
Averagers include:
1. Nearest neighbor
2. Locally weighted regression
3. Tile coding
4. Artificial neural networks (ANNs)

11.3 The Deadly Triad

The danger of instability and divergence arises when we combine these 3 elements, which make up the deadly triad:
- Function approximation
- Bootstrapping
- Off-policy training
Instability can be avoided if one of the elements is absent:
- Function approximation cannot be given up (needed for large-scale problems).
- Bootstrapping can be given up but at the cost of computational and data efficiency.
- Off-policy can be given up (replace Q-learning with Sarsa).
- There is no perfect solution as we still need off-policy for planning and parallel learning.

11.4 Linear Value-Function Geometry

To better understand the stability challenge of off-policy learning, let’s think about value-function approximation more abstractly and independently of how learning is done.
Let’s consider the case with 3 states $S = {s_1, s_2, s_3}$ and 2 parameters $\mathbf{w} = (w_1, w_2)^T$.
- All value functions exist in a 3-D space, however the parameters provide a 2-D subspace.
- Any weight vector $\mathbf{w} = (w_1, w_2)^T$ is a point in the 2-D subspace and thus also a complete value function $v_\mathbf{w}$ that assigns values to all 3 states.
- In linear value-function approximation, the subspace is a simple plane.
How do we represent $v_\pi$ in the $d$-dimensional space?
- We need to perform a projection operation.
- TD methods present other solutions.

The Geometry of Linear Value-Function Approximation

Shown is the 3D space of all value functions over three states, while shown as a plane is the subspace of all value functions representable by a linear function approximator with parameter $\mathbf{w} = (w_1, w_2)^T$. The true value function $v_\pi$ is in the larger space and can be projected down (into the subspace, using a projection operator $\Pi$) to its best approximation in the value error ($\text{VE}$) sense. The best approximators in the Bellman error ($\text{BE}$), projected Bellman error ($\text{PBE}$), and temporal difference error ($\text{TDE}$) senses are all potentially different and are shown in the lower right.

Projection Operation

For the projection operation, the distance between value functions using the norm is:

\[\begin{align*} \lVert v \rVert_\mu^2 &\doteq \sum_{s \in S} \mu(s)\, v(s)^2 \\[6pt] \overline{\text{VE}}(\mathbf{w}) &= \lVert v_\mathbf{w} - v_\pi \rVert_\mu^2 \\[6pt] \Pi\, v &\doteq v_\mathbf{w} \quad \\[6pt] \text{where } \mathbf{w} = \arg\min_{\mathbf{w} \in \mathbb{R}^d} \lVert v - v_\mathbf{w} \rVert_\mu^2 & \hspace{0.8em} \text{and} \hspace{0.5em} \Pi \equiv \text{projection operator} \end{align*}\]

The representable value function that is closest to the true value function $V_\pi$ is its projection $\Pi V_\pi$ (MC method asymptotic solution).
Projection matrix: with $\mathbf{D} \equiv \vert S \vert \times \vert S \vert$ diagonal matrix with $\mu(s)$ on the diagonal and $\mathbf{X} \equiv \vert S \vert \times d$ matrix whose rows are the feature vectors $\mathbf{x}(s)^T$:
\[\Pi \doteq \mathbf{X}\!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{D}\]
If the inverse does not exist, the pseudoinverse is substituted. Using these matrices, the squared norm of a vector can be written as:
\[\lVert v \rVert_\mu^2 = v^T \mathbf{D}\, v\]
and the approximate linear value function written as:
\[v_\mathbf{w} = \mathbf{X}\mathbf{w}\]

TD Solutions

Bellman Error

The true value function $v_\pi$ solves the Bellman equation exactly.
The Bellman error shows how far off $v_\mathbf{w}$ is from $v_\pi$. The Bellman error at state $s$ is:

\[\begin{align*} \bar{\delta}_\mathbf{w}(s) &\doteq \left(\sum_a \pi(a \vert s) \sum_{s', r} p(s', r \vert s, a)\!\left[r + \gamma v_\mathbf{w}(s')\right]\right) - v_\mathbf{w}(s) \\ &= \mathbb{E}_\pi\!\left[R_{t+1} + \gamma v_\mathbf{w}(S_{t+1}) - v_\mathbf{w}(S_t) \mid S_t = s, A_t \sim \pi\right] \end{align*}\]

The Bellman error is the expectation of the TD error.
The vector of all the Bellman errors, at all states, $\bar{\delta}_\mathbf{w} \in \mathbb{R}^{\vert S \vert}$, is called the Bellman error vector ($\text{BE}$).
The overall size of $\text{BE}$ is the Mean Squared Bellman Error, $\overline{\text{BE}}$:

\[\overline{\text{BE}}(\mathbf{w}) = \lVert \bar{\delta}_\mathbf{w} \rVert_\mu^2\]

The Bellman operator $B_\pi : \mathbb{R}^{\vert S \vert} \to \mathbb{R}^{\vert S \vert}$ is defined by:

\[\begin{align*} (B_\pi v)(s) &\doteq \sum_a \pi(a \vert s) \sum_{s', r} p(s', r \vert s, a)\!\left[r + \gamma v(s')\right], \quad \forall s \in S \text{ and } v : S \to \mathbb{R} \\[6pt] \bar{\delta}_\mathbf{w} &= B_\pi v_\mathbf{w} - v_\mathbf{w} \\[6pt] v_\pi &= B_\pi v_\pi \end{align*}\]

The projection of the Bellman error vector back into the representable space creates the Projected Bellman Error $(\text{PBE})$ vector:

\[\text{PBE} = \Pi\, \bar{\delta}_\mathbf{w}\]

The size of $\text{PBE}$, in the norm, is another measure of error in the approximate value function, called the Mean Square Projected Bellman Error, $\overline{\text{PBE}}$:

\[\overline{\text{PBE}}(\mathbf{w}) = \lVert \Pi\, \delta_\mathbf{w} \rVert_\mu^2\]

With linear function approximation, there always exists an approximate value function (within the subspace) with zero $\overline{\text{PBE}}$; this is the TD fixed point $\mathbf{w}_\text{TD}$.

11.5 Gradient Descent in the Bellman Error

Let’s apply the approach of SGD in dealing with the challenge of stability in off-policy learning.

TD Error (Naive Residual-Gradient Algorithm)

Let’s take the minimization of the expected square of the TD error as the objective, TD(0):

\[\delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t)\]

Using the TD error, we can find the Mean Squared TD error, the objective function $\overline{\text{TDE}}$:

\[\begin{align*} \overline{\text{TDE}}(\mathbf{w}) &= \sum_{s \in S} \mu(s)\, \mathbb{E}\!\left[\delta_t^2 \mid S_t = s, A_t \sim \pi\right] \\ &= \sum_{s \in S} \mu(s)\, \mathbb{E}\!\left[\rho_t\, \delta_t^2 \mid S_t = s, A_t \sim b\right] \\ &= \mathbb{E}_b\!\left[\rho_t\, \delta_t^2\right] \end{align*}\]

Following the standard SGD approach, the per-step update based on a sample of this expected value:

\[\begin{align*} \mathbf{w}_{t+1} &= \mathbf{w}_t - \tfrac{1}{2}\alpha \nabla\!\left(\rho_t\, \delta_t^2\right) \\ &= \mathbf{w}_t - \alpha \rho_t\, \delta_t \nabla \delta_t \\ &= \mathbf{w}_t + \alpha \rho_t\, \delta_t\!\left(\nabla \hat{v}(S_t, \mathbf{w}_t) - \gamma \nabla \hat{v}(S_{t+1}, \mathbf{w}_t)\right) \end{align*}\]

This is the same as the semi-gradient TD algorithm except for the additional final term.
This method is naive because it achieves temporal smoothing-like behavior rather than accurate prediction by penalizing all TD errors.

Bellman Error (Residual-Gradient Algorithm)

Consider the minimization of the Bellman error (if the exact values are learned, the Bellman error is zero everywhere).
This yields the residual gradient algorithm:
\[\begin{align*} \mathbf{w}_{t+1} &= \mathbf{w}_t - \tfrac{1}{2}\alpha \nabla\!\left(\mathbb{E}_\pi\!\left[\delta_t\right]^2\right) \\ &= \mathbf{w}_t - \tfrac{1}{2}\alpha \nabla\!\left(\mathbb{E}_b\!\left[\rho_t\, \delta_t\right]^2\right) \\ &= \mathbf{w}_t - \alpha\, \mathbb{E}_b\!\left[\rho_t\, \delta_t\right] \nabla \mathbb{E}_b\!\left[\rho_t\, \delta_t\right] \\ &= \mathbf{w}_t - \alpha\, \mathbb{E}_b\!\left[\rho_t\!\left(R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w})\right)\right] \mathbb{E}_b\!\left[\rho_t \nabla \delta_t\right] \\ &= \mathbf{w}_t + \alpha\!\left[\mathbb{E}_b\!\left[\rho_t\!\left(R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})\right)\right] - \hat{v}(S_t, \mathbf{w})\right]\!\left[\nabla \hat{v}(S_t, \mathbf{w}) - \gamma\, \mathbb{E}_b\!\left[\rho_t \nabla \hat{v}(S_{t+1}, \mathbf{w})\right]\right] \end{align*}\]
Two ways to make the residual-gradient algorithm work:
- In the case of deterministic environments.
- Obtain 2 independent samples of the next state $S_{t+1}$ from $S_t$.
In both ways above, the algorithm is guaranteed to converge to a minimum of the $\overline{\text{BE}}$ under the usual conditions on the step-size parameter.
However, there are at least 3 ways in which the convergence of the residual-gradient algorithm is unsatisfactory:
- Very slow.
- Converges to the wrong values.
- A problem with the $\overline{\text{BE}}$ objective covered in the next section.

11.6 The Bellman Error is Not Learnable

The Bellman error is not learnable from the observed sequence of feature vectors, actions, and rewards.
Since the Bellman error objective cannot be learned from the observable data, this is the strongest reason not to seek it.
Examples of non-learnable Markov Reward Processes (MRPs):

Example 1

Value Error (VE) Learnability Counterexample: Deterministic MRP pair with an endless stream of $0$s and $2$s

These MRPs have a deterministic reward with observable data of an endless stream of 0s and 2s.
We cannot learn if the MRP has one state or two, or is stochastic or deterministic.
The pair of MRPs shows that the $\overline{\text{VE}}$ objective is not learnable:

\[\overline{\text{VE}}(\mathbf{w}) \doteq \sum_{s \in S} \mu(s)\!\left[v_\pi(s) - \hat{v}(s, \mathbf{w})\right]^2\]

The $\overline{\text{VE}}$ is not learnable, but the parameter that optimizes it is!
We introduce a learnable natural objective function that is always observable. This is the error between the value estimate at each time and the return from that time, called the return error. The Mean Square Return Error $(\overline{\text{RE}})$ is the expectation, under $\mu$, of the square of this return error.
$\overline{\text{RE}}$ in the on-policy case is:

\[\begin{align*} \overline{\text{RE}}(\mathbf{w}) &= \mathbb{E}\!\left[\left(G_t - \hat{v}(S_t, \mathbf{w})\right)^2\right] \\ &= \overline{\text{VE}}(\mathbf{w}) + \mathbb{E}\!\left[\left(G_t - v_\pi(S_t)\right)^2\right] \end{align*}\]

The $\overline{\text{BE}}$ can be computed from knowledge of the MDP but is not learnable from data, and its minimum solution is not learnable.

Example 2

Bellman Error (BE) Learnability Counterexample: Complex Deterministic MRP pair with same distribution but different minimizing parameter vector

The example above serves as a counterexample to the learnability of the Bellman error.
The 2 MRPs generate the same data distribution but have different minimizing parameter vectors, proving that the optimal parameter vector is not a function of the data and thus cannot be learned from it.
Other bootstrapping objectives, like $\overline{\text{PBE}}$ and $\overline{\text{TDE}}$, are learnable from data and yield optimal solutions different from each other and that of $\overline{\text{BE}}$.
$\overline{\text{BE}}$ is limited to model-based settings, therefore $\overline{\text{PBE}}$ is preferred.

Casual Relationships among the data distribution, MDPs & various objectives

Left, Monte Carlo objectives: Two different MDPs can produce the same data distribution yet also produce different $\overline{\text{VE}}$s, proving that the $\overline{\text{VE}}$ objective cannot be determined from data and is not learnable. However, all such $\overline{\text{VE}}$s must have the same optimal parameter vector, $\mathbf{w}^{*}$! Moreover, this same $\mathbf{w}^{*}$ can be determined from another objective, the $\overline{\text{RE}}$, which is uniquely determined from the data distribution. Thus $\mathbf{w}^{*}$ and the $\overline{\text{RE}}$ are learnable even though the $\overline{\text{VE}}$s are not.
Right, Bootstrapping objectives: Two different MDPs can produce the same data distribution yet also produce different $\overline{\text{BE}}$s and have different minimizing parameter vectors; these are not learnable from the data distribution. The $\overline{\text{PBE}}$ and $\overline{\text{TDE}}$ objectives and their (different) minima can be directly determined from data and thus are learnable.

11.7 Gradient-TD Methods

Let’s consider SGD methods for minimizing the $\overline{\text{PBE}}$.
True SGD methods, Gradient-TD methods, have robust convergence properties even under off-policy training and nonlinear function approximation.
In the linear case, there exists an exact solution, the TD fixed point $\mathbf{w}_\text{TD}$, at which the $\overline{\text{PBE}}$ is zero.
This solution via least-squares methods yields a $O(d^2)$ complexity; however, we want an SGD method with $O(d)$ that converges robustly.
Let’s derive an SGD method for the $\overline{\text{PBE}}$ assuming linear function approximation:
\[\begin{align*} \overline{\text{PBE}}(\mathbf{w}) &= \lVert \Pi\, \bar{\delta}_\mathbf{w} \rVert_\mu^2 \\ &= \left(\Pi\, \bar{\delta}_\mathbf{w}\right)^T \mathbf{D}\, \Pi\, \bar{\delta}_\mathbf{w} \\ &= \bar{\delta}_\mathbf{w}^T \Pi^T \mathbf{D}\, \Pi\, \bar{\delta}_\mathbf{w} \\ &= \bar{\delta}_\mathbf{w}^T \mathbf{D} \mathbf{X}\!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w} \\ &= \left(\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w}\right)^T \!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \!\left(\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w}\right) \end{align*}\]
$\quad \left(\text{using } \Pi = \mathbf{X}!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{D} \text{ and the identity } \Pi^T \mathbf{D} \Pi = \mathbf{D} \mathbf{X}!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{D}\right)$
The gradient of the $\overline{\text{PBE}}$ w.r.t $\mathbf{w}$ is:

\[\nabla \overline{\text{PBE}}(\mathbf{w}) = 2\, \nabla\!\left[\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w}\right]^T \!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \!\left(\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w}\right)\]

Let’s turn this into an SGD method via converting the 3 factors above into expectations under this distribution:

\[\begin{aligned} \mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w} &= \sum_s \mu(s)\, \mathbf{x}(s)\, \bar{\delta}_\mathbf{w}(s) = \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\[6pt] \nabla\!\left[\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w}\right] &= \nabla \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\ &= \mathbb{E}\!\left[\rho_t \nabla \delta_t^T\, \mathbf{x}_t^T\right] \\ &= \mathbb{E}\!\left[\rho_t \nabla\!\left(R_{t+1} + \gamma \mathbf{w}^T \mathbf{x}_{t+1} - \mathbf{w}^T \mathbf{x}_t\right) \mathbf{x}_t^T\right] \\ &= \mathbb{E}\!\left[\rho_t\!\left(\gamma \mathbf{x}_{t+1} - \mathbf{x}_t\right) \mathbf{x}_t^T\right] \\[6pt] \mathbf{X}^T \mathbf{D} \mathbf{X} &= \sum_s \mu(s)\, \mathbf{x}_s\, \mathbf{x}_s^T = \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right] \end{aligned}\]

Substituting these expectations for the three factors into $\nabla \overline{\text{PBE}}$:

\[\nabla \overline{\text{PBE}}(\mathbf{w}) = 2\, \mathbb{E}\!\left[\rho_t\!\left(\gamma \mathbf{x}_{t+1} - \mathbf{x}_t\right) \mathbf{x}_t^T\right] \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right]\]

The 1st and last terms are not independent (biased gradient estimate).
Could estimate all 3 terms separately and combine (unbiased gradient estimate) but too computationally expensive.

Gradient-TD

Estimate and store the product of the last 2 terms of $\nabla \overline{\text{PBE}}(\mathbf{w})$ (product of a $d \times d$ matrix and a $d$-vector yields a $d$-vector like $\mathbf{w}$ itself):

\[\mathbf{v} \approx \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right]\]

In linear supervised learning, this is the solution to a linear least-squares problem for $\rho_t\, \delta_t$ approximation from the features.
The standard SGD method for incrementally finding $\mathbf{v}$ that minimizes the expected squared error $\left(\mathbf{v}^T \mathbf{x}_t - \rho_t\, \delta_t\right)^2$ is known as the Least Mean Square (LMS) rule:

\[\mathbf{v}_{t+1} \doteq \mathbf{v}_t + \beta\, \delta_t\!\left(\delta_t - \mathbf{v}_t^T \mathbf{x}_t\right) \mathbf{x}_t\] \[\begin{aligned} \text{where} \\ \beta &> 0 \equiv \text{another step-size parameter} \\ \rho_t &\equiv \text{importance sampling ratio} \\ O(d) &\equiv \text{storage \& per-step computational complexity} \end{aligned}\]

GTD2

With $\mathbf{v}_t$, we can update $\mathbf{w}_t$ using SGD:
\[\begin{align*} \mathbf{w}_{t+1} &= \mathbf{w}_t - \tfrac{1}{2}\alpha \nabla \overline{\text{PBE}}(\mathbf{w}_t) \\ &= \mathbf{w}_t - \tfrac{1}{2}\alpha\!\left(2\, \mathbb{E}\!\left[\rho_t\!\left(\gamma \mathbf{x}_{t+1} - \mathbf{x}_t\right) \mathbf{x}_t^T\right] \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right]\right) \\ &= \mathbf{w}_t + \alpha\, \mathbb{E}\!\left[\rho_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right) \mathbf{x}_t^T\right] \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\ &\approx \mathbf{w}_t + \alpha\, \mathbb{E}\!\left[\rho_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right) \mathbf{x}_t^T\right] \mathbf{v}_t \\ &\approx \mathbf{w}_t + \alpha \rho_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right) \mathbf{x}_t^T \mathbf{v}_t \end{align*}\]
where $O(d)$ per-step computational complexity of $(\mathbf{x}_t^T \mathbf{v}_t)$ is done first.

TD(0) with Gradient Correction (GTD(0) or TDC)

Let’s look at another analytical algorithm called TD(0) with gradient correction, TDC:
\[\begin{align*} \mathbf{w}_{t+1} &= \mathbf{w}_t + \alpha\, \mathbb{E}\!\left[\rho_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right) \mathbf{x}_t^T\right] \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\ &= \mathbf{w}_t + \alpha\!\left(\mathbb{E}\!\left[\rho_t\, \mathbf{x}_t\, \mathbf{x}_t\right] - \gamma\, \mathbb{E}\!\left[\rho_t\, \mathbf{x}_{t+1}\, \mathbf{x}_t^T\right]\right) \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\ &= \mathbf{w}_t + \alpha\!\left(\mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t\right] - \gamma\, \mathbb{E}\!\left[\rho_t\, \mathbf{x}_{t+1}\, \mathbf{x}_t^T\right]\right) \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\ &= \mathbf{w}_t + \alpha\!\left(\mathbb{E}\!\left[\mathbf{x}_t\, \rho_t\, \delta_t\right] - \gamma\, \mathbb{E}\!\left[\rho_t\, \mathbf{x}_{t+1}\, \mathbf{x}_t^T\right] \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right]\right) \\ &\approx \mathbf{w}_t + \alpha\!\left(\mathbb{E}\!\left[\mathbf{x}_t\, \rho_t\, \delta_t\right] - \gamma\, \mathbb{E}\!\left[\rho_t\, \mathbf{x}_{t+1}\, \mathbf{x}_t^T\right] \mathbf{v}_t\right) \\ &\approx \mathbf{w}_t + \alpha \rho_t\!\left(\delta_t\, \mathbf{x}_t - \gamma \mathbf{x}_{t+1}\, \mathbf{x}_t^T \mathbf{v}_t\right) \end{align*}\]
with $O(d)$ complexity if the final product $(\mathbf{x}_t^T \mathbf{v}_t)$ is done first.

Takeaways

GTD2 and TDC both involve 2 learning processes: a primary one for $\mathbf{w}$ and a secondary one for $\mathbf{v}$.
Asymmetrical dependence ($\mathbf{w}$ depends on $\mathbf{v}$ but $\mathbf{v}$ does not depend on $\mathbf{w}$) is referred to as a cascade.
Gradient-TD methods are the most well-understood and widely used stable off-policy methods.
Extensions of GTD methods include to:
1. Action values and control: GQ [Maei et al., 2010]
2. Eligibility traces: GTD($\lambda$), GQ($\lambda$) [Maei, 2011; Maei & Sutton, 2010]
3. Nonlinear function approximation [Maei et al., 2009]
Hybrid algorithms include:
1. Midway between semi-gradient TD and gradient TD [Hackman, 2012; White & White, 2016]
2. GTD + proximal methods & control variates [Mahadevan et al., 2014; Du et al., 2017]

11.8 Emphatic-TD Methods

Let’s explore a major strategy for obtaining a cheap and efficient off-policy learning method with function approximation.
Recall that linear semi-gradient TD methods are stable when trained under the on-policy distribution .
The match between the on-policy state distribution $\mu_\pi$ and the state-transition probabilities $p(s’ \vert s, a)$ under the target policy does not exist in off-policy learning.
Mismatch Fix:
- Re-weight the states, emphasizing some and de-emphasizing others, so as to return the distribution of the updates to the on-policy distribution.
- Then there would be a match, and convergence and stability would be achieved. This is the idea of Emphatic-TD methods.
The one-step Emphatic-TD algorithm for learning episodic state values is defined by:

\[\delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t)\] \[\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha M_t \rho_t\, \delta_t \nabla \hat{v}(S_t, \mathbf{w}_t)\] \[M_t = \gamma \rho_{t-1} M_{t-1} + \mathcal{I}_t\] \[\begin{aligned} \text{where} \\ \mathcal{I}_t &\equiv \text{the interest} \\ M_t &\equiv \text{the emphasis} \quad (M_{-1} = 0) \end{aligned}\]

Applying Emphatic TD to Baird’s counterexample yields very high variance results (impossible to get consistent results in experiments).
We focus on how we reduce the variance in all these algorithms in the next section.

11.9 Reducing Variance

Off-policy learning is inherently of greater variance than on-policy learning.
The raison d’être of off-policy learning is to enable generalization to the vast number of related-but-not-identical policies.
Why is variance control critical in off-policy learning based on importance sampling?
- Recall importance sampling involves products of policy ratios:
\[\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k \vert S_k)}{b(A_k \vert S_k)}\]
- The policy ratios always have an expected value of 1, but their actual values might be high or as low as 0:
\[\mathbb{E}\!\left[\frac{\pi(A_k \vert S_k)}{b(A_k \vert S_k)}\right] \doteq \sum_a b(a \vert S_k) \frac{\pi(a \vert S_k)}{b(a \vert S_k)} = \sum_a \pi(a \vert S_k) = 1\]
- Successive ratios are uncorrelated, so their products are always 1 in expected value, but they can be of high variance.
- These ratios multiply the step size in SGD methods, so their high variance is problematic for SGD because of the occasional huge steps.
How can we alleviate the effects of high variance via small step-size settings enough to ensure the expected step taken by SGD is small? Some approaches:
- Momentum
- Polyak-Ruppert averaging
- Methods for adaptively setting separate step sizes for different components of the parameter vector
- “Importance weight aware” updates of Karampatziakis & Langford (2015)
- Weighted importance sampling, which is well-behaved with lower variance updates than ordinary importance sampling, but adapting it to function approximation is challenging [Mahmood & Sutton, 2015]
- Tree backup algorithm (off-policy, without importance sampling)
- Allow the target policy $\pi$ to be determined partly by the behavior policy $b$ to limit creating large importance sampling ratios

11.10 Summary

Off-policy learning poses a challenge that requires creating stable and efficient learning algorithms.
Tabular Q-learning makes off-policy learning seem easy, as does its generalizations to Expected Sarsa and tree backup.
Extension further to function approximation (even linear) is challenging.
The challenge of off-policy learning is divided into two parts:
- Correcting the targets of learning for the behavior policy.
- Dealing with the instability of bootstrapping (mismatch between off-policy and on-policy distribution of updates).
The deadly triad arises when we try to combine these 3 elements: function approximation, off-policy learning, and bootstrapping, thereby causing instability and divergence.
SGD in the Bellman error $\overline{\text{BE}}$ is not learnable so it does not work.
Gradient-TD methods perform SGD in the projected Bellman error $\overline{\text{PBE}}$ and are learnable with $O(d)$ computational complexity.
Emphatic-TD methods re-weight updates, emphasizing some and de-emphasizing others, to get the off-policy distribution of the updates to match that of on-policy.
There are many ways of reducing high variance in off-policy learning that are centered on minimizing the step taken by SGD by using small step-size parameters to counter the multiplicative effect from the successive policy ratios.

Citation

If you found this blog post helpful, please consider citing it:

@article{obasi2026RLsuttonBartoCh11notes,
  title   = "Sutton & Barto, Ch. 11: Off-Policy Methods with Approximation (Personal Notes)",
  author  = "Obasi, Chizoba",
  journal = "chizkidd.github.io",
  year    = "2026",
  month   = "Mar",
  url     = "https://chizkidd.github.io/2026/03/09/rl-sutton-barto-notes-ch011/"
}

Sutton & Barto, Ch. 10: On-Policy Control with Approximation (Personal Notes)

Mon, 09 Mar 2026 00:00:00 +0000

Let’s dive into the control problem now with parametric approximation of the action-value function $\hat{q}(s, a, \mathbf{w}) \approx q_{*}(s, a)$, where $\mathbf{w} \in \mathbb{R}^d$ is a finite-dimensional weight vector.
We’ll focus on semi-gradient Sarsa, the natural extension of semi-gradient TD(0) to action values and to on-policy control.
We’ll look at this extension in both the episodic and continuing case.
We’ll look at $n$-step linear Sarsa.

10.1 Episodic Semi-gradient Control
10.2 Semi-gradient $n$-step Sarsa
10.3 Average Reward: A New Problem Setting for Continuing Tasks
10.4 Deprecating the Discounted Setting
10.5 Differential Semi-gradient $n$-step Sarsa
10.6 Summary

Appendix

Citation

10.1 Episodic Semi-gradient Control

The extension of the semi-gradient prediction methods of Chapter 9 to action values is straightforward.
It is the approximate action-value function, $\hat{q} \approx q_\pi$, that is represented as a parametrized functional form with weight vector $\mathbf{w}$.
Before, the training examples had the form $S_t \mapsto U_t$; now the examples have the form $S_t, A_t \mapsto U_t$.
The update target $U_t$ can be any approximation of $q_\pi(S_t, A_t)$, including the usual backed-up values such as the full Monte Carlo (MC) return $G_t$ or any $n$-step Sarsa return $G_{t:t+n}$.
The general gradient-descent update for action-value prediction is:

\[\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\!\left[U_t - \hat{q}(S_t, A_t, \mathbf{w}_t)\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)\]

The update for the one-step Sarsa method is:

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\!\left[R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t)\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}\]

This method is called episodic semi-gradient one-step Sarsa. For a constant policy, this method converges in the same way that TD(0) does with the same kind of error bound.
Control = action-value prediction + policy improvement & action selection:

\[\boxed{a, S_{t+1} \longrightarrow \hat{q}(S_{t+1}, a, \mathbf{w}_t) \longrightarrow A^*_{t+1} = \arg\max_a \hat{q}(S_{t+1}, a, \mathbf{w}_t) \longrightarrow \varepsilon\text{-greedy policy improvement} \longrightarrow \varepsilon\text{-greedy action selection}}\]

Linear function approximation for the action-value function is:

\[\hat{q}(s, a, \mathbf{w}) \doteq \mathbf{w}^T \mathbf{x}(s, a) = \sum_{i=1}^{d} w_i \cdot x_i(s, a)\]

10.2 Semi-gradient $n$-step Sarsa

We use an $n$-step return as the update target for episodic semi-gradient $n$-step Sarsa. The $n$-step return generalizes from its tabular form to a function approximation form:

\[G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1}), \quad t+n < T\] \[\text{with } G_{t:t+n} \doteq G_t \text{ if } t+n \geq T\]

The $n$-step update equation is:

\[\boxed{\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha\!\left[G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1}), \quad 0 \leq t < T}\]

Performance is best if an intermediate level of bootstrapping is used ($n > 1$).

10.3 Average Reward: A New Problem Setting for Continuing Tasks

Average reward applies to continuing problems for goal formulation in MDPs.
Average reward uses no discounting; the agent has the same level of care for immediate and delayed rewards.
Average reward setting is more commonly considered in dynamic programming and less commonly in reinforcement learning (RL).
The discounted setting is problematic with function approximation, hence the need for average reward to replace it.
In the average-reward setting, the quality of a policy $\pi$ is defined as the average rate of reward, or simply average reward, while following that policy, denoted as $r(\pi)$:

\[\begin{align*} r(\pi) &\doteq \lim_{h \to \infty} \frac{1}{h} \sum_{t=1}^{h} \mathbb{E}\!\left[R_t \mid S_0, A_{0:t-1} \sim \pi\right] \\ &= \lim_{t \to \infty} \mathbb{E}\!\left[R_t \mid S_0, A_{0:t-1} \sim \pi\right] \\ &= \sum_s \mu_\pi(s) \sum_a \pi(a \vert s) \sum_{s', r} p(s', r \vert s, a)\, r \end{align*}\]

The expectations in the above equations are conditioned on the initial state $S_0$, and on the subsequent actions $A_0, A_1, \ldots, A_{t-1}$, being taken according to $\pi$.
The 2nd and 3rd equations above hold if the MDP is ergodic, i.e., if the steady-state distribution exists and is independent of the starting state $S_0$:

\[\mu_\pi(s) \doteq \lim_{t \to \infty} \Pr\!\left\{S_t = s \mid A_{0:t-1} \sim \pi\right\}\]

In an ergodic MDP, the starting state can have only a temporary effect, but in the long run the expectation of being in a state depends only on the policy and the MDP transition probabilities.
Ergodicity is sufficient but not necessary to guarantee the existence of the limit in the $r(\pi)$ equation above.
It may be adequate practically to simply order policies according to their average reward per time step, otherwise called the return rate.
All policies that reach the maximal value of $r(\pi)$ are optimal.
The steady-state distribution $\mu_\pi$ is the special distribution under which, if you select actions according to $\pi$, you remain in the same distribution, i.e., for which:

\[\sum_s \mu_\pi(s) \sum_a \pi(a \vert s)\, p(s' \vert s, a) = \mu_\pi(s')\]

In the average-reward setting, returns are defined in terms of differences between rewards and the average reward; this is called the differential return:

\[G_t \doteq R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + R_{t+3} - r(\pi) + \ldots\]

The corresponding value functions for the differential return are known as differential value functions:

\[\begin{aligned} v_\pi(s) &\doteq \mathbb{E}_\pi\!\left[G_t \mid S_t = s\right] \\ q_\pi(s, a) &\doteq \mathbb{E}_\pi\!\left[G_t \mid S_t = s, A_t = a\right] \end{aligned}\]

Differential value functions also have Bellman equations:

\[\begin{aligned} v_\pi(s) &= \sum_a \pi(a \vert s) \sum_{r, s'} p(s', r \vert s, a)\!\left[r - r(\pi) + v_\pi(s')\right] \\[6pt] q_\pi(s, a) &= \sum_{r, s'} p(s', r \vert s, a)\!\left[r - r(\pi) + \sum_{a'} \pi(a' \vert s')\, q_\pi(s', a')\right] \\[6pt] v_{*}(s) &= \max_a \sum_{r, s'} p(s', r \vert s, a)\!\left[r - \max_\pi r(\pi) + v_{*}(s)\right] \\[6pt] q_{*}(s, a) &= \sum_{r, s'} p(s', r \vert s, a)\!\left[r - \max_\pi r(\pi) + \max_{a'} q_{*}(s', a')\right] \end{aligned}\]

The differential form of the 2 TD errors:

\[\begin{aligned} \delta_t &\doteq R_{t+1} - \bar{R}_t + \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \\ \delta_t &\doteq R_{t+1} - \bar{R}_t + \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t) \end{aligned}\] \[\begin{aligned} \text{where} \quad \bar{R}_t &= \text{average reward } r(\pi) \text{ estimate at time } t \end{aligned}\]

Most of the algorithms covered so far don’t change for the average-reward setting. For example, the semi-gradient Sarsa average-reward version is the same as the regular version except with the differential version of the TD error:

\[\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\, \delta_t \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)\]

10.4 Deprecating the Discounted Setting

For the tabular case, the continuing, discounted problem formulation is useful, but in the approximate case, is this problem formulation necessary?
Should we use the discounted reward or average reward in continuing tasks?
It turns out that the average of the discounted return is proportional to the average reward.
The ordering of all policies in the average discounted return setting would be exactly the same as in the average-reward setting.
This idea of the futility of discounting in continuing problems can be proven by the symmetry argument.
- Let’s choose an objective that saves discounting by summing discounted values over the distribution with which states occur under the policy (where $v^\gamma_\pi \equiv$ discounted value function):
  \[\begin{align*} J(\pi) &= \sum_s \mu_\pi(s)\, v^\gamma_\pi(s) \\ &= \sum_s \mu_\pi(s) \sum_a \pi(a \vert s) \sum_{s'} \sum_r p(s', r \vert s, a)\!\left[r + \gamma v^\gamma_\pi(s')\right] \\ &= r(\pi) + \sum_s \mu_\pi(s) \sum_a \pi(a \vert s) \sum_{s'} \sum_r p(s', r \vert s, a)\, \gamma v^\gamma_\pi(s') \\ &= r(\pi) + \gamma \sum_{s'} v^\gamma_\pi(s') \sum_s \mu_\pi(s) \sum_a \pi(a \vert s)\, p(s' \vert s, a) \\ &= r(\pi) + \gamma \sum_{s'} v^\gamma_\pi(s')\, \mu_\pi(s') \\ &= r(\pi) + \gamma J(\pi) \\ &= r(\pi) + \gamma\!\left(r(\pi) + \gamma J(\pi)\right) \\ &= r(\pi) + \gamma r(\pi) + \gamma^2 J(\pi) \\ &= r(\pi) + \gamma r(\pi) + \gamma^2 r(\pi) + \gamma^3 r(\pi) + \gamma^4 r(\pi) + \ldots \\ &= r(\pi)\!\left[1 + \gamma + \gamma^2 + \gamma^3 + \ldots\right] \end{align*}\] \[\hspace{-6cm} \boxed{J(\pi) = \left(\frac{1}{1-\gamma}\right) r(\pi)}\]
- The proposed discounted objective orders policies identically to the undiscounted (average reward) objective.
- The discount rate $\gamma$ does not influence the ordering.
The root cause of the difficulties with the discounted control setting is that with function approximation we have lost the policy improvement theorem.
Now if we change the policy to improve the discounted value of one state, we are no longer guaranteed to have improved the overall policy.

10.5 Differential Semi-gradient $n$-step Sarsa

We need an $n$-step version of the TD error in order to generalize to $n$-step bootstrapping.
Let’s generalize the $n$-step return to its differential form, with function approximation:

\[\boxed{G_{t:t+n} \doteq R_{t+1} - \bar{R}_{t+n-1} + \ldots + R_{t+n} - \bar{R}_{t+n-1} + \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1})}\] \[\begin{aligned} \text{where} \quad \bar{R} &\equiv \text{an estimate of } r(\pi),\quad n \geq 1\ \&\ t+n < T \\ G_{t:t+n} &\doteq G_t \quad \text{ if } t+n \geq T \end{aligned}\]

The $n$-step TD error is:

\[\delta_t \doteq G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w})\]

10.6 Summary

Extended parametrized function approximation & semi-gradient descent to control.
The extension is immediate for the episodic case, but dependent on a new problem formulation based on maximizing the average reward setting per time step, for the continuing case.
The discounted formulation cannot be carried over to control in the presence of approximations.
Most policies cannot be represented by a value function in the approximate case.
The scalar average reward $r(\pi)$ provides an effective way of ranking the remaining arbitrary policies.
The average reward formulation involves new differential versions of value functions, Bellman equations, and TD errors, but all of these parallel the old ones and the conceptual changes are small.
The average reward setting has a new parallel set of differential algorithms.

Citation

If you found this blog post helpful, please consider citing it:

@article{obasi2026RLsuttonBartoCh10notes,
  title   = "Sutton & Barto, Ch. 10: On-Policy Control with Approximation (Personal Notes)",
  author  = "Obasi, Chizoba",
  journal = "chizkidd.github.io",
  year    = "2026",
  month   = "Mar",
  url     = "https://chizkidd.github.io/2026/03/09/rl-sutton-barto-notes-ch010/"
}

When Your Voice Assistant Can't Hear Tones: Evaluating ASR Bias in Igbo

Wed, 04 Mar 2026 00:00:00 +0000

I grew up in an Igbo household in Northern Nigeria, that code-switched between English, Igbo, and Hausa almost unconsciously. Like many bilingual Nigerians, I’ve watched voice assistants and ASR systems get better and better at English while struggling with our languages. When Meta released omniASR claiming support for over 1,600 languages including Igbo, I was curious. Does “supported” mean it actually works?

Turns out, the answer is more complicated than I expected.

The Problem: What Does “Language Support” Really Mean?

Here’s the thing about Igbo: tone changes word meaning. The difference between “akwa” (crying), “akwà” (cloth), “àkwà” (egg), and “ákwá” (bridge) isn’t just decorative accent marks. These are completely different words that happen to have the same consonants and vowels. The tone is the difference.

So when I saw that omniASR listed Igbo among its supported languages, I wanted to know: does it actually preserve these tonal distinctions? Or does “support” just mean “we trained on some Igbo data and hope for the best”?

The Experiment: 21 Audio Samples

I designed a simple test. Using my iPhone Voice Memos app, I recorded 21 short audio clips in different categories:

Tonal minimal pairs: I said “akwa, akwa, akwa” three times with no tone, then “akwà, akwà, akwà” three times with low tone, then “àkwà, àkwà, àkwà” with low-low tone, and finally “ákwá, ákwá, ákwá” with high-high tone. Four distinct words, each repeated three times.

Code-switching: Phrases like “The ụlọ is beautiful” where I mix English and Igbo naturally, the way we actually speak.

Place names and cultural terms: Nigerian cities, Igbo food words, proverbs. The stuff that’s probably not in training data.

The smoking gun test: I spoke a sentence with deliberately flat intonation, no tonal variation at all. If the model is actually listening to tone in the audio, it shouldn’t add tone marks to monotone speech.

Then I ran everything through omniASR and compared what I actually said to what it transcribed.

The Results: 75% Tone Loss

The numbers were worse than I expected.

For the tonal sample after bootstrapping, the model dropped 75.5% of the tone marks. Not just a few mistakes here and there. Three out of every four tone marks, gone.

When I said the four different “akwa” words, the model output was: “akua akua akua akua akwa akwa akwa akua akwa ọkua ọkua ọkua”. Random variations. The semantic distinctions completely lost.

But here’s what really convinced me the model isn’t actually listening to tones: the monotone test. I spoke “O na-eri oji n’ututu” (He eats kolanut in the morning) with flat intonation, like a robot. The model transcribed it as “ọne rị ọjí nụ tútú” and added tone marks that I never spoke.

If the model were using acoustic information to place diacritics, it shouldn’t be adding tones to flat speech. This suggests it’s doing something else: probably using statistical patterns from training data to guess where diacritics should go, rather than actually hearing them.

Key Diagnostic: The Monotone Test

File 09: Spoke “O na-eri oji n’ututu” with FLAT intonation
Expected: 0 diacritics (no tonal variation in audio)
Result: Model added 7 tone marks that weren’t spoken

This is evidence of orthographic bias, not acoustic perception.

What the Data Shows

I created three visualizations to make the patterns clear.

Figure 1 shows diacritic loss by category. The tonal category (in red) jumps out immediately at 61.2% raw count loss. For comparison, the domain-specific category had only 6.3% loss. But look at the cross-lingual interference category: it’s at -38.9%, which means the model was adding diacritics that don’t exist. It’s not just dropping tones, it’s hallucinating them in the wrong places.

Figure 2 plots character error rate against diacritic loss for each sample. What’s interesting here is that the tonal samples (red dots) show high diacritic loss even when the overall character error rate is moderate (20-40%). This means tone errors aren’t just a consequence of the model doing poorly in general. The model can get most of the characters right while still completely failing on tones specifically.

Figure 3 shows the bootstrap confidence intervals. Even with only 21 samples, the error bars don’t overlap between categories. The tonal category’s worst-case lower bound is 57.1%, which is still terrible. This confirms that what I’m seeing isn’t just noise from a small sample size.

The Statistical Story

I’m not a statistician, but I know enough to be careful with small sample sizes. Twenty-one samples isn’t huge. So I used bootstrap resampling (basically, randomly resampling my data 10,000 times to get confidence intervals) to make sure these effects weren’t just random noise.

Even under the most conservative estimate (the lower bound of the 95% confidence interval), tonal diacritic loss was still 57.1%. The worst-case scenario is still terrible.

I also created a custom metric called Diacritic Error Rate (DER) because standard Character Error Rate treats tone marks the same as spacing errors. DER specifically tracks dropped tone marks versus hallucinated tone marks. Turns out the model isn’t just dropping tones. It’s also adding tones that don’t exist, which is a whole different kind of problem.

The Categories

Breaking down the errors helped me understand what’s going wrong:

Cross-lingual interference: When I spoke phrases with no tone marks at all (like names), the model added incorrect diacritics 38.9% of the time. It’s probably applying orthographic patterns from other languages.

Code-switching boundary effects: The English portions of code-switched sentences were transcribed perfectly. The Igbo portions immediately adjacent to English lost their tones. Something about language boundaries is disrupting processing.

Domain coverage: Culturally specific terms (place names, food words) had the best diacritic preservation at only 6.3% loss, but terrible overall accuracy. The model knows the orthography but doesn’t know the words.

Tonal collapse: 75.5% loss. This is the big one.

Why This Matters

I keep coming back to the monotone hallucination test. If I were building a voice assistant for Igbo speakers and it’s adding tones I didn’t speak, that’s not just an accuracy problem. It’s an epistemological problem. The system is presenting confident outputs that have no acoustic basis.

Imagine you’re dictating a text message in Igbo and the system confidently transcribes “crying” when you said “cloth.” Not just a typo you can spot and fix. A completely different word that makes semantic nonsense but looks plausible.

What 75% Tonal Loss Means

75.5% bootstrap diacritic loss means:
3 out of 4 tone marks disappear
“cloth” → could mean “crying”
“egg” → meaning lost entirely
“bridge” → wrong word

In English, this would be like dropping 75% of consonants.

This isn’t just about transcription accuracy. It’s about whether “supporting 1,600+ languages” means anything more than “we trained on data from 1,600+ languages and didn’t check if it actually works for tonal distinctions.”

The Bigger Picture: Zeno’s Paradox of Low-Resource Languages

There’s a paper from EMNLP 2024 that talks about “The Zeno’s Paradox of Low-Resource Languages.” The basic idea: models keep claiming to support more and more languages, but the quality asymptote never actually reaches parity with high-resource languages. We get closer and closer, but never quite there.

Igbo is interesting because by speaker population (45 million people), it’s not low-resource. But by model performance, it clearly behaves like one. The gap between coverage (we trained on Igbo data) and competence (the model preserves linguistically meaningful distinctions) is huge.

'Supported' ≠ Works Well

omniASR claims support for 1,600+ languages. Igbo has 45 million speakers, but its tonal accuracy is 24.5% (only 1 in 4 tone marks preserved).

Coverage (in training data) ≠ Competence (preserves meaning)

This makes me think about all the other languages in that 1,600+ list. How many of them have this same gap? How many communities are using systems that confidently produce nonsense because nobody with native speaker expertise has stress-tested them?

What I Learned

Small, targeted datasets can reveal problems big datasets hide. I didn’t need thousands of hours of audio. Twenty-one carefully designed samples were enough to show systematic failure modes.

Native speaker expertise matters. Automated metrics can’t catch when “crying” is transcribed as “cloth” because the character error rate looks fine. You need someone who speaks the language to know that the semantic content is destroyed.

Bootstrap resampling is powerful for small samples. I was worried 21 samples was too few, but bootstrap confidence intervals let me quantify uncertainty rigorously. Even the pessimistic lower bounds showed substantial effects.

The monotone test is a better diagnostic than I expected. If diacritics are added to flat speech, that’s clear evidence of orthographic bias over acoustic conditioning. One simple test that revealed the core mechanism.

The Technical Details

For anyone interested in replicating this:

I used my iPhone for recording (Voice Memos app, M4A format)
Ran inference through Google Colab with omniASR’s official pipeline
Computed bootstrap CIs with 10,000 iterations at the utterance level
Created a custom DER metric to separate tonal errors from general transcription errors
All code, data, and analysis is on GitHub and HuggingFace

The whole analysis took about half a week of evening work. Most of that was iterating on the sample design and figuring out the right statistical approach. The actual recording and inference was maybe a day.

What’s Next

This is really just a proof of concept. To make stronger claims, I’d need:

Multi-speaker evaluation (10+ speakers across different Igbo dialects)
Acoustic analysis (F0 contour tracking to verify what’s actually in the audio)
Comparative evaluation (does Whisper do better? What about Google’s USM?)
Fine-tuning experiments (can we fix this with targeted training data?)

I have ideas for all of these, but they’re bigger projects. For now, I’m focused on documenting the blind spot and making the methodology replicable.

This started as curiosity about whether “multilingual” ASR systems actually work for the languages I grew up speaking. But it turned into something bigger.

There’s a tendency in ML to treat “supporting” a language as a checkbox. Train on some data, add it to the model card, ship it. But languages aren’t just data. They’re how people communicate, how they think, how they preserve culture.

When voice assistants strip tone marks from Igbo, they’re not just making transcription errors. They’re normalizing a version of the language that doesn’t preserve meaning. If every voice interface does this, what happens to how people write Igbo? Do they start thinking tone marks are optional because the AI doesn’t use them?

I don’t know the answers to these questions. But I think they’re worth asking before we claim to “support” 1,600+ languages.

Resources

If you want to explore the data or replicate the analysis:

Dataset: HuggingFace
Code: GitHub
Audio samples: You can actually listen to the 21 clips and see the transcription failures yourself

The dataset is CC-BY-4.0 licensed, while the code is MIT licensed. If this is useful for your work, feel free to use it, cite it, and build on it.

Final Thoughts

This project taught me something important: you don’t need massive compute or huge datasets to find meaningful problems in ML systems. You just need to know where to look and what questions to ask.

As a native Igbo speaker, I knew what questions to ask. As someone learning ML, I knew how to design tests and interpret results. That combination turned out to be more valuable than I expected.

If you speak a language that’s “supported” by these big multilingual models, I encourage you to test them. Record some minimal pairs. Try code-switching. See if the system actually works the way you use the language, not just the way it appears in training data.

You might be surprised what you find.

Citation

If you found this work helpful, please consider citing it:

@article{obasi2026igboasr,
  title   = "When Your Voice Assistant Can't Hear Tones: Evaluating ASR Bias in Igbo",
  author  = "Obasi, Chizoba",
  journal = "chizkidd.github.io",
  year    = "2026",
  month   = "Mar",
  url     = "https://chizkidd.github.io/2026/03/04/igbo-asr-tonal-evaluation/"
}

Tonal Fidelity in Multilingual ASR: A Diagnostic Evaluation

Sun, 01 Mar 2026 00:00:00 +0000

This is a brief guide to my evaluation of tonal preservation in facebook’s omniASR-CTC-1B Automatic Speech Recognition (ASR) model for Igbo, a tonal Niger-Congo language with 45 million speakers. The model claims support for 1,600+ languages including Igbo, but what does “support” mean when tone changes word meaning? I created 21 systematically designed audio samples, ran them through the model, and measured a 75.5% bootstrapped diacritic loss rate on tonal markers. The core finding: the model appears to generate tone marks probabilistically based on orthographic priors rather than acoustic conditioning. I cannot simplify this investigation any further.

Where to find it: The dataset with audio is on HuggingFace. The code and analysis are on GitHub. The full analysis notebook is available at analysis.ipynb.

The following is my guide to stepping through the evaluation methodology.

The Problem

In Igbo, tone is phonemic. This means tone changes word meaning, not just prosody. The difference between:

akwa (crying)
akwà (cloth)
àkwà (egg)
ákwá (bridge)

…isn’t decorative. These are four completely different words that happen to share consonants and vowels. The tone marks (diacritics) are the only thing distinguishing them. When omniASR lists Igbo as “supported,” does it preserve these tonal distinctions? Or does “support” just mean “we trained on some Igbo data”?

Dataset Design

I recorded 21 audio samples using my iPhone SE Voice Memos app. Each sample targets a specific failure mode across four categories.

The first category tests cross-lingual orthographic interference. My hypothesis was that the model applies incorrect orthographic conventions from other languages to Igbo text. I recorded five samples: personal names without tone marks, formal greetings, numbers in Igbo, well-known proverbs, and a slow prosody test. I expected 0% diacritic loss since there was nothing to lose, but observed -38.9%, meaning the model added diacritics that don’t exist.

The second category tests phonemic tone sensitivity. The hypothesis here is that the model cannot distinguish phonemically contrastive tones. I recorded six samples including minimal pairs like akwa/akwà/àkwà/ákwá and oke/òkè/ọkè, dense tone marking, a monotone control (the key diagnostic), and two Yoruba controls. I expected low loss if the model uses acoustic information, but observed 75.5% loss with a bootstrap 95% confidence interval of [57.1%, 89.7%].

The smoking gun is file 09. I spoke “O na-eri oji n’ututu” with deliberately flat intonation, with no tonal variation at all. The model transcribed it as “ọne rị ọjí nụ tútú” and ADDED tone marks I never spoke. If the model were using acoustics, it shouldn’t hallucinate tones on monotone speech.

The third category tests language boundary effects from code-switching. I hypothesized that switching between English and Igbo disrupts language-specific processing. Five samples test different patterns: English embedding into Igbo, Igbo embedding into English, sentence-level alternation, diacritics in English context, and Nigerian Pidgin as a control. The result was 14.3% diacritic loss, with English portions transcribed perfectly while adjacent Igbo lost tone marks.

The fourth category tests domain-specific lexical coverage. The hypothesis is that culturally specific terms outside the training distribution would struggle. I recorded Nigerian place names, Igbo food terms, long proverbs, French as a high-resource control, and background noise robustness. This category showed the best diacritic preservation at only 6.3% loss, but terrible overall accuracy with 30% character error rate, indicating word-level errors.

The data looks like this (metadata.csv):

file_name,ground_truth,model_output,category,character_error_rate,diacritics_expected,diacritics_produced
06_tonal_akwa.m4a,"akwa, akwa, akwa. Akwà, akwà, akwà...","akua akua akua akua akwa akwa...",tonal_diacritics,0.583,12,3
09_tonal_flat.m4a,"O na-eri oji n'ututu","ọne rị ọjí nụ tútú",tonal_diacritics,0.744,0,7
...

Model Inference

I used omniASR’s official inference pipeline:

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

pipeline = ASRInferencePipeline(model_card="omniASR_CTC_1B")
transcription = pipeline.transcribe(
    inp=["data/audio/06_tonal_akwa.m4a"],
    lang=["ibo_Latn"]
)

The model has 975 million parameters and uses a CTC-based ASR architecture with a wav2vec2-style encoder and CTC head. It was trained on multilingual data covering over 1,600 languages and released on November 14, 2025.

For each audio file, I extracted:

ground_truth = "akwa, akwa, akwa. Akwà, akwà, akwà. Àkwà, àkwà, àkwà. Ákwá, ákwá, ákwá."
model_output = transcription[0]['transcription']
# Compare and compute metrics

Metrics

Standard Character Error Rate (CER) conflates spacing errors with tonal errors. I defined a custom metric:

Diacritic Error Rate (DER)

def diacritic_error_rate(ground_truth, model_output):
    E = count_diacritics(ground_truth)  # expected
    P = count_diacritics(model_output)  # produced
    D = max(0, E - P)  # dropped
    H = max(0, P - E)  # hallucinated
    return (D + H) / E if E > 0 else 0

def count_diacritics(text):
    diacritics = set('ụọịàèìòùáéíóúẹṣ')
    return sum(1 for c in text.lower() if c in diacritics)

DER isolates tone-related failures:

Metric	Formula	What it captures
CER	Levenshtein distance / length	All character errors
RDD (Raw Drop Rate)	dropped / expected	Only missing tone marks
DER	(dropped + hallucinated) / expected	Total tonal deviation

Note that DER can exceed 100% when hallucinations are substantial, because the denominator reflects ground truth expectations, not produced output.

Bootstrap Uncertainty

With N=21 samples, I needed to quantify uncertainty. I used bootstrap resampling:

def bootstrap_ci(data, stat_fn, n_boot=10000, ci=0.95, seed=42):
    rng = np.random.default_rng(seed)
    n = len(data)
    
    # Point estimate
    point = float(stat_fn(data))
    
    # Bootstrap resampling
    boots = np.empty(n_boot)
    for i in range(n_boot):
        idx = rng.integers(0, n, size=n)
        boots[i] = float(stat_fn(data.iloc[idx]))
    
    # Percentile CI
    alpha = (1 - ci) / 2
    lo = float(np.quantile(boots, alpha))
    hi = float(np.quantile(boots, 1 - alpha))
    
    return (point, lo, hi)

Bootstrap resampling occurs at the utterance level, not event level. This matters because diacritic distribution is uneven across samples. Some files have 0 expected tone marks, others have 12. Resampling utterances captures this variability.

Example result:

Raw count: 30/49 = 61.2% drop rate
Bootstrap mean: 75.5%
95% CI: [57.1%, 89.7%]

The bootstrap mean exceeds the raw percentage because resampling at utterance level gives more weight to samples with extreme loss rates. Both values are reported for transparency.

Why Bootstrap Matters

With only 21 samples, we need uncertainty quantification. Bootstrap resampling (10,000 iterations) shows: Worst-case lower bound: 57.1%
Even pessimistically, loss is still >50%
Not a small-sample fluke

Results

Quantitative Summary

Category	Samples	Diacritic Loss	Avg CER
Phonemic Tone Sensitivity	6	75.5%	50.6%
Cross-lingual Interference	5	-38.9%	28.8%
Domain-Specific Coverage	5	6.3%	30.1%
Language Boundary Effects	5	14.3%	20.0%
Overall	21	26.8%	32.5%

Bootstrap Confidence Intervals

Tonal category:  75.5% (95% CI: [57.1%, 89.7%])
Overall:         52.6% (95% CI: [30.3%, 69.7%])

Even under the worst-case lower bound (57.1%), tonal diacritic loss remains severe.

Visualizations

Bar chart showing 61.2% raw count loss for tonal category (red), with negative values indicating diacritic hallucination (script interference).

Scatter plot showing tonal samples (red) have high diacritic loss even when CER is moderate.

Forest plot showing 95% CIs for each category, with 50% threshold line.

Example: Tonal Minimal Pairs

File 06 is the clearest demonstration:

Input (what I said):
"akwa, akwa, akwa. Akwà, akwà, akwà. Àkwà, àkwà, àkwà. Ákwá, ákwá, ákwá."

Model output:
"akua akua akua akua akwa akwa akwa akua akwa ọkua ọkua ọkua"

Expected diacritics: 12
Produced diacritics: 3
Loss rate: 75%

The four distinct words collapsed into random variations. From a linguistic perspective, this is catastrophic. The word akwà meaning cloth got transcribed as akwa, which could mean crying instead. The word àkwà meaning egg got transcribed as akwa, and the meaning is completely lost. The word ákwá meaning bridge got transcribed as akua, which is wrong both in word and tone.

The Monotone Test

File 09 is my favorite diagnostic. Setup:

Spoke “O na-eri oji n’ututu” (He eats kolanut in the morning)
Deliberately flat intonation, like a robot
Zero tonal variation in the audio

If the model uses acoustic information to place diacritics, it should produce few or no tone marks on flat speech. Result:

Ground truth: "O na-eri oji n'ututu"  (0 diacritics)
Model output: "ọne rị ọjí nụ tútú"    (7 diacritics)

The model ADDED tone marks I never spoke. This is clear evidence of orthographic bias over acoustic conditioning. The model is using statistical patterns from training data to guess where diacritics should go, not listening to the audio.

Statistical Analysis

Hypothesis Testing

Null hypothesis (H0): Diacritic loss in tonal category ≤ other categories  
Alternative (H1): Tonal category shows higher loss

Test: Bootstrap confidence intervals (10,000 iterations, 95% CI)

Result: Tonal bootstrap mean (75.5%) substantially exceeds all other categories (highest alternative: 38.9% for script hallucination). While confidence intervals show some overlap due to small sample size, the tonal category's point estimate is nearly 2x higher than the next closest category.

Conclusion: Tonal degradation exhibits the highest loss rate across all categories (bootstrap mean: 75.5%). While confidence intervals show some overlap with script hallucination due to small sample size (N=21), the effect size is large and consistent across resamples.

Robustness Check

Even under worst-case assumptions using the lower bound of the confidence interval, tonal loss remains at 57.1%, which is still greater than 50%. Overall loss stays at 30.3%, which is still substantial. This suggests the observed tonal degradation is unlikely to be driven solely by sampling variability.

Code

The full analysis is in analysis.ipynb. The core evaluation functions handle diacritic counting, character error rate calculation, and bootstrap resampling. Diacritic counting uses a set of Igbo tone mark characters and counts occurrences in the text. Character error rate is computed using Python’s SequenceMatcher for character-level similarity. Bootstrap resampling runs 10,000 iterations on the tonal diacritics category to compute confidence intervals.

All evaluation code is organized in the src/ directory. The evaluate.py module contains metrics like DER and bootstrap confidence intervals. The visualize.py module has plotting functions for all three figures. The utils.py module handles data loading and validation.

Run it

Clone the repository and reproduce:

git clone https://github.com/chizkidd/igbo-asr-tonal-evaluation.git
cd igbo-asr-tonal-evaluation
pip install -r requirements.txt
jupyter notebook analysis.ipynb

Or run in Google Colab:

The notebook takes about 5-10 minutes to run on Colab with a T4 GPU. You’ll see the analysis output:

Loading metadata...
Total samples: 21
Categories: 4

Computing metrics...
Overall DER: 26.8%
  Tonal category: 75.5%
  Script interference: -38.9%
  Code-switching: 14.3%
  Domain-specific: 6.3%

Bootstrap resampling (10,000 iterations)...
Tonal diacritics: 75.5% [57.1%, 89.7%]
Overall: 52.6% [30.3%, 69.7%]

Generating visualizations...
Saved: results/visualizations/fig1_loss_by_category.png
Saved: results/visualizations/fig2_cer_vs_loss.png
Saved: results/visualizations/fig3_bootstrap_ci.png

Reproducibility

Model: omniASR-CTC-1B (975M params)
Data: 21 samples, 4 categories
Metrics: Custom DER (Diacritic Error Rate)
Stats: Bootstrap with utterance-level resampling
Code: github.com/chizkidd/igbo-asr-tonal-evaluation

Scope and Limitations

This study demonstrates three things. First, systematic diacritic loss in omniASR on Igbo across 21 controlled samples. Second, failure to preserve tonal minimal pairs in this evaluation setup. Third, diacritic hallucination on monotone speech, which is evidence of orthographic bias.

This study does not claim four things. It doesn’t claim universal failure on all Igbo speech. It doesn’t claim that tone modeling is architecturally absent from the model. It doesn’t claim that Igbo is uniquely disadvantaged compared to all other low-resource languages. And it doesn’t claim that the observed error rates generalize to all dialects or all speakers.

What would strengthen these claims? Multi-speaker evaluation with 10+ speakers across different dialects. Acoustic analysis with F0 contour extraction and pitch tracking validation. Comparative evaluation on other models like Whisper, MMS, USM, and Azure Speech. And controlled resynthesis experiments that isolate acoustic factors from lexical priors.

Future Work

Current: Single speaker, 21 samples (proof-of-concept)
Next: 200 samples, 10+ speakers, 5 dialects
Then: Comparative evaluation (Whisper, MMS, Azure)
Finally: Fine-tuning intervention with tone-annotated data

Real Production Systems

Between this evaluation and a production-grade ASR fairness audit, there is a long list of things that change:

Data. Instead of 21 samples, production evaluations use thousands of hours across multiple speakers, dialects, ages, and recording conditions.

Speakers. Instead of single-speaker, you need balanced sampling across: dialects (Owerri, Onitsha, Enugu, Nsukka, Afikpo), gender, age ranges, native vs. L2 speakers.

Acoustic analysis. Instead of just comparing transcriptions, you need F0 (fundamental frequency) tracking to verify what’s actually in the audio. Praat or similar tools extract pitch contours frame-by-frame.

Comparative evaluation. Instead of one model, you audit multiple: Whisper (OpenAI), MMS (Meta), USM (Google), Azure Speech (Microsoft). This isolates whether the problem is specific to omniASR or universal.

Fine-tuning experiments. You collect tone-annotated Igbo data (50-100 hours), fine-tune the model, and measure pre/post accuracy. This tests whether the problem is architectural or just data scarcity.

Real-world deployment. You partner with Nigerian developers building voice assistants and measure downstream impact: do users trust ASR that strips tones? Does it affect adoption?

All of these are important, but if you understand this 21-sample evaluation, you understand the diagnostic methodology.

FAQ

Why only 21 samples? This is a proof-of-concept for blind spot discovery. Large datasets measure prevalence; small targeted datasets reveal failure modes. I prioritized depth (systematic coverage of error types) over breadth (statistical power).

Is 75.5% loss generalizable? Not necessarily. This is the loss rate on my voice, my dialect, my recording setup, for these specific test cases. Multi-speaker evaluation would give population estimates.

Why not use Word Error Rate? WER measures whole-word accuracy. In Igbo, “akwa” vs “akwà” counts as correct by WER (same word, different tone), but semantically these are different words. Diacritic-specific metrics capture what WER misses.

Does the model “understand” Igbo? That’s philosophical. Mechanically: it learned statistical patterns from training data. Whether assigning probability distributions to tokens constitutes “understanding” is up to you.

Why does the bootstrap mean exceed the raw percentage? Bootstrap resamples at utterance level. Samples with extreme loss rates (e.g., file 09 with 0 expected, 7 hallucinated) get resampled more in some iterations, pulling the mean up. This reflects uncertainty about which utterances are “typical.”

What’s next? Collect a 200-sample multi-speaker dataset across 5 Igbo dialects. After that: comparative model evaluation (Whisper vs MMS vs omniASR) and fine-tuning experiments with tone-annotated data.

Why This Matters

There’s a tendency in ML to treat “supporting” a language as a checkbox. Add it to the model card, ship it. But Igbo has 45 million speakers. When ASR systems strip tone marks, they normalize a version of the language that doesn’t preserve meaning.

If every voice interface does this, what happens to how people write Igbo? Do they internalize that tone marks are optional because the AI doesn’t use them? I don’t know, but these are questions worth asking before claiming to “support” 1,600+ languages.

Resources

The dataset is available on Huggingface. The code is on github. The model evaluated is facebook/omniASR-CTC-1B on HuggingFace. The dataset is licensed under CC-BY-4.0 and the code under MIT. Feel free to use it, cite it, and build on it.

Citation

If you found this evaluation helpful, please consider citing it:

@article{obasi2026tonalevaluation,
  title   = "Tonal Fidelity in Multilingual ASR: A Diagnostic Evaluation",
  author  = "Obasi, Chizoba",
  journal = "chizkidd.github.io",
  year    = "2026",
  month   = "Mar",
  url     = "https://chizkidd.github.io/2026/03/01/tonal-fidelity-diagnostic-evaluation/"
}

For the dataset:

@misc{obasi2026igbodataset,
  title={Igbo Blind Spot Dataset for omniASR-CTC-1B: Systematic Evaluation of Tonal Diacritic Loss},
  author={Obasi, Chizoba},
  year={2026},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/datasets/chiz/omniASR-igbo-blindspots}},
  note={Model evaluated: facebook/omniASR-CTC-1B (975M parameters)}
}

Chizoba Obasi blog

Sutton & Barto, Ch. 13: Policy Gradient Methods (Personal Notes)

Table of Contents

Appendix

13.1 Policy Approximation & Its Advantages

13.2 The Policy Gradient Theorem

13.3 REINFORCE: Monte Carlo Policy Gradient

REINFORCE: Monte Carlo Policy Gradient Control (Episodic) for $\pi_*$

13.4 REINFORCE with Baseline

REINFORCE with Baseline (Episodic), for estimating $\pi_{\boldsymbol{\theta}} \approx \pi_*$

13.5 Actor-Critic Methods

13.5.1 One-Step Actor-Critic Methods

One-Step Actor-Critic (Episodic), for estimating $\pi_{\boldsymbol{\theta}} \approx \pi_*$

13.5.2 Actor-Critic with Eligibility Traces

Actor-Critic with Eligibility Traces (Episodic), for estimating $\pi_{\boldsymbol{\theta}} \approx \pi_*$

13.6 Policy Gradient for Continuing Problems

Actor-Critic with Eligibility Traces (Continuing), for estimating $\pi_{\boldsymbol{\theta}} \approx \pi_*$

13.7 Policy Parametrization for Continuous Actions

13.8 Summary

Citation

Transformers

Table of Contents

Appendix

Input Text Sequence Representation

Tokenization

Why We Need Context

Encoders

From MLP to Attention

MLP only

Concatenation of nearby token embeddings before MLP

Attention

Self-Attention & Multi-Head Attention

Self-Attention

Multi-Head Attention (MHA)

Decoders

Masked Attention

Encoder-Decoder Cross Attention

Citation

SAM 2: Segment Anything in Images & Videos

Table of Contents

Appendix

1. Introduction

2. Related Work

3. Task: Promptable Visual Segmentation (PVS)

4. Model

5. Data

5.1 Data Engine

5.2 SA-V Dataset

6. Zero-Shot Experiments

6.1 Promptable Video Segmentation (PVS)

6.2 Semi-Supervised Video Object Segmentation

6.3 Image Segmentation

7. Comparison to SOTA in Semi-Supervised VOS

8. Conclusion

9. Discussion

Where This Fits in the Broader Literature

Why This Matters

Open Questions I’d Want to Explore

Citation

References

Muon & MuonClip Optimizers

Table of Contents

Appendix

Adam Optimizer

Gradient Descent

Cons of Adam

The Linear Layer & Matrix Momentum

The Problem with Vector-Based Optimizers

Matrix Orthogonalization

Orthogonalization via SVD

Odd Polynomial Matrix

Newton-Schulz-5 Iteration

Muon

Pseudo-Algorithm for Muon

Muon + Weight Decay + RMS Alignment

The Exploding Attention Logit Crisis

QK-Clip

QK-Clip for Multi-Head Attention

Multihead Latent Attention (MLA)

MuonClip