<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Chizoba Obasi blog</title>
    <description>Exploring Deep Learning.</description>
    <link>https://chizkidd.github.io//</link>
    <atom:link href="https://chizkidd.github.io//feed.xml" rel="self" type="application/rss+xml" />
    <pubDate>Fri, 08 May 2026 19:08:40 +0000</pubDate>
    <lastBuildDate>Fri, 08 May 2026 19:08:40 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>Sutton &amp; Barto, Ch. 13: Policy Gradient Methods (Personal Notes)</title>
        <description>&lt;ul&gt;
  &lt;li&gt;Almost all the algorithms/methods covered so far have been &lt;strong&gt;action-value methods&lt;/strong&gt; (except gradient-bandit algorithms, &lt;a href=&quot;https://chizkidd.github.io/RL-Sutton-Barto-notes/chapters/ch02-multi-armed-bandits.html#sec-ch02-2-8&quot;&gt;Section 2.8&lt;/a&gt;).&lt;/li&gt;
  &lt;li&gt;Action-value methods learn the values of actions and then derive the policy thereafter to select actions based on their estimated action values.&lt;/li&gt;
  &lt;li&gt;Here, we explicitly learn a &lt;strong&gt;parametrized policy&lt;/strong&gt; that can select actions without consulting a value function.&lt;/li&gt;
  &lt;li&gt;A value function is not required for action selection, but still could be used to learn the policy parameters $\boldsymbol{\theta}$.&lt;/li&gt;
  &lt;li&gt;The parametrized policy $\pi(a \vert s, \boldsymbol{\theta})$ is now the probability that action $a$ is taken at time $t$ given that the environment is in state $s$ at time $t$ with parameter $\boldsymbol{\theta} \in \mathbb{R}^{d’}$:&lt;/li&gt;
&lt;/ul&gt;

\[\pi(a \vert s, \boldsymbol{\theta}) = \Pr\{A_t = a \vert S_t = s,\ \boldsymbol{\theta}_t = \boldsymbol{\theta}\}\]

&lt;ul&gt;
  &lt;li&gt;We will consider methods for learning $\boldsymbol{\theta}$ based on the gradient of some scalar performance measure $J(\boldsymbol{\theta})$ w.r.t. $\boldsymbol{\theta}$. The goal is to &lt;strong&gt;maximize&lt;/strong&gt; performance, hence the use of gradient ascent:&lt;/li&gt;
&lt;/ul&gt;

\[\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \alpha \widehat{\nabla J(\boldsymbol{\theta}_t)}\]

\[\begin{aligned}
\text{where} \\
\widehat{\nabla J(\boldsymbol{\theta}_t)} &amp;amp;\in \mathbb{R}^{d&apos;} \equiv \text{a stochastic estimate whose expectation approximates} \\
&amp;amp;\phantom{{}\equiv{}} \text{the gradient of the performance measure } J \text{ w.r.t. } \boldsymbol{\theta}_t
\end{aligned}\]

&lt;ul&gt;
  &lt;li&gt;This general methodology applies to &lt;strong&gt;policy gradient methods&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;Methods that learn approximations to both policy &amp;amp; value functions are often called &lt;strong&gt;Actor-Critic methods&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;strong&gt;Actor&lt;/strong&gt; refers to the learned policy.&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Critic&lt;/strong&gt; refers to the learned value function (state-value function).&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;table-of-contents&quot;&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#131-policy-approximation--its-advantages&quot;&gt;13.1 Policy Approximation &amp;amp; Its Advantages&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#132-the-policy-gradient-theorem&quot;&gt;13.2 The Policy Gradient Theorem&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#133-reinforce-monte-carlo-policy-gradient&quot;&gt;13.3 REINFORCE: Monte Carlo Policy Gradient&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#134-reinforce-with-baseline&quot;&gt;13.4 REINFORCE with Baseline&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#135-actor-critic-methods&quot;&gt;13.5 Actor-Critic Methods&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#136-policy-gradient-for-continuing-problems&quot;&gt;13.6 Policy Gradient for Continuing Problems&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#137-policy-parametrization-for-continuous-actions&quot;&gt;13.7 Policy Parametrization for Continuous Actions&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#138-summary&quot;&gt;13.8 Summary&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;appendix&quot;&gt;Appendix&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#citation&quot;&gt;Citation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;131-policy-approximation--its-advantages&quot;&gt;13.1 Policy Approximation &amp;amp; Its Advantages&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;In policy gradient methods, the policy can be parametrized in any way, as long as $\pi(a \vert s, \boldsymbol{\theta})$ is differentiable w.r.t. $\boldsymbol{\theta}$ (essentially the partial derivatives’ column vector exists and is finite):&lt;/li&gt;
&lt;/ul&gt;

\[\nabla \pi(a \vert s, \boldsymbol{\theta}) \text{ exists and is finite for all } s \in S,\ a \in A(s),\ \boldsymbol{\theta} \in \mathbb{R}^{d&apos;}\]

&lt;ul&gt;
  &lt;li&gt;If the action-space is &lt;strong&gt;discrete&lt;/strong&gt; and not too large, then a natural &amp;amp; common kind of parametrization is to form parametrized &lt;strong&gt;numerical preferences&lt;/strong&gt; $h(s, a, \boldsymbol{\theta}) \in \mathbb{R}$ for each $(s, a)$ pair.&lt;/li&gt;
  &lt;li&gt;The actions with the highest preferences in each state are given the highest probabilities of being selected, for example, according to an exponential soft-max distribution:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\pi(a \vert s, \boldsymbol{\theta}) \doteq \frac{e^{h(s,a,\boldsymbol{\theta})}}{\sum_b e^{h(s,b,\boldsymbol{\theta})}}}\]

&lt;ul&gt;
  &lt;li&gt;This kind of policy parametrization is called &lt;strong&gt;softmax in action preferences&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;The action preferences can be parametrized arbitrarily, they can be computed by a deep artificial neural network (ANN) or could simply be &lt;strong&gt;linear&lt;/strong&gt; in features:&lt;/li&gt;
&lt;/ul&gt;

\[h(s, a, \boldsymbol{\theta}) = \boldsymbol{\theta}^T \mathbf{x}(s, a)\]

&lt;p&gt;&lt;strong&gt;Advantages of softmax in action preferences policy parametrization:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The approx. policy can approach a &lt;strong&gt;deterministic policy&lt;/strong&gt;, unlike $\varepsilon$-greedy action selection.&lt;/li&gt;
  &lt;li&gt;It enables the selection of actions with &lt;strong&gt;arbitrary probabilities&lt;/strong&gt;, which is useful for cases that require near-optimal stochastic policy (e.g. significant function approximation).
    &lt;ul&gt;
      &lt;li&gt;E.g. useful in environments with imperfect information (e.g., card games) where it is optimal to act stochastically, such as when &lt;strong&gt;bluffing in Poker&lt;/strong&gt;; it is important to do so randomly to unnerve/confuse an opponent.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;The policy from policy parametrization may be a &lt;strong&gt;simpler function to approximate&lt;/strong&gt; than the action-value function.&lt;/li&gt;
  &lt;li&gt;Often the most important reason for using a policy-based learning method is that it is a good way to &lt;strong&gt;inject prior knowledge&lt;/strong&gt; about the desired form of the policy into the RL system.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;132-the-policy-gradient-theorem&quot;&gt;13.2 The Policy Gradient Theorem&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Policy-gradient methods have stronger convergence guarantees compared to action-value methods due to &lt;strong&gt;smooth continuity&lt;/strong&gt; in change of action probabilities as a function of the learned parameter (gradient ascent).&lt;/li&gt;
  &lt;li&gt;Let’s consider the &lt;strong&gt;episodic&lt;/strong&gt; performance measure, which is the value of the start state of the episode:&lt;/li&gt;
&lt;/ul&gt;

\[J(\boldsymbol{\theta}) \doteq v_{\pi_{\boldsymbol{\theta}}}(s_0)\]

\[\begin{aligned}
\text{where} \\
v_{\pi_{\boldsymbol{\theta}}} &amp;amp;\equiv \text{the true value function for } \pi_{\boldsymbol{\theta}}\text{, the policy determined by } \boldsymbol{\theta} \\
s_0 &amp;amp;\equiv \text{some non-random state}
\end{aligned}\]

&lt;ul&gt;
  &lt;li&gt;How can we estimate the performance gradient w.r.t. the policy parameter when the gradient depends on the unknown effect of policy changes on the state distribution?
    &lt;ul&gt;
      &lt;li&gt;The theoretical answer is the &lt;strong&gt;policy gradient theorem&lt;/strong&gt;.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;The policy gradient theorem, which provides an analytic expression for the performance gradient w.r.t. the policy parameter, for the episodic case establishes that:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\nabla J(\boldsymbol{\theta}) \propto \sum_s \mu(s) \sum_a q_\pi(s,a)\, \nabla\pi(a \vert s, \boldsymbol{\theta})}\]

\[\begin{aligned}
  \text{where} \quad &amp;amp;\\
  \nabla J(\boldsymbol{\theta}) &amp;amp;\equiv \text{column vectors of partial derivatives w.r.t. the components of } \boldsymbol{\theta} \\
  \mu &amp;amp;\equiv \text{the on-policy distribution under the policy } \pi \\
  &amp;amp;\phantom{{}\equiv{}} \text{(in the episodic case, proportionality constant is the average length of an episode;} \\
  &amp;amp;\phantom{{}\equiv{}} \text{in the continuing case it is 1)}
  \end{aligned}\]

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Proof: Policy Gradient Theorem (Episodic Case)&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;Let’s prove the policy gradient theorem from first principles using elementary calculus.&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; To keep the notation simple, we leave it implicit in all cases that $\pi = f(\boldsymbol{\theta})$ and all gradients $\nabla[…]$ are also implicit w.r.t. $\boldsymbol{\theta}$.&lt;/p&gt;

\[\begin{align*}
\nabla v_\pi(s) &amp;amp;= \nabla \!\left[\sum_a \pi(a \vert s)\, q_\pi(s,a)\right], \quad \text{for all } s \in \mathcal{S} \\
&amp;amp;= \sum_a \!\Biggl[\nabla\pi(a \vert s)\, q_\pi(s,a) + \pi(a \vert s)\, \nabla q_\pi(s,a)\Biggr] \\
&amp;amp;= \sum_a \!\left[\nabla\pi(a \vert s)\, q_\pi(s,a) + \pi(a \vert s)\, \nabla \sum_{s&apos;,r} p(s&apos;,r \vert s,a)\!\left(r + v_\pi(s&apos;)\right)\right] \\
&amp;amp;= \sum_a \!\left[\nabla\pi(a \vert s)\, q_\pi(s,a) + \pi(a \vert s) \sum_{s&apos;} p(s&apos; \vert s,a)\, \nabla v_\pi(s&apos;)\right] \\
&amp;amp;= \sum_a \Biggl[\nabla\pi(a \vert s)\, q_\pi(s,a) + \pi(a \vert s) \sum_{s&apos;} p(s&apos; \vert s,a) \sum_{a&apos;} \Biggl[\nabla\pi(a&apos; \vert s&apos;)\, q_\pi(s&apos;,a&apos;) \\
&amp;amp;\qquad\qquad + \pi(a&apos; \vert s&apos;) \sum_{s&apos;&apos;} p(s&apos;&apos; \vert s&apos;,a&apos;)\, \nabla v_\pi(s&apos;&apos;)\Biggr]\Biggr]
\end{align*}\]

&lt;p&gt;Unrolling this recursion:&lt;/p&gt;

\[\boxed{\nabla v_\pi(s) = \sum_{x \in \mathcal{S}} \sum_{k=0}^{\infty} \Pr(s \to x, k, \pi) \sum_a \nabla\pi(a \vert x)\, q_\pi(x,a)}\]

\[\begin{aligned}
\Pr(s \to x, k, \pi) &amp;amp;\equiv \text{probability of transitioning from state } s \text{ to state } x \text{ in } k \text{ steps under policy } \pi
\end{aligned}\]

&lt;p&gt;Then:&lt;/p&gt;

\[\begin{align*}
\nabla J(\boldsymbol{\theta}) &amp;amp;= \nabla v_\pi(s_0) \\
&amp;amp;= \sum_s \!\left(\sum_{k=0}^{\infty} \Pr(s_0 \to s, k, \pi)\right) \sum_a \nabla\pi(a \vert s)\, q_\pi(s,a) \\
&amp;amp;= \sum_s \eta(s) \sum_a \nabla\pi(a \vert s)\, q_\pi(s,a) \\
&amp;amp;= \sum_{s&apos;} \eta(s&apos;) \sum_s \frac{\eta(s)}{\sum_{s&apos;} \eta(s&apos;)} \sum_a \nabla\pi(a \vert s)\, q_\pi(s,a) \\
&amp;amp;= \sum_{s&apos;} \eta(s&apos;) \sum_s \mu(s) \sum_a \nabla\pi(a \vert s)\, q_\pi(s,a)
\end{align*}\]

\[\boxed{\nabla J(\boldsymbol{\theta}) \propto \sum_s \mu(s) \sum_a \nabla\pi(a \vert s)\, q_\pi(s,a)}\]

  &lt;/div&gt;
&lt;/div&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;133-reinforce-monte-carlo-policy-gradient&quot;&gt;13.3 REINFORCE: Monte Carlo Policy Gradient&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;REINFORCE is our first policy gradient algorithm.&lt;/li&gt;
  &lt;li&gt;The goal/strategy is to find a way to get samples such that the expectation of the sample gradient is proportional to the actual performance gradient as a function of the parameters.&lt;/li&gt;
  &lt;li&gt;The sample gradients need to only be proportional to the performance gradient because any proportionality constant can be absorbed into the step size $\alpha$.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The policy gradient theorem gives &lt;strong&gt;an exact expression proportional to the gradient&lt;/strong&gt;; so all that is needed is a way of sampling whose expectation equals or approximates this expression.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;Recall the RHS of the policy gradient theorem is a sum over states weighted by how often the states occur under the target policy $\pi$:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
\nabla J(\boldsymbol{\theta}) &amp;amp;\propto \sum_s \mu(s) \sum_a q_\pi(s,a)\, \nabla\pi(a \vert s, \boldsymbol{\theta}) \\
&amp;amp;= \mathbb{E}_\pi \!\left[\sum_a q_\pi(S_t, a)\, \nabla\pi(a \vert S_t, \boldsymbol{\theta})\right]
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;We get REINFORCE by replacing the sum over the random variable’s possible values by an expectation under $\pi$, and then sampling the expectation:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
\nabla J(\boldsymbol{\theta}) &amp;amp;\propto \mathbb{E}_\pi \!\left[\sum_a \pi(a \vert S_t, \boldsymbol{\theta})\, q_\pi(S_t, a)\, \frac{\nabla\pi(a \vert S_t, \boldsymbol{\theta})}{\pi(a \vert S_t, \boldsymbol{\theta})}\right] \\
&amp;amp;= \mathbb{E}_\pi \!\left[q_\pi(S_t, A_t)\, \frac{\nabla\pi(A_t \vert S_t, \boldsymbol{\theta})}{\pi(A_t \vert S_t, \boldsymbol{\theta})}\right] \quad \text{(replacing } a \text{ by the sample } A_t \sim \pi\text{)} \\
&amp;amp;= \mathbb{E}_\pi \!\left[G_t\, \frac{\nabla\pi(A_t \vert S_t, \boldsymbol{\theta})}{\pi(A_t \vert S_t, \boldsymbol{\theta})}\right] \quad \text{(because } \mathbb{E}_\pi[G_t \vert S_t, A_t] = q_\pi(S_t, A_t)\text{)}
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;The last expression is the required expression; a quantity that can be sampled on each time step whose expectation is proportional to the gradient. This leads to the &lt;strong&gt;REINFORCE update&lt;/strong&gt;:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\boldsymbol{\theta}_{t+1} \doteq \boldsymbol{\theta}_t + \alpha G_t \frac{\nabla\pi(A_t \vert S_t, \boldsymbol{\theta}_t)}{\pi(A_t \vert S_t, \boldsymbol{\theta}_t)}}\]

&lt;ul&gt;
  &lt;li&gt;This update is intuitively coherent:
    &lt;ul&gt;
      &lt;li&gt;The gradient term represents the direction in parameter space that most increases the probability of repeating/taking again the same action $A_t$ in the future.&lt;/li&gt;
      &lt;li&gt;The update moves the parameter the most in the directions that favour actions that yield the highest return.&lt;/li&gt;
      &lt;li&gt;The update ensures a balancing act (lower impact) of frequently selected actions with lower returns.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;REINFORCE has good convergence properties but may be of high variance and slow learning as a MC method.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;reinforce-monte-carlo-policy-gradient-control-episodic-for-pi_&quot;&gt;REINFORCE: Monte Carlo Policy Gradient Control (Episodic) for $\pi_*$&lt;/h4&gt;

\[\boxed{
\begin{aligned}
&amp;amp;\textbf{Input: } \text{a differentiable policy parametrization } \pi(a \vert s, \boldsymbol{\theta}) \\
&amp;amp;\textbf{Algorithm parameter: } \text{step size } \alpha &amp;gt; 0 \\
&amp;amp;\textbf{Initialize } \text{policy parameter } \boldsymbol{\theta} \in \mathbb{R}^{d&apos;} \text{ (e.g., to } \mathbf{0}\text{)} \\
&amp;amp;\textbf{Loop forever } \text{(for each episode):} \\
&amp;amp;\quad \text{Generate an episode } S_0, A_0, R_1, \ldots, S_{T-1}, A_{T-1}, R_T \text{ following } \pi(\cdot \vert \cdot, \boldsymbol{\theta}) \\
&amp;amp;\quad \textbf{Loop for each step of the episode } t = 0, 1, 2, \ldots, T-1\text{:} \\
&amp;amp;\qquad G \leftarrow \sum_{k=t+1}^{T} \gamma^{k-t-1} R_k \\
&amp;amp;\qquad \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha\gamma^t G\, \nabla \ln \pi(A_t \vert S_t, \boldsymbol{\theta})
\end{aligned}
}\]

&lt;hr /&gt;

&lt;h2 id=&quot;134-reinforce-with-baseline&quot;&gt;13.4 REINFORCE with Baseline&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;The policy gradient theorem can be generalized to include a comparison of the action value to an arbitrary &lt;strong&gt;baseline&lt;/strong&gt; $b(s)$:&lt;/li&gt;
&lt;/ul&gt;

\[\nabla J(\boldsymbol{\theta}) \propto \sum_s \mu(s) \sum_a \!\left[q_\pi(s,a) - b(s)\right] \nabla\pi(a \vert s, \boldsymbol{\theta})\]

&lt;ul&gt;
  &lt;li&gt;The baseline can be any function that is not dependent on the action $a$:&lt;/li&gt;
&lt;/ul&gt;

\[\sum_a b(s)\, \nabla\pi(a \vert s, \boldsymbol{\theta}) = b(s)\, \nabla \sum_a \pi(a \vert s, \boldsymbol{\theta}) = b(s)\, \nabla 1 = 0\]

&lt;ul&gt;
  &lt;li&gt;Now the REINFORCE update with baseline is:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\boldsymbol{\theta}_{t+1} \doteq \boldsymbol{\theta}_t + \alpha \!\left[G_t - b(S_t)\right] \frac{\nabla\pi(A_t \vert S_t, \boldsymbol{\theta}_t)}{\pi(A_t \vert S_t, \boldsymbol{\theta}_t)}}\]

&lt;ul&gt;
  &lt;li&gt;The Baseline REINFORCE update &lt;strong&gt;reduces the variance&lt;/strong&gt; despite leaving the expected update value unchanged, which speeds up learning.&lt;/li&gt;
  &lt;li&gt;One natural choice for the baseline is an estimate of the state value $\hat{v}(S_t, \mathbf{w})$.&lt;/li&gt;
  &lt;li&gt;Since REINFORCE is a Monte Carlo method for learning the policy parameter $\boldsymbol{\theta}$, it’s natural to also use a Monte Carlo method to learn the state-value weights $\mathbf{w}$.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;reinforce-with-baseline-episodic-for-estimating-pi_boldsymboltheta-approx-pi_&quot;&gt;REINFORCE with Baseline (Episodic), for estimating $\pi_{\boldsymbol{\theta}} \approx \pi_*$&lt;/h4&gt;

\[\boxed{
\begin{aligned}
&amp;amp;\textbf{Input: } \text{a differentiable policy parametrization } \pi(a \vert s, \boldsymbol{\theta}) \\
&amp;amp;\textbf{Input: } \text{a differentiable state-value function parametrization } \hat{v}(s, \mathbf{w}) \\
&amp;amp;\textbf{Algorithm parameters: } \text{step sizes } \alpha^{\boldsymbol{\theta}} &amp;gt; 0,\ \alpha^{\mathbf{w}} &amp;gt; 0 \\
&amp;amp;\textbf{Initialize } \text{policy parameter } \boldsymbol{\theta} \in \mathbb{R}^{d&apos;} \text{ and state-value weights } \mathbf{w} \in \mathbb{R}^d \text{ (e.g., to } \mathbf{0}\text{)} \\
&amp;amp;\textbf{Loop forever } \text{(for each episode):} \\
&amp;amp;\quad \text{Generate an episode } S_0, A_0, R_1, \ldots, S_{T-1}, A_{T-1}, R_T \text{ following } \pi(\cdot \vert \cdot, \boldsymbol{\theta}) \\
&amp;amp;\quad \textbf{Loop for each step of the episode } t = 0, 1, \ldots, T-1\text{:} \\
&amp;amp;\qquad G \leftarrow \sum_{k=t+1}^{T} \gamma^{k-t-1} R_k \\
&amp;amp;\qquad \delta \leftarrow G - \hat{v}(S_t, \mathbf{w}) \\
&amp;amp;\qquad \mathbf{w} \leftarrow \mathbf{w} + \alpha^{\mathbf{w}}\, \delta\, \nabla \hat{v}(S_t, \mathbf{w}) \\
&amp;amp;\qquad \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha^{\boldsymbol{\theta}} \gamma^t\, \delta\, \nabla \ln \pi(A_t \vert S_t, \boldsymbol{\theta})
\end{aligned}
}\]

&lt;ul&gt;
  &lt;li&gt;This algorithm has 2 step sizes $\alpha^{\boldsymbol{\theta}}$ and $\alpha^{\mathbf{w}}$.&lt;/li&gt;
  &lt;li&gt;Choosing $\alpha^{\mathbf{w}}$ is relatively easy; in the linear case a good rule of thumb is (see &lt;a href=&quot;https://chizkidd.github.io/RL-Sutton-Barto-notes/chapters/ch09-on-policy-prediction-approximation.html#sec-ch09-9-6&quot;&gt;Section 9.6&lt;/a&gt;):&lt;/li&gt;
&lt;/ul&gt;

\[\alpha^{\mathbf{w}} = \frac{0.1}{\mathbb{E}\!\left[\|\nabla \hat{v}(S_t, \mathbf{w})\|_\mu^2\right]}\]

&lt;ul&gt;
  &lt;li&gt;Choosing $\alpha^{\boldsymbol{\theta}}$ is much less clear since its best value depends on the range of variation of the rewards and on the policy parametrization.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;135-actor-critic-methods&quot;&gt;13.5 Actor-Critic Methods&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;REINFORCE with baseline cannot be used to evaluate actions because its state-value function &lt;strong&gt;only estimates the value of the 1st state&lt;/strong&gt; of each state transition (1st state to 2nd state), which serves as a baseline of what to expect for the subsequent return to be, prior to the actual transition’s action.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Actor-critic methods, however, apply the state-value function to the 2nd state&lt;/strong&gt; of the transition thereby estimating its value and thus &lt;strong&gt;evaluating the action&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;The policy is the &lt;strong&gt;actor&lt;/strong&gt; that maps states to actions, while the state-value function used to assess actions in this way is the &lt;strong&gt;critic&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;The estimated value of the 2nd state, when discounted &amp;amp; added to the reward, yields the &lt;strong&gt;one-step return&lt;/strong&gt;, $G_{t:t+1}$.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;1351-one-step-actor-critic-methods&quot;&gt;13.5.1 One-Step Actor-Critic Methods&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;One-step actor-critic methods replace the REINFORCE full return with the one-step return and use a learned state-value function as the baseline as follows:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
\boldsymbol{\theta}_{t+1} &amp;amp;\doteq \boldsymbol{\theta}_t + \alpha \!\left[G_{t:t+1} - \hat{v}(S_t, \mathbf{w})\right] \frac{\nabla\pi(A_t \vert S_t, \boldsymbol{\theta}_t)}{\pi(A_t \vert S_t, \boldsymbol{\theta}_t)} \\
&amp;amp;= \boldsymbol{\theta}_t + \alpha \!\left[R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w})\right] \frac{\nabla\pi(A_t \vert S_t, \boldsymbol{\theta}_t)}{\pi(A_t \vert S_t, \boldsymbol{\theta}_t)} \\
&amp;amp;= \boldsymbol{\theta}_t + \alpha\, \delta_t\, \frac{\nabla\pi(A_t \vert S_t, \boldsymbol{\theta}_t)}{\pi(A_t \vert S_t, \boldsymbol{\theta}_t)}
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;The semi-gradient TD(0) could serve as the natural state-value function learning method.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;PROS:&lt;/strong&gt; simple, fully online &amp;amp; incremental.&lt;/li&gt;
  &lt;li&gt;It is analogous to TD(0), Sarsa(0) &amp;amp; Q-learning.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;one-step-actor-critic-episodic-for-estimating-pi_boldsymboltheta-approx-pi_&quot;&gt;One-Step Actor-Critic (Episodic), for estimating $\pi_{\boldsymbol{\theta}} \approx \pi_*$&lt;/h4&gt;

\[\boxed{
\begin{aligned}
&amp;amp;\textbf{Inputs: } \text{a differentiable policy } \pi(a \vert s, \boldsymbol{\theta}) \text{ and state-value function } \hat{v}(s, \mathbf{w}) \text{ parametrization} \\
&amp;amp;\textbf{Parameters: } \text{step sizes } \alpha^{\boldsymbol{\theta}} &amp;gt; 0,\ \alpha^{\mathbf{w}} &amp;gt; 0 \\
&amp;amp;\textbf{Initialize } \text{policy parameter } \boldsymbol{\theta} \in \mathbb{R}^{d&apos;} \text{ and state-value weights } \mathbf{w} \in \mathbb{R}^d \text{ (e.g., to } \mathbf{0}\text{)} \\
&amp;amp;\textbf{Loop forever } \text{(for each episode):} \\
&amp;amp;\quad \text{Initialize } S \text{ (1st state of episode)} \\
&amp;amp;\quad I \leftarrow 1 \\
&amp;amp;\quad \textbf{Loop while } S \text{ is not terminal (for each time step):} \\
&amp;amp;\qquad A \sim \pi(\cdot \vert S, \boldsymbol{\theta}) \\
&amp;amp;\qquad \text{Take action } A\text{, observe } S&apos;, R \\
&amp;amp;\qquad \delta \leftarrow R + \gamma \hat{v}(S&apos;, \mathbf{w}) - \hat{v}(S, \mathbf{w}) \quad \text{(if } S&apos; \text{ is terminal, then } \hat{v}(S&apos;, \mathbf{w}) \doteq 0\text{)} \\
&amp;amp;\qquad \mathbf{w} \leftarrow \mathbf{w} + \alpha^{\mathbf{w}}\, \delta\, \nabla \hat{v}(S, \mathbf{w}) \\
&amp;amp;\qquad \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha^{\boldsymbol{\theta}} I\, \delta\, \nabla \ln \pi(A \vert S, \boldsymbol{\theta}) \\
&amp;amp;\qquad I \leftarrow \gamma I \\
&amp;amp;\qquad S \leftarrow S&apos;
\end{aligned}
}\]

&lt;h3 id=&quot;1352-actor-critic-with-eligibility-traces&quot;&gt;13.5.2 Actor-Critic with Eligibility Traces&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;We generalize to the forward view of $n$-step methods and then to a $\lambda$-return algorithm.&lt;/li&gt;
  &lt;li&gt;We replace the one-step return $G_{t:t+1}$ by $G_{t:t+n}$ or $G_t^\lambda$ respectively for either $n$-step or $\lambda$-return.&lt;/li&gt;
  &lt;li&gt;The backward view of the $\lambda$-return algorithm uses &lt;strong&gt;eligibility traces&lt;/strong&gt; for the actor and the critic.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;actor-critic-with-eligibility-traces-episodic-for-estimating-pi_boldsymboltheta-approx-pi_&quot;&gt;Actor-Critic with Eligibility Traces (Episodic), for estimating $\pi_{\boldsymbol{\theta}} \approx \pi_*$&lt;/h4&gt;

\[\boxed{
\begin{aligned}
&amp;amp;\textbf{Inputs: } \text{a differentiable policy and state-value function parametrization: } \pi(a \vert s, \boldsymbol{\theta}),\ \hat{v}(s, \mathbf{w}) \\
&amp;amp;\textbf{Parameters: } \text{trace-decay rates } \lambda^{\boldsymbol{\theta}} \in [0,1],\ \lambda^{\mathbf{w}} \in [0,1]\text{; step sizes } \alpha^{\boldsymbol{\theta}} &amp;gt; 0,\ \alpha^{\mathbf{w}} &amp;gt; 0 \\
&amp;amp;\textbf{Initialize } \text{policy parameter } \boldsymbol{\theta} \in \mathbb{R}^{d&apos;} \text{ and state-value weights } \mathbf{w} \in \mathbb{R}^d \text{ (e.g., to } \mathbf{0}\text{)} \\
&amp;amp;\textbf{Loop forever } \text{(for each episode):} \\
&amp;amp;\quad \text{Initialize } S \text{ (1st state of episode)} \\
&amp;amp;\quad \mathbf{z}^{\boldsymbol{\theta}} \leftarrow \mathbf{0} \quad \text{($d&apos;$-component eligibility trace vector)} \\
&amp;amp;\quad \mathbf{z}^{\mathbf{w}} \leftarrow \mathbf{0} \quad \text{($d$-component eligibility trace vector)} \\
&amp;amp;\quad \textbf{Loop while } S \text{ is not terminal (for each time step):} \\
&amp;amp;\qquad A \sim \pi(\cdot \vert S, \boldsymbol{\theta}) \\
&amp;amp;\qquad \text{Take action } A\text{, observe } S&apos;, R \quad \text{(if } S&apos; \text{ is terminal, then } \hat{v}(S&apos;, \mathbf{w}) \doteq 0\text{)} \\
&amp;amp;\qquad \delta \leftarrow R + \gamma \hat{v}(S&apos;, \mathbf{w}) - \hat{v}(S, \mathbf{w}) \\
&amp;amp;\qquad \mathbf{z}^{\mathbf{w}} \leftarrow \gamma\lambda^{\mathbf{w}} \mathbf{z}^{\mathbf{w}} + \nabla \hat{v}(S, \mathbf{w}) \\
&amp;amp;\qquad \mathbf{z}^{\boldsymbol{\theta}} \leftarrow \gamma\lambda^{\boldsymbol{\theta}} \mathbf{z}^{\boldsymbol{\theta}} + I\, \nabla \ln \pi(A \vert S, \boldsymbol{\theta}) \\
&amp;amp;\qquad \mathbf{w} \leftarrow \mathbf{w} + \alpha^{\mathbf{w}}\, \delta\, \mathbf{z}^{\mathbf{w}} \\
&amp;amp;\qquad \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha^{\boldsymbol{\theta}}\, \delta\, \mathbf{z}^{\boldsymbol{\theta}} \\
&amp;amp;\qquad I \leftarrow \gamma I \\
&amp;amp;\qquad S \leftarrow S&apos;
\end{aligned}
}\]

&lt;hr /&gt;

&lt;h2 id=&quot;136-policy-gradient-for-continuing-problems&quot;&gt;13.6 Policy Gradient for Continuing Problems&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;From &lt;a href=&quot;https://chizkidd.github.io/RL-Sutton-Barto-notes/chapters/ch10-on-policy-control-approximation.html#sec-ch10-10-3&quot;&gt;Section 10.3&lt;/a&gt; on continuing problems, lack of episode boundaries requires a new performance measure definition in terms of the &lt;strong&gt;average rate of reward per time step&lt;/strong&gt;:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
J(\boldsymbol{\theta}) \doteq r(\pi) &amp;amp;\doteq \lim_{h \to \infty} \frac{1}{h} \sum_{t=1}^{h} \mathbb{E}\!\left[R_t \vert S_0, A_{0:t-1} \sim \pi\right] \\
&amp;amp;= \lim_{t \to \infty} \mathbb{E}\!\left[R_t \vert S_0, A_{0:t-1} \sim \pi\right] \\
&amp;amp;= \sum_s \mu(s) \sum_a \pi(a \vert s) \sum_{s&apos;,r} p(s&apos;,r \vert s,a)\, r
\end{align*}\]

\[\begin{aligned}
\text{where} \\
\mu(s) &amp;amp;\doteq \lim_{t \to \infty} \Pr\{S_t = s \vert A_{0:t} \sim \pi\} \equiv \text{steady state distribution under } \pi\text{,} \\
&amp;amp;\phantom{{}\doteq{}} \text{assumed to exist and be independent of } S_0 \textbf{ [ergodicity assumption]}
\end{aligned}\]

\[\sum_s \mu(s) \sum_a \pi(a \vert s, \boldsymbol{\theta})\, p(s&apos; \vert s,a) = \mu(s&apos;) \quad \text{for all } s&apos; \in S \quad \text{(ergodicity)}\]

&lt;ul&gt;
  &lt;li&gt;In the continuing case, we define values $v_{\pi}(s) \doteq \mathbb{E}_{\pi}[G_t \vert S_t = s]$ and $q_{\pi}(s,a) \doteq \mathbb{E}_{\pi}[G_t \vert S_t = s, A_t = a]$ w.r.t. the &lt;strong&gt;differential return&lt;/strong&gt;:&lt;/li&gt;
&lt;/ul&gt;

\[G_t \doteq R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + R_{t+3} - r(\pi) + \ldots\]

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Proof: Policy Gradient Theorem (Continuing Case)&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;Leave the notation implicit in all cases that $\pi = f(\boldsymbol{\theta})$ and that the gradients $\nabla[…]$ are w.r.t. $\boldsymbol{\theta}$. In the continuing case $J(\boldsymbol{\theta}) = r(\pi)$, and $v_\pi$ &amp;amp; $q_\pi$ denote values w.r.t. the &lt;strong&gt;differential return&lt;/strong&gt;.&lt;/p&gt;

\[\begin{align*}
\nabla v_\pi(s) &amp;amp;= \nabla \!\left[\sum_a \pi(a \vert s)\, q_\pi(s,a)\right], \quad \text{for all } s \in \mathcal{S} \\
&amp;amp;= \sum_a \!\left[\nabla\pi(a \vert s)\, q_\pi(s,a) + \pi(a \vert s)\, \nabla q_\pi(s,a)\right] \\
&amp;amp;= \sum_a \!\left[\nabla\pi(a \vert s)\, q_\pi(s,a) + \pi(a \vert s)\, \nabla \sum_{s&apos;,r} p(s&apos;,r \vert s,a)\!\left(r - r(\boldsymbol{\theta}) + v_\pi(s&apos;)\right)\right] \\
&amp;amp;= \sum_a \!\left[\nabla\pi(a \vert s)\, q_\pi(s,a) + \pi(a \vert s)\!\left[-\nabla r(\boldsymbol{\theta}) + \sum_{s&apos;} p(s&apos; \vert s,a)\, \nabla v_\pi(s&apos;)\right]\right]
\end{align*}\]

\[\nabla r(\boldsymbol{\theta}) = \sum_a \!\left[\nabla\pi(a \vert s)\, q_\pi(s,a) + \pi(a \vert s) \sum_{s&apos;} p(s&apos; \vert s,a)\, \nabla v_\pi(s&apos;)\right] - \nabla v_\pi(s)\]

&lt;p&gt;Since $\nabla J(\boldsymbol{\theta})$ does not depend on $s$, we sum over all $s \in \mathcal{S}$ weighted by $\mu(s)$ (because $\sum_s \mu(s) = 1$):&lt;/p&gt;

\[\begin{align*}
\nabla J(\boldsymbol{\theta}) &amp;amp;= \sum_s \mu(s) \Bigl(\sum_a \Bigl[\nabla\pi(a \vert s)\, q_\pi(s,a) + \pi(a \vert s) \sum_{s&apos;} p(s&apos; \vert s,a)\, \nabla v_\pi(s&apos;)\Bigr] - \nabla v_\pi(s)\Bigr) \\
&amp;amp;= \sum_s \mu(s) \sum_a \nabla\pi(a \vert s)\, q_\pi(s,a) \\
&amp;amp;\quad + \sum_{s&apos;} \underbrace{\sum_s \mu(s) \sum_a \pi(a \vert s)\, p(s&apos; \vert s,a)}_{\mu(s&apos;)} \nabla v_\pi(s&apos;) - \sum_s \mu(s)\, \nabla v_\pi(s) \\
&amp;amp;= \sum_s \mu(s) \sum_a \nabla\pi(a \vert s)\, q_\pi(s,a) + \sum_{s&apos;} \mu(s&apos;)\, \nabla v_\pi(s&apos;) - \sum_s \mu(s)\, \nabla v_\pi(s) \\
&amp;amp;= \sum_s \mu(s) \sum_a \nabla\pi(a \vert s)\, q_\pi(s,a) \qquad \text{Q.E.D.}
\end{align*}\]

  &lt;/div&gt;
&lt;/div&gt;

&lt;h4 id=&quot;actor-critic-with-eligibility-traces-continuing-for-estimating-pi_boldsymboltheta-approx-pi_&quot;&gt;Actor-Critic with Eligibility Traces (Continuing), for estimating $\pi_{\boldsymbol{\theta}} \approx \pi_*$&lt;/h4&gt;

\[\boxed{
\begin{aligned}
&amp;amp;\textbf{Inputs: } \pi(a \vert s, \boldsymbol{\theta}),\ \hat{v}(s, \mathbf{w}) \\
&amp;amp;\textbf{Parameters: } \lambda^{\mathbf{w}} \in [0,1],\ \lambda^{\boldsymbol{\theta}} \in [0,1],\ \alpha^{\mathbf{w}} &amp;gt; 0,\ \alpha^{\boldsymbol{\theta}} &amp;gt; 0,\ \alpha^{\bar{R}} &amp;gt; 0 \\
&amp;amp;\textbf{Initialize } \bar{R} \in \mathbb{R} \text{ (e.g., to 0)},\ \mathbf{w} \in \mathbb{R}^d\ \&amp;amp;\ \boldsymbol{\theta} \in \mathbb{R}^{d&apos;} \text{ (e.g., to } \mathbf{0}\text{)},\ S \in \mathcal{S} \text{ (e.g., to } s_0\text{)} \\
&amp;amp;\mathbf{z}^{\mathbf{w}} \leftarrow \mathbf{0};\ \mathbf{z}^{\boldsymbol{\theta}} \leftarrow \mathbf{0} \\
&amp;amp;\textbf{Loop forever } \text{(for each time step):} \\
&amp;amp;\quad A \sim \pi(\cdot \vert S, \boldsymbol{\theta}) \\
&amp;amp;\quad \text{Take action } A\text{, observe } S&apos;, R \\
&amp;amp;\quad \delta \leftarrow R - \bar{R} + \hat{v}(S&apos;, \mathbf{w}) - \hat{v}(S, \mathbf{w}) \\
&amp;amp;\quad \bar{R} \leftarrow \bar{R} + \alpha^{\bar{R}}\, \delta \\
&amp;amp;\quad \mathbf{z}^{\mathbf{w}} \leftarrow \lambda^{\mathbf{w}} \mathbf{z}^{\mathbf{w}} + \nabla \hat{v}(S, \mathbf{w}) \\
&amp;amp;\quad \mathbf{z}^{\boldsymbol{\theta}} \leftarrow \lambda^{\boldsymbol{\theta}} \mathbf{z}^{\boldsymbol{\theta}} + \nabla \ln \pi(A \vert S, \boldsymbol{\theta}) \\
&amp;amp;\quad \mathbf{w} \leftarrow \mathbf{w} + \alpha^{\mathbf{w}}\, \delta\, \mathbf{z}^{\mathbf{w}} \\
&amp;amp;\quad \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha^{\boldsymbol{\theta}}\, \delta\, \mathbf{z}^{\boldsymbol{\theta}} \\
&amp;amp;\quad S \leftarrow S&apos;
\end{aligned}
}\]

&lt;hr /&gt;

&lt;h2 id=&quot;137-policy-parametrization-for-continuous-actions&quot;&gt;13.7 Policy Parametrization for Continuous Actions&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Policy gradient methods enable us to handle large (and even continuous) action spaces by learning the statistics of the probability distribution instead of computing learned probabilities for each of the many actions.&lt;/li&gt;
  &lt;li&gt;The probability density function for the normal distribution is conventionally written as:&lt;/li&gt;
&lt;/ul&gt;

\[p(x) \doteq \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\]

\[\begin{aligned}
\text{where} \\
\mu &amp;amp;\equiv \text{mean of the normal distribution} \\
\sigma &amp;amp;\equiv \text{standard deviation of the normal distribution}
\end{aligned}\]

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/rl-sutton-barto/ch13-13-7-prob-density-function.png&quot; alt=&quot;Probability Density Function&quot; /&gt;&lt;/p&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Probability Density Function (PDF)&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;The probability density functions for several different means and standard deviations are shown above.&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;$p(x)$ is the &lt;strong&gt;density&lt;/strong&gt; of the probability at $x$, and not the probability. It can be greater than 1; it is the total area under the graph that must sum to 1.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;To get the probability of $x$ falling within a range, take the integral under $p(x)$ for that specific range of $x$ values.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;The policy parametrization, defining the policy as the normal probability density over a real-valued scalar action $a$ with mean &amp;amp; standard deviation given by state-dependent, parametric function approximators, is as follows:&lt;/li&gt;
&lt;/ul&gt;

\[\pi(a \vert s, \boldsymbol{\theta}) \doteq \frac{1}{\sigma(s, \boldsymbol{\theta})\sqrt{2\pi}} \exp\!\left(-\frac{(a - \mu(s, \boldsymbol{\theta}))^2}{2\sigma(s, \boldsymbol{\theta})^2}\right)\]

\[\begin{aligned}
\text{where} \\
\mu &amp;amp;: S \times \mathbb{R}^{d&apos;} \to \mathbb{R} \equiv \text{mean function approximator} \\
\sigma &amp;amp;: S \times \mathbb{R}^{d&apos;} \to \mathbb{R}^{+} \equiv \text{standard deviation function approximator}
\end{aligned}\]

&lt;ul&gt;
  &lt;li&gt;The approximators need a representation form, so we split the policy’s parameter vector into 2 parts, $\boldsymbol{\theta} = [\boldsymbol{\theta}_{\mu}, \boldsymbol{\theta}_{\sigma}]^T$, one for mean approximation and the other for standard deviation approximation.&lt;/li&gt;
  &lt;li&gt;The mean can be approximated as a linear function while the standard deviation can be approximated as the exponential of a linear function:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{aligned}
\mu(s, \boldsymbol{\theta}) &amp;amp;\doteq \boldsymbol{\theta}_\mu^T \mathbf{x}_\mu(s) \\
\sigma(s, \boldsymbol{\theta}) &amp;amp;\doteq \exp\!\left(\boldsymbol{\theta}_\sigma^T \mathbf{x}_\sigma(s)\right)
\end{aligned}\]

&lt;hr /&gt;

&lt;h2 id=&quot;138-summary&quot;&gt;13.8 Summary&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Going from action-value methods to &lt;strong&gt;parametrized policy&lt;/strong&gt; methods that take actions without consulting action-value estimates.&lt;/li&gt;
  &lt;li&gt;More specifically, policy gradient methods update the policy parameter on each step in the direction of an estimate of the performance gradient w.r.t. the policy parameter.&lt;/li&gt;
  &lt;li&gt;Advantages of &lt;strong&gt;parametrized policy&lt;/strong&gt; methods over $\varepsilon$-greedy &amp;amp; action-value methods:
    &lt;ul&gt;
      &lt;li&gt;They can learn specific probabilities for taking actions.&lt;/li&gt;
      &lt;li&gt;They can learn appropriate levels of exploration &amp;amp; approach deterministic policies asymptotically.&lt;/li&gt;
      &lt;li&gt;They can naturally handle continuous action spaces.&lt;/li&gt;
      &lt;li&gt;Important theoretical advantage over action-value methods in the form of the &lt;strong&gt;policy gradient theorem&lt;/strong&gt;.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;REINFORCE&lt;/strong&gt; uses the policy gradient theorem with Monte Carlo returns.&lt;/li&gt;
  &lt;li&gt;Addition of a state-value function as a &lt;strong&gt;baseline&lt;/strong&gt; with REINFORCE reduces its variance without introducing bias and speeds up learning.&lt;/li&gt;
  &lt;li&gt;If the state-value function is used to criticize/assess the policy’s action selections, then the value function is called a &lt;strong&gt;critic&lt;/strong&gt; and the policy is called an &lt;strong&gt;actor&lt;/strong&gt;. Overall this is referred to as the &lt;strong&gt;actor-critic method&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;The critic introduces &lt;strong&gt;bias&lt;/strong&gt; into the actor’s gradient estimates, but is still often desirable for the same reason that bootstrapping TD methods are superior to Monte Carlo methods (significant variance reduction).&lt;/li&gt;
  &lt;li&gt;Overall, policy-gradient methods provide a significantly different set of strengths &amp;amp; weaknesses than action-value methods.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;citation&quot;&gt;Citation&lt;/h2&gt;

&lt;p&gt;If you found this blog post helpful, please consider citing it:&lt;/p&gt;

&lt;div class=&quot;language-bibtex highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nc&quot;&gt;@article&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;obasi2026RLsuttonBartoCh13notes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;title&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Sutton &amp;amp; Barto, Ch. 13: Policy Gradient Methods (Personal Notes)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;author&lt;/span&gt;  &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Obasi, Chizoba&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;journal&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;chizkidd.github.io&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;year&lt;/span&gt;    &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;2026&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;month&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;May&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;url&lt;/span&gt;     &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;https://chizkidd.github.io/2026/05/07/rl-sutton-barto-notes-ch013/&quot;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;hr /&gt;
</description>
        <pubDate>Thu, 07 May 2026 00:00:00 +0000</pubDate>
        <link>https://chizkidd.github.io//2026/05/07/rl-sutton-barto-notes-ch013/</link>
        <guid isPermaLink="true">https://chizkidd.github.io//2026/05/07/rl-sutton-barto-notes-ch013/</guid>
        
        
      </item>
    
      <item>
        <title>Transformers</title>
        <description>&lt;ul&gt;
  &lt;li&gt;Transformers are a &lt;strong&gt;sequence-to-sequence model&lt;/strong&gt;: given an input sequence, produce an output sequence.&lt;/li&gt;
  &lt;li&gt;Architecture: an &lt;strong&gt;Encoder&lt;/strong&gt; processes the input; a &lt;strong&gt;Decoder&lt;/strong&gt; generates the output autoregressively.&lt;/li&gt;
&lt;/ul&gt;

\[\text{(En) &quot;I am sorry&quot;} \xrightarrow{\text{Encoder}} \xrightarrow{\text{Decoder}} \texttt{&amp;lt;start&amp;gt;}\ \text{Je suis désolé}\ \texttt{&amp;lt;end&amp;gt;}\]

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Autoregressive&lt;/strong&gt;: the decoder generates one token at a time, conditioning on all previously generated tokens.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;table-of-contents&quot;&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#input-text-sequence-representation&quot;&gt;Input Text Sequence Representation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#encoders&quot;&gt;Encoders&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#from-mlp-to-attention&quot;&gt;From MLP to Attention&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#self-attention--multi-head-attention&quot;&gt;Self-Attention &amp;amp; Multi-Head Attention&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#decoders&quot;&gt;Decoders&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#masked-attention&quot;&gt;Masked Attention&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#encoder-decoder-cross-attention&quot;&gt;Encoder-Decoder Cross Attention&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;appendix&quot;&gt;Appendix&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#citation&quot;&gt;Citation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;input-text-sequence-representation&quot;&gt;Input Text Sequence Representation&lt;/h2&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/transformers/enc-dec.png&quot; alt=&quot;encoder-decoder&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;tokenization&quot;&gt;Tokenization&lt;/h3&gt;

&lt;p&gt;Two approaches to representing input text:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. One-hot encoding&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;No semantic similarity or meaning of words encoded.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Token embedding&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Encodes semantic similarity between words.&lt;/li&gt;
  &lt;li&gt;Embedding matrix is &lt;strong&gt;learned&lt;/strong&gt; (Lookup Table).&lt;/li&gt;
  &lt;li&gt;Each token embedding is stored as a &lt;strong&gt;column vector&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/transformers/token-embedding.png&quot; alt=&quot;token embedding&quot; /&gt;&lt;/p&gt;

&lt;!-- $$
\underbrace{\begin{bmatrix} 0.5 \\ 2.7 \\ 1.2 \\ -0.2 \end{bmatrix}}_{d} = W_E \underbrace{\begin{bmatrix} 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix}}_{\text{\# tokens}}
$$ --&gt;

\[\begin{aligned}
\text{where} \\
W_E &amp;amp;\equiv \text{embedding matrix, } d \times \text{\#tokens} \\
d &amp;amp;\equiv \text{embedding dimension}
\end{aligned}\]

&lt;h3 id=&quot;why-we-need-context&quot;&gt;Why We Need Context&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Many words have different meanings in different contexts:
    &lt;ul&gt;
      &lt;li&gt;“I bought an &lt;strong&gt;apple&lt;/strong&gt; &amp;amp; an orange”&lt;/li&gt;
      &lt;li&gt;“I bought an &lt;strong&gt;apple&lt;/strong&gt; watch”&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;We need to rely on &lt;strong&gt;context&lt;/strong&gt; to resolve the ambiguity.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;encoders&quot;&gt;Encoders&lt;/h2&gt;

&lt;p&gt;The encoder pipeline:&lt;/p&gt;

\[\text{Input} \to \text{token} + \text{POS embed} \to \text{Norm} \to \text{MHA(self)} \to \text{Add} \to \text{Norm} \to \text{FFN} \to \text{Add} \to \text{Output}\]

&lt;ul&gt;
  &lt;li&gt;Input tokens are embedded using $W_E$ and combined with &lt;strong&gt;positional encodings&lt;/strong&gt; to produce the input matrix $X$.&lt;/li&gt;
  &lt;li&gt;The architecture stacks: Multi-Head Self-Attention + residual, then FeedForward Network + residual.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/transformers/encoder.png&quot; alt=&quot;encoder&quot; /&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;from-mlp-to-attention&quot;&gt;From MLP to Attention&lt;/h2&gt;

&lt;h3 id=&quot;mlp-only&quot;&gt;MLP only&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;MLP stands for Multilayer Perceptron&lt;/li&gt;
  &lt;li&gt;No contextual information; each token is processed independently.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;concatenation-of-nearby-token-embeddings-before-mlp&quot;&gt;Concatenation of nearby token embeddings before MLP&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Need a sufficiently large window to cover the entire input sequence.&lt;/li&gt;
  &lt;li&gt;Cannot handle variable sequence lengths.&lt;/li&gt;
  &lt;li&gt;Requires many model parameters.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;attention&quot;&gt;Attention&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Use &lt;strong&gt;token similarity&lt;/strong&gt; to determine the relevance of each token to every other token by performing a dot product.&lt;/li&gt;
  &lt;li&gt;Allows the model to dynamically weight which parts of the input are relevant for each position.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;self-attention--multi-head-attention&quot;&gt;Self-Attention &amp;amp; Multi-Head Attention&lt;/h2&gt;

&lt;h3 id=&quot;self-attention&quot;&gt;Self-Attention&lt;/h3&gt;

\[Q = XW_Q, \quad K = XW_K, \quad V = XW_V\]

\[\text{head}_i = \text{Softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V = \text{Attention}(Q, K, V)\]

&lt;h3 id=&quot;multi-head-attention-mha&quot;&gt;Multi-Head Attention (MHA)&lt;/h3&gt;

\[\text{MHA}(X) = \text{multi-head}(Q, K, V) = \text{Concat}(h_1, h_2, \ldots, h_H)\, W_O = Z\]

\[\begin{aligned}
\text{where} \\
h_i &amp;amp;\equiv i\text{-th attention head} \\
W_O &amp;amp;\equiv \text{output projection matrix} \\
Z &amp;amp;\equiv \text{final output}
\end{aligned}\]

&lt;hr /&gt;

&lt;h2 id=&quot;decoders&quot;&gt;Decoders&lt;/h2&gt;

&lt;p&gt;The decoder pipeline:&lt;/p&gt;

\[\text{Masked MHA} \to \text{Cross-Attn} \to \text{FFN} \to \text{Linear} \to \text{Softmax} \to \text{Output}\]

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/transformers/decoder.png&quot; alt=&quot;Decoder&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The decoder takes as input the previously generated tokens $(z_i)$ along with their positional encodings $(p_i)$, and at each step attends both to itself (masked) and to the encoder output (cross attention).&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;masked-attention&quot;&gt;Masked Attention&lt;/h2&gt;

&lt;p&gt;In the decoder, we use &lt;strong&gt;masked&lt;/strong&gt; (causal) self-attention to prevent the decoder from attending to future tokens:&lt;/p&gt;

\[\text{Masked Attn}(Q, K, V) = \text{Softmax}\!\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V\]

\[\begin{aligned}
\text{where} \\
M &amp;amp;\equiv \text{lookahead mask}
\end{aligned}\]

\[M = \begin{bmatrix}
0 &amp;amp; -\infty &amp;amp; -\infty &amp;amp; -\infty \\
0 &amp;amp; 0 &amp;amp; -\infty &amp;amp; -\infty \\
0 &amp;amp; 0 &amp;amp; 0 &amp;amp; -\infty \\
0 &amp;amp; 0 &amp;amp; 0 &amp;amp; 0
\end{bmatrix}\]

&lt;p&gt;Adding $-\infty$ to future positions drives their softmax weights to zero, ensuring position $i$ can only attend to positions $\leq i$.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;encoder-decoder-cross-attention&quot;&gt;Encoder-Decoder Cross Attention&lt;/h2&gt;

&lt;p&gt;The decoder queries the encoder output $E$ (encoder output) to incorporate source context:&lt;/p&gt;

\[\begin{aligned}
Y&apos; &amp;amp;= Y + \text{Masked MHA}(\text{Norm}(Y)) \\[6pt]
Q &amp;amp;= Y&apos; W_Q^{\text{dec}} \\
K^{\text{enc}} &amp;amp;= E W_K^{\text{enc}} \\
V^{\text{enc}} &amp;amp;= E W_V^{\text{enc}}
\end{aligned}\]

\[\begin{aligned}
\text{where} \\
E &amp;amp;\equiv \text{encoder output}
\end{aligned}\]

\[\text{Cross Attn}(Y&apos;, E) = \text{Softmax}\!\left(\frac{Q K_{\text{enc}}^T}{\sqrt{d_k}}\right)V^{\text{enc}}\]

&lt;p&gt;Then the rest of the decoder:&lt;/p&gt;

\[\begin{aligned}
Y&apos;&apos; &amp;amp;= Y&apos; + \text{Cross Attn}(\text{Norm}(Y&apos;),\ E) \\
Y^O &amp;amp;= Y&apos;&apos; + \text{FFN}(\text{Norm}(Y&apos;&apos;)) \\
D &amp;amp;= Y^O \\
\text{logits} &amp;amp;= DW^{\text{out}} + b \\
P(y_t) &amp;amp;= \text{Softmax}(\text{logits}_t)
\end{aligned}\]

\[\begin{aligned}
\text{where} \\
D &amp;amp;\equiv \text{decoder output} \\
W^{\text{out}} &amp;amp;\equiv \text{output projection to vocabulary} \\
P(y_t) &amp;amp;\equiv \text{probability distribution over next token}
\end{aligned}\]

&lt;hr /&gt;

&lt;h2 id=&quot;citation&quot;&gt;Citation&lt;/h2&gt;

&lt;p&gt;If you found this blog post helpful, please consider citing it:&lt;/p&gt;

&lt;div class=&quot;language-bibtex highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nc&quot;&gt;@article&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;obasi2026transformers&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;title&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Transformers&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;author&lt;/span&gt;  &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Obasi, Chizoba&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;journal&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;chizkidd.github.io&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;year&lt;/span&gt;    &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;2026&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;month&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Apr&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;url&lt;/span&gt;     &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;https://chizkidd.github.io/2026/04/17/transformers/&quot;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;hr /&gt;
</description>
        <pubDate>Fri, 17 Apr 2026 00:00:00 +0000</pubDate>
        <link>https://chizkidd.github.io//2026/04/17/transformers/</link>
        <guid isPermaLink="true">https://chizkidd.github.io//2026/04/17/transformers/</guid>
        
        
      </item>
    
      <item>
        <title>SAM 2: Segment Anything in Images &amp; Videos</title>
        <description>&lt;ul&gt;
  &lt;li&gt;Meta’s unified model for promptable image and video segmentation.
    &lt;ul&gt;
      &lt;li&gt;A foundation model for solving promptable visual segmentation in images &amp;amp; &lt;strong&gt;videos&lt;/strong&gt;.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Built a data engine to collect the largest video segmentation dataset to date.&lt;/li&gt;
  &lt;li&gt;&lt;u&gt;Model&lt;/u&gt;: Simple transformer architecture with &lt;strong&gt;streaming memory&lt;/strong&gt; for real-time video processing.&lt;/li&gt;
  &lt;li&gt;Trained on a wide range of tasks: video segmentation and image segmentation.&lt;/li&gt;
  &lt;li&gt;The paper can be found &lt;a href=&quot;https://arxiv.org/pdf/2408.00714&quot;&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;table-of-contents&quot;&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#1-introduction&quot;&gt;1. Introduction&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#2-related-work&quot;&gt;2. Related Work&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#3-task-promptable-visual-segmentation-pvs&quot;&gt;3. Task: Promptable Visual Segmentation (PVS)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#4-model&quot;&gt;4. Model&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#5-data&quot;&gt;5. Data&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#6-zero-shot-experiments&quot;&gt;6. Zero-Shot Experiments&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#7-comparison-to-sota-in-semi-supervised-vos&quot;&gt;7. Comparison to SOTA in Semi-Supervised VOS&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#8-conclusion&quot;&gt;8. Conclusion&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#discussion&quot;&gt;9. Discussion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;appendix&quot;&gt;Appendix&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#citation&quot;&gt;Citation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#references&quot;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;1-introduction&quot;&gt;1. Introduction&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Why video and not image?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Image is only a static snapshot of the real world; lacks motion information (temporal).&lt;/li&gt;
  &lt;li&gt;Video captures temporal information.&lt;/li&gt;
  &lt;li&gt;Many vital applications (robotics, AR/VR, autonomous vehicles) require temporal localization beyond image-level segmentation.&lt;/li&gt;
  &lt;li&gt;A universal visual segmentation system should be applicable to both images &amp;amp; videos.&lt;/li&gt;
  &lt;li&gt;Video segmentation aims to determine the &lt;strong&gt;spatio-temporal extent&lt;/strong&gt; of entities, which presents unique challenges beyond those in images.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Significant changes in appearance&lt;/strong&gt; encountered by entities &amp;amp; &lt;strong&gt;lower quality nature of videos&lt;/strong&gt; than images present challenges for video segmentation.&lt;/li&gt;
  &lt;li&gt;SAM successfully solves image segmentation, but existing video segmentation models &amp;amp; datasets fall short in providing a comparable capability to “segment anything in videos.”&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;SAM 2&lt;/strong&gt;: A unified model for &lt;strong&gt;video &amp;amp; image&lt;/strong&gt; segmentation.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Promptable Visual Segmentation (PVS)&lt;/strong&gt;: Task that generalizes image segmentation to the video domain.&lt;/li&gt;
  &lt;li&gt;A &lt;strong&gt;data engine&lt;/strong&gt; that generates training data via an in-the-loop model with annotators and produces the &lt;strong&gt;Segment Anything Video (SA-V) dataset&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;2-related-work&quot;&gt;2. Related Work&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Video Object Segmentation (VOS)&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Video augmentation datasets&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Interactive Video Object Segmentation (iVOS)&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Image Segmentation&lt;/strong&gt; task, model and dataset
    &lt;ul&gt;
      &lt;li&gt;Research Paper: &lt;a href=&quot;https://arxiv.org/pdf/2304.02643&quot;&gt;Segment Anything (SA)&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Segment Anything (Adapted from the Paper)&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;We aim to build a foundation model for segmentation by introducing three interconnected components: a promptable segmentation &lt;strong&gt;task&lt;/strong&gt;, a segmentation &lt;strong&gt;model&lt;/strong&gt; (SAM) that powers data annotation and enables zero-shot transfer to a range of tasks via prompt engineering, and a &lt;strong&gt;data&lt;/strong&gt; engine for collecting SA-1B, our dataset of over 1 billion masks.&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;3-task-promptable-visual-segmentation-pvs&quot;&gt;3. Task: Promptable Visual Segmentation (PVS)&lt;/h2&gt;

\[\text{PVS} \longrightarrow \text{SAM 2} \longrightarrow \text{SA-V dataset}\]

&lt;ul&gt;
  &lt;li&gt;PVS task allows providing prompts to the model on &lt;strong&gt;any frame&lt;/strong&gt; of a video.&lt;/li&gt;
  &lt;li&gt;The interactive segmentation with SAM2 involves the steps below:
    &lt;ul&gt;
      &lt;li&gt;SAM 2 is prompted on a single frame and responds instantly with a valid segmentation mask of the target object on this frame.&lt;/li&gt;
      &lt;li&gt;SAM 2 then propagates the target object’s segment to multiple frames to form a &lt;strong&gt;masklet&lt;/strong&gt;.&lt;/li&gt;
      &lt;li&gt;Multiple initial prompts are received and propagated by the model to obtain the masklet of the object &lt;strong&gt;across the entire video&lt;/strong&gt;, which leads to localization of the segmentation mask of the target on every single video frame.&lt;/li&gt;
      &lt;li&gt;Additional prompts on &lt;strong&gt;any&lt;/strong&gt; frame can be added to SAM 2 for segmentation mask refinement.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;SAM 2 is applied as a data collection tool to the PVS task for building the SA-V dataset.&lt;/li&gt;
  &lt;li&gt;Model evaluation is done via simulation of interactive video segmentation scenarios across multiple frames in the conventional first-frame, limited, semi-supervised VOS setting, and for image segmentation on the SA benchmarks.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;4-model&quot;&gt;4. Model&lt;/h2&gt;

&lt;p&gt;SAM 2 is a generalization of SAM to the video (&amp;amp; image) domain. Essentially, it employs taking point, box &amp;amp; mask prompts on individual frames to define the &lt;strong&gt;spatial extent&lt;/strong&gt; of the object to be segmented &lt;strong&gt;spatio-temporally&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/sam-2/sam-2.png&quot; alt=&quot;SAM 2 Architecture&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;small class=&quot;text-muted d-block text-center&quot;&gt;&lt;strong&gt;Figure 1:&lt;/strong&gt; The SAM 2 architecture. For a given frame, the segmentation prediction is conditioned on the current prompt and/or on previously observed memories. Video frames are processed in a streaming fashion by the image encoder, cross-attended to memories of the target object from previous frames stored in the memory bank, and decoded via the mask decoder (optionally prompted by the prompt encoder) to predict the segmentation mask for that frame. Finally, a memory encoder transforms the prediction and image encoder embeddings for use in future frames.&lt;/small&gt;&lt;/p&gt;

&lt;!-- 
&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;SAM 2 Architecture&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;For a given frame, the segmentation prediction is conditioned on the current prompt and/or on previously observed memories. Video frames are processed in a streaming fashion by the image encoder, cross-attended to memories of the target object from previous frames stored in the memory bank, and decoded via the mask decoder (optionally prompted by the prompt encoder) to predict the segmentation mask for that frame. Finally, a memory encoder transforms the prediction and image encoder embeddings for use in future frames.&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt; --&gt;

&lt;p&gt;&lt;strong&gt;Components:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Image encoder&lt;/strong&gt;: For real-time processing of arbitrarily long videos via a streaming, hierarchical approach.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Memory attention&lt;/strong&gt;: &lt;em&gt;Condition&lt;/em&gt; the current frame features on the past frames features and predictions or any new prompts.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Prompt encoder &amp;amp; mask decoder&lt;/strong&gt;: Encode and send the input prompts to the mask decoder to predict the segmentation mask for the current frame.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Memory encoder&lt;/strong&gt;: Generates a memory by downsampling, element-wise summation and light-weight convolution of the output mask which fuses the information.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Memory bank&lt;/strong&gt;: Stores spatial memory information (from prompts) about past predictions for the target object in the video, and high-level semantic information of the object to segment based on each frame’s mask decoder output tokens.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Training:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The model is trained &lt;strong&gt;jointly&lt;/strong&gt; on image and video data via interactive prompting simulation of the model, sequential sampling of 8 frames and random selection of 2 frames to prompt, with the goal of sequential and interactive prediction of the gound-truth masklet.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;5-data&quot;&gt;5. Data&lt;/h2&gt;

&lt;p&gt;The data engine was built – by an interactive in-the-loop model setup with human annotators – to collect a large and diverse video segmentation dataset to develop the “segemnt anything” capability in video. The data engine underwent 3 phases, each based on the level of model assistance provided to annotators.&lt;/p&gt;

&lt;h3 id=&quot;51-data-engine&quot;&gt;5.1 Data Engine&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Phase 1:&lt;/strong&gt; SAM per frame&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Phase 2:&lt;/strong&gt; SAM + SAM 2&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Phase 3:&lt;/strong&gt; SAM 2&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Quality verification&lt;/strong&gt; including a separate set of QA annotators for the masklets (satisfactory or unsatisfactory)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Auto masklet generation&lt;/strong&gt; enables the &lt;em&gt;“anything capability”&lt;/em&gt; of the model by ensuring annotation diversity.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Analysis 1:&lt;/strong&gt; Comparison in each data engine phase through a controlled experiment of the average annotation time per frame, the average percentage of manually edited frames per masklet, and the average number of clicks per clicked frame. For QA, the defined metric is called &lt;strong&gt;Phase 1 Mask Alignment Score&lt;/strong&gt;, the percentage of masks whose IoU compared to the corresponding masks in Phase 1 (highest quality manual annotations) exceeds 0.75.
&lt;!-- &gt; Phase 3 is **8.4×** faster than Phase 1, has the lowest edited frame percentage and clicks per frame, and results in better alignment. --&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Analysis 2:&lt;/strong&gt; Performance comparison of SAM 2 trained on the available data at the end of each phase keeping the number of iterations fixed, therefore measuring solely the impact of the additional data. Evaluation is done on the SA-V val dataset and 9 zero-shot benchmarks using the standard &lt;em&gt;$J\&amp;amp;F$&lt;/em&gt; accuracy metric. 
&lt;!-- &gt;Consistent segmentation accuracy improvement from iteratively adding data from each data engine phase (1, 2 &amp; 3) for both SA-V val set and 9 zero-shot benchmarks is observed. --&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Analysis Results&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;&lt;strong&gt;Phase 3 is 8.4× faster than Phase 1,&lt;/strong&gt; has the lowest edited frame percentage and clicks per frame, and results in better alignment.&lt;br /&gt; 
&lt;strong&gt;Consistent segmentation accuracy improvement&lt;/strong&gt; from iteratively adding data from each data engine phase (1, 2 &amp;amp; 3) for both SA-V val set and 9 zero-shot benchmarks is observed.&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&quot;52-sa-v-dataset&quot;&gt;5.2 SA-V Dataset&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Videos (&lt;strong&gt;50.9K total, 54% indoor + 46% outdoor scenes, average duration of 14 secs&lt;/strong&gt;)&lt;/li&gt;
  &lt;li&gt;Masklets (&lt;strong&gt;190.9K manual + 451.7K automatic, 642.6K total&lt;/strong&gt;)&lt;/li&gt;
  &lt;li&gt;SA-V training, validation (&lt;strong&gt;293 masklets &amp;amp; 155 videos&lt;/strong&gt;) &amp;amp; test splits (&lt;strong&gt;278 masklets &amp;amp; 150 videos&lt;/strong&gt;)&lt;/li&gt;
  &lt;li&gt;Internal dataset (&lt;strong&gt;62.9K videos &amp;amp; 69.6K masklets&lt;/strong&gt; annotated in Phase &lt;em&gt;2 &amp;amp; 3&lt;/em&gt; for training; &lt;strong&gt;96 videos &amp;amp; 189 masklets&lt;/strong&gt; annotated using Phase &lt;em&gt;1&lt;/em&gt; for testing)&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;6-zero-shot-experiments&quot;&gt;6. Zero-Shot Experiments&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Compare SAM 2 with previous work on zero-shot video &amp;amp; image tasks using the &lt;em&gt;$J\&amp;amp;F$&lt;/em&gt; accuracy metric.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;61-promptable-video-segmentation-pvs&quot;&gt;6.1 Promptable Video Segmentation (PVS)&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;First evaluate PVS, which involves simulating an interactive setting that is akin to the user experience. Both evaluation settings below are done on 9 densely annotated zero-shot video datasets using $N_{click} = 3$ clicks per frame and are compared to 2 strong baselines based on 2 SOTA VOS models (&lt;strong&gt;XMem++ &amp;amp; Cutie&lt;/strong&gt;)
    &lt;ul&gt;
      &lt;li&gt;&lt;strong&gt;Offline evaluation:&lt;/strong&gt; multiple passes made through a video for frame selection via the largest model error.&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Online evaluation:&lt;/strong&gt; single forward pass for video frames’ annotation.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;62-semi-supervised-video-object-segmentation&quot;&gt;6.2 Semi-Supervised Video Object Segmentation&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Evaluate the semi-supervised VOS setting with click, box or mask prompts only on the &lt;strong&gt;1st frame&lt;/strong&gt; of the video.&lt;/li&gt;
  &lt;li&gt;For click prompts, try either 1, 3, or 5 clicks on the 1st video frame (mIoU).&lt;/li&gt;
  &lt;li&gt;Comparison is done via &lt;em&gt;$J\&amp;amp;F$&lt;/em&gt; accuracy between SAM + XMem++, SAM + Cutie and SAM 2.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;63-image-segmentation&quot;&gt;6.3 Image Segmentation&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Evaluate SAM 2 on the Segment Anything task across &lt;strong&gt;37 zero-shot datasets&lt;/strong&gt; using 1-click and 5-click mIoUs.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;7-comparison-to-sota-in-semi-supervised-vos&quot;&gt;7. Comparison to SOTA in Semi-Supervised VOS&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Evaluate 2 versions of SAM 2 with varying image encoder sizes with different speed-vs-accuracy tradeoffs.&lt;/li&gt;
  &lt;li&gt;Comparison with existing SOTA models/methods via accuracy (&lt;em&gt;$J\&amp;amp;F, G$&lt;/em&gt;) using standard protocols.
&lt;!-- &gt;SAM 2 performs well in accuracy for video segmentation based on first-frame ground-truth mask prompts.  --&gt;&lt;/li&gt;
  &lt;li&gt;Evaluate existing work on the SA-V val &amp;amp; test sets which measure performance for open-world segments of &lt;strong&gt;“any”&lt;/strong&gt; object class via &lt;em&gt;$J\&amp;amp;F$&lt;/em&gt; accuracy metric.
&lt;!-- &gt;SAM 2 performs significantly better on SA-V val/test. --&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Comparison Results&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;SAM 2 &lt;strong&gt;performs well&lt;/strong&gt; in accuracy for video segmentation based on first-frame ground-truth mask prompts.&lt;br /&gt; 
SAM 2 performs &lt;strong&gt;significantly better&lt;/strong&gt; on SA-V val/test.&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;8-conclusion&quot;&gt;8. Conclusion&lt;/h2&gt;

&lt;p&gt;Present a natural extension of Segment Anything into the video domain, based on 3 key aspects:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Extending the promptable segmentation task to video.&lt;/li&gt;
  &lt;li&gt;Equipping the SAM architecture to use memory when applied to video.&lt;/li&gt;
  &lt;li&gt;The diverse SA-V dataset for training &amp;amp; benchmarking video segmentation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The authors believed SAM 2 marked a significant advancement in visual perception, positioning their contributions as milestones that will propel further research &amp;amp; applications.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;9-discussion&quot;&gt;9. Discussion&lt;/h2&gt;

&lt;p&gt;When I first read about SAM 2’s memory bank component, my immediate thought was: &lt;em&gt;this feels familiar&lt;/em&gt;. Not because it copies prior work, but because it sits at an intersection I’ve been circling for a while: &lt;em&gt;how do neural systems remember what matters, without remembering everything?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;SAM 2 maintains object identity across video frames by storing two kinds of information in its memory bank:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Spatial feature maps&lt;/strong&gt; from up to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;N&lt;/code&gt; recent frames and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M&lt;/code&gt; prompted frames (stored in FIFO queues)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Object pointers&lt;/strong&gt;: lightweight semantic vectors derived from the mask decoder’s output tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Memory attention then cross-attends over &lt;em&gt;both&lt;/em&gt; when predicting the current frame. This design elegantly sidesteps the sequential bottleneck that plagued RNNs. Instead of compressing history into a single fixed-length hidden state, SAM 2 gives the model direct, position-invariant access to past representations. The “distance” between frame 1 and frame 200 is functionally the same as between frame 199 and 200. In that sense, attention acts as a &lt;strong&gt;temporal superhighway&lt;/strong&gt;.&lt;/p&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Distance vs. Capacity&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;&lt;strong&gt;Key distinction&lt;/strong&gt;: RNNs encode history into a single vector (making long-range dependencies hard). SAM 2’s attention mechanism bypasses that bottleneck entirely, so distance becomes irrelevant. But attention solves the &lt;em&gt;access&lt;/em&gt; problem, not the &lt;em&gt;retention&lt;/em&gt; problem.&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;And that’s where the FIFO design matters. The paper explicitly states that the memory bank evicts the oldest frame once the queue is full, &lt;em&gt;regardless of semantic importance&lt;/em&gt;.&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; This validates a subtle but critical observation: SAM 2’s forgetting mechanism is a &lt;strong&gt;fixed heuristic&lt;/strong&gt;, not a learned one. The model doesn’t decide what to remember based on future tracking utility; it drops frames based on arrival time.&lt;/p&gt;

&lt;p&gt;This creates a tangible trade-off between &lt;strong&gt;memory availability&lt;/strong&gt; and &lt;strong&gt;diagnostic utility&lt;/strong&gt;. Consider an object that:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Undergoes dramatic lighting change&lt;/li&gt;
  &lt;li&gt;Is occluded for dozens of frames&lt;/li&gt;
  &lt;li&gt;Reappears with significant appearance deformation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The frames that best capture its initial identity might be evicted long before it reappears. Meanwhile, newer frames with degraded or ambiguous representations linger in the queue simply because they arrived later. The “object pointers” partially mitigate this by storing lightweight semantic summaries, but they’re still bound by the same FIFO eviction policy. If the pointer for frame &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;t&lt;/code&gt; is overwritten, the model loses not just the spatial map but also its high-level anchor.&lt;/p&gt;

&lt;div class=&quot;mermaid&quot; style=&quot;margin: 2rem 0;&quot;&gt;
flowchart LR
    A[FIFO Queue&lt;br /&gt;&lt;small&gt;time-based eviction&lt;/small&gt;] --&amp;gt; B[Eviction Policy&lt;br /&gt;&lt;small&gt;fixed heuristic&lt;/small&gt;]
    B --&amp;gt; C[Memory Attention&lt;br /&gt;&lt;small&gt;position-invariant access&lt;/small&gt;]
    C --&amp;gt; D{Tracking Outcome}
    D --&amp;gt;|Object reappears&lt;br /&gt;with useful memory| E[✓ Success]
    D --&amp;gt;|Key frame evicted&lt;br /&gt;or pointer overwritten| F[✗ Failure]
    style A fill:#e3f2fd
    style B fill:#fff3e0
    style C fill:#e8f5e9
    style E fill:#c8e6c9,stroke:#2e7d32
    style F fill:#ffcdd2,stroke:#c62828
&lt;/div&gt;

&lt;p&gt;&lt;small class=&quot;text-muted d-block text-center&quot;&gt;&lt;strong&gt;Figure 2:&lt;/strong&gt; The memory bank’s fixed eviction policy (FIFO) interacts with attention’s position-invariant access. When evicted frames contain critical identity information, tracking fails—even if attention could theoretically retrieve them.&lt;/small&gt;&lt;/p&gt;

&lt;p&gt;The paper’s handling of temporal position encoding reinforces this pragmatic trade-off. Temporal embeddings are injected into the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;N&lt;/code&gt; recent frames to capture short-term motion, but deliberately &lt;em&gt;omitted&lt;/em&gt; from prompted frames due to sparse training signals and inference generalization concerns. This is a sound engineering decision, but it reveals a boundary: SAM 2 optimizes for stable, short-to-medium horizon tracking, not open-ended temporal reasoning.&lt;/p&gt;

&lt;h3 id=&quot;where-this-fits-in-the-broader-literature&quot;&gt;Where This Fits in the Broader Literature&lt;/h3&gt;

&lt;p&gt;SAM 2’s memory bank isn’t operating in a vacuum. It shares conceptual DNA with several prior lines of work:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Neural Turing Machines&lt;/strong&gt; introduced differentiable external memory with read/write heads, allowing networks to learn &lt;em&gt;what&lt;/em&gt; to store and &lt;em&gt;where&lt;/em&gt; to retrieve from &lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. SAM 2’s memory attention is a specialized, non-differentiable cousin: it retrieves, but doesn’t learn the eviction policy.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;RETRO&lt;/strong&gt; demonstrated that retrieval-augmented transformers can scale knowledge without scaling parameters, by querying a frozen corpus at inference &lt;sup id=&quot;fnref:3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. SAM 2 does something analogous for video: query a frozen buffer of past frames. The open question is whether that buffer should be learned, not fixed.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;TimeSformer&lt;/strong&gt; showed that spatiotemporal attention alone can handle video understanding without recurrent components &lt;sup id=&quot;fnref:4&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;. SAM 2 extends this by adding explicit memory, and also inherits TimeSformer’s assumption that all frames are equally worth attending to.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What SAM 2 adds is a &lt;em&gt;practical&lt;/em&gt; instantiation of these ideas for promptable segmentation. It’s not trying to solve general memory-augmented reasoning; it’s solving “keep this object tracked, please.” That focus is its strength; and its limitation.&lt;/p&gt;

&lt;div class=&quot;callout callout--danger&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Counter Argument to the Paper&apos;s Memory Design&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;About memory design, I’d say: &lt;em&gt;“FIFO is computationally cheap and training-stable, which makes sense for a production model. But from a research standpoint, it hardcodes a failure mode: important frames get evicted by time, not relevance. A differentiable memory controller or retrieval-augmented eviction policy could close that gap.”&lt;/em&gt;&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&quot;why-this-matters&quot;&gt;Why This Matters&lt;/h3&gt;

&lt;p&gt;There’s a tendency in ML to treat architectural choices as purely engineering decisions. But memory management isn’t just about compute budgets, it’s also about what the model &lt;em&gt;values&lt;/em&gt;. When SAM 2 drops a frame because it’s old, not because it’s uninformative, it’s making a silent claim: &lt;em&gt;recency matters more than relevance&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;That claim works well for many tracking scenarios. But it breaks down when objects reappear after long occlusions, or when appearance changes dramatically. In those cases, the model isn’t failing because attention is weak; it’s failing because the &lt;em&gt;right information was never kept around to attend to&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This isn’t unique to SAM 2. It’s a fundamental tension in any system that must balance finite memory against infinite context. But because SAM 2 is a foundation model positioned for broad adoption, its design choices will influence how thousands of downstream applications handle temporal reasoning. Getting the memory story right matters.&lt;/p&gt;

&lt;p&gt;So where does this leave us? SAM 2 is undoubtedly a milestone in promptable video segmentation. But its memory bank inadvertently frames a deeper research problem: &lt;strong&gt;attention removes the barrier of temporal distance, but leaves the bottleneck of memory management wide open&lt;/strong&gt;.&lt;/p&gt;

&lt;h3 id=&quot;open-questions-id-want-to-explore&quot;&gt;Open Questions I’d Want to Explore&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Learnable eviction&lt;/strong&gt;: Could we replace FIFO with a lightweight, content-aware, learnable memory eviction mechanistic network that predicts which frames to retain based on tracking confidence, appearance stability, or semantic salience? Would SAM 2 maintain robust identity over arbitrarily long horizons? What’s the compute trade-off from an engineering system to a long-context reasoning model?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Pointer robustness&lt;/strong&gt;: The object pointers are a clever compression trick, but they’re still overwritten by FIFO. Could we decouple pointer retention from spatial memory eviction?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cross-video retrieval&lt;/strong&gt;: RETRO retrieves from a corpus of documents; could SAM 2 retrieve from a corpus of &lt;em&gt;past videos&lt;/em&gt; to bootstrap tracking of familiar objects?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Failure diagnostics&lt;/strong&gt;: Can we design a probe that predicts &lt;em&gt;when&lt;/em&gt; SAM 2 is likely to lose an object, based on memory bank state? That would be valuable for safety-critical applications.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I don’t have answers to these yet. But they feel like the right questions to ask if we want video models that don’t just see, but remember. SAM 2 shows we’ve mastered &lt;em&gt;access&lt;/em&gt; to the past. The next step is mastering &lt;em&gt;retention&lt;/em&gt; of what matters.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;citation&quot;&gt;Citation&lt;/h2&gt;

&lt;p&gt;If you found this blog post helpful, please consider citing it:&lt;/p&gt;

&lt;div class=&quot;language-bibtex highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nc&quot;&gt;@article&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;obasi2026sam2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;title&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;SAM 2: Segment Anything in Images &amp;amp; Videos&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;author&lt;/span&gt;  &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Obasi, Chizoba&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;journal&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;chizkidd.github.io&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;year&lt;/span&gt;    &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;2026&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;month&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Apr&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;url&lt;/span&gt;     &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;https://chizkidd.github.io/2026/04/17/sam2/&quot;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;hr /&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Ravi N, Gabeur V, Hu Y-T, et al. &lt;a href=&quot;https://arxiv.org/pdf/2408.00714&quot;&gt;SAM 2: Segment Anything in Images and Videos&lt;/a&gt;. arXiv. 2024. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Graves A, Wayne G, Danihelka I. &lt;a href=&quot;https://arxiv.org/abs/1410.5401&quot;&gt;Neural Turing Machines&lt;/a&gt;. arXiv. 2014. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Borgeaud S, Mensch A, Hoffmann J, et al. &lt;a href=&quot;https://arxiv.org/abs/2112.04426&quot;&gt;Improving Language Models by Retrieving from Trillions of Tokens&lt;/a&gt;. arXiv. 2022. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Bertasius G, Wang H, Torresani L. &lt;a href=&quot;https://arxiv.org/abs/2102.05095&quot;&gt;Is Space-Time Attention All You Need for Video Understanding?&lt;/a&gt;. arXiv. 2021. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Fri, 17 Apr 2026 00:00:00 +0000</pubDate>
        <link>https://chizkidd.github.io//2026/04/17/sam-2/</link>
        <guid isPermaLink="true">https://chizkidd.github.io//2026/04/17/sam-2/</guid>
        
        
      </item>
    
      <item>
        <title>Muon &amp; MuonClip Optimizers</title>
        <description>&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Muon&lt;/strong&gt; stands for &lt;strong&gt;M&lt;/strong&gt;oment&lt;strong&gt;U&lt;/strong&gt;m &lt;strong&gt;O&lt;/strong&gt;rthogonalized by &lt;strong&gt;N&lt;/strong&gt;ewton-Schulz and was invented by &lt;a href=&quot;https://kellerjordan.github.io/&quot;&gt;Keller Jordan&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;&lt;strong&gt;The key idea:&lt;/strong&gt;&lt;/em&gt; Instead of applying Adam-style per-element adaptive updates to model parameters, Muon orthogonalizes the momentum matrix before using it as the update direction.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;table-of-contents&quot;&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#adam-optimizer&quot;&gt;Adam Optimizer&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#matrix-orthogonalization&quot;&gt;Matrix Orthogonalization&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#newton-schulz-5-iteration&quot;&gt;Newton-Schulz 5 Iteration&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#muon&quot;&gt;Muon&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#qk-clip&quot;&gt;QK-Clip&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#multihead-latent-attention-mla&quot;&gt;Multihead Latent Attention (MLA)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#muonclip&quot;&gt;MuonClip&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;appendix&quot;&gt;Appendix&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#citation&quot;&gt;Citation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#references&quot;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;a href=&quot;https://x.com/Kimi_Moonshot&quot;&gt;Moonshot AI&lt;/a&gt;, creator of Kimi, pioneered the improvements to Muon (MuonClip) which I dive into in detail in this post. But they have an article on X with more academic details on why they chose Muon titled &lt;em&gt;“Why We Chose Muon: Our Chain of Thought”&lt;/em&gt;&lt;sup id=&quot;fnref:5&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; by &lt;a href=&quot;https://x.com/Jianlin_S&quot;&gt;Jianlin Su&lt;/a&gt;, the first author of &lt;a href=&quot;https://arxiv.org/abs/2104.09864&quot;&gt;RoPE&lt;/a&gt; (Rotary Position Embedding).&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p lang=&quot;zxx&quot; dir=&quot;ltr&quot;&gt;&lt;a href=&quot;https://t.co/dxZnLxvPae&quot;&gt;https://t.co/dxZnLxvPae&lt;/a&gt;&lt;/p&gt;&amp;mdash; Kimi.ai (@Kimi_Moonshot) &lt;a href=&quot;https://twitter.com/Kimi_Moonshot/status/1897929976948965870?ref_src=twsrc%5Etfw&quot;&gt;March 7, 2025&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async=&quot;&quot; src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;This blog post is directly from my personal handwritten notes on studying Muon &amp;amp; MuonClip. I posted those &lt;a href=&quot;https://x.com/latentchiz/status/2040617828856803690?s=20&quot;&gt;notes on X&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;My notes on the Muon optimizer by &lt;a href=&quot;https://twitter.com/kellerjordan0?ref_src=twsrc%5Etfw&quot;&gt;@kellerjordan0&lt;/a&gt; and MuonClip by &lt;a href=&quot;https://twitter.com/Kimi_Moonshot?ref_src=twsrc%5Etfw&quot;&gt;@Kimi_Moonshot&lt;/a&gt; which integrates Muon with weight decay, RMS matching, &amp;amp; QK-Clip.&lt;br /&gt;&lt;br /&gt;&amp;quot;Using MuonClip, we successfully pre-trained Kimi K2 on 15.5 trillion tokens without a single loss spike.&amp;quot; - &lt;a href=&quot;https://twitter.com/Kimi_Moonshot?ref_src=twsrc%5Etfw&quot;&gt;@Kimi_Moonshot&lt;/a&gt; &lt;a href=&quot;https://t.co/KW84xOi51E&quot;&gt;pic.twitter.com/KW84xOi51E&lt;/a&gt;&lt;/p&gt;&amp;mdash; Chiz (@latentchiz) &lt;a href=&quot;https://twitter.com/latentchiz/status/2040617828856803690?ref_src=twsrc%5Etfw&quot;&gt;April 5, 2026&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async=&quot;&quot; src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;adam-optimizer&quot;&gt;Adam Optimizer&lt;/h2&gt;
&lt;p&gt;Let’s start by looking at Adam optimizer, the most common optimizer for training neural networks, which I cover in &lt;a href=&quot;https://chizkidd.github.io/2026/01/22/neural-net-optimizers/#8-muon-momentum-orthogonalized-by-newton-schulz&quot;&gt;this article guide&lt;/a&gt; on common deep learning optimizers. &lt;strong&gt;Adam&lt;/strong&gt; stands for &lt;strong&gt;Ada&lt;/strong&gt;ptive &lt;strong&gt;M&lt;/strong&gt;oment Estimation.&lt;/p&gt;

&lt;p&gt;It combines momentum and adaptive learning rates so the model not only remembers the direction it has been moving in, but also adjusts how big each step should be for every parameter. More specifcally, it combines momentum (first moment) and RMSProp (second moment) with bias corrections to handle noisy gradients and early training instability.&lt;sup id=&quot;fnref:7&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:7&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; The forward pass looks like:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/muon/model-learning.png&quot; alt=&quot;model learning&quot; /&gt;&lt;/p&gt;

&lt;!-- $$x \longrightarrow \boxed{\Theta} \longrightarrow \hat{y} \longrightarrow \boxed{\text{Loss},\ L} \longrightarrow y_{gt}$$

$${\text{input}} \quad {\text{model}} \quad {\text{prediction}} \hspace{2em} \quad  {\text{ground truth}}$$ --&gt;

&lt;h3 id=&quot;gradient-descent&quot;&gt;Gradient Descent&lt;/h3&gt;

&lt;p&gt;Gradient descent updates the model parameters by stepping in the direction that reduces the loss, using the gradient as a guide for how to adjust each parameter. The momentum and velocity terms refine this process by smoothing past gradients and scaling updates adaptively, which helps stabilize training and converge faster, especially in noisy or complex loss landscapes.&lt;/p&gt;

\[\begin{aligned}
\text{simply} \quad &amp;amp; \Theta_i \leftarrow \Theta_{i-1} - \alpha \frac{\partial L}{\partial \Theta_i} \\[6pt]
\text{momentum} \quad &amp;amp; M_i \leftarrow \beta_1 M_{i-1} + (1 - \beta_1) \frac{\partial L}{\partial \Theta_i} \\[6pt]
\text{velocity} \quad &amp;amp; V_i \leftarrow \beta_2 V_{i-1} + (1 - \beta_2) \left(\frac{\partial L}{\partial \Theta_i}\right)^2 \\[6pt]
&amp;amp; \Theta_i \leftarrow \Theta_{i-1} - \alpha \left(\frac{M_i}{\sqrt{V_i} + \varepsilon}\right)
\end{aligned}\]

\[\begin{aligned}
\text{where} \\
M_i &amp;amp;\equiv \text{momentum (1st moment)} \\
V_i &amp;amp;\equiv \text{velocity: gradient squared (2nd moment)} \\
\beta_1, \beta_2 &amp;amp;\equiv \text{decay hyperparameters} \\
\varepsilon &amp;amp;\equiv \text{small constant for numerical stability}
\end{aligned}\]

&lt;h3 id=&quot;cons-of-adam&quot;&gt;Cons of Adam&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Memory intensive.&lt;/li&gt;
  &lt;li&gt;Challenging hyper-parameter tuning.&lt;/li&gt;
  &lt;li&gt;Independent update of each value in a single, long, parameter vector without considering any internal structure (vector-based optimizer behavior) of the model parameters.&lt;sup id=&quot;fnref:4&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Question:&lt;/strong&gt; Can we explicitly account for the underlying matrix structure of the model parameters?&lt;/p&gt;

&lt;h3 id=&quot;the-linear-layer--matrix-momentum&quot;&gt;The Linear Layer &amp;amp; Matrix Momentum&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;For a linear layer with 3 inputs $(x_1, x_2, x_3)$ and 4 outputs $(z_1, z_2, z_3, z_4)$:&lt;/p&gt;

    &lt;p&gt;&lt;img src=&quot;/assets/images/2026/muon/linear-layer.png&quot; alt=&quot;Linear layer&quot; /&gt;&lt;/p&gt;

\[z_i = \Theta_{i1} x_1 + \Theta_{i2} x_2 + \Theta_{i3} x_3, \quad \forall\, i = 1, 2, 3, 4\]
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;In matrix form:&lt;/p&gt;

\[\begin{bmatrix} z_1 \\ z_2 \\ z_3 \\ z_4 \end{bmatrix}
  =
  \begin{bmatrix}
  \Theta_{11} &amp;amp; \Theta_{12} &amp;amp; \Theta_{13} \\
  \Theta_{21} &amp;amp; \Theta_{22} &amp;amp; \Theta_{23} \\
  \Theta_{31} &amp;amp; \Theta_{32} &amp;amp; \Theta_{33} \\
  \Theta_{41} &amp;amp; \Theta_{42} &amp;amp; \Theta_{43}
  \end{bmatrix}
  \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}\]
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Since $\Theta$ is now a matrix, we need a &lt;strong&gt;matrix momentum&lt;/strong&gt; $\hat{M}$ for $\hat{\Theta}$:&lt;/p&gt;

\[\hat{M} =
  \begin{bmatrix}
  m_{11} &amp;amp; m_{12} &amp;amp; m_{13} \\
  m_{21} &amp;amp; m_{22} &amp;amp; m_{23} \\
  m_{31} &amp;amp; m_{32} &amp;amp; m_{33} \\
  m_{41} &amp;amp; m_{42} &amp;amp; m_{43}
  \end{bmatrix}\]
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;So the matrix momentum update is:&lt;/p&gt;

\[\hat{M}_i \leftarrow \beta \hat{M}_i + \frac{\partial L}{\partial \hat{\Theta}_i}\]
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;the-problem-with-vector-based-optimizers&quot;&gt;The Problem with Vector-Based Optimizers&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;With vector-based optimizers like Adam, the momentum for a linear layer (a 2D matrix) tends to become &lt;strong&gt;almost low rank&lt;/strong&gt; in practice.&lt;/li&gt;
  &lt;li&gt;Essentially, only a small number of dominant directions really drive the update, while the many remaining other directions contribute very little.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Question:&lt;/strong&gt; How can we tackle this update direction imbalance and what makes a good optimizer?&lt;/p&gt;

&lt;p&gt;From fundamental first principles, a good optimizer possesses two characteristics: &lt;strong&gt;stability&lt;/strong&gt; and &lt;strong&gt;speed.&lt;/strong&gt; The goal of each update of a good optimizer is to minimize model variance and maximize loss reduction contribution, which correspond to stability and speed respectively.&lt;sup id=&quot;fnref:5:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;matrix-orthogonalization&quot;&gt;Matrix Orthogonalization&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;We can fix the imbalance of update directions via orthogonalization.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Orthogonalize&lt;/strong&gt; the momentum matrix. This is where &lt;strong&gt;Muon&lt;/strong&gt; comes in.&lt;/li&gt;
  &lt;li&gt;It amplifies the effect of &lt;strong&gt;rare&lt;/strong&gt; directions – the directions that typically receive small or infrequent updates.&lt;/li&gt;
  &lt;li&gt;Even though these rare directions seem minor, they are often essential for effective learning and can help capture more nuance patterns in the data.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;orthogonalization-via-svd&quot;&gt;Orthogonalization via SVD&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;We want the orthogonal matrix $O$ closest to $M$:&lt;/p&gt;

\[\text{Ortho}(M) = \arg\min_{O} \left\{\| O - M \|_F\right\} \quad \text{subject to } OO^T = I \text{ or } O^TO = I\]
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Using &lt;strong&gt;SVD&lt;/strong&gt; (Singular Value Decomposition): $M = USV^T$, where:&lt;/p&gt;

\[\begin{aligned}
  UU^T &amp;amp;= U^TU = I \\
  VV^T &amp;amp;= V^TV = I
  \end{aligned}
  \quad \bigg\} \text{ orthonormal matrices}\]
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Since $OO^T = O^TO = I$:&lt;/p&gt;

\[\therefore\quad O = USV^T \quad \text{where } S = \begin{bmatrix} 1 &amp;amp; &amp;amp; 0 \\ &amp;amp; \ddots &amp;amp; \\ 0 &amp;amp; &amp;amp; 1 \end{bmatrix} \equiv \text{unit diagonal matrix}\]
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Issue:&lt;/strong&gt; SVD on a matrix is computationally expensive.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Use an odd polynomial matrix:&lt;/p&gt;

\[p(X) = aX + b(XX^T)X\]
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;odd-polynomial-matrix&quot;&gt;Odd Polynomial Matrix&lt;/h3&gt;

\[\begin{align*}
p(M) &amp;amp;= a(M) + b(MM^T)M \\
&amp;amp;= \left(aI + b(MM^T)\right)M \\
&amp;amp;= \left(aI + b(USV^T VS U^T)\right)USV^T \\
&amp;amp;= \left(aI + b(US^2 U^T)\right)USV^T \\
&amp;amp;= aUSV^T + bUS^2 U^T USV^T \\
p(M) &amp;amp;= aUSV^T + bUS^3V^T \\
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;This applies to any odd polynomial, so in general:&lt;/p&gt;

\[p(M) = U\!\left(aS + bS^3 + cS^5 + \ldots + (\text{const})\, S^{2n+1}\right)V^T, \quad \forall\, n \geq 0,\ n \to \infty\]
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Sticking to the &lt;strong&gt;5th order polynomial&lt;/strong&gt;:&lt;/p&gt;

\[\boxed{p(M) = U\!\left(aS + bS^3 + cS^5\right)V^T}\]
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;We need to determine the coefficients $(a, b, c)$. The goal is to get $S$ to a unit diagonal matrix – i.e., get the diagonal values as close to 1 as possible.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Plot $y = p(x)$ against $x$. If $(a, b, c) = (1.5,\ -0.5,\ 0)$:&lt;/p&gt;

\[y = 1.5x - 0.5x^3\]

    &lt;p&gt;&lt;img src=&quot;/assets/images/2026/muon/newton-schulz5.png&quot; alt=&quot;Newton-Schulz5-1&quot; /&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Applying this repeatedly via &lt;strong&gt;Newton-Schulz iteration&lt;/strong&gt;:&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

\[\begin{aligned}
y_1 &amp;amp;= p(x) \\
y_2 &amp;amp;= p(p(x)) \\
y_3 &amp;amp;= p(p(p(x))) \\
y_4 &amp;amp;= p(p(p(p(x)))) \\
y_5 &amp;amp;= p(p(p(p(p(x))))) \\
\end{aligned}\]

&lt;ul&gt;
  &lt;li&gt;Each $y_k$ represents one more composition:  $y_1 \to y_5$ are multiple iterations aimed at converging the singular values toward 1.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;newton-schulz-5-iteration&quot;&gt;Newton-Schulz-5 Iteration&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;After 5 iterations, almost all input values end up very close to 1.&lt;/li&gt;
  &lt;li&gt;We can change $(a, b, c)$ to see the effect on convergence of $y$ to 1.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;$(a, b, c) = (2,\ -1.5,\ 0.5)$ speeds up the convergence to 1.&lt;/p&gt;

    &lt;p&gt;&lt;img src=&quot;/assets/images/2026/muon/newton-schulz5-converged.png&quot; alt=&quot;Netwon-Schulz5-2&quot; /&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;Empirically, we don’t need the singular values to converge to exactly 1.&lt;/li&gt;
  &lt;li&gt;Let’s set an upper &amp;amp; lower bound, e.g. $(0.7,\ 1.3)$, which is basically $[1 - \varepsilon, 1 + \varepsilon]$.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Tuned coefficients:&lt;/p&gt;

\[\boxed{(a,\ b,\ c) = (3.4445,\ -4.775,\ 2.0315)}\]
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Each Newton-Schulz 5 iteration involves only &lt;strong&gt;matrix multiplication&lt;/strong&gt;:&lt;/p&gt;

\[X \leftarrow aX + b(XX^T)X + c(XX^T)^2 X\]
  &lt;/li&gt;
  &lt;li&gt;With GPUs, no need to use SVD since GPUs can efficiently compute matrix multiplication.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;newtonschulz5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;G&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;steps&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;eps&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;1e-7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;G&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndim&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;3.4445&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;4.7750&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;2.0315&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;G&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bfloat16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;norm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;eps&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;G&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;G&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;steps&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;A&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;B&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;A&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;A&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;A&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;G&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;G&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Retrieved from https://kellerjordan.github.io/posts/muon/
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;hr /&gt;

&lt;h2 id=&quot;muon&quot;&gt;Muon&lt;/h2&gt;

&lt;p&gt;Muon is designed specifically for 2D weight matrices in neural network hidden layers.&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; Unlike traditional optimizers that treat each parameter independently, Muon leverages the geometric structure of weight matrices by orthogonalizing gradients using the Newton-Schulz iteration. Muon is specifically designed for linear neural network layers, which aligns with ongoing research that argues that different layer types require different optimizers due to their varying geometry.&lt;sup id=&quot;fnref:6&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;The optimizer formulates weight updates as a constrained optimization problem in the RMS-to-RMS operator norm space:&lt;/p&gt;

\[\text{Ortho}(M) = \arg\min_{O} \left\{\| O - M \|_F\right\} \quad \text{subject to } OO^T = I \text{ or } O^TO = I\]

&lt;p&gt;Where $M$ is the gradient matrix. The solution involves projecting the gradient onto the set of orthogonal matrices, which aims to standardize all singular values to 1 while preserving gradient directions.&lt;/p&gt;

&lt;h3 id=&quot;pseudo-algorithm-for-muon&quot;&gt;Pseudo-Algorithm for Muon&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;for t = 1, 2, ..., do:
    Compute gradient      G_t  ←  ∇L_t(θ_{t-1})
    Compute momentum      M_t  ←  βM_{t-1} + G_t
    Normalize             M&apos;_t ←  M_t / ||M_t||_F
    Orthogonalization     O_t  ←  NewtonSchutz5(M&apos;_t)
    Update parameter      θ_t  ←  θ_{t-1} - α O_t
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;With &lt;strong&gt;weight decay&lt;/strong&gt; $(+)$:&lt;/p&gt;

\[\Theta_t \leftarrow \Theta_{t-1} - \alpha\!\left(O_t + \lambda\Theta_{t-1}\right)\]
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;With &lt;strong&gt;adjusted learning rate&lt;/strong&gt; $(+)$:&lt;/p&gt;

\[\Theta_t \leftarrow \Theta_{t-1} - \alpha\!\left[\left(0.2\sqrt{\max(n,m)}\right) O_t + \lambda\Theta_{t-1}\right]\]

\[\begin{aligned}
  \text{where} \\
  \lambda &amp;amp;\equiv \text{weight decay coefficient} \\
  n, m &amp;amp;\equiv \text{dimensions of the 2D parameter matrix}
  \end{aligned}\]
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;muon--weight-decay--rms-alignment&quot;&gt;Muon + Weight Decay + RMS Alignment&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Weight decay is used to address the diminished performance gains of Muon over AdamW when scaling up to train a larger model.&lt;/li&gt;
  &lt;li&gt;The learning rate also gets adjusted by taking into account the &lt;strong&gt;size of the 2D matrix&lt;/strong&gt;. This is the underlying principle behind the &lt;strong&gt;RMS (Root Mean Squared) Alignment.&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;The scaling factor used to scale the Muon update for each matrix to ensure per-matrix parameter update RMS alignment of around 1 of matrices of different shapes is $\sqrt(\max(A, B))$ for a full-rank weight matrix of shape $[A,\ B]$.&lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
      &lt;li&gt;The $0.2$ factor is used to match Muon’s update RMS to that of AdamW. From empirical observations, AdamW’s update RMS is usually around $0.2$ to $0.4$.&lt;sup id=&quot;fnref:2:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;, &lt;sup id=&quot;fnref:5:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;These 2 improvements (weight decay &amp;amp; adjusting the per-parameter update scale) help to stabilize the training of large models.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;the-exploding-attention-logit-crisis&quot;&gt;The Exploding Attention Logit Crisis&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Issue:&lt;/strong&gt; Attention logits can grow larger &amp;amp; larger as training continues, which may cause the training process to become unstable.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Consider a sequence of 4 tokens &amp;amp; assume self-attention for simplicity. Each token is mapped to an embedding vector of dimension $d$. Let the embedding matrix be $X$.&lt;/p&gt;

    &lt;p&gt;&lt;img src=&quot;/assets/images/2026/muon/self-attention.png&quot; alt=&quot;Self-attention architecture diagram&quot; /&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Self-Attention&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;$O = \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;In self attention above, $Q = XW^Q$, $K = XW^K$, and $V = XW^V$.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The &lt;strong&gt;attention logits&lt;/strong&gt; $S$:&lt;/p&gt;

\[\begin{align*}
  S &amp;amp;= QK^T \\
  &amp;amp;= (XW^Q)(XW^K)^T \\
  &amp;amp;= X\!\left(W^Q W^{K^T}\right)X^T
  \end{align*}\]

    &lt;p&gt;where $X$ and $X^T$ denote the embedding vectors, which are typically normalized to have unit norms.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;To prevent the attention logits from becoming excessively large, we must control the scale of $W^Q$ and $W^K$:&lt;/p&gt;

\[S = X\underbrace{\left(W^Q W^{K^T}\right)}_{\text{scale control}}X^T\]
  &lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;qk-clip&quot;&gt;QK-Clip&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;A common strategy is to apply a scaling vector to these matrices.&lt;/li&gt;
  &lt;li&gt;During training, monitor the maximum value of the attention logits, $S_{\max}$. If it exceeds a certain threshold $\tau$, calculate a scaling ratio $\gamma$:&lt;/li&gt;
&lt;/ul&gt;

\[\text{if } S_{\max} &amp;gt; \tau: \quad \gamma = \frac{\tau}{S_{\max}} \implies \frac{\tau}{S_{\max}} &amp;lt; 1\]

&lt;ul&gt;
  &lt;li&gt;Directly constrains attention logits, ensuring they stay within a safe range by rescaling the query &amp;amp; key projection weights.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Idea:&lt;/strong&gt; Scale the relevant model parameters by $\gamma$ when the attention logits surpass the threshold. Scale both $W^Q$, $W^K$ by $\sqrt{\gamma}$:&lt;/p&gt;

\[\begin{align*}
S &amp;amp;= X\!\left(\gamma W^Q W^{K^T}\right)X^T \\
&amp;amp;= X\!\left(\sqrt{\gamma}\, W^Q\ \sqrt{\gamma}\, W^{K^T}\right)X^T
\end{align*}\]

&lt;p&gt;&lt;strong&gt;Revised pseudo-algorithm (QK-Clip):&lt;/strong&gt;&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;θ_t ← MuonOptimizer(θ_{t-1}, G_t)
if S_max &amp;gt; τ:
    W^Q ← √γ W^Q
    W^K ← √γ W^K
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;qk-clip-for-multi-head-attention&quot;&gt;QK-Clip for Multi-Head Attention&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;In practice, self-attention consists of multiple heads ($n_{\text{heads}} = h$).&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;When the maximum attention logits exceed the threshold, instead of rescaling all heads in the same way (which doesn’t make sense), we introduce an &lt;strong&gt;individual scaling factor for each head&lt;/strong&gt; to control their logits separately.&lt;/p&gt;

    &lt;p&gt;&lt;img src=&quot;/assets/images/2026/muon/MHA.png&quot; alt=&quot;Multi-head attention architecture diagram&quot; /&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Multi-Head Attention&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;$O = \text{MultiheadAttention}(Q, K, V) = \text{Concat}(head_0, head_1, .., head_h) W_o$&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Attention logits per head:&lt;/p&gt;

\[S = X\!\left(\sqrt{\gamma_h}\, W_h^Q\ \sqrt{\gamma_h}\, W_h^{K^T}\right)X^T\]

\[\text{if } S_{\max}^h &amp;gt; \tau \quad \text{then} \quad \gamma_h = \frac{\tau}{S_{\max}}\]
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Algorithm for Muon + QK-Clip for MHA:&lt;/strong&gt;&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;θ_t ← MuonOptimizer(θ_{t-1}, G_t)
if S^h_max &amp;gt; τ:
    W^Q_h ← √γ_h W^Q_h
    W^K_h ← √γ_h W^K_h
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;multihead-latent-attention-mla&quot;&gt;Multihead Latent Attention (MLA)&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;For multihead Latent Attention (MLA) by DeepSeek, things get tricky.&lt;/p&gt;

    &lt;p&gt;&lt;!-- $$O = \text{Multihead Latent Attention}(Q, K, V)$$ --&gt;&lt;/p&gt;

    &lt;p&gt;&lt;img src=&quot;/assets/images/2026/muon/MLA.png&quot; alt=&quot;MLA architecture diagram&quot; /&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;MLA compresses $Q, K, V$ representations into a low-rank space to reduce the size of the KV cache using a &lt;strong&gt;down-projection matrix&lt;/strong&gt; which produces latent representations:&lt;/p&gt;

\[C^Q = XW^Q_\downarrow \qquad C^{KV} = XW^{KV}_\downarrow\]
  &lt;/li&gt;
  &lt;li&gt;These compressed latent vectors are then mapped back to $W^Q_\uparrow$, $W^K_\uparrow$, $W^V_\uparrow$ for each attention head using the corresponding &lt;strong&gt;up-projection matrices&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Issue:&lt;/strong&gt; This low-rank KV compression fails with rotary position embedding (RoPE).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Fix:&lt;/strong&gt; A decoupled RoPE technique which introduces extra multi-head queries $W^{QR}$ and a shared key $W^{KR}$ to encode positional information.&lt;/li&gt;
  &lt;li&gt;For MLA, the Query, Key &amp;amp; Values are regrouped for each head:
    &lt;ul&gt;
      &lt;li&gt;The &lt;strong&gt;Query&lt;/strong&gt; is constructed by concatenating the compressed query $Q^C$ with the rotated query $Q^R$.&lt;/li&gt;
      &lt;li&gt;The &lt;strong&gt;Key&lt;/strong&gt; is constructed similarly by concatenating the compressed key $K^C$ with the rotated key $K^R$.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Multihead Latent Attention (MLA)&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;MLA compresses $Q, K, V$ representations into a low-rank latent space via down-projection matrices $W^Q_\downarrow$ and $W^{KV}_\downarrow$, then maps back up via up-projection matrices. A decoupled RoPE technique adds rotary queries $W^{QR}$ and a shared rotary key $W^{KR}$.&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;muonclip&quot;&gt;MuonClip&lt;/h2&gt;

&lt;p&gt;In MLA, it is important to carefully decide how to rescale these 4 matrices: $W^Q_\uparrow,\ W^{QR},\ W^{KR},\ W^K_\uparrow$.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;For the &lt;strong&gt;up-projection matrices&lt;/strong&gt; $W^Q_\uparrow$, $W^K_\uparrow$: rescale these parameters for each head individually.&lt;/li&gt;
  &lt;li&gt;The &lt;strong&gt;RoPE components&lt;/strong&gt; $W^{QR}$, $W^{KR}$ rescaling is more tricky:
    &lt;ul&gt;
      &lt;li&gt;Each head has its &lt;strong&gt;own&lt;/strong&gt; rotary query $W^{QR}$.&lt;/li&gt;
      &lt;li&gt;But all heads &lt;strong&gt;share&lt;/strong&gt; a single rotary key matrix $W^{KR}$.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Issue:&lt;/strong&gt; Applying the same per-head scaling for both RoPE components leads to the shared $W^{KR}$ being rescaled multiple times, which is undesirable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Rescale only the head-specific rotary query $W^{QR}$ by their respective $\gamma^h$, while leaving the shared rotary key matrix $W^{KR}$ unchanged. This technique is called &lt;strong&gt;MuonClip&lt;/strong&gt;. MuonClip improves upon Muon with the QK-Clip technique to handle training instability while benefiting from Muon’s advanced token efficiency.&lt;sup id=&quot;fnref:3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MuonClip Algorithm:&lt;/strong&gt;&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;θ_t ← MuonOptimizer(θ_{t-1}, G_t)
if S^h_max &amp;gt; τ:
    γ_h = τ / S^h_max
    W^Q_{h,↑}  ← √γ_h  W^Q_{h,↑}
    W^K_{h,↑}  ← √γ_h  W^K_{h,↑}
    W^{QR}_h   ← γ_h   W^{QR}_h
    W^{KR}     ← W^{KR}   (unchanged)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In math:&lt;/p&gt;

\[\begin{aligned}
\gamma_h &amp;amp;= \frac{\tau}{S_{\max}^h} \\[6pt]
W_{h,\uparrow}^Q &amp;amp;\leftarrow \sqrt{\gamma_h}\, W_{h,\uparrow}^Q \\
W_{h,\uparrow}^K &amp;amp;\leftarrow \sqrt{\gamma_h}\, W_{h,\uparrow}^K \\
W_h^{QR} &amp;amp;\leftarrow \gamma_h\, W_h^{QR} \\
W^{KR} &amp;amp;\leftarrow W^{KR} \quad \text{(unchanged)}
\end{aligned}\]

&lt;hr /&gt;

&lt;h2 id=&quot;citation&quot;&gt;Citation&lt;/h2&gt;

&lt;p&gt;If you found this blog post helpful, please consider citing it:&lt;/p&gt;

&lt;div class=&quot;language-bibtex highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nc&quot;&gt;@article&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;obasi2026muonmuonclip&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;title&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Muon &amp;amp; MuonClip Optimizers&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;author&lt;/span&gt;  &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Obasi, Chizoba&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;journal&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;chizkidd.github.io&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;year&lt;/span&gt;    &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;2026&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;month&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Apr&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;url&lt;/span&gt;     &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;https://chizkidd.github.io/2026/04/04/muon-muonclip/&quot;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;hr /&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:5&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Kimi (Moonshot AI). &lt;a href=&quot;https://x.com/Kimi_Moonshot/status/1897929976948965870&quot;&gt;Why We Chose Muon: Our Chain of Thought&lt;/a&gt;. X (Twitter). 2025. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:5:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;a href=&quot;#fnref:5:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:7&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Chizoba Obasi. &lt;a href=&quot;https://chizkidd.github.io/2026/01/22/neural-net-optimizers/&quot;&gt;A Complete Guide to Neural Network Optimizers&lt;/a&gt;. chizkidd.github.io. 2026. &lt;a href=&quot;#fnref:7&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Jia-Bin Huang &lt;a href=&quot;https://youtu.be/bO5nvE289ec&quot;&gt;This Simple Optimizer Is Revolutionizing How We Train AI [Muon] (YouTube Video)&lt;/a&gt;. YouTube. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Keller Jordan. &lt;a href=&quot;https://kellerjordan.github.io/posts/muon/&quot;&gt;Muon: An optimizer for hidden layers in neural networks&lt;/a&gt;. kellerjordan.github.io. 2024. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Jeremy Bernstein. &lt;a href=&quot;https://jeremybernste.in/writing/deriving-muon&quot;&gt;Deriving Muon&lt;/a&gt;. jeremybernste.in. 2025. &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;&lt;a href=&quot;https://arxiv.org/pdf/2502.16982&quot;&gt;Muon is Scalable for LLM Training&lt;/a&gt;. arXiv. 2025. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:2:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;&lt;a href=&quot;https://arxiv.org/pdf/2507.20534&quot;&gt;Kimi K2: Open Agentic Intelligence&lt;/a&gt;. arXiv. 2025. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Sat, 04 Apr 2026 00:00:00 +0000</pubDate>
        <link>https://chizkidd.github.io//2026/04/04/muon-muonclip/</link>
        <guid isPermaLink="true">https://chizkidd.github.io//2026/04/04/muon-muonclip/</guid>
        
        
      </item>
    
      <item>
        <title>Inkcast: A Free, Browser-Based Audiobook Player</title>
        <description>&lt;p&gt;Earlier this year, I decided to force myself to read more. Not a New Year’s resolution, because those never last. The reason is that growing up as a child and young teenager, reading often felt like punishment. My mum required my siblings and me to read a certain number of pages from a designated book every day throughout elementary school. Missing a day meant mandatory punishment. In boarding secondary school, this eventually led to a stubborn, subconscious resistance to non-essential reading. Over the six years I spent there, I probably read only five to ten non-academic fiction books (though &lt;em&gt;Artemis Fowl&lt;/em&gt; was a delight). So it is not hard to see where my indifference to reading came from.&lt;/p&gt;

&lt;p&gt;During the COVID-19 pandemic, however, I fell deep into podcasts. As an avid sports fan and TV show buff, I listened to everything: sports recaps, tech podcasts, expert interviews, the works (meeting Walter White at Anfield would be the ultimate dream). So when I decided to read more this year, audiobooks felt like the natural bridge. I already had a few EPUBs in the Apple Books app on my reading list and wondered: &lt;em&gt;Can I listen to these EPUBs using Apple Dictation’s two-finger swipe-down feature?&lt;/em&gt; Unfortunately, it only works for the current page. It is quite janky, not very user-friendly, and frankly does not work well for my use case.&lt;/p&gt;

&lt;p&gt;Recently, I worked on &lt;a href=&quot;https://chizkidd.github.io/2026/03/01/tonal-fidelity-multilingual-asr/&quot;&gt;evaluating how a Facebook state-of-the-art (SOTA) automatic speech recognition (ASR) model handles Igbo tones&lt;/a&gt;, trying to see whether it actually “listens” properly. So I have been dabbling with audio quite a bit this year. You could say I have been thinking about listening a lot. In the past, I also experimented with WaveNet (a generative model for raw audio) and its fundamental building block, &lt;a href=&quot;https://chizkidd.github.io/Karpathy-Neural-Networks-Zero-to-Hero/006_makemore_WaveNet/makemore_WaveNet.html&quot;&gt;the dilated causal convolution&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;With these experiences in mind, I wondered: &lt;em&gt;Can I build an iPhone Shortcut that lets me listen to EPUBs properly?&lt;/em&gt; That question eventually led to &lt;a href=&quot;https://chizkidd.github.io/inkcast/&quot;&gt;&lt;strong&gt;Inkcast&lt;/strong&gt;&lt;/a&gt;. The goal was not to build a Speechify competitor. I simply wanted to solve a personal problem. My aim was to create a low-effort, frictionless tool for personal use, so I made a GitHub repository and started building. Within a few days, I had a working website that could take EPUBs and PDFs and let users listen to the content organized by chapters in a sidebar with one-tap navigation. It included basic controls such as play/pause, rewind (15 seconds), forward (30 seconds), playback speed control (0.75× to 2×), and voice selection.&lt;/p&gt;

&lt;p&gt;It worked well on desktop, so I used the URL to create a Shortcut on my iPhone. On mobile it also worked, but there was one problem: the reader voices sounded robotic and monotone, which is not ideal for long-form listening. The irony was that only weeks earlier I had been evaluating how machines handle speech. So I went back to the drawing board to figure out how to get more natural-sounding reader voices on Inkcast.&lt;/p&gt;

&lt;p&gt;While researching audiobook-quality text-to-speech, I came across several APIs (OpenAI, ElevenLabs, Google Cloud). None were free, and for a personal project I wanted something that required no subscriptions or API keys. Most resources suggested that human-quality narration requires a dedicated TTS service. Eventually I discovered that the Web Speech API can access premium voices already installed on the device. These voices are free, require no API keys, and remain available offline. They are not state of the art, but they are surprisingly good. Many people do not realize that higher-quality Siri voices can be downloaded. The voice quality improved, but the project also started evolving in another direction.&lt;/p&gt;

&lt;p&gt;I have always wanted to work through Paul Graham’s essays properly. There are 229 of them, and they read almost like long-form podcasts. But they live on webpages, which raised another question: Why limit the input to EPUBs and PDFs? So I added URL support. I pasted Paul Graham’s archive page into Inkcast, and it automatically pulled all 200+ essays into the sidebar. That was the moment I realized the idea actually worked.&lt;/p&gt;

&lt;p&gt;The entire project lives in a single HTML file. There are no accounts, no installations, and files never leave the user’s device. Because the app has no server dependencies, it ended up functioning as a privacy-preserving tool by default. In a way, I started the year studying whether machines listen well. Along the way, I realized that humans do not have many good free tools for listening either. Speechify costs $139 per year and Audible requires a subscription, so I built something that worked for me.&lt;/p&gt;

&lt;p&gt;If you find it useful, pls &lt;a href=&quot;https://chizkidd.github.io/inkcast/&quot;&gt;try it&lt;/a&gt;, &lt;a href=&quot;https://github.com/chizkidd/inkcast&quot;&gt;star it&lt;/a&gt;, or &lt;a href=&quot;https://buymeacoffee.com/cobasi&quot;&gt;buy me a coffee&lt;/a&gt; if it saves you a Speechify subscription.&lt;/p&gt;
</description>
        <pubDate>Mon, 16 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://chizkidd.github.io//2026/03/16/inkcast/</link>
        <guid isPermaLink="true">https://chizkidd.github.io//2026/03/16/inkcast/</guid>
        
        
      </item>
    
      <item>
        <title>Sutton &amp; Barto, Ch. 12: Eligibility Traces (Personal Notes)</title>
        <description>&lt;ul&gt;
  &lt;li&gt;Eligibility traces are one of the basic mechanisms of RL that unify and generalize TD and Monte Carlo (MC) methods.&lt;/li&gt;
  &lt;li&gt;TD methods augmented with eligibility traces produce a family of methods spanning a range from MC methods at one end ($\lambda = 1$) to one-step TD (TD(0)) methods at the other end ($\lambda = 0$).&lt;/li&gt;
  &lt;li&gt;With eligibility traces, MC methods can be implemented online and on continuing problems.&lt;/li&gt;
  &lt;li&gt;$n$-step methods also unify TD and MC methods but are not as elegant algorithmically as eligibility traces (ET).&lt;/li&gt;
  &lt;li&gt;The eligibility traces algorithm entails:
    &lt;ul&gt;
      &lt;li&gt;First, we have a short-term memory vector, the &lt;strong&gt;eligibility trace&lt;/strong&gt; $\mathbf{z}_t \in \mathbb{R}^d$, that parallels the long-term weight vector $\mathbf{w}_t \in \mathbb{R}^d$.&lt;/li&gt;
      &lt;li&gt;Then when a component of $\mathbf{w}_t$ participates in producing an estimated value, the corresponding component of $\mathbf{z}_t$ is bumped up and then begins to fade away.&lt;/li&gt;
      &lt;li&gt;Learning occurs in that component of $\mathbf{w}_t$ if a non-zero TD error occurs before the trace falls back to zero (fades away).&lt;/li&gt;
      &lt;li&gt;The trace-decay parameter $\lambda \in [0,1)$ determines the rate at which the trace falls.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Advantages of ET over $n$-step methods:
    &lt;ul&gt;
      &lt;li&gt;Requires only a single trace vector $\mathbf{z}_t$ rather than storing the last $n$ feature vectors.&lt;/li&gt;
      &lt;li&gt;Learning occurs continually and uniformly in time rather than being delayed and playing “catch up” at episode end. This leads to immediate behavior effect from learning rather than being delayed.&lt;/li&gt;
      &lt;li&gt;ET is a &lt;strong&gt;backward view&lt;/strong&gt; algorithm, unlike $n$-step methods that are &lt;strong&gt;forward view&lt;/strong&gt; algorithms, which is less complex to implement.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Forward view&lt;/strong&gt; algorithms are based on looking forward from the updated state, and the updated state depends on all the future rewards.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Backward view&lt;/strong&gt; algorithms use the current TD error, looking backward to recently visited states, to achieve nearly the same updates as forward view.&lt;/li&gt;
  &lt;li&gt;We start with ideas for state values and prediction, then extend them to action values and control, then do on-policy, then extend to off-policy learning. The field of focus is linear function approximation (covers tabular and state aggregation cases).&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;table-of-contents&quot;&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#121-the--return&quot;&gt;12.1 The $\lambda$-return&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#122-td&quot;&gt;12.2 TD($\lambda$)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#123--step-truncated--return-methods&quot;&gt;12.3 $n$-step Truncated $\lambda$-return Methods&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#124-redoing-updates-online--return-algorithm&quot;&gt;12.4 Redoing Updates: Online $\lambda$-return Algorithm&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#125-true-online-td&quot;&gt;12.5 True Online TD($\lambda$)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#126-dutch-traces-in-monte-carlo-learning&quot;&gt;12.6 Dutch Traces in Monte Carlo Learning&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#127-sarsa&quot;&gt;12.7 Sarsa($\lambda$)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#128-variable--and&quot;&gt;12.8 Variable $\lambda$ and $\gamma$&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#129-off-policy-traces-with-control-variates&quot;&gt;12.9 Off-Policy Traces with Control Variates&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#1210-watkinss-q-to-tree-backup&quot;&gt;12.10 Watkins’s Q($\lambda$) to Tree-Backup($\lambda$)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#1211-stable-off-policy-methods-with-traces&quot;&gt;12.11 Stable Off-Policy Methods with Traces&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#1212-implementation-issues&quot;&gt;12.12 Implementation Issues&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#1213-conclusions&quot;&gt;12.13 Conclusions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;appendix&quot;&gt;Appendix&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#citation&quot;&gt;Citation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;121-the-lambda-return-&quot;&gt;12.1 The $\lambda$-return &lt;a name=&quot;121-the--return&quot;&gt;&lt;/a&gt;&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Recall in Chapter 7 we defined an $n$-step return as the sum of the first $n$ rewards plus the estimated value of the state reached in $n$ steps, each appropriately discounted:&lt;/li&gt;
&lt;/ul&gt;

\[G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{v}(S_{t+n}, \mathbf{w}_{t+n-1}), \quad\quad 0 \leq t \leq T - n\]

&lt;ul&gt;
  &lt;li&gt;A valid update can be done not just towards any $n$-step return, but also towards any average of $n$-step returns.
    &lt;ul&gt;
      &lt;li&gt;E.g. average the 2-step and 4-step return: $\frac{1}{2} G_{t:t+2} + \frac{1}{2} G_{t:t+4}$&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/rl-sutton-barto/ch12-12-1-the-2-and-4-step-returns.png&quot; alt=&quot;compound update of 2-step and 4-step&quot; /&gt;&lt;/p&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Backup Diagram for Compound Update&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;The compound update mixing half of a two-step return and half of a four-step return.&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;Any set of $n$-step returns can be averaged, even an infinite set, as long as the weights on the component returns are positive and sum to $1$.&lt;/li&gt;
  &lt;li&gt;What if instead of using one $n$-step return, we use a weighted average of all $n$-step returns? This leads to averaging which produces a substantial new range of algorithms.E.g.,
    &lt;ol&gt;
      &lt;li&gt;Averaging one-step and infinite-step returns to interrelate TD and MC methods.&lt;/li&gt;
      &lt;li&gt;Averaging experience-based updates with Dynamic Programming (DP) updates to obtain a single combination of experience-based and model-based methods.&lt;/li&gt;
    &lt;/ol&gt;
  &lt;/li&gt;
  &lt;li&gt;An update that averages simpler component updates is called a &lt;strong&gt;compound update&lt;/strong&gt; or the &lt;strong&gt;$\lambda$-return&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;The TD($\lambda$) algorithm is one way of averaging $n$-step updates, each weighted proportionally by $\lambda^{n-1}$ (where $\lambda \in [0,1]$) and normalized by a factor of $(1-\lambda)$ to ensure that the weights sum to $1$.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/rl-sutton-barto/ch12-12-1-td-lambda.png&quot; alt=&quot;TD(lambda)&quot; /&gt;&lt;/p&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Backup Diagram for TD($\lambda$)&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;If $\lambda = 0$, then the overall update reduces to its first component, the &lt;strong&gt;TD(0)&lt;/strong&gt; update, whereas if $\lambda = 1$, then the overall update reduces to its last component, the &lt;strong&gt;MC&lt;/strong&gt; update.&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;Essentially, $\lambda$-return, $G_t^\lambda$, combines all $n$-step returns $G_{t:t+n}$ in a weighted average manner, $(1-\lambda)\lambda^{n-1}$, and is defined in its state-based form by:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{G_t^\lambda \doteq (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}}\]

&lt;h3 id=&quot;tdlambda-weighting&quot;&gt;TD($\lambda$) Weighting&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;The TD($\lambda$) weighting function diagram illustrates the weighting on the sequence of $n$-step returns in the $\lambda$-return:
    &lt;ul&gt;
      &lt;li&gt;$1$-step return gets the largest weight, $1-\lambda$&lt;/li&gt;
      &lt;li&gt;$2$-step return gets the next (2nd) largest weight, $(1-\lambda)\lambda$&lt;/li&gt;
      &lt;li&gt;$3$-step return gets the 3rd largest weight, $(1-\lambda)\lambda^2$&lt;/li&gt;
      &lt;li&gt;$n$-step return gets the $n$-th largest (smallest) weight, $(1-\lambda)\lambda^{n-1}$&lt;/li&gt;
      &lt;li&gt;The weight fades by $\lambda$ with each additional step.&lt;/li&gt;
      &lt;li&gt;After a terminal state has been reached, all subsequent $n$-step returns are equal to the conventional return $G_t$.&lt;/li&gt;
      &lt;li&gt;So essentially, we can decompose $G_t^\lambda$ based on the TD($\lambda$) weighting function diagram into the main sum and post-termination terms:&lt;/li&gt;
    &lt;/ul&gt;

\[\begin{array}{l}
G_t^\lambda = (1-\lambda) \sum\nolimits_{n=1}^{T-t-1} \lambda^{n-1} G_{t:t+n} + \lambda^{T-t-1} G_t \\
\hspace{3em} \underbrace{\hspace{11em}}_{\text{pre-termination}} \kern{0.5em}\underbrace{\hspace{4em}}_{\text{post-termination}}
\end{array}\]

    &lt;ul&gt;
      &lt;li&gt;So now we can see the impact of $\lambda$ more clearly:&lt;/li&gt;
    &lt;/ul&gt;

\[\begin{aligned}
\text{if } \lambda = 1: \quad &amp;amp; G_t^\lambda = G_t \hspace{18em} \text{(MC)} \\[6pt]
\text{if } \lambda = 0: \quad &amp;amp; G_t^\lambda = \left\{ \begin{array}{ll} \sum_{n=1}^{T-t-1} G_{t:t+n} &amp;amp; \text{for } n=1 \\ 0 &amp;amp; \text{for } n &amp;gt; 1 \end{array} \right\} = G_{t:t+1} \quad \text{(TD(0))}
\end{aligned}\]
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/rl-sutton-barto/ch12-12-1-td-lambda-weighting-function.png&quot; alt=&quot;TD(lambda) weighting&quot; /&gt;&lt;/p&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;TD($\lambda$) Weighting&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;Weighting given in the $\lambda$-return to each of the $n$-step returns.&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;Our first learning algorithm based on the $\lambda$-return is the &lt;strong&gt;off-line $\lambda$-return algorithm&lt;/strong&gt;, which waits until the end of an episode to make updates. Its semi-gradient, $\lambda$-return, target update for $t = 0, 1, 2, \ldots, T-1$ is:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \!\left[G_t^\lambda - \hat{v}(S_t, \mathbf{w}_t)\right] \nabla \hat{v}(S_t, \mathbf{w}_t)}\]

&lt;ul&gt;
  &lt;li&gt;The $\lambda$-return allows us to smoothly move between MC and TD(0) methods, comparable to $n$-step returning.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;forward-view&quot;&gt;Forward View&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;This approach is called the theoretical, forward view of a learning algorithm:
    &lt;ul&gt;
      &lt;li&gt;Update value function towards the $\lambda$-return.&lt;/li&gt;
      &lt;li&gt;Look forward in time to all the future rewards to compute $G_t^\lambda$.&lt;/li&gt;
      &lt;li&gt;Like MC, can only be computed from complete return.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/rl-sutton-barto/ch12-12-1-forward-view-td-lambda.png&quot; alt=&quot;Forward view&quot; /&gt;&lt;/p&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Forward View&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;We decide how to update each state by looking forward to future rewards and states.&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;122-tdlambda-&quot;&gt;12.2 TD($\lambda$) &lt;a name=&quot;122-td&quot;&gt;&lt;/a&gt;&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;TD($\lambda$) was the first algorithm that showed a formal relationship between a forward view and backward view using eligibility traces.&lt;/li&gt;
  &lt;li&gt;TD($\lambda$) improves over the off-line $\lambda$-return algorithm in 3 ways:
    &lt;ul&gt;
      &lt;li&gt;Updates the weight vector on every step of an episode and not just the end.&lt;/li&gt;
      &lt;li&gt;Equal distributions in time of its computations rather than at an episode’s end.&lt;/li&gt;
      &lt;li&gt;Can be applied to continuing problems and not just episodic problems.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Let’s focus on the &lt;strong&gt;semi-gradient version of TD($\lambda$)&lt;/strong&gt; with function approximation:
    &lt;ul&gt;
      &lt;li&gt;The &lt;strong&gt;eligibility trace $\mathbf{z}_t$&lt;/strong&gt; has the same number of components as $\mathbf{w}_t$.&lt;/li&gt;
      &lt;li&gt;$\mathbf{z}$ is initialized to $\mathbf{0}$, incremented on each time step by the value gradient, and then fades away by $\gamma\lambda$:&lt;/li&gt;
    &lt;/ul&gt;

\[\begin{align*}
\mathbf{z}_{-1} &amp;amp;\doteq \mathbf{0} \\
\mathbf{z}_t &amp;amp;\doteq \gamma\lambda \mathbf{z}_{t-1} + \nabla \hat{v}(S_t, \mathbf{w}_t), \quad 0 \leq t \leq T
\end{align*}\]

\[\text{where } \lambda \equiv \text{trace decay parameter and }  \gamma \equiv \text{discount rate}\]
  &lt;/li&gt;
  &lt;li&gt;The eligibility trace keeps track of which $\mathbf{w}_t$ components have contributed, positively or negatively, to recent state valuations.&lt;/li&gt;
  &lt;li&gt;This is the &lt;strong&gt;recency heuristic&lt;/strong&gt; used for &lt;strong&gt;credit assignment,&lt;/strong&gt; where more credit is assigned to the most recent states. &lt;strong&gt;Recent&lt;/strong&gt; is defined in terms of $\gamma\lambda$.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The TD error for state-value prediction is:&lt;/p&gt;

\[\delta_t \doteq R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t)\]

    &lt;p&gt;and the weight vector update in TD($\lambda$) is proportional to the scalar TD error and the vector eligibility trace:&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\, \delta_t\, \mathbf{z}_t}\]

&lt;h3 id=&quot;backward-view&quot;&gt;Backward View&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Forward view provides theory but backward view provides mechanism (practical) where we update online, every step, from incomplete sequences.&lt;/li&gt;
  &lt;li&gt;Keep an eligibility trace for every state $s$.&lt;/li&gt;
  &lt;li&gt;Update value $V(s)$ for every state $s$ in proportion to TD-error $\delta_t$ and eligibility trace $\mathbf{z}_t$:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{aligned}
\delta_t &amp;amp;= R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \\
V(s) &amp;amp;\leftarrow V(s) + \alpha\, \delta_t\, \mathbf{z}_t
\end{aligned}\]

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/rl-sutton-barto/ch12-12-2-backward-view-td-lambda.png&quot; alt=&quot;Backward TD(lambda)&quot; /&gt;&lt;/p&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Backward View of TD($\lambda$)&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;In the backward or mechanistic view of TD($\lambda$), each update depends on the 
current TD error combined with the current eligibility traces of past events.&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;Let’s look at the effect of $\lambda$ to understand the backward view of TD($\lambda$):&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
\text{if } \lambda = 0: \quad &amp;amp; \mathbf{z}_t = \nabla \hat{v}(S_t, \mathbf{w}_t) \\
&amp;amp; \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\, \delta_t \nabla \hat{v}(S_t, \mathbf{w}_t) \quad \longrightarrow \quad \text{TD(0)} \\[6pt]
\text{if } 0 &amp;lt; \lambda &amp;lt; 1: \quad &amp;amp; \text{earlier states are given less credit for the TD error} \\[6pt]
\text{if } \lambda = 1: \quad &amp;amp; \mathbf{z}_t = \gamma \mathbf{z}_{t-1} + \nabla \hat{v}(S_t, \mathbf{w}_t) \quad \longrightarrow \quad \text{credit for earlier states falls by } \gamma \text{ per step} \\[6pt]
\text{if } \lambda = 1,\ \gamma = 1: \quad &amp;amp; \mathbf{z}_t = \mathbf{z}_{t-1} + \nabla \hat{v}(S_t, \mathbf{w}_t) \quad \longrightarrow \quad \text{MC-like behavior (no time decay for ET)} \\[6pt]
\text{if } \lambda = 1: \quad &amp;amp; \text{we get TD(1)}
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;In summary, $\lambda = 1$ yields TD(1).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;TD(1) implements MC algorithms in a more general way and for a wider range of applicability:
    &lt;ul&gt;
      &lt;li&gt;Not limited to episodic tasks, can be applied to discounted continuing tasks.&lt;/li&gt;
      &lt;li&gt;Can be performed &lt;strong&gt;incrementally and online.&lt;/strong&gt;&lt;/li&gt;
      &lt;li&gt;Learns &lt;strong&gt;immediately&lt;/strong&gt; and alters behavior during an episode if something good or bad happens, for control methods.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Linear TD($\lambda$) converges in the on-policy case if the step-size parameter $\alpha$ is reduced over time according to stochastic approximation theory conditions.&lt;/li&gt;
  &lt;li&gt;The convergence of linear TD($\lambda$) is not to the minimum-error weight vector but to a nearby weight vector that depends on $\lambda$.&lt;/li&gt;
  &lt;li&gt;The bound on solution quality generalized for any $\lambda$, for the continuing, discounted case is:&lt;/li&gt;
&lt;/ul&gt;

\[\overline{\text{VE}}(\mathbf{w}_\infty) \leq \frac{1 - \gamma\lambda}{1 - \gamma} \min_\mathbf{w} \overline{\text{VE}}(\mathbf{w})\]

&lt;ul&gt;
  &lt;li&gt;That is, the asymptotic error is no more than $\dfrac{1-\gamma\lambda}{1-\gamma}$ times the smallest possible error for TD($\lambda$):&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
\text{as } \lambda \to 1: \quad &amp;amp; \overline{\text{VE}}(\mathbf{w}_\infty) \to \min_\mathbf{w} \overline{\text{VE}}(\mathbf{w}) \\[6pt]
\text{as } \lambda \to 0: \quad &amp;amp; \overline{\text{VE}}(\mathbf{w}_\infty) \to \frac{1}{1-\gamma} \min_\mathbf{w} \overline{\text{VE}}(\mathbf{w}) = \overline{\text{VE}}(\mathbf{w}_\text{TD}) \quad \text{(TD(0))}
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;However, $\lambda = 1$ is often the poorest choice.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;123-n-step-truncated-lambda-return-methods-&quot;&gt;12.3 $n$-step Truncated $\lambda$-return Methods &lt;a name=&quot;123--step-truncated--return-methods&quot;&gt;&lt;/a&gt;&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;The off-line $\lambda$-return is of limited use because the $\lambda$-return is not known until the episode ends.&lt;/li&gt;
  &lt;li&gt;The off-line $\lambda$-return approximation is to truncate the sequence after a &lt;strong&gt;fixed&lt;/strong&gt; number of steps.&lt;/li&gt;
  &lt;li&gt;Hence, a natural approximation is to truncate the sequence where $\lambda$-return cannot be calculated for an arbitrarily large $n$. This handles the continuing case.&lt;/li&gt;
  &lt;li&gt;The truncated $\lambda$-return for time $t$, given data only up to some later horizon $h$, is:&lt;/li&gt;
&lt;/ul&gt;

\[G_{t:h}^\lambda \doteq (1-\lambda) \sum_{n=1}^{h-t-1} \lambda^{n-1} G_{t:t+n} + \lambda^{h-t-1} G_{t:h}, \quad 0 \leq t \leq h \leq T\]

\[\begin{aligned}
\text{where } h &amp;amp;\equiv \text{horizon (plays same role as time of termination } T\text{)}
\end{aligned}\]

&lt;ul&gt;
  &lt;li&gt;Here the &lt;strong&gt;residual weighting&lt;/strong&gt; is given to the longest available $n$-step return $G_{t:h}$.&lt;/li&gt;
  &lt;li&gt;The truncated $\lambda$-return gives rise to a family of $n$-step $\lambda$-return algorithms, known in the state-value case as &lt;strong&gt;Truncated TD($\lambda$)&lt;/strong&gt; or &lt;strong&gt;TTD($\lambda$)&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;TTD($\lambda$) is defined for $0 \leq t &amp;lt; T$ by:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha \!\left[G_{t:t+n}^\lambda - \hat{v}(S_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{v}(S_t, \mathbf{w}_{t+n-1})}\]

&lt;ul&gt;
  &lt;li&gt;Efficient implementation of TTD($\lambda$) relies on the $k$-step $\lambda$-return:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{G_{t:t+k}^\lambda = \hat{v}(S_t, \mathbf{w}_{t-1}) + \sum_{i=t}^{t+k-1} (\delta\lambda)^{i-t} \delta_i&apos;}\]

\[\begin{aligned}
\text{where } \delta_i&apos; &amp;amp;\equiv R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_{t-1}) - \hat{v}(S_t, \mathbf{w}_{t-1})
\end{aligned}\]

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/rl-sutton-barto/ch12-12-3-truncated-td-lambda.png&quot; alt=&quot;TTD(lambda)&quot; /&gt;&lt;/p&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Backup Diagram for Truncated TD($\lambda$)&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;The truncated $\lambda$-return gives rise to a family of $n$-step $\lambda$-return algorithms called &lt;strong&gt;TTD($\lambda$)&lt;/strong&gt;.&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;124-redoing-updates-online-lambda-return-algorithm-&quot;&gt;12.4 Redoing Updates: Online $\lambda$-return Algorithm &lt;a name=&quot;124-redoing-updates-online--return-algorithm&quot;&gt;&lt;/a&gt;&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;How do we choose the truncation parameter $n$ in TTD($\lambda$)?&lt;/li&gt;
  &lt;li&gt;It involves a tradeoff:
    &lt;ul&gt;
      &lt;li&gt;$n$ should be large so that TTD($\lambda$) closely approximates off-line $\lambda$-return, but&lt;/li&gt;
      &lt;li&gt;$n$ should also be small so that the updates can be made sooner and can influence behavior sooner.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;In principle, we can achieve both cases via the &lt;strong&gt;online $\lambda$-return algorithm&lt;/strong&gt;, but at the cost of computational complexity.&lt;/li&gt;
  &lt;li&gt;Essentially at each time step, we go back and redo all the updates since the beginning of the episode as we gather new increment of data:
    &lt;ul&gt;
      &lt;li&gt;The new updates are better than the old ones because now they account for the time step’s new data.&lt;/li&gt;
      &lt;li&gt;Basically this conceptual algorithm involves multiple passes over the episode, one at each horizon, each generating a different sequence of weight vectors.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Let’s distinguish between the weight vectors computed at the different horizons by seeing the first 3 sequences:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
h=1: \quad &amp;amp; \mathbf{w}_1^1 \doteq \mathbf{w}_0^1 + \alpha \!\left[G_{0:1}^\lambda - \hat{v}(S_0, \mathbf{w}_0^1)\right] \nabla \hat{v}(S_0, \mathbf{w}_0^1) \\[6pt]
h=2: \quad &amp;amp; \mathbf{w}_0^2 \doteq \mathbf{w}_0^2 + \alpha \!\left[G_{0:2}^\lambda - \hat{v}(S_0, \mathbf{w}_0^2)\right] \nabla \hat{v}(S_0, \mathbf{w}_0^2) \\
&amp;amp; \mathbf{w}_2^2 \doteq \mathbf{w}_1^2 + \alpha \!\left[G_{1:2}^\lambda - \hat{v}(S_1, \mathbf{w}_1^2)\right] \nabla \hat{v}(S_1, \mathbf{w}_1^2) \\[6pt]
h=3: \quad &amp;amp; \mathbf{w}_1^3 \doteq \mathbf{w}_0^3 + \alpha \!\left[G_{0:3}^\lambda - \hat{v}(S_0, \mathbf{w}_0^3)\right] \nabla \hat{v}(S_0, \mathbf{w}_0^3) \\
&amp;amp; \mathbf{w}_2^3 \doteq \mathbf{w}_1^3 + \alpha \!\left[G_{1:3}^\lambda - \hat{v}(S_1, \mathbf{w}_1^3)\right] \nabla \hat{v}(S_1, \mathbf{w}_1^3) \\
&amp;amp; \mathbf{w}_3^3 \doteq \mathbf{w}_2^3 + \alpha \!\left[G_{2:3}^\lambda - \hat{v}(S_2, \mathbf{w}_2^3)\right] \nabla \hat{v}(S_2, \mathbf{w}_2^3)
\end{align*}\]

\[\begin{aligned}
\text{where} \\
\mathbf{w}_t^h &amp;amp;\equiv \text{weights used to generate the value at time } t \text{ in the sequence up to horizon } h \\
\mathbf{w}_0^h &amp;amp;\equiv \text{1st weight vector in each sequence that is inherited from the previous episode} \\
\mathbf{w}_n^h &amp;amp;\equiv \text{last weight vector in each sequence or the ultimate weight-vector sequence}
\end{aligned}\]

&lt;ul&gt;
  &lt;li&gt;The general form of the &lt;strong&gt;online $\lambda$-return update&lt;/strong&gt; for $0 \leq t &amp;lt; h \leq T$ is:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\mathbf{w}_{t+1}^h \doteq \mathbf{w}_t^h + \alpha \!\left[G_{t:h}^\lambda - \hat{v}(S_t, \mathbf{w}_t^h)\right] \nabla \hat{v}(S_t, \mathbf{w}_t^h)}\]

\[\mathbf{w}_t \doteq \mathbf{w}_t^t\]

&lt;hr /&gt;

&lt;h2 id=&quot;125-true-online-tdlambda-&quot;&gt;12.5 True Online TD($\lambda$) &lt;a name=&quot;125-true-online-td&quot;&gt;&lt;/a&gt;&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;The original ideal online $\lambda$-return algorithm shown in Section 12.4 is very complex so we use online TD($\lambda$) to approximate it.&lt;/li&gt;
  &lt;li&gt;We’ll use eligibility trace to invert the forward view, online $\lambda$-return to an efficient backward view algorithm. This is called the &lt;strong&gt;True Online TD($\lambda$)&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;It uses a simple strategy trick with the weight matrix, where we only need to keep the last weight vector from all the updates at the last time step (the diagonals of online $\lambda$-return weight matrix).&lt;/li&gt;
&lt;/ul&gt;

\[\begin{aligned}
\begin{bmatrix}
\mathbf{w}_0^0 &amp;amp; &amp;amp; &amp;amp; &amp;amp; \\
\mathbf{w}_0^1 &amp;amp; \mathbf{w}_1^1 &amp;amp; &amp;amp; &amp;amp; \\
\mathbf{w}_0^2 &amp;amp; \mathbf{w}_1^2 &amp;amp; \mathbf{w}_2^2 &amp;amp; &amp;amp; \\
\mathbf{w}_0^3 &amp;amp; \mathbf{w}_1^3 &amp;amp; \mathbf{w}_2^3 &amp;amp; \mathbf{w}_3^3 &amp;amp; \\
\vdots &amp;amp; \vdots &amp;amp; \vdots &amp;amp; \vdots &amp;amp; \ddots \\
\mathbf{w}_0^T &amp;amp; \mathbf{w}_1^T &amp;amp; \mathbf{w}_2^T &amp;amp; \mathbf{w}_3^T &amp;amp; \cdots &amp;amp; \mathbf{w}_T^T
\end{bmatrix}
&amp;amp;\longrightarrow
\begin{bmatrix}
\mathbf{w}_0^0 \\
&amp;amp; \mathbf{w}_1^1 \\
&amp;amp; &amp;amp; \mathbf{w}_2^2 \\
&amp;amp; &amp;amp; &amp;amp; \mathbf{w}_3^3 \\
&amp;amp; &amp;amp; &amp;amp; &amp;amp; \ddots \\
&amp;amp; &amp;amp; &amp;amp; &amp;amp; &amp;amp; \mathbf{w}_T^T
\end{bmatrix} \\
\end{aligned}\]

\[\text{Online } \lambda\text{-return} \hspace{8em} \text{True Online TD}(\lambda)\]

&lt;ul&gt;
  &lt;li&gt;For the linear case in which $\hat{v}(s, \mathbf{w}) = \mathbf{w}^T \mathbf{x}(s)$, the true online TD($\lambda$) algorithm is:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\, \delta_t\, \mathbf{z}_t + \alpha \!\left(\mathbf{w}_t^T \mathbf{x}_t - \mathbf{w}_{t-1}^T \mathbf{x}_t\right)\!\left(\mathbf{z}_t - \mathbf{x}_t\right)}\]

\[\begin{aligned}
\text{where} \\
\mathbf{w}_t &amp;amp;\doteq \mathbf{w}_t^t \\
\mathbf{x}_t &amp;amp;\doteq \mathbf{x}(S_t) \\
\mathbf{z}_t &amp;amp;\doteq \gamma\lambda \mathbf{z}_{t-1} + (1 - \alpha\gamma\lambda\, \mathbf{z}_{t-1}^T \mathbf{x}_t)\, \mathbf{x}_t
\end{aligned}\]

&lt;ul&gt;
  &lt;li&gt;The per-step computational complexity of true online TD($\lambda$) is the same as TD($\lambda$), $O(d)$.&lt;/li&gt;
  &lt;li&gt;$\mathbf{z}_t$ used in true online TD($\lambda$) is called a &lt;strong&gt;dutch trace&lt;/strong&gt;, unlike that of TD($\lambda$) which is called an &lt;strong&gt;accumulating trace&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;Earlier work used a 3rd kind of trace called the &lt;strong&gt;replacing trace&lt;/strong&gt;, defined only for the tabular case or for binary feature vectors (tile coding). It is defined:&lt;/li&gt;
&lt;/ul&gt;

\[\tilde{z}_{i,t} \doteq \left\{ \begin{array}{ll} 1 &amp;amp; \text{if } x_{i,t} = 1 \\ \gamma\lambda\, z_{i,t-1} &amp;amp; \text{otherwise} \end{array} \right\}\]

&lt;ul&gt;
  &lt;li&gt;Nowadays, dutch traces usually perform better than replacing traces and have a clearer theoretical basis.&lt;/li&gt;
  &lt;li&gt;Accumulating traces remain of interest for nonlinear function approximations where dutch traces are unavailable.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;126-dutch-traces-in-monte-carlo-learning&quot;&gt;12.6 Dutch Traces in Monte Carlo Learning&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Eligibility traces have nothing to do with TD learning despite their close historical association.&lt;/li&gt;
  &lt;li&gt;Eligibility traces arise even in Monte Carlo learning.&lt;/li&gt;
  &lt;li&gt;Using dutch traces, we can invert the forward view MC algorithm to an equivalent, yet computationally cheaper backward view algorithm.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;This is the only equivalence of forward and backward view that is explicitly demonstrated in this book.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;The linear, gradient MC prediction algorithm makes the following sequence of updates, one for each time step of the episode:&lt;/li&gt;
&lt;/ul&gt;

\[\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \!\left[G - \mathbf{w}_t^T \mathbf{x}_t\right] \mathbf{x}_t, \quad 0 \leq t &amp;lt; T\]

&lt;ul&gt;
  &lt;li&gt;For simplicity, assume that the return $G$ is a single reward at the end of the episode (hence no subscript by time) and that there is no discounting.&lt;/li&gt;
  &lt;li&gt;This is known as the &lt;strong&gt;Least Mean Square (LMS)&lt;/strong&gt; rule.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;We introduce an additional vector memory, the &lt;strong&gt;eligibility trace,&lt;/strong&gt; that keeps in it a summary of all the feature vectors seen so far. The overall update will be the same as the MC updates’ sequence shown above and is:&lt;/p&gt;

\[\begin{align*}
\mathbf{w}_T &amp;amp;= \mathbf{w}_{T-1} + \alpha \!\left(G - \mathbf{w}_{T-1}^T \mathbf{x}_{T-1}\right) \mathbf{x}_{T-1} \\
&amp;amp;= \mathbf{w}_{T-1} + \alpha \mathbf{x}_{T-1}\!\left(-\mathbf{x}_{T-1}^T \mathbf{w}_{T-1}\right) + \alpha G \mathbf{x}_{T-1} \\
&amp;amp;= \!\left(\mathbf{I} - \alpha \mathbf{x}_{T-1} \mathbf{x}_{T-1}^T\right) \mathbf{w}_{T-1} + \alpha G \mathbf{x}_{T-1} \\
&amp;amp;= \mathbf{F}_{T-1}\, \mathbf{w}_{T-1} + \alpha G \mathbf{x}_{T-1}
\end{align*}\]

\[\begin{aligned}
\text{where} \\
\mathbf{F}_t &amp;amp;\doteq \mathbf{I} - \alpha \mathbf{x}_t \mathbf{x}_t^T \equiv \text{a forgetting or fading matrix}
\end{aligned}\]

\[\therefore\quad \mathbf{w}_{T-1} = \mathbf{F}_{T-2}\, \mathbf{w}_{T-2} + \alpha G \mathbf{x}_{T-2}\]

    &lt;p&gt;Now recursing:&lt;/p&gt;

\[\begin{align*}
\mathbf{w}_T &amp;amp;= \mathbf{F}_{T-1}\, \mathbf{w}_{T-1} + \alpha G \mathbf{x}_{T-1} \\
&amp;amp;= \mathbf{F}_{T-1}\!\left(\mathbf{F}_{T-2}\, \mathbf{w}_{T-2} + \alpha G \mathbf{x}_{T-2}\right) + \alpha G \mathbf{x}_{T-1} \\
&amp;amp;= \mathbf{F}_{T-1} \mathbf{F}_{T-2}\, \mathbf{w}_{T-2} + \alpha G\!\left(\mathbf{F}_{T-1} \mathbf{x}_{T-2} + \mathbf{x}_{T-1}\right) \\
&amp;amp;= \mathbf{F}_{T-1} \mathbf{F}_{T-2}\!\left(\mathbf{F}_{T-3}\, \mathbf{w}_{T-3} + \alpha G\, \mathbf{x}_{T-3}\right) + \alpha G\!\left(\mathbf{F}_{T-1}\, \mathbf{x}_{T-2} + \mathbf{x}_{T-1}\right) \\
&amp;amp;= \mathbf{F}_{T-1} \mathbf{F}_{T-2} \mathbf{F}_{T-3}\, \mathbf{w}_{T-3} + \alpha G\!\left(\mathbf{F}_{T-1} \mathbf{F}_{T-2}\, \mathbf{x}_{T-3} + \mathbf{F}_{T-1}\, \mathbf{x}_{T-2} + \mathbf{x}_{T-1}\right) \\
&amp;amp;\quad \vdots \\
&amp;amp;= \underbrace{\mathbf{F}_{T-1} \mathbf{F}_{T-2} \cdots \mathbf{F}_0\, \mathbf{w}_0}_{\mathbf{a}_{T-1}} + \alpha G \underbrace{\sum\nolimits_{k=0}^{T-1} \mathbf{F}_{T-1} \mathbf{F}_{T-2} \cdots \mathbf{F}_{k+1}\, \mathbf{x}_k}_{\mathbf{z}_{T-1}} \\
&amp;amp;= \mathbf{a}_{T-1} + \alpha G\, \mathbf{z}_{T-1}
\end{align*}\]

\[\begin{aligned}
\text{where} \\
\mathbf{a}_{T-1}\ \&amp;amp;\ \mathbf{z}_{T-1} &amp;amp;\equiv \text{values at time } T-1 \text{ of 2 auxiliary memory vectors that can be updated} \\
&amp;amp;\phantom{{}\equiv{}} \text{incrementally w/o knowledge of } G \text{ and with } O(d) \text{ complexity per time step} \\
\mathbf{z}_t &amp;amp;\equiv \text{dutch-style eligibility trace, initialized to } \mathbf{z}_0 = \mathbf{x}_0
\end{aligned}\]
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The $\mathbf{z}_t$ vector is in fact a dutch-style eligibility trace, initialized to $\mathbf{z}_0 = \mathbf{x}_0$, that can be updated according to:&lt;/p&gt;

\[\begin{align*}
\mathbf{z}_t &amp;amp;= \sum\nolimits_{k=0}^{t} \mathbf{F}_t \mathbf{F}_{t-1} \cdots \mathbf{F}_{k+1}\, \mathbf{x}_k, \quad 1 \leq t &amp;lt; T \\
&amp;amp;= \sum\nolimits_{k=0}^{t-1} \mathbf{F}_t \mathbf{F}_{t-1} \cdots \mathbf{F}_{k+1}\, \mathbf{x}_k + \mathbf{x}_t \\
&amp;amp;= \mathbf{F}_t \sum\nolimits_{k=0}^{t-1} \mathbf{F}_{t-1} \mathbf{F}_{t-2} \cdots \mathbf{F}_{k+1}\, \mathbf{x}_k + \mathbf{x}_t \\
&amp;amp;= \mathbf{F}_t\, \mathbf{z}_{t-1} + \mathbf{x}_t \\
&amp;amp;= \!\left(\mathbf{I} - \alpha \mathbf{x}_t \mathbf{x}_t^T\right) \mathbf{z}_{t-1} + \mathbf{x}_t \\
&amp;amp;= \mathbf{z}_{t-1} - \alpha\!\left(\mathbf{z}_{t-1}^T \mathbf{x}_t\right) \mathbf{x}_t + \mathbf{x}_t \\
&amp;amp;\boxed{= \mathbf{z}_{t-1} + \!\left(1 - \alpha\, \mathbf{z}_{t-1}^T \mathbf{x}_t\right) \mathbf{x}_t}
\end{align*}\]

    &lt;p&gt;which is the dutch trace for $\gamma\lambda = 1$.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The $\mathbf{a}_t$ auxiliary vector is initialized to $\mathbf{a}_0 = \mathbf{w}_0$ and then updated according to:&lt;/p&gt;

\[\begin{align*}
\mathbf{a}_t &amp;amp;= \mathbf{F}_t \mathbf{F}_{t-1} \cdots \mathbf{F}_0\, \mathbf{w}_0, \quad 1 \leq t &amp;lt; T \\
&amp;amp;= \mathbf{F}_t\, \mathbf{a}_{t-1} \\
&amp;amp;= \mathbf{a}_{t-1} - \alpha \mathbf{x}_t \mathbf{x}_t^T \mathbf{a}_{t-1} \\
&amp;amp;\boxed{= \mathbf{a}_{t-1}\!\left(\mathbf{I} - \alpha \mathbf{x}_t \mathbf{x}_t^T\right)}
\end{align*}\]
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;takeaways&quot;&gt;Takeaways&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;The auxiliary vectors, $\mathbf{a}_t$ and $\mathbf{z}_t$, are updated on each time step $t &amp;lt; T$ and then, at time $T$ when $G$ is observed, they are used to compute:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\mathbf{w}_T = \mathbf{a}_{T-1} + \alpha G\, \mathbf{z}_{T-1}}\]

&lt;ul&gt;
  &lt;li&gt;The time and memory complexity per step is $O(d)$.&lt;/li&gt;
  &lt;li&gt;This is surprising and intriguing since ET is working in a non-TD setting (ET arise where long-term predictions are needed to be learned efficiently).&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;127-sarsalambda-&quot;&gt;12.7 Sarsa($\lambda$) &lt;a name=&quot;127-sarsa&quot;&gt;&lt;/a&gt;&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Now let’s extend eligibility traces to action-value methods.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;First, let’s recall the action-value form of the &lt;strong&gt;$n$-step&lt;/strong&gt; return:&lt;/p&gt;

\[G_{t:t+n} \doteq R_{t+1} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1}), \quad t+n &amp;lt; T\]

\[\text{with} \quad G_{t:t+n} = G_t \quad \text{ if } t+n \geq T.\]
  &lt;/li&gt;
  &lt;li&gt;With this, for $t = 0, \ldots, T-1,$ let’s form the action-value form of the &lt;strong&gt;off-line $\lambda$-return&lt;/strong&gt; algorithm which uses $\hat{q}$ rather than $\hat{v}$:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \!\left[G_t^\lambda - \hat{q}(S_t, A_t, \mathbf{w}_t)\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}\]

\[\begin{aligned}
\text{where} \quad G_t^\lambda &amp;amp;\doteq G_{t:\infty}^\lambda
\end{aligned}\]

&lt;ul&gt;
  &lt;li&gt;For the forward view shown in the figure below, which is similar to TD($\lambda$), the updates are:
    &lt;ul&gt;
      &lt;li&gt;&lt;strong&gt;1st update:&lt;/strong&gt; one full-step lookahead&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;2nd update:&lt;/strong&gt; two-step lookahead&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Final update:&lt;/strong&gt; complete return.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/rl-sutton-barto/ch12-12-7-sarsa-lambda.png&quot; alt=&quot;Sarsa(lambda)&quot; /&gt;&lt;/p&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Backup Diagram for Sarsa($\lambda$)&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;The first update looks ahead one full step, to the next state–action pair, the second looks
ahead two steps, to the second state–action pair, and so on. A final update is based on
the complete return.&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;The weighting of each $n$-step update in the $\lambda$-return is same as in TD($\lambda$) and $\lambda$-return.&lt;/li&gt;
  &lt;li&gt;The forward view TD for action values, Sarsa($\lambda$), has the same update rule as TD($\lambda$):&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\, \delta_t\, \mathbf{z}_t}\]

&lt;ul&gt;
  &lt;li&gt;The action-value form of the TD error is used:&lt;/li&gt;
&lt;/ul&gt;

\[\delta_t \doteq R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t)\]

&lt;ul&gt;
  &lt;li&gt;The action-value form of the eligibility trace:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
\mathbf{z}_{-1} &amp;amp;\doteq \mathbf{0} \\
\mathbf{z}_t &amp;amp;\doteq \gamma\lambda \mathbf{z}_{t-1} + \nabla \hat{q}(S_t, A_t, \mathbf{w}_t), \quad 0 \leq t \leq T
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;There exists an action-value version of our ideal TD method, the online $\lambda$-return algorithm and its efficient implementation as true online TD($\lambda$):
    &lt;ul&gt;
      &lt;li&gt;&lt;strong&gt;Section &lt;a href=&quot;#124-redoing-updates-online--return-algorithm&quot;&gt;12.4&lt;/a&gt;:&lt;/strong&gt; Everything there holds here except for using the action-value form of the $n$-step return, $G_{t:t+n}$.&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Sections &lt;a href=&quot;#125-true-online-td&quot;&gt;12.5&lt;/a&gt; &amp;amp; &lt;a href=&quot;#126-dutch-traces-in-monte-carlo-learning&quot;&gt;12.6&lt;/a&gt;:&lt;/strong&gt; Everything holds here except for using state-action feature vectors $\mathbf{x}_t = \mathbf{x}(S_t, A_t)$ instead of state feature vectors $\mathbf{x}_t = \mathbf{x}(S_t)$.&lt;/li&gt;
      &lt;li&gt;The resulting efficient backward algorithm obtained from using the eligibility trace to invert the action-value form of the forward view, online $\lambda$-return is called the &lt;strong&gt;True Online Sarsa($\lambda$)&lt;/strong&gt;.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;There is also a truncated version of Sarsa($\lambda$), called &lt;strong&gt;Forward Sarsa($\lambda$)&lt;/strong&gt;, which appears to be a model-free, control method for use in conjunction with multi-layer ANNs.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;128-variable-lambda-and-gamma-&quot;&gt;12.8 Variable $\lambda$ and $\gamma$ &lt;a name=&quot;128-variable--and&quot;&gt;&lt;/a&gt;&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;To get the most general forms of the final TD algorithms, it is vital to generalize the degree of bootstrapping and discounting beyond constant parameters to functions dependent on the state and action:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
\lambda_t &amp;amp;\doteq \lambda(S_t, A_t), \quad &amp;amp; \lambda &amp;amp;: S \times A \to [0,1] \\
\gamma_t &amp;amp;\doteq \gamma(S_t), \quad &amp;amp; \gamma &amp;amp;: S \to [0,1]
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;$\lambda_t$ is the termination function and is significant because it changes the return $G_t$, which is now more generally defined as:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
G_t &amp;amp;\doteq R_{t+1} + \gamma_{t+1} G_{t+1} \\
&amp;amp;= R_{t+1} + \gamma_{t+1} R_{t+2} + \gamma_{t+1} \gamma_{t+2} R_{t+3} + \gamma_{t+1} \gamma_{t+2} \gamma_{t+3} R_{t+4} + \ldots \\
&amp;amp;= \sum_{k=t}^{\infty} \left(\prod_{i=t+1}^{k} \gamma_i\right) R_{k+1}
\end{align*}\]

\[\begin{aligned}
\text{where } \prod_{k=t}^{\infty} \gamma_k &amp;amp;= 0 \text{ with probability 1 for all } t, \text{ to assure the sums are finite}
\end{aligned}\]

&lt;ul&gt;
  &lt;li&gt;This general return $G_t$ definition enables episodic settings to become a single stream of experience, without special terminal state, start distributions or termination times
    &lt;ul&gt;
      &lt;li&gt;A terminal state just becomes a state with $\gamma(s) = 0$ that transitions to the start distribution.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Generalization to variable bootstrapping yields a new state-based $\lambda$-return:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{G_t^{\lambda s} \doteq R_{t+1} + \gamma_{t+1}\!\left[(1 - \lambda_{t+1})\, \hat{v}(S_{t+1}, \mathbf{w}_t) + \lambda_{t+1}\, G_{t+1}^{\lambda s}\right]}\]

&lt;ul&gt;
  &lt;li&gt;Action-based $\lambda$-return is either the &lt;strong&gt;Sarsa form&lt;/strong&gt;:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{G_t^{\lambda a} \doteq R_{t+1} + \gamma_{t+1}\!\left[(1 - \lambda_{t+1})\, \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) + \lambda_{t+1}\, G_{t+1}^{\lambda a}\right]}\]

&lt;ul&gt;
  &lt;li&gt;or the &lt;strong&gt;Expected Sarsa form&lt;/strong&gt;:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{G_t^{\lambda a} \doteq R_{t+1} + \gamma_{t+1}\!\left[(1 - \lambda_{t+1})\, \bar{V}_t(S_{t+1}) + \lambda_{t+1}\, G_{t+1}^{\bar{\lambda}a}\right]}\]

\[\begin{aligned}
\text{where } \bar{V}_t(s) \doteq \sum_a \Pi(a \vert s)\, \hat{q}(s, a, \mathbf{w}_t) \\
\end{aligned}\]

&lt;h3 id=&quot;superscripts-notation-for-i-in-g_tlambda-i&quot;&gt;Superscripts notation for $i$ in $G_t^{\lambda i}$&lt;/h3&gt;

\[\begin{aligned}
\text{&quot;s&quot;} &amp;amp;: \text{bootstraps from state values} \\
\text{&quot;a&quot;} &amp;amp;: \text{bootstraps from action values}
\end{aligned}\]

&lt;hr /&gt;

&lt;h2 id=&quot;129-off-policy-traces-with-control-variates&quot;&gt;12.9 Off-Policy Traces with Control Variates&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;To generalize to off-policy, we need to incorporate importance sampling using eligibility traces.&lt;/li&gt;
  &lt;li&gt;Let’s focus on the bootstrapping generalization of per-decision importance sampling with control variates &lt;strong&gt;(Section 7.4).&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;The new state-based $\lambda$-return in &lt;strong&gt;Section &lt;a href=&quot;#128-variable--and&quot;&gt;12.8&lt;/a&gt;&lt;/strong&gt; generalizes, after the off-policy, control variate, $n$-step return (ending at horizon $h$) model to:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{G_t^{\lambda s} \doteq \rho_t \!\left(R_{t+1} + \gamma_{t+1}\!\left[(1 - \lambda_{t+1})\, \hat{v}(S_{t+1}, \mathbf{w}_t) + \lambda_{t+1}\, G_{t+1}^{\lambda s}\right]\right) + (1 - \rho_t)\, \hat{v}(S_t, \mathbf{w}_t)}\]

\[\begin{aligned}
\text{where } \rho_t &amp;amp;= \frac{\pi(A_t \vert S_t)}{b(A_t \vert S_t)}
\end{aligned}\]

&lt;ul&gt;
  &lt;li&gt;The final $\lambda$-return can be approximated in terms of sums of the state-based TD error $\delta_t^s$, with the approximation becoming exact if the approximate value function does not change:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
G_t^{\lambda s} &amp;amp;\approx \hat{v}(S_t, \mathbf{w}_t) + \rho_t \sum_{k=t}^{\infty} \delta_k^s \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \\
\delta_t^s &amp;amp;\doteq R_{t+1} + \gamma_{t+1}\, \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t)
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;The forward view update of the approximate $\lambda$-return is:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
\mathbf{w}_{t+1} &amp;amp;= \mathbf{w}_t + \alpha \!\left[G_t^{\lambda s} - \hat{v}(S_t, \mathbf{w}_t)\right] \nabla \hat{v}(S_t, \mathbf{w}_t) \\
&amp;amp;\boxed{\approx \mathbf{w}_t + \alpha \rho_t \!\left(\sum_{k=t}^{\infty} \delta_k^s \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i\right) \nabla \hat{v}(S_t, \mathbf{w}_t)}
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;We’re interested in the equivalence (approximately) between the forward-view update summed over time and a backward-view update summed over time. The equivalence is approximate because we ignore changes in the value function.&lt;/li&gt;
  &lt;li&gt;The sum of the forward-view update over time is:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
\sum_{t=0}^{\infty} \!\left(\mathbf{w}_{t+1} - \mathbf{w}_t\right) &amp;amp;\approx \sum_{t=0}^{\infty} \sum_{k=t}^{\infty} \alpha \rho_t\, \delta_k^s \nabla \hat{v}(S_t, \mathbf{w}_t) \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \\
&amp;amp;= \sum_{k=0}^{\infty} \sum_{t=0}^{k} \alpha \rho_t \nabla \hat{v}(S_t, \mathbf{w}_t)\, \delta_k^s \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \\
&amp;amp;\quad \left(\text{using the summation rule: } \sum_{t=x}^{y} \sum_{k=t}^{y} = \sum_{k=x}^{y} \sum_{t=x}^{k}\right) \\
&amp;amp;= \sum_{k=0}^{\infty} \alpha\, \delta_k^s \sum_{t=0}^{k} \rho_t \nabla \hat{v}(S_t, \mathbf{w}_t) \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;If the entire expression from the 2nd sum on could be written and updated incrementally as an eligibility trace, then the sum of the forward-view update over time would be in the form of the sum of a backward-view TD update.
    &lt;ul&gt;
      &lt;li&gt;That is, if this expression was the trace at time $k$, then we could update it from its value at time $k-1$ by:&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
\mathbf{z}_k &amp;amp;= \sum_{t=0}^{k} \rho_t \nabla \hat{v}(S_t, \mathbf{w}_t) \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \\
&amp;amp;= \sum_{t=0}^{k-1} \rho_t \nabla \hat{v}(S_t, \mathbf{w}_t) \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i + \rho_k \nabla \hat{v}(S_k, \mathbf{w}_k) \\
&amp;amp;= \gamma_k \lambda_k \rho_k \underbrace{\sum_{t=0}^{k-1} \rho_t \nabla \hat{v}(S_t, \mathbf{w}_t) \prod_{i=t+1}^{k-1} \gamma_i \lambda_i \rho_i}_{\mathbf{z}_{k-1}} + \rho_k \nabla \hat{v}(S_k, \mathbf{w}_k)
\end{align*}\]

\[\boxed{\mathbf{z}_k = \rho_k \!\left[\gamma_k \lambda_k\, \mathbf{z}_{k-1} + \nabla \hat{v}(S_k, \mathbf{w}_k)\right]}\]

&lt;ul&gt;
  &lt;li&gt;If we change the index from $k$ to $t$ of the $\mathbf{z}_k$ equation above, we get the &lt;strong&gt;general accumulating trace&lt;/strong&gt; update for state values:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\mathbf{z}_t \doteq \rho_t \!\left[\gamma_t \lambda_t\, \mathbf{z}_{t-1} + \nabla \hat{v}(S_t, \mathbf{w}_t)\right]}\]

&lt;ul&gt;
  &lt;li&gt;This eligibility trace combined with the usual semi-gradient TD($\lambda$) parameter-update rule &lt;strong&gt;(Section &lt;a href=&quot;#122-td&quot;&gt;12.2&lt;/a&gt;)&lt;/strong&gt; forms a &lt;strong&gt;general TD($\lambda$)&lt;/strong&gt; algorithm that can be applied to either on-policy or off-policy data:
    &lt;ul&gt;
      &lt;li&gt;In on-policy, the algorithm is exactly TD($\lambda$) because $\rho_t = 1$ always and the ET above becomes the usual accumulating trace for variable $\lambda$ and $\gamma$:&lt;/li&gt;
    &lt;/ul&gt;

\[\mathbf{z}_t \doteq \gamma_t \lambda_t\, \mathbf{z}_{t-1} + \nabla \hat{v}(S_t, \mathbf{w}_t)\]

    &lt;ul&gt;
      &lt;li&gt;In off-policy, the algorithm stays as it is, although not guaranteed to be stable as a semi-gradient method.&lt;/li&gt;
      &lt;li&gt;For off-policy, we’ll consider extensions that guarantee stability in the next few sections.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Let’s derive the off-policy ET for &lt;strong&gt;action-value&lt;/strong&gt; methods and corresponding general Sarsa($\lambda$) algorithms.
    &lt;ul&gt;
      &lt;li&gt;Starting with either recursive general action-based $\lambda$-return of Sarsa or Expected Sarsa, $G_t^{\lambda a}$, in &lt;strong&gt;Section &lt;a href=&quot;#128-variable--and&quot;&gt;12.8&lt;/a&gt;&lt;/strong&gt; (Expected Sarsa works out to be simpler), we can extend the Expected Sarsa $G_t^{\lambda a}$ to the off-policy case after the off-policy model of action-based, off-policy, control variate, $n$-step return:&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\begin{align*}
G_t^{\lambda a} &amp;amp;\doteq R_{t+1} + \gamma_{t+1}\!\left(\!\left[1 - \lambda_{t+1}\right] \bar{V}_t(S_{t+1}) + \lambda_{t+1}\!\left[\rho_{t+1} G_{t+1}^{\lambda a} + \bar{V}_t(S_{t+1}) - \rho_{t+1}\, \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t)\right]\right) \\
&amp;amp;= R_{t+1} + \gamma_{t+1}\!\left(\bar{V}_t(S_{t+1}) + \lambda_{t+1} \rho_{t+1} \!\left[G_{t+1}^{\lambda a} - \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t)\right]\right)
\end{align*}}\]

\[\begin{aligned}
\text{where } \bar{V}_t(S_{t+1}) &amp;amp;= \sum_a \Pi(a \vert S_{t+1})\, \hat{q}(S_{t+1}, a, \mathbf{w}_t)
\end{aligned}\]

&lt;ul&gt;
  &lt;li&gt;The $\lambda$-return, approximately as the sum of TD errors, is:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
G_t^{\lambda a} &amp;amp;\approx \hat{q}(S_t, A_t, \mathbf{w}_t) + \sum_{k=t}^{\infty} \delta_k^a \prod_{i=t+1}^{k} \gamma_i \lambda_i \rho_i \\
\delta_t^a &amp;amp;= R_{t+1} + \gamma_{t+1} \bar{V}(S_{t+1}) - \hat{q}(S_t, A_t, \mathbf{w}_t)
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;Using steps analogous to those for the state case earlier in this section, write a forward-view update based on action-based $\lambda$-return $G_t^{\lambda a}$ above, then transform the sum of the updates using the summation rule and finally derive the eligibility trace for action values:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\mathbf{z}_t \doteq \gamma_t \lambda_t \rho_t\, \mathbf{z}_{t-1} + \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}\]

&lt;ul&gt;
  &lt;li&gt;This ET combined with the action-based, expected TD error $\delta_t^a$ and the usual semi-gradient TD($\lambda$) parameter-update rule &lt;strong&gt;(Section &lt;a href=&quot;#122-td&quot;&gt;12.2&lt;/a&gt;)&lt;/strong&gt; forms an elegant, efficient &lt;strong&gt;Expected Sarsa($\lambda$)&lt;/strong&gt; algorithm that can be applied to either on-policy or off-policy data:
    &lt;ul&gt;
      &lt;li&gt;&lt;strong&gt;&lt;u&gt;On-policy case&lt;/u&gt;:&lt;/strong&gt; The algorithm becomes the Sarsa($\lambda$) algorithm given constant $\lambda$ and $\gamma$, and the usual state-action TD error:&lt;/li&gt;
    &lt;/ul&gt;

\[\begin{aligned}
&amp;amp;\quad \rho_t = 1, \quad \nabla\lambda_t = \nabla\gamma_t = 0 \\
&amp;amp;\boxed{\mathbf{z}_t \doteq \gamma\lambda\, \mathbf{z}_{t-1} + \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}
\end{aligned}\]
  &lt;/li&gt;
  &lt;li&gt;At $\lambda = 1$, these algorithms become closely related to corresponding Monte Carlo algorithms.&lt;/li&gt;
  &lt;li&gt;No episode-by-episode equivalence of updates exist, only of their expectations, even under the most favorable conditions.
    &lt;ul&gt;
      &lt;li&gt;Methods have been proposed recently &lt;strong&gt;[Sutton, Mahmood, Precup &amp;amp; van Hasselt, 2014]&lt;/strong&gt; that do achieve an exact equivalence.&lt;/li&gt;
      &lt;li&gt;These methods require an additional vector of &lt;strong&gt;“provisional weights”&lt;/strong&gt; that keep track of executed updates but may need to be retracted/emphasized depending on future actions taken.&lt;/li&gt;
      &lt;li&gt;The state and state-action versions of these methods are called &lt;strong&gt;PTD($\lambda$) and PQ($\lambda$)&lt;/strong&gt; respectively, where the ‘P’ stands for Provisional.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If $\lambda &amp;lt; 1$, then all these off-policy algorithms involve bootstrapping and &lt;strong&gt;the deadly triad&lt;/strong&gt; applies, meaning that they can be guaranteed stable only for the tabular case, state aggregation and other limited forms of function approximation.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;Recall the challenge of off-policy learning has 2 parts. Off-policy eligibility traces deal effectively with the 1st part, correcting for the expected value of the targets, but not with the 2nd part that has to do with distribution of updates (matching off-policy to on-policy).&lt;/li&gt;
  &lt;li&gt;Algorithmic strategies for handling the 2nd part of the off-policy learning challenge with eligibility traces are summarized in &lt;strong&gt;Section &lt;a href=&quot;#1211-stable-off-policy-methods-with-traces&quot;&gt;12.11&lt;/a&gt;.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;1210-watkinss-qlambda-to-tree-backuplambda-&quot;&gt;12.10 Watkins’s Q($\lambda$) to Tree-Backup($\lambda$) &lt;a name=&quot;1210-watkinss-q-to-tree-backup&quot;&gt;&lt;/a&gt;&lt;/h2&gt;

&lt;h3 id=&quot;watkinss-qlambda&quot;&gt;Watkins’s Q($\lambda$)&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Watkins’s Q($\lambda$) is the original method for extending Q-learning to eligibility traces.&lt;/li&gt;
  &lt;li&gt;It involves decaying the ET in the usual way as long as a greedy action was taken, then cuts the traces to 0 after the 1st non-greedy action.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/rl-sutton-barto/ch12-12-10-watkins-q-lambda.png&quot; alt=&quot;Watkins&apos;s Q(lambda)&quot; /&gt;&lt;/p&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Backup Diagram for Watkins&apos;s Q($\lambda$)&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;The series of component updates ends
either with the end of the episode or with the first nongreedy action, whichever comes first.&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&quot;tree-backuplambda&quot;&gt;Tree-Backup($\lambda$)&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Let’s look at the eligibility trace version of Tree Backup, which is called &lt;strong&gt;Tree-Backup($\lambda$)&lt;/strong&gt; or &lt;strong&gt;TB($\lambda$)&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;TB($\lambda$) is the &lt;strong&gt;true successor&lt;/strong&gt; to Q-learning because it has no importance sampling.&lt;/li&gt;
  &lt;li&gt;TB($\lambda$) concept is straightforward:
    &lt;ul&gt;
      &lt;li&gt;
        &lt;p&gt;The tree-backup updates of each length (Section 7.5) are weighted dependent on the bootstrapping parameter $\lambda$.&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;Using the recursive form of the action-based $\lambda$-return for Expected Sarsa and then expanding the bootstrapping target case after the model of tree-backup $n$-step return (Section 7.5):&lt;/p&gt;

\[\boxed{\begin{align*}
G_t^{\lambda a} &amp;amp;\doteq R_{t+1} + \gamma_{t+1}\!\left(\!\left[1 - \lambda_{t+1}\right] \bar{V}_t(S_{t+1}) + \lambda_{t+1}\!\left[\sum_{a \neq A_{t+1}} \pi(a \vert S_{t+1})\, \hat{q}(S_{t+1}, a, \mathbf{w}_t) + \pi(A_{t+1} \vert S_{t+1}) G_{t+1}^{\lambda a}\right]\right) \\
&amp;amp;= R_{t+1} + \gamma_{t+1}\!\left(\bar{V}_t(S_{t+1}) + \lambda_{t+1} \pi(A_{t+1} \vert S_{t+1}) \!\left[G_{t+1}^{\lambda a} - \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t)\right]\right)
\end{align*}}\]
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;$G_t^{\lambda a}$ can be approximated (ignoring changes in approx. value function) as a sum of TD errors:&lt;/p&gt;

\[\begin{align*}
G_t^{\lambda a} &amp;amp;\approx \hat{q}(S_t, A_t, \mathbf{w}_t) + \sum_{k=t}^{\infty} \delta_k^a \prod_{i=t+1}^{k} \gamma_i \lambda_i \pi(A_i \vert S_i) \\
\delta_t^a &amp;amp;= R_{t+1} + \gamma_{t+1} \bar{V}_t(S_{t+1}) - \hat{q}(S_t, A_t, \mathbf{w}_t)
\end{align*}\]
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;As always, using same steps as in the previous section, we get a special eligibility trace update involving the target-policy probabilities of the selected actions:&lt;/p&gt;

\[\boxed{\mathbf{z}_t \doteq \gamma_t \lambda_t \pi(A_t \vert S_t)\, \mathbf{z}_{t-1} + \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}\]
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/rl-sutton-barto/ch12-12-10-tree-backup-q-lambda.png&quot; alt=&quot;Tree Backup (lambda)&quot; /&gt;&lt;/p&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Backup Diagram for Tree Backup($\lambda$)&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;The tree-backup updates of each length are weighted in the
usual way dependent on the bootstrapping parameter $\lambda$&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;The ET above combined with the usual semi-gradient TD($\lambda$) parameter-update rule defines the TB($\lambda$) algorithm.&lt;/li&gt;
  &lt;li&gt;Like all semi-gradient algorithms, TB($\lambda$) is not guaranteed to be stable when used with off-policy data and a powerful function approximator &lt;strong&gt;(the deadly triad).&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;1211-stable-off-policy-methods-with-traces&quot;&gt;12.11 Stable Off-Policy Methods with Traces&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Let’s look at 4 of the most important methods that achieve stable off-policy methods/training using eligibility traces.&lt;/li&gt;
  &lt;li&gt;All 4 are based on either the &lt;strong&gt;Gradient-TD or Emphatic TD&lt;/strong&gt; and linear function approximation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;gtdlambda&quot;&gt;GTD($\lambda$)&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Analogous to TDC, and aims to learn a parameter $\mathbf{w}_{t}$ such that $\hat{v}(s, \mathbf{w}) \doteq \mathbf{w}_{t}^T \mathbf{x}(s) \approx v_{\pi}(s)$ even from data that is due to following another policy $b$. Its update is:&lt;/p&gt;

\[\begin{aligned}
\mathbf{w}_{t+1} &amp;amp;\doteq \mathbf{w}_t + \alpha\, \delta_t^s\, \mathbf{z}_t - \alpha \gamma_{t+1}(1 - \lambda_{t+1})\!\left(\mathbf{z}_t^T \mathbf{v}_t\right) \mathbf{x}_{t+1} \\
\mathbf{v}_{t+1} &amp;amp;\doteq \mathbf{v}_t + \beta\, \delta_t^s\, \mathbf{z}_t - \beta\!\left(\mathbf{v}_t^T \mathbf{x}_t\right) \mathbf{x}_t
\end{aligned}\]

\[\begin{aligned}
\text{where} \\
\mathbf{v} &amp;amp;\in \mathbb{R}^d \equiv \text{a vector of the same dimension as } \mathbf{w}, \text{ initialized to } \mathbf{v}_0 = \mathbf{0} \\
\beta &amp;amp;&amp;gt; 0 \equiv \text{a 2nd step-size parameter}
\end{aligned}\]
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;gqlambda&quot;&gt;GQ($\lambda$)&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Gradient-TD algorithm for action values with eligibility traces.&lt;/li&gt;
  &lt;li&gt;GQ($\lambda$) aims to learn $\mathbf{w}_{t}$ such that $\hat{q}(s, a, \mathbf{w}_{t}) \doteq \mathbf{w}_{t}^T \mathbf{x}(s,a) \approx q_{\pi}(s,a)$ from off-policy data.&lt;/li&gt;
  &lt;li&gt;If the target policy is $\varepsilon$-greedy, or otherwise biased towards the greedy policy for $\hat{q}$, then GQ($\lambda$) can be used as a control algorithm.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;GQ($\lambda$) update is:&lt;/p&gt;

\[\begin{aligned}
\mathbf{w}_{t+1} &amp;amp;\doteq \mathbf{w}_t + \alpha\, \delta_t^a\, \mathbf{z}_t - \alpha \gamma_{t+1}(1 - \lambda_{t+1})\!\left(\mathbf{z}_t^T \mathbf{v}_t\right) \bar{\mathbf{x}}_{t+1} \\
\bar{\mathbf{x}}_t &amp;amp;\doteq \sum_a \pi(a \vert S_t)\, \mathbf{x}(S_t, a) \\
\delta_t^a &amp;amp;\doteq R_{t+1} + \gamma_{t+1}\, \mathbf{w}_t^T \bar{\mathbf{x}}_{t+1} - \mathbf{w}_t^T \mathbf{x}_t \\
\mathbf{z}_t &amp;amp;\doteq \gamma_t \lambda_t \rho_t\, \mathbf{z}_{t-1} + \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)
\end{aligned}\]

\[\begin{aligned}
\text{where} \\
\bar{\mathbf{x}}_t &amp;amp;\equiv \text{average feature vector for } S_t \text{ under the target policy} \\
\delta_t^a &amp;amp;\equiv \text{expectation form of the TD error}
\end{aligned}\]
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;htdlambda&quot;&gt;HTD($\lambda$)&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Hybrid TD($\lambda$) state-value algorithm combines aspects of GTD($\lambda$) and TD($\lambda$).&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;HTD($\lambda$) is a strict generalization of TD($\lambda$) to the off-policy setting, meaning it reduces exactly to TD($\lambda$) when the behavior and target policies coincide; a property GTD($\lambda$) does not share:&lt;/p&gt;

\[b(A_t \vert S_t) = \pi(A_t \vert S_t), \quad \rho_t = 1 \implies \text{HTD}(\lambda) = \text{TD}(\lambda)\]
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;HTD($\lambda$) is defined by:&lt;/p&gt;

\[\begin{aligned}
\mathbf{w}_{t+1} &amp;amp;\doteq \mathbf{w}_t + \alpha\, \delta_t^s\, \mathbf{z}_t + \alpha\!\left[\!\left(\mathbf{z}_t - \mathbf{z}_t^b\right)^T \mathbf{v}_t\right]\!\left(\mathbf{x}_t - \gamma_{t+1} \mathbf{x}_{t+1}\right) \\
\mathbf{v}_{t+1} &amp;amp;\doteq \mathbf{v}_t + \beta\, \delta_t^s\, \mathbf{z}_t - \beta\!\left(\mathbf{z}_t^T \mathbf{v}_t\right)\!\left(\mathbf{x}_t - \gamma_{t+1} \mathbf{x}_{t+1}\right), \quad &amp;amp; \mathbf{v}_0 \doteq \mathbf{0} \\
\mathbf{z}_t &amp;amp;\doteq \rho_t \!\left(\gamma_t \lambda_t\, \mathbf{z}_{t-1} + \mathbf{x}_t\right), \quad &amp;amp; \mathbf{z}_{-1} \doteq \mathbf{0} \\
\mathbf{z}_t^b &amp;amp;\doteq \gamma_t \lambda_t\, \mathbf{z}_{t-1}^b + \mathbf{x}_t, \quad &amp;amp; \mathbf{z}_{-1}^b \doteq \mathbf{0}
\end{aligned}\]
  &lt;/li&gt;
  &lt;li&gt;We get
    &lt;ul&gt;
      &lt;li&gt;a 2nd set of weights, $\mathbf{v}_t$.&lt;/li&gt;
      &lt;li&gt;a 2nd set of eligibility traces, $\mathbf{z}_t^b$, &lt;strong&gt;conventional accumulating traces&lt;/strong&gt; for the behavior policy.&lt;/li&gt;
    &lt;/ul&gt;

\[\begin{aligned}
\mathbf{z}_t^b = \mathbf{z}_t \text{ if all } \rho_t = 1 &amp;amp;\implies \left(\mathbf{z}_t - \mathbf{z}_t^b\right)^T = \mathbf{0} \\
&amp;amp;\implies \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\, \delta_t^s\, \mathbf{z}_t \quad \text{(TD(}\lambda\text{))}
\end{aligned}\]
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;emphatic-tdlambda&quot;&gt;Emphatic TD($\lambda$)&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Extension of one-step Emphatic TD (Sections 9.11 &amp;amp; 11.8) to eligibility traces.&lt;/li&gt;
  &lt;li&gt;The resulting algorithm:
    &lt;ul&gt;
      &lt;li&gt;(+) retains strong off-policy convergence guarantees&lt;/li&gt;
      &lt;li&gt;(+) enables any degree of bootstrapping&lt;/li&gt;
      &lt;li&gt;(-) has high variance&lt;/li&gt;
      &lt;li&gt;(-) potentially slow convergence.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Emphatic TD($\lambda$) is defined by:&lt;/p&gt;

\[\begin{aligned}
\mathbf{w}_{t+1} &amp;amp;\doteq \mathbf{w}_t + \alpha\, \delta_t\, \mathbf{z}_t \\
\delta_t &amp;amp;\doteq R_{t+1} + \gamma_{t+1}\, \mathbf{w}_t^T \mathbf{x}_{t+1} - \mathbf{w}_t^T \mathbf{x}_t \\
\mathbf{z}_t &amp;amp;\doteq \rho_t \!\left(\gamma_t \lambda_t\, \mathbf{z}_{t-1} + M_t \mathbf{x}_t\right), \quad &amp;amp; \mathbf{z}_{-1} \doteq \mathbf{0} \\
M_t &amp;amp;\doteq \lambda_t \mathcal{I}_t + (1 - \lambda_t) F_t \\
F_t &amp;amp;\doteq \rho_{t-1} \gamma_t F_{t-1} + \mathcal{I}_t, \quad &amp;amp; F_0 \doteq \mathcal{i}(S_0)
\end{aligned}\]

\[\begin{aligned}
\text{where} \\
M_t &amp;amp;\geq 0 \equiv \text{emphasis} \\
F_t &amp;amp;\geq 0 \equiv \text{followon trace} \\
\mathcal{I}_t &amp;amp;\geq 0 \equiv \text{interest}
\end{aligned}\]
  &lt;/li&gt;
  &lt;li&gt;In the on-policy case ($\rho_t = 1$ for all $t$), Emphatic TD($\lambda$) is similar to conventional TD($\lambda$), but still significantly different:
    &lt;ul&gt;
      &lt;li&gt;Emphatic TD($\lambda$) is guaranteed to converge for all state-dependent $\lambda$ functions; TD($\lambda$) is not (TD($\lambda$) is guaranteed only for constant $\lambda$).&lt;/li&gt;
      &lt;li&gt;See Yu’s counterexample &lt;strong&gt;[Ghassian, Rafiee &amp;amp; Sutton, 2016].&lt;/strong&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;1212-implementation-issues&quot;&gt;12.12 Implementation Issues&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Naive implementation seems expensive:&lt;/strong&gt; Updating eligibility traces for every state at every time step appears computationally costly on serial computers.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Practical optimization:&lt;/strong&gt; Most ET are nearly 0; only recently visited states have significant traces, so implementations can track and update only these few states.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Computational cost:&lt;/strong&gt; With this optimization, tabular methods with traces are only a few times more expensive than one-step methods.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Function approximation reduces overhead:&lt;/strong&gt; When using neural networks, ET typically only double memory and computation per step (much less overhead than in tabular case).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Tabular is the worst case:&lt;/strong&gt; The tabular setting represents the highest computational complexity for ET relative to simpler methods.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;1213-conclusions&quot;&gt;12.13 Conclusions&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Eligibility traces&lt;/strong&gt; provide an efficient, incremental way to interpolate between TD and MC methods.&lt;/li&gt;
  &lt;li&gt;ET offer advantages over $n$-step methods in terms of &lt;strong&gt;generality and computational trade-offs.&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Empirically, &lt;strong&gt;an intermediate mix works best:&lt;/strong&gt; ET should move towards MC but not all the way since pure MC performance degrades sharply.&lt;/li&gt;
  &lt;li&gt;ET are the &lt;strong&gt;first defense against long-delayed rewards and non-Markov tasks,&lt;/strong&gt; used with TD methods to make them behave more like MC methods without full bootstrapping.&lt;/li&gt;
  &lt;li&gt;Use traces &lt;strong&gt;when data is scarce and online learning is required,&lt;/strong&gt; as they provide faster learning per sample despite higher computational cost per step.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Avoid traces in offline settings&lt;/strong&gt; with cheap abundant data (maximum data processing speed matters more than learning efficiency per sample).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;True online methods&lt;/strong&gt; achieve ideal $\lambda$-return performance while maintaining $O(d)$ computational efficiency.&lt;/li&gt;
  &lt;li&gt;Forward-to-backward view derivations provide &lt;strong&gt;computationally efficient, mechanistic,&lt;/strong&gt; practical implementations of theory.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;citation&quot;&gt;Citation&lt;/h2&gt;

&lt;p&gt;If you found this blog post helpful, please consider citing it:&lt;/p&gt;

&lt;div class=&quot;language-bibtex highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nc&quot;&gt;@article&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;obasi2026RLsuttonBartoCh12notes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;title&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Sutton &amp;amp; Barto, Ch. 12: Eligibility Traces (Personal Notes)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;author&lt;/span&gt;  &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Obasi, Chizoba&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;journal&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;chizkidd.github.io&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;year&lt;/span&gt;    &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;2026&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;month&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Mar&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;url&lt;/span&gt;     &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;https://chizkidd.github.io/2026/03/13/rl-sutton-barto-notes-ch012/&quot;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;hr /&gt;
</description>
        <pubDate>Fri, 13 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://chizkidd.github.io//2026/03/13/rl-sutton-barto-notes-ch012/</link>
        <guid isPermaLink="true">https://chizkidd.github.io//2026/03/13/rl-sutton-barto-notes-ch012/</guid>
        
        
      </item>
    
      <item>
        <title>Sutton &amp; Barto, Ch. 11: Off-Policy Methods with Approximation (Personal Notes)</title>
        <description>&lt;ul&gt;
  &lt;li&gt;Let’s discuss the extension of off-policy methods from the tabular case (Ch. 6 &amp;amp; 7) to function approximation.&lt;/li&gt;
  &lt;li&gt;We’ll explore the convergence problems, the theory of linear function approximation, the notion of learnability, and stronger convergence off-policy algorithms.&lt;/li&gt;
  &lt;li&gt;Off-policy learning with function approximation has 2 challenges:
    &lt;ol&gt;
      &lt;li&gt;Finding the target of the update.&lt;/li&gt;
      &lt;li&gt;The off-policy distribution of updates does not match that of the on-policy distribution.&lt;/li&gt;
    &lt;/ol&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;table-of-contents&quot;&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#111-semi-gradient-methods&quot;&gt;11.1 Semi-gradient Methods&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#112-examples-of-off-policy-divergence&quot;&gt;11.2 Examples of Off-Policy Divergence&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#113-the-deadly-triad&quot;&gt;11.3 The Deadly Triad&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#114-linear-value-function-geometry&quot;&gt;11.4 Linear Value-Function Geometry&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#115-gradient-descent-in-the-bellman-error&quot;&gt;11.5 Gradient Descent in the Bellman Error&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#116-the-bellman-error-is-not-learnable&quot;&gt;11.6 The Bellman Error is Not Learnable&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#117-gradient-td-methods&quot;&gt;11.7 Gradient-TD Methods&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#118-emphatic-td-methods&quot;&gt;11.8 Emphatic-TD Methods&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#119-reducing-variance&quot;&gt;11.9 Reducing Variance&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#1110-summary&quot;&gt;11.10 Summary&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;appendix&quot;&gt;Appendix&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#citation&quot;&gt;Citation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;111-semi-gradient-methods&quot;&gt;11.1 Semi-gradient Methods&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Let’s discuss the extension of previous off-policy methods to function approximation as semi-gradient methods.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;This is how we find the update target (or change it) to address the first challenge.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;Recall the semi-gradient update:&lt;/li&gt;
&lt;/ul&gt;

\[\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\!\left[U_t - \hat{v}(S_t, \mathbf{w}_t)\right] \nabla \hat{v}(S_t, \mathbf{w}_t)\]

\[U_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t)\]

&lt;ul&gt;
  &lt;li&gt;In the tabular case, we update the array ($V$ or $Q$), but now we update the weight vector $\mathbf{w}$.&lt;/li&gt;
  &lt;li&gt;Many off-policy algorithms use the per-step importance sampling ratio:&lt;/li&gt;
&lt;/ul&gt;

\[\rho_t \doteq \rho_{t:t} = \frac{\pi(A_t \vert S_t)}{b(A_t \vert S_t)}\]

&lt;ul&gt;
  &lt;li&gt;The off-policy, semi-gradient &lt;strong&gt;TD(0)&lt;/strong&gt; is same as that of the on-policy TD(0) except for the addition of the $\rho_t$ term:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha \rho_t\, \delta_t \nabla \hat{v}(S_t, \mathbf{w}_t)}\]

\[\begin{align*}
\text{(episodic)} \quad \delta_t &amp;amp;= R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \\
\text{(continuing)} \quad \delta_t &amp;amp;= R_{t+1} - \bar{R}_t + \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t)
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;For action values, the off-policy, semi-gradient &lt;strong&gt;Expected Sarsa&lt;/strong&gt; update rule is (no importance sampling):&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\, \delta_t \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}\]

\[\begin{align*}
\text{(episodic)} \quad \delta_t &amp;amp;= R_{t+1} + \gamma \sum_a \pi(a \vert S_{t+1}) \hat{q}(S_{t+1}, a, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t) \\
\text{(continuing)} \quad \delta_t &amp;amp;= R_{t+1} - \bar{R}_t + \sum_a \pi(a \vert S_{t+1}) \hat{q}(S_{t+1}, a, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t)
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;The lack of use of importance sampling in Expected Sarsa is an unclear choice since we might want to weight different state-action pairs &lt;strong&gt;differently&lt;/strong&gt; once they all contribute to the same overall approximation. This issue can only be properly resolved by more thorough understanding of the &lt;strong&gt;theory of function approximation&lt;/strong&gt; in RL.&lt;/li&gt;
  &lt;li&gt;In the multi-step generalizations of the algorithms, both the state-value and action-value algorithms involve importance sampling. For example, the off-policy, semi-gradient $\mathbf{n}$&lt;strong&gt;-step Sarsa&lt;/strong&gt; update is:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha \rho_{t+1} \cdots \rho_{t+n}\!\left[G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})}\]

\[\begin{align*}
\text{(episodic)} \quad G_{t:t+n} &amp;amp;= R_{t+1} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1}) \\
\text{(continuing)} \quad G_{t:t+n} &amp;amp;= R_{t+1} - \bar{R}_t + \ldots + R_{t+n} - \bar{R}_{t+n-1} + \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1})
\end{align*}\]

\[\text{where } \rho_k = 1 \hspace{0.5em} \text{ for } k \geq T \quad \text{and} \quad G_{t:t+n} = G_t \hspace{0.5em} \text{ if } t+n \geq T\]

&lt;ul&gt;
  &lt;li&gt;The off-policy, semi-gradient $\mathbf{n}$&lt;strong&gt;-step backup tree&lt;/strong&gt; (no importance sampling) algorithm is:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha\!\left[G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})}\]

\[G_{t:t+n} \doteq \hat{q}(S_t, A_t, \mathbf{w}_{t+n}) + \sum_{k=t}^{t+n-1} \delta_k \prod_{i=t+1}^{k} \gamma \pi(A_i \vert S_i)\]

\[\text{where } \delta_t \text{ is the Expected Sarsa TD error defined earlier in this section.}\]

&lt;hr /&gt;

&lt;h2 id=&quot;112-examples-of-off-policy-divergence&quot;&gt;11.2 Examples of Off-Policy Divergence&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Now let’s discuss the 2nd off-policy function approximation challenge.&lt;/li&gt;
  &lt;li&gt;We’ll look at some instructive counterexamples where the semi-gradient algorithm diverges.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;example-1&quot;&gt;Example 1&lt;/h3&gt;

&lt;p&gt;Consider part of a larger MDP with 2 states whose estimated values are $w$ and $2w$:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/rl-sutton-barto/ch11-11-2-example1-DOWNSPACED.png&quot; alt=&quot;Counterexample&quot; /&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Simple Counterexample:&lt;/strong&gt; 2-state part of an MDP.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
  &lt;li&gt;$w$ updates will diverge to infinity, since the transition will always look good (higher next-state estimated value than current state estimated value).&lt;/li&gt;
  &lt;li&gt;The TD error on a transition between the 2 states is:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
\delta_t &amp;amp;= R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \\
&amp;amp;= 0 + \gamma \cdot 2w_t - w_t \\
&amp;amp;= (2\gamma - 1)\, w_t
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;The off-policy, semi-gradient TD(0) update is:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
w_{t+1} &amp;amp;= w_t + \alpha \rho_t\, \delta_t \nabla \hat{v}(S_t, w_t) \\
&amp;amp;= w_t + (\alpha)(1)\!\left[(2\gamma - 1) w_t\right](1) \\
&amp;amp;= w_t\!\left[1 + \alpha(2\gamma - 1)\right]
\end{align*}\]

\[\begin{aligned}
\Rightarrow \quad &amp;amp; 1 + \alpha(2\gamma - 1) &amp;gt; 1 \\
&amp;amp; \alpha(2\gamma - 1) &amp;gt; 0 \\
&amp;amp; 2\gamma - 1 &amp;gt; 0 \\
&amp;amp; \gamma &amp;gt; \tfrac{1}{2} \quad \longrightarrow \quad w \to \pm\infty
\end{aligned}\]

&lt;h3 id=&quot;example-2-bairds-counterexample&quot;&gt;Example 2 (Baird’s Counterexample)&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Now let’s look at an entire complete system with instability (divergence).&lt;/li&gt;
  &lt;li&gt;Consider the episodic 7-state, 2-action MDP shown below.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/rl-sutton-barto/ch11-11-2-bairds-counterexample.png&quot; alt=&quot;Baird&apos;s Counterexample&quot; /&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Baird’s Counterexample:&lt;/strong&gt; Episodic 7-state, 2-action MDP.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Assumptions/knowns:&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;$b(\text{dashed}\,\vert\,\cdot) = 6/7$&lt;/li&gt;
      &lt;li&gt;$b(\text{solid}\,\vert\,\cdot) = 1/7$&lt;/li&gt;
      &lt;li&gt;$\pi(a\,\vert\,\cdot) = \pi(\text{solid}\,\vert\,\cdot) = 1$&lt;/li&gt;
      &lt;li&gt;$R = 0$ (on all transitions)&lt;/li&gt;
      &lt;li&gt;$\gamma = 0.99$&lt;/li&gt;
      &lt;li&gt;The state values are estimated via linear parametrization.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;The estimated value of the leftmost state is $2w_1 + w_8$, which corresponds to a feature vector for the 1st state being:&lt;/li&gt;
&lt;/ul&gt;

\[\mathbf{x}(1) = (2, 0, 0, 0, 0, 0, 0, 1)^T\]

\[R = 0 \quad \therefore\quad v_\pi(s) = 0 \; \forall s, \text{ which can be exactly approximated if } \mathbf{w} = \mathbf{0}\]

&lt;ul&gt;
  &lt;li&gt;Since there are 8 components of the weight vector (more than the 7 non-terminal states), there exist many solutions.&lt;/li&gt;
  &lt;li&gt;Applying semi-gradient TD(0) to this problem will cause the weights to diverge to infinity. This also applies for the dynamic programming (DP) case.&lt;/li&gt;
  &lt;li&gt;The semi-gradient DP update is:&lt;/li&gt;
&lt;/ul&gt;

\[\mathbf{w}_{k+1} \doteq \mathbf{w}_k + \frac{\alpha}{\vert S \vert} \sum_s \left(\mathbb{E}_\pi\!\left[R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_k) \mid S_t = s\right] - \hat{v}(s, \mathbf{w}_k)\right) \nabla \hat{v}(s, \mathbf{w}_k)\]

&lt;ul&gt;
  &lt;li&gt;This example shows that even the simplest combination of bootstrapping and function approximation can be unstable in the off-policy case.
    &lt;ul&gt;
      &lt;li&gt;&lt;u&gt;Simplest bootstrapping&lt;/u&gt;: DP and TD.&lt;/li&gt;
      &lt;li&gt;&lt;u&gt;Simplest function approximation&lt;/u&gt;: linear, semi-gradient descent method.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;example-3-tsitsiklis--van-roys-counterexample&quot;&gt;Example 3 (Tsitsiklis &amp;amp; Van Roy’s Counterexample)&lt;/h3&gt;

&lt;p&gt;This extends Example 1 with a terminal state and $R = 0$:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/rl-sutton-barto/ch11-11-2-tsitsiklis-van-roy-counterexample.png&quot; alt=&quot;Tsitsiklis &amp;amp; Van Roy&apos;s Counterexample&quot; /&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Tsitsiklis &amp;amp; Van Roy’s Counterexample:&lt;/strong&gt; Extension of Example 1 with probability $\varepsilon$ of transitioning to the terminal state (shaded).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
  &lt;li&gt;Let’s find $w_{k+1}$ at each step that minimizes the $\overline{\text{VE}}$ between the estimated value and the expected one-step return:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
w_{k+1} &amp;amp;= \arg\min_{w \in \mathbb{R}} \sum_{s \in S} \left(\hat{v}(s, w) - \mathbb{E}_\pi\!\left[R_{t+1} + \gamma \hat{v}(S_{t+1}, w_k) \mid S_t = s\right]\right)^2 \\[6pt]
&amp;amp;= \arg\min_{w \in \mathbb{R}} \left(w - \gamma \cdot 2w_k\right)^2 + \left(2w - (1 - \varepsilon)\gamma \cdot 2w_k\right)^2 \\[6pt]
&amp;amp;= \left(\frac{6 - 4\varepsilon}{5}\right) \gamma w_k
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;The sequence ${w_k}$ diverges when $\gamma &amp;gt; \dfrac{5}{6 - 4\varepsilon}$ and $w_0 \neq 0$.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;takeaways&quot;&gt;Takeaways&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Instability can be prevented by using special methods for function approximation.&lt;/li&gt;
  &lt;li&gt;These special methods guarantee stability because they do not extrapolate from the observed targets. They are called &lt;strong&gt;averagers&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;Averagers include:
    &lt;ol&gt;
      &lt;li&gt;Nearest neighbor&lt;/li&gt;
      &lt;li&gt;Locally weighted regression&lt;/li&gt;
      &lt;li&gt;Tile coding&lt;/li&gt;
      &lt;li&gt;Artificial neural networks (ANNs)&lt;/li&gt;
    &lt;/ol&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;113-the-deadly-triad&quot;&gt;11.3 The Deadly Triad&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;The danger of instability and divergence arises when we combine these 3 elements, which make up the &lt;strong&gt;deadly triad&lt;/strong&gt;:
    &lt;ul&gt;
      &lt;li&gt;Function approximation&lt;/li&gt;
      &lt;li&gt;Bootstrapping&lt;/li&gt;
      &lt;li&gt;Off-policy training&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Instability can be avoided if one of the elements is absent:
    &lt;ul&gt;
      &lt;li&gt;Function approximation cannot be given up (needed for large-scale problems).&lt;/li&gt;
      &lt;li&gt;Bootstrapping can be given up but at the cost of computational and data efficiency.&lt;/li&gt;
      &lt;li&gt;Off-policy can be given up (replace Q-learning with Sarsa).&lt;/li&gt;
      &lt;li&gt;There is no perfect solution as we still need off-policy for planning and parallel learning.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;114-linear-value-function-geometry&quot;&gt;11.4 Linear Value-Function Geometry&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;To better understand the stability challenge of off-policy learning, let’s think about value-function approximation &lt;strong&gt;more abstractly and independently&lt;/strong&gt; of how learning is done.&lt;/li&gt;
  &lt;li&gt;Let’s consider the case with 3 states $S = {s_1, s_2, s_3}$ and 2 parameters $\mathbf{w} = (w_1, w_2)^T$.
    &lt;ul&gt;
      &lt;li&gt;All value functions exist in a 3-D space, however the parameters provide a 2-D subspace.&lt;/li&gt;
      &lt;li&gt;Any weight vector $\mathbf{w} = (w_1, w_2)^T$ is a point in the 2-D subspace and thus also a complete value function $v_\mathbf{w}$ that assigns values to all 3 states.&lt;/li&gt;
      &lt;li&gt;In linear value-function approximation, the subspace is a simple plane.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;How do we represent $v_\pi$ in the $d$-dimensional space?
    &lt;ul&gt;
      &lt;li&gt;We need to perform a projection operation.&lt;/li&gt;
      &lt;li&gt;TD methods present other solutions.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/rl-sutton-barto/ch11-11-4-linear-value-func-approx-geometry-DOWNSPACED.png&quot; alt=&quot;Linear value-func. approx. geometry&quot; /&gt;&lt;/p&gt;

&lt;!-- &gt;**The Geometry of Linear Value-Function Approximation:**  Shown is the 3D space of all value functions over three states, while shown as a plane is the subspace of all value functions representable by a linear function approximator with parameter $\mathbf{w} = (w_1, w_2)^T$. The true value function $v_\pi$ is in the larger space and can be projected down (into the subspace, using a projection operator $\Pi$) to its best approximation in the value error ($\text{VE}$) sense. The best approximators in the Bellman error ($\text{BE}$), projected Bellman error ($\text{PBE}$), and temporal difference error ($\text{TDE}$) senses are all potentially different and are shown in the lower right. --&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;The Geometry of Linear Value-Function Approximation&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;Shown is the 3D space of all value functions over three states, while shown as a plane is the subspace of all value functions representable by a linear function approximator with parameter $\mathbf{w} = (w_1, w_2)^T$. The true value function $v_\pi$ is in the larger space and can be projected down (into the subspace, using a projection operator $\Pi$) to its best approximation in the value error ($\text{VE}$) sense. The best approximators in the Bellman error ($\text{BE}$), projected Bellman error ($\text{PBE}$), and temporal difference error ($\text{TDE}$) senses are all potentially different and are shown in the lower right.&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&quot;projection-operation&quot;&gt;Projection Operation&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;For the projection operation, the distance between value functions using the norm is:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
\lVert v \rVert_\mu^2 &amp;amp;\doteq \sum_{s \in S} \mu(s)\, v(s)^2 \\[6pt]
\overline{\text{VE}}(\mathbf{w}) &amp;amp;= \lVert v_\mathbf{w} - v_\pi \rVert_\mu^2 \\[6pt]
\Pi\, v &amp;amp;\doteq v_\mathbf{w} \quad \\[6pt]
\text{where } \mathbf{w} = \arg\min_{\mathbf{w} \in \mathbb{R}^d} \lVert v - v_\mathbf{w} \rVert_\mu^2 &amp;amp;  \hspace{0.8em} \text{and} \hspace{0.5em}  \Pi \equiv \text{projection operator}
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;The representable value function that is closest to the true value function $V_\pi$ is its projection $\Pi V_\pi$ (MC method asymptotic solution).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Projection matrix&lt;/strong&gt;: with $\mathbf{D} \equiv \vert S \vert \times \vert S \vert$ diagonal matrix with $\mu(s)$ on the diagonal and $\mathbf{X} \equiv \vert S \vert \times d$ matrix whose rows are the feature vectors $\mathbf{x}(s)^T$:&lt;/p&gt;

\[\Pi \doteq \mathbf{X}\!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{D}\]

    &lt;p&gt;If the inverse does not exist, the pseudoinverse is substituted. Using these matrices, the squared norm of a vector can be written as:&lt;/p&gt;

\[\lVert v \rVert_\mu^2 = v^T \mathbf{D}\, v\]

    &lt;p&gt;and the approximate linear value function written as:&lt;/p&gt;

\[v_\mathbf{w} = \mathbf{X}\mathbf{w}\]
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;td-solutions&quot;&gt;TD Solutions&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Bellman Error&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The true value function $v_\pi$ solves the Bellman equation exactly.&lt;/li&gt;
  &lt;li&gt;The &lt;strong&gt;Bellman error&lt;/strong&gt; shows how far off $v_\mathbf{w}$ is from $v_\pi$. The Bellman error at state $s$ is:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
\bar{\delta}_\mathbf{w}(s) &amp;amp;\doteq \left(\sum_a \pi(a \vert s) \sum_{s&apos;, r} p(s&apos;, r \vert s, a)\!\left[r + \gamma v_\mathbf{w}(s&apos;)\right]\right) - v_\mathbf{w}(s) \\
&amp;amp;= \mathbb{E}_\pi\!\left[R_{t+1} + \gamma v_\mathbf{w}(S_{t+1}) - v_\mathbf{w}(S_t) \mid S_t = s, A_t \sim \pi\right]
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;The Bellman error is the expectation of the TD error.&lt;/li&gt;
  &lt;li&gt;The vector of all the Bellman errors, at all states, $\bar{\delta}_\mathbf{w} \in \mathbb{R}^{\vert S \vert}$, is called the &lt;strong&gt;Bellman error vector&lt;/strong&gt; ($\text{BE}$).&lt;/li&gt;
  &lt;li&gt;The overall size of $\text{BE}$ is the &lt;strong&gt;Mean Squared Bellman Error&lt;/strong&gt;, $\overline{\text{BE}}$:&lt;/li&gt;
&lt;/ul&gt;

\[\overline{\text{BE}}(\mathbf{w}) = \lVert \bar{\delta}_\mathbf{w} \rVert_\mu^2\]

&lt;ul&gt;
  &lt;li&gt;The &lt;strong&gt;Bellman operator&lt;/strong&gt; $B_\pi : \mathbb{R}^{\vert S \vert} \to \mathbb{R}^{\vert S \vert}$ is defined by:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
(B_\pi v)(s) &amp;amp;\doteq \sum_a \pi(a \vert s) \sum_{s&apos;, r} p(s&apos;, r \vert s, a)\!\left[r + \gamma v(s&apos;)\right], \quad \forall s \in S \text{ and } v : S \to \mathbb{R} \\[6pt]
\bar{\delta}_\mathbf{w} &amp;amp;= B_\pi v_\mathbf{w} - v_\mathbf{w} \\[6pt]
v_\pi &amp;amp;= B_\pi v_\pi
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;The projection of the Bellman error vector back into the representable space creates the &lt;strong&gt;Projected Bellman Error $(\text{PBE})$&lt;/strong&gt; vector:&lt;/li&gt;
&lt;/ul&gt;

\[\text{PBE} = \Pi\, \bar{\delta}_\mathbf{w}\]

&lt;ul&gt;
  &lt;li&gt;The size of $\text{PBE}$, in the norm, is another measure of error in the approximate value function, called the &lt;strong&gt;Mean Square Projected Bellman Error&lt;/strong&gt;, $\overline{\text{PBE}}$:&lt;/li&gt;
&lt;/ul&gt;

\[\overline{\text{PBE}}(\mathbf{w}) = \lVert \Pi\, \delta_\mathbf{w} \rVert_\mu^2\]

&lt;ul&gt;
  &lt;li&gt;With linear function approximation, there always exists an approximate value function (within the subspace) with zero $\overline{\text{PBE}}$; this is the TD fixed point $\mathbf{w}_\text{TD}$.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;115-gradient-descent-in-the-bellman-error&quot;&gt;11.5 Gradient Descent in the Bellman Error&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Let’s apply the approach of SGD in dealing with the challenge of stability in off-policy learning.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;td-error-naive-residual-gradient-algorithm&quot;&gt;TD Error (Naive Residual-Gradient Algorithm)&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Let’s take the minimization of the expected square of the TD error as the objective, TD(0):&lt;/li&gt;
&lt;/ul&gt;

\[\delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t)\]

&lt;ul&gt;
  &lt;li&gt;Using the TD error, we can find the Mean Squared TD error, the objective function $\overline{\text{TDE}}$:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
\overline{\text{TDE}}(\mathbf{w}) &amp;amp;= \sum_{s \in S} \mu(s)\, \mathbb{E}\!\left[\delta_t^2 \mid S_t = s, A_t \sim \pi\right] \\
&amp;amp;= \sum_{s \in S} \mu(s)\, \mathbb{E}\!\left[\rho_t\, \delta_t^2 \mid S_t = s, A_t \sim b\right] \\
&amp;amp;= \mathbb{E}_b\!\left[\rho_t\, \delta_t^2\right]
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;Following the standard SGD approach, the per-step update based on a sample of this expected value:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
\mathbf{w}_{t+1} &amp;amp;= \mathbf{w}_t - \tfrac{1}{2}\alpha \nabla\!\left(\rho_t\, \delta_t^2\right) \\
&amp;amp;= \mathbf{w}_t - \alpha \rho_t\, \delta_t \nabla \delta_t \\
&amp;amp;= \mathbf{w}_t + \alpha \rho_t\, \delta_t\!\left(\nabla \hat{v}(S_t, \mathbf{w}_t) - \gamma \nabla \hat{v}(S_{t+1}, \mathbf{w}_t)\right)
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;This is the same as the semi-gradient TD algorithm except for the additional final term.&lt;/li&gt;
  &lt;li&gt;This method is &lt;strong&gt;naive&lt;/strong&gt; because it achieves temporal smoothing-like behavior rather than accurate prediction by penalizing all TD errors.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;bellman-error-residual-gradient-algorithm&quot;&gt;Bellman Error (Residual-Gradient Algorithm)&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Consider the minimization of the Bellman error (if the exact values are learned, the Bellman error is zero everywhere).&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;This yields the &lt;strong&gt;residual gradient algorithm&lt;/strong&gt;:&lt;/p&gt;

\[\begin{align*}
  \mathbf{w}_{t+1} &amp;amp;= \mathbf{w}_t - \tfrac{1}{2}\alpha \nabla\!\left(\mathbb{E}_\pi\!\left[\delta_t\right]^2\right) \\
  &amp;amp;= \mathbf{w}_t - \tfrac{1}{2}\alpha \nabla\!\left(\mathbb{E}_b\!\left[\rho_t\, \delta_t\right]^2\right) \\
  &amp;amp;= \mathbf{w}_t - \alpha\, \mathbb{E}_b\!\left[\rho_t\, \delta_t\right] \nabla \mathbb{E}_b\!\left[\rho_t\, \delta_t\right] \\
  &amp;amp;= \mathbf{w}_t - \alpha\, \mathbb{E}_b\!\left[\rho_t\!\left(R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w})\right)\right] \mathbb{E}_b\!\left[\rho_t \nabla \delta_t\right] \\
  &amp;amp;= \mathbf{w}_t + \alpha\!\left[\mathbb{E}_b\!\left[\rho_t\!\left(R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})\right)\right] - \hat{v}(S_t, \mathbf{w})\right]\!\left[\nabla \hat{v}(S_t, \mathbf{w}) - \gamma\, \mathbb{E}_b\!\left[\rho_t \nabla \hat{v}(S_{t+1}, \mathbf{w})\right]\right]
  \end{align*}\]
  &lt;/li&gt;
  &lt;li&gt;Two ways to make the residual-gradient algorithm work:
    &lt;ul&gt;
      &lt;li&gt;In the case of deterministic environments.&lt;/li&gt;
      &lt;li&gt;Obtain 2 independent samples of the next state $S_{t+1}$ from $S_t$.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;In both ways above, the algorithm is guaranteed to converge to a minimum of the $\overline{\text{BE}}$ under the usual conditions on the step-size parameter.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;However, there are at least 3 ways in which the convergence of the residual-gradient algorithm is unsatisfactory:
    &lt;ul&gt;
      &lt;li&gt;Very slow.&lt;/li&gt;
      &lt;li&gt;Converges to the wrong values.&lt;/li&gt;
      &lt;li&gt;A problem with the $\overline{\text{BE}}$ objective covered in the next section.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;116-the-bellman-error-is-not-learnable&quot;&gt;11.6 The Bellman Error is Not Learnable&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;The Bellman error is not learnable from the observed sequence of feature vectors, actions, and rewards.&lt;/li&gt;
  &lt;li&gt;Since the Bellman error objective cannot be learned from the observable data, this is the strongest reason not to seek it.&lt;/li&gt;
  &lt;li&gt;Examples of non-learnable Markov Reward Processes (MRPs):&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;example-1-1&quot;&gt;Example 1&lt;/h3&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/rl-sutton-barto/ch11-11-6-example1.png&quot; alt=&quot;VE learnability Counterexample&quot; /&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Value Error (VE) Learnability Counterexample:&lt;/strong&gt; Deterministic MRP pair with an endless stream of $0$s and $2$s&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
  &lt;li&gt;These MRPs have a deterministic reward with observable data of an endless stream of 0s and 2s.&lt;/li&gt;
  &lt;li&gt;We cannot learn if the MRP has one state or two, or is stochastic or deterministic.&lt;/li&gt;
  &lt;li&gt;The pair of MRPs shows that the $\overline{\text{VE}}$ objective is not learnable:&lt;/li&gt;
&lt;/ul&gt;

\[\overline{\text{VE}}(\mathbf{w}) \doteq \sum_{s \in S} \mu(s)\!\left[v_\pi(s) - \hat{v}(s, \mathbf{w})\right]^2\]

&lt;ul&gt;
  &lt;li&gt;The $\overline{\text{VE}}$ is not learnable, but the parameter that optimizes it is!&lt;/li&gt;
  &lt;li&gt;We introduce a learnable natural objective function that is always observable. This is the error between the value estimate at each time and the return from that time, called the &lt;strong&gt;return error&lt;/strong&gt;. The &lt;strong&gt;Mean Square Return Error&lt;/strong&gt; $(\overline{\text{RE}})$ is the expectation, under $\mu$, of the square of this return error.&lt;/li&gt;
  &lt;li&gt;$\overline{\text{RE}}$ in the on-policy case is:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
\overline{\text{RE}}(\mathbf{w}) &amp;amp;= \mathbb{E}\!\left[\left(G_t - \hat{v}(S_t, \mathbf{w})\right)^2\right] \\
&amp;amp;= \overline{\text{VE}}(\mathbf{w}) + \mathbb{E}\!\left[\left(G_t - v_\pi(S_t)\right)^2\right]
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;The $\overline{\text{BE}}$ can be computed from knowledge of the MDP but is not learnable from data, and its minimum solution is not learnable.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;example-2&quot;&gt;Example 2&lt;/h3&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/rl-sutton-barto/ch11-11-6-example2.png&quot; alt=&quot;BE learnability Counterexample&quot; /&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Bellman Error (BE) Learnability Counterexample:&lt;/strong&gt; Complex Deterministic MRP pair with same distribution but different minimizing parameter vector&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
  &lt;li&gt;The example above serves as a counterexample to the learnability of the Bellman error.&lt;/li&gt;
  &lt;li&gt;The 2 MRPs generate the same data distribution but have different minimizing parameter vectors, proving that the optimal parameter vector is not a function of the data and thus cannot be learned from it.&lt;/li&gt;
  &lt;li&gt;Other bootstrapping objectives, like $\overline{\text{PBE}}$ and $\overline{\text{TDE}}$, are learnable from data and yield optimal solutions different from each other and that of $\overline{\text{BE}}$.&lt;/li&gt;
  &lt;li&gt;$\overline{\text{BE}}$ is limited to model-based settings, therefore $\overline{\text{PBE}}$ is preferred.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/rl-sutton-barto/ch11-11-6-causal-relationships-mdps-datadistr-errors.png&quot; alt=&quot;MDPs-data distribution-objectives causal relationships&quot; /&gt;&lt;/p&gt;

&lt;!-- &gt;**Casual Relationships among the data distribution, MDPs &amp; various objectives:** Monte-Carlo &amp; Bootstrapping objectives --&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Casual Relationships among the data distribution, MDPs &amp;amp; various objectives&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;&lt;strong&gt;Left, Monte Carlo objectives:&lt;/strong&gt; Two different MDPs can produce the same data distribution yet also produce different $\overline{\text{VE}}$s, proving that the $\overline{\text{VE}}$ objective cannot be determined from data and is not learnable. However, all such $\overline{\text{VE}}$s must have the same optimal parameter vector, $\mathbf{w}^{*}$! Moreover, this same $\mathbf{w}^{*}$ can be determined from another objective, the $\overline{\text{RE}}$, which is uniquely determined from the data distribution. Thus $\mathbf{w}^{*}$ and the $\overline{\text{RE}}$ are learnable even though the $\overline{\text{VE}}$s are not. &lt;br /&gt;
&lt;strong&gt;Right, Bootstrapping objectives:&lt;/strong&gt; Two different MDPs can produce the same data distribution yet also produce different $\overline{\text{BE}}$s &lt;em&gt;and&lt;/em&gt; have different minimizing parameter vectors; these are not learnable from the data distribution. The $\overline{\text{PBE}}$ and $\overline{\text{TDE}}$ objectives and their (different) minima can be directly determined from data and thus are learnable.&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;117-gradient-td-methods&quot;&gt;11.7 Gradient-TD Methods&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Let’s consider SGD methods for minimizing the $\overline{\text{PBE}}$.&lt;/li&gt;
  &lt;li&gt;True SGD methods, &lt;strong&gt;Gradient-TD methods&lt;/strong&gt;, have robust convergence properties even under off-policy training and nonlinear function approximation.&lt;/li&gt;
  &lt;li&gt;In the linear case, there exists an exact solution, the TD fixed point $\mathbf{w}_\text{TD}$, at which the $\overline{\text{PBE}}$ is zero.&lt;/li&gt;
  &lt;li&gt;This solution via &lt;strong&gt;least-squares&lt;/strong&gt; methods yields a $O(d^2)$ complexity; however, we want an SGD method with $O(d)$ that converges robustly.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Let’s derive an SGD method for the $\overline{\text{PBE}}$ assuming linear function approximation:&lt;/p&gt;

\[\begin{align*}
\overline{\text{PBE}}(\mathbf{w}) &amp;amp;= \lVert \Pi\, \bar{\delta}_\mathbf{w} \rVert_\mu^2 \\
&amp;amp;= \left(\Pi\, \bar{\delta}_\mathbf{w}\right)^T \mathbf{D}\, \Pi\, \bar{\delta}_\mathbf{w} \\
&amp;amp;= \bar{\delta}_\mathbf{w}^T \Pi^T \mathbf{D}\, \Pi\, \bar{\delta}_\mathbf{w} \\
&amp;amp;= \bar{\delta}_\mathbf{w}^T \mathbf{D} \mathbf{X}\!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w} \\
&amp;amp;= \left(\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w}\right)^T \!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \!\left(\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w}\right)
\end{align*}\]

    &lt;p&gt;$\quad \left(\text{using } \Pi = \mathbf{X}!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{D} \text{ and the identity } \Pi^T \mathbf{D} \Pi = \mathbf{D} \mathbf{X}!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{D}\right)$&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;The gradient of the $\overline{\text{PBE}}$ w.r.t $\mathbf{w}$ is:&lt;/li&gt;
&lt;/ul&gt;

\[\nabla \overline{\text{PBE}}(\mathbf{w}) = 2\, \nabla\!\left[\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w}\right]^T \!\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \!\left(\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w}\right)\]

&lt;ul&gt;
  &lt;li&gt;Let’s turn this into an SGD method via converting the 3 factors above into &lt;strong&gt;expectations&lt;/strong&gt; under this distribution:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{aligned}
\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w} &amp;amp;= \sum_s \mu(s)\, \mathbf{x}(s)\, \bar{\delta}_\mathbf{w}(s) = \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\[6pt]
\nabla\!\left[\mathbf{X}^T \mathbf{D}\, \bar{\delta}_\mathbf{w}\right] &amp;amp;= \nabla \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\
&amp;amp;= \mathbb{E}\!\left[\rho_t \nabla \delta_t^T\, \mathbf{x}_t^T\right] \\
&amp;amp;= \mathbb{E}\!\left[\rho_t \nabla\!\left(R_{t+1} + \gamma \mathbf{w}^T \mathbf{x}_{t+1} - \mathbf{w}^T \mathbf{x}_t\right) \mathbf{x}_t^T\right] \\
&amp;amp;= \mathbb{E}\!\left[\rho_t\!\left(\gamma \mathbf{x}_{t+1} - \mathbf{x}_t\right) \mathbf{x}_t^T\right] \\[6pt]
\mathbf{X}^T \mathbf{D} \mathbf{X} &amp;amp;= \sum_s \mu(s)\, \mathbf{x}_s\, \mathbf{x}_s^T = \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]
\end{aligned}\]

&lt;p&gt;Substituting these expectations for the three factors into $\nabla \overline{\text{PBE}}$:&lt;/p&gt;

\[\nabla \overline{\text{PBE}}(\mathbf{w}) = 2\, \mathbb{E}\!\left[\rho_t\!\left(\gamma \mathbf{x}_{t+1} - \mathbf{x}_t\right) \mathbf{x}_t^T\right] \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right]\]

&lt;ul&gt;
  &lt;li&gt;The 1st and last terms are not independent (&lt;strong&gt;biased gradient estimate&lt;/strong&gt;).&lt;/li&gt;
  &lt;li&gt;Could estimate all 3 terms separately and combine (&lt;strong&gt;unbiased gradient estimate&lt;/strong&gt;) but too computationally expensive.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;gradient-td&quot;&gt;Gradient-TD&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Estimate and store the product of the last 2 terms of $\nabla \overline{\text{PBE}}(\mathbf{w})$ (product of a $d \times d$ matrix and a $d$-vector yields a $d$-vector like $\mathbf{w}$ itself):&lt;/li&gt;
&lt;/ul&gt;

\[\mathbf{v} \approx \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right]\]

&lt;ul&gt;
  &lt;li&gt;In linear supervised learning, this is the solution to a linear least-squares problem for $\rho_t\, \delta_t$ approximation from the features.&lt;/li&gt;
  &lt;li&gt;The standard SGD method for incrementally finding $\mathbf{v}$ that minimizes the expected squared error $\left(\mathbf{v}^T \mathbf{x}_t - \rho_t\, \delta_t\right)^2$ is known as the &lt;strong&gt;Least Mean Square (LMS)&lt;/strong&gt; rule:&lt;/li&gt;
&lt;/ul&gt;

\[\mathbf{v}_{t+1} \doteq \mathbf{v}_t + \beta\, \delta_t\!\left(\delta_t - \mathbf{v}_t^T \mathbf{x}_t\right) \mathbf{x}_t\]

\[\begin{aligned}
\text{where} \\
\beta &amp;amp;&amp;gt; 0 \equiv \text{another step-size parameter} \\
\rho_t &amp;amp;\equiv \text{importance sampling ratio} \\
O(d) &amp;amp;\equiv \text{storage \&amp;amp; per-step computational complexity}
\end{aligned}\]

&lt;h3 id=&quot;gtd2&quot;&gt;GTD2&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;With $\mathbf{v}_t$, we can update $\mathbf{w}_t$ using SGD:&lt;/p&gt;

\[\begin{align*}
\mathbf{w}_{t+1} &amp;amp;= \mathbf{w}_t - \tfrac{1}{2}\alpha \nabla \overline{\text{PBE}}(\mathbf{w}_t) \\
&amp;amp;= \mathbf{w}_t - \tfrac{1}{2}\alpha\!\left(2\, \mathbb{E}\!\left[\rho_t\!\left(\gamma \mathbf{x}_{t+1} - \mathbf{x}_t\right) \mathbf{x}_t^T\right] \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right]\right) \\
&amp;amp;= \mathbf{w}_t + \alpha\, \mathbb{E}\!\left[\rho_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right) \mathbf{x}_t^T\right] \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\
&amp;amp;\approx \mathbf{w}_t + \alpha\, \mathbb{E}\!\left[\rho_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right) \mathbf{x}_t^T\right] \mathbf{v}_t \\
&amp;amp;\approx \mathbf{w}_t + \alpha \rho_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right) \mathbf{x}_t^T \mathbf{v}_t
\end{align*}\]

    &lt;p&gt;where $O(d)$ per-step computational complexity of $(\mathbf{x}_t^T \mathbf{v}_t)$ is done first.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;td0-with-gradient-correction-gtd0-or-tdc&quot;&gt;TD(0) with Gradient Correction (GTD(0) or TDC)&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Let’s look at another analytical algorithm called TD(0) with gradient correction, &lt;strong&gt;TDC&lt;/strong&gt;:&lt;/p&gt;

\[\begin{align*}
\mathbf{w}_{t+1} &amp;amp;= \mathbf{w}_t + \alpha\, \mathbb{E}\!\left[\rho_t\!\left(\mathbf{x}_t - \gamma \mathbf{x}_{t+1}\right) \mathbf{x}_t^T\right] \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\
&amp;amp;= \mathbf{w}_t + \alpha\!\left(\mathbb{E}\!\left[\rho_t\, \mathbf{x}_t\, \mathbf{x}_t\right] - \gamma\, \mathbb{E}\!\left[\rho_t\, \mathbf{x}_{t+1}\, \mathbf{x}_t^T\right]\right) \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\
&amp;amp;= \mathbf{w}_t + \alpha\!\left(\mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t\right] - \gamma\, \mathbb{E}\!\left[\rho_t\, \mathbf{x}_{t+1}\, \mathbf{x}_t^T\right]\right) \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right] \\
&amp;amp;= \mathbf{w}_t + \alpha\!\left(\mathbb{E}\!\left[\mathbf{x}_t\, \rho_t\, \delta_t\right] - \gamma\, \mathbb{E}\!\left[\rho_t\, \mathbf{x}_{t+1}\, \mathbf{x}_t^T\right] \mathbb{E}\!\left[\mathbf{x}_t\, \mathbf{x}_t^T\right]^{-1} \mathbb{E}\!\left[\rho_t\, \delta_t\, \mathbf{x}_t\right]\right) \\
&amp;amp;\approx \mathbf{w}_t + \alpha\!\left(\mathbb{E}\!\left[\mathbf{x}_t\, \rho_t\, \delta_t\right] - \gamma\, \mathbb{E}\!\left[\rho_t\, \mathbf{x}_{t+1}\, \mathbf{x}_t^T\right] \mathbf{v}_t\right) \\
&amp;amp;\approx \mathbf{w}_t + \alpha \rho_t\!\left(\delta_t\, \mathbf{x}_t - \gamma \mathbf{x}_{t+1}\, \mathbf{x}_t^T \mathbf{v}_t\right)
\end{align*}\]

    &lt;p&gt;with $O(d)$ complexity if the final product $(\mathbf{x}_t^T \mathbf{v}_t)$ is done first.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;takeaways-1&quot;&gt;Takeaways&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;GTD2 and TDC both involve 2 learning processes: a primary one for $\mathbf{w}$ and a secondary one for $\mathbf{v}$.&lt;/li&gt;
  &lt;li&gt;Asymmetrical dependence ($\mathbf{w}$ depends on $\mathbf{v}$ but $\mathbf{v}$ does not depend on $\mathbf{w}$) is referred to as a &lt;strong&gt;cascade&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;Gradient-TD methods are the most well-understood and widely used stable off-policy methods.&lt;/li&gt;
  &lt;li&gt;Extensions of GTD methods include to:
    &lt;ol&gt;
      &lt;li&gt;Action values and control: &lt;strong&gt;GQ [Maei et al., 2010]&lt;/strong&gt;&lt;/li&gt;
      &lt;li&gt;Eligibility traces: &lt;strong&gt;GTD($\lambda$), GQ($\lambda$) [Maei, 2011; Maei &amp;amp; Sutton, 2010]&lt;/strong&gt;&lt;/li&gt;
      &lt;li&gt;Nonlinear function approximation &lt;strong&gt;[Maei et al., 2009]&lt;/strong&gt;&lt;/li&gt;
    &lt;/ol&gt;
  &lt;/li&gt;
  &lt;li&gt;Hybrid algorithms include:
    &lt;ol&gt;
      &lt;li&gt;Midway between semi-gradient TD and gradient TD &lt;strong&gt;[Hackman, 2012; White &amp;amp; White, 2016]&lt;/strong&gt;&lt;/li&gt;
      &lt;li&gt;GTD + proximal methods &amp;amp; control variates &lt;strong&gt;[Mahadevan et al., 2014; Du et al., 2017]&lt;/strong&gt;&lt;/li&gt;
    &lt;/ol&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;118-emphatic-td-methods&quot;&gt;11.8 Emphatic-TD Methods&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Let’s explore a major strategy for obtaining a cheap and efficient off-policy learning method with function approximation.&lt;/li&gt;
  &lt;li&gt;Recall that linear semi-gradient TD methods are stable when trained under the on-policy distribution .&lt;/li&gt;
  &lt;li&gt;The match between the on-policy state distribution $\mu_\pi$ and the state-transition probabilities $p(s’ \vert s, a)$ under the target policy does not exist in off-policy learning.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Mismatch Fix&lt;/strong&gt;:
    &lt;ul&gt;
      &lt;li&gt;Re-weight the states, emphasizing some and de-emphasizing others, so as to return the distribution of the updates to the on-policy distribution.&lt;/li&gt;
      &lt;li&gt;Then there would be a match, and convergence and stability would be achieved. This is the idea of &lt;strong&gt;Emphatic-TD methods&lt;/strong&gt;.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;The &lt;strong&gt;one-step Emphatic-TD algorithm&lt;/strong&gt; for learning episodic state values is defined by:&lt;/li&gt;
&lt;/ul&gt;

\[\delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t)\]

\[\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha M_t \rho_t\, \delta_t \nabla \hat{v}(S_t, \mathbf{w}_t)\]

\[M_t = \gamma \rho_{t-1} M_{t-1} + \mathcal{I}_t\]

\[\begin{aligned}
\text{where} \\
\mathcal{I}_t &amp;amp;\equiv \text{the interest} \\
M_t &amp;amp;\equiv \text{the emphasis} \quad (M_{-1} = 0)
\end{aligned}\]

&lt;ul&gt;
  &lt;li&gt;Applying Emphatic TD to Baird’s counterexample yields very high variance results (impossible to get consistent results in experiments).&lt;/li&gt;
  &lt;li&gt;We focus on how we reduce the variance in all these algorithms in the next section.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;119-reducing-variance&quot;&gt;11.9 Reducing Variance&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Off-policy learning is inherently of greater variance than on-policy learning.&lt;/li&gt;
  &lt;li&gt;The raison d’être of off-policy learning is to enable generalization to the vast number of related-but-not-identical policies.&lt;/li&gt;
  &lt;li&gt;Why is variance control critical in off-policy learning based on importance sampling?
    &lt;ul&gt;
      &lt;li&gt;Recall importance sampling involves products of policy ratios:&lt;/li&gt;
    &lt;/ul&gt;

\[\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k \vert S_k)}{b(A_k \vert S_k)}\]

    &lt;ul&gt;
      &lt;li&gt;The policy ratios always have an expected value of 1, but their actual values might be high or as low as 0:&lt;/li&gt;
    &lt;/ul&gt;

\[\mathbb{E}\!\left[\frac{\pi(A_k \vert S_k)}{b(A_k \vert S_k)}\right] \doteq \sum_a b(a \vert S_k) \frac{\pi(a \vert S_k)}{b(a \vert S_k)} = \sum_a \pi(a \vert S_k) = 1\]

    &lt;ul&gt;
      &lt;li&gt;Successive ratios are uncorrelated, so their products are always 1 in expected value, but they can be of high variance.&lt;/li&gt;
      &lt;li&gt;These ratios multiply the step size in SGD methods, so their high variance is problematic for SGD because of the occasional huge steps.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;How can we alleviate the effects of high variance via small step-size settings enough to ensure the expected step taken by SGD is small? Some approaches:
    &lt;ul&gt;
      &lt;li&gt;Momentum&lt;/li&gt;
      &lt;li&gt;Polyak-Ruppert averaging&lt;/li&gt;
      &lt;li&gt;Methods for adaptively setting separate step sizes for different components of the parameter vector&lt;/li&gt;
      &lt;li&gt;“Importance weight aware” updates of &lt;strong&gt;Karampatziakis &amp;amp; Langford (2015)&lt;/strong&gt;&lt;/li&gt;
      &lt;li&gt;Weighted importance sampling, which is well-behaved with lower variance updates than ordinary importance sampling, but adapting it to function approximation is challenging &lt;strong&gt;[Mahmood &amp;amp; Sutton, 2015]&lt;/strong&gt;&lt;/li&gt;
      &lt;li&gt;Tree backup algorithm (off-policy, without importance sampling)&lt;/li&gt;
      &lt;li&gt;Allow the target policy $\pi$ to be determined partly by the behavior policy $b$ to limit creating large importance sampling ratios&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;1110-summary&quot;&gt;11.10 Summary&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Off-policy learning poses a challenge that requires creating stable and efficient learning algorithms.&lt;/li&gt;
  &lt;li&gt;Tabular Q-learning makes off-policy learning seem easy, as does its generalizations to Expected Sarsa and tree backup.&lt;/li&gt;
  &lt;li&gt;Extension further to function approximation (even linear) is challenging.&lt;/li&gt;
  &lt;li&gt;The challenge of off-policy learning is divided into two parts:
    &lt;ul&gt;
      &lt;li&gt;Correcting the targets of learning for the behavior policy.&lt;/li&gt;
      &lt;li&gt;Dealing with the instability of bootstrapping (mismatch between off-policy and on-policy distribution of updates).&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;The &lt;strong&gt;deadly triad&lt;/strong&gt; arises when we try to combine these 3 elements: &lt;strong&gt;function approximation, off-policy learning, and bootstrapping,&lt;/strong&gt; thereby causing instability and divergence.&lt;/li&gt;
  &lt;li&gt;SGD in the Bellman error $\overline{\text{BE}}$ is not learnable so it does not work.&lt;/li&gt;
  &lt;li&gt;Gradient-TD methods perform SGD in the projected Bellman error $\overline{\text{PBE}}$ and are learnable with $O(d)$ computational complexity.&lt;/li&gt;
  &lt;li&gt;Emphatic-TD methods re-weight updates, emphasizing some and de-emphasizing others, to get the off-policy distribution of the updates to match that of on-policy.&lt;/li&gt;
  &lt;li&gt;There are many ways of reducing high variance in off-policy learning that are centered on minimizing the step taken by SGD by using small step-size parameters to counter the multiplicative effect from the successive policy ratios.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;citation&quot;&gt;Citation&lt;/h2&gt;

&lt;p&gt;If you found this blog post helpful, please consider citing it:&lt;/p&gt;

&lt;div class=&quot;language-bibtex highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nc&quot;&gt;@article&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;obasi2026RLsuttonBartoCh11notes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;title&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Sutton &amp;amp; Barto, Ch. 11: Off-Policy Methods with Approximation (Personal Notes)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;author&lt;/span&gt;  &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Obasi, Chizoba&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;journal&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;chizkidd.github.io&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;year&lt;/span&gt;    &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;2026&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;month&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Mar&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;url&lt;/span&gt;     &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;https://chizkidd.github.io/2026/03/09/rl-sutton-barto-notes-ch011/&quot;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;hr /&gt;
</description>
        <pubDate>Mon, 09 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://chizkidd.github.io//2026/03/09/rl-sutton-barto-notes-ch011/</link>
        <guid isPermaLink="true">https://chizkidd.github.io//2026/03/09/rl-sutton-barto-notes-ch011/</guid>
        
        
      </item>
    
      <item>
        <title>Sutton &amp; Barto, Ch. 10: On-Policy Control with Approximation (Personal Notes)</title>
        <description>&lt;ul&gt;
  &lt;li&gt;Let’s dive into the control problem now with parametric approximation of the action-value function $\hat{q}(s, a, \mathbf{w}) \approx q_{*}(s, a)$, where $\mathbf{w} \in \mathbb{R}^d$ is a &lt;strong&gt;finite-dimensional weight vector.&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;We’ll focus on &lt;strong&gt;semi-gradient Sarsa&lt;/strong&gt;, the natural extension of semi-gradient TD(0) to action values and to on-policy control.&lt;/li&gt;
  &lt;li&gt;We’ll look at this extension in both the episodic and continuing case.&lt;/li&gt;
  &lt;li&gt;We’ll look at $n$-step linear Sarsa.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;table-of-contents&quot;&gt;Table of Contents&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#101-episodic-semi-gradient-control&quot;&gt;10.1 Episodic Semi-gradient Control&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#102-semi-gradient-n-step-sarsa&quot;&gt;10.2 Semi-gradient $n$-step Sarsa&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#103-average-reward-a-new-problem-setting-for-continuing-tasks&quot;&gt;10.3 Average Reward: A New Problem Setting for Continuing Tasks&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#104-deprecating-the-discounted-setting&quot;&gt;10.4 Deprecating the Discounted Setting&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#105-differential-semi-gradient-n-step-sarsa&quot;&gt;10.5 Differential Semi-gradient $n$-step Sarsa&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#106-summary&quot;&gt;10.6 Summary&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;appendix&quot;&gt;Appendix&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#citation&quot;&gt;Citation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;101-episodic-semi-gradient-control&quot;&gt;10.1 Episodic Semi-gradient Control&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;The extension of the semi-gradient prediction methods of Chapter 9 to action values is straightforward.&lt;/li&gt;
  &lt;li&gt;It is the approximate action-value function, $\hat{q} \approx q_\pi$, that is represented as a parametrized functional form with weight vector $\mathbf{w}$.&lt;/li&gt;
  &lt;li&gt;Before, the training examples had the form $S_t \mapsto U_t$; now the examples have the form $S_t, A_t \mapsto U_t$.&lt;/li&gt;
  &lt;li&gt;The update target $U_t$ can be any approximation of $q_\pi(S_t, A_t)$, including the usual backed-up values such as the full Monte Carlo (MC) return $G_t$ or any $n$-step Sarsa return $G_{t:t+n}$.&lt;/li&gt;
  &lt;li&gt;The general gradient-descent update for action-value prediction is:&lt;/li&gt;
&lt;/ul&gt;

\[\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\!\left[U_t - \hat{q}(S_t, A_t, \mathbf{w}_t)\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)\]

&lt;ul&gt;
  &lt;li&gt;The update for the one-step Sarsa method is:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\!\left[R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t)\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)}\]

&lt;ul&gt;
  &lt;li&gt;This method is called &lt;strong&gt;episodic semi-gradient one-step Sarsa&lt;/strong&gt;. For a constant policy, this method converges in the same way that TD(0) does with the same kind of error bound.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Control&lt;/strong&gt; = action-value prediction + policy improvement &amp;amp; action selection:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{a, S_{t+1} \longrightarrow \hat{q}(S_{t+1}, a, \mathbf{w}_t) \longrightarrow A^*_{t+1} = \arg\max_a \hat{q}(S_{t+1}, a, \mathbf{w}_t) \longrightarrow \varepsilon\text{-greedy policy improvement} \longrightarrow \varepsilon\text{-greedy action selection}}\]

&lt;ul&gt;
  &lt;li&gt;Linear function approximation for the action-value function is:&lt;/li&gt;
&lt;/ul&gt;

\[\hat{q}(s, a, \mathbf{w}) \doteq \mathbf{w}^T \mathbf{x}(s, a) = \sum_{i=1}^{d} w_i \cdot x_i(s, a)\]

&lt;hr /&gt;

&lt;h2 id=&quot;102-semi-gradient-n-step-sarsa&quot;&gt;10.2 Semi-gradient $n$-step Sarsa&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;We use an $n$-step return as the update target for episodic semi-gradient $n$-step Sarsa. The $n$-step return generalizes from its tabular form to a function approximation form:&lt;/li&gt;
&lt;/ul&gt;

\[G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1}), \quad t+n &amp;lt; T\]

\[\text{with } G_{t:t+n} \doteq G_t \text{ if } t+n \geq T\]

&lt;ul&gt;
  &lt;li&gt;The $n$-step update equation is:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{\mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha\!\left[G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1})\right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_{t+n-1}), \quad 0 \leq t &amp;lt; T}\]

&lt;ul&gt;
  &lt;li&gt;Performance is best if an intermediate level of bootstrapping is used ($n &amp;gt; 1$).&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;103-average-reward-a-new-problem-setting-for-continuing-tasks&quot;&gt;10.3 Average Reward: A New Problem Setting for Continuing Tasks&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Average reward applies to continuing problems for goal formulation in MDPs.&lt;/li&gt;
  &lt;li&gt;Average reward uses &lt;strong&gt;no discounting&lt;/strong&gt;; the agent has the same level of care for immediate and delayed rewards.&lt;/li&gt;
  &lt;li&gt;Average reward setting is more commonly considered in dynamic programming and less commonly in reinforcement learning (RL).&lt;/li&gt;
  &lt;li&gt;The discounted setting is problematic with function approximation, hence the need for average reward to replace it.&lt;/li&gt;
  &lt;li&gt;In the average-reward setting, the quality of a policy $\pi$ is defined as the average rate of reward, or simply &lt;strong&gt;average reward&lt;/strong&gt;, while following that policy, denoted as $r(\pi)$:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{align*}
r(\pi) &amp;amp;\doteq \lim_{h \to \infty} \frac{1}{h} \sum_{t=1}^{h} \mathbb{E}\!\left[R_t \mid S_0, A_{0:t-1} \sim \pi\right] \\
&amp;amp;= \lim_{t \to \infty} \mathbb{E}\!\left[R_t \mid S_0, A_{0:t-1} \sim \pi\right] \\
&amp;amp;= \sum_s \mu_\pi(s) \sum_a \pi(a \vert s) \sum_{s&apos;, r} p(s&apos;, r \vert s, a)\, r
\end{align*}\]

&lt;ul&gt;
  &lt;li&gt;The expectations in the above equations are conditioned on the initial state $S_0$, and on the subsequent actions $A_0, A_1, \ldots, A_{t-1}$, being taken according to $\pi$.&lt;/li&gt;
  &lt;li&gt;The 2nd and 3rd equations above hold if the MDP is &lt;strong&gt;ergodic&lt;/strong&gt;, i.e., if the steady-state distribution exists and is independent of the starting state $S_0$:&lt;/li&gt;
&lt;/ul&gt;

\[\mu_\pi(s) \doteq \lim_{t \to \infty} \Pr\!\left\{S_t = s \mid A_{0:t-1} \sim \pi\right\}\]

&lt;ul&gt;
  &lt;li&gt;In an ergodic MDP, the starting state can have only a temporary effect, but in the long run the expectation of being in a state depends only on the policy and the MDP transition probabilities.&lt;/li&gt;
  &lt;li&gt;Ergodicity is sufficient but not necessary to guarantee the existence of the limit in the $r(\pi)$ equation above.&lt;/li&gt;
  &lt;li&gt;It may be adequate practically to simply order policies according to their average reward per time step, otherwise called the &lt;strong&gt;return rate&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;All policies that reach the maximal value of $r(\pi)$ are optimal.&lt;/li&gt;
  &lt;li&gt;The steady-state distribution $\mu_\pi$ is the special distribution under which, if you select actions according to $\pi$, you remain in the same distribution, i.e., for which:&lt;/li&gt;
&lt;/ul&gt;

\[\sum_s \mu_\pi(s) \sum_a \pi(a \vert s)\, p(s&apos; \vert s, a) = \mu_\pi(s&apos;)\]

&lt;ul&gt;
  &lt;li&gt;In the average-reward setting, returns are defined in terms of differences between rewards and the average reward; this is called the &lt;strong&gt;differential return&lt;/strong&gt;:&lt;/li&gt;
&lt;/ul&gt;

\[G_t \doteq R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + R_{t+3} - r(\pi) + \ldots\]

&lt;ul&gt;
  &lt;li&gt;The corresponding value functions for the differential return are known as &lt;strong&gt;differential value functions&lt;/strong&gt;:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{aligned}
v_\pi(s) &amp;amp;\doteq \mathbb{E}_\pi\!\left[G_t \mid S_t = s\right] \\
q_\pi(s, a) &amp;amp;\doteq \mathbb{E}_\pi\!\left[G_t \mid S_t = s, A_t = a\right]
\end{aligned}\]

&lt;ul&gt;
  &lt;li&gt;Differential value functions also have Bellman equations:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{aligned}
v_\pi(s) &amp;amp;= \sum_a \pi(a \vert s) \sum_{r, s&apos;} p(s&apos;, r \vert s, a)\!\left[r - r(\pi) + v_\pi(s&apos;)\right] \\[6pt]
q_\pi(s, a) &amp;amp;= \sum_{r, s&apos;} p(s&apos;, r \vert s, a)\!\left[r - r(\pi) + \sum_{a&apos;} \pi(a&apos; \vert s&apos;)\, q_\pi(s&apos;, a&apos;)\right] \\[6pt]
v_{*}(s) &amp;amp;= \max_a \sum_{r, s&apos;} p(s&apos;, r \vert s, a)\!\left[r - \max_\pi r(\pi) + v_{*}(s)\right] \\[6pt]
q_{*}(s, a) &amp;amp;= \sum_{r, s&apos;} p(s&apos;, r \vert s, a)\!\left[r - \max_\pi r(\pi) + \max_{a&apos;} q_{*}(s&apos;, a&apos;)\right]
\end{aligned}\]

&lt;ul&gt;
  &lt;li&gt;The differential form of the 2 TD errors:&lt;/li&gt;
&lt;/ul&gt;

\[\begin{aligned}
\delta_t &amp;amp;\doteq R_{t+1} - \bar{R}_t + \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \\
\delta_t &amp;amp;\doteq R_{t+1} - \bar{R}_t + \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t)
\end{aligned}\]

\[\begin{aligned}
\text{where} \quad \bar{R}_t &amp;amp;= \text{average reward } r(\pi) \text{ estimate at time } t
\end{aligned}\]

&lt;ul&gt;
  &lt;li&gt;Most of the algorithms covered so far don’t change for the average-reward setting. For example, the semi-gradient Sarsa average-reward version is the same as the regular version except with the differential version of the TD error:&lt;/li&gt;
&lt;/ul&gt;

\[\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha\, \delta_t \nabla \hat{q}(S_t, A_t, \mathbf{w}_t)\]

&lt;hr /&gt;

&lt;h2 id=&quot;104-deprecating-the-discounted-setting&quot;&gt;10.4 Deprecating the Discounted Setting&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;For the tabular case, the continuing, discounted problem formulation is useful, but in the approximate case, is this problem formulation necessary?&lt;/li&gt;
  &lt;li&gt;Should we use the discounted reward or average reward in continuing tasks?&lt;/li&gt;
  &lt;li&gt;It turns out that the average of the discounted return is proportional to the average reward.&lt;/li&gt;
  &lt;li&gt;The ordering of all policies in the average discounted return setting would be exactly the same as in the average-reward setting.&lt;/li&gt;
  &lt;li&gt;This idea of the &lt;strong&gt;futility of discounting in continuing problems&lt;/strong&gt; can be proven by the &lt;strong&gt;symmetry argument.&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;
        &lt;p&gt;Let’s choose an objective that saves discounting by summing discounted values over the distribution with which states occur under the policy (where $v^\gamma_\pi \equiv$ discounted value function):&lt;/p&gt;

\[\begin{align*}
J(\pi) &amp;amp;= \sum_s \mu_\pi(s)\, v^\gamma_\pi(s) \\
&amp;amp;= \sum_s \mu_\pi(s) \sum_a \pi(a \vert s) \sum_{s&apos;} \sum_r p(s&apos;, r \vert s, a)\!\left[r + \gamma v^\gamma_\pi(s&apos;)\right] \\
&amp;amp;= r(\pi) + \sum_s \mu_\pi(s) \sum_a \pi(a \vert s) \sum_{s&apos;} \sum_r p(s&apos;, r \vert s, a)\, \gamma v^\gamma_\pi(s&apos;) \\
&amp;amp;= r(\pi) + \gamma \sum_{s&apos;} v^\gamma_\pi(s&apos;) \sum_s \mu_\pi(s) \sum_a \pi(a \vert s)\, p(s&apos; \vert s, a) \\
&amp;amp;= r(\pi) + \gamma \sum_{s&apos;} v^\gamma_\pi(s&apos;)\, \mu_\pi(s&apos;) \\
&amp;amp;= r(\pi) + \gamma J(\pi) \\
&amp;amp;= r(\pi) + \gamma\!\left(r(\pi) + \gamma J(\pi)\right) \\
&amp;amp;= r(\pi) + \gamma r(\pi) + \gamma^2 J(\pi) \\
&amp;amp;= r(\pi) + \gamma r(\pi) + \gamma^2 r(\pi) + \gamma^3 r(\pi) + \gamma^4 r(\pi) + \ldots \\
&amp;amp;= r(\pi)\!\left[1 + \gamma + \gamma^2 + \gamma^3 + \ldots\right]
\end{align*}\]

\[\hspace{-6cm} \boxed{J(\pi) = \left(\frac{1}{1-\gamma}\right) r(\pi)}\]
      &lt;/li&gt;
      &lt;li&gt;&lt;em&gt;The proposed discounted objective orders policies identically to the undiscounted (average reward) objective.&lt;/em&gt;&lt;/li&gt;
      &lt;li&gt;&lt;em&gt;The discount rate $\gamma$ does not influence the ordering.&lt;/em&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;The root cause of the difficulties with the discounted control setting is that with function approximation we have lost the policy improvement theorem.&lt;/li&gt;
  &lt;li&gt;Now if we change the policy to improve the discounted value of one state, we are no longer guaranteed to have improved the overall policy.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;105-differential-semi-gradient-n-step-sarsa&quot;&gt;10.5 Differential Semi-gradient $n$-step Sarsa&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;We need an $n$-step version of the TD error in order to generalize to $n$-step bootstrapping.&lt;/li&gt;
  &lt;li&gt;Let’s generalize the $n$-step return to its differential form, with function approximation:&lt;/li&gt;
&lt;/ul&gt;

\[\boxed{G_{t:t+n} \doteq R_{t+1} - \bar{R}_{t+n-1} + \ldots + R_{t+n} - \bar{R}_{t+n-1} + \hat{q}(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1})}\]

\[\begin{aligned}
\text{where} \quad \bar{R} &amp;amp;\equiv \text{an estimate of } r(\pi),\quad n \geq 1\ \&amp;amp;\ t+n &amp;lt; T \\
G_{t:t+n} &amp;amp;\doteq G_t \quad \text{ if } t+n \geq T
\end{aligned}\]

&lt;ul&gt;
  &lt;li&gt;The $n$-step TD error is:&lt;/li&gt;
&lt;/ul&gt;

\[\delta_t \doteq G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w})\]

&lt;hr /&gt;

&lt;h2 id=&quot;106-summary&quot;&gt;10.6 Summary&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Extended parametrized function approximation &amp;amp; semi-gradient descent to control.&lt;/li&gt;
  &lt;li&gt;The extension is immediate for the episodic case, but dependent on a new problem formulation based on maximizing the &lt;strong&gt;average reward&lt;/strong&gt; setting per time step, for the continuing case.&lt;/li&gt;
  &lt;li&gt;The discounted formulation cannot be carried over to control in the presence of approximations.&lt;/li&gt;
  &lt;li&gt;Most policies cannot be represented by a value function in the approximate case.&lt;/li&gt;
  &lt;li&gt;The scalar average reward $r(\pi)$ provides an effective way of ranking the remaining arbitrary policies.&lt;/li&gt;
  &lt;li&gt;The average reward formulation involves new &lt;strong&gt;differential&lt;/strong&gt; versions of value functions, Bellman equations, and TD errors, but all of these parallel the old ones and the conceptual changes are small.&lt;/li&gt;
  &lt;li&gt;The average reward setting has a new parallel set of differential algorithms.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;citation&quot;&gt;Citation&lt;/h2&gt;

&lt;p&gt;If you found this blog post helpful, please consider citing it:&lt;/p&gt;

&lt;div class=&quot;language-bibtex highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nc&quot;&gt;@article&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;obasi2026RLsuttonBartoCh10notes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;title&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Sutton &amp;amp; Barto, Ch. 10: On-Policy Control with Approximation (Personal Notes)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;author&lt;/span&gt;  &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Obasi, Chizoba&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;journal&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;chizkidd.github.io&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;year&lt;/span&gt;    &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;2026&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;month&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Mar&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;url&lt;/span&gt;     &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;https://chizkidd.github.io/2026/03/09/rl-sutton-barto-notes-ch010/&quot;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;hr /&gt;
</description>
        <pubDate>Mon, 09 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://chizkidd.github.io//2026/03/09/rl-sutton-barto-notes-ch010/</link>
        <guid isPermaLink="true">https://chizkidd.github.io//2026/03/09/rl-sutton-barto-notes-ch010/</guid>
        
        
      </item>
    
      <item>
        <title>When Your Voice Assistant Can&apos;t Hear Tones: Evaluating ASR Bias in Igbo</title>
        <description>&lt;p&gt;I grew up in an Igbo household in Northern Nigeria, that code-switched between English, Igbo, and Hausa almost unconsciously. Like many bilingual Nigerians, I’ve watched voice assistants and ASR systems get better and better at English while struggling with our languages. When Meta released omniASR claiming support for over 1,600 languages including Igbo, I was curious. Does “supported” mean it actually works?&lt;/p&gt;

&lt;p&gt;Turns out, the answer is more complicated than I expected.&lt;/p&gt;

&lt;h2 id=&quot;the-problem-what-does-language-support-really-mean&quot;&gt;The Problem: What Does “Language Support” Really Mean?&lt;/h2&gt;

&lt;p&gt;Here’s the thing about Igbo: tone changes word meaning. The difference between “akwa” (crying), “akwà” (cloth), “àkwà” (egg), and “ákwá” (bridge) isn’t just decorative accent marks. These are completely different words that happen to have the same consonants and vowels. The tone is the difference.&lt;/p&gt;

&lt;p&gt;So when I saw that omniASR listed Igbo among its supported languages, I wanted to know: does it actually preserve these tonal distinctions? Or does “support” just mean “we trained on some Igbo data and hope for the best”?&lt;/p&gt;

&lt;h2 id=&quot;the-experiment-21-audio-samples&quot;&gt;The Experiment: 21 Audio Samples&lt;/h2&gt;

&lt;p&gt;I designed a simple test. Using my iPhone Voice Memos app, I recorded 21 short audio clips in different categories:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tonal minimal pairs&lt;/strong&gt;: I said “akwa, akwa, akwa” three times with no tone, then “akwà, akwà, akwà” three times with low tone, then “àkwà, àkwà, àkwà” with low-low tone, and finally “ákwá, ákwá, ákwá” with high-high tone. Four distinct words, each repeated three times.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code-switching&lt;/strong&gt;: Phrases like “The ụlọ is beautiful” where I mix English and Igbo naturally, the way we actually speak.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Place names and cultural terms&lt;/strong&gt;: Nigerian cities, Igbo food words, proverbs. The stuff that’s probably not in training data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The smoking gun test&lt;/strong&gt;: I spoke a sentence with deliberately flat intonation, no tonal variation at all. If the model is actually listening to tone in the audio, it shouldn’t add tone marks to monotone speech.&lt;/p&gt;

&lt;p&gt;Then I ran everything through omniASR and compared what I actually said to what it transcribed.&lt;/p&gt;

&lt;h2 id=&quot;the-results-75-tone-loss&quot;&gt;The Results: 75% Tone Loss&lt;/h2&gt;

&lt;p&gt;The numbers were worse than I expected.&lt;/p&gt;

&lt;p&gt;For the tonal sample after bootstrapping, the model dropped 75.5% of the tone marks. Not just a few mistakes here and there. Three out of every four tone marks, gone.&lt;/p&gt;

&lt;p&gt;When I said the four different “akwa” words, the model output was: “akua akua akua akua akwa akwa akwa akua akwa ọkua ọkua ọkua”. Random variations. The semantic distinctions completely lost.&lt;/p&gt;

&lt;p&gt;But here’s what really convinced me the model isn’t actually listening to tones: the monotone test. I spoke “O na-eri oji n’ututu” (He eats kolanut in the morning) with flat intonation, like a robot. The model transcribed it as “ọne rị ọjí nụ tútú” and added tone marks that I never spoke.&lt;/p&gt;

&lt;p&gt;If the model were using acoustic information to place diacritics, it shouldn’t be adding tones to flat speech. This suggests it’s doing something else: probably using statistical patterns from training data to guess where diacritics should go, rather than actually hearing them.&lt;/p&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Key Diagnostic: The Monotone Test&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;&lt;strong&gt;File 09:&lt;/strong&gt; Spoke “O na-eri oji n’ututu” with FLAT intonation&lt;br /&gt;
&lt;strong&gt;Expected:&lt;/strong&gt; 0 diacritics (no tonal variation in audio)&lt;br /&gt;
&lt;strong&gt;Result:&lt;/strong&gt; Model added 7 tone marks that weren’t spoken&lt;br /&gt;
&lt;br /&gt;
This is evidence of &lt;strong&gt;orthographic bias,&lt;/strong&gt; not acoustic perception.&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&quot;what-the-data-shows&quot;&gt;What the Data Shows&lt;/h3&gt;

&lt;p&gt;I created three visualizations to make the patterns clear.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/omniASR/fig1_loss_by_category.png&quot; alt=&quot;loss by category&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 1&lt;/strong&gt; shows diacritic loss by category. The tonal category (in red) jumps out immediately at 61.2% raw count loss. For comparison, the domain-specific category had only 6.3% loss. But look at the cross-lingual interference category: it’s at -38.9%, which means the model was adding diacritics that don’t exist. It’s not just dropping tones, it’s hallucinating them in the wrong places.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/omniASR/fig2_cer_vs_diacritic_loss.png&quot; alt=&quot;char error rate vs diacritic loss&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 2&lt;/strong&gt; plots character error rate against diacritic loss for each sample. What’s interesting here is that the tonal samples (red dots) show high diacritic loss even when the overall character error rate is moderate (20-40%). This means tone errors aren’t just a consequence of the model doing poorly in general. The model can get most of the characters right while still completely failing on tones specifically.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/omniASR/fig3_bootstrap_ci.png&quot; alt=&quot;boostrap confidence interval&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 3&lt;/strong&gt; shows the bootstrap confidence intervals. Even with only 21 samples, the error bars don’t overlap between categories. The tonal category’s worst-case lower bound is 57.1%, which is still terrible. This confirms that what I’m seeing isn’t just noise from a small sample size.&lt;/p&gt;

&lt;h2 id=&quot;the-statistical-story&quot;&gt;The Statistical Story&lt;/h2&gt;

&lt;p&gt;I’m not a statistician, but I know enough to be careful with small sample sizes. Twenty-one samples isn’t huge. So I used bootstrap resampling (basically, randomly resampling my data 10,000 times to get confidence intervals) to make sure these effects weren’t just random noise.&lt;/p&gt;

&lt;p&gt;Even under the most conservative estimate (the lower bound of the 95% confidence interval), tonal diacritic loss was still 57.1%. The worst-case scenario is still terrible.&lt;/p&gt;

&lt;p&gt;I also created a custom metric called Diacritic Error Rate (DER) because standard Character Error Rate treats tone marks the same as spacing errors. DER specifically tracks dropped tone marks versus hallucinated tone marks. Turns out the model isn’t just dropping tones. It’s also adding tones that don’t exist, which is a whole different kind of problem.&lt;/p&gt;

&lt;h2 id=&quot;the-categories&quot;&gt;The Categories&lt;/h2&gt;

&lt;p&gt;Breaking down the errors helped me understand what’s going wrong:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-lingual interference&lt;/strong&gt;: When I spoke phrases with no tone marks at all (like names), the model added incorrect diacritics 38.9% of the time. It’s probably applying orthographic patterns from other languages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code-switching boundary effects&lt;/strong&gt;: The English portions of code-switched sentences were transcribed perfectly. The Igbo portions immediately adjacent to English lost their tones. Something about language boundaries is disrupting processing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Domain coverage&lt;/strong&gt;: Culturally specific terms (place names, food words) had the best diacritic preservation at only 6.3% loss, but terrible overall accuracy. The model knows the orthography but doesn’t know the words.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tonal collapse&lt;/strong&gt;: 75.5% loss. This is the big one.&lt;/p&gt;

&lt;h2 id=&quot;why-this-matters&quot;&gt;Why This Matters&lt;/h2&gt;

&lt;p&gt;I keep coming back to the monotone hallucination test. If I were building a voice assistant for Igbo speakers and it’s adding tones I didn’t speak, that’s not just an accuracy problem. It’s an epistemological problem. The system is presenting confident outputs that have no acoustic basis.&lt;/p&gt;

&lt;p&gt;Imagine you’re dictating a text message in Igbo and the system confidently transcribes “crying” when you said “cloth.” Not just a typo you can spot and fix. A completely different word that makes semantic nonsense but looks plausible.&lt;/p&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;What 75% Tonal Loss Means&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;75.5% bootstrap diacritic loss means:&lt;br /&gt;
&lt;strong&gt;3 out of 4&lt;/strong&gt; tone marks disappear&lt;br /&gt;
&lt;strong&gt;“cloth”&lt;/strong&gt; → could mean “crying”&lt;br /&gt;
&lt;strong&gt;“egg”&lt;/strong&gt; → meaning lost entirely&lt;br /&gt;
&lt;strong&gt;“bridge”&lt;/strong&gt; → wrong word
&lt;br /&gt;&lt;br /&gt;
In English, this would be like dropping 75% of consonants.&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;This isn’t just about transcription accuracy. It’s about whether “supporting 1,600+ languages” means anything more than “we trained on data from 1,600+ languages and didn’t check if it actually works for tonal distinctions.”&lt;/p&gt;

&lt;h2 id=&quot;the-bigger-picture-zenos-paradox-of-low-resource-languages&quot;&gt;The Bigger Picture: Zeno’s Paradox of Low-Resource Languages&lt;/h2&gt;

&lt;p&gt;There’s a paper from EMNLP 2024 that talks about “The Zeno’s Paradox of Low-Resource Languages.” The basic idea: models keep claiming to support more and more languages, but the quality asymptote never actually reaches parity with high-resource languages. We get closer and closer, but never quite there.&lt;/p&gt;

&lt;p&gt;Igbo is interesting because by speaker population (45 million people), it’s not low-resource. But by model performance, it clearly behaves like one. The gap between coverage (we trained on Igbo data) and competence (the model preserves linguistically meaningful distinctions) is huge.&lt;/p&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;&apos;Supported&apos; ≠ Works Well&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;omniASR claims support for 1,600+ languages. Igbo has 45 million speakers, but its tonal accuracy is 24.5% (only 1 in 4 tone marks preserved).&lt;br /&gt;&lt;br /&gt;
&lt;strong&gt;Coverage&lt;/strong&gt; (in training data) ≠ &lt;strong&gt;Competence&lt;/strong&gt; (preserves meaning)&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;This makes me think about all the other languages in that 1,600+ list. How many of them have this same gap? How many communities are using systems that confidently produce nonsense because nobody with native speaker expertise has stress-tested them?&lt;/p&gt;

&lt;h2 id=&quot;what-i-learned&quot;&gt;What I Learned&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Small, targeted datasets can reveal problems big datasets hide.&lt;/strong&gt; I didn’t need thousands of hours of audio. Twenty-one carefully designed samples were enough to show systematic failure modes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Native speaker expertise matters.&lt;/strong&gt; Automated metrics can’t catch when “crying” is transcribed as “cloth” because the character error rate looks fine. You need someone who speaks the language to know that the semantic content is destroyed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bootstrap resampling is powerful for small samples.&lt;/strong&gt; I was worried 21 samples was too few, but bootstrap confidence intervals let me quantify uncertainty rigorously. Even the pessimistic lower bounds showed substantial effects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The monotone test is a better diagnostic than I expected.&lt;/strong&gt; If diacritics are added to flat speech, that’s clear evidence of orthographic bias over acoustic conditioning. One simple test that revealed the core mechanism.&lt;/p&gt;

&lt;h2 id=&quot;the-technical-details&quot;&gt;The Technical Details&lt;/h2&gt;

&lt;p&gt;For anyone interested in replicating this:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;I used my iPhone for recording (Voice Memos app, M4A format)&lt;/li&gt;
  &lt;li&gt;Ran inference through Google Colab with omniASR’s official pipeline&lt;/li&gt;
  &lt;li&gt;Computed bootstrap CIs with 10,000 iterations at the utterance level&lt;/li&gt;
  &lt;li&gt;Created a custom DER metric to separate tonal errors from general transcription errors&lt;/li&gt;
  &lt;li&gt;All code, data, and analysis is on GitHub and HuggingFace&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The whole analysis took about half a week of evening work. Most of that was iterating on the sample design and figuring out the right statistical approach. The actual recording and inference was maybe a day.&lt;/p&gt;

&lt;h2 id=&quot;whats-next&quot;&gt;What’s Next&lt;/h2&gt;

&lt;p&gt;This is really just a proof of concept. To make stronger claims, I’d need:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Multi-speaker evaluation (10+ speakers across different Igbo dialects)&lt;/li&gt;
  &lt;li&gt;Acoustic analysis (F0 contour tracking to verify what’s actually in the audio)&lt;/li&gt;
  &lt;li&gt;Comparative evaluation (does Whisper do better? What about Google’s USM?)&lt;/li&gt;
  &lt;li&gt;Fine-tuning experiments (can we fix this with targeted training data?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I have ideas for all of these, but they’re bigger projects. For now, I’m focused on documenting the blind spot and making the methodology replicable.&lt;/p&gt;

&lt;h2 id=&quot;why-im-sharing-this&quot;&gt;Why I’m Sharing This&lt;/h2&gt;

&lt;p&gt;This started as curiosity about whether “multilingual” ASR systems actually work for the languages I grew up speaking. But it turned into something bigger.&lt;/p&gt;

&lt;p&gt;There’s a tendency in ML to treat “supporting” a language as a checkbox. Train on some data, add it to the model card, ship it. But languages aren’t just data. They’re how people communicate, how they think, how they preserve culture.&lt;/p&gt;

&lt;p&gt;When voice assistants strip tone marks from Igbo, they’re not just making transcription errors. They’re normalizing a version of the language that doesn’t preserve meaning. If every voice interface does this, what happens to how people write Igbo? Do they start thinking tone marks are optional because the AI doesn’t use them?&lt;/p&gt;

&lt;p&gt;I don’t know the answers to these questions. But I think they’re worth asking before we claim to “support” 1,600+ languages.&lt;/p&gt;

&lt;h2 id=&quot;resources&quot;&gt;Resources&lt;/h2&gt;

&lt;p&gt;If you want to explore the data or replicate the analysis:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Dataset:&lt;/strong&gt; &lt;a href=&quot;https://huggingface.co/datasets/chiz/omniASR-igbo-blindspots&quot;&gt;HuggingFace&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href=&quot;https://github.com/chizkidd/igbo-asr-tonal-evaluation&quot;&gt;GitHub&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Audio samples:&lt;/strong&gt; You can actually listen to the 21 clips and see the transcription failures yourself&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The dataset is CC-BY-4.0 licensed, while the code is MIT licensed. If this is useful for your work, feel free to use it, cite it, and build on it.&lt;/p&gt;

&lt;h2 id=&quot;final-thoughts&quot;&gt;Final Thoughts&lt;/h2&gt;

&lt;p&gt;This project taught me something important: you don’t need massive compute or huge datasets to find meaningful problems in ML systems. You just need to know where to look and what questions to ask.&lt;/p&gt;

&lt;p&gt;As a native Igbo speaker, I knew what questions to ask. As someone learning ML, I knew how to design tests and interpret results. That combination turned out to be more valuable than I expected.&lt;/p&gt;

&lt;p&gt;If you speak a language that’s “supported” by these big multilingual models, I encourage you to test them. Record some minimal pairs. Try code-switching. See if the system actually works the way you use the language, not just the way it appears in training data.&lt;/p&gt;

&lt;p&gt;You might be surprised what you find.&lt;/p&gt;

&lt;h2 id=&quot;citation&quot;&gt;Citation&lt;/h2&gt;

&lt;p&gt;If you found this work helpful, please consider citing it:&lt;/p&gt;
&lt;div class=&quot;language-bibtex highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nc&quot;&gt;@article&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;obasi2026igboasr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;title&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;When Your Voice Assistant Can&apos;t Hear Tones: Evaluating ASR Bias in Igbo&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;author&lt;/span&gt;  &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Obasi, Chizoba&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;journal&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;chizkidd.github.io&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;year&lt;/span&gt;    &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;2026&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;month&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Mar&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;url&lt;/span&gt;     &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;https://chizkidd.github.io/2026/03/04/igbo-asr-tonal-evaluation/&quot;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

</description>
        <pubDate>Wed, 04 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://chizkidd.github.io//2026/03/04/igbo-asr-tonal-eval/</link>
        <guid isPermaLink="true">https://chizkidd.github.io//2026/03/04/igbo-asr-tonal-eval/</guid>
        
        
      </item>
    
      <item>
        <title>Tonal Fidelity in Multilingual ASR: A Diagnostic Evaluation</title>
        <description>&lt;p&gt;This is a brief guide to my evaluation of tonal preservation in facebook’s omniASR-CTC-1B Automatic Speech Recognition (ASR) model for Igbo, a tonal Niger-Congo language with 45 million speakers. The model claims support for 1,600+ languages including Igbo, but what does “support” mean when tone changes word meaning? I created 21 systematically designed audio samples, ran them through the model, and measured a 75.5% bootstrapped diacritic loss rate on tonal markers. The core finding: the model appears to generate tone marks probabilistically based on orthographic priors rather than acoustic conditioning. I cannot simplify this investigation any further.&lt;/p&gt;

&lt;p&gt;Where to find it: The dataset with audio is on &lt;a href=&quot;https://huggingface.co/datasets/chiz/omniASR-igbo-blindspots&quot;&gt;HuggingFace&lt;/a&gt;. The code and analysis are on &lt;a href=&quot;https://github.com/chizkidd/igbo-asr-tonal-evaluation&quot;&gt;GitHub&lt;/a&gt;. The full analysis notebook is available at &lt;a href=&quot;https://github.com/chizkidd/igbo-asr-tonal-evaluation/blob/main/analysis.ipynb&quot;&gt;analysis.ipynb&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The following is my guide to stepping through the evaluation methodology.&lt;/p&gt;

&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;

&lt;p&gt;In Igbo, tone is phonemic. This means tone changes word meaning, not just prosody. The difference between:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;akwa (crying)&lt;/li&gt;
  &lt;li&gt;akwà (cloth)&lt;/li&gt;
  &lt;li&gt;àkwà (egg)&lt;/li&gt;
  &lt;li&gt;ákwá (bridge)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…isn’t decorative. These are four completely different words that happen to share consonants and vowels. The tone marks (diacritics) are the only thing distinguishing them. When omniASR lists Igbo as “supported,” does it preserve these tonal distinctions? Or does “support” just mean “we trained on some Igbo data”?&lt;/p&gt;

&lt;h2 id=&quot;dataset-design&quot;&gt;Dataset Design&lt;/h2&gt;

&lt;p&gt;I recorded 21 audio samples using my iPhone SE Voice Memos app. Each sample targets a specific failure mode across four categories.&lt;/p&gt;

&lt;p&gt;The first category tests cross-lingual orthographic interference. My hypothesis was that the model applies incorrect orthographic conventions from other languages to Igbo text. I recorded five samples: personal names without tone marks, formal greetings, numbers in Igbo, well-known proverbs, and a slow prosody test. I expected 0% diacritic loss since there was nothing to lose, but observed -38.9%, meaning the model added diacritics that don’t exist.&lt;/p&gt;

&lt;p&gt;The second category tests phonemic tone sensitivity. The hypothesis here is that the model cannot distinguish phonemically contrastive tones. I recorded six samples including minimal pairs like akwa/akwà/àkwà/ákwá and oke/òkè/ọkè, dense tone marking, a monotone control (the key diagnostic), and two Yoruba controls. I expected low loss if the model uses acoustic information, but observed 75.5% loss with a bootstrap 95% confidence interval of [57.1%, 89.7%].&lt;/p&gt;

&lt;p&gt;The smoking gun is file 09. I spoke “O na-eri oji n’ututu” with deliberately flat intonation, with no tonal variation at all. The model transcribed it as “ọne rị ọjí nụ tútú” and ADDED tone marks I never spoke. If the model were using acoustics, it shouldn’t hallucinate tones on monotone speech.&lt;/p&gt;

&lt;p&gt;The third category tests language boundary effects from code-switching. I hypothesized that switching between English and Igbo disrupts language-specific processing. Five samples test different patterns: English embedding into Igbo, Igbo embedding into English, sentence-level alternation, diacritics in English context, and Nigerian Pidgin as a control. The result was 14.3% diacritic loss, with English portions transcribed perfectly while adjacent Igbo lost tone marks.&lt;/p&gt;

&lt;p&gt;The fourth category tests domain-specific lexical coverage. The hypothesis is that culturally specific terms outside the training distribution would struggle. I recorded Nigerian place names, Igbo food terms, long proverbs, French as a high-resource control, and background noise robustness. This category showed the best diacritic preservation at only 6.3% loss, but terrible overall accuracy with 30% character error rate, indicating word-level errors.&lt;/p&gt;

&lt;p&gt;The data looks like this (metadata.csv):&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-csv&quot;&gt;file_name,ground_truth,model_output,category,character_error_rate,diacritics_expected,diacritics_produced
06_tonal_akwa.m4a,&quot;akwa, akwa, akwa. Akwà, akwà, akwà...&quot;,&quot;akua akua akua akua akwa akwa...&quot;,tonal_diacritics,0.583,12,3
09_tonal_flat.m4a,&quot;O na-eri oji n&apos;ututu&quot;,&quot;ọne rị ọjí nụ tútú&quot;,tonal_diacritics,0.744,0,7
...
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&quot;model-inference&quot;&gt;Model Inference&lt;/h2&gt;

&lt;p&gt;I used omniASR’s official inference pipeline:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;omnilingual_asr.models.inference.pipeline&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ASRInferencePipeline&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;pipeline&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ASRInferencePipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model_card&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;omniASR_CTC_1B&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;transcription&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transcribe&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;inp&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;data/audio/06_tonal_akwa.m4a&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;lang&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;ibo_Latn&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The model has 975 million parameters and uses a CTC-based ASR architecture with a wav2vec2-style encoder and CTC head. It was trained on multilingual data covering over 1,600 languages and released on November 14, 2025.&lt;/p&gt;

&lt;p&gt;For each audio file, I extracted:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;ground_truth&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;akwa, akwa, akwa. Akwà, akwà, akwà. Àkwà, àkwà, àkwà. Ákwá, ákwá, ákwá.&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;model_output&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transcription&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;][&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;transcription&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# Compare and compute metrics
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;metrics&quot;&gt;Metrics&lt;/h2&gt;

&lt;p&gt;Standard Character Error Rate (CER) conflates spacing errors with tonal errors. I defined a custom metric:&lt;/p&gt;

&lt;h3 id=&quot;diacritic-error-rate-der&quot;&gt;Diacritic Error Rate (DER)&lt;/h3&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;diacritic_error_rate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ground_truth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;E&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count_diacritics&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ground_truth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# expected
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;P&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count_diacritics&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# produced
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;E&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;P&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# dropped
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;H&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;P&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;E&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# hallucinated
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;D&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;H&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;E&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;E&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;count_diacritics&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;text&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;diacritics&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;ụọịàèìòùáéíóúẹṣ&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;text&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lower&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;diacritics&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;DER isolates tone-related failures:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Metric&lt;/th&gt;
      &lt;th&gt;Formula&lt;/th&gt;
      &lt;th&gt;What it captures&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;CER&lt;/td&gt;
      &lt;td&gt;Levenshtein distance / length&lt;/td&gt;
      &lt;td&gt;All character errors&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;RDD (Raw Drop Rate)&lt;/td&gt;
      &lt;td&gt;dropped / expected&lt;/td&gt;
      &lt;td&gt;Only missing tone marks&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;DER&lt;/td&gt;
      &lt;td&gt;(dropped + hallucinated) / expected&lt;/td&gt;
      &lt;td&gt;Total tonal deviation&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Note that DER can exceed 100% when hallucinations are substantial, because the denominator reflects ground truth expectations, not produced output.&lt;/p&gt;

&lt;h2 id=&quot;bootstrap-uncertainty&quot;&gt;Bootstrap Uncertainty&lt;/h2&gt;

&lt;p&gt;With N=21 samples, I needed to quantify uncertainty. I used bootstrap resampling:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;bootstrap_ci&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stat_fn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n_boot&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ci&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.95&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;seed&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;42&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;rng&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;default_rng&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;# Point estimate
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;point&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stat_fn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;# Bootstrap resampling
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;boots&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;empty&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n_boot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n_boot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;idx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rng&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;integers&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;boots&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stat_fn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;iloc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;idx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]))&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;# Percentile CI
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ci&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;lo&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;quantile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;boots&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;hi&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;quantile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;boots&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;point&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hi&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Bootstrap resampling occurs at the &lt;strong&gt;utterance level&lt;/strong&gt;, not event level. This matters because diacritic distribution is uneven across samples. Some files have 0 expected tone marks, others have 12. Resampling utterances captures this variability.&lt;/p&gt;

&lt;p&gt;Example result:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Raw count: 30/49 = 61.2% drop rate&lt;/li&gt;
  &lt;li&gt;Bootstrap mean: 75.5%&lt;/li&gt;
  &lt;li&gt;95% CI: [57.1%, 89.7%]&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The bootstrap mean exceeds the raw percentage because resampling at utterance level gives more weight to samples with extreme loss rates. Both values are reported for transparency.&lt;/p&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Why Bootstrap Matters&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;With only 21 samples, we need uncertainty quantification. Bootstrap resampling (10,000 iterations) shows:
&lt;strong&gt;Worst-case lower bound:&lt;/strong&gt; 57.1%&lt;br /&gt;
&lt;strong&gt;Even pessimistically,&lt;/strong&gt; loss is still &amp;gt;50%&lt;br /&gt;
&lt;strong&gt;Not&lt;/strong&gt; a small-sample fluke&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;results&quot;&gt;Results&lt;/h2&gt;

&lt;h3 id=&quot;quantitative-summary&quot;&gt;Quantitative Summary&lt;/h3&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Category&lt;/th&gt;
      &lt;th&gt;Samples&lt;/th&gt;
      &lt;th&gt;Diacritic Loss&lt;/th&gt;
      &lt;th&gt;Avg CER&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Phonemic Tone Sensitivity&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;6&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;75.5%&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;50.6%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Cross-lingual Interference&lt;/td&gt;
      &lt;td&gt;5&lt;/td&gt;
      &lt;td&gt;-38.9%&lt;/td&gt;
      &lt;td&gt;28.8%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Domain-Specific Coverage&lt;/td&gt;
      &lt;td&gt;5&lt;/td&gt;
      &lt;td&gt;6.3%&lt;/td&gt;
      &lt;td&gt;30.1%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Language Boundary Effects&lt;/td&gt;
      &lt;td&gt;5&lt;/td&gt;
      &lt;td&gt;14.3%&lt;/td&gt;
      &lt;td&gt;20.0%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Overall&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;21&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;26.8%&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;32.5%&lt;/strong&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h3 id=&quot;bootstrap-confidence-intervals&quot;&gt;Bootstrap Confidence Intervals&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Tonal category:  75.5% (95% CI: [57.1%, 89.7%])
Overall:         52.6% (95% CI: [30.3%, 69.7%])
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Even under the worst-case lower bound (57.1%), tonal diacritic loss remains severe.&lt;/p&gt;

&lt;h3 id=&quot;visualizations&quot;&gt;Visualizations&lt;/h3&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/omniASR/fig1_loss_by_category.png&quot; alt=&quot;loss by category&quot; /&gt;
Bar chart showing 61.2% raw count loss for tonal category (red), with negative values indicating diacritic hallucination (script interference).&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/omniASR/fig2_cer_vs_diacritic_loss.png&quot; alt=&quot;char error rate vs diacritic loss&quot; /&gt;
Scatter plot showing tonal samples (red) have high diacritic loss even when CER is moderate.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2026/omniASR/fig3_bootstrap_ci.png&quot; alt=&quot;boostrap confidence interval&quot; /&gt;
Forest plot showing 95% CIs for each category, with 50% threshold line.&lt;/p&gt;

&lt;h2 id=&quot;example-tonal-minimal-pairs&quot;&gt;Example: Tonal Minimal Pairs&lt;/h2&gt;

&lt;p&gt;File 06 is the clearest demonstration:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Input (what I said):
&quot;akwa, akwa, akwa. Akwà, akwà, akwà. Àkwà, àkwà, àkwà. Ákwá, ákwá, ákwá.&quot;

Model output:
&quot;akua akua akua akua akwa akwa akwa akua akwa ọkua ọkua ọkua&quot;

Expected diacritics: 12
Produced diacritics: 3
Loss rate: 75%
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The four distinct words collapsed into random variations. From a linguistic perspective, this is catastrophic. The word akwà meaning cloth got transcribed as akwa, which could mean crying instead. The word àkwà meaning egg got transcribed as akwa, and the meaning is completely lost. The word ákwá meaning bridge got transcribed as akua, which is wrong both in word and tone.&lt;/p&gt;

&lt;h2 id=&quot;the-monotone-test&quot;&gt;The Monotone Test&lt;/h2&gt;

&lt;p&gt;File 09 is my favorite diagnostic. Setup:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Spoke “O na-eri oji n’ututu” (He eats kolanut in the morning)&lt;/li&gt;
  &lt;li&gt;Deliberately flat intonation, like a robot&lt;/li&gt;
  &lt;li&gt;Zero tonal variation in the audio&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the model uses acoustic information to place diacritics, it should produce few or no tone marks on flat speech. Result:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Ground truth: &quot;O na-eri oji n&apos;ututu&quot;  (0 diacritics)
Model output: &quot;ọne rị ọjí nụ tútú&quot;    (7 diacritics)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The model ADDED tone marks I never spoke. This is clear evidence of orthographic bias over acoustic conditioning. The model is using statistical patterns from training data to guess where diacritics should go, not listening to the audio.&lt;/p&gt;

&lt;h2 id=&quot;statistical-analysis&quot;&gt;Statistical Analysis&lt;/h2&gt;

&lt;h3 id=&quot;hypothesis-testing&quot;&gt;Hypothesis Testing&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Null hypothesis (H0): Diacritic loss in tonal category ≤ other categories  
Alternative (H1): Tonal category shows higher loss

Test: Bootstrap confidence intervals (10,000 iterations, 95% CI)

Result: Tonal bootstrap mean (75.5%) substantially exceeds all other categories (highest alternative: 38.9% for script hallucination). While confidence intervals show some overlap due to small sample size, the tonal category&apos;s point estimate is nearly 2x higher than the next closest category.

Conclusion: Tonal degradation exhibits the highest loss rate across all categories (bootstrap mean: 75.5%). While confidence intervals show some overlap with script hallucination due to small sample size (N=21), the effect size is large and consistent across resamples.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;robustness-check&quot;&gt;Robustness Check&lt;/h3&gt;

&lt;p&gt;Even under worst-case assumptions using the lower bound of the confidence interval, tonal loss remains at 57.1%, which is still greater than 50%. Overall loss stays at 30.3%, which is still substantial. This suggests the observed tonal degradation is unlikely to be driven solely by sampling variability.&lt;/p&gt;

&lt;h2 id=&quot;code&quot;&gt;Code&lt;/h2&gt;

&lt;p&gt;The full analysis is in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;analysis.ipynb&lt;/code&gt;. The core evaluation functions handle diacritic counting, character error rate calculation, and bootstrap resampling. Diacritic counting uses a set of Igbo tone mark characters and counts occurrences in the text. Character error rate is computed using Python’s SequenceMatcher for character-level similarity. Bootstrap resampling runs 10,000 iterations on the tonal diacritics category to compute confidence intervals.&lt;/p&gt;

&lt;p&gt;All evaluation code is organized in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src/&lt;/code&gt; directory. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;evaluate.py&lt;/code&gt; module contains metrics like DER and bootstrap confidence intervals. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;visualize.py&lt;/code&gt; module has plotting functions for all three figures. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;utils.py&lt;/code&gt; module handles data loading and validation.&lt;/p&gt;

&lt;h2 id=&quot;run-it&quot;&gt;Run it&lt;/h2&gt;

&lt;p&gt;Clone the repository and reproduce:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;git clone https://github.com/chizkidd/igbo-asr-tonal-evaluation.git
&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;igbo-asr-tonal-evaluation
pip &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-r&lt;/span&gt; requirements.txt
jupyter notebook analysis.ipynb
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Or run in Google Colab: 
&lt;a href=&quot;https://colab.research.google.com/github/chizkidd/igbo-asr-tonal-evaluation/blob/main/analysis.ipynb&quot;&gt;&lt;img src=&quot;https://colab.research.google.com/assets/colab-badge.svg&quot; alt=&quot;Open In Colab&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The notebook takes about 5-10 minutes to run on Colab with a T4 GPU. You’ll see the analysis output:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Loading metadata...
Total samples: 21
Categories: 4

Computing metrics...
Overall DER: 26.8%
  Tonal category: 75.5%
  Script interference: -38.9%
  Code-switching: 14.3%
  Domain-specific: 6.3%

Bootstrap resampling (10,000 iterations)...
Tonal diacritics: 75.5% [57.1%, 89.7%]
Overall: 52.6% [30.3%, 69.7%]

Generating visualizations...
Saved: results/visualizations/fig1_loss_by_category.png
Saved: results/visualizations/fig2_cer_vs_loss.png
Saved: results/visualizations/fig3_bootstrap_ci.png
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Reproducibility&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;&lt;strong&gt;Model:&lt;/strong&gt; omniASR-CTC-1B (975M params)&lt;br /&gt;
&lt;strong&gt;Data:&lt;/strong&gt; 21 samples, 4 categories&lt;br /&gt;
&lt;strong&gt;Metrics:&lt;/strong&gt; Custom DER (Diacritic Error Rate)&lt;br /&gt;
&lt;strong&gt;Stats:&lt;/strong&gt; Bootstrap with utterance-level resampling&lt;br /&gt;
&lt;strong&gt;Code:&lt;/strong&gt; github.com/chizkidd/igbo-asr-tonal-evaluation&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;scope-and-limitations&quot;&gt;Scope and Limitations&lt;/h2&gt;

&lt;p&gt;This study demonstrates three things. First, systematic diacritic loss in omniASR on Igbo across 21 controlled samples. Second, failure to preserve tonal minimal pairs in this evaluation setup. Third, diacritic hallucination on monotone speech, which is evidence of orthographic bias.&lt;/p&gt;

&lt;p&gt;This study does not claim four things. It doesn’t claim universal failure on all Igbo speech. It doesn’t claim that tone modeling is architecturally absent from the model. It doesn’t claim that Igbo is uniquely disadvantaged compared to all other low-resource languages. And it doesn’t claim that the observed error rates generalize to all dialects or all speakers.&lt;/p&gt;

&lt;p&gt;What would strengthen these claims? Multi-speaker evaluation with 10+ speakers across different dialects. Acoustic analysis with F0 contour extraction and pitch tracking validation. Comparative evaluation on other models like Whisper, MMS, USM, and Azure Speech. And controlled resynthesis experiments that isolate acoustic factors from lexical priors.&lt;/p&gt;

&lt;div class=&quot;callout callout--note&quot;&gt;
  &lt;div class=&quot;callout__title&quot;&gt;
    &lt;strong&gt;Future Work&lt;/strong&gt;
  &lt;/div&gt;
  &lt;div class=&quot;callout__body&quot;&gt;
    
&lt;p&gt;&lt;strong&gt;Current:&lt;/strong&gt; Single speaker, 21 samples (proof-of-concept)&lt;br /&gt;
&lt;strong&gt;Next:&lt;/strong&gt; 200 samples, 10+ speakers, 5 dialects&lt;br /&gt;
&lt;strong&gt;Then:&lt;/strong&gt; Comparative evaluation (Whisper, MMS, Azure)&lt;br /&gt;
&lt;strong&gt;Finally:&lt;/strong&gt; Fine-tuning intervention with tone-annotated data&lt;/p&gt;

  &lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;real-production-systems&quot;&gt;Real Production Systems&lt;/h2&gt;

&lt;p&gt;Between this evaluation and a production-grade ASR fairness audit, there is a long list of things that change:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data.&lt;/strong&gt; Instead of 21 samples, production evaluations use thousands of hours across multiple speakers, dialects, ages, and recording conditions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speakers.&lt;/strong&gt; Instead of single-speaker, you need balanced sampling across: dialects (Owerri, Onitsha, Enugu, Nsukka, Afikpo), gender, age ranges, native vs. L2 speakers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Acoustic analysis.&lt;/strong&gt; Instead of just comparing transcriptions, you need F0 (fundamental frequency) tracking to verify what’s actually in the audio. Praat or similar tools extract pitch contours frame-by-frame.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comparative evaluation.&lt;/strong&gt; Instead of one model, you audit multiple: Whisper (OpenAI), MMS (Meta), USM (Google), Azure Speech (Microsoft). This isolates whether the problem is specific to omniASR or universal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fine-tuning experiments.&lt;/strong&gt; You collect tone-annotated Igbo data (50-100 hours), fine-tune the model, and measure pre/post accuracy. This tests whether the problem is architectural or just data scarcity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-world deployment.&lt;/strong&gt; You partner with Nigerian developers building voice assistants and measure downstream impact: do users trust ASR that strips tones? Does it affect adoption?&lt;/p&gt;

&lt;p&gt;All of these are important, but if you understand this 21-sample evaluation, you understand the diagnostic methodology.&lt;/p&gt;

&lt;h2 id=&quot;faq&quot;&gt;FAQ&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Why only 21 samples?&lt;/strong&gt; This is a proof-of-concept for blind spot discovery. Large datasets measure prevalence; small targeted datasets reveal failure modes. I prioritized depth (systematic coverage of error types) over breadth (statistical power).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is 75.5% loss generalizable?&lt;/strong&gt; Not necessarily. This is the loss rate on my voice, my dialect, my recording setup, for these specific test cases. Multi-speaker evaluation would give population estimates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why not use Word Error Rate?&lt;/strong&gt; WER measures whole-word accuracy. In Igbo, “akwa” vs “akwà” counts as correct by WER (same word, different tone), but semantically these are different words. Diacritic-specific metrics capture what WER misses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does the model “understand” Igbo?&lt;/strong&gt; That’s philosophical. Mechanically: it learned statistical patterns from training data. Whether assigning probability distributions to tokens constitutes “understanding” is up to you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why does the bootstrap mean exceed the raw percentage?&lt;/strong&gt; Bootstrap resamples at utterance level. Samples with extreme loss rates (e.g., file 09 with 0 expected, 7 hallucinated) get resampled more in some iterations, pulling the mean up. This reflects uncertainty about which utterances are “typical.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What’s next?&lt;/strong&gt; Collect a 200-sample multi-speaker dataset across 5 Igbo dialects. After that: comparative model evaluation (Whisper vs MMS vs omniASR) and fine-tuning experiments with tone-annotated data.&lt;/p&gt;

&lt;h2 id=&quot;why-this-matters&quot;&gt;Why This Matters&lt;/h2&gt;

&lt;p&gt;There’s a tendency in ML to treat “supporting” a language as a checkbox. Add it to the model card, ship it. But Igbo has 45 million speakers. When ASR systems strip tone marks, they normalize a version of the language that doesn’t preserve meaning.&lt;/p&gt;

&lt;p&gt;If every voice interface does this, what happens to how people write Igbo? Do they internalize that tone marks are optional because the AI doesn’t use them? I don’t know, but these are questions worth asking before claiming to “support” 1,600+ languages.&lt;/p&gt;

&lt;h2 id=&quot;resources&quot;&gt;Resources&lt;/h2&gt;

&lt;p&gt;The dataset is available on &lt;a href=&quot;huggingface.co/datasets/chiz/omniASR-igbo-blindspots&quot;&gt;Huggingface&lt;/a&gt;. The code is on &lt;a href=&quot;github.com/chizkidd/igbo-asr-tonal-evaluation&quot;&gt;github&lt;/a&gt;. The model evaluated is &lt;a href=&quot;https://huggingface.co/facebook/omniASR-CTC-1B&quot;&gt;facebook/omniASR-CTC-1B&lt;/a&gt; on HuggingFace. The dataset is licensed under CC-BY-4.0 and the code under MIT. Feel free to use it, cite it, and build on it.&lt;/p&gt;

&lt;h2 id=&quot;citation&quot;&gt;Citation&lt;/h2&gt;

&lt;p&gt;If you found this evaluation helpful, please consider citing it:&lt;/p&gt;
&lt;div class=&quot;language-bibtex highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nc&quot;&gt;@article&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;obasi2026tonalevaluation&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;title&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Tonal Fidelity in Multilingual ASR: A Diagnostic Evaluation&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;author&lt;/span&gt;  &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Obasi, Chizoba&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;journal&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;chizkidd.github.io&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;year&lt;/span&gt;    &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;2026&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;month&lt;/span&gt;   &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Mar&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;url&lt;/span&gt;     &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;https://chizkidd.github.io/2026/03/01/tonal-fidelity-diagnostic-evaluation/&quot;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For the dataset:&lt;/p&gt;
&lt;div class=&quot;language-bibtex highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nc&quot;&gt;@misc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;obasi2026igbodataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;title&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{Igbo Blind Spot Dataset for omniASR-CTC-1B: Systematic Evaluation of Tonal Diacritic Loss}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;author&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{Obasi, Chizoba}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{2026}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;publisher&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{HuggingFace}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;howpublished&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{\url{https://huggingface.co/datasets/chiz/omniASR-igbo-blindspots}}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;note&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;{Model evaluated: facebook/omniASR-CTC-1B (975M parameters)}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
</description>
        <pubDate>Sun, 01 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://chizkidd.github.io//2026/03/01/tonal-fidelity-multilingual-asr/</link>
        <guid isPermaLink="true">https://chizkidd.github.io//2026/03/01/tonal-fidelity-multilingual-asr/</guid>
        
        
      </item>
    
  </channel>
</rss>
