Sutton & Barto, Ch. 10: On-Policy Control with Approximation (Personal Notes)
- Let’s dive into the control problem now with parametric approximation of the action-value function $\hat{q}(s, a, \mathbf{w}) \approx q_{*}(s, a)$, where $\mathbf{w} \in \mathbb{R}^d$ is a finite-dimensional weight vector.
- We’ll focus on semi-gradient Sarsa, the natural extension of semi-gradient TD(0) to action values and to on-policy control.
- We’ll look at this extension in both the episodic and continuing case.
- We’ll look at $n$-step linear Sarsa.
Table of Contents
- 10.1 Episodic Semi-gradient Control
- 10.2 Semi-gradient $n$-step Sarsa
- 10.3 Average Reward: A New Problem Setting for Continuing Tasks
- 10.4 Deprecating the Discounted Setting
- 10.5 Differential Semi-gradient $n$-step Sarsa
- 10.6 Summary
Appendix
10.1 Episodic Semi-gradient Control
- The extension of the semi-gradient prediction methods of Chapter 9 to action values is straightforward.
- It is the approximate action-value function, $\hat{q} \approx q_\pi$, that is represented as a parametrized functional form with weight vector $\mathbf{w}$.
- Before, the training examples had the form $S_t \mapsto U_t$; now the examples have the form $S_t, A_t \mapsto U_t$.
- The update target $U_t$ can be any approximation of $q_\pi(S_t, A_t)$, including the usual backed-up values such as the full Monte Carlo (MC) return $G_t$ or any $n$-step Sarsa return $G_{t:t+n}$.
- The general gradient-descent update for action-value prediction is:
- The update for the one-step Sarsa method is:
- This method is called episodic semi-gradient one-step Sarsa. For a constant policy, this method converges in the same way that TD(0) does with the same kind of error bound.
- Control = action-value prediction + policy improvement & action selection:
- Linear function approximation for the action-value function is:
10.2 Semi-gradient $n$-step Sarsa
- We use an $n$-step return as the update target for episodic semi-gradient $n$-step Sarsa. The $n$-step return generalizes from its tabular form to a function approximation form:
- The $n$-step update equation is:
- Performance is best if an intermediate level of bootstrapping is used ($n > 1$).
10.3 Average Reward: A New Problem Setting for Continuing Tasks
- Average reward applies to continuing problems for goal formulation in MDPs.
- Average reward uses no discounting; the agent has the same level of care for immediate and delayed rewards.
- Average reward setting is more commonly considered in dynamic programming and less commonly in reinforcement learning (RL).
- The discounted setting is problematic with function approximation, hence the need for average reward to replace it.
- In the average-reward setting, the quality of a policy $\pi$ is defined as the average rate of reward, or simply average reward, while following that policy, denoted as $r(\pi)$:
- The expectations in the above equations are conditioned on the initial state $S_0$, and on the subsequent actions $A_0, A_1, \ldots, A_{t-1}$, being taken according to $\pi$.
- The 2nd and 3rd equations above hold if the MDP is ergodic, i.e., if the steady-state distribution exists and is independent of the starting state $S_0$:
- In an ergodic MDP, the starting state can have only a temporary effect, but in the long run the expectation of being in a state depends only on the policy and the MDP transition probabilities.
- Ergodicity is sufficient but not necessary to guarantee the existence of the limit in the $r(\pi)$ equation above.
- It may be adequate practically to simply order policies according to their average reward per time step, otherwise called the return rate.
- All policies that reach the maximal value of $r(\pi)$ are optimal.
- The steady-state distribution $\mu_\pi$ is the special distribution under which, if you select actions according to $\pi$, you remain in the same distribution, i.e., for which:
- In the average-reward setting, returns are defined in terms of differences between rewards and the average reward; this is called the differential return:
- The corresponding value functions for the differential return are known as differential value functions:
- Differential value functions also have Bellman equations:
- The differential form of the 2 TD errors:
- Most of the algorithms covered so far don’t change for the average-reward setting. For example, the semi-gradient Sarsa average-reward version is the same as the regular version except with the differential version of the TD error:
10.4 Deprecating the Discounted Setting
- For the tabular case, the continuing, discounted problem formulation is useful, but in the approximate case, is this problem formulation necessary?
- Should we use the discounted reward or average reward in continuing tasks?
- It turns out that the average of the discounted return is proportional to the average reward.
- The ordering of all policies in the average discounted return setting would be exactly the same as in the average-reward setting.
- This idea of the futility of discounting in continuing problems can be proven by the symmetry argument.
-
Let’s choose an objective that saves discounting by summing discounted values over the distribution with which states occur under the policy (where $v^\gamma_\pi \equiv$ discounted value function):
\[\begin{align*} J(\pi) &= \sum_s \mu_\pi(s)\, v^\gamma_\pi(s) \\ &= \sum_s \mu_\pi(s) \sum_a \pi(a \vert s) \sum_{s'} \sum_r p(s', r \vert s, a)\!\left[r + \gamma v^\gamma_\pi(s')\right] \\ &= r(\pi) + \sum_s \mu_\pi(s) \sum_a \pi(a \vert s) \sum_{s'} \sum_r p(s', r \vert s, a)\, \gamma v^\gamma_\pi(s') \\ &= r(\pi) + \gamma \sum_{s'} v^\gamma_\pi(s') \sum_s \mu_\pi(s) \sum_a \pi(a \vert s)\, p(s' \vert s, a) \\ &= r(\pi) + \gamma \sum_{s'} v^\gamma_\pi(s')\, \mu_\pi(s') \\ &= r(\pi) + \gamma J(\pi) \\ &= r(\pi) + \gamma\!\left(r(\pi) + \gamma J(\pi)\right) \\ &= r(\pi) + \gamma r(\pi) + \gamma^2 J(\pi) \\ &= r(\pi) + \gamma r(\pi) + \gamma^2 r(\pi) + \gamma^3 r(\pi) + \gamma^4 r(\pi) + \ldots \\ &= r(\pi)\!\left[1 + \gamma + \gamma^2 + \gamma^3 + \ldots\right] \end{align*}\] \[\hspace{-6cm} \boxed{J(\pi) = \left(\frac{1}{1-\gamma}\right) r(\pi)}\] - The proposed discounted objective orders policies identically to the undiscounted (average reward) objective.
- The discount rate $\gamma$ does not influence the ordering.
-
- The root cause of the difficulties with the discounted control setting is that with function approximation we have lost the policy improvement theorem.
- Now if we change the policy to improve the discounted value of one state, we are no longer guaranteed to have improved the overall policy.
10.5 Differential Semi-gradient $n$-step Sarsa
- We need an $n$-step version of the TD error in order to generalize to $n$-step bootstrapping.
- Let’s generalize the $n$-step return to its differential form, with function approximation:
- The $n$-step TD error is:
10.6 Summary
- Extended parametrized function approximation & semi-gradient descent to control.
- The extension is immediate for the episodic case, but dependent on a new problem formulation based on maximizing the average reward setting per time step, for the continuing case.
- The discounted formulation cannot be carried over to control in the presence of approximations.
- Most policies cannot be represented by a value function in the approximate case.
- The scalar average reward $r(\pi)$ provides an effective way of ranking the remaining arbitrary policies.
- The average reward formulation involves new differential versions of value functions, Bellman equations, and TD errors, but all of these parallel the old ones and the conceptual changes are small.
- The average reward setting has a new parallel set of differential algorithms.
Citation
If you found this blog post helpful, please consider citing it:
@article{obasi2026RLsuttonBartoCh10notes,
title = "Sutton & Barto, Ch. 10: On-Policy Control with Approximation (Personal Notes)",
author = "Obasi, Chizoba",
journal = "chizkidd.github.io",
year = "2026",
month = "Mar",
url = "https://chizkidd.github.io/2026/03/09/rl-sutton-barto-notes-ch010/"
}