2 Multi-Armed Bandits

2.1 The k-armed Bandit Problem

The particular nonassociative, evaluative feedback problem is a simple version of the k-armed bandit problem.

q_*(a) \doteq \mathbb{E}\left[R_t \mid a_t = a\right] \quad \text{action-value function}

Q_t(a) \to q_*(a)

\begin{aligned} \text{where} \\ Q_t(a) &= \text{estimated action-value of action } a \text{ at time } t \end{aligned}

Exploitation: always selecting the greedy actions (actions with highest estimated value)
Exploration: selecting one of the nongreedy actions
There exists an Exploration-Exploitation tradeoff

Multi-Armed Bandit

A multi-armed bandit is a tuple \langle A, R \rangle where A is a known set of m actions or “arms” and R^a(r) is an unknown probability distribution over rewards.

2.2 Action-value Methods

Let’s go through how we can estimate our action-value, Q_t(a). One method is by averaging the rewards that have been received (Sample average)

Q_t(a) \doteq \frac{\text{sum of rewards when } a \text{ taken prior to } t}{\text{number of times } a \text{ taken prior to } t} = \frac{\sum_{i=1}^{t-1} R_i \cdot \mathbb{1}_{A_i=a}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_i=a}}

\begin{aligned} \text{where} \\ a &= \text{action} \\ t &= \text{timestep} \\ \mathbb{1}_{\text{predicate}} &\equiv \text{random variable that is } 1 \text{ if predicate is true, } 0 \text{ if not} \end{aligned}

As \displaystyle\sum_{i=1}^{t-1} \mathbb{1}_{A_i=a} \to \infty, by the law of large numbers, Q_t(a) converges to q_*(a).
Now that we can compute Q_t(a), how do we use it to select actions?

Greedy Method

Simplest action selection rule for no exploration.

A_t \doteq \arg\max_a Q_t(a)

\varepsilon-greedy method

Simplest action selection rule for ensuring continual exploration.
For m total actions:
- With probability 1-\varepsilon: choose the greedy action
- With probability \varepsilon: choose a random action

\pi(a|s) = \left\{ \begin{array}{ll} \varepsilon/m + 1 - \varepsilon & \text{if } a^* = \arg\max_{a \in A} Q(s,a) \\ \varepsilon/m & \text{otherwise} \end{array} \right\}

2.3 The 10-armed Testbed

Stationarity & determinism vs. nonstationarity & stochasticity
Compare the relative effectiveness of greedy & \varepsilon-greedy action-value methods

2.4 Incremental Implementation

Maintaining sample averages naively requires storing all past rewards, causing memory and computation to grow unboundedly over time.
For a single action selected n-1 times, its value estimate Q_n is:

Q_n \doteq \frac{R_1 + R_2 + \ldots + R_{n-1}}{n-1}

How can we compute Q_t(a) estimates in a computationally and memory efficient manner?
Instead of storing all the R values and recomputing the sum at every time step, we can derive an incremental update rule: Q_n estimate can be updated efficiently without storing the full history.

\begin{align*} Q_{n+1} &= \frac{1}{n} \sum_{i=1}^{n} R_i \\ &= \frac{1}{n}\left(R_n + \sum_{i=1}^{n-1} R_i\right) \\ &= \frac{1}{n}\left(R_n + (n-1)\frac{1}{n-1}\sum_{i=1}^{n-1} R_i\right) \\ &= \frac{1}{n}\left(R_n + (n-1)Q_n\right) \\ &= \frac{1}{n}\left(R_n + nQ_n - Q_n\right) \end{align*}

\boxed{Q_{n+1} = Q_n + \frac{1}{n}\left[R_n - Q_n\right]}

Throughout the book, \alpha is used to represent \left(\frac{1}{n}\right) as the step-size parameter.
In general:

\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize}\underbrace{\left[\text{Target} - \text{OldEstimate}\right]}_{\text{error}}

2.5 Tracking a Nonstationary Problem

Sample-average methods assume a stationary reward distribution, so they perform poorly on non-stationary bandits where the true action values drift/change over time.
A simple fix is to replace the \frac{1}{n} step size with a constant \alpha \in (0, 1], which gives more weight to recent rewards via an exponentially decaying average over past rewards.

Q_{n+1} \doteq Q_n + \alpha\left[R_n - Q_n\right], \quad \alpha \in (0,1] \text{ is constant}

Expanding the recurrence:

\begin{align*} Q_{n+1} &= Q_n + \alpha[R_n - Q_n] \\ &= \alpha R_n + (1-\alpha) Q_n \\ &= \alpha R_n + (1-\alpha)\left[\alpha R_{n-1} + (1-\alpha) Q_{n-1}\right] \\ &= \alpha R_n + (1-\alpha)\alpha R_{n-1} + (1-\alpha)^2 Q_{n-1} \\ &= \alpha R_n + (1-\alpha)\alpha R_{n-1} + (1-\alpha)^2\left[\alpha R_{n-2} + (1-\alpha) Q_{n-2}\right] \\ &= \alpha R_n + (1-\alpha)\alpha R_{n-1} + (1-\alpha)^2 \alpha R_{n-2} + (1-\alpha)^3 Q_{n-2} \\ &= \alpha R_n + (1-\alpha)\alpha R_{n-1} + (1-\alpha)^2 \alpha R_{n-2} + (1-\alpha)^3 \alpha R_{n-3} + \ldots + (1-\alpha)^{n-1}\alpha R_1 + (1-\alpha)^n Q_1 &\quad \vdots \\ \end{align*}

\boxed{Q_{n+1} = (1-\alpha)^n Q_1 + \sum_{i=1}^{n} \alpha(1-\alpha)^{n-i} R_i}

Exponential recency-weighted average

The weight given to R_i depends on how many rewards ago (n-i) it was observed. The more recent the rweard, the higher the weight assigned to it.
This is a weighted average , which gives priority/higher value to more recently acquired rewards.
It’s called a weighted average because the sum of the weights is 1.

\left(1-\alpha\right)^n + \sum_{i=1}^{n} \alpha(1-\alpha)^{n-i} = 1

The weighted average is more specifically called an exponential recency-weighted average because the weight decays exponentially into the past according to the exponent on (1 - \alpha), (n - i).
Q_{n+1} becomes a weighted average of the previous rewards and the initial estimate Q_1.

\text{If } (1-\alpha) = 0 \Rightarrow \alpha = 1, \text{then:}

\begin{align*} Q_{n+1} &= (1-\alpha)^n Q_1 + \sum_{i=1}^{n} \alpha(1-\alpha)^{n-i} R_i \\ &= 0^n Q_1 + \sum_{i=1}^{n} 1 \cdot (0)^{n-i} R_i \\ &= Q_1 + \sum_{i=1}^{n} R_i \\ &= Q_1 + R_1 + R_2 + R_3 + \ldots + R_n \end{align*}

Step-size Parameter

\begin{align*} \alpha_n(a) = \frac{1}{n} \quad &\text{(sample-average)} \Rightarrow \textbf{unbiased} \\ \alpha_n(a) = \alpha \quad &\text{(constant)} \Rightarrow \textbf{biased} \end{align*}

Stochastic approximation theory (Robbins-Monro sequence) requires:

\begin{align*} \text{(I)} \quad & \sum_{n=1}^{\infty} \alpha_n(a) = \infty \quad \text{[overcome random fluctuations]} \\ \text{(II)} \quad & \sum_{n=1}^{\infty} \alpha_n^2(a) < \infty \quad \text{[convergence]} \end{align*}

(I) & (II) are met for \alpha_n(a) = \frac{1}{n} (sample-average method)
(II) is not met for \alpha_n(a) = \alpha (constant method)

2.6 Optimistic Initial Values

The methods discussed so far are dependent to some extent on the initial action-value estimates i.e. they are biased by their initial estimates.
Essentially, the initial action-value estimates become model parameters that must be chosen by the user.
For example, in a 10-armed bandit testbed if we set Q_{1}(a) = +5 instead of 0, we encourage exploration even in the greeedy case.
- The agent will almost always be disappointed with its current samples action-values because they are less than the initial action-value estimate of +5, therefore it will continue to explore until the action-values converge.
- This technique for encouraging exploration is called optimistic initial values (focuses on initial conditions)

Optimistic Initialization

Assume the best action-value initially, until proven otherwise. Basically optimistic initialization encourages systematic exploration early on but can still lock onto suboptimal action.

Pros & Cons

(+) It is effective on stationary problems, but not a general approach
(-) It is not good for nonstationary problems because its drive for exploration is inherently temporary.

2.7 Upper-Confidence-Bound (UCB) Action Selection

\varepsilon-greedy action selection explores indiscriminately, giving equal probability to all non-greedy actions regardless of their potential.
A smarter approach is Upper Confidence Bound (UCB) action selection, which favors non-greedy actions based on both how close their value estimates are to the maximum and how uncertain those estimates are.
Actions can be selected by:

A_t \doteq \arg\max_a \left[Q_t(a) + c\sqrt{\frac{\ln(t)}{N_t(a)}}\right]

\begin{aligned} \text{where} \\ t &\equiv \text{number of timesteps that have passed} \\ N_t(a) &\equiv \text{number of times action } a \text{ has been selected prior to time } t \\ c &> 0 \ \text{controls the degree of exploration} \ (c \equiv \text{confidence level}) \\ \sqrt{\dfrac{\ln(t)}{N_t(a)}} &\equiv \text{measure of the uncertainty/variance in the estimate of a's value} \end{aligned}

If N_t(a) = 0, then a is considered to be a maximizing action.
Introduces an exploration bonus term c\sqrt{\ln(t)/N_t(a)}, which measures the uncertainty in our estimate
- It is proportional to t and inversely proportional to N_t(a).
- The more time has passed, and the less we have sampled an action, the higher our upper-confidence-bound.

t \uparrow,\ N_t(a) \downarrow \ \Rightarrow \ \sqrt{\frac{\ln t}{N_t(a)}} \uparrow \ \Rightarrow \ \text{UCB} \uparrow

The idea of the UCB action selection is that the square-root term is a measure of the uncertainty or variance in the estimate of a’s value.
The quantity being max’ed over is a sort of upper bound on the possible true value of action a, with c determining the confidence level.
Each time a is selected, N_t(a) increments and, since it appears in the denominator, the uncertainty term decreases.
Conversely, each time any other action is selected the uncertainty increases (t increases while N_t(a) does not).
Using \ln t means the uncertainty bonus grows without bound but at a diminishing rate (increases get smaller over time), ensuring all actions are eventually selected while those with lower value estimates or higher selection counts are chosen with decreasing frequency over time.

Pros & Cons

(+) UCB usually performs better than \varepsilon-greedy
(-) UCB is more difficult than \varepsilon-greedy to extend beyond bandits to the more general RL settings
(-) Difficulty in dealing with nonstationary problems
(-) Difficulty in dealing with large state spaces (especially function approximation)

Optimism in the Face of Uncertainty

The more uncertain we are about an action-value, the more important it is to explore that action, because it could turn out to be the best action.

2.8 Gradient Bandit Algorithms

Another action selection method, which isn’t based on action-value estimation, entails learning a numerical preference for each action a, which is denoted as H_t(a).
The larger the preference (H_t(a)), the more often the action is taken.
The relative preference of one action over another is more important, hence why we define the action probabilities by a soft-max distribution (i.e., Gibbs or Boltzmann):

\mathbb{P}\{A_t = a\} \doteq \frac{e^{H_t(a)}}{\sum_{b=1}^{k} e^{H_t(b)}} \doteq \pi_t(a) \quad \text{(softmax)}

\begin{aligned} \text{where} \\ H_t(a) &\in \mathbb{R} \equiv \text{numerical preference for action } a \text{ at time } t \\ \pi_t(a) &= \text{probability of taking action } a \text{ at time } t \end{aligned}

Initially all action preferences are the same (e.g., H_1(a) = 0 \forall a) so that all actions have an equal probability of being selected.
The action preferences are updated via stochastic gradient ascent: \begin{align*} H_{t+1}(A_t) &\doteq H_t(A_t) + \alpha(R_t - \bar{R}_t)(1 - \pi_t(A_t)), \quad \text{and} \\ H_{t+1}(a) &\doteq H_t(a) - \alpha(R_t - \bar{R}_t)\pi_t(a), \quad \forall a \neq A_t \end{align*}

\begin{aligned} \text{where} \\ \alpha &> 0 \equiv \text{step-size parameter} \\ \bar{R}_t &\in \mathbb{R} \equiv \text{average of rewards up to but not including time } t \ (\bar{R}_1 \doteq R_1) \\ \bar{R}_t &\equiv \text{baseline against which the reward is compared} \end{aligned}

If R_t > \bar{R}_t \Rightarrow \mathbb{P}(A_t = a)\uparrow
If R_t < \bar{R}_t \Rightarrow \mathbb{P}(A_t = a)\downarrow
If R_t > \bar{R}_t \Rightarrow \mathbb{P}(A_t = a)\downarrow \quad \forall a \neq A_t
If R_t < \bar{R}_t \Rightarrow \mathbb{P}(A_t = a)\uparrow \quad \forall a \neq A_t

Proof: The Bandit Gradient Algorithm as Stochastic Gradient Ascent

Gradient ascent update:

H_{t+1}(a) \doteq H_t(a) + \alpha \frac{\partial \mathbb{E}[R_t]}{\partial H_t(a)}

\begin{aligned} \text{where} \quad \mathbb{E}[R_t] &= \sum_x \pi_t(x) q_{*}(x) \end{aligned}

Exact performance gradient:

\begin{align*} \frac{\partial \mathbb{E}[R_t]}{\partial H_t(a)} &= \frac{\partial}{\partial H_t(a)}\left[\sum_x \pi_t(x) q_{*}(x)\right] \\ &= \sum_x q_{*}(x) \frac{\partial \pi_t(x)}{\partial H_t(a)} \\ &= \sum_x \left(q_{*}(x) - B_t\right) \frac{\partial \pi_t(x)}{\partial H_t(a)} \\ &= \sum_x \pi_t(x)\left[q_{*}(x) - B_t\right] \frac{\partial \pi_t(x)}{\partial H_t(a)} \frac{1}{\pi_t(x)} \\ &= \mathbb{E}\left[\left(q_{*}(A_t) - B_t\right) \frac{\partial \pi_t(A_t)}{\partial H_t(a)} \frac{1}{\pi_t(A_t)}\right] \quad (B_t = \bar{R}_t) \\ &= \mathbb{E}\left[\left(R_t - \bar{R}_t\right) \frac{\partial \pi_t(A_t)}{\partial H_t(a)} \frac{1}{\pi_t(A_t)}\right] \end{align*}

q_*(A_t) = R_t \text{because} \mathbb{E}[R_t|A_t] = q_*(A_t)

Aside:

\frac{\partial \pi_t(x)}{\partial H_t(a)} = \pi_t(x)\left[\mathbb{1}_{a=x} - \pi_t(a)\right]

\text{where } \mathbb{1}_{a=x} = \left\{ \begin{array}{ll} 1 & \text{if } a = x \\ 0 & \text{otherwise} \end{array} \right\}

Therefore:

\begin{align*} \frac{\partial \mathbb{E}[R_t]}{\partial H_t(a)} &= \mathbb{E}\left[(R_t - \bar{R}_t)\,\pi_t(A_t)\left(\mathbb{1}_{a=A_t} - \pi_t(a)\right)\frac{1}{\pi_t(A_t)}\right] \\ &= \mathbb{E}\left[(R_t - \bar{R}_t)\left(\mathbb{1}_{a=A_t} - \pi_t(a)\right)\right] \end{align*}

\Rightarrow \boxed{H_{t+1}(a) = H_t(a) + \alpha(R_t - \bar{R}_t)\left(\mathbb{1}_{a=A_t} - \pi_t(a)\right)} \quad \text{for all } a

Prove \dfrac{\partial \pi_t(x)}{\partial H_t(a)} = \pi_t(x)\left[\mathbb{1}_{a=x} - \pi_t(a)\right]:

Using the quotient rule: \dfrac{\partial}{\partial z}\left[\dfrac{f(z)}{g(z)}\right] = \dfrac{f'(z)g(z) - f(z)g'(z)}{g(z)^2}:

\begin{align*} \frac{\partial \pi_t(x)}{\partial H_t(a)} &= \frac{\partial}{\partial H_t(a)}\left[\frac{e^{H_t(x)}}{\sum_{y=1}^{k} e^{H_t(y)}}\right] \\ &= \frac{\frac{\partial e^{H_t(x)}}{\partial H_t(a)} \sum_{y=1}^{k} e^{H_t(y)} - e^{H_t(x)} \frac{\partial \sum_{y=1}^{k} e^{H_t(y)}}{\partial H_t(a)}}{\left(\sum_{y=1}^{k} e^{H_t(y)}\right)^2} \\ &= \frac{\mathbb{1}_{a=x}\, e^{H_t(x)} \sum_{y=1}^{k} e^{H_t(y)} - e^{H_t(x)} e^{H_t(a)}}{\left(\sum_{y=1}^{k} e^{H_t(y)}\right)^2} \\ &= \frac{\mathbb{1}_{a=x}\, e^{H_t(x)}}{\sum_{y=1}^{k} e^{H_t(y)}} - \frac{e^{H_t(x)} e^{H_t(a)}}{\left(\sum_{y=1}^{k} e^{H_t(y)}\right)^2} \\ &= \mathbb{1}_{a=x}\, \pi_t(x) - \pi_t(x)\pi_t(a) \\ &= \pi_t(x)\left(\mathbb{1}_{a=x} - \pi_t(a)\right) \end{align*}

\boxed{\frac{\partial \pi_t(x)}{\partial H_t(a)} = \pi_t(x)\left[\mathbb{1}_{a=x} - \pi_t(a)\right]}

2.9 Associative Search (Contextual Bandits)

Nonassociative tasks: tasks where there is no need to associate different actions with different situations.
Associative Search: extend nonassociative tasks to the associative setting.
- Consider a setting where the bandit task changes from step to step, but the value distributions of the arms within each task remain fixed.
- These are called associative search tasks, or contextual bandits, where the agent must learn to associate different policies with different task contexts.
These tasks are intermediate between the k-armed bandit stationary problem & the full reinforcement learning (RL) problem
Like full RL probelm: they learn a policy
Like k-armed bandit: each action affects only the immediate reward
These tasks involve both trial & error learning to search for the best actions & association of these actions with the situations in which they are best
Full RL problem: Actions are allowed to affect the next situation as well as the reward.

Contextual Bandit

A contextual bandit is a tuple \langle A, S, R \rangle where A is a known set of m actions or “arms”, S is an unknown probability distribution over states or “contexts” and R^a(r) is an unknown probability distribution over rewards.

2.10 Summary

The value of an action can be defined as Q_t(a), the sample average return of action a taken prior to time t.
Understand the need for incremental action-value estimation in stationary bandit problems to minimize memory usage and computational complexity.
Introduced to weighted averages, the step-size parameter \alpha, and stochastic approximation conditions in non-stationary bandit problems.
Understand the distinction between stationary/deterministic and nonstationary/stochastic bandit problems.
Understand the transition from nonassociative to associative search tasks as a stepping stone toward the full RL problem.
Presented several simple ways of balancing exploration & exploitation during action selection:
- \varepsilon-greedy: choose randomly a small fraction of the time
- UCB: choose deterministically but achieve exploration by subtly favoring at each step the actions that have received fewer samples
- Gradient Bandit algorithms: estimate action preferences, not action values, and favor the more preferred actions in a graded, probabilistic manner using a soft-max distribution
- Optimistic Initialization: initializing estimates optimistically causes even greedy methods to explore significantly
How do we know which of these methods is best?
- We can run them all on a k (k=10)-armed testbed and compare their performances, but they all have a parameter of their own.
- To get a meaningful comparative study we have to consider their performance as a function of their parameter.
- Simple learning curves would be too complex and crowded in a graph to make clear comparisons.
- Instead we use a different kind of graph called parameter study:
  - We can estimate each point as the average reward obtained over 1000 steps with a particular algorithm at a particular parameter setting.
  - From the inverted U-shape of each algorithm’s performance curve, we can deduce that the algorithms all perform best at an intermediate value of their parameter, neither too large nor too small.
  - We can study the algorithm’s sensitivities to their respective parameters by studying/plotting their performance/average reward over a range of parameter values varying about an order of magnitude.
These bandit methods studied in this chapter are considered state of the art despite their simplicity.
Gittins index is a well-studied Bayesian approach to balancing exploration and exploitation in k-armed bandit problems.
- Posterior sampling or Thompson sampling is used for action selection.
In the Bayesian setting, it is generally not feasible to perform the outrageously large computation of an information state of a horizon, but efficient approximation could be done.
- This turns the bandit problem to full RL problem instance, hence the need for approximate RL methods.