Bayesian Bandits: Thompson Sampling with Implementation and Regret Analysis

Bayesian Bandits: Thompson Sampling with Implementation and Regret Analysis

2022, Sep 01    

Thompson Sampling: At any round of the Thompson Sampling, we try to sample a bandit environment from the bandit space that is most likely to be close to the one we are dealing with right now, given the history of the previous rounds. So, we are in fact building a posterior distribution $Q_{t}(v | X_{t-1}, A_{n-1},\cdots, ,X_{1}, A_{1})$ where $v$ is the bandit environment and $A_{t}$ is the action taken at round $t$ and $X_{t}$ is the reward revealed after action $A_{t}$. For brevity, let’s define $H_{t} = (X_{t}, A_{t},\cdots, ,X_{1}, A_{1})$ to be the history of exploration-exploitation and rewards. So, this approach returns a policy $\pi = (\pi_{t})_{t=1}^{\infty}$ so that,

$\pi_{t}(a|H_{t-1}) = Q_{t}(B_{a}|H_{t-1})$

where,

$B_{a} = \{v \in \epsilon: \mu_{a}(v) = \arg \max_{b} \mu_{b}(v)\}$

Ties are broken arbitrarily.

Pseudocode:
for $t = 1, 2, \cdots , n$ do
$\quad\quad$sample $v_{t}$ from $Q_{t}(.|H_{t-1})$
$\quad\quad$choose $A_{t} = \arg \max_{1 \leq i \leq k} \mu_{i}(v_{t})$
$\quad\quad$observe reward $X_{t}$
$\quad\quad$update $H_{t} = (X_{t}, A_{t}) \oplus H_{t-1}$
end for

Here, $\oplus$ means concatenation.

A very easy example for Bernoulli Bandits: For every arm, we keep track of two variables $(\alpha, \beta)$. $\alpha$ means the number of rewards and $\beta$ is the number times you got no rewards. We use a Beta Distribution to sample a mean reward for this arm which is likely to be close to $\frac{\alpha}{\alpha + \beta}$. In this way, we get an estimate for the mean rewards of each arm. And this is the bandit $v_{t}$ which is mentioned in the pseudocode. We simply choose the best arm of $v_{t}$.

Suppose, we are talking about ad recommendation. We have $k$ ads to render to a user. For each ad, we keep track of $(\alpha, \beta)$. We start with the beta distribution with $(\alpha = 0, \beta = 0)$ as prior. An implementation of Thompson Sampling in this scenario is shown below:

Bayesian Regret Analysis: We will discuss the worst case Bayesian regret of Thompson Sampling for Bernoulli Bandits with 0-1 rewards and Gaussian bandits with 1-sub Gaussian arms.
If the bandit has only 0-1 rewards, then for a particular arm, if it’s played $T_{i}(t-1)$ times up to round $t$, then from, Hoeffding’s Inequality we know that,

$Pr(|\bar u_{i}(t-1) - \mu_{i}| \geq \sqrt{\frac{1}{2 \max(1,T_{i}(t-1))}\log{\frac{2}{\delta}}} ) \leq \delta$

If the bandit is 1-sub Gaussian, then we can say,

$Pr(|\bar u_{i}(t-1) - \mu_{i}| \geq \sqrt{\frac{2}{\max(1,T_{i}(t-1))}\log{\frac{1}{\delta}}} ) \leq \delta$

So, in both cases, we can say that for all arms $i$ and for all rounds $t$,

$ |\bar u_{i}(t-1) - \mu_{i}| \leq \sqrt{\frac{1}{\max(1,T_{i}(t-1))}\log{\frac{1}{\delta}}}$

holds for a good probability in the order of $1-2\delta$ for each $i$ and $t$ with $\delta$ arbitrality close to 0. Such as, taking $\delta = \frac{1}{n^{2}}$. Let’s define this event to be the good event $G$. The bad event $G^{c}$ is then the event when any of the arms violate this property at any round. So,

$Pr(G^{c}) = Pr(\cup_{i = 1, t = 1}^{i = k, t = n} {|\bar u_{i}(t-1) - \mu_{i}| > \sqrt{\frac{1}{\max(1,T_{i}(t-1))}\log{\frac{1}{\delta}}}})$
Using the union bound,

$Pr(G^{c}) \leq \sum_{i = 1}^{k} \sum_{t = 1}^{n} Pr(|\bar u_{i}(t-1) - \mu_{i}| > \sqrt{\frac{1}{\max(1,T_{i}(t-1))}\log{\frac{1}{\delta}}}) \leq 2nk\delta = \frac{2k}{n}$

Let’s define $U_{t}(i) = \mu_{i}(t-1) + \sqrt{\frac{1}{\max(1,T_{i}(t-1))}\log{\frac{1}{\delta}}}$ which is the upper confidence bound of UCB. We know that, the Bayesian Regret can be expressed using the tower rule of conditional expectation with respect to the history $H_{t-1}$. So,

$R_{n} = \mathbb{E}[ \sum_{t=1}^{n} (\mu_{A^{*}} - \mu_{A_{t}}) ] = \mathbb{E}[ \sum_{t=1}^{n} \mathbb{E}[\mu_{A^{*}} - \mu_{A_{t}}| H_{t-1}] ]$

Let’s focus on the inner conditional expectation,

$ \mathbb{E}[\mu_{A^{*}} - \mu_{A_{t}}| H_{t-1}] = \mathbb{E}[\mu_{A^{*}} - U_{t}(A_{t}) + U_{t}(A_{t}) - \mu_{A_{t}}| H_{t-1}]$
$ \mathbb{E}[\mu_{A^{*}} - \mu_{A_{t}}| H_{t-1}] = \mathbb{E}[\mu_{A^{*}} - U_{t}(A^{*}) + U_{t}(A_{t}) - \mu_{A_{t}}| H_{t-1}]$
$ \mathbb{E}[\mu_{A^{*}} - \mu_{A_{t}}| H_{t-1}] = \mathbb{E}[\mu_{A^{*}} - U_{t}(A^{*})| H_{t-1}] + \mathbb{E}[U_{t}(A_{t}) - \mu_{A_{t}}| H_{t-1}]$

The second line follows from the fact that,

$\pi_{t}(a|H_{t-1}) = Q_{t}(B_{a}|H_{t-1})$

The best arm $A^{*}$ given the history is actually identically distributed as the best arm of the most plausible bandit given the history. So,

$\mathbb{P}(A^{*} = a| H_{t-1}) = \mathbb{P}(A_{t} = a| H_{t-1})$

So,

$R_{n} = \mathbb{E}[\sum_{t=1}^{n} (\mu_{A^{*}} - U_{t}(A^{*})) + \sum_{t=1}^{n} (U_{t}(A_{t}) - \mu_{A_{t}})]$

Then,

$\mathbb{E}[R_{n}] = \mathbb{P}(G^{c}) \mathbb{E}[R_{n}|G^{c}] + \mathbb{P}(G) \mathbb{E}[R_{n}|G] $
$\mathbb{E}[R_{n}] \leq \frac{2k}{n} 2n + (1-\frac{2k}{n}) \mathbb{E}[R_{n}|G] $

Let’s focus on $\mathbb{E}[R_{n} | G]$.

$\mathbb{E}[R_{n} | G] = \mathbb{E}[\sum_{t=1}^{n} (\mu_{A^{*}} - U_{t}(A^{*})) + \sum_{t=1}^{n} (U_{t}(A_{t}) - \mu_{A_{t}}) | G] \leq \mathbb{E}[\sum_{t=1}^{n} (U_{t}(A_{t}) - \mu_{A_{t}}) | G]$

Because, the left summand is negative in the good scenario. So,

$\mathbb{E}[R_{n} | G] \leq \mathbb{E}[\sum_{t=1}^{n} (U_{t}(A_{t}) - \mu_{A_{t}})| G]$
$\mathbb{E}[R_{n} | G] \leq \mathbb{E}[\sum_{t=1}^{n} \sum_{i=1}^{k} \mathbb{I}(A_{t} = i)(U_{t}(i) - \mu_{i})| G]$
$\mathbb{E}[R_{n} | G] \leq \mathbb{E}[\sum_{i=1}^{k} \sum_{t=1}^{n} \mathbb{I}(A_{t} = i) \sqrt{\frac{1}{\max(1,T_{i}(t-1))}\log{\frac{1}{\delta}}}| G]$
$\mathbb{E}[R_{n} | G] \leq \sum_{i=1}^{k} \int_{t = 0}^{T_{i}(n)} \sqrt{\frac{1}{s}\log{\frac{1}{\delta}}}$
$\mathbb{E}[R_{n} | G] \leq \sqrt{\log{\frac{1}{\delta}}} \sum_{i=1}^{k} 2\sqrt{T_{i}(n)}$

We know that, the arithmetic mean is less than the quadratic mean. So,

$\frac{\sum_{i=1}^{k} \sqrt{T_{i}(n)}}{k} \leq \sqrt{ \frac{\sum_{i=1}^{k} T_{i}(n)}{k} } = \sqrt{ \frac{n}{k} }$
$\mathbb{E}[R_{n} | G] \leq \sqrt{\log{\frac{1}{\delta}}} \sum_{i=1}^{k} 2\sqrt{T_{i}(n)} \leq 2 \sqrt{nk\log{\frac{1}{\delta}}}$

So,

$E[R_{n}] \leq \frac{2k}{n} 2n + (1-\frac{2k}{n}) 2 \sqrt{2nk\log{n}}$

Since, typically $n$ is substantially larger than $k$ and since, $\delta = \frac{1}{n^{2}}$,

$E[R_{n}] \approx \mathcal{O} (\sqrt{nk\log{n}})$