
Stochastic Bandits: The Epsilon Greedy Algorithm
Previously, we saw that in the ETC algorithm, we first explore all of the arms $m$ times for some $m$, then we choose the best arm from our exploration and use this arm for all the subsequent rounds. But we could severely underperform in the initial exploration phase. So, instead of blindly exploring in the first $km$ rounds, we can spread the uniform exploration over all of the rounds using a coin toss model. For each round $t$, with a probability $\epsilon_{t}$, we explore any arm with a uniform probability of $\frac{1}{k}$, otherwise, we choose the best possible arm from our experience of explorations so far. In this way, we do not run the risk of incurring a heavy regret in the exploration phase and also, eventually our algorithm succeeds to determine the truly optimal arm.
$\textbf{for}$ round $t = 1, 2, \cdots , n$ $\textbf{do}$
$\quad$ Toss a coin with probability $\epsilon_{t}$
$\quad\quad$ if $\textbf{success:}$
$\quad\quad\quad$ Explore any with uniform probability
$\quad\quad$ if $\textbf{failure:}$
$\quad\quad\quad$ Exploit the best arm so far
Let’s look at the simple regret of a particular round $t$. If $n$ is sufficiently large, then at some point, in an exploitation round, only the optimal arm will be used. So, a suboptimal arm being exploited is unlikely. So, most arms are used in the algorithm in the random exploration rounds. Let’s say in an exploitation round, a suboptimal arm $a_{i}$ is being used. So, definitely, $\bar u_{i}(t-1) \geq max_{j = 1}^{k} \bar u_{j}(t-1)$.
Let’s define the good event $G$. In the good event, for all of the arms, the mean reward stays within the confidence bound of the sample mean reward. That is, for all rounds $t$ and for all arms $i$, $|\bar \mu_{i}(t-1) - \mu_{i}| \leq \sqrt{\frac{2}{n_{i}(t-1)} log(n)}$
We know that, $Pr(G) \geq 1 - \frac{1}{n^{2}}$.
In the good event, let’s say at round $t$, arm $i$ is the apparently best arm and arm $j$ is the truly optimal arm. We know for sure that,
Since, $i$ is the chosen arm at round $t$, so $\bar \mu_{j}(t-1) - \bar \mu_{i}(t-1) \leq 0$, so we have,
Now, let’s try to find an upper bound for the expected value of this $\Delta_{i}$. </center>
So, the expected regret per round in the good event is,
The above bound is minimized by the choice $\epsilon_{t} = 2^{\frac{1}{3}} k^{\frac{1}{3}} t^{\frac{-1}{3}} log(n)^{\frac{1}{3}}$.
Then we get,
So, the expected simple regret over both good and bad events is,
So, now we find the expected regret over all rounds,
So, the epsilon-greedy algorithm has an asymptotic complexity of $\mathcal{O}(k^{\frac{1}{3}} n^{\frac{2}{3}} log(n)^{\frac{1}{3}})$