Achieving complete learning in Multi-Armed Bandit problems

Sattar Vakili; Qing Zhao

doi:10.1109/ACSSC.2013.6810607

Achieving complete learning in Multi-Armed Bandit problems

Source

2013 Asilomar Conference on Signals, Systems and Computers > 1778 - 1782

Abstract

In the classic Multi-Armed Bandit (MAB) problem, there is a given set of arms with unknown reward distributions. At each time, a player selects one arm to play, aiming to maximize the total expected reward over a horizon of length T. It is known that the minimum growth rate of regret (defined as the total expected loss with respect to the ideal scenario of known reward models of all arms) is logarithmic with T. In other words, mistakes in selecting suboptimal arms occur infinitely often, and the player will never converge to the arm with the largest reward mean. In this paper, we are interested in the questions that whether side information on the reward model can lead to bounded regret, thus, complete learning, and what is the minimum side information to achieve complete learning. We show that the knowledge of a value η between the largest and the second largest reward mean (among all arms) leads to complete learning by constructing an online learning policy with bounded regret. This result applies to both light-tailed and heavy-tailed reward distributions.