Multi-armed bandit

The multi-armed bandit problem is a classic Reinforcement learning problem where a fixed set of resources is allocated across a collection of reward-generating options, with the goal of maximizing the expected cumulative reward. The name “multi-armed bandit” comes from the phrase “one-armed bandit” in reference to slot machines, and one can liken the problem to a gambler facing a row of slot machines (with the goal of maximizing returns on a fixed sum of money by optimally playing the machines). The problem exemplifies the exploration-exploitation tradeoff (a core theme in reinforcement learning), where an agent has to choose between exploiting a known, successful strategy and exploring other possibly more lucrative options.

Contextual bandits

Contextual bandits are variations of multi-armed bandits wherein a context/feature vector is provided at each iteration (i.e. timestep that an agent chooses between arms). These feature vectors are associated with the current iteration, and the goal is often to learn how to make a decision about which arm to select using both previously observed rewards and the current context vector. Note this formulation is very similar to the standard reinforcement learning formulation: arms are the actions an agent must select from at a given timestep, the context is the state information, and a reward is received from the environment. The key difference, however, is the frequency of reward; bandits expect to receive a reward associated with each iteration, whereas in RL we may receive sparse rewards not directly affiliated with any specific iteration or the action taken at that iteration. In the form of a flow diagram, multi-armed bandits look like

action
reward

whereas RL looks like

state
action
reward

Sources