The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised – in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem. A standard mathematical setting for this situation is “k-Armed Bandits”, often with various relevant embellishments. The k-Armed Bandit setting works on a round-by-round basis. On each round:
1. A policy chooses arm a from 1 of k arms (i.e. 1 of k ads).
2. The world reveals the reward ra of the chosen arm (i.e. whether the ad is clicked on).
Contextual Bandit google