Multi-armed bandits
Action-value estimation
The action-value function usually can be updated according to below equation:
The learning rate here has been treated as the function w.r.t. update times .
can converge when below two conditions are satisfied.
The first condition is required to guarantee that the steps are large enough to eventually overcome any initial conditions or random fluctuations.
The second condition guarantees that eventually the steps become small enough to assure convergence.
Note that usually is set to a constant in many practical RL problems, which violates the second condition. Hence, the won’t converge. However, this actually can be a desirable property in highly nonstationary environment (i.e. the true state-value can be changed after sampling each time).
Several exploration strategies
-
-greedy
-
Upper confidence bound (UCB)
This method is adopted in recent AlphaGo monte carlo tree search.
The action will be taken according to below rule
where is the action-value result at time step . denotes the number of times that action has been selected prior to time .
The idea of this action selection is that each time is selected, the uncertainty of action is presumbly reduced, and thus the probability of selecting will decrease.
-
Optimistic initial values
Setting a larger initial estimates can encourage exploring automatically in some special cases.
-
Gradient bandit algorithm
Directly define a numerical preference for each action selection.