Policy¶
coleman.policy ¶
Policies for multi-armed bandit and contextual bandit action selection.
This module provides a collection of policies that are designed to operate with multi-armed bandits and contextual bandits. Each policy dictates how an agent will select its actions based on prior knowledge, current context, or exploration strategies.
Classes:
| Name | Description |
|---|---|
Policy |
Basic policy class that prescribes actions based on the memory of an agent. |
EpsilonGreedyPolicy |
Chooses either the best apparent action or a random one based on a probability epsilon. |
GreedyPolicy |
Always chooses the best apparent action. |
RandomPolicy |
Always chooses a random action. |
UCBPolicyBase |
Base class for Upper Confidence Bound policies. |
UCB1Policy |
Implementation of the UCB1 algorithm. |
UCBPolicy |
A variation of the UCB algorithm with a scaling factor. |
FRRMABPolicy |
Fitness-Rate-Rank based Multi-Armed Bandit policy. |
SlMABPolicy |
Sliding window-based Multi-Armed Bandit policy. |
LinUCBPolicy |
Contextual bandit policy using linear upper confidence bounds. |
SWLinUCBPolicy |
Variation of LinUCBPolicy using a sliding window approach. |
CombinatorialUCBPolicy |
Combinatorial policy that prioritizes a top-k subset via UCB scores. |
CombinatorialThompsonPolicy |
Combinatorial policy that prioritizes a top-k subset via Thompson sampling. |
DuelingUCBPolicy |
Dueling/ranking policy based on pairwise UCB-style preference estimates. |
PairwiseThompsonRankingPolicy |
Dueling/ranking policy based on pairwise Thompson preference sampling. |
PortfolioUCBPolicy |
Meta-policy that selects among candidate policies online using UCB. |
Notes
- UCB (Upper Confidence Bound) policies are designed to balance exploration and exploitation by considering both the estimated reward of an action and the uncertainty around that reward.
- EpsilonGreedy and its variations (Greedy, Random) are simpler strategies that either exploit the best-known action or explore random actions based on a fixed probability.
- LinUCB and SWLinUCB are contextual bandits. They choose actions not just based on past rewards, but also considering the current context. SWLinUCB adds a sliding window mechanism to LinUCB, giving more weight to recent actions.
- Combinatorial policies prioritize a strong subset first (top-k) while preserving compatibility with full-ranking execution.
- Dueling/ranking policies learn pairwise preferences and derive a global priority order.
- Portfolio meta-policies select which policy to use online based on recent performance.
References
.. [1] Lihong Li, et al. "A Contextual-Bandit Approach to Personalized News Article Recommendation." In Proceedings of the 19th International Conference on World Wide Web (WWW), 2010. .. [2] Nicolas Gutowski, Tassadit Amghar, Olivier Camp, and Fabien Chhel. "Global Versus Individual Accuracy in Contextual Multi-Armed Bandit." In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (SAC '19), April 8-12, 2019, Limassol, Cyprus. .. [3] Chen, W.; Wang, Y.; Yuan, Y. "Combinatorial Multi-Armed Bandit: General Framework and Applications." ICML, 2013. .. [4] Yue, Y.; Joachims, T. "Beat the Mean Bandit." ICML, 2011. .. [5] Zoghi, M.; Karnin, Z.; Whiteson, S.; de Rijke, M.; Munos, R. "Copeland Dueling Bandits." NeurIPS, 2015. .. [6] Auer, P.; Cesa-Bianchi, N.; Fischer, P. "Finite-time Analysis of the Multiarmed Bandit Problem." Machine Learning, 2002.
BayesianUCBPolicy ¶
Bases: ThompsonSamplingPolicy
Bayesian UCB with Beta posterior mean and standard deviation bonus.
References
.. [1] Kaufmann, E.; Korda, N.; Munos, R. "Thompson Sampling: an asymptotically optimal finite-time analysis." ALT, 2012.
BootstrappedThompsonPolicy ¶
Bases: Policy, DeltaRewardMixin
Bootstrapped Thompson Sampling using Poisson-resampled heads.
References
.. [1] Osband, I.; Van Roy, B.; Wen, Z. "Generalization and Exploration via Randomized Value Functions." ICML, 2016.
ChangeDetectionUCBPolicy ¶
Bases: Policy, DeltaRewardMixin
UCB with simple change detection and per-arm reset.
References
.. [1] Liu, F.; Lee, J.; Shroff, N. "A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem." AAAI, 2018. .. [2] Garivier, A.; Moulines, E. "On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems." ALT, 2011.
__init__ ¶
Initialize UCB constant, change-detection window and threshold.
CombinatorialThompsonPolicy ¶
Bases: Policy, DeltaRewardMixin
Subset-first Thompson ranking using independent Beta posteriors.
What it is
A combinatorial bandit policy that samples action quality from per-action posteriors and prioritizes a top-k subset before the remaining actions. This preserves exploration while emphasizing a smaller candidate set.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subset_size
|
int
|
Number of actions to place at the front of the ranking. |
5
|
alpha_prior
|
float
|
Prior alpha parameter for each action. |
1.0
|
beta_prior
|
float
|
Prior beta parameter for each action. |
1.0
|
References
.. [1] Agrawal, S.; Goyal, N. "Analysis of Thompson Sampling for the Multi-armed Bandit Problem." COLT, 2012. .. [2] Wang, S.; Chen, W. "Thompson Sampling for Combinatorial Semi- Bandits." ICML, 2018.
CombinatorialUCBPolicy ¶
Bases: Policy
Subset-first UCB ranking for combinatorial action selection.
What it is
A combinatorial bandit policy that focuses on selecting a strong subset of actions first (top-k), instead of assuming all actions are equally relevant at every step. It still returns a full ordering for compatibility with the current agent contract.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subset_size
|
int
|
Number of actions to prioritize as the subset head. |
5
|
c
|
float
|
Exploration strength for the UCB bonus. |
1.0
|
References
.. [1] Chen, W.; Wang, Y.; Yuan, Y. "Combinatorial Multi-Armed Bandit: General Framework and Applications." ICML, 2013. .. [2] Kveton, B.; Wen, Z.; Ashkan, A.; Szepesvári, C. "Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits." AISTATS, 2015.
credit_assignment ¶
Apply the default reward update for compatibility with base policy behavior.
ContextualEpsilonGreedyPolicy ¶
Bases: LinUCBPolicy
Contextual epsilon-greedy based on linear reward estimates.
References
.. [1] Langford, J.; Zhang, T. "The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information." NIPS, 2007.
__init__ ¶
Initialize contextual epsilon-greedy policy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
epsilon
|
float
|
Exploration probability. |
0.1
|
DecayEpsilonGreedyPolicy ¶
Bases: Policy
Epsilon-greedy with polynomial decay schedule.
References
.. [1] Sutton, R. S.; Barto, A. G. "Reinforcement Learning: An Introduction." MIT Press, 2018.
DiscountedUCBPolicy ¶
Bases: Policy, DeltaRewardMixin
Discounted-UCB for non-stationary rewards.
References
.. [1] Garivier, A.; Moulines, E. "On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems." ALT, 2011.
DuelingUCBPolicy ¶
Bases: Policy
Pairwise-UCB ranking with Copeland-style scores.
What it is
A dueling/ranking bandit policy that learns preferences through pairwise comparisons between actions, then induces a global ranking from these preferences using Copeland-like aggregation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
c
|
float
|
Exploration parameter for optimistic pairwise preference estimates. |
1.0
|
References
.. [1] Yue, Y.; Joachims, T. "Beat the Mean Bandit." ICML, 2011. .. [2] Zoghi, M.; Karnin, Z.; Whiteson, S.; de Rijke, M.; Munos, R. "Copeland Dueling Bandits." NeurIPS, 2015.
EXP3IXPolicy ¶
Bases: EXP3Policy
EXP3-IX with implicit exploration.
References
.. [1] Neu, G. "First-order regret bounds for combinatorial semi-bandits." COLT, 2015.
EXP3Policy ¶
Bases: Policy, DeltaRewardMixin
EXP3 for adversarial bandits.
References
.. [1] Auer, P.; Cesa-Bianchi, N.; Freund, Y.; Schapire, R. "The Nonstochastic Multiarmed Bandit Problem." SIAM J. Computing, 2002.
EpsilonDecreasingPolicy ¶
Bases: Policy
Epsilon-greedy with decreasing exploration schedule.
References
.. [1] Sutton, R. S.; Barto, A. G. "Reinforcement Learning: An Introduction." MIT Press, 2018.
EpsilonGreedyPolicy ¶
Bases: Policy
Epsilon-Greedy policy for action selection.
Chooses a random action with probability epsilon and takes the best apparent approach with probability 1-epsilon. If multiple actions are tied for best choice, then a random action from that subset is selected.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
epsilon
|
float
|
Probability of choosing a random action. |
required |
Attributes:
| Name | Type | Description |
|---|---|---|
epsilon |
float
|
Probability of choosing a random action. |
__init__ ¶
Initialize the EpsilonGreedyPolicy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
epsilon
|
float
|
Probability of choosing a random action. |
required |
__str__ ¶
Return a string representation of the policy.
Returns:
| Type | Description |
|---|---|
str
|
The policy name with epsilon value. |
choose_all ¶
Choose all actions using the epsilon-greedy strategy.
Each action is independently flagged for random exploration with probability epsilon. Exploration actions are placed first in a random order; the remaining actions are sorted by Q value descending (exploitation).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent
|
Agent
|
The agent whose actions are to be prioritized. |
required |
Returns:
| Type | Description |
|---|---|
list of str
|
List of action names ordered by the epsilon-greedy strategy. |
FRRMABPolicy ¶
Bases: Policy
Fitness-Rate-Rank based Multi-Armed Bandit (FRRMAB) policy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
c
|
float
|
Exploration parameter. |
required |
decayed_factor
|
float
|
Decay factor for ranking. Default is 1. |
1
|
Attributes:
| Name | Type | Description |
|---|---|---|
c |
float
|
Exploration parameter. |
decayed_factor |
float
|
Decay factor for ranking. |
history |
DataFrame
|
History of actions and their outcomes. |
__init__ ¶
Initialize the FRRMABPolicy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
c
|
float
|
Exploration parameter. |
required |
decayed_factor
|
float
|
Decay factor for ranking. Default is 1. |
1
|
__str__ ¶
Return a string representation of the policy.
Returns:
| Type | Description |
|---|---|
str
|
The policy name with C and D values. |
choose_all ¶
Choose all actions based on Q values from the FRRMAB history.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent
|
Agent
|
The agent whose actions are to be prioritized. |
required |
Returns:
| Type | Description |
|---|---|
list of str
|
List of action names sorted by Q value. |
credit_assignment ¶
Assign credit using the Fitness-Rate-Rank method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent
|
RewardSlidingWindowAgent
|
The agent for which credit assignment is to be performed. |
required |
GreedyPolicy ¶
Bases: EpsilonGreedyPolicy
Greedy policy that always takes the best apparent action.
Ties are broken by random selection. This is a special case of EpsilonGreedy where epsilon = 0 (always exploit).
KLUCBPolicy ¶
Bases: Policy
KL-UCB for Bernoulli rewards.
References
.. [1] Garivier, A.; Cappé, O. "The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond." COLT, 2011.
LinTSPolicy ¶
Bases: LinUCBPolicy
Linear Thompson Sampling for contextual bandits.
References
.. [1] Agrawal, S.; Goyal, N. "Thompson Sampling for Contextual Bandits with Linear Payoffs." ICML, 2013.
LinUCBPolicy ¶
Bases: Policy
LinUCB with Disjoint Linear Models.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
alpha
|
float
|
The constant that determines the width of the upper confidence bound. Default is 0.5. |
0.5
|
Attributes:
| Name | Type | Description |
|---|---|---|
alpha |
float
|
The exploration parameter. |
context |
dict
|
Dictionary containing A matrices, their inverses, and b vectors for each action. |
context_features |
object or None
|
Current context features. |
features |
object or None
|
Feature names. |
References
.. [1] Lihong Li, et al. "A Contextual-Bandit Approach to Personalized News Article Recommendation." In Proceedings of the 19th International Conference on World Wide Web (WWW), 2010.
__init__ ¶
Initialize LinUCBPolicy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
alpha
|
float
|
The constant that determines the width of the upper confidence bound. Default is 0.5. |
0.5
|
__str__ ¶
Return a string representation of the policy.
Returns:
| Type | Description |
|---|---|
str
|
The policy name with alpha value. |
add_action ¶
Add an action to the policy's context.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
action_id
|
str
|
The identifier of the action to add. |
required |
choose_all ¶
Choose all actions based on the LinUCB policy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent
|
Agent
|
The agent whose actions are to be prioritized. |
required |
Returns:
| Type | Description |
|---|---|
list of str
|
List of action names sorted by Q value in descending order. |
Raises:
| Type | Description |
|---|---|
QException
|
If Q computation results in unexpected shape. |
credit_assignment ¶
Assign credit based on the agent's actions and rewards.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent
|
Agent
|
The agent for which credit assignment is to be performed. |
required |
update_actions ¶
Update actions based on the agent's context.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent
|
ContextualAgent
|
The contextual agent providing context information. |
required |
new_actions
|
list of str
|
List of new action identifiers to add. |
required |
MOSSUCBPolicy ¶
Bases: Policy
Minimax Optimal Strategy in the Stochastic case (MOSS).
References
.. [1] Audibert, J.-Y.; Bubeck, S. "Minimax Policies for Adversarial and Stochastic Bandits." COLT, 2009.
OptimisticGreedyPolicy ¶
Bases: GreedyPolicy
Greedy with optimistic initialization for unseen actions.
References
.. [1] Sutton, R. S.; Barto, A. G. "Reinforcement Learning: An Introduction." MIT Press, 2018.
PHEPolicy ¶
Bases: Policy, DeltaRewardMixin
Perturbed-History Exploration (PHE).
References
.. [1] Kveton, B.; et al. "Perturbed-History Exploration in Stochastic Multi-Armed Bandits." IJCAI, 2019.
PairwiseThompsonRankingPolicy ¶
Bases: Policy
Pairwise Thompson Sampling over duels for ranking.
What it is
A ranking-oriented bandit policy that models pairwise preferences with Beta posteriors and samples them online to produce a total action ordering.
Uses a Beta posterior per unordered action pair and samples preferences to build a ranking at each decision step.
References
.. [1] Thompson, W. R. "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples." Biometrika, 1933. .. [2] Wu, H.; Liu, X.; Yue, Y. "Double Thompson Sampling for Dueling Bandits." NeurIPS, 2016.
__init__ ¶
Initialize Beta priors for each pairwise action combination.
Policy ¶
A policy prescribes an action to be taken based on the memory of an agent.
__str__ ¶
Return a string representation of the policy.
Returns:
| Type | Description |
|---|---|
str
|
The policy name. |
choose_all ¶
Return all actions in their default (untreated) order.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent
|
Agent
|
The agent whose actions are to be returned. |
required |
Returns:
| Type | Description |
|---|---|
list of str
|
List of action names. |
credit_assignment ¶
Assign credit to actions based on their outcomes.
The credit assignment method calculates the value estimates for each action based on the rewards observed. The specific implementation of how credit is assigned depends on the policy in use.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent
|
Agent
|
The agent for which credit assignment is to be performed. |
required |
Notes
This is a base method and should be overridden in derived classes to provide specific credit assignment logic. The method modifies the agent's state, updating the value estimates for each action based on the outcomes observed.
PortfolioUCBPolicy ¶
Bases: Policy
UCB portfolio selector over a set of candidate policies.
What it is
A meta-policy that treats each candidate policy as an arm in a portfolio bandit. It selects which policy to run online based on recent performance, balancing exploitation of strong policies and exploration of alternatives.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
policies
|
list[Policy]
|
Candidate policies that can be selected online. |
required |
c
|
float
|
Exploration strength for portfolio-level UCB. |
1.0
|
window_size
|
int
|
Number of recent rewards used to score each candidate policy. |
20
|
References
.. [1] Auer, P.; Cesa-Bianchi, N.; Fischer, P. "Finite-time Analysis of the Multiarmed Bandit Problem." Machine Learning, 2002. .. [2] Fialho, Á.; Da Costa, L.; Schoenauer, M.; Sebag, M. "Analyzing Bandit-Based Adaptive Operator Selection Mechanisms." Annals of Mathematics and Artificial Intelligence, 2010.
__init__ ¶
Initialize the portfolio with candidate policies and UCB parameters.
PursuitPolicy ¶
Bases: Policy
Pursuit method that tracks best arm and shifts selection probabilities.
References
.. [1] Thathachar, M. A. L.; Sastry, P. S. "A New Approach to the Design of Reinforcement Schemes for Learning Automata." IEEE Trans. SMC, 1985.
RandomPolicy ¶
SWContextualEpsilonGreedyPolicy ¶
Bases: ContextualEpsilonGreedyPolicy
Sliding-window contextual epsilon-greedy policy.
References
.. [1] Besbes, O.; Gur, Y.; Zeevi, A. "Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards." NeurIPS, 2014.
SWLinTSPolicy ¶
Bases: LinTSPolicy
Sliding-window Linear Thompson Sampling.
References
.. [1] Russac, Y.; et al. "Weighted Linear Bandits for Non-Stationary Environments." NeurIPS, 2019.
SWLinUCBPolicy ¶
Bases: LinUCBPolicy
LinUCB with Disjoint Linear Models and Sliding Window.
References
.. [1] Nicolas Gutowski, Tassadit Amghar, Olivier Camp, and Fabien Chhel. "Global Versus Individual Accuracy in Contextual Multi-Armed Bandit." In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (SAC '19), April 8-12, 2019, Limassol, Cyprus. ACM, 8 pages.
__str__ ¶
Return a string representation of the policy.
Returns:
| Type | Description |
|---|---|
str
|
The policy name with alpha value. |
choose_all ¶
Choose all actions based on the sliding window policy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent
|
Agent
|
The agent whose actions are to be prioritized. |
required |
Returns:
| Type | Description |
|---|---|
list of str
|
List of action names sorted by Q value in descending order. |
Raises:
| Type | Description |
|---|---|
QException
|
If Q computation results in unexpected shape. |
SlMABPolicy ¶
Bases: Policy
Sliding Multi-Armed Bandit policy.
Attributes:
| Name | Type | Description |
|---|---|---|
history |
DataFrame
|
History of actions and their outcomes. |
__str__ ¶
Return a string representation of the policy.
The closing parenthesis is intentionally omitted so the agent can append the window size when constructing the full label.
Returns:
| Type | Description |
|---|---|
str
|
The policy name without closing parenthesis. |
choose_all ¶
Choose all actions based on Q values from the SlMAB history.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent
|
Agent
|
The agent whose actions are to be prioritized. |
required |
Returns:
| Type | Description |
|---|---|
list of str
|
List of action names sorted by Q value. |
credit_assignment ¶
Assign credit using the SlMAB method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent
|
RewardSlidingWindowAgent
|
The agent for which credit assignment is to be performed. |
required |
SlidingWindowUCBPolicy ¶
Bases: UCBPolicyBase
Sliding-window UCB policy.
References
.. [1] Garivier, A.; Moulines, E. "On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems." ALT, 2011.
SoftmaxPolicy ¶
Bases: Policy
Boltzmann/Softmax exploration policy.
References
.. [1] Sutton, R. S.; Barto, A. G. "Reinforcement Learning: An Introduction." MIT Press, 2018.
ThompsonSamplingPolicy ¶
Bases: Policy, DeltaRewardMixin
Thompson Sampling with Beta-Bernoulli posterior.
References
.. [1] Thompson, W. R. "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples." Biometrika, 1933. .. [2] Agrawal, S.; Goyal, N. "Analysis of Thompson Sampling for the Multi-armed Bandit Problem." COLT, 2012.
UCB1Policy ¶
Bases: UCBPolicyBase
Upper Confidence Bound algorithm (UCB1).
Applies an exploration factor to the expected value of each arm which can influence a greedy selection strategy to more intelligently explore less confident options.
__str__ ¶
Return a string representation of the policy.
Returns:
| Type | Description |
|---|---|
str
|
The policy name with C value. |
UCB2Policy ¶
Bases: UCBPolicyBase
UCB2 policy.
References
.. [1] Auer, P.; Cesa-Bianchi, N.; Fischer, P. "Finite-time Analysis of the Multiarmed Bandit Problem." Machine Learning, 2002.
UCBPolicy ¶
Bases: UCBPolicyBase
Upper Confidence Bound algorithm (UCB) with scaling factor.
__str__ ¶
Return a string representation of the policy.
Returns:
| Type | Description |
|---|---|
str
|
The policy name with C value. |
UCBPolicyBase ¶
Bases: Policy
Base class for Upper Confidence Bound (UCB) policies.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
c
|
float
|
Exploration parameter controlling the width of the confidence bound. |
required |
Attributes:
| Name | Type | Description |
|---|---|---|
c |
float
|
Exploration parameter. |
__init__ ¶
Initialize the UCBPolicyBase.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
c
|
float
|
Exploration parameter controlling the width of the confidence bound. |
required |
UCBTunedPolicy ¶
UCBVPolicy ¶
Bases: UCBTunedPolicy
UCB-V variance-aware confidence bound.
References
.. [1] Audibert, J.-Y.; Munos, R.; Szepesvári, C. "Exploration- Exploitation Tradeoff using Variance Estimates in Multi-Armed Bandits." Theoretical Computer Science, 2009.