Skip to content

Policy

coleman.policy

Policies for multi-armed bandit and contextual bandit action selection.

This module provides a collection of policies that are designed to operate with multi-armed bandits and contextual bandits. Each policy dictates how an agent will select its actions based on prior knowledge, current context, or exploration strategies.

Classes:

Name Description
Policy

Basic policy class that prescribes actions based on the memory of an agent.

EpsilonGreedyPolicy

Chooses either the best apparent action or a random one based on a probability epsilon.

GreedyPolicy

Always chooses the best apparent action.

RandomPolicy

Always chooses a random action.

UCBPolicyBase

Base class for Upper Confidence Bound policies.

UCB1Policy

Implementation of the UCB1 algorithm.

UCBPolicy

A variation of the UCB algorithm with a scaling factor.

FRRMABPolicy

Fitness-Rate-Rank based Multi-Armed Bandit policy.

SlMABPolicy

Sliding window-based Multi-Armed Bandit policy.

LinUCBPolicy

Contextual bandit policy using linear upper confidence bounds.

SWLinUCBPolicy

Variation of LinUCBPolicy using a sliding window approach.

CombinatorialUCBPolicy

Combinatorial policy that prioritizes a top-k subset via UCB scores.

CombinatorialThompsonPolicy

Combinatorial policy that prioritizes a top-k subset via Thompson sampling.

DuelingUCBPolicy

Dueling/ranking policy based on pairwise UCB-style preference estimates.

PairwiseThompsonRankingPolicy

Dueling/ranking policy based on pairwise Thompson preference sampling.

PortfolioUCBPolicy

Meta-policy that selects among candidate policies online using UCB.

Notes
  • UCB (Upper Confidence Bound) policies are designed to balance exploration and exploitation by considering both the estimated reward of an action and the uncertainty around that reward.
  • EpsilonGreedy and its variations (Greedy, Random) are simpler strategies that either exploit the best-known action or explore random actions based on a fixed probability.
  • LinUCB and SWLinUCB are contextual bandits. They choose actions not just based on past rewards, but also considering the current context. SWLinUCB adds a sliding window mechanism to LinUCB, giving more weight to recent actions.
  • Combinatorial policies prioritize a strong subset first (top-k) while preserving compatibility with full-ranking execution.
  • Dueling/ranking policies learn pairwise preferences and derive a global priority order.
  • Portfolio meta-policies select which policy to use online based on recent performance.
References

.. [1] Lihong Li, et al. "A Contextual-Bandit Approach to Personalized News Article Recommendation." In Proceedings of the 19th International Conference on World Wide Web (WWW), 2010. .. [2] Nicolas Gutowski, Tassadit Amghar, Olivier Camp, and Fabien Chhel. "Global Versus Individual Accuracy in Contextual Multi-Armed Bandit." In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (SAC '19), April 8-12, 2019, Limassol, Cyprus. .. [3] Chen, W.; Wang, Y.; Yuan, Y. "Combinatorial Multi-Armed Bandit: General Framework and Applications." ICML, 2013. .. [4] Yue, Y.; Joachims, T. "Beat the Mean Bandit." ICML, 2011. .. [5] Zoghi, M.; Karnin, Z.; Whiteson, S.; de Rijke, M.; Munos, R. "Copeland Dueling Bandits." NeurIPS, 2015. .. [6] Auer, P.; Cesa-Bianchi, N.; Fischer, P. "Finite-time Analysis of the Multiarmed Bandit Problem." Machine Learning, 2002.

BayesianUCBPolicy

Bases: ThompsonSamplingPolicy

Bayesian UCB with Beta posterior mean and standard deviation bonus.

References

.. [1] Kaufmann, E.; Korda, N.; Munos, R. "Thompson Sampling: an asymptotically optimal finite-time analysis." ALT, 2012.

__init__

__init__(c=2.0, alpha_prior=1.0, beta_prior=1.0)

Initialize Bayesian UCB confidence scaling and Beta priors.

__str__

__str__()

Return string representation of the policy.

choose_all

choose_all(agent)

Select actions using Bayesian UCB with Beta posterior statistics.

BootstrappedThompsonPolicy

Bases: Policy, DeltaRewardMixin

Bootstrapped Thompson Sampling using Poisson-resampled heads.

References

.. [1] Osband, I.; Van Roy, B.; Wen, Z. "Generalization and Exploration via Randomized Value Functions." ICML, 2016.

__init__

__init__(n_bootstrap=8)

Initialize the number of bootstrap heads.

__str__

__str__()

Return string representation of the policy.

choose_all

choose_all(agent)

Select actions using bootstrap heads.

credit_assignment

credit_assignment(agent)

Update bootstrap sums and counts.

ChangeDetectionUCBPolicy

Bases: Policy, DeltaRewardMixin

UCB with simple change detection and per-arm reset.

References

.. [1] Liu, F.; Lee, J.; Shroff, N. "A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem." AAAI, 2018. .. [2] Garivier, A.; Moulines, E. "On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems." ALT, 2011.

__init__

__init__(c=1.0, window=20, threshold=0.25)

Initialize UCB constant, change-detection window and threshold.

__str__

__str__()

Return string representation of the policy.

choose_all

choose_all(agent)

Select actions using UCB with change detection.

credit_assignment

credit_assignment(agent)

Update statistics with change detection.

CombinatorialThompsonPolicy

Bases: Policy, DeltaRewardMixin

Subset-first Thompson ranking using independent Beta posteriors.

What it is

A combinatorial bandit policy that samples action quality from per-action posteriors and prioritizes a top-k subset before the remaining actions. This preserves exploration while emphasizing a smaller candidate set.

Parameters:

Name Type Description Default
subset_size int

Number of actions to place at the front of the ranking.

5
alpha_prior float

Prior alpha parameter for each action.

1.0
beta_prior float

Prior beta parameter for each action.

1.0
References

.. [1] Agrawal, S.; Goyal, N. "Analysis of Thompson Sampling for the Multi-armed Bandit Problem." COLT, 2012. .. [2] Wang, S.; Chen, W. "Thompson Sampling for Combinatorial Semi- Bandits." ICML, 2018.

__init__

__init__(subset_size=5, alpha_prior=1.0, beta_prior=1.0)

Initialize combinatorial Thompson Sampling parameters.

__str__

__str__()

Return a concise human-readable policy description.

choose_all

choose_all(agent)

Sample Thompson scores for every action and return subset-first ordering.

credit_assignment

credit_assignment(agent)

Update Beta posterior parameters from bounded step rewards.

CombinatorialUCBPolicy

Bases: Policy

Subset-first UCB ranking for combinatorial action selection.

What it is

A combinatorial bandit policy that focuses on selecting a strong subset of actions first (top-k), instead of assuming all actions are equally relevant at every step. It still returns a full ordering for compatibility with the current agent contract.

Parameters:

Name Type Description Default
subset_size int

Number of actions to prioritize as the subset head.

5
c float

Exploration strength for the UCB bonus.

1.0
References

.. [1] Chen, W.; Wang, Y.; Yuan, Y. "Combinatorial Multi-Armed Bandit: General Framework and Applications." ICML, 2013. .. [2] Kveton, B.; Wen, Z.; Ashkan, A.; Szepesvári, C. "Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits." AISTATS, 2015.

__init__

__init__(subset_size=5, c=1.0)

Initialize combinatorial UCB parameters.

__str__

__str__()

Return a concise human-readable policy description.

choose_all

choose_all(agent)

Rank all actions using UCB and keep the best subset at the head.

credit_assignment

credit_assignment(agent)

Apply the default reward update for compatibility with base policy behavior.

ContextualEpsilonGreedyPolicy

Bases: LinUCBPolicy

Contextual epsilon-greedy based on linear reward estimates.

References

.. [1] Langford, J.; Zhang, T. "The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information." NIPS, 2007.

__init__

__init__(epsilon=0.1)

Initialize contextual epsilon-greedy policy.

Parameters:

Name Type Description Default
epsilon float

Exploration probability.

0.1

__str__

__str__()

Return a string representation of the policy.

choose_all

choose_all(agent)

Choose all actions with epsilon-randomized contextual scoring.

DecayEpsilonGreedyPolicy

Bases: Policy

Epsilon-greedy with polynomial decay schedule.

References

.. [1] Sutton, R. S.; Barto, A. G. "Reinforcement Learning: An Introduction." MIT Press, 2018.

__init__

__init__(epsilon0=1.0, decay=0.5, min_epsilon=0.01)

Initialize decay schedule hyperparameters.

__str__

__str__()

Return a string representation of the policy.

choose_all

choose_all(agent)

Choose actions by decaying epsilon-greedy ordering.

DiscountedUCBPolicy

Bases: Policy, DeltaRewardMixin

Discounted-UCB for non-stationary rewards.

References

.. [1] Garivier, A.; Moulines, E. "On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems." ALT, 2011.

__init__

__init__(gamma=0.95, c=1.0)

Initialize discount factor and exploration constant.

__str__

__str__()

Return string representation of the policy.

choose_all

choose_all(agent)

Select actions using discounted UCB scores.

credit_assignment

credit_assignment(agent)

Update discounted statistics.

DuelingUCBPolicy

Bases: Policy

Pairwise-UCB ranking with Copeland-style scores.

What it is

A dueling/ranking bandit policy that learns preferences through pairwise comparisons between actions, then induces a global ranking from these preferences using Copeland-like aggregation.

Parameters:

Name Type Description Default
c float

Exploration parameter for optimistic pairwise preference estimates.

1.0
References

.. [1] Yue, Y.; Joachims, T. "Beat the Mean Bandit." ICML, 2011. .. [2] Zoghi, M.; Karnin, Z.; Whiteson, S.; de Rijke, M.; Munos, R. "Copeland Dueling Bandits." NeurIPS, 2015.

__init__

__init__(c=1.0)

Initialize DuelingUCB exploration parameter and pairwise counters.

__str__

__str__()

Return string representation of the policy.

choose_all

choose_all(agent)

Select and rank all actions using Copeland scores.

credit_assignment

credit_assignment(agent)

Update preference estimates from the ranking.

EXP3IXPolicy

Bases: EXP3Policy

EXP3-IX with implicit exploration.

References

.. [1] Neu, G. "First-order regret bounds for combinatorial semi-bandits." COLT, 2015.

__init__

__init__(eta=0.1, gamma=0.01)

Initialize EXP3-IX learning rate and exploration parameters.

__str__

__str__()

Return string representation of the policy.

credit_assignment

credit_assignment(agent)

Update weights using implicit exploration.

EXP3Policy

Bases: Policy, DeltaRewardMixin

EXP3 for adversarial bandits.

References

.. [1] Auer, P.; Cesa-Bianchi, N.; Freund, Y.; Schapire, R. "The Nonstochastic Multiarmed Bandit Problem." SIAM J. Computing, 2002.

__init__

__init__(gamma=0.07)

Initialize EXP3 exploration weight and gamma parameter.

__str__

__str__()

Return string representation of the policy.

choose_all

choose_all(agent)

Select and rank actions by EXP3 probabilities.

credit_assignment

credit_assignment(agent)

Update weights based on adversarial rewards.

EpsilonDecreasingPolicy

Bases: Policy

Epsilon-greedy with decreasing exploration schedule.

References

.. [1] Sutton, R. S.; Barto, A. G. "Reinforcement Learning: An Introduction." MIT Press, 2018.

__init__

__init__(epsilon0=1.0, decay=0.5)

Initialize epsilon schedule parameters.

__str__

__str__()

Return string representation of the policy.

choose_all

choose_all(agent)

Select actions using epsilon-greedy with decreasing exploration.

credit_assignment

credit_assignment(agent)

Update value estimates through parent policy.

EpsilonGreedyPolicy

Bases: Policy

Epsilon-Greedy policy for action selection.

Chooses a random action with probability epsilon and takes the best apparent approach with probability 1-epsilon. If multiple actions are tied for best choice, then a random action from that subset is selected.

Parameters:

Name Type Description Default
epsilon float

Probability of choosing a random action.

required

Attributes:

Name Type Description
epsilon float

Probability of choosing a random action.

__init__

__init__(epsilon)

Initialize the EpsilonGreedyPolicy.

Parameters:

Name Type Description Default
epsilon float

Probability of choosing a random action.

required

__str__

__str__()

Return a string representation of the policy.

Returns:

Type Description
str

The policy name with epsilon value.

choose_all

choose_all(agent)

Choose all actions using the epsilon-greedy strategy.

Each action is independently flagged for random exploration with probability epsilon. Exploration actions are placed first in a random order; the remaining actions are sorted by Q value descending (exploitation).

Parameters:

Name Type Description Default
agent Agent

The agent whose actions are to be prioritized.

required

Returns:

Type Description
list of str

List of action names ordered by the epsilon-greedy strategy.

FRRMABPolicy

Bases: Policy

Fitness-Rate-Rank based Multi-Armed Bandit (FRRMAB) policy.

Parameters:

Name Type Description Default
c float

Exploration parameter.

required
decayed_factor float

Decay factor for ranking. Default is 1.

1

Attributes:

Name Type Description
c float

Exploration parameter.

decayed_factor float

Decay factor for ranking.

history DataFrame

History of actions and their outcomes.

__init__

__init__(c, decayed_factor=1)

Initialize the FRRMABPolicy.

Parameters:

Name Type Description Default
c float

Exploration parameter.

required
decayed_factor float

Decay factor for ranking. Default is 1.

1

__str__

__str__()

Return a string representation of the policy.

Returns:

Type Description
str

The policy name with C and D values.

choose_all

choose_all(agent)

Choose all actions based on Q values from the FRRMAB history.

Parameters:

Name Type Description Default
agent Agent

The agent whose actions are to be prioritized.

required

Returns:

Type Description
list of str

List of action names sorted by Q value.

credit_assignment

credit_assignment(agent)

Assign credit using the Fitness-Rate-Rank method.

Parameters:

Name Type Description Default
agent RewardSlidingWindowAgent

The agent for which credit assignment is to be performed.

required

GreedyPolicy

Bases: EpsilonGreedyPolicy

Greedy policy that always takes the best apparent action.

Ties are broken by random selection. This is a special case of EpsilonGreedy where epsilon = 0 (always exploit).

__init__

__init__()

Initialize the GreedyPolicy with epsilon = 0.

__str__

__str__()

Return a string representation of the policy.

Returns:

Type Description
str

The policy name.

KLUCBPolicy

Bases: Policy

KL-UCB for Bernoulli rewards.

References

.. [1] Garivier, A.; Cappé, O. "The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond." COLT, 2011.

__init__

__init__(c=3.0)

Initialize the KL-UCB exploration constant.

__str__

__str__()

Return string representation of the policy.

choose_all

choose_all(agent)

Select actions using KL-UCB index.

credit_assignment

credit_assignment(agent)

Update value estimates through parent policy.

LinTSPolicy

Bases: LinUCBPolicy

Linear Thompson Sampling for contextual bandits.

References

.. [1] Agrawal, S.; Goyal, N. "Thompson Sampling for Contextual Bandits with Linear Payoffs." ICML, 2013.

__init__

__init__(alpha=0.5)

Initialize the LinTS exploration scale.

__str__

__str__()

Return a string representation of the policy.

choose_all

choose_all(agent)

Choose all actions by Thompson samples from linear posteriors.

LinUCBPolicy

Bases: Policy

LinUCB with Disjoint Linear Models.

Parameters:

Name Type Description Default
alpha float

The constant that determines the width of the upper confidence bound. Default is 0.5.

0.5

Attributes:

Name Type Description
alpha float

The exploration parameter.

context dict

Dictionary containing A matrices, their inverses, and b vectors for each action.

context_features object or None

Current context features.

features object or None

Feature names.

References

.. [1] Lihong Li, et al. "A Contextual-Bandit Approach to Personalized News Article Recommendation." In Proceedings of the 19th International Conference on World Wide Web (WWW), 2010.

__init__

__init__(alpha=0.5)

Initialize LinUCBPolicy.

Parameters:

Name Type Description Default
alpha float

The constant that determines the width of the upper confidence bound. Default is 0.5.

0.5

__str__

__str__()

Return a string representation of the policy.

Returns:

Type Description
str

The policy name with alpha value.

add_action

add_action(action_id)

Add an action to the policy's context.

Parameters:

Name Type Description Default
action_id str

The identifier of the action to add.

required

choose_all

choose_all(agent)

Choose all actions based on the LinUCB policy.

Parameters:

Name Type Description Default
agent Agent

The agent whose actions are to be prioritized.

required

Returns:

Type Description
list of str

List of action names sorted by Q value in descending order.

Raises:

Type Description
QException

If Q computation results in unexpected shape.

credit_assignment

credit_assignment(agent)

Assign credit based on the agent's actions and rewards.

Parameters:

Name Type Description Default
agent Agent

The agent for which credit assignment is to be performed.

required

update_actions

update_actions(agent, new_actions)

Update actions based on the agent's context.

Parameters:

Name Type Description Default
agent ContextualAgent

The contextual agent providing context information.

required
new_actions list of str

List of new action identifiers to add.

required

MOSSUCBPolicy

Bases: Policy

Minimax Optimal Strategy in the Stochastic case (MOSS).

References

.. [1] Audibert, J.-Y.; Bubeck, S. "Minimax Policies for Adversarial and Stochastic Bandits." COLT, 2009.

__str__

__str__()

Return string representation of the policy.

choose_all

choose_all(agent)

Select actions using MOSS minimax optimal strategy.

credit_assignment

credit_assignment(agent)

Update value estimates through parent policy.

OptimisticGreedyPolicy

Bases: GreedyPolicy

Greedy with optimistic initialization for unseen actions.

References

.. [1] Sutton, R. S.; Barto, A. G. "Reinforcement Learning: An Introduction." MIT Press, 2018.

__init__

__init__(optimistic_q=1.0)

Initialize optimistic Q value for never-attempted actions.

__str__

__str__()

Return a string representation of the policy.

credit_assignment

credit_assignment(agent)

Assign empirical means and keep optimistic scores for unseen actions.

PHEPolicy

Bases: Policy, DeltaRewardMixin

Perturbed-History Exploration (PHE).

References

.. [1] Kveton, B.; et al. "Perturbed-History Exploration in Stochastic Multi-Armed Bandits." IJCAI, 2019.

__init__

__init__(a=1.0)

Initialize PHE perturbation multiplier.

__str__

__str__()

Return string representation of the policy.

choose_all

choose_all(agent)

Select actions using perturbed history exploration.

credit_assignment

credit_assignment(agent)

Update success and count statistics.

PairwiseThompsonRankingPolicy

Bases: Policy

Pairwise Thompson Sampling over duels for ranking.

What it is

A ranking-oriented bandit policy that models pairwise preferences with Beta posteriors and samples them online to produce a total action ordering.

Uses a Beta posterior per unordered action pair and samples preferences to build a ranking at each decision step.

References

.. [1] Thompson, W. R. "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples." Biometrika, 1933. .. [2] Wu, H.; Liu, X.; Yue, Y. "Double Thompson Sampling for Dueling Bandits." NeurIPS, 2016.

__init__

__init__(alpha_prior=1.0, beta_prior=1.0)

Initialize Beta priors for each pairwise action combination.

__str__

__str__()

Return string representation of the policy.

choose_all

choose_all(agent)

Sample and rank all actions using Thompson sampling.

credit_assignment

credit_assignment(agent)

Update Beta posteriors from the ranking.

Policy

A policy prescribes an action to be taken based on the memory of an agent.

__str__

__str__()

Return a string representation of the policy.

Returns:

Type Description
str

The policy name.

choose_all

choose_all(agent)

Return all actions in their default (untreated) order.

Parameters:

Name Type Description Default
agent Agent

The agent whose actions are to be returned.

required

Returns:

Type Description
list of str

List of action names.

credit_assignment

credit_assignment(agent)

Assign credit to actions based on their outcomes.

The credit assignment method calculates the value estimates for each action based on the rewards observed. The specific implementation of how credit is assigned depends on the policy in use.

Parameters:

Name Type Description Default
agent Agent

The agent for which credit assignment is to be performed.

required
Notes

This is a base method and should be overridden in derived classes to provide specific credit assignment logic. The method modifies the agent's state, updating the value estimates for each action based on the outcomes observed.

PortfolioUCBPolicy

Bases: Policy

UCB portfolio selector over a set of candidate policies.

What it is

A meta-policy that treats each candidate policy as an arm in a portfolio bandit. It selects which policy to run online based on recent performance, balancing exploitation of strong policies and exploration of alternatives.

Parameters:

Name Type Description Default
policies list[Policy]

Candidate policies that can be selected online.

required
c float

Exploration strength for portfolio-level UCB.

1.0
window_size int

Number of recent rewards used to score each candidate policy.

20
References

.. [1] Auer, P.; Cesa-Bianchi, N.; Fischer, P. "Finite-time Analysis of the Multiarmed Bandit Problem." Machine Learning, 2002. .. [2] Fialho, Á.; Da Costa, L.; Schoenauer, M.; Sebag, M. "Analyzing Bandit-Based Adaptive Operator Selection Mechanisms." Annals of Mathematics and Artificial Intelligence, 2010.

active_policy property

active_policy

Return the currently selected policy.

__init__

__init__(policies, c=1.0, window_size=20)

Initialize the portfolio with candidate policies and UCB parameters.

__str__

__str__()

Return string representation of the policy.

choose_all

choose_all(agent)

Select a policy and delegate action selection to it.

credit_assignment

credit_assignment(agent)

Update the active policy and track portfolio reward.

PursuitPolicy

Bases: Policy

Pursuit method that tracks best arm and shifts selection probabilities.

References

.. [1] Thathachar, M. A. L.; Sastry, P. S. "A New Approach to the Design of Reinforcement Schemes for Learning Automata." IEEE Trans. SMC, 1985.

__init__

__init__(beta=0.1)

Initialize Pursuit beta update rate.

__str__

__str__()

Return string representation of the policy.

choose_all

choose_all(agent)

Select and rank actions by pursuit probabilities.

credit_assignment

credit_assignment(agent)

Update pursuit probabilities based on best action.

RandomPolicy

Bases: Policy

Random policy that randomly selects from all available actions.

No consideration is given to which action is apparently best.

__str__

__str__()

Return a string representation of the policy.

Returns:

Type Description
str

The policy name.

choose_all

choose_all(agent)

Choose all actions randomly.

Parameters:

Name Type Description Default
agent Agent

The agent whose actions are to be shuffled.

required

Returns:

Type Description
list of str

Randomly ordered list of action names.

SWContextualEpsilonGreedyPolicy

Bases: ContextualEpsilonGreedyPolicy

Sliding-window contextual epsilon-greedy policy.

References

.. [1] Besbes, O.; Gur, Y.; Zeevi, A. "Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards." NeurIPS, 2014.

__str__

__str__()

Return a string representation of the policy.

choose_all

choose_all(agent)

Choose all actions with contextual epsilon-greedy and window discount.

SWLinTSPolicy

Bases: LinTSPolicy

Sliding-window Linear Thompson Sampling.

References

.. [1] Russac, Y.; et al. "Weighted Linear Bandits for Non-Stationary Environments." NeurIPS, 2019.

__str__

__str__()

Return a string representation of the policy.

choose_all

choose_all(agent)

Choose all actions with LinTS and sliding-window frequency discount.

SWLinUCBPolicy

Bases: LinUCBPolicy

LinUCB with Disjoint Linear Models and Sliding Window.

References

.. [1] Nicolas Gutowski, Tassadit Amghar, Olivier Camp, and Fabien Chhel. "Global Versus Individual Accuracy in Contextual Multi-Armed Bandit." In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (SAC '19), April 8-12, 2019, Limassol, Cyprus. ACM, 8 pages.

__str__

__str__()

Return a string representation of the policy.

Returns:

Type Description
str

The policy name with alpha value.

choose_all

choose_all(agent)

Choose all actions based on the sliding window policy.

Parameters:

Name Type Description Default
agent Agent

The agent whose actions are to be prioritized.

required

Returns:

Type Description
list of str

List of action names sorted by Q value in descending order.

Raises:

Type Description
QException

If Q computation results in unexpected shape.

SlMABPolicy

Bases: Policy

Sliding Multi-Armed Bandit policy.

Attributes:

Name Type Description
history DataFrame

History of actions and their outcomes.

__init__

__init__()

Initialize the SlMABPolicy.

__str__

__str__()

Return a string representation of the policy.

The closing parenthesis is intentionally omitted so the agent can append the window size when constructing the full label.

Returns:

Type Description
str

The policy name without closing parenthesis.

choose_all

choose_all(agent)

Choose all actions based on Q values from the SlMAB history.

Parameters:

Name Type Description Default
agent Agent

The agent whose actions are to be prioritized.

required

Returns:

Type Description
list of str

List of action names sorted by Q value.

credit_assignment

credit_assignment(agent)

Assign credit using the SlMAB method.

Parameters:

Name Type Description Default
agent RewardSlidingWindowAgent

The agent for which credit assignment is to be performed.

required

SlidingWindowUCBPolicy

Bases: UCBPolicyBase

Sliding-window UCB policy.

References

.. [1] Garivier, A.; Moulines, E. "On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems." ALT, 2011.

__str__

__str__()

Return a string representation of the policy.

credit_assignment

credit_assignment(agent)

Assign credit using only the most recent sliding-window data.

SoftmaxPolicy

Bases: Policy

Boltzmann/Softmax exploration policy.

References

.. [1] Sutton, R. S.; Barto, A. G. "Reinforcement Learning: An Introduction." MIT Press, 2018.

__init__

__init__(tau=0.2)

Initialize Softmax temperature parameter.

__str__

__str__()

Return string representation of the policy.

choose_all

choose_all(agent)

Select and rank actions using softmax probabilities.

credit_assignment

credit_assignment(agent)

Update value estimates through parent policy.

ThompsonSamplingPolicy

Bases: Policy, DeltaRewardMixin

Thompson Sampling with Beta-Bernoulli posterior.

References

.. [1] Thompson, W. R. "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples." Biometrika, 1933. .. [2] Agrawal, S.; Goyal, N. "Analysis of Thompson Sampling for the Multi-armed Bandit Problem." COLT, 2012.

__init__

__init__(alpha_prior=1.0, beta_prior=1.0)

Initialize Beta priors for each arm.

__str__

__str__()

Return string representation of the policy.

choose_all

choose_all(agent)

Sample from Beta posteriors and rank actions.

credit_assignment

credit_assignment(agent)

Update Beta priors based on observed rewards.

UCB1Policy

Bases: UCBPolicyBase

Upper Confidence Bound algorithm (UCB1).

Applies an exploration factor to the expected value of each arm which can influence a greedy selection strategy to more intelligently explore less confident options.

__str__

__str__()

Return a string representation of the policy.

Returns:

Type Description
str

The policy name with C value.

credit_assignment

credit_assignment(agent)

Assign credit using the UCB1 formula.

Parameters:

Name Type Description Default
agent Agent

The agent for which credit assignment is to be performed.

required

UCB2Policy

Bases: UCBPolicyBase

UCB2 policy.

References

.. [1] Auer, P.; Cesa-Bianchi, N.; Fischer, P. "Finite-time Analysis of the Multiarmed Bandit Problem." Machine Learning, 2002.

__init__

__init__(c=1.0, alpha=0.2)

Initialize UCB2 hyperparameters.

__str__

__str__()

Return a string representation of the policy.

credit_assignment

credit_assignment(agent)

Assign credit using UCB2 confidence schedule.

UCBPolicy

Bases: UCBPolicyBase

Upper Confidence Bound algorithm (UCB) with scaling factor.

__str__

__str__()

Return a string representation of the policy.

Returns:

Type Description
str

The policy name with C value.

credit_assignment

credit_assignment(agent)

Assign credit using the UCB formula with scaling factor.

Parameters:

Name Type Description Default
agent Agent

The agent for which credit assignment is to be performed.

required

UCBPolicyBase

Bases: Policy

Base class for Upper Confidence Bound (UCB) policies.

Parameters:

Name Type Description Default
c float

Exploration parameter controlling the width of the confidence bound.

required

Attributes:

Name Type Description
c float

Exploration parameter.

__init__

__init__(c)

Initialize the UCBPolicyBase.

Parameters:

Name Type Description Default
c float

Exploration parameter controlling the width of the confidence bound.

required

choose_all

choose_all(agent)

Choose all actions sorted by Q value in descending order.

Parameters:

Name Type Description Default
agent Agent

The agent whose actions are to be prioritized.

required

Returns:

Type Description
list of str

List of action names sorted by Q value.

UCBTunedPolicy

Bases: Policy, DeltaRewardMixin

UCB-Tuned with empirical variance correction.

References

.. [1] Auer, P.; Cesa-Bianchi, N.; Fischer, P. "Finite-time Analysis of the Multiarmed Bandit Problem." Machine Learning, 2002.

__init__

__init__(c=1.0)

Initialize UCB-Tuned exploration constant and Welford accumulators.

__str__

__str__()

Return string representation of the policy.

choose_all

choose_all(agent)

Select actions using UCB-Tuned with variance correction.

credit_assignment

credit_assignment(agent)

Update means and variance estimates using Welford's algorithm.

UCBVPolicy

Bases: UCBTunedPolicy

UCB-V variance-aware confidence bound.

References

.. [1] Audibert, J.-Y.; Munos, R.; Szepesvári, C. "Exploration- Exploitation Tradeoff using Variance Estimates in Multi-Armed Bandits." Theoretical Computer Science, 2009.

__init__

__init__(c=1.0, b=1.0)

Initialize UCB-V confidence and range parameters.

__str__

__str__()

Return string representation of the policy.

choose_all

choose_all(agent)

Select actions using UCB-V variance-aware bounds.