Skip to content

Policy

coleman4hcs.policy

Policies for multi-armed bandit and contextual bandit action selection.

This module provides a collection of policies that are designed to operate with multi-armed bandits and contextual bandits. Each policy dictates how an agent will select its actions based on prior knowledge, current context, or exploration strategies.

Classes:

Name Description
Policy

Basic policy class that prescribes actions based on the memory of an agent.

EpsilonGreedyPolicy

Chooses either the best apparent action or a random one based on a probability epsilon.

GreedyPolicy

Always chooses the best apparent action.

RandomPolicy

Always chooses a random action.

UCBPolicyBase

Base class for Upper Confidence Bound policies.

UCB1Policy

Implementation of the UCB1 algorithm.

UCBPolicy

A variation of the UCB algorithm with a scaling factor.

FRRMABPolicy

Fitness-Rate-Rank based Multi-Armed Bandit policy.

SlMABPolicy

Sliding window-based Multi-Armed Bandit policy.

LinUCBPolicy

Contextual bandit policy using linear upper confidence bounds.

SWLinUCBPolicy

Variation of LinUCBPolicy using a sliding window approach.

Notes
  • UCB (Upper Confidence Bound) policies are designed to balance exploration and exploitation by considering both the estimated reward of an action and the uncertainty around that reward.
  • EpsilonGreedy and its variations (Greedy, Random) are simpler strategies that either exploit the best-known action or explore random actions based on a fixed probability.
  • LinUCB and SWLinUCB are contextual bandits. They choose actions not just based on past rewards, but also considering the current context. SWLinUCB adds a sliding window mechanism to LinUCB, giving more weight to recent actions.
References

.. [1] Lihong Li, et al. "A Contextual-Bandit Approach to Personalized News Article Recommendation." In Proceedings of the 19th International Conference on World Wide Web (WWW), 2010. .. [2] Nicolas Gutowski, Tassadit Amghar, Olivier Camp, and Fabien Chhel. "Global Versus Individual Accuracy in Contextual Multi-Armed Bandit." In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (SAC '19), April 8-12, 2019, Limassol, Cyprus.

EpsilonGreedyPolicy

Bases: Policy

Epsilon-Greedy policy for action selection.

Chooses a random action with probability epsilon and takes the best apparent approach with probability 1-epsilon. If multiple actions are tied for best choice, then a random action from that subset is selected.

Parameters:

Name Type Description Default
epsilon float

Probability of choosing a random action.

required

Attributes:

Name Type Description
epsilon float

Probability of choosing a random action.

__init__

__init__(epsilon)

Initialize the EpsilonGreedyPolicy.

Parameters:

Name Type Description Default
epsilon float

Probability of choosing a random action.

required

__str__

__str__()

Return a string representation of the policy.

Returns:

Type Description
str

The policy name with epsilon value.

choose_all

choose_all(agent)

Choose all actions using the epsilon-greedy strategy.

Parameters:

Name Type Description Default
agent Agent

The agent whose actions are to be prioritized.

required

Returns:

Type Description
list of str

List of action names ordered by the epsilon-greedy strategy.

FRRMABPolicy

Bases: Policy

Fitness-Rate-Rank based Multi-Armed Bandit (FRRMAB) policy.

Parameters:

Name Type Description Default
c float

Exploration parameter.

required
decayed_factor float

Decay factor for ranking. Default is 1.

1

Attributes:

Name Type Description
c float

Exploration parameter.

decayed_factor float

Decay factor for ranking.

history DataFrame

History of actions and their outcomes.

__init__

__init__(c, decayed_factor=1)

Initialize the FRRMABPolicy.

Parameters:

Name Type Description Default
c float

Exploration parameter.

required
decayed_factor float

Decay factor for ranking. Default is 1.

1

__str__

__str__()

Return a string representation of the policy.

Returns:

Type Description
str

The policy name with C and D values.

choose_all

choose_all(agent)

Choose all actions based on Q values from the FRRMAB history.

Parameters:

Name Type Description Default
agent Agent

The agent whose actions are to be prioritized.

required

Returns:

Type Description
list of str

List of action names sorted by Q value.

credit_assignment

credit_assignment(agent)

Assign credit using the Fitness-Rate-Rank method.

Parameters:

Name Type Description Default
agent RewardSlidingWindowAgent

The agent for which credit assignment is to be performed.

required

GreedyPolicy

Bases: EpsilonGreedyPolicy

Greedy policy that always takes the best apparent action.

Ties are broken by random selection. This is a special case of EpsilonGreedy where epsilon = 0 (always exploit).

__init__

__init__()

Initialize the GreedyPolicy with epsilon = 0.

__str__

__str__()

Return a string representation of the policy.

Returns:

Type Description
str

The policy name.

LinUCBPolicy

Bases: Policy

LinUCB with Disjoint Linear Models.

Parameters:

Name Type Description Default
alpha float

The constant that determines the width of the upper confidence bound. Default is 0.5.

0.5

Attributes:

Name Type Description
alpha float

The exploration parameter.

context dict

Dictionary containing A matrices, their inverses, and b vectors for each action.

context_features object or None

Current context features.

features object or None

Feature names.

References

.. [1] Lihong Li, et al. "A Contextual-Bandit Approach to Personalized News Article Recommendation." In Proceedings of the 19th International Conference on World Wide Web (WWW), 2010.

__init__

__init__(alpha=0.5)

Initialize LinUCBPolicy.

Parameters:

Name Type Description Default
alpha float

The constant that determines the width of the upper confidence bound. Default is 0.5.

0.5

__str__

__str__()

Return a string representation of the policy.

Returns:

Type Description
str

The policy name with alpha value.

add_action

add_action(action_id)

Add an action to the policy's context.

Parameters:

Name Type Description Default
action_id str

The identifier of the action to add.

required

choose_all

choose_all(agent)

Choose all actions based on the LinUCB policy.

Parameters:

Name Type Description Default
agent Agent

The agent whose actions are to be prioritized.

required

Returns:

Type Description
list of str

List of action names sorted by Q value in descending order.

Raises:

Type Description
QException

If Q computation results in unexpected shape.

credit_assignment

credit_assignment(agent)

Assign credit based on the agent's actions and rewards.

Parameters:

Name Type Description Default
agent Agent

The agent for which credit assignment is to be performed.

required

update_actions

update_actions(agent, new_actions)

Update actions based on the agent's context.

Parameters:

Name Type Description Default
agent ContextualAgent

The contextual agent providing context information.

required
new_actions list of str

List of new action identifiers to add.

required

Policy

A policy prescribes an action to be taken based on the memory of an agent.

__str__

__str__()

Return a string representation of the policy.

Returns:

Type Description
str

The policy name.

choose_all

choose_all(agent)

Return all actions in their default (untreated) order.

Parameters:

Name Type Description Default
agent Agent

The agent whose actions are to be returned.

required

Returns:

Type Description
list of str

List of action names.

credit_assignment

credit_assignment(agent)

Assign credit to actions based on their outcomes.

The credit assignment method calculates the value estimates for each action based on the rewards observed. The specific implementation of how credit is assigned depends on the policy in use.

Parameters:

Name Type Description Default
agent Agent

The agent for which credit assignment is to be performed.

required
Notes

This is a base method and should be overridden in derived classes to provide specific credit assignment logic. The method modifies the agent's state, updating the value estimates for each action based on the outcomes observed.

RandomPolicy

Bases: Policy

Random policy that randomly selects from all available actions.

No consideration is given to which action is apparently best. This is a special case of EpsilonGreedy where epsilon = 1 (always explore).

__str__

__str__()

Return a string representation of the policy.

Returns:

Type Description
str

The policy name.

choose_all

choose_all(agent)

Choose all actions randomly.

Parameters:

Name Type Description Default
agent Agent

The agent whose actions are to be shuffled.

required

Returns:

Type Description
list of str

Randomly ordered list of action names.

SWLinUCBPolicy

Bases: LinUCBPolicy

LinUCB with Disjoint Linear Models and Sliding Window.

References

.. [1] Nicolas Gutowski, Tassadit Amghar, Olivier Camp, and Fabien Chhel. "Global Versus Individual Accuracy in Contextual Multi-Armed Bandit." In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (SAC '19), April 8-12, 2019, Limassol, Cyprus. ACM, 8 pages.

__str__

__str__()

Return a string representation of the policy.

Returns:

Type Description
str

The policy name with alpha value.

choose_all

choose_all(agent)

Choose all actions based on the sliding window policy.

Parameters:

Name Type Description Default
agent Agent

The agent whose actions are to be prioritized.

required

Returns:

Type Description
list of str

List of action names sorted by Q value in descending order.

Raises:

Type Description
QException

If Q computation results in unexpected shape.

SlMABPolicy

Bases: Policy

Sliding Multi-Armed Bandit policy.

Attributes:

Name Type Description
history DataFrame

History of actions and their outcomes.

__init__

__init__()

Initialize the SlMABPolicy.

__str__

__str__()

Return a string representation of the policy.

The closing parenthesis is intentionally omitted so the agent can append the window size when constructing the full label.

Returns:

Type Description
str

The policy name without closing parenthesis.

choose_all

choose_all(agent)

Choose all actions based on Q values from the SlMAB history.

Parameters:

Name Type Description Default
agent Agent

The agent whose actions are to be prioritized.

required

Returns:

Type Description
list of str

List of action names sorted by Q value.

credit_assignment

credit_assignment(agent)

Assign credit using the SlMAB method.

Parameters:

Name Type Description Default
agent RewardSlidingWindowAgent

The agent for which credit assignment is to be performed.

required

UCB1Policy

Bases: UCBPolicyBase

Upper Confidence Bound algorithm (UCB1).

Applies an exploration factor to the expected value of each arm which can influence a greedy selection strategy to more intelligently explore less confident options.

__str__

__str__()

Return a string representation of the policy.

Returns:

Type Description
str

The policy name with C value.

credit_assignment

credit_assignment(agent)

Assign credit using the UCB1 formula.

Parameters:

Name Type Description Default
agent Agent

The agent for which credit assignment is to be performed.

required

UCBPolicy

Bases: UCBPolicyBase

Upper Confidence Bound algorithm (UCB) with scaling factor.

__str__

__str__()

Return a string representation of the policy.

Returns:

Type Description
str

The policy name with C value.

credit_assignment

credit_assignment(agent)

Assign credit using the UCB formula with scaling factor.

Parameters:

Name Type Description Default
agent Agent

The agent for which credit assignment is to be performed.

required

UCBPolicyBase

Bases: Policy

Base class for Upper Confidence Bound (UCB) policies.

Parameters:

Name Type Description Default
c float

Exploration parameter controlling the width of the confidence bound.

required

Attributes:

Name Type Description
c float

Exploration parameter.

__init__

__init__(c)

Initialize the UCBPolicyBase.

Parameters:

Name Type Description Default
c float

Exploration parameter controlling the width of the confidence bound.

required

choose_all

choose_all(agent)

Choose all actions sorted by Q value in descending order.

Parameters:

Name Type Description Default
agent Agent

The agent whose actions are to be prioritized.

required

Returns:

Type Description
list of str

List of action names sorted by Q value.