API Reference¶
GSP¶
gsppy.gsp.GSP ¶
Generalized Sequential Pattern (GSP) Algorithm.
The GSP algorithm is used to find frequent sequential patterns in transactional datasets based on a user-defined minimum support threshold. This implementation is optimized for efficiency with candidate generation, batch processing, and multiprocessing.
Attributes: freq_patterns (List[Dict[Tuple, int]]): Stores discovered frequent sequential patterns at each k-sequence level as dictionaries mapping patterns to their support counts. transactions (List[Tuple]): Preprocessed dataset where each transaction is represented as a tuple of items. unique_candidates (List[Tuple]): List of initial singleton candidates (1-item sequences). max_size (int): Length of the longest transaction in the dataset, used to set the maximum k-sequence for pattern generation.
__init__ ¶
__init__(raw_transactions, mingap=None, maxgap=None, maxspan=None, verbose=False, pruning_strategy=None)
Initialize the GSP algorithm with raw transactional data.
Parameters: raw_transactions (Union[List[List[str]], List[List[Tuple[str, float]]]]): Input transaction dataset where each transaction is either: - A list of items (e.g., [['A', 'B'], ['B', 'C', 'D']]) - A list of (item, timestamp) tuples (e.g., [[('A', 1.0), ('B', 2.0)]]) mingap (Optional[float]): Minimum time gap required between consecutive items in patterns. maxgap (Optional[float]): Maximum time gap allowed between consecutive items in patterns. maxspan (Optional[float]): Maximum time span from first to last item in patterns. verbose (bool): Enable verbose logging output with detailed progress information. Default is False (minimal output). pruning_strategy (Optional[PruningStrategy]): Custom pruning strategy for candidate filtering. If None, a default strategy is created based on temporal constraints.
Attributes Initialized:
- Processes the input raw transaction dataset.
- Computes unique singleton candidates (unique_candidates).
- Extracts the maximum transaction size (max_size) from the dataset for limiting
the search space.
- Stores temporal constraints for use during pattern mining.
Raises: ValueError: If the input transaction dataset is empty, contains fewer than two transactions, or is not properly formatted. Also raised if temporal constraints are invalid.
search ¶
search(min_support=0.2, max_k=None, backend=None, verbose=None)
Execute the Generalized Sequential Pattern (GSP) mining algorithm.
This method facilitates the discovery of frequent sequential patterns in the input transaction dataset. Patterns are extracted iteratively at each k-sequence level, starting from singleton sequences, until no further frequent patterns can be found.
When temporal constraints (mingap, maxgap, maxspan) are specified during initialization, the algorithm enforces these constraints during pattern matching, allowing for time-aware sequential pattern mining.
Parameters:
min_support (float): Minimum support threshold as a fraction of total transactions.
For example, 0.3 means that a sequence is frequent if it
appears in at least 30% of all transactions.
max_k (Optional[int]): Maximum length of patterns to mine. If None, mines up to max transaction length.
backend (Optional[str]): Backend to use for support counting ('auto', 'python', 'rust', 'gpu').
Note: temporal constraints always use Python backend.
verbose (Optional[bool]): Override instance verbosity setting for this search.
If None, uses the instance's verbose setting.
Returns: List[Dict[Tuple[str, ...], int]]: A list of dictionaries containing frequent patterns at each k-sequence level, with patterns as keys and their support counts as values.
Raises:
ValueError: If the minimum support threshold is not in the range (0.0, 1.0].
Logs: - Information about the algorithm's start, intermediate progress (candidates filtered), and completion. - Status updates for each iteration until the algorithm terminates.
Examples: Basic usage without temporal constraints:
```python
from gsppy.gsp import GSP
transactions = [
["Bread", "Milk"],
["Bread", "Diaper", "Beer", "Eggs"],
["Milk", "Diaper", "Beer", "Coke"],
]
gsp = GSP(transactions)
patterns = gsp.search(min_support=0.3)
```
Usage with temporal constraints (requires timestamped transactions):
```python
from gsppy.gsp import GSP
# Transactions with timestamps (item, timestamp) pairs
# where timestamps can be in any unit (seconds, minutes, hours, days, etc.)
timestamped_transactions = [
[("A", 1), ("B", 3), ("C", 5)], # timestamps: 1, 3, 5
[("A", 2), ("B", 10), ("C", 12)], # timestamps: 2, 10, 12
[("A", 1), ("C", 4)], # timestamps: 1, 4
]
# Find patterns with maxgap of 5 time units between consecutive items
gsp = GSP(timestamped_transactions, maxgap=5)
patterns = gsp.search(min_support=0.5)
# Pattern ("A", "B", "C") won't be found in transaction 2
# because gap between A and B is 8 (exceeds maxgap=5)
```
Acceleration utilities¶
gsppy.accelerate.support_counts ¶
support_counts(transactions, candidates, min_support_abs, batch_size=100, backend=None)
Choose the best available backend for support counting.
Backend selection is controlled by the backend argument when provided,
otherwise by the env var GSPPY_BACKEND:
- "rust": require Rust extension (raise if missing)
- "gpu": try GPU path when available (currently singletons optimized),
fall back to CPU for the rest
- "python": force pure-Python fallback
- otherwise: try Rust first and fall back to Python
Example: Running a search with an explicit backend:
```python
from gsppy.accelerate import support_counts
transactions = [("A", "B"), ("A", "C")]
candidates = [("A",), ("B",), ("A", "B")]
counts = support_counts(transactions, candidates, min_support_abs=1, backend="python")
```