API Reference¶
GSP¶
gsppy.gsp.GSP ¶
Generalized Sequential Pattern (GSP) Algorithm.
The GSP algorithm is used to find frequent sequential patterns in transactional datasets based on a user-defined minimum support threshold. This implementation is optimized for efficiency with candidate generation, batch processing, and multiprocessing.
Attributes: freq_patterns (List[Dict[Tuple, int]]): Stores discovered frequent sequential patterns at each k-sequence level as dictionaries mapping patterns to their support counts. transactions (List[Tuple]): Preprocessed dataset where each transaction is represented as a tuple of items. unique_candidates (List[Tuple]): List of initial singleton candidates (1-item sequences). max_size (int): Length of the longest transaction in the dataset, used to set the maximum k-sequence for pattern generation.
__init__ ¶
__init__(
raw_transactions,
mingap=None,
maxgap=None,
maxspan=None,
verbose=False,
pruning_strategy=None,
transaction_col=None,
item_col=None,
timestamp_col=None,
sequence_col=None,
)
Initialize the GSP algorithm with raw transactional data.
Parameters: raw_transactions (Union[List[List[str]], List[List[Tuple[str, float]]], DataFrame]): Input transaction dataset. Accepts: - A list of transactions where each transaction is a list of items (e.g., [['A', 'B'], ['B', 'C', 'D']]) - A list of transactions with timestamps (e.g., [[('A', 1.0), ('B', 2.0)]]) - A Polars or Pandas DataFrame (requires 'gsppy[dataframe]' installation)
When using DataFrames, you must specify either:
- `sequence_col`: Column containing complete sequences (list format)
- `transaction_col` and `item_col`: Columns for grouped format
mingap (Optional[float]): Minimum time gap required between consecutive items in patterns.
maxgap (Optional[float]): Maximum time gap allowed between consecutive items in patterns.
maxspan (Optional[float]): Maximum time span from first to last item in patterns.
verbose (bool): Enable verbose logging output with detailed progress information.
Default is False (minimal output).
pruning_strategy (Optional[PruningStrategy]): Custom pruning strategy for candidate filtering.
If None, a default strategy is created based on
temporal constraints.
transaction_col (Optional[str]): DataFrame only - column name for transaction IDs (grouped format).
item_col (Optional[str]): DataFrame only - column name for items (grouped format).
timestamp_col (Optional[str]): DataFrame only - column name for timestamps.
sequence_col (Optional[str]): DataFrame only - column name containing sequences (sequence format).
Attributes Initialized:
- Processes the input raw transaction dataset.
- Computes unique singleton candidates (unique_candidates).
- Extracts the maximum transaction size (max_size) from the dataset for limiting
the search space.
- Stores temporal constraints for use during pattern mining.
Raises: ValueError: If the input transaction dataset is empty, contains fewer than two transactions, or is not properly formatted. Also raised if temporal constraints are invalid.
Examples: Basic usage with lists:
```python
from gsppy.gsp import GSP
transactions = [["A", "B"], ["B", "C", "D"]]
gsp = GSP(transactions)
patterns = gsp.search(min_support=0.5)
```
Using Polars DataFrame (grouped format):
```python
import polars as pl
from gsppy.gsp import GSP
df = pl.DataFrame(
{
"transaction_id": [1, 1, 2, 2, 2],
"item": ["A", "B", "A", "C", "D"],
}
)
gsp = GSP(df, transaction_col="transaction_id", item_col="item")
patterns = gsp.search(min_support=0.5)
```
Using Pandas DataFrame (sequence format):
```python
import pandas as pd
from gsppy.gsp import GSP
df = pd.DataFrame({"sequence": [["A", "B"], ["A", "C", "D"]]})
gsp = GSP(df, sequence_col="sequence")
patterns = gsp.search(min_support=0.5)
```
search ¶
search(
min_support: float = 0.2,
max_k: Optional[int] = None,
backend: Optional[str] = None,
verbose: Optional[bool] = None,
*,
return_sequences: Literal[False] = False,
preprocess_fn: Optional[Callable[[Any], Any]] = None,
postprocess_fn: Optional[Callable[[Any], Any]] = None,
candidate_filter_fn: Optional[
Callable[
[Tuple[str, ...], int, Dict[str, Any]], bool
]
] = None,
) -> List[Dict[Tuple[str, ...], int]]
search(
min_support: float = 0.2,
max_k: Optional[int] = None,
backend: Optional[str] = None,
verbose: Optional[bool] = None,
*,
return_sequences: Literal[True],
preprocess_fn: Optional[Callable[[Any], Any]] = None,
postprocess_fn: Optional[Callable[[Any], Any]] = None,
candidate_filter_fn: Optional[
Callable[
[Tuple[str, ...], int, Dict[str, Any]], bool
]
] = None,
) -> List[List[Sequence]]
search(
min_support=0.2,
max_k=None,
backend=None,
verbose=None,
*,
return_sequences=False,
preprocess_fn=None,
postprocess_fn=None,
candidate_filter_fn=None,
)
Sequence¶
gsppy.sequence.Sequence
dataclass
¶
Represents a sequential pattern with associated metadata.
This class encapsulates a pattern (sequence of items) along with its support count, transaction indices, and optional temporal metadata. The class is immutable and hashable, allowing it to be used as dictionary keys while providing a richer interface than bare tuples.
Attributes: items (Tuple[str, ...]): The pattern elements as an immutable tuple. support (int): The support count (number of transactions containing this pattern). Defaults to 0 for candidate sequences not yet evaluated. transaction_indices (Optional[Tuple[int, ...]]): Indices of transactions that contain this pattern. Optional as it may not always be tracked to save memory. metadata (Optional[dict]): Additional metadata such as timestamps, confidence, lift, or other pattern-specific information.
Examples: Create a simple sequence: >>> seq = Sequence(items=("A", "B", "C"), support=5) >>> seq.length 3 >>> seq.items ('A', 'B', 'C')
Create from tuple for backward compatibility:
>>> seq = Sequence.from_tuple(("A", "B"))
>>> seq.items
('A', 'B')
Use as dictionary key:
>>> patterns = {seq: 10}
>>> seq in patterns
True
from_tuple
classmethod
¶
from_tuple(
items,
support=0,
transaction_indices=None,
metadata=None,
)
Create a Sequence from a tuple of items.
This is a convenience method for backward compatibility with code that uses plain tuples to represent patterns.
Parameters: items (Tuple[str, ...]): The pattern elements. support (int): The support count. Defaults to 0. transaction_indices (Optional[Tuple[int, ...]]): Transaction indices. metadata (Optional[dict]): Additional metadata.
Returns: Sequence: A new Sequence instance.
Examples: >>> seq = Sequence.from_tuple(("A", "B", "C"), support=5) >>> seq.items ('A', 'B', 'C') >>> seq.support 5
from_item
classmethod
¶
from_item(item, support=0)
Create a singleton Sequence from a single item.
Parameters: item (str): The single item. support (int): The support count. Defaults to 0.
Returns: Sequence: A new Sequence instance containing only the item.
Examples: >>> seq = Sequence.from_item("A", support=10) >>> seq.items ('A',) >>> seq.length 1
extend ¶
extend(item, support=0)
Create a new Sequence by extending this one with an additional item.
This is used during candidate generation to create k+1 sequences from k sequences.
Parameters: item (str): The item to append. support (int): The support count for the new sequence. Defaults to 0.
Returns: Sequence: A new Sequence with the item appended.
Examples: >>> seq = Sequence.from_tuple(("A", "B")) >>> new_seq = seq.extend("C") >>> new_seq.items ('A', 'B', 'C')
with_support ¶
with_support(support, transaction_indices=None)
Create a new Sequence with updated support information.
This is used after calculating support to update the sequence with its actual support count and optionally transaction indices.
Parameters: support (int): The new support count. transaction_indices (Optional[Tuple[int, ...]]): Transaction indices.
Returns: Sequence: A new Sequence with updated support information.
Examples: >>> seq = Sequence.from_tuple(("A", "B")) >>> supported_seq = seq.with_support(5, (0, 2, 4)) >>> supported_seq.support 5
with_metadata ¶
with_metadata(**kwargs)
Create a new Sequence with additional or updated metadata.
Parameters: **kwargs: Metadata key-value pairs to add or update.
Returns: Sequence: A new Sequence with updated metadata.
Examples: >>> seq = Sequence.from_tuple(("A", "B"), support=5) >>> seq_with_meta = seq.with_metadata(confidence=0.75, lift=1.2) >>> seq_with_meta.metadata
as_tuple ¶
as_tuple()
Return the pattern as a plain tuple for backward compatibility.
Returns: Tuple[str, ...]: The sequence items as a tuple.
Sequence Utilities¶
gsppy.sequence.sequences_to_dict ¶
sequences_to_dict(sequences)
Convert a list of Sequence objects to a dictionary mapping tuples to support counts.
This function provides backward compatibility with code expecting the traditional Dict[Tuple[str, ...], int] format.
Parameters: sequences (List[Sequence]): List of Sequence objects.
Returns: dict[Tuple[str, ...], int]: Dictionary mapping pattern tuples to support counts.
Examples: >>> seqs = [Sequence(("A",), 5), Sequence(("B",), 3)] >>> sequences_to_dict(seqs)
gsppy.sequence.dict_to_sequences ¶
dict_to_sequences(pattern_dict)
Convert a dictionary of patterns to a list of Sequence objects.
This function converts the traditional Dict[Tuple[str, ...], int] format to Sequence objects.
Parameters: pattern_dict (dict[Tuple[str, ...], int]): Dictionary mapping tuples to support.
Returns: List[Sequence]: List of Sequence objects.
Examples: >>> patterns = {("A",): 5, ("B",): 3} >>> seqs = dict_to_sequences(patterns) >>> len(seqs) 2
gsppy.sequence.to_sequence ¶
to_sequence(obj, support=0)
Convert various input types to a Sequence object.
Parameters: obj: Input object (Sequence, tuple, or string). support: Support count to use if creating a new Sequence.
Returns: Sequence: A Sequence object.
Examples: >>> to_sequence(("A", "B"), support=5) Sequence(items=('A', 'B'), support=5) >>> seq = Sequence(("X",), 3) >>> to_sequence(seq) Sequence(items=('X',), support=3)
Acceleration utilities¶
gsppy.accelerate.support_counts ¶
support_counts(
transactions,
candidates,
min_support_abs,
batch_size=100,
backend=None,
)
Choose the best available backend for support counting.
Backend selection is controlled by the backend argument when provided,
otherwise by the env var GSPPY_BACKEND:
- "rust": require Rust extension (raise if missing)
- "gpu": try GPU path when available (currently singletons optimized),
fall back to CPU for the rest
- "python": force pure-Python fallback
- otherwise: try Rust first and fall back to Python
Example: Running a search with an explicit backend:
```python
from gsppy.accelerate import support_counts
transactions = [("A", "B"), ("A", "C")]
candidates = [("A",), ("B",), ("A", "B")]
counts = support_counts(transactions, candidates, min_support_abs=1, backend="python")
```