Evaluation¶

Coleman includes classic TCP/IR-aligned metrics:

NAPFD and NAPFD-Verdict
APFDc
Precision@k
Recall@k
AveragePrecision@k (AP@k)
ReciprocalRank@k (RR@k, equivalent to MRR for a single ranked suite)
NDCG@k (normalized discounted gain at top-k)

Literature references:

Rothermel, G.; Untch, R. H.; Chu, C.; Harrold, M. J. (2001). Prioritizing test cases for regression testing. IEEE TSE.
Elbaum, S.; Malishevsky, A. G.; Rothermel, G. (2002). Test case prioritization: a family of empirical studies. IEEE TSE.
Catal, C.; Mishra, D.; Tufekci, S.; Cagiltay, N. E. (2011). Mapping study of software test case prioritization. Software Quality Journal.
Manning, C. D.; Raghavan, P.; Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
Baeza-Yates, R.; Ribeiro-Neto, B. (2011). Modern Information Retrieval (2nd ed.). Addison-Wesley.
Thakur, N.; Reimers, N.; Ruckle, A.; et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS Datasets and Benchmarks.
Muennighoff, N.; Tazi, N.; Magne, L.; Reimers, N. (2023). MTEB: Massive Text Embedding Benchmark. EACL.

coleman.evaluation ¶

coleman.evaluation - Evaluation Metrics for Coleman.

This module provides classes and methods to evaluate the performance of the Coleman framework in the context of test case prioritization. Various metrics such as NAPFD (Normalized Average Percentage of Faults Detected) based on errors or verdicts can be utilized to measure the effectiveness.

Classes:

Name	Description
`EvaluationMetric`	Base class for all evaluation metrics. Defines basic attributes and methods used across all metrics.
`NAPFDMetric`	Implements the NAPFD metric based on error counts.
`NAPFDVerdictMetric`	Implements the NAPFD metric based on test verdicts (e.g., pass/fail).

Notes

The evaluate method in EvaluationMetric is abstract and should be overridden in child classes. Ensure that the reset method is called at the beginning of each evaluation to reset metric values.

APFDcMetric ¶

Bases: EvaluationMetric

APFDc (Average Percentage of Faults Detected cost-aware) Metric.

Extends NAPFD with explicit exposure of cost-aware fault detection scoring.

init ¶

__init__()

Initialize the APFDcMetric.

str ¶

__str__()

Return a string representation of the metric.

Returns:

Type	Description
`str`	The metric name.

compute_metrics ¶

compute_metrics(
    costs,
    total_failure_count,
    total_failed_tests,
    no_testcases,
)

Compute APFDc metric (cost-aware faults detected).

Parameters:

Name	Type	Description	Default
`costs`	`list`	A list containing the costs (e.g., execution time) for each test case.	required
`total_failure_count`	`int`	Total number of failures detected across all test cases.	required
`total_failed_tests`	`int`	Total number of test cases that failed.	required
`no_testcases`	`int`	Total number of test cases in the test suite.	required

Notes

This method updates the instance's attributes directly and does not return any value.

evaluate ¶

evaluate(test_suite)

Evaluate the test suite using the APFDc metric.

Parameters:

Name	Type	Description	Default
`test_suite`	`list of dict`	Test suite to evaluate.	required

AveragePrecisionAtKMetric ¶

Bases: _TopKVerdictMetric

Average Precision at k (AP@k) for failure detection.

AP@k averages precision values observed at each failing rank up to k:

.. math:: AP@k = \frac{1}{\min(F, k)} \sum_{i=1}^{k} P(i) \cdot rel(i)

where rel(i) is 1 when rank i is failing, 0 otherwise.

EvaluationMetric ¶

Base class for evaluation metrics.

Attributes:

Name	Type	Description
`available_time`	`float`	The time available for test execution.
`scheduled_testcases`	`list`	Test cases that were scheduled for execution.
`unscheduled_testcases`	`list`	Test cases that were not scheduled.
`detection_ranks`	`list`	Ranks at which failures were detected.
`detection_ranks_time`	`list`	Durations of failure-detecting test cases.
`detection_ranks_failures`	`list`	Failure counts at each detection rank.
`ttf`	`int`	Time to Fail (rank value).
`ttf_duration`	`float`	Time spent until the first test case fail.
`fitness`	`float`	APFD or NAPFD value.
`cost`	`float`	APFDc value.
`detected_failures`	`int`	Number of detected failures.
`undetected_failures`	`int`	Number of undetected failures.
`recall`	`float`	Recall metric value.
`avg_precision`	`float`	Average precision metric value.

init ¶

__init__()

Initialize the EvaluationMetric.

evaluate ¶

evaluate(test_suite)

Evaluate the test suite.

This is an abstract method and must be implemented in child classes.

Parameters:

Name	Type	Description	Default
`test_suite`	`list of dict`	Test suite to evaluate.	required

Raises:

Type	Description
`NotImplementedError`	If not implemented in a child class.

process_test_suite ¶

process_test_suite(test_suite, error_key)

Process the test suite and return the costs and total failure count.

Parameters:

Name	Type	Description	Default
`test_suite`	`list of dict`	Test suite to process.	required
`error_key`	`str`	Key to determine the error in the test suite.	required

Returns:

Name	Type	Description
`costs`	`list`	List of durations for each test case.
`total_failure_count`	`int`	Total number of failures detected.
`total_failed_tests`	`int`	Total number of test cases that failed.

reset ¶

reset()

Reset all attributes to their default values.

set_default_metrics ¶

set_default_metrics()

Set the default values for NAPFD and APFDc metrics.

This method is called when there are no detected failures in the test suite, ensuring that the metric attributes are appropriately initialized.

Notes

This method updates the instance's attributes directly and does not return any value.

update_available_time ¶

update_available_time(available_time)

Update the available time for the metric.

Parameters:

Name	Type	Description	Default
`available_time`	`float`	Time available for the metric.	required

update_budget ¶

update_budget(budget_mode, budget_value)

Update budget semantics used while selecting scheduled testcases.

NAPFDMetric ¶

Bases: EvaluationMetric

Normalized Average Percentage of Faults Detected (NAPFD) Metric.

Based on error counts.

str ¶

__str__()

Return a string representation of the metric.

Returns:

Type	Description
`str`	The metric name.

compute_metrics ¶

compute_metrics(
    costs,
    total_failure_count,
    total_failed_tests,
    no_testcases,
)

Compute NAPFD and APFDc metrics.

Parameters:

Name	Type	Description	Default
`costs`	`list`	A list containing the costs (e.g., execution time) for each test case.	required
`total_failure_count`	`int`	Total number of failures detected across all test cases.	required
`total_failed_tests`	`int`	Total number of test cases that failed.	required
`no_testcases`	`int`	Total number of test cases in the test suite.	required

Notes

This method updates the instance's attributes directly and does not return any value.

evaluate ¶

evaluate(test_suite)

Evaluate the test suite using the NAPFD metric.

Parameters:

Name	Type	Description	Default
`test_suite`	`list of dict`	Test suite to evaluate.	required

NAPFDVerdictMetric ¶

Bases: EvaluationMetric

Normalized Average Percentage of Faults Detected (NAPFD) Metric based on Verdict.

str ¶

__str__()

Return a string representation of the metric.

Returns:

Type	Description
`str`	The metric name.

compute_metrics ¶

compute_metrics(costs, total_failure_count, no_testcases)

Compute NAPFD and APFDc metrics based on test verdicts.

Parameters:

Name	Type	Description	Default
`costs`	`list`	A list containing the costs (e.g., execution time) for each test case.	required
`total_failure_count`	`int`	Total number of test cases that failed.	required
`no_testcases`	`int`	Total number of test cases in the test suite.	required

Notes

This method updates the instance's attributes directly and does not return any value.

evaluate ¶

evaluate(test_suite)

Evaluate the test suite using the NAPFD Verdict metric.

Parameters:

Name	Type	Description	Default
`test_suite`	`list of dict`	Test suite to evaluate.	required

NDCGAtKMetric ¶

Bases: _TopKVerdictMetric

Normalized Discounted Cumulative Gain at k (nDCG@k).

With binary relevance (fail = 1, pass = 0), nDCG@k is:

.. math:: nDCG@k = \frac{\sum_{i=1}^{k} rel(i)/\log_2(i+1)}{\sum_{i=1}^{m} 1/\log_2(i+1)}

where m = min(F, k) and F is the total number of failing tests.

PrecisionAtKMetric ¶

Bases: _TopKVerdictMetric

Precision@k for test-failure detection.

Defined as the fraction of failing tests among the first k selected tests. Example: with k=6 and 6 failures in the first 6 positions, Precision@k is 1.0 (100%).

RecallAtKMetric ¶

Bases: _TopKVerdictMetric

Recall@k for test-failure detection.

Defined as the fraction of all failing tests that were found within the first k selected tests.

ReciprocalRankAtKMetric ¶

Bases: _TopKVerdictMetric

Reciprocal Rank at k (RR@k).

Returns the reciprocal of the first failing rank within the top-k prefix:

.. math:: RR@k = \begin{cases} 1/r_1, & \text{if a failure appears in top-k at rank } r_1 \ 0, & \text{otherwise} \end{cases}

For one prioritized suite, RR@k is equivalent to MRR@k.

Evaluation¶

coleman.evaluation ¶

APFDcMetric ¶

__init__ ¶

__str__ ¶

compute_metrics ¶

evaluate ¶

AveragePrecisionAtKMetric ¶

EvaluationMetric ¶

__init__ ¶

evaluate ¶

process_test_suite ¶

reset ¶

set_default_metrics ¶

update_available_time ¶

update_budget ¶

NAPFDMetric ¶

__str__ ¶

compute_metrics ¶

evaluate ¶

NAPFDVerdictMetric ¶

__str__ ¶

compute_metrics ¶

evaluate ¶

NDCGAtKMetric ¶

PrecisionAtKMetric ¶

RecallAtKMetric ¶

ReciprocalRankAtKMetric ¶

init ¶

str ¶

init ¶

str ¶

str ¶