Evaluation¶
Coleman includes classic TCP/IR-aligned metrics:
- NAPFD and NAPFD-Verdict
- APFDc
- Precision@k
- Recall@k
- AveragePrecision@k (AP@k)
- ReciprocalRank@k (RR@k, equivalent to MRR for a single ranked suite)
- NDCG@k (normalized discounted gain at top-k)
Literature references:
- Rothermel, G.; Untch, R. H.; Chu, C.; Harrold, M. J. (2001). Prioritizing test cases for regression testing. IEEE TSE.
- Elbaum, S.; Malishevsky, A. G.; Rothermel, G. (2002). Test case prioritization: a family of empirical studies. IEEE TSE.
- Catal, C.; Mishra, D.; Tufekci, S.; Cagiltay, N. E. (2011). Mapping study of software test case prioritization. Software Quality Journal.
- Manning, C. D.; Raghavan, P.; Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
- Baeza-Yates, R.; Ribeiro-Neto, B. (2011). Modern Information Retrieval (2nd ed.). Addison-Wesley.
- Thakur, N.; Reimers, N.; Ruckle, A.; et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS Datasets and Benchmarks.
- Muennighoff, N.; Tazi, N.; Magne, L.; Reimers, N. (2023). MTEB: Massive Text Embedding Benchmark. EACL.
coleman.evaluation ¶
coleman.evaluation - Evaluation Metrics for Coleman.
This module provides classes and methods to evaluate the performance of the Coleman framework in the context of test case prioritization. Various metrics such as NAPFD (Normalized Average Percentage of Faults Detected) based on errors or verdicts can be utilized to measure the effectiveness.
Classes:
| Name | Description |
|---|---|
EvaluationMetric |
Base class for all evaluation metrics. Defines basic attributes and methods used across all metrics. |
NAPFDMetric |
Implements the NAPFD metric based on error counts. |
NAPFDVerdictMetric |
Implements the NAPFD metric based on test verdicts (e.g., pass/fail). |
Notes
The evaluate method in EvaluationMetric is abstract and should be overridden in
child classes. Ensure that the reset method is called at the beginning of each
evaluation to reset metric values.
APFDcMetric ¶
Bases: EvaluationMetric
APFDc (Average Percentage of Faults Detected cost-aware) Metric.
Extends NAPFD with explicit exposure of cost-aware fault detection scoring.
__str__ ¶
Return a string representation of the metric.
Returns:
| Type | Description |
|---|---|
str
|
The metric name. |
compute_metrics ¶
Compute APFDc metric (cost-aware faults detected).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
costs
|
list
|
A list containing the costs (e.g., execution time) for each test case. |
required |
total_failure_count
|
int
|
Total number of failures detected across all test cases. |
required |
total_failed_tests
|
int
|
Total number of test cases that failed. |
required |
no_testcases
|
int
|
Total number of test cases in the test suite. |
required |
Notes
This method updates the instance's attributes directly and does not return any value.
evaluate ¶
Evaluate the test suite using the APFDc metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
test_suite
|
list of dict
|
Test suite to evaluate. |
required |
AveragePrecisionAtKMetric ¶
Bases: _TopKVerdictMetric
Average Precision at k (AP@k) for failure detection.
AP@k averages precision values observed at each failing rank up to k:
.. math:: AP@k = \frac{1}{\min(F, k)} \sum_{i=1}^{k} P(i) \cdot rel(i)
where rel(i) is 1 when rank i is failing, 0 otherwise.
EvaluationMetric ¶
Base class for evaluation metrics.
Attributes:
| Name | Type | Description |
|---|---|---|
available_time |
float
|
The time available for test execution. |
scheduled_testcases |
list
|
Test cases that were scheduled for execution. |
unscheduled_testcases |
list
|
Test cases that were not scheduled. |
detection_ranks |
list
|
Ranks at which failures were detected. |
detection_ranks_time |
list
|
Durations of failure-detecting test cases. |
detection_ranks_failures |
list
|
Failure counts at each detection rank. |
ttf |
int
|
Time to Fail (rank value). |
ttf_duration |
float
|
Time spent until the first test case fail. |
fitness |
float
|
APFD or NAPFD value. |
cost |
float
|
APFDc value. |
detected_failures |
int
|
Number of detected failures. |
undetected_failures |
int
|
Number of undetected failures. |
recall |
float
|
Recall metric value. |
avg_precision |
float
|
Average precision metric value. |
evaluate ¶
Evaluate the test suite.
This is an abstract method and must be implemented in child classes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
test_suite
|
list of dict
|
Test suite to evaluate. |
required |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If not implemented in a child class. |
process_test_suite ¶
Process the test suite and return the costs and total failure count.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
test_suite
|
list of dict
|
Test suite to process. |
required |
error_key
|
str
|
Key to determine the error in the test suite. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
costs |
list
|
List of durations for each test case. |
total_failure_count |
int
|
Total number of failures detected. |
total_failed_tests |
int
|
Total number of test cases that failed. |
set_default_metrics ¶
Set the default values for NAPFD and APFDc metrics.
This method is called when there are no detected failures in the test suite, ensuring that the metric attributes are appropriately initialized.
Notes
This method updates the instance's attributes directly and does not return any value.
update_available_time ¶
Update the available time for the metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
available_time
|
float
|
Time available for the metric. |
required |
update_budget ¶
Update budget semantics used while selecting scheduled testcases.
NAPFDMetric ¶
Bases: EvaluationMetric
Normalized Average Percentage of Faults Detected (NAPFD) Metric.
Based on error counts.
__str__ ¶
Return a string representation of the metric.
Returns:
| Type | Description |
|---|---|
str
|
The metric name. |
compute_metrics ¶
Compute NAPFD and APFDc metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
costs
|
list
|
A list containing the costs (e.g., execution time) for each test case. |
required |
total_failure_count
|
int
|
Total number of failures detected across all test cases. |
required |
total_failed_tests
|
int
|
Total number of test cases that failed. |
required |
no_testcases
|
int
|
Total number of test cases in the test suite. |
required |
Notes
This method updates the instance's attributes directly and does not return any value.
evaluate ¶
Evaluate the test suite using the NAPFD metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
test_suite
|
list of dict
|
Test suite to evaluate. |
required |
NAPFDVerdictMetric ¶
Bases: EvaluationMetric
Normalized Average Percentage of Faults Detected (NAPFD) Metric based on Verdict.
__str__ ¶
Return a string representation of the metric.
Returns:
| Type | Description |
|---|---|
str
|
The metric name. |
compute_metrics ¶
Compute NAPFD and APFDc metrics based on test verdicts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
costs
|
list
|
A list containing the costs (e.g., execution time) for each test case. |
required |
total_failure_count
|
int
|
Total number of test cases that failed. |
required |
no_testcases
|
int
|
Total number of test cases in the test suite. |
required |
Notes
This method updates the instance's attributes directly and does not return any value.
evaluate ¶
Evaluate the test suite using the NAPFD Verdict metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
test_suite
|
list of dict
|
Test suite to evaluate. |
required |
NDCGAtKMetric ¶
Bases: _TopKVerdictMetric
Normalized Discounted Cumulative Gain at k (nDCG@k).
With binary relevance (fail = 1, pass = 0), nDCG@k is:
.. math:: nDCG@k = \frac{\sum_{i=1}^{k} rel(i)/\log_2(i+1)}{\sum_{i=1}^{m} 1/\log_2(i+1)}
where m = min(F, k) and F is the total number of failing tests.
PrecisionAtKMetric ¶
Bases: _TopKVerdictMetric
Precision@k for test-failure detection.
Defined as the fraction of failing tests among the first k selected
tests. Example: with k=6 and 6 failures in the first 6 positions,
Precision@k is 1.0 (100%).
RecallAtKMetric ¶
Bases: _TopKVerdictMetric
Recall@k for test-failure detection.
Defined as the fraction of all failing tests that were found within the
first k selected tests.
ReciprocalRankAtKMetric ¶
Bases: _TopKVerdictMetric
Reciprocal Rank at k (RR@k).
Returns the reciprocal of the first failing rank within the top-k prefix:
.. math:: RR@k = \begin{cases} 1/r_1, & \text{if a failure appears in top-k at rank } r_1 \ 0, & \text{otherwise} \end{cases}
For one prioritized suite, RR@k is equivalent to MRR@k.