Skip to content

Evaluation

Coleman includes classic TCP/IR-aligned metrics:

  • NAPFD and NAPFD-Verdict
  • APFDc
  • Precision@k
  • Recall@k
  • AveragePrecision@k (AP@k)
  • ReciprocalRank@k (RR@k, equivalent to MRR for a single ranked suite)
  • NDCG@k (normalized discounted gain at top-k)

Literature references:

  • Rothermel, G.; Untch, R. H.; Chu, C.; Harrold, M. J. (2001). Prioritizing test cases for regression testing. IEEE TSE.
  • Elbaum, S.; Malishevsky, A. G.; Rothermel, G. (2002). Test case prioritization: a family of empirical studies. IEEE TSE.
  • Catal, C.; Mishra, D.; Tufekci, S.; Cagiltay, N. E. (2011). Mapping study of software test case prioritization. Software Quality Journal.
  • Manning, C. D.; Raghavan, P.; Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
  • Baeza-Yates, R.; Ribeiro-Neto, B. (2011). Modern Information Retrieval (2nd ed.). Addison-Wesley.
  • Thakur, N.; Reimers, N.; Ruckle, A.; et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS Datasets and Benchmarks.
  • Muennighoff, N.; Tazi, N.; Magne, L.; Reimers, N. (2023). MTEB: Massive Text Embedding Benchmark. EACL.

coleman.evaluation

coleman.evaluation - Evaluation Metrics for Coleman.

This module provides classes and methods to evaluate the performance of the Coleman framework in the context of test case prioritization. Various metrics such as NAPFD (Normalized Average Percentage of Faults Detected) based on errors or verdicts can be utilized to measure the effectiveness.

Classes:

Name Description
EvaluationMetric

Base class for all evaluation metrics. Defines basic attributes and methods used across all metrics.

NAPFDMetric

Implements the NAPFD metric based on error counts.

NAPFDVerdictMetric

Implements the NAPFD metric based on test verdicts (e.g., pass/fail).

Notes

The evaluate method in EvaluationMetric is abstract and should be overridden in child classes. Ensure that the reset method is called at the beginning of each evaluation to reset metric values.

APFDcMetric

Bases: EvaluationMetric

APFDc (Average Percentage of Faults Detected cost-aware) Metric.

Extends NAPFD with explicit exposure of cost-aware fault detection scoring.

__init__

__init__()

Initialize the APFDcMetric.

__str__

__str__()

Return a string representation of the metric.

Returns:

Type Description
str

The metric name.

compute_metrics

compute_metrics(
    costs,
    total_failure_count,
    total_failed_tests,
    no_testcases,
)

Compute APFDc metric (cost-aware faults detected).

Parameters:

Name Type Description Default
costs list

A list containing the costs (e.g., execution time) for each test case.

required
total_failure_count int

Total number of failures detected across all test cases.

required
total_failed_tests int

Total number of test cases that failed.

required
no_testcases int

Total number of test cases in the test suite.

required
Notes

This method updates the instance's attributes directly and does not return any value.

evaluate

evaluate(test_suite)

Evaluate the test suite using the APFDc metric.

Parameters:

Name Type Description Default
test_suite list of dict

Test suite to evaluate.

required

AveragePrecisionAtKMetric

Bases: _TopKVerdictMetric

Average Precision at k (AP@k) for failure detection.

AP@k averages precision values observed at each failing rank up to k:

.. math:: AP@k = \frac{1}{\min(F, k)} \sum_{i=1}^{k} P(i) \cdot rel(i)

where rel(i) is 1 when rank i is failing, 0 otherwise.

EvaluationMetric

Base class for evaluation metrics.

Attributes:

Name Type Description
available_time float

The time available for test execution.

scheduled_testcases list

Test cases that were scheduled for execution.

unscheduled_testcases list

Test cases that were not scheduled.

detection_ranks list

Ranks at which failures were detected.

detection_ranks_time list

Durations of failure-detecting test cases.

detection_ranks_failures list

Failure counts at each detection rank.

ttf int

Time to Fail (rank value).

ttf_duration float

Time spent until the first test case fail.

fitness float

APFD or NAPFD value.

cost float

APFDc value.

detected_failures int

Number of detected failures.

undetected_failures int

Number of undetected failures.

recall float

Recall metric value.

avg_precision float

Average precision metric value.

__init__

__init__()

Initialize the EvaluationMetric.

evaluate

evaluate(test_suite)

Evaluate the test suite.

This is an abstract method and must be implemented in child classes.

Parameters:

Name Type Description Default
test_suite list of dict

Test suite to evaluate.

required

Raises:

Type Description
NotImplementedError

If not implemented in a child class.

process_test_suite

process_test_suite(test_suite, error_key)

Process the test suite and return the costs and total failure count.

Parameters:

Name Type Description Default
test_suite list of dict

Test suite to process.

required
error_key str

Key to determine the error in the test suite.

required

Returns:

Name Type Description
costs list

List of durations for each test case.

total_failure_count int

Total number of failures detected.

total_failed_tests int

Total number of test cases that failed.

reset

reset()

Reset all attributes to their default values.

set_default_metrics

set_default_metrics()

Set the default values for NAPFD and APFDc metrics.

This method is called when there are no detected failures in the test suite, ensuring that the metric attributes are appropriately initialized.

Notes

This method updates the instance's attributes directly and does not return any value.

update_available_time

update_available_time(available_time)

Update the available time for the metric.

Parameters:

Name Type Description Default
available_time float

Time available for the metric.

required

update_budget

update_budget(budget_mode, budget_value)

Update budget semantics used while selecting scheduled testcases.

NAPFDMetric

Bases: EvaluationMetric

Normalized Average Percentage of Faults Detected (NAPFD) Metric.

Based on error counts.

__str__

__str__()

Return a string representation of the metric.

Returns:

Type Description
str

The metric name.

compute_metrics

compute_metrics(
    costs,
    total_failure_count,
    total_failed_tests,
    no_testcases,
)

Compute NAPFD and APFDc metrics.

Parameters:

Name Type Description Default
costs list

A list containing the costs (e.g., execution time) for each test case.

required
total_failure_count int

Total number of failures detected across all test cases.

required
total_failed_tests int

Total number of test cases that failed.

required
no_testcases int

Total number of test cases in the test suite.

required
Notes

This method updates the instance's attributes directly and does not return any value.

evaluate

evaluate(test_suite)

Evaluate the test suite using the NAPFD metric.

Parameters:

Name Type Description Default
test_suite list of dict

Test suite to evaluate.

required

NAPFDVerdictMetric

Bases: EvaluationMetric

Normalized Average Percentage of Faults Detected (NAPFD) Metric based on Verdict.

__str__

__str__()

Return a string representation of the metric.

Returns:

Type Description
str

The metric name.

compute_metrics

compute_metrics(costs, total_failure_count, no_testcases)

Compute NAPFD and APFDc metrics based on test verdicts.

Parameters:

Name Type Description Default
costs list

A list containing the costs (e.g., execution time) for each test case.

required
total_failure_count int

Total number of test cases that failed.

required
no_testcases int

Total number of test cases in the test suite.

required
Notes

This method updates the instance's attributes directly and does not return any value.

evaluate

evaluate(test_suite)

Evaluate the test suite using the NAPFD Verdict metric.

Parameters:

Name Type Description Default
test_suite list of dict

Test suite to evaluate.

required

NDCGAtKMetric

Bases: _TopKVerdictMetric

Normalized Discounted Cumulative Gain at k (nDCG@k).

With binary relevance (fail = 1, pass = 0), nDCG@k is:

.. math:: nDCG@k = \frac{\sum_{i=1}^{k} rel(i)/\log_2(i+1)}{\sum_{i=1}^{m} 1/\log_2(i+1)}

where m = min(F, k) and F is the total number of failing tests.

PrecisionAtKMetric

Bases: _TopKVerdictMetric

Precision@k for test-failure detection.

Defined as the fraction of failing tests among the first k selected tests. Example: with k=6 and 6 failures in the first 6 positions, Precision@k is 1.0 (100%).

RecallAtKMetric

Bases: _TopKVerdictMetric

Recall@k for test-failure detection.

Defined as the fraction of all failing tests that were found within the first k selected tests.

ReciprocalRankAtKMetric

Bases: _TopKVerdictMetric

Reciprocal Rank at k (RR@k).

Returns the reciprocal of the first failing rank within the top-k prefix:

.. math:: RR@k = \begin{cases} 1/r_1, & \text{if a failure appears in top-k at rank } r_1 \ 0, & \text{otherwise} \end{cases}

For one prioritized suite, RR@k is equivalent to MRR@k.