CalibrationCurveDrift
Evaluates changes in probability calibration between reference and monitoring datasets.
Purpose
The Calibration Curve Drift test is designed to assess changes in the model’s probability calibration over time. By comparing calibration curves between reference and monitoring datasets, this test helps identify whether the model’s probability estimates remain reliable in production. This is crucial for understanding if the model’s risk predictions maintain their intended interpretation and whether recalibration might be necessary.
Test Mechanism
This test proceeds by generating calibration curves for both reference and monitoring datasets. For each dataset, it bins the predicted probabilities and calculates the actual fraction of positives within each bin. It then compares these values between datasets to identify significant shifts in calibration. The test quantifies drift as percentage changes in both mean predicted probabilities and actual fractions of positives per bin, providing both visual and numerical assessments of calibration stability.
Signs of High Risk
- Large differences between reference and monitoring calibration curves
- Systematic over-estimation or under-estimation in monitoring dataset
- Significant drift percentages exceeding the threshold in multiple bins
- Changes in calibration concentrated in specific probability ranges
- Inconsistent drift patterns across the probability spectrum
- Empty or sparse bins indicating insufficient data for reliable comparison
Strengths
- Provides visual and quantitative assessment of calibration changes
- Identifies specific probability ranges where calibration has shifted
- Enables early detection of systematic prediction biases
- Includes detailed bin-by-bin comparison of calibration metrics
- Handles edge cases with insufficient data in certain bins
- Supports both binary and probabilistic interpretation of results
Limitations
- Requires sufficient data in each probability bin for reliable comparison
- Sensitive to choice of number of bins and binning strategy
- May not capture complex changes in probability distributions
- Cannot directly suggest recalibration parameters
- Limited to assessing probability calibration aspects
- Results may be affected by class imbalance changes