ValidMind for model validation 3 — Developing a potential challenger model

Learn how to use ValidMind for your end-to-end model validation process with our series of four introductory notebooks. In this third notebook, develop a potential challenger model and then pass your model and its predictions to ValidMind.

A challenger model is an alternate model that attempts to outperform the champion model, ensuring that the best performing fit-for-purpose model is always considered for deployment. Challenger models also help avoid over-reliance on a single model, and allow testing of new features, algorithms, or data sources without disrupting the production lifecycle.

Learn by doing

Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals

Prerequisites

In order to develop potential challenger models with this notebook, you'll need to first have:

Need help with the above steps?

Refer to the first two notebooks in this series:

Setting up

This section should be quite familiar to you — as we performed the same actions in the previous notebook, 2 — Start the model validation process.

Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

  1. On the left sidebar that appears for your model, select Getting Started and select Validation from the DOCUMENT drop-down menu.
  2. Click Copy snippet to clipboard.
  3. Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:
# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
    document="validation-report",
)
Note: you may need to restart the kernel to use updated packages.
2026-04-07 23:09:11,469 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report

Import the sample dataset

Next, we'll load in the sample Bank Customer Churn Prediction dataset used to develop the champion model that we will independently preprocess:

# Load the sample dataset
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}

Preprocess the dataset

We’ll apply a simple rebalancing technique to the dataset before continuing:

import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test.

As you know, before we can run tests you’ll need to initialize a ValidMind dataset object with the init_dataset function:

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)

With our balanced dataset initialized, we can then run our test and utilize the output to help us identify the features we want to remove:

# Run HighPearsonCorrelation test with our balanced dataset as input and return a result object
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

The High Pearson Correlation test evaluates the linear relationships between feature pairs to identify potential redundancy or multicollinearity. The results table presents the top ten strongest absolute correlations, listing the feature pairs, their Pearson correlation coefficients, and a Pass/Fail status based on a threshold of 0.3. Only one feature pair exceeds the threshold, while the remaining pairs show lower correlation values and pass the test criteria.

Key insights:

  • One feature pair exceeds correlation threshold: The pair (Age, Exited) has a correlation coefficient of 0.3361, resulting in a Fail status as it surpasses the 0.3 threshold.
  • All other feature pairs pass threshold: The remaining nine feature pairs have absolute correlation coefficients ranging from 0.1983 to 0.0351, all below the 0.3 threshold and marked as Pass.
  • No evidence of widespread multicollinearity: Only a single pair demonstrates a correlation above the threshold, with no clusters of high correlations among other features.

The results indicate that the dataset exhibits generally low linear correlations among most feature pairs, with only the (Age, Exited) pair exceeding the specified threshold. This suggests limited risk of feature redundancy or multicollinearity, as the majority of features maintain sufficient independence. The isolated higher correlation between Age and Exited warrants awareness but does not indicate systemic correlation issues within the dataset.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(Age, Exited) 0.3361 Fail
(Balance, NumOfProducts) -0.1983 Pass
(IsActiveMember, Exited) -0.1966 Pass
(Balance, Exited) 0.1445 Pass
(NumOfProducts, Exited) -0.0542 Pass
(NumOfProducts, IsActiveMember) 0.0418 Pass
(Age, NumOfProducts) -0.0398 Pass
(Age, Balance) 0.0396 Pass
(Balance, HasCrCard) -0.0379 Pass
(CreditScore, Exited) -0.0351 Pass
# From result object, extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df
Columns Coefficient Pass/Fail
0 (Age, Exited) 0.3361 Fail
1 (Balance, NumOfProducts) -0.1983 Pass
2 (IsActiveMember, Exited) -0.1966 Pass
3 (Balance, Exited) 0.1445 Pass
4 (NumOfProducts, Exited) -0.0542 Pass
5 (NumOfProducts, IsActiveMember) 0.0418 Pass
6 (Age, NumOfProducts) -0.0398 Pass
7 (Age, Balance) 0.0396 Pass
8 (Balance, HasCrCard) -0.0379 Pass
9 (CreditScore, Exited) -0.0351 Pass
# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features
['(Age, Exited)']
# Extract feature names from the list of strings
high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features
['Age']

We can then re-initialize the dataset with a different input_id and the highly correlated features removed and re-run the test for confirmation:

# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)
# Re-run the test with the reduced feature set
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

The High Pearson Correlation test evaluates the linear relationships between feature pairs to identify potential redundancy or multicollinearity within the dataset. The results table presents the top ten absolute Pearson correlation coefficients, along with the corresponding feature pairs and Pass/Fail status based on a threshold of 0.3. All reported coefficients are below the threshold, and each feature pair is marked as Pass.

Key insights:

  • No feature pairs exceed correlation threshold: All absolute Pearson correlation coefficients are below the 0.3 threshold, with the highest magnitude observed at 0.1983 between Balance and NumOfProducts.
  • Consistently low linear relationships: The top ten feature pairs display coefficients ranging from 0.0322 to 0.1983, indicating weak linear associations across the dataset.
  • Uniform Pass status across all pairs: Every evaluated feature pair is marked as Pass, reflecting the absence of high linear correlations among the top relationships.

The results indicate that the dataset does not exhibit strong linear dependencies between any of the evaluated feature pairs. The absence of high Pearson correlation coefficients suggests minimal risk of feature redundancy or multicollinearity based on linear relationships, supporting the interpretability and stability of subsequent model development.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(Balance, NumOfProducts) -0.1983 Pass
(IsActiveMember, Exited) -0.1966 Pass
(Balance, Exited) 0.1445 Pass
(NumOfProducts, Exited) -0.0542 Pass
(NumOfProducts, IsActiveMember) 0.0418 Pass
(Balance, HasCrCard) -0.0379 Pass
(CreditScore, Exited) -0.0351 Pass
(Tenure, IsActiveMember) -0.0350 Pass
(HasCrCard, Exited) -0.0348 Pass
(Tenure, EstimatedSalary) 0.0322 Pass

Split the preprocessed dataset

With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:

# Encode categorical features in the dataset
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()
CreditScore Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited Geography_Germany Geography_Spain Gender_Male
2333 588 8 0.00 1 1 0 61931.21 0 False True True
6464 693 4 130661.96 1 1 1 101918.96 0 False False True
3282 577 10 125389.70 2 1 1 178616.73 0 False False True
2115 742 2 191864.51 1 1 0 108457.99 1 False False False
120 691 5 40915.55 1 1 0 126213.84 1 False False False
from sklearn.model_selection import train_test_split

# Split the dataset into train and test
train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]
# Initialize the split datasets
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

Import the champion model

With our raw dataset assessed and preprocessed, let's go ahead and import the champion model submitted by the model development team in the format of a .pkl file: lr_model_champion.pkl

# Import the champion model
import pickle as pkl

with open("lr_model_champion.pkl", "rb") as f:
    log_reg = pkl.load(f)
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/base.py:463: InconsistentVersionWarning: Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.8.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(

Training a potential challenger model

We're curious how an alternate model compares to our champion model, so let's train a challenger model as a basis for our testing.

Our champion logistic regression model is a simpler, parametric model that assumes a linear relationship between the independent variables and the log-odds of the outcome. While logistic regression may not capture complex patterns as effectively, it offers a high degree of interpretability and is easier to explain to stakeholders. However, model risk is not calculated in isolation from a single factor, but rather in consideration with trade-offs in predictive performance, ease of interpretability, and overall alignment with business objectives.

Random forest classification model

A random forest classification model is an ensemble machine learning algorithm that uses multiple decision trees to classify data. In ensemble learning, multiple models are combined to improve prediction accuracy and robustness.

Random forest classification models generally have higher accuracy because they capture complex, non-linear relationships, but as a result they lack transparency in their predictions.

# Import the Random Forest Classification model
from sklearn.ensemble import RandomForestClassifier

# Create the model instance with 50 decision trees
rf_model = RandomForestClassifier(
    n_estimators=50,
    random_state=42,
)

# Train the model
rf_model.fit(X_train, y_train)
RandomForestClassifier(n_estimators=50, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Initializing the model objects

Initialize the model objects

In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models.

You simply initialize this model object with vm.init_model():

# Initialize the champion logistic regression model
vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model_champion",
)

# Initialize the challenger random forest classification model
vm_rf_model = vm.init_model(
    rf_model,
    input_id="rf_model",
)

Assign predictions

With our models registered, we'll move on to assigning both the predictive probabilities coming directly from each model's predictions, and the binary prediction after applying the cutoff threshold described in the Compute binary predictions step above.

  • The assign_predictions() method from the Dataset object can link existing predictions to any number of models.
  • This method links the model's class prediction values and probabilities to our vm_train_ds and vm_test_ds datasets.

If no prediction values are passed, the method will compute predictions automatically:

# Champion — Logistic regression model
vm_train_ds.assign_predictions(model=vm_log_model)
vm_test_ds.assign_predictions(model=vm_log_model)

# Challenger — Random forest classification model
vm_train_ds.assign_predictions(model=vm_rf_model)
vm_test_ds.assign_predictions(model=vm_rf_model)
2026-04-07 23:09:19,754 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-04-07 23:09:19,757 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-04-07 23:09:19,757 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-04-07 23:09:19,759 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-04-07 23:09:19,761 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-04-07 23:09:19,762 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-04-07 23:09:19,762 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-04-07 23:09:19,763 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-04-07 23:09:19,765 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-04-07 23:09:19,787 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-04-07 23:09:19,788 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-04-07 23:09:19,809 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-04-07 23:09:19,811 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-04-07 23:09:19,822 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-04-07 23:09:19,822 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-04-07 23:09:19,832 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Running model evaluation tests

With our setup complete, let's run the rest of our validation tests. Since we have already verified the data quality of the dataset used to train our champion model, we will now focus on comprehensive performance evaluations of both the champion and challenger models.

Run model performance tests

Let's run some performance tests, beginning with independent testing of our champion logistic regression model, then moving on to our potential challenger model.

Use vm.tests.list_tests() to identify all the model performance tests for classification:


vm.tests.list_tests(tags=["model_performance"], task="classification")
ID Name Description Has Figure Has Table Required Inputs Params Tags Tasks
validmind.model_validation.sklearn.CalibrationCurve Calibration Curve Evaluates the calibration of probability estimates by comparing predicted probabilities against observed... True False ['model', 'dataset'] {'n_bins': {'type': 'int', 'default': 10}} ['sklearn', 'model_performance', 'classification'] ['classification']
validmind.model_validation.sklearn.ClassifierPerformance Classifier Performance Evaluates performance of binary or multiclass classification models using precision, recall, F1-Score, accuracy,... False True ['dataset', 'model'] {'average': {'type': 'str', 'default': 'macro'}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ConfusionMatrix Confusion Matrix Evaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix... True False ['dataset', 'model'] {'threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.HyperParametersTuning Hyper Parameters Tuning Performs exhaustive grid search over specified parameter ranges to find optimal model configurations... False True ['model', 'dataset'] {'param_grid': {'type': 'dict', 'default': None}, 'scoring': {'type': 'Union', 'default': None}, 'thresholds': {'type': 'Union', 'default': None}, 'fit_params': {'type': 'dict', 'default': None}} ['sklearn', 'model_performance'] ['clustering', 'classification']
validmind.model_validation.sklearn.MinimumAccuracy Minimum Accuracy Checks if the model's prediction accuracy meets or surpasses a specified threshold.... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.7}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumF1Score Minimum F1 Score Assesses if the model's F1 score on the validation set meets a predefined minimum threshold, ensuring balanced... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumROCAUCScore Minimum ROCAUC Score Validates model by checking if the ROC AUC score meets or surpasses a specified threshold.... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ModelsPerformanceComparison Models Performance Comparison Evaluates and compares the performance of multiple Machine Learning models using various metrics like accuracy,... False True ['dataset', 'models'] {} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'model_comparison'] ['classification', 'text_classification']
validmind.model_validation.sklearn.PopulationStabilityIndex Population Stability Index Assesses the Population Stability Index (PSI) to quantify the stability of an ML model's predictions across... True True ['datasets', 'model'] {'num_bins': {'type': 'int', 'default': 10}, 'mode': {'type': 'str', 'default': 'fixed'}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurve Precision Recall Curve Evaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve.... True False ['model', 'dataset'] {} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurve ROC Curve Evaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic... True False ['model', 'dataset'] {} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.RegressionErrors Regression Errors Assesses the performance and error distribution of a regression model using various error metrics.... False True ['model', 'dataset'] {} ['sklearn', 'model_performance'] ['regression', 'classification']
validmind.model_validation.sklearn.TrainingTestDegradation Training Test Degradation Tests if model performance degradation between training and test datasets exceeds a predefined threshold.... False True ['datasets', 'model'] {'max_threshold': {'type': 'float', 'default': 0.1}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.statsmodels.GINITable GINI Table Evaluates classification model performance using AUC, GINI, and KS metrics for training and test datasets.... False True ['dataset', 'model'] {} ['model_performance'] ['classification']
validmind.ongoing_monitoring.CalibrationCurveDrift Calibration Curve Drift Evaluates changes in probability calibration between reference and monitoring datasets.... True True ['datasets', 'model'] {'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ClassDiscriminationDrift Class Discrimination Drift Compares classification discrimination metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ClassificationAccuracyDrift Classification Accuracy Drift Compares classification accuracy metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ConfusionMatrixDrift Confusion Matrix Drift Compares confusion matrix metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ROCCurveDrift ROC Curve Drift Compares ROC curves between reference and monitoring datasets.... True False ['datasets', 'model'] {} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']

We'll isolate the specific tests we want to run in mpt:

As we learned in the previous notebook 2 — Start the model validation process, you can use a custom result_id to tag the individual result with a unique identifier by appending this result_id to the test_id with a : separator. We'll append an identifier for our champion model here:

mpt = [
    "validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion",
    "validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion",
    "validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion",
    "validmind.model_validation.sklearn.MinimumF1Score:logreg_champion",
    "validmind.model_validation.sklearn.ROCCurve:logreg_champion"
]

Evaluate performance of the champion model

Now, let's run and log our batch of model performance tests using our testing dataset (vm_test_ds) for our champion model:

  • The test set serves as a proxy for real-world data, providing an unbiased estimate of model performance since it was not used during training or tuning.
  • The test set also acts as protection against selection bias and model tweaking, giving a final, more unbiased checkpoint.
for test in mpt:
    vm.tests.run_test(
        test,
        inputs={
            "dataset": vm_test_ds, "model" : vm_log_model,
        },
    ).log()

Classifier Performance Logreg Champion

The Classifier Performance test evaluates the predictive performance of the classification model by reporting precision, recall, F1-score, accuracy, and ROC AUC metrics. The results are presented in two tables: one detailing class-wise and aggregate precision, recall, and F1-scores, and another summarizing overall accuracy and ROC AUC. These metrics provide a quantitative assessment of the model's ability to distinguish between classes and its general classification effectiveness.

Key insights:

  • Balanced class-wise performance: Precision, recall, and F1-scores are similar across both classes, with class 0 showing F1 = 0.6586 and class 1 showing F1 = 0.6424, indicating no substantial performance disparity between classes.
  • Consistent aggregate metrics: Weighted and macro averages for precision, recall, and F1-score are closely aligned (all approximately 0.65), reflecting uniform model behavior across the dataset.
  • Moderate overall accuracy: The model achieves an accuracy of 0.6507, indicating that approximately 65% of predictions match the true class labels.
  • ROC AUC indicates moderate separability: The ROC AUC score of 0.7037 suggests the model has moderate ability to distinguish between the two classes.

The results indicate that the model demonstrates consistent and balanced classification performance across both classes, with moderate accuracy and ROC AUC values. The close alignment of class-wise and aggregate metrics suggests uniform predictive behavior, and the ROC AUC score reflects moderate discriminative capability. No significant imbalance or extreme performance deficiencies are observed in the reported metrics.

Tables

Precision, Recall, and F1

Class Precision Recall F1
0 0.6356 0.6834 0.6586
1 0.6678 0.6189 0.6424
Weighted Average 0.6519 0.6507 0.6504
Macro Average 0.6517 0.6511 0.6505

Accuracy and ROC AUC

Metric Value
Accuracy 0.6507
ROC AUC 0.7037
2026-04-07 23:09:26,876 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion does not exist in model's document

Confusion Matrix Logreg Champion

The Confusion Matrix test evaluates the classification performance of the model by comparing predicted and actual class labels, providing a breakdown of True Positives, True Negatives, False Positives, and False Negatives. The resulting matrix visually displays the distribution of correct and incorrect predictions, allowing for assessment of the model's ability to distinguish between classes. The matrix includes counts for each outcome, facilitating identification of areas where the model performs well or may require further analysis.

Key insights:

  • Higher count of correct classifications: The model produced 203 True Positives and 218 True Negatives, indicating a greater number of correct predictions relative to incorrect ones.
  • Notable presence of misclassifications: There are 125 False Negatives and 101 False Positives, reflecting a measurable rate of both types of classification errors.
  • False Negatives exceed False Positives: The number of False Negatives (125) is higher than the number of False Positives (101), suggesting a greater tendency to miss positive cases than to incorrectly flag negatives as positives.

The confusion matrix reveals that the model demonstrates a higher frequency of correct classifications, with True Positives and True Negatives outnumbering misclassifications. However, the presence of both False Negatives and False Positives at notable levels indicates areas where the model's predictive accuracy could be further examined, particularly regarding the higher rate of missed positive cases. The distribution of outcomes provides a clear basis for evaluating model strengths and identifying potential areas for targeted improvement.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion:b6f2
2026-04-07 23:09:33,412 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion does not exist in model's document

❌ Minimum Accuracy Logreg Champion

The Minimum Accuracy test evaluates whether the model's prediction accuracy meets or exceeds a specified threshold, providing a direct measure of overall predictive correctness. The results table presents the model's observed accuracy score, the minimum threshold applied, and the corresponding pass/fail outcome. The model's accuracy score is compared against the threshold to determine if the model satisfies the minimum performance requirement.

Key insights:

  • Accuracy score below threshold: The model achieved an accuracy score of 0.6507, which is below the specified threshold of 0.7.
  • Test outcome is Fail: The test result is marked as "Fail," indicating the model did not meet the minimum accuracy requirement set for this evaluation.

The results indicate that the model's predictive accuracy did not reach the predefined minimum threshold, as evidenced by the observed score of 0.6507 against the 0.7 benchmark. This outcome highlights a gap between current model performance and the established accuracy criterion.

Tables

Score Threshold Pass/Fail
0.6507 0.7 Fail
2026-04-07 23:09:37,379 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion does not exist in model's document

✅ Minimum F1 Score Logreg Champion

The MinimumF1Score:logreg_champion test evaluates whether the model's F1 score on the validation set meets or exceeds a predefined minimum threshold, ensuring balanced performance between precision and recall. The results table presents the observed F1 score, the minimum threshold for passing, and the pass/fail outcome. The observed F1 score is 0.6424, with a threshold set at 0.5, and the test outcome is marked as "Pass".

Key insights:

  • F1 score exceeds minimum threshold: The model achieved an F1 score of 0.6424, which is above the required threshold of 0.5.
  • Test outcome is Pass: The model met the minimum performance standard for F1 score on the validation set, as indicated by the "Pass" result.

The results indicate that the model demonstrates balanced precision and recall performance on the validation set, surpassing the established minimum F1 score requirement. The observed F1 score provides evidence of effective classification capability under the current validation conditions.

Tables

Score Threshold Pass/Fail
0.6424 0.5 Pass
2026-04-07 23:09:40,186 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:logreg_champion does not exist in model's document

ROC Curve Logreg Champion

The ROC Curve test evaluates the binary classification performance of the log_model_champion by plotting the Receiver Operating Characteristic (ROC) curve and calculating the Area Under the Curve (AUC) on the test_dataset_final. The resulting plot displays the trade-off between the true positive rate and false positive rate across all classification thresholds, with the ROC curve compared against a baseline representing random performance. The AUC value is provided as a summary metric of the model's discriminative ability.

Key insights:

  • AUC indicates moderate discriminative power: The ROC curve yields an AUC of 0.70, reflecting the model's ability to distinguish between the two classes on the test dataset.
  • ROC curve consistently above random baseline: The plotted ROC curve remains above the diagonal line representing random classification (AUC = 0.5) across all thresholds, indicating performance superior to chance.
  • No evidence of threshold instability: The ROC curve appears smooth without abrupt changes, suggesting stable model behavior across varying thresholds.

The test results demonstrate that the log_model_champion achieves moderate classification performance, with an AUC of 0.70 indicating reliable but not exceptional discriminative capability. The ROC curve's consistent position above the random baseline confirms that the model provides meaningful separation between classes throughout the evaluated threshold range. No signs of instability or erratic threshold behavior are observed in the ROC profile.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:logreg_champion:b5b2
2026-04-07 23:09:47,135 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:logreg_champion does not exist in model's document
Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.

Log an artifact

As we can observe from the output above, our champion model doesn't pass the MinimumAccuracy based on the default thresholds of the out-of-the-box test, so let's log an artifact (finding) in the ValidMind Platform (Need more help?):

  1. From the Inventory in the ValidMind Platform, go to the model you connected to earlier.

  2. In the left sidebar that appears for your model, click Validation under Documents.

  3. Locate the Data Preparation section and click on 2.2.2. Model Performance to expand that section.

  4. Under the Model Performance Metrics section, locate Artifacts then click Link Artifact to Report:

    Screenshot showing the validation report with the link artifact option highlighted

  5. Select Validation Issue as the type of artifact.

  6. Click + Add Validation Issue to add a validation issue type artifact.

  7. Enter in the details for your validation issue, for example:

    • TITLE — Champion Logistic Regression Model Fails Minimum Accuracy Threshold
    • RISK AREA — Model Performance
    • DOCUMENTATION SECTION — 3.2. Model Evaluation
    • DESCRIPTION — The logistic regression champion model was subjected to a Minimum Accuracy test to determine whether its predictive accuracy meets the predefined performance threshold of 0.7. The model achieved an accuracy score of 0.6136, which falls below the required minimum. As a result, the test produced a Fail outcome.
  8. Click Save.

  9. Select the validation issue you just added to link to your validation report and click Update Linked Artifacts to insert your validation issue.

  10. Click on the validation issue to expand the issue, where you can adjust details such as severity, owner, due date, status, etc. as well as include proposed remediation plans or supporting documentation as attachments.

Evaluate performance of challenger model

We've now conducted similar tests as the model development team for our champion model, with the aim of verifying their test results.

Next, let's see how our challenger models compare. We'll use the same batch of tests here as we did in mpt, but append a different result_id to indicate that these results should be associated with our challenger model:

mpt_chall = [
    "validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger",
    "validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger",
    "validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger"
]

We'll run each test once for each model with the same vm_test_ds dataset to compare them:

for test in mpt_chall:
    vm.tests.run_test(
        test,
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        }
    ).log()

Classifier Performance Champion Vs Challenger

The Classifier Performance: champion_vs_challenger test evaluates the predictive performance of classification models by reporting precision, recall, F1-score, accuracy, and ROC AUC metrics for each model and class. The results table presents these metrics for both the logistic regression (log_model_champion) and random forest (rf_model) models, including per-class, macro, and weighted averages, as well as overall accuracy and ROC AUC values. This allows for a direct comparison of model effectiveness across key performance dimensions.

Key insights:

  • Random forest outperforms logistic regression across all metrics: The rf_model achieves higher precision, recall, F1-score, accuracy (0.7311), and ROC AUC (0.8122) compared to the log_model_champion, which records accuracy of 0.6507 and ROC AUC of 0.7037.
  • Consistent class-level performance within each model: Both models display similar precision, recall, and F1-scores across classes 0 and 1, with no substantial imbalance between classes.
  • Macro and weighted averages closely aligned: For both models, macro and weighted averages for precision, recall, and F1-score are nearly identical, indicating balanced class distribution and uniform model performance across classes.

The results indicate that the random forest model demonstrates superior classification performance relative to the logistic regression model, as evidenced by higher scores across all evaluated metrics. Both models exhibit balanced predictive behavior between classes, with minimal disparity in class-level metrics. The alignment of macro and weighted averages further supports the absence of class imbalance effects in model performance.

Tables

model Class Precision Recall F1
log_model_champion 0 0.6356 0.6834 0.6586
log_model_champion 1 0.6678 0.6189 0.6424
log_model_champion Weighted Average 0.6519 0.6507 0.6504
log_model_champion Macro Average 0.6517 0.6511 0.6505
rf_model 0 0.7177 0.7492 0.7331
rf_model 1 0.7452 0.7134 0.7290
rf_model Weighted Average 0.7317 0.7311 0.7310
rf_model Macro Average 0.7315 0.7313 0.7311
model Metric Value
log_model_champion Accuracy 0.6507
log_model_champion ROC AUC 0.7037
rf_model Accuracy 0.7311
rf_model ROC AUC 0.8122
2026-04-07 23:09:51,536 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger does not exist in model's document

Confusion Matrix Champion Vs Challenger

The Confusion Matrix: champion_vs_challenger test evaluates the predictive performance of classification models by comparing predicted and actual class labels, providing a breakdown of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) for both the champion (log_model_champion) and challenger (rf_model) models. The results are presented as annotated heatmaps, allowing for direct comparison of classification outcomes between the two models. Each cell in the matrix quantifies the number of instances for each outcome type, facilitating assessment of model strengths and error patterns.

Key insights:

  • Challenger model reduces classification errors: The rf_model (challenger) exhibits lower counts of both False Positives (80 vs. 101) and False Negatives (94 vs. 125) compared to the log_model_champion.
  • Higher correct classification rates in challenger: The rf_model achieves higher True Positives (234 vs. 203) and True Negatives (239 vs. 218) than the champion model, indicating improved identification of both classes.
  • Error distribution shifts favor challenger: The reduction in both FP and FN for the rf_model suggests a more balanced and effective classification performance relative to the champion model.

The confusion matrix comparison demonstrates that the challenger model (rf_model) outperforms the champion model (log_model_champion) across all key confusion matrix categories, with higher correct classifications and fewer errors. This indicates a more effective overall classification capability for the challenger, as evidenced by the observed distribution of TP, TN, FP, and FN counts.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:def2
ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:f65b
2026-04-07 23:09:57,108 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger does not exist in model's document

❌ Minimum Accuracy Champion Vs Challenger

The Minimum Accuracy test evaluates whether each model's prediction accuracy meets or exceeds a specified threshold, providing a direct measure of overall classification correctness. The results table presents accuracy scores for two models—log_model_champion and rf_model—alongside the threshold value of 0.7 and the corresponding pass/fail outcome. Each model's score is compared to the threshold to determine if the minimum accuracy requirement is satisfied.

Key insights:

  • rf_model surpasses accuracy threshold: rf_model achieved an accuracy score of 0.7311, exceeding the 0.7 threshold and resulting in a passing outcome.
  • log_model_champion falls below threshold: log_model_champion recorded an accuracy score of 0.6507, which is below the 0.7 threshold, resulting in a failing outcome.
  • Clear differentiation in model performance: The two models display a notable difference in accuracy, with rf_model outperforming log_model_champion by approximately 8 percentage points relative to the threshold.

The results indicate that rf_model meets the minimum accuracy requirement, while log_model_champion does not. This differentiation highlights a material performance gap between the two models, with only rf_model demonstrating sufficient predictive accuracy as defined by the test threshold.

Tables

model Score Threshold Pass/Fail
log_model_champion 0.6507 0.7 Fail
rf_model 0.7311 0.7 Pass
2026-04-07 23:10:00,894 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger does not exist in model's document

✅ Minimum F1 Score Champion Vs Challenger

The MinimumF1Score:champion_vs_challenger test evaluates whether each model's F1 score on the validation set meets or exceeds a predefined minimum threshold, ensuring balanced performance between precision and recall. The results table presents F1 scores for both the champion (log_model_champion) and challenger (rf_model) models, alongside the minimum threshold and pass/fail status. Both models are assessed independently against the threshold value of 0.5, with their respective F1 scores and outcomes displayed.

Key insights:

  • Both models exceed the minimum F1 threshold: log_model_champion achieved an F1 score of 0.6424 and rf_model achieved 0.729, both surpassing the threshold of 0.5.
  • Challenger model demonstrates higher F1 performance: rf_model outperforms log_model_champion by 0.0866 in F1 score, indicating stronger balance between precision and recall on the validation set.
  • Consistent pass status across models: Both models are marked as "Pass," confirming that each meets the minimum performance criterion established for this test.

Both the champion and challenger models satisfy the minimum F1 score requirement, with the challenger model exhibiting a higher F1 score on the validation set. The results indicate that both models maintain balanced classification performance above the established threshold, with the challenger model providing comparatively stronger results in this metric.

Tables

model Score Threshold Pass/Fail
log_model_champion 0.6424 0.5 Pass
rf_model 0.7290 0.5 Pass
2026-04-07 23:10:04,877 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger does not exist in model's document

ROC Curve Champion Vs Challenger

The ROC Curve: champion_vs_challenger test evaluates the binary classification performance of two models by plotting their Receiver Operating Characteristic (ROC) curves and calculating the Area Under the Curve (AUC) scores. The results display ROC curves for both the log_model_champion and rf_model on the test_dataset_final, with each curve compared against a baseline representing random classification (AUC = 0.5). The AUC values are presented in the plot legends, providing a quantitative measure of each model's ability to distinguish between the positive and negative classes.

Key insights:

  • rf_model demonstrates higher discriminative power: The rf_model achieves an AUC of 0.81, indicating stronger separation between classes compared to the log_model_champion.
  • log_model_champion shows moderate performance: The log_model_champion records an AUC of 0.70, reflecting moderate discriminative ability above random chance.
  • Both models outperform random classification: Both ROC curves are consistently above the diagonal line representing random performance (AUC = 0.5), confirming meaningful predictive capability.

The comparative ROC analysis reveals that both models provide effective discrimination between classes, with the rf_model exhibiting notably higher classification performance than the log_model_champion. The observed AUC values indicate that both models are suitable for binary classification tasks in this context, with the rf_model offering a stronger overall predictive capability.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:6d52
ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:0915
2026-04-07 23:10:11,336 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger does not exist in model's document
Based on the performance metrics, our challenger random forest classification model passes the MinimumAccuracy where our champion did not.

In your validation report, support your recommendation in your validation issue's Proposed Remediation Plan to investigate the usage of our challenger model by inserting the performance tests we logged with this notebook into the appropriate section.

Run diagnostic tests

Next, we want to inspect the robustness and stability testing comparison between our champion and challenger model.

Use list_tests() to list all available diagnosis tests applicable to classification tasks:

vm.tests.list_tests(tags=["model_diagnosis"], task="classification")
ID Name Description Has Figure Has Table Required Inputs Params Tags Tasks
validmind.model_validation.sklearn.OverfitDiagnosis Overfit Diagnosis Assesses potential overfitting in a model's predictions, identifying regions where performance between training and... True True ['model', 'datasets'] {'metric': {'type': 'str', 'default': None}, 'cut_off_threshold': {'type': 'float', 'default': 0.04}} ['sklearn', 'binary_classification', 'multiclass_classification', 'linear_regression', 'model_diagnosis'] ['classification', 'regression']
validmind.model_validation.sklearn.RobustnessDiagnosis Robustness Diagnosis Assesses the robustness of a machine learning model by evaluating performance decay under noisy conditions.... True True ['datasets', 'model'] {'metric': {'type': 'str', 'default': None}, 'scaling_factor_std_dev_list': {'type': 'List', 'default': [0.1, 0.2, 0.3, 0.4, 0.5]}, 'performance_decay_threshold': {'type': 'float', 'default': 0.05}} ['sklearn', 'model_diagnosis', 'visualization'] ['classification', 'regression']
validmind.model_validation.sklearn.WeakspotsDiagnosis Weakspots Diagnosis Identifies and visualizes weak spots in a machine learning model's performance across various sections of the... True True ['datasets', 'model'] {'features_columns': {'type': 'Optional', 'default': None}, 'metrics': {'type': 'Optional', 'default': None}, 'thresholds': {'type': 'Optional', 'default': None}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_diagnosis', 'visualization'] ['classification', 'text_classification']

Let’s now assess the models for potential signs of overfitting and identify any sub-segments where performance may inconsistent with the OverfitDiagnosis test.

Overfitting occurs when a model learns the training data too well, capturing not only the true pattern but noise and random fluctuations resulting in excellent performance on the training dataset but poor generalization to new, unseen data:

  • Since the training dataset (vm_train_ds) was used to fit the model, we use this set to establish a baseline performance for how well the model performs on data it has already seen.
  • The testing dataset (vm_test_ds) was never seen during training, and here simulates real-world generalization, or how well the model performs on new, unseen data.
vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    }
).log()

Overfit Diagnosis Champion Vs Challenger

The Overfit Diagnosis: champion vs challenger test evaluates the extent to which model performance differs between training and test sets across feature segments, using AUC as the performance metric for classification. The results are presented as AUC gaps for each feature bin, with a threshold of 0.04 used to flag regions of potential overfitting. Both the logistic regression (log_model_champion) and random forest (rf_model) models are assessed, with visualizations and tabular summaries highlighting the magnitude and distribution of AUC gaps across key features.

Key insights:

  • Localized overfitting in low-sample Balance segments: Both models exhibit pronounced AUC gaps in Balance segments with very few records (e.g., (25089.809, 50179.618] and (200718.472, 225808.281]), with gaps reaching 0.5–1.0, indicating substantial overfitting in these regions.
  • Widespread overfitting in random forest model: The random forest model shows consistently high AUC gaps across most feature bins, with gaps exceeding 0.1–0.3 for CreditScore, Tenure, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, Geography, and Gender, indicating pervasive overfitting across the feature space.
  • Moderate, segment-specific overfitting in logistic regression: The logistic regression model displays moderate AUC gaps in specific segments, notably for Balance (gaps up to 0.7692), EstimatedSalary (up to 0.1574), and select bins of CreditScore and Tenure (gaps up to 0.1157 and 0.084, respectively), while other features show minimal or no overfitting.
  • Minimal overfitting in categorical features for logistic regression: For categorical features such as HasCrCard, IsActiveMember, Geography, and Gender, the logistic regression model shows AUC gaps well below the threshold, indicating stable generalization in these dimensions.

The results indicate that the random forest model is highly susceptible to overfitting across nearly all feature segments, with AUC gaps consistently exceeding the diagnostic threshold. In contrast, the logistic regression model demonstrates more localized overfitting, primarily in low-sample or high-value Balance and EstimatedSalary segments, while maintaining stable performance in categorical features. These findings highlight the importance of monitoring segment-level performance, particularly in regions with limited data, and underscore the differing generalization characteristics between model types.

Tables

model Feature Slice Number of Training Records Number of Test Records Training AUC Test AUC Gap
log_model_champion CreditScore (400.0, 450.0] 46 17 0.6990 0.5833 0.1157
log_model_champion CreditScore (800.0, 850.0] 160 32 0.7086 0.6510 0.0576
log_model_champion Tenure (3.0, 4.0] 252 60 0.6804 0.6194 0.0610
log_model_champion Tenure (5.0, 6.0] 242 74 0.7251 0.6411 0.0840
log_model_champion Balance (25089.809, 50179.618] 15 3 0.5000 0.0000 0.5000
log_model_champion Balance (150538.854, 175628.663] 196 38 0.6210 0.4846 0.1364
log_model_champion Balance (200718.472, 225808.281] 14 3 0.7692 0.0000 0.7692
log_model_champion EstimatedSalary (79943.476, 99926.45] 263 63 0.6997 0.5423 0.1574
log_model_champion EstimatedSalary (159875.372, 179858.346] 283 68 0.6599 0.6116 0.0483
rf_model CreditScore (400.0, 450.0] 46 17 1.0000 0.5278 0.4722
rf_model CreditScore (500.0, 550.0] 255 71 1.0000 0.7833 0.2167
rf_model CreditScore (550.0, 600.0] 354 106 1.0000 0.8257 0.1743
rf_model CreditScore (600.0, 650.0] 488 106 1.0000 0.8061 0.1939
rf_model CreditScore (650.0, 700.0] 473 140 1.0000 0.8341 0.1659
rf_model CreditScore (700.0, 750.0] 410 88 1.0000 0.7581 0.2419
rf_model CreditScore (750.0, 800.0] 267 54 1.0000 0.8780 0.1220
rf_model CreditScore (800.0, 850.0] 160 32 1.0000 0.7922 0.2078
rf_model Tenure (-0.01, 1.0] 396 81 1.0000 0.8095 0.1905
rf_model Tenure (1.0, 2.0] 271 61 1.0000 0.7716 0.2284
rf_model Tenure (2.0, 3.0] 265 76 1.0000 0.7528 0.2472
rf_model Tenure (3.0, 4.0] 252 60 1.0000 0.8661 0.1339
rf_model Tenure (4.0, 5.0] 257 69 1.0000 0.8920 0.1080
rf_model Tenure (5.0, 6.0] 242 74 1.0000 0.7551 0.2449
rf_model Tenure (6.0, 7.0] 251 62 1.0000 0.8566 0.1434
rf_model Tenure (7.0, 8.0] 248 69 1.0000 0.8253 0.1747
rf_model Tenure (8.0, 9.0] 263 64 1.0000 0.7292 0.2708
rf_model Tenure (9.0, 10.0] 140 31 1.0000 0.9292 0.0708
rf_model Balance (-250.898, 25089.809] 813 216 1.0000 0.8706 0.1294
rf_model Balance (25089.809, 50179.618] 15 3 1.0000 0.0000 1.0000
rf_model Balance (50179.618, 75269.427] 104 20 1.0000 0.8800 0.1200
rf_model Balance (75269.427, 100359.236] 289 70 1.0000 0.7353 0.2647
rf_model Balance (100359.236, 125449.045] 609 158 1.0000 0.7923 0.2077
rf_model Balance (125449.045, 150538.854] 498 128 1.0000 0.7296 0.2704
rf_model Balance (150538.854, 175628.663] 196 38 1.0000 0.6779 0.3221
rf_model Balance (200718.472, 225808.281] 14 3 1.0000 0.0000 1.0000
rf_model NumOfProducts (0.997, 1.3] 1498 375 1.0000 0.7187 0.2813
rf_model NumOfProducts (1.9, 2.2] 898 220 1.0000 0.7447 0.2553
rf_model NumOfProducts (2.8, 3.1] 151 46 1.0000 0.7146 0.2854
rf_model HasCrCard (-0.001, 0.1] 751 186 1.0000 0.8148 0.1852
rf_model HasCrCard (0.9, 1.0] 1834 461 1.0000 0.8129 0.1871
rf_model IsActiveMember (-0.001, 0.1] 1371 350 1.0000 0.7985 0.2015
rf_model IsActiveMember (0.9, 1.0] 1214 297 1.0000 0.7910 0.2090
rf_model EstimatedSalary (-188.25, 19994.554] 258 50 1.0000 0.8733 0.1267
rf_model EstimatedSalary (19994.554, 39977.528] 268 62 1.0000 0.8275 0.1725
rf_model EstimatedSalary (39977.528, 59960.502] 264 64 1.0000 0.8424 0.1576
rf_model EstimatedSalary (59960.502, 79943.476] 265 79 1.0000 0.7925 0.2075
rf_model EstimatedSalary (79943.476, 99926.45] 263 63 1.0000 0.7026 0.2974
rf_model EstimatedSalary (99926.45, 119909.424] 243 76 1.0000 0.7738 0.2262
rf_model EstimatedSalary (119909.424, 139892.398] 244 54 1.0000 0.9167 0.0833
rf_model EstimatedSalary (139892.398, 159875.372] 251 51 1.0000 0.7742 0.2258
rf_model EstimatedSalary (159875.372, 179858.346] 283 68 1.0000 0.7556 0.2444
rf_model EstimatedSalary (179858.346, 199841.32] 246 80 1.0000 0.8682 0.1318
rf_model Geography_Germany (-0.001, 0.1] 1810 456 1.0000 0.8026 0.1974
rf_model Geography_Germany (0.9, 1.0] 775 191 1.0000 0.7897 0.2103
rf_model Geography_Spain (-0.001, 0.1] 1975 498 1.0000 0.8200 0.1800
rf_model Geography_Spain (0.9, 1.0] 610 149 1.0000 0.7905 0.2095
rf_model Gender_Male (-0.001, 0.1] 1287 311 1.0000 0.7877 0.2123
rf_model Gender_Male (0.9, 1.0] 1298 336 1.0000 0.8253 0.1747

Figures

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:a94b
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:5ad5
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:536a
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:9800
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:95f3
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:09d2
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:1392
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:5462
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:4708
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:2a8b
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:4ff9
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:e312
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:6c22
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:b89e
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:0ee3
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:985d
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:9725
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:5727
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:445e
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:3777
2026-04-07 23:10:38,895 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger does not exist in model's document

Let's also conduct robustness and stability testing of the two models with the RobustnessDiagnosis test. Robustness refers to a model's ability to maintain consistent performance, and stability refers to a model's ability to produce consistent outputs over time across different data subsets.

Again, we'll use both the training and testing datasets to establish baseline performance and to simulate real-world generalization:

vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    },
).log()

❌ Robustness Diagnosis Champion Vs Log Regression

The Robustness Diagnosis: Champion vs. LogRegression test evaluates the resilience of the log_model_champion and rf_model to input perturbations by introducing Gaussian noise to numeric features and measuring AUC decay across increasing noise levels. The results are presented in tabular and graphical formats, showing AUC and performance decay for both models on train and test datasets at perturbation sizes ranging from 0.0 to 0.5. The test highlights how each model's predictive performance changes as the input data becomes progressively noisier.

Key insights:

  • Logistic regression model demonstrates gradual AUC decay: The log_model_champion exhibits a steady but moderate decline in AUC as perturbation size increases, with test set AUC decreasing from 0.7037 at baseline to 0.6772 at the highest noise level (0.5), and performance decay remaining below 0.03 throughout.
  • Random forest model shows pronounced sensitivity in training: The rf_model displays a sharp drop in training set AUC, falling from 1.0 at baseline to 0.7934 at perturbation size 0.5, with performance decay exceeding 0.2 and failing the test at perturbation sizes ≥0.2.
  • Test set robustness higher for random forest than training: On the test set, the rf_model maintains higher AUC values under noise (0.8122 at baseline to 0.7555 at 0.5 perturbation), with performance decay remaining below 0.06, but fails the test at perturbation sizes ≥0.4.
  • All models pass at low perturbation, but random forest fails at moderate noise: Both models pass the robustness test at low perturbation levels (≤0.1), but the rf_model fails at moderate to high noise on both train and test datasets, while the log_model_champion passes at all tested levels.

The results indicate that the log_model_champion maintains stable predictive performance under increasing Gaussian noise, with only moderate AUC decay and consistent test passing across all perturbation sizes. In contrast, the rf_model is highly sensitive to input noise in the training set, with substantial performance decay and multiple test failures at moderate and higher perturbation levels, though its test set robustness is comparatively higher. These findings highlight a trade-off between model complexity and robustness to noisy inputs, with the logistic regression model demonstrating greater resilience under the tested conditions.

Tables

model Perturbation Size Dataset Row Count AUC Performance Decay Passed
log_model_champion Baseline (0.0) train_dataset_final 2585 0.6829 0.0000 True
log_model_champion Baseline (0.0) test_dataset_final 647 0.7037 0.0000 True
log_model_champion 0.1 train_dataset_final 2585 0.6809 0.0021 True
log_model_champion 0.1 test_dataset_final 647 0.7030 0.0008 True
log_model_champion 0.2 train_dataset_final 2585 0.6814 0.0015 True
log_model_champion 0.2 test_dataset_final 647 0.6987 0.0050 True
log_model_champion 0.3 train_dataset_final 2585 0.6719 0.0111 True
log_model_champion 0.3 test_dataset_final 647 0.6985 0.0053 True
log_model_champion 0.4 train_dataset_final 2585 0.6727 0.0102 True
log_model_champion 0.4 test_dataset_final 647 0.6877 0.0160 True
log_model_champion 0.5 train_dataset_final 2585 0.6594 0.0236 True
log_model_champion 0.5 test_dataset_final 647 0.6772 0.0266 True
rf_model Baseline (0.0) train_dataset_final 2585 1.0000 0.0000 True
rf_model Baseline (0.0) test_dataset_final 647 0.8122 0.0000 True
rf_model 0.1 train_dataset_final 2585 0.9838 0.0162 True
rf_model 0.1 test_dataset_final 647 0.8034 0.0088 True
rf_model 0.2 train_dataset_final 2585 0.9385 0.0615 False
rf_model 0.2 test_dataset_final 647 0.8069 0.0053 True
rf_model 0.3 train_dataset_final 2585 0.8977 0.1023 False
rf_model 0.3 test_dataset_final 647 0.7825 0.0297 True
rf_model 0.4 train_dataset_final 2585 0.8439 0.1561 False
rf_model 0.4 test_dataset_final 647 0.7600 0.0522 False
rf_model 0.5 train_dataset_final 2585 0.7934 0.2066 False
rf_model 0.5 test_dataset_final 647 0.7555 0.0567 False

Figures

ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:b730
ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:0c5c
2026-04-07 23:10:51,365 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression does not exist in model's document

Run feature importance tests

We also want to verify the relative influence of different input features on our models' predictions, as well as inspect the differences between our champion and challenger model to see if a certain model offers more understandable or logical importance scores for features.

Use list_tests() to identify all the feature importance tests for classification:

# Store the feature importance tests
FI = vm.tests.list_tests(tags=["feature_importance"], task="classification",pretty=False)
FI
['validmind.model_validation.FeaturesAUC',
 'validmind.model_validation.sklearn.PermutationFeatureImportance',
 'validmind.model_validation.sklearn.SHAPGlobalImportance']

We'll only use our testing dataset (vm_test_ds) here, to provide a realistic, unseen sample that mimic future or production data, as the training dataset has already influenced our model during learning:

# Run and log our feature importance tests for both models for the testing dataset
for test in FI:
    vm.tests.run_test(
        "".join((test,':champion_vs_challenger')),
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        },
    ).log()

Features Champion Vs Challenger

The FeaturesAUC:champion_vs_challenger test evaluates the univariate discriminatory power of each feature by calculating the Area Under the Curve (AUC) for each feature against the binary target. The resulting bar chart displays the AUC values for all features in the test dataset, with higher AUC values indicating stronger individual ability to distinguish between classes. The features are ranked from highest to lowest AUC, providing a clear view of which variables are most and least informative on their own.

Key insights:

  • Geography_Germany and Balance show highest univariate discrimination: Geography_Germany and Balance have the highest AUC values, both approaching 0.6, indicating these features possess the strongest individual ability to separate the two classes.
  • Tenure, EstimatedSalary, and CreditScore provide moderate discrimination: These features exhibit AUC values slightly above 0.5, reflecting moderate univariate predictive strength.
  • IsActiveMember and NumOfProducts display lowest AUC values: These features have AUC values near 0.4, suggesting limited individual discriminatory power in the context of this dataset.

The results indicate that Geography_Germany and Balance are the most individually informative features for binary classification in this dataset, while IsActiveMember and NumOfProducts contribute less to univariate class separation. The observed spread in AUC values highlights varying degrees of univariate predictive strength across features, supporting targeted feature analysis and interpretation.

Figures

ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:5d30
ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:c8c6
2026-04-07 23:10:57,999 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.FeaturesAUC:champion_vs_challenger does not exist in model's document

Permutation Feature Importance Champion Vs Challenger

The Permutation Feature Importance (PFI) test evaluates the relative importance of each input feature by measuring the decrease in model performance when feature values are randomly permuted. The results are presented as bar plots for both the logistic regression (log_model_champion) and random forest (rf_model) models, with each bar representing the importance score for a given feature. The magnitude of each bar indicates the extent to which permuting that feature reduces model performance, thereby quantifying its contribution to the model's predictive capability.

Key insights:

  • Distinct feature reliance across models: The logistic regression model (log_model_champion) assigns highest importance to IsActiveMember, Geography_Germany, and Gender_Male, while the random forest model (rf_model) is most influenced by NumOfProducts, Geography_Germany, and Balance.
  • IsActiveMember and NumOfProducts show model-specific dominance: IsActiveMember is the most influential feature for the logistic regression model, whereas NumOfProducts is the most influential for the random forest model, with a substantially higher importance score than any other feature in either model.
  • Geography_Germany consistently important: Geography_Germany ranks among the top two features for both models, indicating a stable and significant contribution to predictive performance across model types.
  • Low importance for several features: Features such as HasCrCard, Geography_Spain, and Tenure exhibit low or near-zero importance in both models, suggesting minimal impact on model predictions.

The PFI results demonstrate that feature importance profiles differ substantially between the logistic regression and random forest models, with each model relying on a distinct set of primary predictors. Geography_Germany emerges as a consistently important feature across both models, while other features such as IsActiveMember and NumOfProducts display model-specific dominance. Several features contribute minimally to predictive performance, indicating concentrated reliance on a subset of input variables.

Figures

ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:1b9c
ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:c246
2026-04-07 23:11:09,272 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger does not exist in model's document

SHAP Global Importance Champion Vs Challenger

The SHAPGlobalImportance:champion_vs_challenger test evaluates and visualizes the global feature importance for both the champion (log_model_champion) and challenger (rf_model) models using SHAP values. The results include normalized mean importance plots and SHAP summary plots, which display the relative contribution of each feature to model predictions and the distribution of SHAP values across instances. These visualizations provide insight into which features most strongly influence model outputs and how feature values relate to prediction impact.

Key insights:

  • Distinct feature dominance in champion model: The log_model_champion model assigns the highest importance to IsActiveMember, Geography_Germany, and Gender_Male, with IsActiveMember showing the largest normalized SHAP value.
  • Broader feature utilization in champion model: The champion model distributes importance across a wider set of features, including Balance, CreditScore, Tenure, NumOfProducts, EstimatedSalary, HasCrCard, and Geography_Spain, though with lower relative importance.
  • Challenger model focuses on fewer features: The rf_model challenger model attributes importance almost exclusively to CreditScore and Tenure, with no other features showing material contribution in the normalized SHAP value plot.
  • SHAP value distributions indicate model behavior: The summary plots for the champion model show a range of SHAP value impacts for top features, with visible variation and both positive and negative contributions, while the challenger model’s SHAP interaction plots are concentrated on CreditScore and Tenure, indicating limited feature interaction.

The SHAP global importance analysis reveals that the champion model leverages a broader set of features with varying degrees of influence, while the challenger model relies primarily on CreditScore and Tenure. The distribution and magnitude of SHAP values suggest that the champion model incorporates more complex feature relationships, whereas the challenger model’s predictions are driven by a narrower feature set. This differentiation in feature importance profiles provides transparency into the decision logic of each model and highlights the relative complexity and focus of their predictive mechanisms.

Figures

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:c4dc
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:d8e1
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:a1f5
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:9f81
2026-04-07 23:11:19,962 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger does not exist in model's document

In summary

In this third notebook, you learned how to:

Next steps

Finalize validation and reporting

Now that you're familiar with the basics of using the ValidMind Library to run and log validation tests, let's learn how to implement some custom tests and wrap up our validation: 4 — Finalize validation and reporting


Copyright © 2023-2026 ValidMind Inc. All rights reserved.
Refer to LICENSE for details.
SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial