ValidMind for model validation 3 — Developing a potential challenger model

Learn how to use ValidMind for your end-to-end model validation process with our series of four introductory notebooks. In this third notebook, develop a potential challenger model and then pass your model and its predictions to ValidMind.

A challenger model is an alternate model that attempts to outperform the champion model, ensuring that the best performing fit-for-purpose model is always considered for deployment. Challenger models also help avoid over-reliance on a single model, and allow testing of new features, algorithms, or data sources without disrupting the production lifecycle.

Learn by doing

Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals

Prerequisites

In order to develop potential challenger models with this notebook, you'll need to first have:

Need help with the above steps?

Refer to the first two notebooks in this series:

Setting up

This section should be quite familiar to you — as we performed the same actions in the previous notebook, 2 — Start the model validation process.

Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

  1. In a browser, log in to ValidMind.

  2. In the left sidebar, navigate to Inventory and select the model you registered for this "ValidMind for model validation" series of notebooks.

  3. Go to Getting Started and click Copy snippet to clipboard.

Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
)
Note: you may need to restart the kernel to use updated packages.
2026-01-30 23:20:18,928 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report

Import the sample dataset

Next, we'll load in the sample Bank Customer Churn Prediction dataset used to develop the champion model that we will independently preprocess:

# Load the sample dataset
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}

Preprocess the dataset

We’ll apply a simple rebalancing technique to the dataset before continuing:

import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test.

As you know, before we can run tests you’ll need to initialize a ValidMind dataset object with the init_dataset function:

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)

With our balanced dataset initialized, we can then run our test and utilize the output to help us identify the features we want to remove:

# Run HighPearsonCorrelation test with our balanced dataset as input and return a result object
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

The High Pearson Correlation test evaluates the linear relationships between feature pairs to identify potential redundancy or multicollinearity. The results table presents the top ten strongest absolute correlations, listing the feature pairs, their Pearson correlation coefficients, and a Pass/Fail status based on a threshold of 0.3. Only one feature pair exceeds the threshold, while the remaining pairs show lower correlation values and pass the test criteria.

Key insights:

  • One feature pair exceeds correlation threshold: The pair (Age, Exited) has a correlation coefficient of 0.3594, surpassing the 0.3 threshold and resulting in a Fail status.
  • All other feature pairs below threshold: The remaining nine feature pairs have absolute correlation coefficients ranging from 0.1911 to 0.0373, all below the 0.3 threshold and marked as Pass.
  • Predominantly weak linear relationships: Most feature pairs exhibit weak linear associations, with coefficients clustered well below the threshold.

The results indicate that the dataset contains predominantly low linear correlations among features, with only the (Age, Exited) pair displaying a moderate correlation above the specified threshold. The overall correlation structure suggests limited risk of feature redundancy or multicollinearity, aside from the identified exception.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(Age, Exited) 0.3594 Fail
(IsActiveMember, Exited) -0.1911 Pass
(Balance, NumOfProducts) -0.1636 Pass
(Balance, Exited) 0.1494 Pass
(NumOfProducts, Exited) -0.0582 Pass
(Tenure, Balance) -0.0521 Pass
(NumOfProducts, IsActiveMember) 0.0456 Pass
(HasCrCard, IsActiveMember) -0.0386 Pass
(Tenure, IsActiveMember) -0.0385 Pass
(Age, NumOfProducts) -0.0373 Pass
# From result object, extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df
Columns Coefficient Pass/Fail
0 (Age, Exited) 0.3594 Fail
1 (IsActiveMember, Exited) -0.1911 Pass
2 (Balance, NumOfProducts) -0.1636 Pass
3 (Balance, Exited) 0.1494 Pass
4 (NumOfProducts, Exited) -0.0582 Pass
5 (Tenure, Balance) -0.0521 Pass
6 (NumOfProducts, IsActiveMember) 0.0456 Pass
7 (HasCrCard, IsActiveMember) -0.0386 Pass
8 (Tenure, IsActiveMember) -0.0385 Pass
9 (Age, NumOfProducts) -0.0373 Pass
# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features
['(Age, Exited)']
# Extract feature names from the list of strings
high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features
['Age']

We can then re-initialize the dataset with a different input_id and the highly correlated features removed and re-run the test for confirmation:

# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)
# Re-run the test with the reduced feature set
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

The High Pearson Correlation test evaluates the linear relationships between feature pairs to identify potential redundancy or multicollinearity. The results table presents the top ten absolute Pearson correlation coefficients among feature pairs, along with their Pass/Fail status relative to the threshold of 0.3. All reported coefficients are below the threshold, and each feature pair is marked as Pass.

Key insights:

  • No high correlations detected: All absolute Pearson correlation coefficients are below the 0.3 threshold, with the highest magnitude observed at 0.1911 between IsActiveMember and Exited.
  • Consistent Pass status across all pairs: Every feature pair in the top ten list is marked as Pass, indicating no evidence of strong linear relationships among the evaluated features.
  • Low to moderate negative and positive associations: The coefficients range from -0.1911 to 0.0456, reflecting only weak linear associations between the examined feature pairs.

The test results indicate an absence of strong linear relationships or multicollinearity among the top feature pairs in the dataset. All observed correlations are weak, supporting the independence of features and reducing concerns regarding feature redundancy or interpretability issues arising from linear dependencies.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(IsActiveMember, Exited) -0.1911 Pass
(Balance, NumOfProducts) -0.1636 Pass
(Balance, Exited) 0.1494 Pass
(NumOfProducts, Exited) -0.0582 Pass
(Tenure, Balance) -0.0521 Pass
(NumOfProducts, IsActiveMember) 0.0456 Pass
(HasCrCard, IsActiveMember) -0.0386 Pass
(Tenure, IsActiveMember) -0.0385 Pass
(CreditScore, IsActiveMember) 0.0366 Pass
(CreditScore, EstimatedSalary) -0.0359 Pass

Split the preprocessed dataset

With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:

# Encode categorical features in the dataset
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()
CreditScore Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited Geography_Germany Geography_Spain Gender_Male
6653 707 9 0.00 2 1 1 70403.65 0 False False False
2457 627 5 100880.76 1 0 1 134665.25 0 False True True
4226 721 7 0.00 2 1 1 122580.48 0 False False False
1770 648 6 157559.59 2 1 0 140991.23 1 False True False
1223 515 2 90432.92 1 1 1 188366.04 1 True False True
from sklearn.model_selection import train_test_split

# Split the dataset into train and test
train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]
# Initialize the split datasets
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

Import the champion model

With our raw dataset assessed and preprocessed, let's go ahead and import the champion model submitted by the model development team in the format of a .pkl file: lr_model_champion.pkl

# Import the champion model
import pickle as pkl

with open("lr_model_champion.pkl", "rb") as f:
    log_reg = pkl.load(f)
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/base.py:463: InconsistentVersionWarning:

Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.8.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations

Training a potential challenger model

We're curious how an alternate model compares to our champion model, so let's train a challenger model as a basis for our testing.

Our champion logistic regression model is a simpler, parametric model that assumes a linear relationship between the independent variables and the log-odds of the outcome. While logistic regression may not capture complex patterns as effectively, it offers a high degree of interpretability and is easier to explain to stakeholders. However, model risk is not calculated in isolation from a single factor, but rather in consideration with trade-offs in predictive performance, ease of interpretability, and overall alignment with business objectives.

Random forest classification model

A random forest classification model is an ensemble machine learning algorithm that uses multiple decision trees to classify data. In ensemble learning, multiple models are combined to improve prediction accuracy and robustness.

Random forest classification models generally have higher accuracy because they capture complex, non-linear relationships, but as a result they lack transparency in their predictions.

# Import the Random Forest Classification model
from sklearn.ensemble import RandomForestClassifier

# Create the model instance with 50 decision trees
rf_model = RandomForestClassifier(
    n_estimators=50,
    random_state=42,
)

# Train the model
rf_model.fit(X_train, y_train)
RandomForestClassifier(n_estimators=50, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Initializing the model objects

Initialize the model objects

In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models.

You simply initialize this model object with vm.init_model():

# Initialize the champion logistic regression model
vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model_champion",
)

# Initialize the challenger random forest classification model
vm_rf_model = vm.init_model(
    rf_model,
    input_id="rf_model",
)

Assign predictions

With our models registered, we'll move on to assigning both the predictive probabilities coming directly from each model's predictions, and the binary prediction after applying the cutoff threshold described in the Compute binary predictions step above.

  • The assign_predictions() method from the Dataset object can link existing predictions to any number of models.
  • This method links the model's class prediction values and probabilities to our vm_train_ds and vm_test_ds datasets.

If no prediction values are passed, the method will compute predictions automatically:

# Champion — Logistic regression model
vm_train_ds.assign_predictions(model=vm_log_model)
vm_test_ds.assign_predictions(model=vm_log_model)

# Challenger — Random forest classification model
vm_train_ds.assign_predictions(model=vm_rf_model)
vm_test_ds.assign_predictions(model=vm_rf_model)
2026-01-30 23:20:28,202 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-30 23:20:28,204 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-30 23:20:28,205 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-30 23:20:28,207 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-30 23:20:28,210 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-30 23:20:28,212 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-30 23:20:28,213 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-30 23:20:28,215 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-30 23:20:28,217 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-30 23:20:28,239 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-30 23:20:28,241 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-30 23:20:28,262 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-30 23:20:28,264 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-30 23:20:28,276 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-30 23:20:28,277 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-30 23:20:28,289 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Running model evaluation tests

With our setup complete, let's run the rest of our validation tests. Since we have already verified the data quality of the dataset used to train our champion model, we will now focus on comprehensive performance evaluations of both the champion and challenger models.

Run model performance tests

Let's run some performance tests, beginning with independent testing of our champion logistic regression model, then moving on to our potential challenger model.

Use vm.tests.list_tests() to identify all the model performance tests for classification:


vm.tests.list_tests(tags=["model_performance"], task="classification")
ID Name Description Has Figure Has Table Required Inputs Params Tags Tasks
validmind.model_validation.sklearn.CalibrationCurve Calibration Curve Evaluates the calibration of probability estimates by comparing predicted probabilities against observed... True False ['model', 'dataset'] {'n_bins': {'type': 'int', 'default': 10}} ['sklearn', 'model_performance', 'classification'] ['classification']
validmind.model_validation.sklearn.ClassifierPerformance Classifier Performance Evaluates performance of binary or multiclass classification models using precision, recall, F1-Score, accuracy,... False True ['dataset', 'model'] {'average': {'type': 'str', 'default': 'macro'}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ConfusionMatrix Confusion Matrix Evaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix... True False ['dataset', 'model'] {'threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.HyperParametersTuning Hyper Parameters Tuning Performs exhaustive grid search over specified parameter ranges to find optimal model configurations... False True ['model', 'dataset'] {'param_grid': {'type': 'dict', 'default': None}, 'scoring': {'type': 'Union', 'default': None}, 'thresholds': {'type': 'Union', 'default': None}, 'fit_params': {'type': 'dict', 'default': None}} ['sklearn', 'model_performance'] ['clustering', 'classification']
validmind.model_validation.sklearn.MinimumAccuracy Minimum Accuracy Checks if the model's prediction accuracy meets or surpasses a specified threshold.... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.7}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumF1Score Minimum F1 Score Assesses if the model's F1 score on the validation set meets a predefined minimum threshold, ensuring balanced... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumROCAUCScore Minimum ROCAUC Score Validates model by checking if the ROC AUC score meets or surpasses a specified threshold.... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ModelsPerformanceComparison Models Performance Comparison Evaluates and compares the performance of multiple Machine Learning models using various metrics like accuracy,... False True ['dataset', 'models'] {} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'model_comparison'] ['classification', 'text_classification']
validmind.model_validation.sklearn.PopulationStabilityIndex Population Stability Index Assesses the Population Stability Index (PSI) to quantify the stability of an ML model's predictions across... True True ['datasets', 'model'] {'num_bins': {'type': 'int', 'default': 10}, 'mode': {'type': 'str', 'default': 'fixed'}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurve Precision Recall Curve Evaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve.... True False ['model', 'dataset'] {} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurve ROC Curve Evaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic... True False ['model', 'dataset'] {} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.RegressionErrors Regression Errors Assesses the performance and error distribution of a regression model using various error metrics.... False True ['model', 'dataset'] {} ['sklearn', 'model_performance'] ['regression', 'classification']
validmind.model_validation.sklearn.TrainingTestDegradation Training Test Degradation Tests if model performance degradation between training and test datasets exceeds a predefined threshold.... False True ['datasets', 'model'] {'max_threshold': {'type': 'float', 'default': 0.1}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.statsmodels.GINITable GINI Table Evaluates classification model performance using AUC, GINI, and KS metrics for training and test datasets.... False True ['dataset', 'model'] {} ['model_performance'] ['classification']
validmind.ongoing_monitoring.CalibrationCurveDrift Calibration Curve Drift Evaluates changes in probability calibration between reference and monitoring datasets.... True True ['datasets', 'model'] {'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ClassDiscriminationDrift Class Discrimination Drift Compares classification discrimination metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ClassificationAccuracyDrift Classification Accuracy Drift Compares classification accuracy metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ConfusionMatrixDrift Confusion Matrix Drift Compares confusion matrix metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ROCCurveDrift ROC Curve Drift Compares ROC curves between reference and monitoring datasets.... True False ['datasets', 'model'] {} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']

We'll isolate the specific tests we want to run in mpt:

As we learned in the previous notebook 2 — Start the model validation process, you can use a custom result_id to tag the individual result with a unique identifier by appending this result_id to the test_id with a : separator. We'll append an identifier for our champion model here:

mpt = [
    "validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion",
    "validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion",
    "validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion",
    "validmind.model_validation.sklearn.MinimumF1Score:logreg_champion",
    "validmind.model_validation.sklearn.ROCCurve:logreg_champion"
]

Evaluate performance of the champion model

Now, let's run and log our batch of model performance tests using our testing dataset (vm_test_ds) for our champion model:

  • The test set serves as a proxy for real-world data, providing an unbiased estimate of model performance since it was not used during training or tuning.
  • The test set also acts as protection against selection bias and model tweaking, giving a final, more unbiased checkpoint.
for test in mpt:
    vm.tests.run_test(
        test,
        inputs={
            "dataset": vm_test_ds, "model" : vm_log_model,
        },
    ).log()

Classifier Performance Logreg Champion

The Classifier Performance test evaluates the predictive effectiveness of the classification model by reporting precision, recall, F1-Score, accuracy, and ROC AUC metrics. The results are presented for each class, as well as macro and weighted averages, providing a comprehensive view of model performance across all classes. The summary table includes class-specific and aggregate metrics, while overall accuracy and ROC AUC are reported separately to capture both threshold-dependent and threshold-independent performance.

Key insights:

  • Balanced class performance: Precision, recall, and F1-Score are similar across both classes, with values ranging from 0.6254 to 0.6582 for precision, 0.6265 to 0.6571 for recall, and 0.6409 to 0.6420 for F1-Score, indicating no substantial performance disparity between classes.
  • Consistent aggregate metrics: Macro and weighted averages for precision, recall, and F1-Score are closely aligned (all approximately 0.6414–0.6422), reflecting uniform model behavior across the dataset.
  • Moderate overall accuracy: The model achieves an accuracy of 0.6414, indicating that approximately 64% of predictions match the true class labels.
  • ROC AUC indicates moderate separability: The ROC AUC score of 0.6903 suggests the model has moderate ability to distinguish between classes.

The results indicate that the classification model demonstrates consistent and balanced performance across both classes, with moderate accuracy and ROC AUC values. The close alignment of macro and weighted averages with class-specific metrics suggests uniform predictive behavior without significant class imbalance effects. The ROC AUC score further supports moderate discriminative capability, providing a comprehensive assessment of model effectiveness.

Tables

Precision, Recall, and F1

Class Precision Recall F1
0 0.6254 0.6571 0.6409
1 0.6582 0.6265 0.6420
Weighted Average 0.6422 0.6414 0.6414
Macro Average 0.6418 0.6418 0.6414

Accuracy and ROC AUC

Metric Value
Accuracy 0.6414
ROC AUC 0.6903
2026-01-30 23:20:36,727 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion does not exist in model's document

Confusion Matrix Logreg Champion

The Confusion Matrix test evaluates the classification performance of the logistic regression model by comparing predicted and actual class labels, providing a breakdown of true positives, true negatives, false positives, and false negatives. The resulting matrix visually displays the distribution of correct and incorrect predictions across both classes. The matrix quantifies the model's ability to distinguish between the two classes and highlights the types and frequencies of classification errors.

Key insights:

  • Comparable true positive and true negative counts: The model correctly identified 208 true positives and 207 true negatives, indicating balanced detection capability across both classes.
  • Substantial false negative and false positive rates: There are 124 false negatives and 108 false positives, reflecting a notable proportion of misclassifications in both directions.
  • Error rates are non-negligible: The combined total of false negatives and false positives (232) is significant relative to the total number of predictions, suggesting room for improvement in overall classification accuracy.

The confusion matrix reveals that the logistic regression model demonstrates similar effectiveness in identifying both positive and negative cases, with true positive and true negative counts closely matched. However, the presence of substantial false negative and false positive counts indicates that misclassification rates are material and may impact downstream decision processes. The results highlight the importance of further analysis to understand the sources of error and potential model refinements.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion:e0c3
2026-01-30 23:20:46,033 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion does not exist in model's document

❌ Minimum Accuracy Logreg Champion

The Minimum Accuracy test evaluates whether the model's prediction accuracy meets or exceeds a specified threshold, providing a direct measure of overall classification correctness. The results table presents the model's achieved accuracy score, the minimum threshold set for the test, and the corresponding pass/fail outcome. The model's accuracy score is reported as 0.6414, with a threshold of 0.7, and the test outcome is marked as "Fail."

Key insights:

  • Accuracy below threshold: The model achieved an accuracy score of 0.6414, which is below the specified minimum threshold of 0.7.
  • Test outcome is Fail: The test result is recorded as "Fail," indicating the model did not meet the minimum accuracy requirement.

The results indicate that the model's overall prediction accuracy does not satisfy the predefined minimum standard set by the test. The observed accuracy shortfall is material, as the score falls below the threshold by approximately 0.0586. This outcome highlights a gap in the model's current predictive performance relative to the established benchmark.

Tables

Score Threshold Pass/Fail
0.6414 0.7 Fail
2026-01-30 23:20:49,275 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion does not exist in model's document

✅ Minimum F1 Score Logreg Champion

The MinimumF1Score:logreg_champion test evaluates whether the model's F1 score on the validation set meets or exceeds a predefined minimum threshold, ensuring balanced performance between precision and recall. The results table presents the observed F1 score, the minimum threshold for passing, and the pass/fail outcome. The model's F1 score is reported as 0.642, with a threshold of 0.5, and the test outcome is marked as "Pass".

Key insights:

  • F1 score exceeds minimum threshold: The model achieved an F1 score of 0.642, which is above the required threshold of 0.5.
  • Test outcome is Pass: The model met the minimum performance standard for balanced precision and recall as defined by the test criteria.

The results indicate that the model demonstrates balanced classification performance on the validation set, with the F1 score surpassing the established minimum requirement. The test outcome confirms that the model satisfies the predefined standard for this metric.

Tables

Score Threshold Pass/Fail
0.642 0.5 Pass
2026-01-30 23:20:52,019 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:logreg_champion does not exist in model's document

ROC Curve Logreg Champion

The ROC Curve test evaluates the binary classification performance of the log_model_champion by plotting the Receiver Operating Characteristic (ROC) curve and calculating the Area Under the Curve (AUC) on the test_dataset_final. The resulting plot displays the model's true positive rate against the false positive rate across all classification thresholds, with a reference line indicating random performance (AUC = 0.5). The AUC value is reported directly on the plot, providing a summary measure of the model's discriminative ability.

Key insights:

  • AUC indicates moderate discriminative power: The model achieves an AUC of 0.69, which is above the random baseline of 0.5, indicating the model can distinguish between the two classes with moderate effectiveness.
  • ROC curve consistently above random line: The ROC curve remains above the diagonal reference line across most thresholds, confirming the model's ability to achieve higher true positive rates than would be expected by chance.
  • No evidence of near-random performance: The observed AUC and ROC curve position do not approach the high-risk threshold of 0.5, suggesting the model maintains a meaningful level of classification skill.

The test results demonstrate that the log_model_champion exhibits moderate classification performance on the test dataset, as evidenced by an AUC of 0.69 and a ROC curve that consistently outperforms random guessing. The model's ability to separate positive and negative classes is present but not strong, indicating room for further improvement in discriminative capability. No indications of high-risk or near-random performance are observed in this evaluation.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:logreg_champion:996f
2026-01-30 23:21:01,759 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:logreg_champion does not exist in model's document
Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.

Log an artifact

As we can observe from the output above, our champion model doesn't pass the MinimumAccuracy based on the default thresholds of the out-of-the-box test, so let's log an artifact (finding) in the ValidMind Platform (Need more help?):

  1. From the Inventory in the ValidMind Platform, go to the model you connected to earlier.

  2. In the left sidebar that appears for your model, click Validation Report under Documents.

  3. Locate the Data Preparation section and click on 2.2.2. Model Performance to expand that section.

  4. Under the Model Performance Metrics section, locate Artifacts then click Link Artifact to Report:

    Screenshot showing the validation report with the link artifact option highlighted

  5. Select Validation Issue as the type of artifact.

  6. Click + Add Validation Issue to add a validation issue type artifact.

  7. Enter in the details for your validation issue, for example:

    • TITLE — Champion Logistic Regression Model Fails Minimum Accuracy Threshold
    • RISK AREA — Model Performance
    • DOCUMENTATION SECTION — 3.2. Model Evaluation
    • DESCRIPTION — The logistic regression champion model was subjected to a Minimum Accuracy test to determine whether its predictive accuracy meets the predefined performance threshold of 0.7. The model achieved an accuracy score of 0.6136, which falls below the required minimum. As a result, the test produced a Fail outcome.
  8. Click Save.

  9. Select the validation issue you just added to link to your validation report and click Update Linked Artifacts to insert your validation issue.

  10. Click on the validation issue to expand the issue, where you can adjust details such as severity, owner, due date, status, etc. as well as include proposed remediation plans or supporting documentation as attachments.

Evaluate performance of challenger model

We've now conducted similar tests as the model development team for our champion model, with the aim of verifying their test results.

Next, let's see how our challenger models compare. We'll use the same batch of tests here as we did in mpt, but append a different result_id to indicate that these results should be associated with our challenger model:

mpt_chall = [
    "validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger",
    "validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger",
    "validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger"
]

We'll run each test once for each model with the same vm_test_ds dataset to compare them:

for test in mpt_chall:
    vm.tests.run_test(
        test,
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        }
    ).log()

Classifier Performance Champion Vs Challenger

The Classifier Performance test evaluates the predictive effectiveness of classification models by reporting precision, recall, F1-Score, accuracy, and ROC AUC metrics. The results compare two models, "log_model_champion" and "rf_model," across these metrics for both classes, as well as macro and weighted averages. The tables present detailed class-level and aggregate performance scores, enabling direct comparison of model discrimination and overall accuracy.

Key insights:

  • rf_model outperforms log_model_champion across all metrics: rf_model achieves higher precision, recall, and F1-Score for both classes, as well as higher macro and weighted averages.
  • Higher accuracy and ROC AUC for rf_model: rf_model records an accuracy of 0.6955 and ROC AUC of 0.7577, compared to log_model_champion's accuracy of 0.6414 and ROC AUC of 0.6903.
  • Consistent class-level performance: Both models show similar precision and recall values across classes 0 and 1, with no substantial imbalance between classes.
  • rf_model demonstrates stronger class discrimination: The higher ROC AUC for rf_model indicates improved ability to distinguish between classes relative to log_model_champion.

The results indicate that rf_model provides superior classification performance compared to log_model_champion, as evidenced by higher scores across all evaluated metrics. Both models maintain balanced performance between classes, but rf_model demonstrates enhanced accuracy and discrimination capability, as reflected in its higher ROC AUC and F1-Score values.

Tables

model Class Precision Recall F1
log_model_champion 0 0.6254 0.6571 0.6409
log_model_champion 1 0.6582 0.6265 0.6420
log_model_champion Weighted Average 0.6422 0.6414 0.6414
log_model_champion Macro Average 0.6418 0.6418 0.6414
rf_model 0 0.6844 0.6952 0.6898
rf_model 1 0.7064 0.6958 0.7011
rf_model Weighted Average 0.6957 0.6955 0.6956
rf_model Macro Average 0.6954 0.6955 0.6954
model Metric Value
log_model_champion Accuracy 0.6414
log_model_champion ROC AUC 0.6903
rf_model Accuracy 0.6955
rf_model ROC AUC 0.7577
2026-01-30 23:21:08,316 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger does not exist in model's document

Confusion Matrix Champion Vs Challenger

The Confusion Matrix: champion_vs_challenger test evaluates the predictive performance of classification models by quantifying the counts of true positives, true negatives, false positives, and false negatives. The results are presented as annotated heatmaps for both the champion (log_model_champion) and challenger (rf_model) models, allowing for direct comparison of classification outcomes across both models. Each matrix cell displays the count of predictions for each outcome type, providing a detailed breakdown of model performance on the test set.

Key insights:

  • Challenger model achieves higher true positive and true negative counts: The rf_model records 231 true positives and 219 true negatives, compared to 208 true positives and 207 true negatives for the log_model_champion.
  • Challenger model reduces both false positives and false negatives: The rf_model shows 96 false positives and 101 false negatives, while the log_model_champion has 108 false positives and 124 false negatives.
  • Overall error reduction in challenger model: The total number of misclassifications (false positives plus false negatives) is lower for the rf_model (197) than for the log_model_champion (232).

The confusion matrix results indicate that the challenger model (rf_model) demonstrates improved classification performance relative to the champion model (log_model_champion), with higher counts of correct predictions and lower counts of both types of misclassifications. This reflects a more effective balance between sensitivity and specificity in the challenger model, as evidenced by the reduction in both false positives and false negatives. The observed differences highlight the challenger model's enhanced ability to correctly identify both positive and negative cases within the evaluated dataset.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:11bf
ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:9d61
2026-01-30 23:21:19,109 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger does not exist in model's document

❌ Minimum Accuracy Champion Vs Challenger

The Minimum Accuracy test evaluates whether each model's prediction accuracy meets or exceeds a specified threshold, with results presented for both the log_model_champion and rf_model. The table displays the accuracy scores, the threshold applied (0.7), and the pass/fail outcome for each model. Both models' accuracy scores are compared directly to the threshold to determine if the minimum performance criterion is satisfied.

Key insights:

  • Both models fall below the accuracy threshold: log_model_champion achieved an accuracy score of 0.6414 and rf_model achieved 0.6955, both below the 0.7 threshold.
  • Test outcome is fail for all models evaluated: Both models received a "Fail" result, indicating neither met the minimum accuracy requirement specified by the test.

Both evaluated models did not achieve the minimum accuracy threshold of 0.7, as indicated by their respective scores of 0.6414 and 0.6955. The test results reflect that, under the current configuration and dataset, neither model satisfies the predefined accuracy criterion. This outcome highlights a gap between observed model performance and the minimum standard set for accuracy.

Tables

model Score Threshold Pass/Fail
log_model_champion 0.6414 0.7 Fail
rf_model 0.6955 0.7 Fail
2026-01-30 23:21:25,390 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger does not exist in model's document

✅ Minimum F1 Score Champion Vs Challenger

The MinimumF1Score:champion_vs_challenger test evaluates whether each model's F1 score on the validation set meets or exceeds a predefined minimum threshold, ensuring balanced performance between precision and recall. The results table presents F1 scores for two models—log_model_champion and rf_model—alongside the minimum threshold and pass/fail status. Both models are assessed against a threshold of 0.5, with their respective F1 scores and outcomes displayed.

Key insights:

  • Both models exceed the minimum F1 threshold: log_model_champion achieved an F1 score of 0.642 and rf_model achieved 0.7011, both surpassing the 0.5 threshold.
  • Consistent pass status across models: Both models are marked as "Pass," indicating that neither model falls below the required F1 score for validation set performance.
  • rf_model demonstrates higher F1 performance: rf_model outperforms log_model_champion by approximately 0.0591 in F1 score, indicating stronger balance between precision and recall.

Both evaluated models meet the minimum F1 score requirement on the validation set, indicating balanced classification performance above the established threshold. The rf_model demonstrates the highest F1 score among the tested models, while log_model_champion also maintains performance above the minimum standard. No models in this test instance exhibit F1 scores indicative of high risk.

Tables

model Score Threshold Pass/Fail
log_model_champion 0.6420 0.5 Pass
rf_model 0.7011 0.5 Pass
2026-01-30 23:21:32,078 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger does not exist in model's document

ROC Curve Champion Vs Challenger

The ROC Curve test evaluates the discrimination ability of binary classification models by plotting the True Positive Rate against the False Positive Rate at various thresholds and calculating the Area Under the Curve (AUC) score. The results present ROC curves and AUC values for two models—log_model_champion and rf_model—on the test_dataset_final, with each curve compared against a random classifier baseline (AUC = 0.5). The ROC curves and corresponding AUC scores provide a visual and quantitative assessment of each model's ability to distinguish between the two classes.

Key insights:

  • rf_model demonstrates higher discrimination: The rf_model achieves an AUC of 0.76, indicating stronger separation between classes compared to the log_model_champion.
  • log_model_champion shows moderate performance: The log_model_champion records an AUC of 0.69, reflecting moderate discriminative ability above the random baseline.
  • Both models outperform random classification: Both ROC curves are consistently above the diagonal line representing random performance (AUC = 0.5), confirming meaningful predictive power in both models.

The results indicate that both models provide measurable discrimination between classes, with the rf_model exhibiting superior performance as reflected by a higher AUC score. The log_model_champion also demonstrates moderate effectiveness, with both models maintaining ROC curves above the random baseline throughout the threshold range.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:78cc
ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:c1e8
2026-01-30 23:21:42,870 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger does not exist in model's document
Based on the performance metrics, our challenger random forest classification model passes the MinimumAccuracy where our champion did not.

In your validation report, support your recommendation in your validation issue's Proposed Remediation Plan to investigate the usage of our challenger model by inserting the performance tests we logged with this notebook into the appropriate section.

Run diagnostic tests

Next, we want to inspect the robustness and stability testing comparison between our champion and challenger model.

Use list_tests() to list all available diagnosis tests applicable to classification tasks:

vm.tests.list_tests(tags=["model_diagnosis"], task="classification")
ID Name Description Has Figure Has Table Required Inputs Params Tags Tasks
validmind.model_validation.sklearn.OverfitDiagnosis Overfit Diagnosis Assesses potential overfitting in a model's predictions, identifying regions where performance between training and... True True ['model', 'datasets'] {'metric': {'type': 'str', 'default': None}, 'cut_off_threshold': {'type': 'float', 'default': 0.04}} ['sklearn', 'binary_classification', 'multiclass_classification', 'linear_regression', 'model_diagnosis'] ['classification', 'regression']
validmind.model_validation.sklearn.RobustnessDiagnosis Robustness Diagnosis Assesses the robustness of a machine learning model by evaluating performance decay under noisy conditions.... True True ['datasets', 'model'] {'metric': {'type': 'str', 'default': None}, 'scaling_factor_std_dev_list': {'type': 'List', 'default': [0.1, 0.2, 0.3, 0.4, 0.5]}, 'performance_decay_threshold': {'type': 'float', 'default': 0.05}} ['sklearn', 'model_diagnosis', 'visualization'] ['classification', 'regression']
validmind.model_validation.sklearn.WeakspotsDiagnosis Weakspots Diagnosis Identifies and visualizes weak spots in a machine learning model's performance across various sections of the... True True ['datasets', 'model'] {'features_columns': {'type': 'Optional', 'default': None}, 'metrics': {'type': 'Optional', 'default': None}, 'thresholds': {'type': 'Optional', 'default': None}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_diagnosis', 'visualization'] ['classification', 'text_classification']

Let’s now assess the models for potential signs of overfitting and identify any sub-segments where performance may inconsistent with the OverfitDiagnosis test.

Overfitting occurs when a model learns the training data too well, capturing not only the true pattern but noise and random fluctuations resulting in excellent performance on the training dataset but poor generalization to new, unseen data:

  • Since the training dataset (vm_train_ds) was used to fit the model, we use this set to establish a baseline performance for how well the model performs on data it has already seen.
  • The testing dataset (vm_test_ds) was never seen during training, and here simulates real-world generalization, or how well the model performs on new, unseen data.
vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    }
).log()

Overfit Diagnosis Champion Vs Challenger

The Overfit Diagnosis test evaluates the extent to which model performance on the training set diverges from performance on the test set across feature segments, using AUC as the metric for classification models. The results are presented for both the logistic regression (log_model_champion) and random forest (rf_model) models, with AUC gaps calculated for binned regions of key features. Visualizations and tabular data highlight regions where the absolute difference in AUC between training and test sets exceeds the default threshold of 0.04, indicating potential overfitting.

Key insights:

  • Localized overfitting in log_model_champion: For the logistic regression model, the largest AUC gaps are observed in specific regions, such as CreditScore (400–450] (gap = 0.2784), Balance (25089.8–50179.6] (gap = 0.4667), and EstimatedSalary (119889.5–139869.1] (gap = 0.131). Most other feature segments show AUC gaps below the threshold.
  • Widespread and severe overfitting in rf_model: The random forest model exhibits consistently high AUC gaps across nearly all feature segments, with many regions showing gaps well above 0.2 and some as high as 1.0 (e.g., Balance (25089.8–50179.6] and (200718.5–225808.3]). All major features, including CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, Geography, and Gender, display multiple segments with substantial overfitting.
  • Feature segments with small sample sizes show extreme gaps: The most pronounced AUC gaps for both models often occur in bins with low numbers of test records, such as Balance (25089.8–50179.6] and (200718.5–225808.3], indicating instability in these regions.
  • Logistic regression model demonstrates more stable generalization: Compared to the random forest, the logistic regression model shows fewer and less severe overfit regions, with most feature segments remaining below the 0.04 threshold.

The results indicate that the random forest model is highly prone to overfitting across a broad range of feature segments, with AUC gaps frequently exceeding the diagnostic threshold and reaching extreme values in regions with limited data. In contrast, the logistic regression model demonstrates more stable generalization, with overfitting largely confined to isolated regions with small sample sizes. These findings highlight the importance of monitoring segment-level performance, particularly in low-sample regions, to ensure robust model behavior and mitigate overfitting risk.

Tables

model Feature Slice Number of Training Records Number of Test Records Training AUC Test AUC Gap
log_model_champion CreditScore (400.0, 450.0] 55 12 0.6499 0.3714 0.2784
log_model_champion Tenure (-0.01, 1.0] 360 106 0.6876 0.6383 0.0493
log_model_champion Tenure (1.0, 2.0] 280 55 0.6377 0.5760 0.0617
log_model_champion Tenure (7.0, 8.0] 265 69 0.7331 0.6777 0.0554
log_model_champion Balance (25089.809, 50179.618] 21 6 0.6667 0.2000 0.4667
log_model_champion Balance (50179.618, 75269.427] 88 34 0.6258 0.4607 0.1651
log_model_champion Balance (200718.472, 225808.281] 19 2 0.1000 0.0000 0.1000
log_model_champion NumOfProducts (2.8, 3.1] 146 43 0.8222 0.7805 0.0417
log_model_champion EstimatedSalary (119889.492, 139869.144] 259 57 0.6689 0.5379 0.1310
log_model_champion EstimatedSalary (159848.796, 179828.448] 264 69 0.6900 0.6316 0.0584
rf_model CreditScore (400.0, 450.0] 55 12 1.0000 0.6000 0.4000
rf_model CreditScore (450.0, 500.0] 111 38 1.0000 0.6528 0.3472
rf_model CreditScore (500.0, 550.0] 284 63 1.0000 0.8427 0.1573
rf_model CreditScore (550.0, 600.0] 367 90 1.0000 0.7528 0.2472
rf_model CreditScore (600.0, 650.0] 465 122 1.0000 0.7170 0.2830
rf_model CreditScore (650.0, 700.0] 491 113 1.0000 0.8160 0.1840
rf_model CreditScore (700.0, 750.0] 396 100 1.0000 0.7438 0.2562
rf_model CreditScore (750.0, 800.0] 238 67 1.0000 0.7037 0.2963
rf_model CreditScore (800.0, 850.0] 165 41 1.0000 0.7643 0.2357
rf_model Tenure (-0.01, 1.0] 360 106 1.0000 0.6726 0.3274
rf_model Tenure (1.0, 2.0] 280 55 1.0000 0.7080 0.2920
rf_model Tenure (2.0, 3.0] 272 58 1.0000 0.7298 0.2702
rf_model Tenure (3.0, 4.0] 261 58 1.0000 0.8226 0.1774
rf_model Tenure (4.0, 5.0] 245 60 1.0000 0.7636 0.2364
rf_model Tenure (5.0, 6.0] 252 67 1.0000 0.7996 0.2004
rf_model Tenure (6.0, 7.0] 261 68 1.0000 0.8054 0.1946
rf_model Tenure (7.0, 8.0] 265 69 1.0000 0.7565 0.2435
rf_model Tenure (8.0, 9.0] 254 69 1.0000 0.7581 0.2419
rf_model Tenure (9.0, 10.0] 135 37 1.0000 0.7794 0.2206
rf_model Balance (-250.898, 25089.809] 814 222 1.0000 0.7789 0.2211
rf_model Balance (25089.809, 50179.618] 21 6 1.0000 0.0000 1.0000
rf_model Balance (50179.618, 75269.427] 88 34 1.0000 0.5964 0.4036
rf_model Balance (75269.427, 100359.236] 305 72 1.0000 0.7193 0.2807
rf_model Balance (100359.236, 125449.045] 594 155 1.0000 0.7466 0.2534
rf_model Balance (125449.045, 150538.854] 477 111 1.0000 0.6857 0.3143
rf_model Balance (150538.854, 175628.663] 211 29 1.0000 0.7813 0.2187
rf_model Balance (175628.663, 200718.472] 54 16 1.0000 0.8016 0.1984
rf_model Balance (200718.472, 225808.281] 19 2 1.0000 0.0000 1.0000
rf_model NumOfProducts (0.997, 1.3] 1494 363 1.0000 0.6720 0.3280
rf_model NumOfProducts (1.9, 2.2] 908 234 1.0000 0.6035 0.3965
rf_model NumOfProducts (2.8, 3.1] 146 43 1.0000 0.2622 0.7378
rf_model HasCrCard (-0.001, 0.1] 786 196 1.0000 0.7901 0.2099
rf_model HasCrCard (0.9, 1.0] 1799 451 1.0000 0.7452 0.2548
rf_model IsActiveMember (-0.001, 0.1] 1360 370 1.0000 0.7370 0.2630
rf_model IsActiveMember (0.9, 1.0] 1225 277 1.0000 0.7588 0.2412
rf_model EstimatedSalary (-188.217, 19991.232] 251 67 1.0000 0.8802 0.1198
rf_model EstimatedSalary (19991.232, 39970.884] 259 54 1.0000 0.7514 0.2486
rf_model EstimatedSalary (39970.884, 59950.536] 250 73 1.0000 0.7879 0.2121
rf_model EstimatedSalary (59950.536, 79930.188] 250 72 1.0000 0.7526 0.2474
rf_model EstimatedSalary (79930.188, 99909.84] 255 66 1.0000 0.7398 0.2602
rf_model EstimatedSalary (99909.84, 119889.492] 286 57 1.0000 0.7716 0.2284
rf_model EstimatedSalary (119889.492, 139869.144] 259 57 1.0000 0.7014 0.2986
rf_model EstimatedSalary (139869.144, 159848.796] 253 68 1.0000 0.7934 0.2066
rf_model EstimatedSalary (159848.796, 179828.448] 264 69 1.0000 0.6936 0.3064
rf_model EstimatedSalary (179828.448, 199808.1] 258 64 1.0000 0.7144 0.2856
rf_model Geography_Germany (-0.001, 0.1] 1770 453 1.0000 0.7308 0.2692
rf_model Geography_Germany (0.9, 1.0] 815 194 1.0000 0.7310 0.2690
rf_model Geography_Spain (-0.001, 0.1] 1989 490 1.0000 0.7532 0.2468
rf_model Geography_Spain (0.9, 1.0] 596 157 1.0000 0.7664 0.2336
rf_model Gender_Male (-0.001, 0.1] 1255 321 1.0000 0.7580 0.2420
rf_model Gender_Male (0.9, 1.0] 1330 326 1.0000 0.7485 0.2515

Figures

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:1f02
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:a60f
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:25c7
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:795e
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:38c1
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:b9ba
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:c3b0
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:e77b
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:59bb
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:e2c6
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:413a
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:aa64
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:3ea5
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:4efa
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:40d7
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:eac3
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:2eb9
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:c30e
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:6566
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:7efa
2026-01-30 23:22:12,536 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger does not exist in model's document

Let's also conduct robustness and stability testing of the two models with the RobustnessDiagnosis test. Robustness refers to a model's ability to maintain consistent performance, and stability refers to a model's ability to produce consistent outputs over time across different data subsets.

Again, we'll use both the training and testing datasets to establish baseline performance and to simulate real-world generalization:

vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    },
).log()

❌ Robustness Diagnosis Champion Vs Log Regression

The Robustness Diagnosis test evaluates the resilience of machine learning models to input perturbations by measuring AUC decay under increasing levels of Gaussian noise. The results compare the performance of a logistic regression model (log_model_champion) and a random forest model (rf_model) across both training and test datasets, with AUC and performance decay tracked at multiple perturbation scales. Plots visualize the relationship between perturbation size and AUC, highlighting the models' sensitivity to noisy input features.

Key insights:

  • Logistic regression model demonstrates stable robustness: Across all perturbation sizes (0.1 to 0.5), the log_model_champion maintains AUC values between 0.6658 and 0.6947 on both train and test datasets, with performance decay remaining minimal (maximum observed decay of 0.0244 on test data at perturbation size 0.5).
  • Random forest model exhibits pronounced performance decay on training data: The rf_model shows a marked decrease in training AUC from 1.0 at baseline to 0.7936 at perturbation size 0.5, with performance decay increasing steadily and exceeding passing thresholds from perturbation size 0.2 onward.
  • Random forest test performance remains relatively stable: Despite significant training AUC decay, the rf_model test AUC remains between 0.7111 and 0.7594 across all perturbation sizes, with performance decay on test data not exceeding 0.0466.
  • Passing criteria not met for random forest on training data at higher noise levels: The rf_model fails the robustness test on training data for perturbation sizes 0.2 and above, as indicated by the "Passed: false" status, while all other scenarios pass.

The results indicate that the logistic regression model maintains consistent robustness to Gaussian noise, with negligible AUC decay across both training and test datasets. In contrast, the random forest model is highly sensitive to input perturbations on training data, exhibiting substantial performance decay and failing robustness criteria at moderate to high noise levels, though its test set performance remains comparatively stable. These findings highlight a divergence in robustness characteristics between the two model types, with logistic regression demonstrating greater resilience to input noise under the tested conditions.

Tables

model Perturbation Size Dataset Row Count AUC Performance Decay Passed
log_model_champion Baseline (0.0) train_dataset_final 2585 0.6767 0.0000 True
log_model_champion Baseline (0.0) test_dataset_final 647 0.6903 0.0000 True
log_model_champion 0.1 train_dataset_final 2585 0.6755 0.0012 True
log_model_champion 0.1 test_dataset_final 647 0.6932 -0.0028 True
log_model_champion 0.2 train_dataset_final 2585 0.6769 -0.0002 True
log_model_champion 0.2 test_dataset_final 647 0.6947 -0.0044 True
log_model_champion 0.3 train_dataset_final 2585 0.6695 0.0072 True
log_model_champion 0.3 test_dataset_final 647 0.6857 0.0046 True
log_model_champion 0.4 train_dataset_final 2585 0.6700 0.0067 True
log_model_champion 0.4 test_dataset_final 647 0.6946 -0.0043 True
log_model_champion 0.5 train_dataset_final 2585 0.6658 0.0109 True
log_model_champion 0.5 test_dataset_final 647 0.6659 0.0244 True
rf_model Baseline (0.0) train_dataset_final 2585 1.0000 0.0000 True
rf_model Baseline (0.0) test_dataset_final 647 0.7577 0.0000 True
rf_model 0.1 train_dataset_final 2585 0.9849 0.0151 True
rf_model 0.1 test_dataset_final 647 0.7594 -0.0017 True
rf_model 0.2 train_dataset_final 2585 0.9469 0.0531 False
rf_model 0.2 test_dataset_final 647 0.7594 -0.0017 True
rf_model 0.3 train_dataset_final 2585 0.9012 0.0988 False
rf_model 0.3 test_dataset_final 647 0.7395 0.0181 True
rf_model 0.4 train_dataset_final 2585 0.8432 0.1568 False
rf_model 0.4 test_dataset_final 647 0.7372 0.0205 True
rf_model 0.5 train_dataset_final 2585 0.7936 0.2064 False
rf_model 0.5 test_dataset_final 647 0.7111 0.0466 True

Figures

ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:cedd
ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:5c46
2026-01-30 23:22:33,255 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression does not exist in model's document

Run feature importance tests

We also want to verify the relative influence of different input features on our models' predictions, as well as inspect the differences between our champion and challenger model to see if a certain model offers more understandable or logical importance scores for features.

Use list_tests() to identify all the feature importance tests for classification:

# Store the feature importance tests
FI = vm.tests.list_tests(tags=["feature_importance"], task="classification",pretty=False)
FI
['validmind.model_validation.FeaturesAUC',
 'validmind.model_validation.sklearn.PermutationFeatureImportance',
 'validmind.model_validation.sklearn.SHAPGlobalImportance']

We'll only use our testing dataset (vm_test_ds) here, to provide a realistic, unseen sample that mimic future or production data, as the training dataset has already influenced our model during learning:

# Run and log our feature importance tests for both models for the testing dataset
for test in FI:
    vm.tests.run_test(
        "".join((test,':champion_vs_challenger')),
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        },
    ).log()

Features Champion Vs Challenger

The FeaturesAUC:champion_vs_challenger test evaluates the discriminatory power of each individual feature in a binary classification context by calculating the Area Under the Curve (AUC) for each feature independently. The resulting bar chart displays the AUC values for all features in the test_dataset_final dataset, with higher AUC values indicating stronger univariate separation between the two classes. The features are ranked from highest to lowest AUC, providing a clear view of which variables are most and least effective at distinguishing between classes on their own.

Key insights:

  • Geography_Germany exhibits highest univariate discrimination: Geography_Germany achieves the highest AUC, exceeding 0.6, indicating the strongest individual class separation among all features.
  • Balance and EstimatedSalary show moderate discriminatory power: Both Balance and EstimatedSalary have AUC values above 0.5, suggesting moderate ability to distinguish between classes independently.
  • Several features display limited univariate separation: Features such as IsActiveMember, NumOfProducts, and Gender_Male have AUC values close to or below 0.45, indicating limited standalone discriminatory power.

The results indicate that Geography_Germany is the most individually informative feature for class separation in this dataset, with Balance and EstimatedSalary also contributing moderate univariate discrimination. Other features demonstrate lower AUC values, reflecting limited ability to differentiate between classes when considered in isolation. This distribution of AUC scores provides insight into the relative univariate predictive strength of each feature within the dataset.

Figures

ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:d127
ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:cc21
2026-01-30 23:22:50,772 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.FeaturesAUC:champion_vs_challenger does not exist in model's document

Permutation Feature Importance Champion Vs Challenger

The Permutation Feature Importance (PFI) test evaluates the relative importance of each input feature by measuring the decrease in model performance when the feature's values are randomly permuted. The results are presented as bar plots for both the champion (logistic regression) and challenger (random forest) models, with each bar representing the impact of permuting a specific feature on model performance. The magnitude of each bar indicates the degree to which the model relies on that feature for prediction.

Key insights:

  • Distinct feature reliance between models: The champion model (logistic regression) assigns highest importance to Geography_Germany, IsActiveMember, and Gender_Male, while the challenger model (random forest) prioritizes NumOfProducts, Balance, and Geography_Germany.
  • Concentration of importance in top features: In both models, a small subset of features accounts for the majority of total importance, with the top three features in each model showing substantially higher importance values than the remaining features.
  • Low importance for several features: Features such as EstimatedSalary, Tenure, Geography_Spain, and CreditScore exhibit minimal impact on model performance in both models, as indicated by their low permutation importance scores.
  • Model-specific feature differentiation: The challenger model attributes the highest importance to NumOfProducts and Balance, whereas these features are less influential in the champion model, highlighting differences in feature utilization between model architectures.

The PFI results demonstrate that both the champion and challenger models rely heavily on a limited set of features, though the specific features and their relative importances differ between models. The champion model emphasizes categorical and membership-related variables, while the challenger model places greater weight on product and balance-related features. Several features consistently show low importance across both models, indicating limited contribution to predictive performance. These findings provide a clear view of model-specific feature dependencies and support further analysis of model behavior and risk.

Figures

ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:e2e6
ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:8970
2026-01-30 23:23:10,744 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger does not exist in model's document

SHAP Global Importance Champion Vs Challenger

The SHAPGlobalImportance:champion_vs_challenger test evaluates and visualizes the global feature importance for both the champion (log_model_champion) and challenger (rf_model) models using SHAP values. The results include mean importance plots and summary plots, which display the normalized SHAP values for each feature and illustrate the distribution and impact of feature values on model output. These visualizations enable a comparative assessment of how each model attributes importance to its input features.

Key insights:

  • Champion model dominated by IsActiveMember, Geography_Germany, and Gender_Male: The log_model_champion assigns the highest normalized SHAP importance to IsActiveMember, followed by Geography_Germany and Gender_Male, with these three features collectively accounting for the majority of the model's global importance.
  • Challenger model focuses on CreditScore and Tenure: The rf_model attributes nearly all of its normalized SHAP importance to CreditScore and Tenure, with no other features contributing materially to global importance.
  • Distinct feature attribution patterns between models: The champion model distributes importance across a broader set of features, while the challenger model concentrates importance on only two features, indicating divergent model reasoning.
  • SHAP summary plots show limited feature interaction in challenger: The summary plot for the rf_model reveals that only CreditScore and Tenure have non-trivial SHAP value distributions, with no evidence of significant interaction or contribution from other features.

The SHAP global importance analysis reveals a clear contrast in feature attribution between the champion and challenger models. The champion model leverages a wider array of features, with IsActiveMember, Geography_Germany, and Gender_Male being most influential, while the challenger model relies almost exclusively on CreditScore and Tenure. This divergence in feature importance profiles highlights fundamental differences in model logic and may have implications for model robustness and interpretability. No evidence of anomalous or illogical feature importance is observed in either model based on the SHAP plots.

Figures

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:fd4d
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:9c93
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:b295
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:5fec
2026-01-30 23:23:36,206 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger does not exist in model's document

In summary

In this third notebook, you learned how to:

Next steps

Finalize validation and reporting

Now that you're familiar with the basics of using the ValidMind Library to run and log validation tests, let's learn how to implement some custom tests and wrap up our validation: 4 — Finalize validation and reporting


Copyright © 2023-2026 ValidMind Inc. All rights reserved.
Refer to LICENSE for details.
SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial