ValidMind for model validation 3 — Developing a potential challenger model

Learn how to use ValidMind for your end-to-end model validation process with our series of four introductory notebooks. In this third notebook, develop a potential challenger model and then pass your model and its predictions to ValidMind.

A challenger model is an alternate model that attempts to outperform the champion model, ensuring that the best performing fit-for-purpose model is always considered for deployment. Challenger models also help avoid over-reliance on a single model, and allow testing of new features, algorithms, or data sources without disrupting the production lifecycle.

Learn by doing

Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals

Prerequisites

In order to develop potential challenger models with this notebook, you'll need to first have:

Need help with the above steps?

Refer to the first two notebooks in this series:

Setting up

This section should be quite familiar to you — as we performed the same actions in the previous notebook, 2 — Start the model validation process.

Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

  1. In a browser, log in to ValidMind.

  2. In the left sidebar, navigate to Inventory and select the model you registered for this "ValidMind for model validation" series of notebooks.

  3. Go to Getting Started and click Copy snippet to clipboard.

Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
    # document="validation-report",
)
Note: you may need to restart the kernel to use updated packages.
2026-03-12 20:44:24,699 - ERROR(validmind.api_client): Future releases will require `document` as one of the options you must provide to `vm.init()`. To learn more, refer to https://docs.validmind.ai/developer/validmind-library.html
2026-03-12 20:44:24,789 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report

Import the sample dataset

Next, we'll load in the sample Bank Customer Churn Prediction dataset used to develop the champion model that we will independently preprocess:

# Load the sample dataset
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}

Preprocess the dataset

We’ll apply a simple rebalancing technique to the dataset before continuing:

import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test.

As you know, before we can run tests you’ll need to initialize a ValidMind dataset object with the init_dataset function:

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)

With our balanced dataset initialized, we can then run our test and utilize the output to help us identify the features we want to remove:

# Run HighPearsonCorrelation test with our balanced dataset as input and return a result object
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

The High Pearson Correlation test evaluates the linear relationships between feature pairs to identify potential redundancy or multicollinearity. The results table presents the top ten strongest absolute correlations, listing the feature pairs, their Pearson correlation coefficients, and a Pass/Fail status based on a threshold of 0.3. Only one feature pair exceeds the threshold, while the remaining pairs show lower correlation magnitudes and pass the test criteria.

Key insights:

  • One feature pair exceeds correlation threshold: The pair (Age, Exited) has a correlation coefficient of 0.3674, surpassing the 0.3 threshold and receiving a Fail status.
  • All other feature pairs below threshold: The remaining nine feature pairs have absolute correlation coefficients ranging from 0.0441 to 0.1874, all below the threshold and marked as Pass.
  • Predominantly weak linear relationships: Most feature pairs exhibit weak linear associations, with coefficients close to zero, indicating limited direct linear dependency among these features.

The results indicate that, with the exception of the (Age, Exited) pair, the dataset does not display strong linear relationships among the top correlated feature pairs. The overall correlation structure suggests low risk of widespread multicollinearity, with only isolated moderate correlation observed.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(Age, Exited) 0.3674 Fail
(Balance, NumOfProducts) -0.1874 Pass
(IsActiveMember, Exited) -0.1856 Pass
(Balance, Exited) 0.1565 Pass
(Age, Balance) 0.0594 Pass
(NumOfProducts, Exited) -0.0554 Pass
(Tenure, IsActiveMember) -0.0523 Pass
(Age, NumOfProducts) -0.0507 Pass
(HasCrCard, IsActiveMember) -0.0442 Pass
(NumOfProducts, IsActiveMember) 0.0441 Pass
# From result object, extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df
Columns Coefficient Pass/Fail
0 (Age, Exited) 0.3674 Fail
1 (Balance, NumOfProducts) -0.1874 Pass
2 (IsActiveMember, Exited) -0.1856 Pass
3 (Balance, Exited) 0.1565 Pass
4 (Age, Balance) 0.0594 Pass
5 (NumOfProducts, Exited) -0.0554 Pass
6 (Tenure, IsActiveMember) -0.0523 Pass
7 (Age, NumOfProducts) -0.0507 Pass
8 (HasCrCard, IsActiveMember) -0.0442 Pass
9 (NumOfProducts, IsActiveMember) 0.0441 Pass
# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features
['(Age, Exited)']
# Extract feature names from the list of strings
high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features
['Age']

We can then re-initialize the dataset with a different input_id and the highly correlated features removed and re-run the test for confirmation:

# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)
# Re-run the test with the reduced feature set
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

The High Pearson Correlation test evaluates the linear relationships between feature pairs to identify potential redundancy or multicollinearity. The results table presents the top ten absolute Pearson correlation coefficients among feature pairs, along with their corresponding Pass/Fail status based on a threshold of 0.3. All reported coefficients are below the threshold, and each feature pair is marked as Pass.

Key insights:

  • No feature pairs exceed correlation threshold: All absolute Pearson correlation coefficients are below the 0.3 threshold, with the highest magnitude observed at 0.1874 between Balance and NumOfProducts.
  • Weak linear relationships dominate: The strongest observed correlations, both positive and negative, remain in the weak range, with coefficients ranging from -0.1874 to 0.1565.
  • Consistent Pass status across all pairs: Every feature pair in the top ten list is marked as Pass, indicating no detected risk of linear redundancy or multicollinearity among these features.

The results indicate that the dataset does not exhibit strong linear dependencies among the top correlated feature pairs. All observed relationships fall well below the specified threshold, suggesting low risk of feature redundancy or multicollinearity based on linear correlation. The feature set maintains independence suitable for reliable model estimation and interpretability.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(Balance, NumOfProducts) -0.1874 Pass
(IsActiveMember, Exited) -0.1856 Pass
(Balance, Exited) 0.1565 Pass
(NumOfProducts, Exited) -0.0554 Pass
(Tenure, IsActiveMember) -0.0523 Pass
(HasCrCard, IsActiveMember) -0.0442 Pass
(NumOfProducts, IsActiveMember) 0.0441 Pass
(CreditScore, EstimatedSalary) -0.0397 Pass
(CreditScore, Exited) -0.0349 Pass
(CreditScore, IsActiveMember) 0.0311 Pass

Split the preprocessed dataset

With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:

# Encode categorical features in the dataset
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()
CreditScore Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited Geography_Germany Geography_Spain Gender_Male
5683 850 3 100476.46 2 1 1 136539.13 0 False True True
3153 712 2 182888.08 1 1 0 3061.00 0 False False True
1497 570 8 0.00 1 1 1 124641.42 0 False False False
4646 676 1 0.00 1 1 0 79342.31 1 False False False
6458 609 1 108019.27 3 1 1 184524.65 1 False False True
from sklearn.model_selection import train_test_split

# Split the dataset into train and test
train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]
# Initialize the split datasets
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

Import the champion model

With our raw dataset assessed and preprocessed, let's go ahead and import the champion model submitted by the model development team in the format of a .pkl file: lr_model_champion.pkl

# Import the champion model
import pickle as pkl

with open("lr_model_champion.pkl", "rb") as f:
    log_reg = pkl.load(f)
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/base.py:463: InconsistentVersionWarning: Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.8.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(

Training a potential challenger model

We're curious how an alternate model compares to our champion model, so let's train a challenger model as a basis for our testing.

Our champion logistic regression model is a simpler, parametric model that assumes a linear relationship between the independent variables and the log-odds of the outcome. While logistic regression may not capture complex patterns as effectively, it offers a high degree of interpretability and is easier to explain to stakeholders. However, model risk is not calculated in isolation from a single factor, but rather in consideration with trade-offs in predictive performance, ease of interpretability, and overall alignment with business objectives.

Random forest classification model

A random forest classification model is an ensemble machine learning algorithm that uses multiple decision trees to classify data. In ensemble learning, multiple models are combined to improve prediction accuracy and robustness.

Random forest classification models generally have higher accuracy because they capture complex, non-linear relationships, but as a result they lack transparency in their predictions.

# Import the Random Forest Classification model
from sklearn.ensemble import RandomForestClassifier

# Create the model instance with 50 decision trees
rf_model = RandomForestClassifier(
    n_estimators=50,
    random_state=42,
)

# Train the model
rf_model.fit(X_train, y_train)
RandomForestClassifier(n_estimators=50, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Initializing the model objects

Initialize the model objects

In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models.

You simply initialize this model object with vm.init_model():

# Initialize the champion logistic regression model
vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model_champion",
)

# Initialize the challenger random forest classification model
vm_rf_model = vm.init_model(
    rf_model,
    input_id="rf_model",
)

Assign predictions

With our models registered, we'll move on to assigning both the predictive probabilities coming directly from each model's predictions, and the binary prediction after applying the cutoff threshold described in the Compute binary predictions step above.

  • The assign_predictions() method from the Dataset object can link existing predictions to any number of models.
  • This method links the model's class prediction values and probabilities to our vm_train_ds and vm_test_ds datasets.

If no prediction values are passed, the method will compute predictions automatically:

# Champion — Logistic regression model
vm_train_ds.assign_predictions(model=vm_log_model)
vm_test_ds.assign_predictions(model=vm_log_model)

# Challenger — Random forest classification model
vm_train_ds.assign_predictions(model=vm_rf_model)
vm_test_ds.assign_predictions(model=vm_rf_model)
2026-03-12 20:44:34,414 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-03-12 20:44:34,416 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-03-12 20:44:34,416 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-03-12 20:44:34,419 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-03-12 20:44:34,421 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-03-12 20:44:34,422 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-03-12 20:44:34,423 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-03-12 20:44:34,424 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-03-12 20:44:34,426 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-03-12 20:44:34,451 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-03-12 20:44:34,452 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-03-12 20:44:34,476 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-03-12 20:44:34,478 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-03-12 20:44:34,491 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-03-12 20:44:34,492 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-03-12 20:44:34,505 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Running model evaluation tests

With our setup complete, let's run the rest of our validation tests. Since we have already verified the data quality of the dataset used to train our champion model, we will now focus on comprehensive performance evaluations of both the champion and challenger models.

Run model performance tests

Let's run some performance tests, beginning with independent testing of our champion logistic regression model, then moving on to our potential challenger model.

Use vm.tests.list_tests() to identify all the model performance tests for classification:


vm.tests.list_tests(tags=["model_performance"], task="classification")
ID Name Description Has Figure Has Table Required Inputs Params Tags Tasks
validmind.model_validation.sklearn.CalibrationCurve Calibration Curve Evaluates the calibration of probability estimates by comparing predicted probabilities against observed... True False ['model', 'dataset'] {'n_bins': {'type': 'int', 'default': 10}} ['sklearn', 'model_performance', 'classification'] ['classification']
validmind.model_validation.sklearn.ClassifierPerformance Classifier Performance Evaluates performance of binary or multiclass classification models using precision, recall, F1-Score, accuracy,... False True ['dataset', 'model'] {'average': {'type': 'str', 'default': 'macro'}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ConfusionMatrix Confusion Matrix Evaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix... True False ['dataset', 'model'] {'threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.HyperParametersTuning Hyper Parameters Tuning Performs exhaustive grid search over specified parameter ranges to find optimal model configurations... False True ['model', 'dataset'] {'param_grid': {'type': 'dict', 'default': None}, 'scoring': {'type': 'Union', 'default': None}, 'thresholds': {'type': 'Union', 'default': None}, 'fit_params': {'type': 'dict', 'default': None}} ['sklearn', 'model_performance'] ['clustering', 'classification']
validmind.model_validation.sklearn.MinimumAccuracy Minimum Accuracy Checks if the model's prediction accuracy meets or surpasses a specified threshold.... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.7}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumF1Score Minimum F1 Score Assesses if the model's F1 score on the validation set meets a predefined minimum threshold, ensuring balanced... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumROCAUCScore Minimum ROCAUC Score Validates model by checking if the ROC AUC score meets or surpasses a specified threshold.... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ModelsPerformanceComparison Models Performance Comparison Evaluates and compares the performance of multiple Machine Learning models using various metrics like accuracy,... False True ['dataset', 'models'] {} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'model_comparison'] ['classification', 'text_classification']
validmind.model_validation.sklearn.PopulationStabilityIndex Population Stability Index Assesses the Population Stability Index (PSI) to quantify the stability of an ML model's predictions across... True True ['datasets', 'model'] {'num_bins': {'type': 'int', 'default': 10}, 'mode': {'type': 'str', 'default': 'fixed'}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurve Precision Recall Curve Evaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve.... True False ['model', 'dataset'] {} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurve ROC Curve Evaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic... True False ['model', 'dataset'] {} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.RegressionErrors Regression Errors Assesses the performance and error distribution of a regression model using various error metrics.... False True ['model', 'dataset'] {} ['sklearn', 'model_performance'] ['regression', 'classification']
validmind.model_validation.sklearn.TrainingTestDegradation Training Test Degradation Tests if model performance degradation between training and test datasets exceeds a predefined threshold.... False True ['datasets', 'model'] {'max_threshold': {'type': 'float', 'default': 0.1}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.statsmodels.GINITable GINI Table Evaluates classification model performance using AUC, GINI, and KS metrics for training and test datasets.... False True ['dataset', 'model'] {} ['model_performance'] ['classification']
validmind.ongoing_monitoring.CalibrationCurveDrift Calibration Curve Drift Evaluates changes in probability calibration between reference and monitoring datasets.... True True ['datasets', 'model'] {'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ClassDiscriminationDrift Class Discrimination Drift Compares classification discrimination metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ClassificationAccuracyDrift Classification Accuracy Drift Compares classification accuracy metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ConfusionMatrixDrift Confusion Matrix Drift Compares confusion matrix metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ROCCurveDrift ROC Curve Drift Compares ROC curves between reference and monitoring datasets.... True False ['datasets', 'model'] {} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']

We'll isolate the specific tests we want to run in mpt:

As we learned in the previous notebook 2 — Start the model validation process, you can use a custom result_id to tag the individual result with a unique identifier by appending this result_id to the test_id with a : separator. We'll append an identifier for our champion model here:

mpt = [
    "validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion",
    "validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion",
    "validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion",
    "validmind.model_validation.sklearn.MinimumF1Score:logreg_champion",
    "validmind.model_validation.sklearn.ROCCurve:logreg_champion"
]

Evaluate performance of the champion model

Now, let's run and log our batch of model performance tests using our testing dataset (vm_test_ds) for our champion model:

  • The test set serves as a proxy for real-world data, providing an unbiased estimate of model performance since it was not used during training or tuning.
  • The test set also acts as protection against selection bias and model tweaking, giving a final, more unbiased checkpoint.
for test in mpt:
    vm.tests.run_test(
        test,
        inputs={
            "dataset": vm_test_ds, "model" : vm_log_model,
        },
    ).log()

Classifier Performance Logreg Champion

The Classifier Performance test evaluates the predictive effectiveness of classification models by reporting precision, recall, F1-score, accuracy, and ROC AUC metrics. The results table presents these metrics for each class, as well as macro and weighted averages, alongside overall accuracy and ROC AUC values. The reported values provide a quantitative summary of the model's ability to correctly classify instances and distinguish between classes.

Key insights:

  • Balanced class-wise performance: Precision, recall, and F1-scores are similar across both classes, with precision ranging from 0.6382 to 0.6414 and recall from 0.6120 to 0.6667, indicating no substantial disparity in model performance between classes.
  • Consistent macro and weighted averages: Macro and weighted averages for precision, recall, and F1-score are closely aligned (all approximately 0.639), reflecting uniformity in class performance and absence of class imbalance effects in these metrics.
  • Moderate overall accuracy: The model achieves an accuracy of 0.6399, indicating that approximately 64% of predictions match the true class labels.
  • ROC AUC indicates moderate separability: The ROC AUC score of 0.6901 suggests the model has moderate ability to distinguish between the two classes.

The results indicate that the model demonstrates consistent and balanced predictive performance across both classes, with moderate accuracy and ROC AUC values. The close alignment of macro and weighted averages further supports the absence of significant class imbalance effects. Overall, the model exhibits moderate classification effectiveness, with no pronounced weaknesses in class-specific performance metrics.

Tables

Precision, Recall, and F1

Class Precision Recall F1
0 0.6414 0.6667 0.6538
1 0.6382 0.6120 0.6248
Weighted Average 0.6398 0.6399 0.6396
Macro Average 0.6398 0.6393 0.6393

Accuracy and ROC AUC

Metric Value
Accuracy 0.6399
ROC AUC 0.6901
2026-03-12 20:44:44,385 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion does not exist in model's document

Confusion Matrix Logreg Champion

The Confusion Matrix test evaluates the classification performance of the logistic regression model by comparing predicted and actual class labels, providing a breakdown of true positives, true negatives, false positives, and false negatives. The resulting matrix visually displays the distribution of correct and incorrect predictions, enabling assessment of the model’s ability to distinguish between the two classes. The matrix quantifies each outcome, supporting detailed analysis of model strengths and error patterns.

Key insights:

  • True Negatives exceed other outcomes: The model correctly identified 220 true negatives, representing the highest count among all matrix categories.
  • True Positives are substantial: There are 194 true positives, indicating a strong ability to correctly classify positive cases.
  • False Negatives outnumber False Positives: The model produced 123 false negatives compared to 110 false positives, highlighting a greater tendency to miss positive cases than to incorrectly flag negatives as positives.
  • Non-trivial error rates in both classes: Both false positive and false negative counts are material, indicating that misclassification occurs in both directions.

The confusion matrix reveals that the model demonstrates a higher rate of correct classification for negative cases, with true negatives being the most frequent outcome. While true positives are also substantial, the presence of notable false negative and false positive counts indicates that classification errors are distributed across both classes. This distribution provides a clear view of the model’s predictive strengths and areas where misclassification risk is present.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion:cad1
2026-03-12 20:44:53,158 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion does not exist in model's document

❌ Minimum Accuracy Logreg Champion

The Minimum Accuracy test evaluates whether the model's prediction accuracy meets or exceeds a specified threshold, providing a direct measure of overall model correctness. The results table presents the model's achieved accuracy score, the minimum threshold set for the test, and the corresponding pass/fail outcome. The model's accuracy score is compared against the threshold to determine if the model satisfies the minimum performance requirement.

Key insights:

  • Accuracy below threshold: The model achieved an accuracy score of 0.6399, which is below the specified threshold of 0.7.
  • Test outcome is Fail: The test result is marked as "Fail," indicating the model did not meet the minimum accuracy requirement.

The results indicate that the model's predictive accuracy falls short of the established minimum threshold, as evidenced by the accuracy score of 0.6399 against a requirement of 0.7. This outcome highlights a gap in overall model correctness relative to the defined performance criterion.

Tables

Score Threshold Pass/Fail
0.6399 0.7 Fail
2026-03-12 20:44:58,742 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion does not exist in model's document

✅ Minimum F1 Score Logreg Champion

The MinimumF1Score:logreg_champion test evaluates whether the model's F1 score on the validation set meets or exceeds a predefined minimum threshold, ensuring balanced performance between precision and recall. The results table presents the observed F1 score, the minimum threshold for passing, and the pass/fail outcome. The model achieved an F1 score of 0.6248, compared against a threshold of 0.5, with the test outcome marked as "Pass".

Key insights:

  • F1 score exceeds minimum threshold: The model's F1 score of 0.6248 is above the required threshold of 0.5, indicating balanced performance between precision and recall on the validation set.
  • Test outcome is Pass: The model satisfies the minimum F1 score requirement, as reflected by the "Pass" result in the test output.

The results indicate that the model demonstrates balanced classification performance on the validation set, with the F1 score surpassing the established minimum threshold. The test outcome confirms that the model meets the predefined standard for F1-based performance.

Tables

Score Threshold Pass/Fail
0.6248 0.5 Pass
2026-03-12 20:45:02,324 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:logreg_champion does not exist in model's document

ROC Curve Logreg Champion

The ROC Curve test evaluates the binary classification performance of the logreg_champion model by plotting the Receiver Operating Characteristic (ROC) curve and calculating the Area Under the Curve (AUC) on the test_dataset_final. The ROC curve visualizes the trade-off between the true positive rate and false positive rate across all possible classification thresholds, while the AUC quantifies the model's overall discriminative ability. The test result presents the ROC curve for the model alongside a reference line representing random classification (AUC = 0.5), with the model's AUC score displayed in the legend.

Key insights:

  • AUC indicates moderate discriminative ability: The model achieves an AUC of 0.69, reflecting moderate capability to distinguish between the two classes.
  • ROC curve consistently above random baseline: The ROC curve remains above the diagonal line representing random performance, indicating the model provides meaningful separation between positive and negative classes.
  • No evidence of near-random classification: The ROC curve does not approach the random baseline, and the AUC is well above 0.5, suggesting the model avoids high-risk performance zones.

The ROC analysis demonstrates that the logreg_champion model exhibits moderate discriminative power on the test dataset, with an AUC of 0.69. The model's ROC curve consistently outperforms random classification, indicating reliable, though not exceptional, separation between classes. No indications of high-risk or near-random classification behavior are present in the observed results.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:logreg_champion:44c8
2026-03-12 20:45:08,707 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:logreg_champion does not exist in model's document
Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.

Log an artifact

As we can observe from the output above, our champion model doesn't pass the MinimumAccuracy based on the default thresholds of the out-of-the-box test, so let's log an artifact (finding) in the ValidMind Platform (Need more help?):

  1. From the Inventory in the ValidMind Platform, go to the model you connected to earlier.

  2. In the left sidebar that appears for your model, click Validation Report under Documents.

  3. Locate the Data Preparation section and click on 2.2.2. Model Performance to expand that section.

  4. Under the Model Performance Metrics section, locate Artifacts then click Link Artifact to Report:

    Screenshot showing the validation report with the link artifact option highlighted

  5. Select Validation Issue as the type of artifact.

  6. Click + Add Validation Issue to add a validation issue type artifact.

  7. Enter in the details for your validation issue, for example:

    • TITLE — Champion Logistic Regression Model Fails Minimum Accuracy Threshold
    • RISK AREA — Model Performance
    • DOCUMENTATION SECTION — 3.2. Model Evaluation
    • DESCRIPTION — The logistic regression champion model was subjected to a Minimum Accuracy test to determine whether its predictive accuracy meets the predefined performance threshold of 0.7. The model achieved an accuracy score of 0.6136, which falls below the required minimum. As a result, the test produced a Fail outcome.
  8. Click Save.

  9. Select the validation issue you just added to link to your validation report and click Update Linked Artifacts to insert your validation issue.

  10. Click on the validation issue to expand the issue, where you can adjust details such as severity, owner, due date, status, etc. as well as include proposed remediation plans or supporting documentation as attachments.

Evaluate performance of challenger model

We've now conducted similar tests as the model development team for our champion model, with the aim of verifying their test results.

Next, let's see how our challenger models compare. We'll use the same batch of tests here as we did in mpt, but append a different result_id to indicate that these results should be associated with our challenger model:

mpt_chall = [
    "validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger",
    "validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger",
    "validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger"
]

We'll run each test once for each model with the same vm_test_ds dataset to compare them:

for test in mpt_chall:
    vm.tests.run_test(
        test,
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        }
    ).log()

Classifier Performance Champion Vs Challenger

The Classifier Performance: champion_vs_challenger test evaluates the predictive performance of classification models using precision, recall, F1-Score, accuracy, and ROC AUC metrics. The results present a comparative analysis between two models, "log_model_champion" and "rf_model," with detailed class-level and aggregate performance statistics. Metrics are reported for each class, as well as macro and weighted averages, alongside overall accuracy and ROC AUC values for both models.

Key insights:

  • rf_model outperforms log_model_champion across all metrics: rf_model achieves higher precision, recall, F1-Score, accuracy (0.7156), and ROC AUC (0.7962) compared to log_model_champion, which records accuracy of 0.6399 and ROC AUC of 0.6901.
  • Consistent class-level performance within each model: Both models display similar precision and recall values across classes 0 and 1, with no substantial imbalance between classes.
  • Macro and weighted averages align closely: For both models, macro and weighted averages for precision, recall, and F1-Score are nearly identical, indicating balanced class distribution and uniform model behavior across classes.

The comparative results indicate that rf_model demonstrates superior classification performance relative to log_model_champion, as evidenced by higher scores across all evaluated metrics. Both models exhibit balanced predictive behavior across classes, with minimal disparity between class-specific and aggregate performance measures. The observed differences in accuracy and ROC AUC highlight a clear performance advantage for rf_model in this evaluation.

Tables

model Class Precision Recall F1
log_model_champion 0 0.6414 0.6667 0.6538
log_model_champion 1 0.6382 0.6120 0.6248
log_model_champion Weighted Average 0.6398 0.6399 0.6396
log_model_champion Macro Average 0.6398 0.6393 0.6393
rf_model 0 0.7160 0.7333 0.7246
rf_model 1 0.7152 0.6972 0.7061
rf_model Weighted Average 0.7156 0.7156 0.7155
rf_model Macro Average 0.7156 0.7152 0.7153
model Metric Value
log_model_champion Accuracy 0.6399
log_model_champion ROC AUC 0.6901
rf_model Accuracy 0.7156
rf_model ROC AUC 0.7962
2026-03-12 20:45:16,670 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger does not exist in model's document

Confusion Matrix Champion Vs Challenger

The Confusion Matrix: champion_vs_challenger test evaluates the predictive performance of two classification models by comparing their predicted and actual class labels, visualized through annotated heatmaps. The confusion matrices display the counts of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) for both the champion (log_model_champion) and challenger (rf_model) models, enabling direct assessment of classification accuracy and error types.

Key insights:

  • Challenger model reduces both FP and FN: The rf_model records 88 False Positives and 96 False Negatives, compared to 110 False Positives and 123 False Negatives for the log_model_champion, indicating improved error control.
  • Higher correct classification in challenger model: The rf_model achieves 221 True Positives and 242 True Negatives, exceeding the log_model_champion’s 194 True Positives and 220 True Negatives.
  • Overall error reduction in challenger: The total number of misclassifications (FP + FN) is lower for the rf_model (184) than for the log_model_champion (233), reflecting a net improvement in predictive accuracy.

The confusion matrix results demonstrate that the challenger model (rf_model) outperforms the champion model (log_model_champion) across all key confusion matrix categories. The challenger achieves higher counts of correct classifications (TP and TN) and lower counts of both types of errors (FP and FN), resulting in a lower overall misclassification rate. This indicates a clear improvement in classification performance for the challenger model based on the observed test data.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:8184
ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:68ad
2026-03-12 20:45:22,597 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger does not exist in model's document

❌ Minimum Accuracy Champion Vs Challenger

The Minimum Accuracy test evaluates whether each model's prediction accuracy meets or exceeds a specified threshold, with results presented for both the log_model_champion and rf_model. The table displays each model's accuracy score, the threshold applied (0.7), and the corresponding pass/fail outcome. The log_model_champion achieved an accuracy of 0.6399, while the rf_model achieved an accuracy of 0.7156, allowing for direct comparison of model performance relative to the threshold.

Key insights:

  • rf_model surpasses accuracy threshold: The rf_model achieved an accuracy score of 0.7156, exceeding the minimum threshold of 0.7 and resulting in a passing outcome.
  • log_model_champion falls below threshold: The log_model_champion recorded an accuracy of 0.6399, which is below the threshold, resulting in a failing outcome.
  • Clear performance differentiation: The two models display a marked difference in accuracy, with the rf_model outperforming the log_model_champion by approximately 7.6 percentage points.

The results indicate that the rf_model meets the minimum accuracy requirement, while the log_model_champion does not. This differentiation highlights a substantial performance gap between the two models under the specified evaluation criteria. The observed outcomes provide a clear basis for model selection based on accuracy performance relative to the defined threshold.

Tables

model Score Threshold Pass/Fail
log_model_champion 0.6399 0.7 Fail
rf_model 0.7156 0.7 Pass
2026-03-12 20:45:28,782 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger does not exist in model's document

✅ Minimum F1 Score Champion Vs Challenger

The MinimumF1Score:champion_vs_challenger test evaluates whether each model's F1 score on the validation set meets or exceeds a predefined minimum threshold, ensuring balanced performance between precision and recall. The results table presents F1 scores for both the champion (log_model_champion) and challenger (rf_model) models, alongside the minimum threshold and pass/fail status. Both models are assessed independently against the threshold value of 0.5, with their respective F1 scores and outcomes displayed.

Key insights:

  • Both models exceed minimum F1 threshold: log_model_champion achieved an F1 score of 0.6248 and rf_model achieved 0.7061, both surpassing the threshold of 0.5.
  • Challenger model demonstrates higher F1 performance: rf_model outperforms log_model_champion by 0.0813 in F1 score, indicating stronger balance between precision and recall on the validation set.
  • All models pass the test criteria: Both models are marked as "Pass," confirming that each meets the minimum F1 score requirement.

Both the champion and challenger models satisfy the minimum F1 score criterion, with the challenger model (rf_model) exhibiting a higher F1 score than the champion. The results indicate that both models maintain balanced classification performance on the validation set, with the challenger model providing a measurable improvement in F1 score relative to the champion.

Tables

model Score Threshold Pass/Fail
log_model_champion 0.6248 0.5 Pass
rf_model 0.7061 0.5 Pass
2026-03-12 20:45:32,959 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger does not exist in model's document

ROC Curve Champion Vs Challenger

The ROC Curve test evaluates the discrimination ability of binary classification models by plotting the trade-off between true positive rate and false positive rate across thresholds, and by calculating the Area Under the Curve (AUC) as a summary metric. The results present ROC curves and AUC values for two models—log_model_champion and rf_model—on the test_dataset_final, with each curve compared against a random classifier baseline (AUC = 0.5). The ROC curves and corresponding AUC scores provide a visual and quantitative assessment of each model’s ability to distinguish between the positive and negative classes.

Key insights:

  • rf_model demonstrates higher discrimination: The rf_model achieves an AUC of 0.80, indicating stronger separation between classes compared to the log_model_champion.
  • log_model_champion shows moderate performance: The log_model_champion records an AUC of 0.69, reflecting moderate discriminative ability above random chance but below that of the rf_model.
  • Both models outperform random baseline: Both ROC curves are consistently above the random classifier line (AUC = 0.5), confirming that each model provides meaningful predictive power on the test dataset.

The comparative ROC analysis reveals that the rf_model exhibits superior classification performance relative to the log_model_champion, as evidenced by a higher AUC and a more pronounced ROC curve. Both models demonstrate the ability to distinguish between classes beyond random chance, with the rf_model providing a notably stronger level of discrimination on the evaluated dataset.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:589c
ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:3083
2026-03-12 20:45:41,867 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger does not exist in model's document
Based on the performance metrics, our challenger random forest classification model passes the MinimumAccuracy where our champion did not.

In your validation report, support your recommendation in your validation issue's Proposed Remediation Plan to investigate the usage of our challenger model by inserting the performance tests we logged with this notebook into the appropriate section.

Run diagnostic tests

Next, we want to inspect the robustness and stability testing comparison between our champion and challenger model.

Use list_tests() to list all available diagnosis tests applicable to classification tasks:

vm.tests.list_tests(tags=["model_diagnosis"], task="classification")
ID Name Description Has Figure Has Table Required Inputs Params Tags Tasks
validmind.model_validation.sklearn.OverfitDiagnosis Overfit Diagnosis Assesses potential overfitting in a model's predictions, identifying regions where performance between training and... True True ['model', 'datasets'] {'metric': {'type': 'str', 'default': None}, 'cut_off_threshold': {'type': 'float', 'default': 0.04}} ['sklearn', 'binary_classification', 'multiclass_classification', 'linear_regression', 'model_diagnosis'] ['classification', 'regression']
validmind.model_validation.sklearn.RobustnessDiagnosis Robustness Diagnosis Assesses the robustness of a machine learning model by evaluating performance decay under noisy conditions.... True True ['datasets', 'model'] {'metric': {'type': 'str', 'default': None}, 'scaling_factor_std_dev_list': {'type': 'List', 'default': [0.1, 0.2, 0.3, 0.4, 0.5]}, 'performance_decay_threshold': {'type': 'float', 'default': 0.05}} ['sklearn', 'model_diagnosis', 'visualization'] ['classification', 'regression']
validmind.model_validation.sklearn.WeakspotsDiagnosis Weakspots Diagnosis Identifies and visualizes weak spots in a machine learning model's performance across various sections of the... True True ['datasets', 'model'] {'features_columns': {'type': 'Optional', 'default': None}, 'metrics': {'type': 'Optional', 'default': None}, 'thresholds': {'type': 'Optional', 'default': None}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_diagnosis', 'visualization'] ['classification', 'text_classification']

Let’s now assess the models for potential signs of overfitting and identify any sub-segments where performance may inconsistent with the OverfitDiagnosis test.

Overfitting occurs when a model learns the training data too well, capturing not only the true pattern but noise and random fluctuations resulting in excellent performance on the training dataset but poor generalization to new, unseen data:

  • Since the training dataset (vm_train_ds) was used to fit the model, we use this set to establish a baseline performance for how well the model performs on data it has already seen.
  • The testing dataset (vm_test_ds) was never seen during training, and here simulates real-world generalization, or how well the model performs on new, unseen data.
vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    }
).log()

Overfit Diagnosis Champion Vs Challenger

The Overfit Diagnosis: champion_vs_challenger test evaluates the extent of overfitting by comparing model performance between training and test sets across feature segments. The results present AUC gaps for both the logistic regression (log_model_champion) and random forest (rf_model) models, highlighting regions where the difference in AUC between training and test data exceeds the default threshold of 0.04. Bar plots visualize these gaps for each feature, enabling identification of segments with significant overfitting.

Key insights:

  • Random forest model exhibits widespread overfitting: The rf_model shows consistently large AUC gaps across nearly all feature segments, with gaps frequently exceeding 0.2 and reaching as high as 1.0 in certain Balance segments and 0.74 in NumOfProducts.
  • Logistic regression model shows localized overfitting: The log_model_champion displays moderate AUC gaps, with most segments below the threshold, but notable exceptions include CreditScore (gap = 0.1074 for 450–500), Balance (gap = 0.25 for 200,718–225,808), and EstimatedSalary (gap = 0.1277 for 179,981–199,953).
  • Overfitting concentrated in specific feature bins: For both models, the largest AUC gaps are observed in extreme or sparsely populated bins, such as high Balance and high NumOfProducts segments.
  • Minimal overfitting in binary categorical features for logistic regression: HasCrCard, IsActiveMember, Geography, and Gender segments in log_model_champion show AUC gaps well below the threshold, indicating stable generalization in these features.

The results indicate that the random forest model demonstrates extensive overfitting across all examined features, with AUC gaps substantially exceeding the threshold in most segments. In contrast, the logistic regression model exhibits overfitting primarily in specific, often low-sample, feature bins, while maintaining stable performance in the majority of segments. Overfitting is most pronounced in regions with limited data, particularly for the random forest model, underscoring the importance of segment-level evaluation in model validation.

Tables

model Feature Slice Number of Training Records Number of Test Records Training AUC Test AUC Gap
log_model_champion CreditScore (450.0, 500.0] 117 32 0.7026 0.5951 0.1074
log_model_champion Tenure (2.0, 3.0] 272 70 0.6377 0.5479 0.0898
log_model_champion Balance (50179.618, 75269.427] 93 29 0.5764 0.4697 0.1067
log_model_champion Balance (100359.236, 125449.045] 587 142 0.7366 0.6883 0.0483
log_model_champion Balance (200718.472, 225808.281] 13 4 0.2500 0.0000 0.2500
log_model_champion NumOfProducts (2.8, 3.1] 150 40 0.7478 0.6154 0.1324
log_model_champion EstimatedSalary (60151.514, 80123.202] 274 62 0.7389 0.6855 0.0534
log_model_champion EstimatedSalary (80123.202, 100094.89] 275 59 0.6703 0.5644 0.1060
log_model_champion EstimatedSalary (179981.642, 199953.33] 231 71 0.7050 0.5774 0.1277
rf_model CreditScore (400.0, 450.0] 47 12 1.0000 0.7222 0.2778
rf_model CreditScore (450.0, 500.0] 117 32 1.0000 0.6741 0.3259
rf_model CreditScore (500.0, 550.0] 263 59 1.0000 0.7442 0.2558
rf_model CreditScore (550.0, 600.0] 340 106 1.0000 0.7753 0.2247
rf_model CreditScore (600.0, 650.0] 528 112 1.0000 0.8188 0.1812
rf_model CreditScore (650.0, 700.0] 495 124 1.0000 0.7960 0.2040
rf_model CreditScore (700.0, 750.0] 381 95 1.0000 0.8627 0.1373
rf_model CreditScore (750.0, 800.0] 239 70 1.0000 0.7382 0.2618
rf_model CreditScore (800.0, 850.0] 166 32 1.0000 0.9042 0.0958
rf_model Tenure (-0.01, 1.0] 385 92 1.0000 0.6982 0.3018
rf_model Tenure (1.0, 2.0] 249 65 1.0000 0.8471 0.1529
rf_model Tenure (2.0, 3.0] 272 70 1.0000 0.8026 0.1974
rf_model Tenure (3.0, 4.0] 264 64 1.0000 0.7791 0.2209
rf_model Tenure (4.0, 5.0] 264 63 1.0000 0.8657 0.1343
rf_model Tenure (5.0, 6.0] 227 57 1.0000 0.7654 0.2346
rf_model Tenure (6.0, 7.0] 258 66 1.0000 0.7973 0.2027
rf_model Tenure (7.0, 8.0] 266 71 1.0000 0.8201 0.1799
rf_model Tenure (8.0, 9.0] 258 68 1.0000 0.8219 0.1781
rf_model Tenure (9.0, 10.0] 142 31 1.0000 0.8632 0.1368
rf_model Balance (-250.898, 25089.809] 846 212 1.0000 0.8571 0.1429
rf_model Balance (25089.809, 50179.618] 16 6 1.0000 0.7500 0.2500
rf_model Balance (50179.618, 75269.427] 93 29 1.0000 0.7828 0.2172
rf_model Balance (75269.427, 100359.236] 289 67 1.0000 0.6326 0.3674
rf_model Balance (100359.236, 125449.045] 587 142 1.0000 0.7910 0.2090
rf_model Balance (125449.045, 150538.854] 499 127 1.0000 0.7308 0.2692
rf_model Balance (150538.854, 175628.663] 191 50 1.0000 0.7250 0.2750
rf_model Balance (200718.472, 225808.281] 13 4 1.0000 0.0000 1.0000
rf_model NumOfProducts (0.997, 1.3] 1474 390 1.0000 0.6905 0.3095
rf_model NumOfProducts (1.9, 2.2] 926 208 1.0000 0.6903 0.3097
rf_model NumOfProducts (2.8, 3.1] 150 40 1.0000 0.2564 0.7436
rf_model HasCrCard (-0.001, 0.1] 788 191 1.0000 0.7809 0.2191
rf_model HasCrCard (0.9, 1.0] 1797 456 1.0000 0.8034 0.1966
rf_model IsActiveMember (-0.001, 0.1] 1399 340 1.0000 0.7656 0.2344
rf_model IsActiveMember (0.9, 1.0] 1186 307 1.0000 0.8036 0.1964
rf_model EstimatedSalary (36.733, 20208.138] 265 64 1.0000 0.7926 0.2074
rf_model EstimatedSalary (20208.138, 40179.826] 233 67 1.0000 0.8155 0.1845
rf_model EstimatedSalary (40179.826, 60151.514] 239 61 1.0000 0.8761 0.1239
rf_model EstimatedSalary (60151.514, 80123.202] 274 62 1.0000 0.7388 0.2612
rf_model EstimatedSalary (80123.202, 100094.89] 275 59 1.0000 0.7741 0.2259
rf_model EstimatedSalary (100094.89, 120066.578] 259 75 1.0000 0.7911 0.2089
rf_model EstimatedSalary (120066.578, 140038.266] 274 65 1.0000 0.8400 0.1600
rf_model EstimatedSalary (140038.266, 160009.954] 256 70 1.0000 0.7697 0.2303
rf_model EstimatedSalary (160009.954, 179981.642] 279 52 1.0000 0.7585 0.2415
rf_model EstimatedSalary (179981.642, 199953.33] 231 71 1.0000 0.7887 0.2113
rf_model Geography_Germany (-0.001, 0.1] 1792 446 1.0000 0.7890 0.2110
rf_model Geography_Germany (0.9, 1.0] 793 201 1.0000 0.7404 0.2596
rf_model Geography_Spain (-0.001, 0.1] 1983 506 1.0000 0.8016 0.1984
rf_model Geography_Spain (0.9, 1.0] 602 141 1.0000 0.7722 0.2278
rf_model Gender_Male (-0.001, 0.1] 1293 304 1.0000 0.8059 0.1941
rf_model Gender_Male (0.9, 1.0] 1292 343 1.0000 0.7772 0.2228

Figures

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:01b6
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:2a55
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:4473
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:54b2
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:de2d
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:caa1
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:151d
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:6a29
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:ee1b
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:0a05
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:d1b2
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:686e
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:0092
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:e8c8
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:0ef8
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:7004
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:d7d6
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:778c
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:97a8
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:1a85
2026-03-12 20:46:04,690 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger does not exist in model's document

Let's also conduct robustness and stability testing of the two models with the RobustnessDiagnosis test. Robustness refers to a model's ability to maintain consistent performance, and stability refers to a model's ability to produce consistent outputs over time across different data subsets.

Again, we'll use both the training and testing datasets to establish baseline performance and to simulate real-world generalization:

vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    },
).log()

❌ Robustness Diagnosis Champion Vs Log Regression

The Robustness Diagnosis test evaluates the resilience of the log_model_champion and rf_model models by measuring AUC performance decay under increasing levels of Gaussian noise applied to numeric input features. Results are presented for both train and test datasets across perturbation sizes ranging from 0.0 to 0.5 standard deviations. The tables and plots display AUC values, performance decay, and pass/fail status at each noise level, enabling direct comparison of robustness characteristics between the two models.

Key insights:

  • Logistic regression model exhibits gradual, low-magnitude decay: The log_model_champion shows a steady but modest decline in AUC as perturbation size increases, with test set AUC decreasing from 0.6901 (baseline) to 0.6664 (0.5 SD), and performance decay remaining below 0.024 across all noise levels.
  • Random forest model displays pronounced train-test divergence: The rf_model achieves perfect AUC (1.0) on the train set at baseline but experiences rapid performance decay under noise, with train AUC dropping to 0.7834 (0.5 SD) and performance decay exceeding 0.21. In contrast, test set AUC declines more gradually, from 0.7962 to 0.7255.
  • Threshold failures concentrated in random forest train and test sets: The rf_model fails the robustness threshold on the train set at perturbation sizes ≥0.2 and on the test set at 0.5, while the log_model_champion passes all thresholds across both datasets and all perturbation levels.
  • Test set robustness superior to train set for both models: Both models demonstrate lower performance decay and higher AUC retention on the test set compared to the train set as noise increases, particularly evident in the rf_model.

The results indicate that the log_model_champion maintains stable performance under increasing input noise, with minimal AUC decay and consistent threshold passing across all tested perturbation sizes. The rf_model, while initially achieving higher baseline AUC, is more sensitive to input noise, particularly on the train set, where performance decay is substantial and threshold failures occur at moderate noise levels. Test set robustness is consistently higher than train set robustness for both models, with the logistic regression model demonstrating greater resilience to noisy input features overall.

Tables

model Perturbation Size Dataset Row Count AUC Performance Decay Passed
log_model_champion Baseline (0.0) train_dataset_final 2585 0.6768 0.0000 True
log_model_champion Baseline (0.0) test_dataset_final 647 0.6901 0.0000 True
log_model_champion 0.1 train_dataset_final 2585 0.6765 0.0004 True
log_model_champion 0.1 test_dataset_final 647 0.6882 0.0019 True
log_model_champion 0.2 train_dataset_final 2585 0.6727 0.0041 True
log_model_champion 0.2 test_dataset_final 647 0.6801 0.0100 True
log_model_champion 0.3 train_dataset_final 2585 0.6708 0.0061 True
log_model_champion 0.3 test_dataset_final 647 0.6828 0.0073 True
log_model_champion 0.4 train_dataset_final 2585 0.6640 0.0128 True
log_model_champion 0.4 test_dataset_final 647 0.6747 0.0155 True
log_model_champion 0.5 train_dataset_final 2585 0.6547 0.0221 True
log_model_champion 0.5 test_dataset_final 647 0.6664 0.0237 True
rf_model Baseline (0.0) train_dataset_final 2585 1.0000 0.0000 True
rf_model Baseline (0.0) test_dataset_final 647 0.7962 0.0000 True
rf_model 0.1 train_dataset_final 2585 0.9860 0.0140 True
rf_model 0.1 test_dataset_final 647 0.7959 0.0003 True
rf_model 0.2 train_dataset_final 2585 0.9400 0.0600 False
rf_model 0.2 test_dataset_final 647 0.7850 0.0112 True
rf_model 0.3 train_dataset_final 2585 0.8836 0.1164 False
rf_model 0.3 test_dataset_final 647 0.7814 0.0148 True
rf_model 0.4 train_dataset_final 2585 0.8354 0.1646 False
rf_model 0.4 test_dataset_final 647 0.7706 0.0256 True
rf_model 0.5 train_dataset_final 2585 0.7834 0.2166 False
rf_model 0.5 test_dataset_final 647 0.7255 0.0707 False

Figures

ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:94d1
ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:da9e
2026-03-12 20:46:23,146 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression does not exist in model's document

Run feature importance tests

We also want to verify the relative influence of different input features on our models' predictions, as well as inspect the differences between our champion and challenger model to see if a certain model offers more understandable or logical importance scores for features.

Use list_tests() to identify all the feature importance tests for classification:

# Store the feature importance tests
FI = vm.tests.list_tests(tags=["feature_importance"], task="classification",pretty=False)
FI
['validmind.model_validation.FeaturesAUC',
 'validmind.model_validation.sklearn.PermutationFeatureImportance',
 'validmind.model_validation.sklearn.SHAPGlobalImportance']

We'll only use our testing dataset (vm_test_ds) here, to provide a realistic, unseen sample that mimic future or production data, as the training dataset has already influenced our model during learning:

# Run and log our feature importance tests for both models for the testing dataset
for test in FI:
    vm.tests.run_test(
        "".join((test,':champion_vs_challenger')),
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        },
    ).log()

Features Champion Vs Challenger

The FeaturesAUC:champion_vs_challenger test evaluates the discriminatory power of each individual feature by calculating the Area Under the Curve (AUC) for each feature in isolation against the binary target. The resulting plot displays AUC values for all features in the test_dataset_final dataset, with higher AUC values indicating stronger univariate separation between classes. The features are ranked by their AUC scores, providing a direct comparison of their individual classification strength.

Key insights:

  • Balance exhibits highest univariate discriminatory power: The Balance feature achieves the highest AUC, exceeding 0.6, indicating the strongest individual ability to distinguish between classes among all features evaluated.
  • Geography_Germany and CreditScore show moderate separation: Both Geography_Germany and CreditScore display AUC values above 0.5, suggesting moderate univariate predictive strength.
  • Several features cluster at lower AUC values: Features such as NumOfProducts and IsActiveMember have AUC values near 0.4, reflecting limited individual discriminatory capability in the univariate context.

The results indicate that Balance is the most individually informative feature for class separation in this dataset, with Geography_Germany and CreditScore also contributing moderate univariate predictive value. The remaining features demonstrate lower AUC scores, suggesting weaker standalone classification strength when evaluated independently. This distribution of AUC values provides insight into the relative univariate importance of each feature within the binary classification task.

Figures

ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:e4af
ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:c6f8
2026-03-12 20:46:33,312 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.FeaturesAUC:champion_vs_challenger does not exist in model's document

Permutation Feature Importance Champion Vs Challenger

The Permutation Feature Importance (PFI) test evaluates the relative importance of each input feature by measuring the decrease in model performance when feature values are randomly permuted. The results are presented as bar plots for both the logistic regression (log_model_champion) and random forest (rf_model) models, with each bar representing the magnitude of performance reduction attributable to permuting a specific feature. The plots enable direct comparison of feature importance rankings between the two models, highlighting which features most strongly influence predictions in each case.

Key insights:

  • Distinct top features by model type: The logistic regression model assigns highest importance to IsActiveMember and Geography_Germany, while the random forest model ranks NumOfProducts as the most influential feature.
  • Geography_Germany consistently important: Geography_Germany is among the top two features for both models, indicating a strong and consistent impact on model predictions.
  • Model-specific feature reliance: IsActiveMember is highly important for the logistic regression model but less so for the random forest, whereas Balance is a key driver for the random forest but not for the logistic regression model.
  • Low importance for several features: Features such as EstimatedSalary, Gender_Male, and HasCrCard exhibit low permutation importance in both models, suggesting minimal influence on predictive performance.

The PFI results reveal that feature importance rankings differ substantially between the logistic regression and random forest models, with each model relying on a distinct subset of features for prediction. Geography_Germany emerges as a consistently important variable across both models, while other features such as IsActiveMember and NumOfProducts show model-specific prominence. Several features contribute minimally to predictive accuracy in both models, indicating limited relevance within the current modeling context.

Figures

ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:5e7e
ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:3925
2026-03-12 20:46:46,104 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger does not exist in model's document

SHAP Global Importance Champion Vs Challenger

The SHAPGlobalImportance:champion_vs_challenger test evaluates and visualizes the global feature importance for both the champion (log_model_champion) and challenger (rf_model) models using SHAP values. The results include normalized mean importance plots and SHAP summary plots, which display the relative contribution of each feature to model predictions and the distribution of SHAP values across the dataset. These visualizations facilitate comparison of feature influence and model reasoning between the two models.

Key insights:

  • Champion model dominated by few features: For log_model_champion, IsActiveMember, Geography_Germany, and Gender_Male exhibit the highest normalized SHAP values, with IsActiveMember showing the greatest influence on model output.
  • Challenger model relies on limited features: The rf_model challenger model assigns importance almost exclusively to CreditScore and Tenure, with other features not represented in the importance plots.
  • Distinct feature utilization between models: The champion model distributes importance across a broader set of features, while the challenger model's importance is concentrated on two variables.
  • SHAP value distributions are compact: Both models display relatively tight SHAP value distributions for their most important features, with no evidence of extreme outliers or high variability in the summary plots.

The results indicate that the champion and challenger models differ substantially in their feature utilization, with the champion model leveraging a wider range of predictors and the challenger model focusing on a narrow subset. The absence of high variability or scattered SHAP values suggests stable model behavior in both cases. No anomalies or illogical feature importances are observed in the visualizations.

Figures

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:f5c1
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:9d50
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:8542
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:9d6b
2026-03-12 20:47:22,327 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger does not exist in model's document

In summary

In this third notebook, you learned how to:

Next steps

Finalize validation and reporting

Now that you're familiar with the basics of using the ValidMind Library to run and log validation tests, let's learn how to implement some custom tests and wrap up our validation: 4 — Finalize validation and reporting


Copyright © 2023-2026 ValidMind Inc. All rights reserved.
Refer to LICENSE for details.
SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial