ValidMind for model validation 3 — Developing a potential challenger model

Learn how to use ValidMind for your end-to-end model validation process with our series of four introductory notebooks. In this third notebook, develop a potential challenger model and then pass your model and its predictions to ValidMind.

A challenger model is an alternate model that attempts to outperform the champion model, ensuring that the best performing fit-for-purpose model is always considered for deployment. Challenger models also help avoid over-reliance on a single model, and allow testing of new features, algorithms, or data sources without disrupting the production lifecycle.

Learn by doing

Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals

Prerequisites

In order to develop potential challenger models with this notebook, you'll need to first have:

Need help with the above steps?

Refer to the first two notebooks in this series:

Setting up

This section should be quite familiar to you — as we performed the same actions in the previous notebook, 2 — Start the model validation process.

Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

  1. In a browser, log in to ValidMind.

  2. In the left sidebar, navigate to Inventory and select the model you registered for this "ValidMind for model validation" series of notebooks.

  3. Go to Getting Started and click Copy snippet to clipboard.

Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
)
Note: you may need to restart the kernel to use updated packages.
2026-01-10 02:26:37,518 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report

Import the sample dataset

Next, we'll load in the sample Bank Customer Churn Prediction dataset used to develop the champion model that we will independently preprocess:

# Load the sample dataset
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}

Preprocess the dataset

We’ll apply a simple rebalancing technique to the dataset before continuing:

import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test.

As you know, before we can run tests you’ll need to initialize a ValidMind dataset object with the init_dataset function:

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)

With our balanced dataset initialized, we can then run our test and utilize the output to help us identify the features we want to remove:

# Run HighPearsonCorrelation test with our balanced dataset as input and return a result object
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

High Pearson Correlation is designed to identify pairs of features within a dataset that exhibit strong linear relationships, with the primary purpose of detecting potential feature redundancy or multicollinearity. This is crucial for ensuring that the predictive model remains interpretable and robust, as high correlations between features can obscure the true impact of individual variables and may lead to overfitting or instability in model estimates.

The test operates by calculating the Pearson correlation coefficient for every possible pair of features in the dataset. The Pearson correlation coefficient is a statistical measure that quantifies the strength and direction of a linear relationship between two continuous variables, ranging from -1 (perfect negative linear relationship) to +1 (perfect positive linear relationship), with 0 indicating no linear relationship. The test systematically computes these coefficients for all feature pairs, removes self-correlations and duplicate pairs, and then compares the absolute value of each coefficient to a predefined threshold, which in this case is set at 0.3. Any pair with an absolute correlation exceeding this threshold is flagged as a potential risk for multicollinearity. The test then presents the top n strongest correlations, regardless of whether they pass or fail the threshold, providing a transparent view of the most significant linear relationships in the data. The output includes the feature pair, the calculated coefficient, and a Pass or Fail status based on the threshold, allowing users to quickly assess which relationships may warrant further investigation.

The primary advantages of this test include its efficiency and clarity in surfacing linear dependencies between features, which is particularly valuable in the early stages of model development and risk assessment. By highlighting pairs of variables with strong linear associations, the test enables practitioners to proactively address multicollinearity, which can otherwise compromise model interpretability and predictive stability. The transparent tabular output makes it easy to identify and communicate which feature pairs are most strongly related, supporting informed decisions about feature selection, engineering, or regularization. This approach is especially useful in regulated environments or high-stakes applications where model transparency and explainability are paramount, as it provides a straightforward mechanism for documenting and managing potential sources of redundancy.

It should be noted that the test is limited to detecting linear relationships and does not capture more complex, nonlinear dependencies that may exist between features. The Pearson correlation coefficient is also sensitive to outliers, which can distort the measured strength of association and potentially lead to misleading conclusions. Additionally, the test only evaluates pairwise relationships, meaning that it may not identify more intricate forms of multicollinearity involving three or more variables. High correlation coefficients, particularly those exceeding the set threshold, are indicative of potential risk, as they suggest that the features involved may be redundant or could introduce instability into the model. However, the presence of a high correlation does not automatically imply a problem; further analysis is often required to determine the practical impact on model performance and interpretability.

This test shows its results in a tabular format, where each row represents a unique pair of features from the dataset. The columns include the feature pair, the Pearson correlation coefficient (labeled as "Coefficient"), and a Pass or Fail status indicating whether the absolute value of the coefficient exceeds the threshold of 0.3. The coefficients are presented as decimal values, typically ranging from -1 to 1, with positive values indicating direct relationships and negative values indicating inverse relationships. The table is sorted by the absolute value of the coefficient, with the strongest correlations at the top. Notably, only one feature pair, (Age, Exited), has a coefficient (0.339) that exceeds the threshold and is marked as "Fail," while all other pairs have coefficients below the threshold and are marked as "Pass." The remaining coefficients range from -0.1917 to 0.0367, indicating generally weak linear relationships among the other feature pairs. The table provides a clear and concise summary of the linear dependencies present in the dataset, making it straightforward to identify which pairs may require further scrutiny.

The test results reveal the following key insights:

  • Single Feature Pair Exceeds Correlation Threshold: Only the pair (Age, Exited) has a Pearson correlation coefficient of 0.339, surpassing the threshold of 0.3 and resulting in a "Fail" status, indicating a moderate linear relationship between these two features.
  • All Other Feature Pairs Show Weak Linear Relationships: The remaining nine feature pairs have coefficients ranging from -0.1917 to 0.0367, all below the threshold, and are marked as "Pass," suggesting minimal risk of multicollinearity among these pairs.
  • Distribution of Correlation Coefficients Is Centered Near Zero: Most coefficients are close to zero, indicating that the majority of feature pairs do not exhibit strong linear associations, which supports the overall independence of features in the dataset.
  • Negative and Positive Correlations Are Both Present: The coefficients include both positive and negative values, with the strongest negative correlation observed between (IsActiveMember, Exited) at -0.1917 and (Balance, NumOfProducts) at -0.171, though these remain below the risk threshold.
  • No Evidence of Widespread Redundancy: The absence of multiple pairs exceeding the threshold suggests that the dataset does not suffer from pervasive feature redundancy or multicollinearity, aside from the single flagged pair.

Based on these results, the dataset demonstrates a generally low level of linear dependency among its features, with only one pair, (Age, Exited), exhibiting a moderate correlation that exceeds the predefined threshold. This observation indicates that, with the exception of this pair, the features are largely independent in terms of linear relationships, reducing the likelihood of multicollinearity adversely affecting model interpretability or stability. The presence of both positive and negative coefficients, all within a narrow range, further supports the conclusion that the dataset is well-structured with respect to linear feature interactions. The clear separation between the single "Fail" and the multiple "Pass" results provides a straightforward narrative about the dataset's structure, highlighting the isolated nature of the moderate correlation and reinforcing the overall robustness of the feature set in terms of linear independence. This pattern suggests that, aside from the specific relationship between Age and Exited, the model is unlikely to be compromised by feature redundancy or instability arising from strong linear associations among input variables.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(Age, Exited) 0.3390 Fail
(IsActiveMember, Exited) -0.1917 Pass
(Balance, NumOfProducts) -0.1710 Pass
(Balance, Exited) 0.1570 Pass
(NumOfProducts, Exited) -0.0609 Pass
(Age, Balance) 0.0508 Pass
(NumOfProducts, IsActiveMember) 0.0499 Pass
(Tenure, IsActiveMember) -0.0465 Pass
(Age, NumOfProducts) -0.0462 Pass
(Tenure, EstimatedSalary) 0.0367 Pass
# From result object, extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df
Columns Coefficient Pass/Fail
0 (Age, Exited) 0.3390 Fail
1 (IsActiveMember, Exited) -0.1917 Pass
2 (Balance, NumOfProducts) -0.1710 Pass
3 (Balance, Exited) 0.1570 Pass
4 (NumOfProducts, Exited) -0.0609 Pass
5 (Age, Balance) 0.0508 Pass
6 (NumOfProducts, IsActiveMember) 0.0499 Pass
7 (Tenure, IsActiveMember) -0.0465 Pass
8 (Age, NumOfProducts) -0.0462 Pass
9 (Tenure, EstimatedSalary) 0.0367 Pass
# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features
['(Age, Exited)']
# Extract feature names from the list of strings
high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features
['Age']

We can then re-initialize the dataset with a different input_id and the highly correlated features removed and re-run the test for confirmation:

# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)
# Re-run the test with the reduced feature set
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

High Pearson Correlation is designed to identify pairs of features within a dataset that exhibit strong linear relationships, with the primary purpose of detecting potential feature redundancy or multicollinearity. This is crucial for ensuring that the predictive model remains interpretable and robust, as high correlations between features can obscure the true impact of individual variables and may lead to overfitting or instability in model estimates.

The test operates by calculating the Pearson correlation coefficient for every possible pair of features in the dataset. The Pearson correlation coefficient is a statistical measure that quantifies the strength and direction of a linear relationship between two continuous variables, ranging from -1 (perfect negative linear relationship) to +1 (perfect positive linear relationship), with 0 indicating no linear relationship. The test systematically computes these coefficients for all feature pairs, removes self-correlations and duplicate pairs, and then sorts the results by the absolute value of the coefficient. A pre-defined threshold, set at 0.3 in this case, is used to determine whether a pair is considered highly correlated. Each pair is assigned a "Pass" if the absolute value of the coefficient is below the threshold, or a "Fail" if it exceeds the threshold. The test then returns the top n pairs with the strongest correlations, providing a clear view of the most significant linear relationships present in the data.

The primary advantages of this test include its efficiency and transparency in highlighting linear dependencies between features. By systematically surfacing the strongest correlations, it enables data scientists and risk managers to quickly identify and address potential sources of multicollinearity, which can compromise model interpretability and predictive stability. The test’s output is straightforward, presenting clear pairs of features, their correlation coefficients, and pass/fail status, which aids in early detection of problematic relationships before model training. This proactive approach supports the development of more robust and interpretable models, especially in regulated environments where transparency and explainability are paramount.

It should be noted that the test is limited to detecting only linear relationships, as measured by the Pearson correlation coefficient, and does not capture nonlinear dependencies that may also impact model performance. The metric is sensitive to outliers, which can disproportionately influence the calculated coefficients and potentially mask or exaggerate true relationships. Additionally, the test focuses exclusively on pairwise relationships, meaning it may overlook more complex interactions involving three or more features. High correlation coefficients, particularly those exceeding the threshold, are indicative of potential multicollinearity, which can undermine the reliability of model parameter estimates and complicate the interpretation of individual feature effects.

This test shows its results in the form of a table, where each row represents a unique pair of features from the dataset. The columns include the feature pair, the Pearson correlation coefficient (labeled as "Coefficient"), and a "Pass/Fail" status indicating whether the absolute value of the coefficient is below the threshold of 0.3. The coefficients are presented as decimal values, typically ranging from -1 to 1, with negative values indicating inverse relationships and positive values indicating direct relationships. In this particular output, all coefficients fall within the range of approximately -0.19 to 0.03, and every pair is marked as "Pass," signifying that none of the feature pairs exceed the specified threshold. The table is sorted by the absolute value of the coefficient, with the strongest correlations listed first. Notable observations include the absence of any pairs with coefficients near the threshold, suggesting a lack of strong linear dependencies among the top feature pairs. The results provide a clear and interpretable summary of the linear relationships present in the dataset, facilitating straightforward assessment of potential multicollinearity risks.

The test results reveal the following key insights:

  • No Feature Pairs Exceed Correlation Threshold: All feature pairs have absolute Pearson correlation coefficients below the 0.3 threshold, with the highest observed value being -0.1917 for the pair (IsActiveMember, Exited).
  • Low to Moderate Linear Relationships Across Features: The coefficients for the top ten pairs range from -0.1917 to 0.0273, indicating only weak to very weak linear associations between features.
  • Balanced Distribution of Positive and Negative Correlations: Both positive and negative coefficients are present, with no clear dominance of one direction, suggesting that the relationships between features are not systematically aligned.
  • No Evidence of Multicollinearity Among Top Pairs: The absence of high correlation values implies that the dataset does not exhibit significant multicollinearity among the most strongly related feature pairs.
  • Consistent Pass Status Across All Pairs: Every feature pair in the output is marked as "Pass," reinforcing the observation that the dataset is free from problematic linear dependencies within the evaluated pairs.

Based on these results, the dataset demonstrates a stable and well-structured feature space with respect to linear relationships, as none of the evaluated feature pairs approach the threshold for high correlation. The observed coefficients are uniformly low, indicating that the features are largely independent in terms of linear association, which supports the interpretability and reliability of subsequent modeling efforts. The balanced mix of positive and negative correlations further suggests that there are no systematic patterns of redundancy or inverse relationships that could complicate model estimation. The consistent "Pass" status across all pairs provides additional assurance that multicollinearity is not a concern within the top correlated features, allowing for greater confidence in the distinct contribution of each variable to the model. These characteristics collectively indicate that the dataset is well-suited for modeling applications where feature independence is desirable, and the risk of inflated variance or unstable parameter estimates due to linear dependencies is minimal.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(IsActiveMember, Exited) -0.1917 Pass
(Balance, NumOfProducts) -0.1710 Pass
(Balance, Exited) 0.1570 Pass
(NumOfProducts, Exited) -0.0609 Pass
(NumOfProducts, IsActiveMember) 0.0499 Pass
(Tenure, IsActiveMember) -0.0465 Pass
(Tenure, EstimatedSalary) 0.0367 Pass
(HasCrCard, IsActiveMember) -0.0329 Pass
(IsActiveMember, EstimatedSalary) 0.0303 Pass
(CreditScore, IsActiveMember) 0.0273 Pass

Split the preprocessed dataset

With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:

# Encode categorical features in the dataset
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()
CreditScore Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited Geography_Germany Geography_Spain Gender_Male
5974 626 7 113014.70 2 1 1 56646.28 0 False False True
2566 646 6 124445.52 1 1 0 88481.32 0 False False True
3828 621 8 0.00 2 1 0 36122.96 0 False True True
1134 555 4 120392.99 1 1 0 177719.88 1 True False False
616 747 7 116313.57 1 1 1 190696.35 1 True False True
from sklearn.model_selection import train_test_split

# Split the dataset into train and test
train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]
# Initialize the split datasets
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

Import the champion model

With our raw dataset assessed and preprocessed, let's go ahead and import the champion model submitted by the model development team in the format of a .pkl file: lr_model_champion.pkl

# Import the champion model
import pickle as pkl

with open("lr_model_champion.pkl", "rb") as f:
    log_reg = pkl.load(f)
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/base.py:463: InconsistentVersionWarning:

Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.8.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations

Training a potential challenger model

We're curious how an alternate model compares to our champion model, so let's train a challenger model as a basis for our testing.

Our champion logistic regression model is a simpler, parametric model that assumes a linear relationship between the independent variables and the log-odds of the outcome. While logistic regression may not capture complex patterns as effectively, it offers a high degree of interpretability and is easier to explain to stakeholders. However, model risk is not calculated in isolation from a single factor, but rather in consideration with trade-offs in predictive performance, ease of interpretability, and overall alignment with business objectives.

Random forest classification model

A random forest classification model is an ensemble machine learning algorithm that uses multiple decision trees to classify data. In ensemble learning, multiple models are combined to improve prediction accuracy and robustness.

Random forest classification models generally have higher accuracy because they capture complex, non-linear relationships, but as a result they lack transparency in their predictions.

# Import the Random Forest Classification model
from sklearn.ensemble import RandomForestClassifier

# Create the model instance with 50 decision trees
rf_model = RandomForestClassifier(
    n_estimators=50,
    random_state=42,
)

# Train the model
rf_model.fit(X_train, y_train)
RandomForestClassifier(n_estimators=50, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Initializing the model objects

Initialize the model objects

In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models.

You simply initialize this model object with vm.init_model():

# Initialize the champion logistic regression model
vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model_champion",
)

# Initialize the challenger random forest classification model
vm_rf_model = vm.init_model(
    rf_model,
    input_id="rf_model",
)

Assign predictions

With our models registered, we'll move on to assigning both the predictive probabilities coming directly from each model's predictions, and the binary prediction after applying the cutoff threshold described in the Compute binary predictions step above.

  • The assign_predictions() method from the Dataset object can link existing predictions to any number of models.
  • This method links the model's class prediction values and probabilities to our vm_train_ds and vm_test_ds datasets.

If no prediction values are passed, the method will compute predictions automatically:

# Champion — Logistic regression model
vm_train_ds.assign_predictions(model=vm_log_model)
vm_test_ds.assign_predictions(model=vm_log_model)

# Challenger — Random forest classification model
vm_train_ds.assign_predictions(model=vm_rf_model)
vm_test_ds.assign_predictions(model=vm_rf_model)
2026-01-10 02:27:15,421 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-10 02:27:15,423 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-10 02:27:15,423 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-10 02:27:15,425 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-10 02:27:15,427 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-10 02:27:15,428 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-10 02:27:15,428 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-10 02:27:15,430 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-10 02:27:15,432 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-10 02:27:15,452 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-10 02:27:15,452 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-10 02:27:15,473 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-10 02:27:15,475 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-10 02:27:15,486 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-10 02:27:15,486 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-10 02:27:15,497 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Running model evaluation tests

With our setup complete, let's run the rest of our validation tests. Since we have already verified the data quality of the dataset used to train our champion model, we will now focus on comprehensive performance evaluations of both the champion and challenger models.

Run model performance tests

Let's run some performance tests, beginning with independent testing of our champion logistic regression model, then moving on to our potential challenger model.

Use vm.tests.list_tests() to identify all the model performance tests for classification:


vm.tests.list_tests(tags=["model_performance"], task="classification")
ID Name Description Has Figure Has Table Required Inputs Params Tags Tasks
validmind.model_validation.sklearn.CalibrationCurve Calibration Curve Evaluates the calibration of probability estimates by comparing predicted probabilities against observed... True False ['model', 'dataset'] {'n_bins': {'type': 'int', 'default': 10}} ['sklearn', 'model_performance', 'classification'] ['classification']
validmind.model_validation.sklearn.ClassifierPerformance Classifier Performance Evaluates performance of binary or multiclass classification models using precision, recall, F1-Score, accuracy,... False True ['dataset', 'model'] {'average': {'type': 'str', 'default': 'macro'}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ConfusionMatrix Confusion Matrix Evaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix... True False ['dataset', 'model'] {'threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.HyperParametersTuning Hyper Parameters Tuning Performs exhaustive grid search over specified parameter ranges to find optimal model configurations... False True ['model', 'dataset'] {'param_grid': {'type': 'dict', 'default': None}, 'scoring': {'type': 'Union', 'default': None}, 'thresholds': {'type': 'Union', 'default': None}, 'fit_params': {'type': 'dict', 'default': None}} ['sklearn', 'model_performance'] ['clustering', 'classification']
validmind.model_validation.sklearn.MinimumAccuracy Minimum Accuracy Checks if the model's prediction accuracy meets or surpasses a specified threshold.... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.7}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumF1Score Minimum F1 Score Assesses if the model's F1 score on the validation set meets a predefined minimum threshold, ensuring balanced... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumROCAUCScore Minimum ROCAUC Score Validates model by checking if the ROC AUC score meets or surpasses a specified threshold.... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ModelsPerformanceComparison Models Performance Comparison Evaluates and compares the performance of multiple Machine Learning models using various metrics like accuracy,... False True ['dataset', 'models'] {} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'model_comparison'] ['classification', 'text_classification']
validmind.model_validation.sklearn.PopulationStabilityIndex Population Stability Index Assesses the Population Stability Index (PSI) to quantify the stability of an ML model's predictions across... True True ['datasets', 'model'] {'num_bins': {'type': 'int', 'default': 10}, 'mode': {'type': 'str', 'default': 'fixed'}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurve Precision Recall Curve Evaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve.... True False ['model', 'dataset'] {} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurve ROC Curve Evaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic... True False ['model', 'dataset'] {} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.RegressionErrors Regression Errors Assesses the performance and error distribution of a regression model using various error metrics.... False True ['model', 'dataset'] {} ['sklearn', 'model_performance'] ['regression', 'classification']
validmind.model_validation.sklearn.TrainingTestDegradation Training Test Degradation Tests if model performance degradation between training and test datasets exceeds a predefined threshold.... False True ['datasets', 'model'] {'max_threshold': {'type': 'float', 'default': 0.1}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.statsmodels.GINITable GINI Table Evaluates classification model performance using AUC, GINI, and KS metrics for training and test datasets.... False True ['dataset', 'model'] {} ['model_performance'] ['classification']
validmind.ongoing_monitoring.CalibrationCurveDrift Calibration Curve Drift Evaluates changes in probability calibration between reference and monitoring datasets.... True True ['datasets', 'model'] {'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ClassDiscriminationDrift Class Discrimination Drift Compares classification discrimination metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ClassificationAccuracyDrift Classification Accuracy Drift Compares classification accuracy metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ConfusionMatrixDrift Confusion Matrix Drift Compares confusion matrix metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ROCCurveDrift ROC Curve Drift Compares ROC curves between reference and monitoring datasets.... True False ['datasets', 'model'] {} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']

We'll isolate the specific tests we want to run in mpt:

As we learned in the previous notebook 2 — Start the model validation process, you can use a custom result_id to tag the individual result with a unique identifier by appending this result_id to the test_id with a : separator. We'll append an identifier for our champion model here:

mpt = [
    "validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion",
    "validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion",
    "validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion",
    "validmind.model_validation.sklearn.MinimumF1Score:logreg_champion",
    "validmind.model_validation.sklearn.ROCCurve:logreg_champion"
]

Evaluate performance of the champion model

Now, let's run and log our batch of model performance tests using our testing dataset (vm_test_ds) for our champion model:

  • The test set serves as a proxy for real-world data, providing an unbiased estimate of model performance since it was not used during training or tuning.
  • The test set also acts as protection against selection bias and model tweaking, giving a final, more unbiased checkpoint.
for test in mpt:
    vm.tests.run_test(
        test,
        inputs={
            "dataset": vm_test_ds, "model" : vm_log_model,
        },
    ).log()

Classifier Performance Logreg Champion

Classifier Performance: logreg champion is designed to provide a comprehensive evaluation of classification models by quantifying their ability to correctly identify and distinguish between different classes. The primary purpose of this test is to assess the effectiveness of a model in making accurate predictions, using a suite of standard performance metrics that capture various aspects of classification quality, including precision, recall, F1-Score, accuracy, and the area under the receiver operating characteristic curve (ROC AUC). This enables a thorough understanding of the model’s strengths and weaknesses in both binary and multiclass settings.

The test operates by generating a detailed report that includes precision, recall, and F1-Score for each class, as well as macro and weighted averages of these metrics to provide an overall assessment. Precision measures the proportion of positive identifications that are actually correct, reflecting the model’s ability to avoid false positives. Recall quantifies the proportion of actual positives that are correctly identified, indicating the model’s sensitivity to true cases. The F1-Score harmonizes precision and recall into a single metric, balancing the trade-off between them. Accuracy represents the overall proportion of correct predictions out of all predictions made, offering a general sense of model correctness. The ROC AUC score evaluates the model’s ability to distinguish between classes across all possible classification thresholds, with values ranging from 0.5 (no discrimination) to 1.0 (perfect discrimination). These metrics are calculated using the model’s predictions and the true class labels, and are interpreted in the context of the problem domain, with higher values generally indicating better performance.

The primary advantages of this test include its versatility and comprehensiveness, as it is capable of evaluating both binary and multiclass classification models using a range of widely recognized metrics. By incorporating precision, recall, F1-Score, and accuracy, the test provides a multi-faceted view of model performance, capturing both the ability to correctly identify positive cases and to avoid false alarms. The inclusion of macro and weighted averages ensures that the evaluation remains robust even in the presence of class imbalance, while the ROC AUC metric offers valuable insight into the model’s discriminatory power, particularly for unbalanced datasets. This makes the test especially useful for model comparison, selection, and monitoring in diverse real-world scenarios.

It should be noted that the test has certain limitations and potential risks. The accuracy and interpretability of the results depend on the representativeness of the test dataset; if the data does not reflect real-world distributions, the metrics may not generalize. The test assumes that class labels are correctly identified and that the classification task is well-defined, which may not always hold in practice. Additionally, while the test provides a broad overview of performance, it may not capture nuanced behaviors such as model calibration or the impact of rare classes. Signs of high risk include low values for precision, recall, F1-Score, accuracy, or ROC AUC, as well as significant imbalances between precision and recall, which may indicate poor or unstable model performance.

This test shows the results in the form of two tables: one summarizing precision, recall, and F1-Score for each class, along with macro and weighted averages, and another presenting the overall accuracy and ROC AUC scores. The first table lists each class in the model, with columns for precision, recall, and F1-Score, allowing for direct comparison of performance across classes. The macro and weighted averages provide aggregate measures that account for class distribution and balance. The second table displays the overall accuracy, representing the proportion of correct predictions, and the ROC AUC, indicating the model’s ability to distinguish between classes. All metrics are presented as decimal values between 0 and 1, where higher values denote better performance. Notably, the precision, recall, and F1-Score for both classes are closely aligned, with values around 0.67, and the macro and weighted averages are nearly identical, suggesting balanced performance. The accuracy is 0.6754, and the ROC AUC is 0.7051, indicating moderate discriminatory power. These results suggest that the model performs consistently across classes, with no significant disparities or outliers in the reported metrics.

The test results reveal the following key insights:

  • Balanced Class Performance: Both classes exhibit similar precision, recall, and F1-Score values, with class 0 showing precision of 0.6726, recall of 0.6933, and F1-Score of 0.6828, while class 1 has precision of 0.6785, recall of 0.6573, and F1-Score of 0.6677, indicating no substantial performance gap between classes.
  • Consistent Aggregate Metrics: The macro and weighted averages for precision, recall, and F1-Score are all approximately 0.6753 to 0.6755, reflecting uniform model behavior across the dataset and suggesting that class imbalance does not significantly affect overall performance.
  • Moderate Overall Accuracy: The model achieves an accuracy of 0.6754, meaning that approximately 67.5% of predictions are correct, which is indicative of moderate predictive capability in the context of the evaluated dataset.
  • Acceptable Discriminatory Power: The ROC AUC score of 0.7051 demonstrates that the model has a reasonable ability to distinguish between the two classes, with performance above the random baseline of 0.5 but not approaching the ideal of 1.0.
  • Absence of Extreme Values: No metric falls below 0.65 or exceeds 0.71, indicating stable and consistent performance without significant outliers or areas of pronounced weakness.

Based on these results, the model demonstrates a stable and balanced classification performance across both classes, with precision, recall, and F1-Score values closely aligned and aggregate metrics reinforcing this consistency. The accuracy of 0.6754 suggests that the model correctly predicts the class in roughly two-thirds of cases, while the ROC AUC of 0.7051 indicates moderate but not exceptional discriminatory power. The absence of large disparities between class-specific metrics and the similarity between macro and weighted averages imply that the model does not favor one class over the other and is not unduly affected by class imbalance. The results collectively characterize the model as reliable and consistent within the evaluated dataset, with no evidence of severe misclassification or instability. The observed performance metrics provide a clear and objective profile of the model’s behavior, supporting its use in scenarios where moderate accuracy and balanced class treatment are acceptable.

Tables

Precision, Recall, and F1

Class Precision Recall F1
0 0.6726 0.6933 0.6828
1 0.6785 0.6573 0.6677
Weighted Average 0.6755 0.6754 0.6753
Macro Average 0.6755 0.6753 0.6753

Accuracy and ROC AUC

Metric Value
Accuracy 0.6754
ROC AUC 0.7051
2026-01-10 02:27:40,401 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion does not exist in model's document

Confusion Matrix Logreg Champion

Confusion Matrix:logreg_champion is designed to evaluate and visually represent the predictive performance of a classification machine learning model by quantifying the counts of correct and incorrect predictions across all possible classes. Its primary purpose is to provide a clear breakdown of the model’s ability to correctly identify true positives, true negatives, false positives, and false negatives, which are fundamental to understanding model accuracy and error types.

The test operates by comparing the predicted class labels generated by the model to the actual class labels from the test dataset. This comparison is structured into a confusion matrix, which is a two-dimensional table where each row represents the actual class and each column represents the predicted class. The matrix is populated with counts of predictions falling into each category: true positives (correctly predicted positive cases), true negatives (correctly predicted negative cases), false positives (incorrectly predicted positive cases), and false negatives (incorrectly predicted negative cases). The confusion matrix is then visualized as a heatmap using Plotly’s annotated heatmap functionality, which enhances interpretability by providing both color intensity and numerical annotation for each cell. The values in the matrix are non-negative integers, with higher values along the diagonal (true positives and true negatives) generally indicating better model performance, while higher off-diagonal values (false positives and false negatives) suggest areas where the model is making errors. This approach does not directly provide summary statistics like accuracy, precision, or recall, but it forms the basis for calculating these metrics.

The primary advantages of this test include its ability to deliver a comprehensive and intuitive visual summary of a classification model’s performance. By explicitly displaying the counts of each prediction type, the confusion matrix allows practitioners to quickly identify the types and frequencies of errors the model is making. This is particularly valuable in multi-class settings or when the costs of different error types vary, as it enables targeted analysis of specific misclassification patterns. The heatmap visualization further aids in rapid assessment by highlighting areas of strength and weakness through color gradients and annotations. This test is especially useful for diagnosing model behavior in complex or imbalanced datasets, as it provides granular insight into how the model handles each class, supporting more informed model evaluation and refinement.

It should be noted that the confusion matrix has several limitations and potential risks. In datasets with significant class imbalance, the matrix may give a misleading impression of model performance, as high counts in the majority class can mask poor performance in minority classes. The confusion matrix itself does not provide a single summary metric, requiring additional calculations to derive measures such as precision, recall, or F1-score for a more holistic assessment. Interpretation can be challenging without these derived metrics, particularly for non-technical stakeholders. Furthermore, the matrix is descriptive rather than inferential, offering no statistical hypothesis testing or confidence intervals. High values of false positives or false negatives, as indicated in the matrix, are signs of increased risk, as they reflect the model’s inability to correctly classify certain cases, which may have significant operational or regulatory implications depending on the application.

This test shows a confusion matrix presented as a color-annotated heatmap, where the x-axis represents the predicted class labels (0 and 1) and the y-axis represents the true class labels (0 and 1). Each cell in the matrix contains both a numerical count and a descriptive label: True Negatives (TN) in the bottom-left, False Positives (FP) in the bottom-right, False Negatives (FN) in the top-left, and True Positives (TP) in the top-right. The color intensity of each cell corresponds to the magnitude of the count, with darker shades indicating higher values. The matrix displays the following counts: 226 true negatives, 100 false positives, 110 false negatives, and 211 true positives. To interpret the matrix, one reads across each row to see how actual cases of each class are distributed among the predicted classes. The diagonal cells (TN and TP) represent correct classifications, while the off-diagonal cells (FP and FN) represent misclassifications. The range of values in this matrix spans from 100 to 226, with the highest count in the true negative cell and the lowest in the false positive cell. Notably, the number of false negatives (110) and false positives (100) are substantial, indicating that the model is making a significant number of both types of errors. The matrix provides a clear, immediate visual and quantitative summary of the model’s classification behavior on the test set.

The test results reveal the following key insights:

  • Balanced Distribution of Correct and Incorrect Classifications: The confusion matrix shows that the model achieves 226 true negatives and 211 true positives, indicating a relatively balanced ability to correctly classify both classes, but with a notable presence of errors.
  • Substantial False Negative and False Positive Rates: There are 110 false negatives and 100 false positives, which are significant relative to the true positive and true negative counts, suggesting that the model is prone to both types of misclassification.
  • True Negatives Slightly Outnumber True Positives: The model correctly identifies more negative cases (226) than positive cases (211), which may reflect underlying class distributions or model bias.
  • False Negatives Exceed False Positives: The count of false negatives (110) is slightly higher than that of false positives (100), indicating that the model is more likely to miss positive cases than to incorrectly label negatives as positives.
  • Diagonal Dominance with Noticeable Off-Diagonal Values: While the diagonal cells (correct classifications) have the highest counts, the off-diagonal cells (errors) are not negligible, highlighting areas where the model’s predictive power could be improved.

Based on these results, the confusion matrix for the logreg_champion model demonstrates that the model is capable of correctly classifying a substantial number of both positive and negative cases, as evidenced by the high counts of true positives and true negatives. However, the presence of considerable false negatives and false positives indicates that the model’s predictions are not consistently reliable, with a meaningful proportion of both types of errors. The slightly higher number of true negatives compared to true positives suggests a marginally better performance in identifying negative cases, while the higher false negative count relative to false positives points to a tendency to under-predict the positive class. The overall distribution of values in the matrix reflects a model that is neither highly conservative nor highly aggressive in its predictions, but rather one that exhibits a moderate balance between sensitivity and specificity. These observations provide a detailed characterization of the model’s classification behavior, highlighting both its strengths in correct identification and its limitations in error rates, which are critical for understanding its suitability for deployment in contexts where the costs of misclassification are significant.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion:4e92
2026-01-10 02:28:08,595 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion does not exist in model's document

❌ Minimum Accuracy Logreg Champion

Minimum Accuracy:logreg_champion is designed to assess whether the model’s prediction accuracy meets or exceeds a specified minimum threshold, serving as a fundamental check on the model’s ability to correctly classify instances within a given dataset. The primary purpose of this test is to ensure that the model achieves a baseline level of predictive performance, which is critical for establishing the model’s suitability for deployment in production or regulatory environments.

The test operates by calculating the model’s accuracy score, which is the proportion of correct predictions out of the total number of predictions made. This is achieved by comparing the true class labels from the dataset with the predicted class labels generated by the model. The accuracy metric is computed using a standard method, such as the one provided by the scikit-learn library, which counts the number of exact matches between the true and predicted labels and divides this by the total number of samples. The resulting score ranges from 0 to 1, where 1 indicates perfect accuracy and 0 indicates no correct predictions. The computed accuracy is then compared to a predefined threshold, commonly set at 0.7, to determine if the model’s performance is acceptable. If the accuracy meets or surpasses this threshold, the test is marked as passed; otherwise, it is marked as failed. This approach provides a clear, quantitative benchmark for model performance, making it straightforward to interpret and communicate.

The primary advantages of this test include its simplicity and directness, offering a holistic measure of model performance that is easy to understand and communicate to both technical and non-technical stakeholders. Because accuracy is a single, aggregate metric, it provides a quick snapshot of how well the model is performing across all classes, making it particularly useful in scenarios where class distributions are balanced. The test’s versatility allows it to be applied to both binary and multiclass classification problems, and its reliance on a well-established metric ensures consistency and comparability across different models and datasets. This makes the Minimum Accuracy test an effective initial screening tool for model validation and monitoring.

It should be noted that the Minimum Accuracy test has several limitations and potential risks. One key limitation is that accuracy can be misleading in situations where the dataset is imbalanced, as the metric may be disproportionately influenced by the majority class, masking poor performance on minority classes. The test does not provide any information about the types of errors the model is making, such as false positives or false negatives, nor does it capture more nuanced aspects of model performance like precision, recall, or the ability to handle specific subpopulations. Persistent failure to meet the threshold is a sign of high risk, indicating that the model may not be reliable for its intended use. Additionally, the test’s focus on overall correctness may not be sufficient for applications where the cost of different types of errors varies significantly.

This test shows the results in a tabular format, presenting three columns: Score, Threshold, and Pass/Fail. The Score column displays the model’s computed accuracy, which in this case is 0.6754, representing the proportion of correct predictions out of all predictions made. The Threshold column indicates the minimum acceptable accuracy, set at 0.7 for this test. The Pass/Fail column provides a categorical outcome based on whether the Score meets or exceeds the Threshold. In this instance, the model’s accuracy falls below the required threshold, resulting in a “Fail” outcome. The table is straightforward to interpret: each row corresponds to a single test run, and the values are presented as decimals for accuracy and threshold, with the pass/fail status clearly indicated. The range for the Score is from 0 to 1, and the threshold is similarly bounded. Notably, the model’s accuracy is only slightly below the threshold, suggesting that while the model is performing close to the required standard, it does not meet the minimum criterion for acceptance. There are no additional breakdowns or subgroup analyses in this output, and the result is presented as a single, aggregate measure.

The test results reveal the following key insights:

  • Model Accuracy Falls Short of Threshold: The model achieves an accuracy score of 0.6754, which is below the specified minimum threshold of 0.7, resulting in a fail outcome for this test.
  • Threshold Provides Clear Benchmark: The threshold value of 0.7 serves as a definitive benchmark for acceptable performance, and the model’s score is close but insufficient to meet this requirement.
  • Binary Pass/Fail Outcome Simplifies Interpretation: The Pass/Fail column provides an unambiguous assessment of whether the model’s accuracy is adequate, with the current result indicating that the model does not satisfy the minimum standard.
  • No Evidence of Severe Underperformance: While the model does not pass, the accuracy score is not drastically below the threshold, suggesting that the model is not severely underperforming but requires improvement to meet the acceptance criteria.

Based on these results, the model demonstrates an accuracy that is marginally below the established minimum threshold, indicating that its overall predictive performance is close to, but not sufficient for, the required standard. The test provides a clear and objective assessment of the model’s ability to correctly classify instances, with the accuracy score serving as a direct measure of performance relative to a predefined benchmark. The binary pass/fail outcome facilitates straightforward interpretation and decision-making, highlighting that the model does not currently meet the acceptance criteria. The proximity of the score to the threshold suggests that the model is not fundamentally flawed but may require further refinement or adjustment to achieve the desired level of accuracy. The results underscore the importance of considering both the absolute accuracy and the context of the threshold when evaluating model suitability, as well as the need to complement this test with additional metrics for a more comprehensive assessment of model performance.

Tables

Score Threshold Pass/Fail
0.6754 0.7 Fail
2026-01-10 02:28:31,990 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion does not exist in model's document

✅ Minimum F1 Score Logreg Champion

Minimum F1 Score: logreg_champion is designed to assess whether the model's F1 score on the validation set meets or exceeds a predefined minimum threshold, ensuring that the model achieves a balanced performance between precision and recall. This test is particularly important in classification tasks where the distribution of classes may be imbalanced, as it provides a more informative measure of model effectiveness than accuracy alone.

The test operates by calculating the F1 score on the validation dataset using scikit-learn's metrics in Python. For binary classification problems, the standard F1 score is computed, which represents the harmonic mean of precision and recall. For multi-class problems, macro averaging is used, which calculates the F1 score independently for each class and then averages the results, treating all classes equally regardless of their frequency. The F1 score ranges from 0 to 1, where 1 indicates perfect precision and recall, and 0 indicates the worst possible balance. The computed F1 score is then compared to a predefined threshold, which serves as the minimum acceptable performance standard. If the model's F1 score falls below this threshold, it is flagged as not meeting the required performance, indicating potential issues with the model's ability to balance false positives and false negatives.

The primary advantages of this test include its ability to provide a balanced assessment of model performance by considering both false positives and false negatives, which is especially valuable in situations with imbalanced class distributions. The F1 score is less susceptible to being skewed by the majority class, making it a more reliable indicator of model effectiveness in such scenarios. Additionally, the flexibility to set a custom threshold allows organizations to define performance standards that align with their specific risk tolerance and business objectives. This adaptability ensures that the test remains relevant across a wide range of applications and model types.

It should be noted that the F1 score assumes equal importance for precision and recall, which may not always align with real-world business requirements where the costs of false positives and false negatives differ. The test may not be suitable for all types of models or tasks, particularly those where other metrics such as precision, recall, or ROC-AUC are more appropriate. Furthermore, a model that passes the F1 threshold may still exhibit weaknesses in other areas not captured by this metric. The test also identifies high risk if the F1 score is below the established threshold, signaling that the model may not be effectively distinguishing between classes or may be biased toward one class.

This test shows the results in a tabular format, presenting three columns: "Score," "Threshold," and "Pass/Fail." The "Score" column displays the F1 score achieved by the model on the validation set, which in this case is 0.6677. The "Threshold" column indicates the minimum acceptable F1 score, set at 0.5 for this test. The "Pass/Fail" column communicates whether the model's performance meets the required standard, with a "Pass" indicating that the F1 score is above the threshold. The table is straightforward to interpret: if the "Score" is greater than or equal to the "Threshold," the model passes the test; otherwise, it fails. The F1 score of 0.6677 falls within the typical range for this metric and is notably above the threshold, suggesting that the model achieves a reasonable balance between precision and recall. There are no additional breakdowns or subgroup analyses in this result, as the test focuses solely on the overall F1 score for the validation set.

The test results reveal the following key insights:

  • Model Achieves Required F1 Score: The model's F1 score on the validation set is 0.6677, which exceeds the predefined threshold of 0.5, indicating that the model meets the minimum performance standard for balanced precision and recall.
  • Clear Pass Outcome: The "Pass/Fail" column explicitly shows a "Pass," confirming that the model's performance is satisfactory according to the established criteria.
  • Score Significantly Above Threshold: The F1 score is not only above the threshold but exceeds it by a margin of 0.1677, suggesting a comfortable buffer and reducing the likelihood of borderline performance.
  • Single Metric Focus: The test result is based solely on the overall F1 score, with no additional class-level or subgroup breakdowns, emphasizing the aggregate performance of the model.

Based on these results, the model demonstrates a balanced performance between precision and recall on the validation set, as evidenced by an F1 score of 0.6677 that comfortably surpasses the minimum threshold of 0.5. The clear "Pass" outcome indicates that the model is effective at managing the trade-off between false positives and false negatives in this context. The margin by which the score exceeds the threshold suggests that the model's performance is not marginal but rather solidly within acceptable bounds. The focus on a single, aggregate F1 score provides a straightforward assessment of overall model effectiveness, though it does not offer insights into class-specific performance or potential disparities across different segments. Overall, the results indicate that the model is well-calibrated for balanced classification tasks and is likely to perform reliably in scenarios where both precision and recall are important.

Tables

Score Threshold Pass/Fail
0.6677 0.5 Pass
2026-01-10 02:28:51,423 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:logreg_champion does not exist in model's document

ROC Curve Logreg Champion

ROC Curve: logreg_champion is designed to evaluate the performance of a binary classification model by visualizing its ability to distinguish between two classes and quantifying this ability using the Area Under the Curve (AUC) metric. The primary purpose of this test is to provide a comprehensive assessment of the model’s discriminative power across all possible classification thresholds, enabling a robust understanding of how well the model separates positive and negative cases.

The test operates by first generating predicted probabilities for each instance in the test dataset using the selected binary classification model. These probabilities, along with the true class labels, are used to calculate the True Positive Rate (TPR) and False Positive Rate (FPR) at various threshold levels. The ROC curve is then plotted with the FPR on the x-axis and the TPR on the y-axis, illustrating the trade-off between sensitivity and specificity as the threshold changes. A reference line representing random classification (AUC of 0.5) is included for context. The AUC score, which ranges from 0 to 1, is computed as a summary statistic of the ROC curve; a value closer to 1 indicates strong discriminative ability, while a value near 0.5 suggests performance no better than random guessing. The test also ensures that any infinite values in the threshold calculations are removed to maintain result integrity. The resulting ROC curve, AUC score, and associated thresholds are saved for documentation and future analysis.

The primary advantages of this test include its ability to provide a holistic, threshold-independent view of model performance, which is particularly valuable in scenarios where the optimal classification threshold is not predetermined or may vary depending on operational requirements. The ROC curve visually demonstrates the model’s performance across the entire range of possible thresholds, while the AUC condenses this information into a single, interpretable metric that remains consistent regardless of class distribution. This makes the test especially useful for comparing models or monitoring performance over time, even when the underlying data distribution changes. Additionally, the ROC-AUC framework is robust to imbalanced datasets, as it focuses on ranking predictions rather than absolute classification accuracy.

It should be noted that this test is specifically tailored for binary classification models and does not extend to multi-class or regression tasks. The ROC curve and AUC metric may be less informative when model outputs are highly skewed toward one class, as this can mask poor absolute classification performance. Furthermore, the ROC curve can sometimes present an overly optimistic view of model performance in the presence of severe class imbalance, as it evaluates ranking rather than actual prediction correctness. A key sign of high risk is an AUC score near or below 0.5, which indicates that the model lacks meaningful discriminative power and may be performing no better than random chance. Additionally, if the ROC curve closely follows the diagonal line of randomness, this is a clear indication that the model is not effectively distinguishing between the two classes.

This test shows a single ROC curve plot for the logistic regression champion model evaluated on the final test dataset. The plot displays the True Positive Rate (vertical axis) against the False Positive Rate (horizontal axis) for a range of classification thresholds, with both axes spanning from 0 to 1. The magenta line represents the model’s ROC curve, while the dashed gray line indicates the performance of a random classifier (AUC = 0.5). The legend in the upper right corner provides the AUC value for the model, which is 0.71, and reiterates the baseline for random performance. To interpret the plot, one should observe how far the ROC curve lies above the diagonal; the greater the area between the curve and the diagonal, the better the model’s discriminative ability. The curve’s shape reveals how the model balances sensitivity and specificity at different thresholds, with the upper left corner representing ideal performance (high TPR, low FPR). The AUC value of 0.71 quantifies the overall performance, indicating that the model has a moderate ability to distinguish between the two classes. There are no abrupt drops or irregularities in the curve, suggesting stable performance across thresholds. The plot does not display individual threshold values, but the smoothness of the curve implies consistent probability outputs from the model.

The test results reveal the following key insights:

  • Model demonstrates moderate discriminative power: The AUC score of 0.71 indicates that the model is able to distinguish between positive and negative classes with reasonable effectiveness, performing substantially better than random guessing.
  • ROC curve consistently outperforms random baseline: The magenta ROC curve remains above the diagonal line throughout the entire range of false positive rates, confirming that the model maintains discriminative ability across all thresholds.
  • Stable performance across thresholds: The ROC curve is smooth and does not exhibit sharp fluctuations, suggesting that the model’s probability outputs are well-calibrated and that performance does not degrade at specific threshold regions.
  • No evidence of high-risk behavior: The AUC value is well above the 0.5 threshold, and the ROC curve does not approach the line of randomness, indicating that the model is not at risk of failing to discriminate between classes.

Based on these results, the logistic regression champion model exhibits a moderate level of discriminative ability on the final test dataset, as evidenced by an AUC score of 0.71 and a consistently elevated ROC curve above the random baseline. The model’s performance is stable across the full spectrum of classification thresholds, with no signs of erratic behavior or threshold-specific weaknesses. The ROC curve’s shape and the AUC value together suggest that the model is reliably ranking positive cases higher than negative ones, which is a desirable characteristic for binary classification tasks where threshold selection may vary depending on operational needs. The absence of any regions where the ROC curve approaches the diagonal line further supports the conclusion that the model is not exhibiting high-risk or random-like behavior. Overall, the test results provide a clear and objective characterization of the model’s ability to separate the two classes, supporting its use in scenarios where moderate discriminative performance is acceptable.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:logreg_champion:61ca
2026-01-10 02:29:17,814 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:logreg_champion does not exist in model's document
Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.

Log an artifact

As we can observe from the output above, our champion model doesn't pass the MinimumAccuracy based on the default thresholds of the out-of-the-box test, so let's log an artifact (finding) in the ValidMind Platform (Need more help?):

  1. From the Inventory in the ValidMind Platform, go to the model you connected to earlier.

  2. In the left sidebar that appears for your model, click Validation Report under Documents.

  3. Locate the Data Preparation section and click on 2.2.2. Model Performance to expand that section.

  4. Under the Model Performance Metrics section, locate Artifacts then click Link Artifact to Report:

    Screenshot showing the validation report with the link artifact option highlighted

  5. Select Validation Issue as the type of artifact.

  6. Click + Add Validation Issue to add a validation issue type artifact.

  7. Enter in the details for your validation issue, for example:

    • TITLE — Champion Logistic Regression Model Fails Minimum Accuracy Threshold
    • RISK AREA — Model Performance
    • DOCUMENTATION SECTION — 3.2. Model Evaluation
    • DESCRIPTION — The logistic regression champion model was subjected to a Minimum Accuracy test to determine whether its predictive accuracy meets the predefined performance threshold of 0.7. The model achieved an accuracy score of 0.6136, which falls below the required minimum. As a result, the test produced a Fail outcome.
  8. Click Save.

  9. Select the validation issue you just added to link to your validation report and click Update Linked Artifacts to insert your validation issue.

  10. Click on the validation issue to expand the issue, where you can adjust details such as severity, owner, due date, status, etc. as well as include proposed remediation plans or supporting documentation as attachments.

Evaluate performance of challenger model

We've now conducted similar tests as the model development team for our champion model, with the aim of verifying their test results.

Next, let's see how our challenger models compare. We'll use the same batch of tests here as we did in mpt, but append a different result_id to indicate that these results should be associated with our challenger model:

mpt_chall = [
    "validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger",
    "validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger",
    "validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger"
]

We'll run each test once for each model with the same vm_test_ds dataset to compare them:

for test in mpt_chall:
    vm.tests.run_test(
        test,
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        }
    ).log()

Classifier Performance Champion Vs Challenger

Classifier Performance: Champion vs Challenger is designed to evaluate and compare the predictive effectiveness of classification models by quantifying their ability to correctly identify and distinguish between classes. The primary purpose of this test is to provide a comprehensive assessment of model performance using a suite of standard metrics, including precision, recall, F1-Score, accuracy, and ROC AUC, thereby enabling objective comparison between a champion model and one or more challenger models.

The test operates by generating a detailed report of classification metrics for each model under evaluation. It utilizes the classification report from the scikit-learn library to compute precision, recall, and F1-Score for each class, as well as macro and weighted averages to summarize overall model performance. Precision measures the proportion of positive identifications that are actually correct, while recall quantifies the proportion of actual positives that are correctly identified. The F1-Score provides a harmonic mean of precision and recall, balancing the trade-off between the two. Accuracy reflects the overall proportion of correct predictions out of all predictions made. For a more nuanced view, especially in the presence of class imbalance, the test also calculates the ROC AUC score, which measures the model’s ability to discriminate between classes across all possible classification thresholds. ROC AUC values range from 0 to 1, with values closer to 1 indicating strong discriminatory power and values near 0.5 suggesting performance no better than random guessing. The test requires as input the predicted and true class labels, and, for ROC AUC, the predicted probabilities. The output is a set of tables summarizing these metrics for each model and class, allowing for direct comparison.

The primary advantages of this test include its versatility in handling both binary and multiclass classification problems and its comprehensive coverage of key performance metrics. By reporting precision, recall, F1-Score, and accuracy, the test provides a multi-faceted view of model behavior, capturing both the ability to correctly identify positive cases and the tendency to avoid false positives. The inclusion of macro and weighted averages ensures that the results are interpretable even in the presence of class imbalance, while the ROC AUC metric offers a robust measure of overall discriminatory power. This makes the test particularly valuable for model selection and benchmarking, as it enables stakeholders to assess not only overall accuracy but also the balance between sensitivity and specificity, and the model’s robustness to varying decision thresholds.

It should be noted that the test is subject to several limitations and interpretation challenges. The accuracy and utility of the results depend on the representativeness of the test dataset; if the data does not reflect real-world distributions, the reported metrics may not generalize. The test assumes that class labels are correctly specified and that the classification task is well-defined, which may not hold in all scenarios. While the test provides a broad overview of performance, it does not diagnose the underlying causes of poor results or suggest specific areas for model improvement. Low values for precision, recall, F1-Score, accuracy, or ROC AUC are indicative of suboptimal model performance, and significant imbalances between precision and recall may signal issues such as overfitting or underfitting. Additionally, ROC AUC values close to 0.5 suggest that the model lacks discriminatory power, which is a sign of high risk in production settings.

This test shows the results in the form of two tables. The first table presents precision, recall, and F1-Score for each class (0 and 1) for both the champion (log_model_champion) and challenger (rf_model) models, along with their weighted and macro averages. Each row corresponds to a specific model and class, and each column displays the respective metric values, which range from 0 to 1. The second table summarizes the overall accuracy and ROC AUC for each model, with accuracy representing the proportion of correct predictions and ROC AUC indicating the model’s ability to distinguish between classes. Notable observations include the generally higher metric values for the rf_model compared to the log_model_champion across all reported metrics. For example, the rf_model achieves a weighted average F1-Score of 0.6939 and a ROC AUC of 0.7625, both higher than the corresponding values for the log_model_champion (0.6753 and 0.7051, respectively). The tables are read by identifying the model and class of interest, then examining the associated metric values to assess performance. The range of values observed suggests moderate to good performance, with all metrics falling between approximately 0.65 and 0.76, and no values indicating severe underperformance.

The test results reveal the following key insights:

  • rf_model Consistently Outperforms log_model_champion: Across all reported metrics, the rf_model demonstrates higher precision, recall, and F1-Score for both classes, as well as higher weighted and macro averages, indicating superior overall performance.
  • Higher Discriminatory Power in rf_model: The ROC AUC for rf_model is 0.7625, substantially higher than the 0.7051 observed for log_model_champion, suggesting that rf_model is more effective at distinguishing between the two classes.
  • Balanced Performance Across Classes: Both models exhibit relatively balanced precision and recall between classes 0 and 1, with no extreme disparities, though rf_model maintains a slight edge in both metrics for each class.
  • Moderate to Good Accuracy Levels: The accuracy for log_model_champion is 0.6754, while rf_model achieves 0.694, indicating that both models correctly classify a substantial proportion of instances, with rf_model again showing a modest improvement.
  • No Evidence of Severe Class Imbalance Effects: The similarity between macro and weighted averages for both models suggests that class imbalance does not significantly distort the reported metrics, and both models maintain stable performance across classes.

Based on these results, the rf_model demonstrates a clear advantage over the log_model_champion in terms of both overall and class-specific performance metrics. The higher precision, recall, and F1-Score values for rf_model indicate that it is more effective at correctly identifying both positive and negative cases, while its superior ROC AUC reflects a stronger ability to discriminate between classes across varying thresholds. The balanced performance across classes and the close alignment between macro and weighted averages suggest that neither model is unduly affected by class imbalance, and both maintain consistent behavior across the dataset. The observed accuracy levels confirm that both models are capable of making correct predictions at a moderate to good rate, with rf_model providing a modest but consistent improvement. Collectively, these insights indicate that rf_model offers more robust and reliable classification performance in this context, with no evidence of severe underperformance or instability in either model.

Tables

model Class Precision Recall F1
log_model_champion 0 0.6726 0.6933 0.6828
log_model_champion 1 0.6785 0.6573 0.6677
log_model_champion Weighted Average 0.6755 0.6754 0.6753
log_model_champion Macro Average 0.6755 0.6753 0.6753
rf_model 0 0.6905 0.7117 0.7009
rf_model 1 0.6977 0.6760 0.6867
rf_model Weighted Average 0.6941 0.6940 0.6939
rf_model Macro Average 0.6941 0.6938 0.6938
model Metric Value
log_model_champion Accuracy 0.6754
log_model_champion ROC AUC 0.7051
rf_model Accuracy 0.6940
rf_model ROC AUC 0.7625
2026-01-10 02:29:41,807 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger does not exist in model's document

Confusion Matrix Champion Vs Challenger

Confusion Matrix: champion vs challenger is designed to evaluate and visually represent the predictive performance of classification machine learning models by quantifying the counts of true positives, true negatives, false positives, and false negatives. The primary purpose of this test is to provide a clear and interpretable summary of how well each model distinguishes between the positive and negative classes, highlighting both correct and incorrect predictions in a structured format.

The test operates by comparing the predicted class labels generated by each model against the actual observed class labels from the test dataset. For each model, a confusion matrix is constructed, where the rows represent the true class labels and the columns represent the predicted class labels. The matrix is populated with counts for each combination: true positives (cases where the model correctly predicts the positive class), true negatives (correctly predicts the negative class), false positives (incorrectly predicts positive when the true class is negative), and false negatives (incorrectly predicts negative when the true class is positive). These counts are then visualized using a heatmap, which provides an immediate graphical representation of the model’s classification behavior. The values in the matrix are non-negative integers, and higher values along the diagonal (true positives and true negatives) generally indicate better model performance, while higher off-diagonal values (false positives and false negatives) suggest areas where the model is making errors. The confusion matrix does not aggregate these results into a single performance metric but instead allows for a granular examination of the types of errors made by the model.

The primary advantages of this test include its ability to deliver a comprehensive and easily interpretable visual summary of model performance, making it straightforward to identify strengths and weaknesses in classification. The confusion matrix is particularly valuable in scenarios where understanding the balance between different types of errors is critical, such as in medical diagnosis or fraud detection, where the costs of false positives and false negatives may differ significantly. By explicitly displaying the counts of each outcome, the test enables users to assess not only overall accuracy but also the distribution of errors, which can inform further model tuning or selection. Additionally, the confusion matrix is well-suited for both binary and multi-class classification problems, providing a scalable approach to performance evaluation across a range of use cases.

It should be noted that the confusion matrix has several limitations and potential risks. In datasets with imbalanced class distributions, the matrix may give a misleading impression of model performance, as high counts in the majority class can mask poor performance in the minority class. The confusion matrix does not provide summary statistics such as precision, recall, or F1-score, which are often necessary for a more nuanced understanding of model effectiveness, especially in imbalanced settings. Users must compute these metrics separately to gain a complete picture. Furthermore, the matrix is descriptive rather than inferential, offering no statistical hypothesis testing or confidence intervals. Interpretation challenges may arise if users focus solely on overall accuracy without considering the specific costs or implications of different error types. High numbers of false positives or false negatives, as highlighted in the test description, are signs of increased risk and should be carefully examined in the context of the application.

This test shows the results in the form of annotated heatmaps, each representing the confusion matrix for a specific model: the champion logistic regression model and the challenger random forest model. Each heatmap is a 2x2 grid, with the axes labeled as true and predicted class labels. The top left cell shows the count of true negatives, the top right shows false positives, the bottom left shows false negatives, and the bottom right shows true positives. The color intensity of each cell corresponds to the magnitude of the count, with darker shades indicating higher values. For the logistic regression model, the matrix displays 226 true negatives, 100 false positives, 110 false negatives, and 211 true positives. For the random forest model, the matrix shows 232 true negatives, 94 false positives, 104 false negatives, and 217 true positives. These values provide a direct comparison of the two models’ abilities to correctly and incorrectly classify each class. The heatmaps allow users to quickly assess where each model excels or struggles, with particular attention to the off-diagonal cells that represent misclassifications. The range of values in each cell is determined by the size of the test set and the distribution of the true labels. Notable observations include the relatively balanced distribution of errors between false positives and false negatives for both models, as well as the slightly higher true positive and true negative counts for the random forest model compared to the logistic regression model.

The test results reveal the following key insights:

  • Random Forest Model Achieves Higher Correct Classification Counts: The random forest model records 217 true positives and 232 true negatives, both higher than the logistic regression model’s 211 true positives and 226 true negatives, indicating a marginally better ability to correctly identify both classes.
  • Random Forest Model Reduces Misclassification Rates: The random forest model produces fewer false positives (94) and false negatives (104) compared to the logistic regression model, which has 100 false positives and 110 false negatives, suggesting improved error control.
  • Error Distribution Remains Balanced Across Models: Both models exhibit a similar pattern in the distribution of errors, with false positives and false negatives occurring at comparable rates, reflecting consistent model behavior across the two approaches.
  • Magnitude of Classifications Reflects Test Set Composition: The total counts in each matrix cell are closely aligned, indicating that both models are evaluated on the same test set and that the class distribution is relatively stable.
  • Visual Representation Highlights Areas for Further Analysis: The heatmaps make it easy to identify that the majority of predictions fall along the diagonal, but the presence of non-negligible off-diagonal values underscores the importance of further investigation into the causes of misclassification.

Based on these results, both the logistic regression and random forest models demonstrate a similar overall pattern in their classification performance, with the random forest model showing a slight advantage in both true positive and true negative counts. The reduction in false positives and false negatives for the random forest model suggests a more effective balance between sensitivity and specificity, which may be beneficial depending on the application’s requirements. The close alignment in the total number of predictions across both models indicates that the evaluation is consistent and that observed differences are attributable to model behavior rather than data artifacts. The heatmaps provide a clear visual summary that facilitates direct comparison, making it straightforward to identify the random forest model’s incremental improvements in correct classification and error reduction. The balanced distribution of errors across both models suggests that neither model is disproportionately favoring one class over the other, and the observed error rates highlight the need for further analysis if minimizing specific types of misclassification is critical. Overall, the confusion matrix results offer a transparent and interpretable basis for understanding the comparative strengths and weaknesses of the champion and challenger models in this classification task.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:a4c9
ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:2abc
2026-01-10 02:30:09,487 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger does not exist in model's document

❌ Minimum Accuracy Champion Vs Challenger

Minimum Accuracy: Champion vs Challenger is designed to assess whether a model’s prediction accuracy meets or exceeds a specified minimum threshold, ensuring that the model’s overall correctness in classifying instances is sufficient for deployment or further consideration. The primary purpose of this test is to provide a straightforward, quantitative check on the model’s ability to make correct predictions, serving as a baseline measure of performance for both binary and multiclass classification tasks.

The test operates by calculating the accuracy score for each model under evaluation, which is the proportion of correct predictions out of the total number of predictions made. This is achieved by comparing the true labels from the dataset to the predicted labels generated by the model, using a standard method such as sklearn’s accuracy_score. The resulting accuracy value ranges from 0 to 1, where 1 indicates perfect prediction and 0 indicates no correct predictions. The test then compares this score to a predetermined threshold, commonly set at 0.7, to determine if the model’s performance is acceptable. If the accuracy score meets or exceeds the threshold, the model passes the test; otherwise, it fails. This mechanism provides a clear, interpretable metric for evaluating model performance, with higher values indicating better overall correctness and lower values signaling potential inadequacy for the intended application.

The primary advantages of this test include its simplicity and directness, making it an effective initial screening tool for model performance. Because accuracy is easy to interpret and calculate, it allows for rapid comparison across different models or iterations. This test is particularly useful when the dataset has balanced classes, as it reflects the model’s ability to correctly classify all categories without bias. Additionally, its applicability to both binary and multiclass problems makes it a versatile component of model evaluation pipelines, providing a consistent benchmark for minimum acceptable performance.

It should be noted that the Minimum Accuracy test has several limitations and potential risks. Accuracy can be misleading in situations where the dataset is imbalanced, as a model may achieve a high accuracy score simply by predicting the majority class most of the time, without truly learning to distinguish between classes. This test does not account for the types of errors made, such as false positives or false negatives, nor does it provide insight into the model’s precision or recall. Persistent failure to meet the threshold is a sign of high risk, indicating that the model may not be suitable for production use. Furthermore, relying solely on accuracy may obscure important nuances in model behavior, especially in domains where certain types of errors carry greater consequences.

This test shows the results in a tabular format, presenting each model evaluated alongside its calculated accuracy score, the threshold used for evaluation, and the resulting pass or fail status. The table includes columns for the model name, the accuracy score (expressed as a decimal between 0 and 1), the threshold value, and a categorical indicator of whether the model passed or failed the test. For example, the "log_model_champion" achieved an accuracy score of 0.6754, while the "rf_model" achieved 0.694, both compared against a threshold of 0.7. The "Pass/Fail" column clearly indicates that both models failed to meet the minimum accuracy requirement. The table format allows for straightforward comparison between models, highlighting not only the absolute performance but also the margin by which each model falls short of the threshold. The values are precise to four decimal places, enabling detailed scrutiny of model performance relative to the set standard.

The test results reveal the following key insights:

  • Both Models Fall Short of Minimum Accuracy: Neither the "log_model_champion" nor the "rf_model" achieves the required accuracy threshold of 0.7, with scores of 0.6754 and 0.694 respectively, resulting in a fail status for both.
  • RF Model Marginally Outperforms Champion: The "rf_model" demonstrates a slightly higher accuracy than the "log_model_champion," outperforming it by approximately 0.0186, yet still does not meet the threshold.
  • Consistent Threshold Application: The threshold of 0.7 is uniformly applied to both models, ensuring a fair and direct comparison of their performance.
  • Clear Pass/Fail Delineation: The "Pass/Fail" column provides an unambiguous assessment of each model’s status relative to the minimum accuracy requirement, facilitating rapid identification of models that do not meet baseline standards.

Based on these results, both evaluated models do not achieve the minimum required accuracy, as indicated by their respective scores of 0.6754 for the "log_model_champion" and 0.694 for the "rf_model," both falling below the 0.7 threshold. The "rf_model" shows a marginally better performance compared to the "log_model_champion," but the difference is not sufficient to alter the overall outcome. The uniform application of the threshold across models ensures that the comparison is equitable and that the results are directly interpretable. The clear pass/fail status in the results table highlights that neither model currently meets the baseline standard for accuracy, suggesting that further evaluation or model refinement may be necessary before deployment. The observed accuracy values, while close to the threshold, indicate that the models are not yet achieving the level of correctness required for reliable operation in their intended context. The results provide a transparent and objective assessment of model performance against a predefined standard, supporting informed decision-making regarding model selection and further development.

Tables

model Score Threshold Pass/Fail
log_model_champion 0.6754 0.7 Fail
rf_model 0.6940 0.7 Fail
2026-01-10 02:30:28,513 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger does not exist in model's document

✅ Minimum F1 Score Champion Vs Challenger

Minimum F1 Score: champion vs challenger is designed to evaluate whether the F1 score of a model on the validation dataset meets or exceeds a predefined minimum threshold, ensuring that the model achieves a balanced trade-off between precision and recall. This test is particularly important in classification tasks where the distribution of classes may be imbalanced, as it provides a more informative measure of model performance than accuracy alone.

The test operates by calculating the F1 score for each model using the validation dataset. The F1 score is a metric that combines both precision, which measures the proportion of true positive predictions among all positive predictions, and recall, which measures the proportion of true positive predictions among all actual positive cases. For binary classification problems, the standard F1 score calculation is used, while for multi-class problems, macro averaging is applied to ensure that each class contributes equally to the final score. The F1 score ranges from 0 to 1, where 1 indicates perfect precision and recall, and 0 indicates the worst possible balance. The computed F1 score for each model is then compared to a predefined threshold, typically set based on business or regulatory requirements. If the model's F1 score meets or exceeds this threshold, it is considered to have passed the test; otherwise, it is flagged as not meeting the minimum performance standard.

The primary advantages of this test include its ability to provide a balanced assessment of model performance by accounting for both false positives and false negatives, which is especially valuable in situations with class imbalance. The F1 score is less susceptible to being skewed by the majority class, making it a more reliable indicator of a model's ability to correctly identify minority class instances. Additionally, the flexibility to set a minimum threshold allows organizations to define clear, context-specific performance standards that models must meet before deployment. This ensures that models are not only accurate but also robust in their ability to generalize to new data, particularly in high-stakes or regulated environments.

It should be noted that the F1 score assumes equal importance for precision and recall, which may not align with all business objectives or regulatory requirements, especially in cases where the cost of false positives and false negatives differs significantly. The test may not be suitable for all types of models or tasks, such as those where other metrics like precision, recall, or ROC-AUC are more relevant. Additionally, a model passing the F1 threshold does not guarantee optimal performance across all relevant metrics, and reliance solely on the F1 score may overlook important nuances in model behavior. High risk is indicated if a model's F1 score falls below the established threshold, suggesting inadequate balance between precision and recall and potential failure to effectively identify positive cases while minimizing false positives.

This test shows the results in a tabular format, presenting each model evaluated, its corresponding F1 score, the minimum threshold required, and a pass/fail indicator. The table includes two models: "log_model_champion" and "rf_model." The "Score" column displays the F1 score achieved by each model on the validation set, with values of 0.6677 for "log_model_champion" and 0.6867 for "rf_model." The "Threshold" column shows the minimum acceptable F1 score, set at 0.5 for both models. The "Pass/Fail" column indicates whether each model's F1 score meets or exceeds the threshold, with both models marked as "Pass." The F1 scores are presented as decimal values between 0 and 1, allowing for straightforward comparison against the threshold. Notably, both models achieve F1 scores well above the minimum requirement, indicating balanced performance in terms of precision and recall on the validation dataset. The table format enables easy identification of which models satisfy the minimum performance criteria and highlights the relative performance of each model.

The test results reveal the following key insights:

  • All models exceed the minimum F1 threshold: Both "log_model_champion" and "rf_model" achieve F1 scores above the required threshold of 0.5, with scores of 0.6677 and 0.6867, respectively.
  • rf_model demonstrates the highest F1 score: Among the models evaluated, "rf_model" attains the highest F1 score at 0.6867, indicating a slightly better balance between precision and recall compared to "log_model_champion."
  • Consistent threshold application across models: The minimum F1 score threshold is uniformly set at 0.5 for both models, ensuring a fair and consistent evaluation standard.
  • Clear pass/fail outcomes facilitate interpretation: The inclusion of a "Pass/Fail" column provides immediate clarity on which models meet the minimum performance requirement, with both models passing the test.
  • F1 scores indicate robust validation performance: The observed F1 scores, both significantly above the threshold, suggest that the models maintain strong performance on the validation dataset, with no immediate signs of underperformance in terms of the balance between precision and recall.

Based on these results, both "log_model_champion" and "rf_model" demonstrate F1 scores that comfortably exceed the predefined minimum threshold of 0.5, indicating that each model achieves a satisfactory balance between precision and recall on the validation dataset. The "rf_model" shows a marginally higher F1 score than the "log_model_champion," suggesting a slight advantage in its ability to correctly identify positive cases while minimizing false positives and false negatives. The uniform application of the threshold across models ensures that the evaluation is consistent and unbiased. The clear pass/fail outcomes in the results table make it straightforward to determine which models meet the required performance standard. The F1 scores observed are well within the upper range of the metric, reflecting robust model behavior in the context of the validation data. These observations collectively indicate that both models are performing reliably with respect to the balanced metric of F1 score, and there are no indications of performance issues related to the trade-off between precision and recall within the scope of this test.

Tables

model Score Threshold Pass/Fail
log_model_champion 0.6677 0.5 Pass
rf_model 0.6867 0.5 Pass
2026-01-10 02:30:48,983 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger does not exist in model's document

ROC Curve Champion Vs Challenger

ROC Curve: Champion vs Challenger is designed to evaluate the performance of binary classification models by visualizing their ability to distinguish between two classes and quantifying this capability using the Area Under the Curve (AUC) metric. The primary purpose of this test is to provide a comprehensive assessment of how well each model can separate positive and negative cases across all possible classification thresholds, offering a robust measure of model discrimination that is not tied to any single decision boundary.

The test operates by first selecting the relevant binary classification models and applying them to a designated test dataset. For each model, the predicted probabilities for the positive class are computed for all test samples. These probabilities, along with the true class labels, are used to calculate the True Positive Rate (TPR) and False Positive Rate (FPR) at various threshold levels, which are then plotted to form the Receiver Operating Characteristic (ROC) curve. The ROC curve visually represents the trade-off between sensitivity (TPR) and the rate of false alarms (FPR) as the classification threshold varies. The Area Under the Curve (AUC) is then calculated, summarizing the overall performance of the model into a single value ranging from 0 to 1, where 1 indicates perfect discrimination and 0.5 corresponds to random guessing. The test also includes a reference line representing random performance (AUC = 0.5) for direct comparison. Any infinite values in the threshold calculations are removed to ensure the integrity of the results. The ROC curves, AUC scores, and associated thresholds are saved for documentation and further analysis.

The primary advantages of this test include its ability to provide a holistic and threshold-independent evaluation of model discrimination, making it particularly valuable in scenarios where the optimal classification threshold is not predetermined or may vary over time. The ROC curve offers a visual summary of model performance across all possible thresholds, allowing stakeholders to assess the model's behavior under different operating conditions. The AUC metric, being invariant to class distribution, ensures that the evaluation remains consistent even when the dataset is imbalanced, which is a common challenge in many real-world applications. This makes the ROC-AUC framework especially useful for comparing multiple models or monitoring model performance over time, as it distills complex classification behavior into interpretable and actionable insights.

It should be noted that this test is specifically designed for binary classification tasks and does not extend to multi-class or regression models. Additionally, the ROC-AUC metric may not fully capture model performance in cases where predicted probabilities are highly skewed toward the extremes, potentially masking issues with calibration or class imbalance. In situations where the majority of predictions are incorrect but the ranking of probabilities is preserved, the AUC can still appear artificially high, which may lead to overestimation of model effectiveness. A key sign of elevated risk is an AUC score approaching 0.5, indicating that the model's predictions are no better than random chance. Furthermore, if the ROC curve closely follows the diagonal line of randomness, it signals a lack of discriminative power. These limitations highlight the importance of interpreting ROC-AUC results in conjunction with other performance metrics and domain knowledge.

This test shows the results in the form of ROC curve plots for two models: a logistic regression model (log_model_champion) and a random forest model (rf_model), both evaluated on the same test dataset. Each plot displays the ROC curve, which traces the relationship between the True Positive Rate (vertical axis) and the False Positive Rate (horizontal axis) as the classification threshold is varied from 0 to 1. The solid colored line represents the model's performance, while the dashed diagonal line indicates the performance of a random classifier (AUC = 0.5). The AUC value is prominently displayed in the legend for each model, providing a quantitative summary of the model's ability to distinguish between the two classes. For the logistic regression model, the AUC is 0.71, and for the random forest model, the AUC is 0.76. The curves for both models consistently lie above the random line, indicating meaningful discriminative power. The plots allow for direct visual comparison of the two models, with the random forest model's curve generally staying further from the diagonal, especially at lower false positive rates, suggesting stronger performance. The axes range from 0 to 1, and the curves are smooth, indicating stable probability estimates across thresholds. No abrupt changes or irregularities are observed, and both models achieve their highest true positive rates at the upper end of the threshold spectrum.

The test results reveal the following key insights:

  • Random Forest Model Demonstrates Superior Discrimination: The random forest model achieves an AUC of 0.76, outperforming the logistic regression model, which has an AUC of 0.71, indicating stronger overall ability to distinguish between positive and negative cases.
  • Both Models Exceed Random Performance: Both ROC curves consistently lie above the diagonal line representing random guessing (AUC = 0.5), confirming that each model provides meaningful predictive value on the test dataset.
  • Stable Probability Estimates Across Thresholds: The ROC curves for both models are smooth and continuous, with no abrupt jumps or irregularities, suggesting that the models produce stable probability estimates as the threshold varies.
  • Greater Separation at Lower False Positive Rates: The random forest model's ROC curve maintains a higher true positive rate than the logistic regression model, particularly at lower false positive rates, which is advantageous in applications where minimizing false alarms is critical.
  • No Evidence of Discriminative Failure: Neither model's ROC curve approaches the line of randomness, and both AUC values are well above the 0.5 threshold, indicating that there is no sign of model collapse or loss of discriminative power in this evaluation.

Based on these results, the random forest model demonstrates a stronger ability to separate positive and negative cases compared to the logistic regression model, as evidenced by its higher AUC score and more favorable ROC curve positioning across the full range of thresholds. Both models provide predictive value that is clearly superior to random guessing, with stable and consistent probability estimates reflected in the smoothness of their ROC curves. The random forest model's advantage is most pronounced at lower false positive rates, which may be particularly relevant in operational contexts where the cost of false positives is high. The absence of any ROC curve segments near the line of randomness and the lack of abrupt changes in curve shape further support the reliability of these models' probability outputs on the test dataset. Overall, the comparative analysis of ROC curves and AUC scores provides clear evidence of the relative strengths of the two models in binary classification tasks, with the random forest model emerging as the more effective discriminator under the conditions evaluated.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:d7b5
ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:d2be
2026-01-10 02:31:24,114 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger does not exist in model's document
Based on the performance metrics, our challenger random forest classification model passes the MinimumAccuracy where our champion did not.

In your validation report, support your recommendation in your validation issue's Proposed Remediation Plan to investigate the usage of our challenger model by inserting the performance tests we logged with this notebook into the appropriate section.

Run diagnostic tests

Next, we want to inspect the robustness and stability testing comparison between our champion and challenger model.

Use list_tests() to list all available diagnosis tests applicable to classification tasks:

vm.tests.list_tests(tags=["model_diagnosis"], task="classification")
ID Name Description Has Figure Has Table Required Inputs Params Tags Tasks
validmind.model_validation.sklearn.OverfitDiagnosis Overfit Diagnosis Assesses potential overfitting in a model's predictions, identifying regions where performance between training and... True True ['model', 'datasets'] {'metric': {'type': 'str', 'default': None}, 'cut_off_threshold': {'type': 'float', 'default': 0.04}} ['sklearn', 'binary_classification', 'multiclass_classification', 'linear_regression', 'model_diagnosis'] ['classification', 'regression']
validmind.model_validation.sklearn.RobustnessDiagnosis Robustness Diagnosis Assesses the robustness of a machine learning model by evaluating performance decay under noisy conditions.... True True ['datasets', 'model'] {'metric': {'type': 'str', 'default': None}, 'scaling_factor_std_dev_list': {'type': 'List', 'default': [0.1, 0.2, 0.3, 0.4, 0.5]}, 'performance_decay_threshold': {'type': 'float', 'default': 0.05}} ['sklearn', 'model_diagnosis', 'visualization'] ['classification', 'regression']
validmind.model_validation.sklearn.WeakspotsDiagnosis Weakspots Diagnosis Identifies and visualizes weak spots in a machine learning model's performance across various sections of the... True True ['datasets', 'model'] {'features_columns': {'type': 'Optional', 'default': None}, 'metrics': {'type': 'Optional', 'default': None}, 'thresholds': {'type': 'Optional', 'default': None}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_diagnosis', 'visualization'] ['classification', 'text_classification']

Let’s now assess the models for potential signs of overfitting and identify any sub-segments where performance may inconsistent with the OverfitDiagnosis test.

Overfitting occurs when a model learns the training data too well, capturing not only the true pattern but noise and random fluctuations resulting in excellent performance on the training dataset but poor generalization to new, unseen data:

  • Since the training dataset (vm_train_ds) was used to fit the model, we use this set to establish a baseline performance for how well the model performs on data it has already seen.
  • The testing dataset (vm_test_ds) was never seen during training, and here simulates real-world generalization, or how well the model performs on new, unseen data.
vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    }
).log()

Overfit Diagnosis Champion Vs Challenger

Overfit Diagnosis: champion vs challenger is designed to assess potential overfitting in a model’s predictions by identifying regions where the performance between training and testing sets deviates significantly. The primary purpose of this test is to pinpoint specific feature segments or regions where the model may be overfitting, thereby providing a detailed view of model generalization across different data partitions.

The test operates by comparing the model’s performance on training and test datasets, grouped by feature columns. For each feature, the data is binned into segments, and the model’s performance is evaluated within each segment using a relevant metric—AUC for classification models and MSE for regression models. The difference between the training and test performance metrics is calculated for each segment, resulting in a “gap” value. If this gap exceeds a predefined threshold (default 0.04), the segment is flagged as a potential overfitting region. The methodology relies on the principle that a well-generalized model should exhibit similar performance on both training and test data across all feature segments. The AUC metric, which ranges from 0 to 1, measures the model’s ability to discriminate between classes, with higher values indicating better performance. A large positive or negative gap suggests that the model’s predictive power is not consistent between training and test data, which is indicative of overfitting or underfitting in those regions. The results are visualized as bar plots, where the y-axis represents the AUC gap and the x-axis represents feature bins, with a horizontal line marking the overfitting threshold.

The primary advantages of this test include its ability to localize overfitting to specific feature regions, rather than providing only a global assessment. This granularity enables targeted analysis and debugging, as practitioners can identify exactly where the model’s generalization breaks down. The test’s flexibility in supporting both classification and regression models, as well as its compatibility with multiple performance metrics, makes it broadly applicable across different modeling scenarios. The visualizations produced by the test facilitate intuitive interpretation, allowing users to quickly spot problematic regions. By surfacing overfitting at the segment level, the test supports more informed model refinement and risk management, especially in regulated environments where transparency and explainability are critical.

It should be noted that the test’s effectiveness depends on the appropriateness of the chosen threshold, which may require tuning for different datasets or business contexts. The default threshold of 0.04 may not capture more subtle forms of overfitting that fall below this value, potentially missing nuanced generalization issues. Additionally, the test assumes that the binning of features adequately represents meaningful data segments; poor binning choices can obscure or exaggerate overfitting signals. Interpretation challenges may arise in regions with small sample sizes, where performance metrics can be unstable. High-risk signs include significant gaps between training and test performance for specific segments, multiple regions exceeding the threshold, and larger-than-expected differences in predicted versus actual values on the test set.

This test shows the results in both tabular and graphical formats. The tables present, for each model and feature, the feature segment (bin), the number of training and test records in that segment, the training and test AUC values, and the calculated gap. The bar plots visualize the AUC gap for each feature segment, with the overfitting threshold marked as a horizontal line. For the “log_model_champion,” the AUC gaps across features such as CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Geography_Germany are displayed. Most segments show moderate gaps, with a few exceeding the 0.04 threshold, indicating localized overfitting. For the “rf_model,” the gaps are substantially larger and more widespread, with many segments showing gaps well above the threshold, often exceeding 0.2 and in some cases reaching as high as 1.0. The plots make it easy to identify which feature bins are most affected, as bars crossing the threshold line are visually prominent. The range of AUC gaps varies by model and feature, with the random forest model exhibiting consistently higher gaps across nearly all features and bins. Notable data points include extreme gaps in the rf_model for low-frequency bins, such as Balance and CreditScore, where the training AUC is perfect (1.0) but the test AUC drops sharply, resulting in large positive gaps. In contrast, the log_model_champion shows more moderate and isolated overfitting, with most gaps remaining below 0.1 except for a few segments.

The test results reveal the following key insights:

  • Random Forest Model Exhibits Widespread Overfitting: The rf_model shows large AUC gaps across nearly all feature segments, with gaps frequently exceeding 0.2 and reaching up to 1.0 in some Balance bins, indicating severe overfitting throughout the feature space.
  • Logistic Regression Model Shows Localized Overfitting: The log_model_champion demonstrates more moderate and isolated overfitting, with most AUC gaps below 0.1 and only a few segments, such as Tenure (2.0, 3.0] and Balance (150538.854, 175628.663], exceeding the 0.04 threshold.
  • Feature-Specific Patterns in Overfitting: For both models, certain features such as Balance, CreditScore, and Tenure are more prone to overfitting, with specific bins consistently showing higher gaps, while other features like Geography_Spain and Gender_Male remain relatively stable.
  • Sample Size Effects on Gap Stability: Segments with fewer records, particularly in the rf_model, display the most extreme AUC gaps, suggesting that low sample sizes contribute to instability and exaggerated overfitting signals.
  • Threshold Exceedance is Model-Dependent: The overfitting threshold of 0.04 is exceeded in nearly every segment for the rf_model, while for the log_model_champion, only select bins cross this line, highlighting the difference in generalization between the two models.
  • Consistent Training AUC of 1.0 in Random Forest: The rf_model achieves perfect training AUC in all segments, while test AUC varies widely, reinforcing the observation of overfitting due to model complexity and lack of regularization.

Based on these results, the Overfit Diagnosis test provides a clear comparative view of overfitting behavior between the champion logistic regression model and the challenger random forest model. The logistic regression model maintains relatively stable generalization across most feature segments, with only a few localized regions where the training and test AUC diverge beyond the threshold, indicating isolated overfitting. In contrast, the random forest model displays pervasive overfitting, as evidenced by consistently large AUC gaps across nearly all features and bins, with the most pronounced effects in segments with limited data. The visualizations and tabular data together highlight that the random forest’s complexity leads to memorization of the training data, resulting in poor generalization to unseen data, especially in regions with sparse representation. The logistic regression model, while not immune to overfitting, demonstrates a more controlled and interpretable pattern, with overfitting confined to specific, identifiable regions. These observations underscore the importance of model selection and complexity management in achieving robust generalization, as well as the value of segment-level diagnostics in uncovering nuanced model behaviors that may not be apparent from aggregate performance metrics alone.

Tables

model Feature Slice Number of Training Records Number of Test Records Training AUC Test AUC Gap
log_model_champion CreditScore (600.0, 650.0] 476 127 0.6809 0.6079 0.0730
log_model_champion CreditScore (750.0, 800.0] 235 54 0.6974 0.6542 0.0432
log_model_champion Tenure (2.0, 3.0] 261 63 0.6758 0.5152 0.1606
log_model_champion Tenure (4.0, 5.0] 252 65 0.7236 0.6419 0.0817
log_model_champion Tenure (7.0, 8.0] 260 64 0.7457 0.6402 0.1055
log_model_champion Balance (150538.854, 175628.663] 181 58 0.6573 0.5821 0.0751
log_model_champion Balance (200718.472, 225808.281] 16 2 0.0714 0.0000 0.0714
log_model_champion NumOfProducts (2.8, 3.1] 156 35 0.7274 0.6176 0.1098
log_model_champion HasCrCard (-0.001, 0.1] 766 199 0.6733 0.6223 0.0510
log_model_champion EstimatedSalary (60005.85, 80003.94] 275 76 0.6808 0.6140 0.0668
log_model_champion EstimatedSalary (80003.94, 100002.03] 255 61 0.6837 0.6204 0.0633
log_model_champion EstimatedSalary (100002.03, 120000.12] 271 57 0.6651 0.5601 0.1050
log_model_champion Geography_Germany (0.9, 1.0] 803 187 0.6409 0.5629 0.0780
rf_model CreditScore (400.0, 450.0] 39 15 1.0000 0.5800 0.4200
rf_model CreditScore (450.0, 500.0] 121 31 1.0000 0.6333 0.3667
rf_model CreditScore (500.0, 550.0] 284 78 1.0000 0.8056 0.1944
rf_model CreditScore (550.0, 600.0] 389 89 1.0000 0.7579 0.2421
rf_model CreditScore (600.0, 650.0] 476 127 1.0000 0.7181 0.2819
rf_model CreditScore (650.0, 700.0] 484 109 1.0000 0.8104 0.1896
rf_model CreditScore (700.0, 750.0] 384 105 1.0000 0.7308 0.2692
rf_model CreditScore (750.0, 800.0] 235 54 1.0000 0.8000 0.2000
rf_model CreditScore (800.0, 850.0] 162 36 1.0000 0.7711 0.2289
rf_model Tenure (-0.01, 1.0] 368 95 1.0000 0.6676 0.3324
rf_model Tenure (1.0, 2.0] 281 62 1.0000 0.7990 0.2010
rf_model Tenure (2.0, 3.0] 261 63 1.0000 0.7312 0.2688
rf_model Tenure (3.0, 4.0] 258 74 1.0000 0.6630 0.3370
rf_model Tenure (4.0, 5.0] 252 65 1.0000 0.8110 0.1890
rf_model Tenure (5.0, 6.0] 222 77 1.0000 0.8323 0.1677
rf_model Tenure (6.0, 7.0] 283 56 1.0000 0.8764 0.1236
rf_model Tenure (7.0, 8.0] 260 64 1.0000 0.6172 0.3828
rf_model Tenure (8.0, 9.0] 268 60 1.0000 0.8420 0.1580
rf_model Tenure (9.0, 10.0] 132 31 1.0000 0.9118 0.0882
rf_model Balance (-250.898, 25089.809] 845 211 1.0000 0.8328 0.1672
rf_model Balance (50179.618, 75269.427] 98 23 1.0000 0.5231 0.4769
rf_model Balance (75269.427, 100359.236] 273 80 1.0000 0.7074 0.2926
rf_model Balance (100359.236, 125449.045] 599 145 1.0000 0.7727 0.2273
rf_model Balance (125449.045, 150538.854] 497 114 1.0000 0.6901 0.3099
rf_model Balance (150538.854, 175628.663] 181 58 1.0000 0.6375 0.3625
rf_model Balance (175628.663, 200718.472] 53 10 1.0000 0.8333 0.1667
rf_model Balance (200718.472, 225808.281] 16 2 1.0000 0.0000 1.0000
rf_model NumOfProducts (0.997, 1.3] 1475 378 1.0000 0.6608 0.3392
rf_model NumOfProducts (1.9, 2.2] 917 227 1.0000 0.6803 0.3197
rf_model NumOfProducts (2.8, 3.1] 156 35 1.0000 0.7500 0.2500
rf_model HasCrCard (-0.001, 0.1] 766 199 1.0000 0.7504 0.2496
rf_model HasCrCard (0.9, 1.0] 1819 448 1.0000 0.7678 0.2322
rf_model IsActiveMember (-0.001, 0.1] 1378 351 1.0000 0.7350 0.2650
rf_model IsActiveMember (0.9, 1.0] 1207 296 1.0000 0.7573 0.2427
rf_model EstimatedSalary (-188.401, 20009.67] 255 73 1.0000 0.7870 0.2130
rf_model EstimatedSalary (20009.67, 40007.76] 234 66 1.0000 0.8540 0.1460
rf_model EstimatedSalary (40007.76, 60005.85] 245 68 1.0000 0.7299 0.2701
rf_model EstimatedSalary (60005.85, 80003.94] 275 76 1.0000 0.6989 0.3011
rf_model EstimatedSalary (80003.94, 100002.03] 255 61 1.0000 0.7161 0.2839
rf_model EstimatedSalary (100002.03, 120000.12] 271 57 1.0000 0.6074 0.3926
rf_model EstimatedSalary (120000.12, 139998.21] 259 68 1.0000 0.7896 0.2104
rf_model EstimatedSalary (139998.21, 159996.3] 260 60 1.0000 0.8131 0.1869
rf_model EstimatedSalary (159996.3, 179994.39] 283 59 1.0000 0.7563 0.2437
rf_model EstimatedSalary (179994.39, 199992.48] 248 59 1.0000 0.8444 0.1556
rf_model Geography_Germany (-0.001, 0.1] 1782 460 1.0000 0.7443 0.2557
rf_model Geography_Germany (0.9, 1.0] 803 187 1.0000 0.7298 0.2702
rf_model Geography_Spain (-0.001, 0.1] 2006 481 1.0000 0.7558 0.2442
rf_model Geography_Spain (0.9, 1.0] 579 166 1.0000 0.7775 0.2225
rf_model Gender_Male (-0.001, 0.1] 1242 313 1.0000 0.7671 0.2329
rf_model Gender_Male (0.9, 1.0] 1343 334 1.0000 0.7439 0.2561

Figures

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:6d31
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:8638
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:72f0
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:85b0
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:5335
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:e03b
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:44d4
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:87d9
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:a4ec
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:f347
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:ca0b
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:035a
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:562f
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:b6a4
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:f2ed
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:b2e1
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:bc34
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:f946
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:0082
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:0f4e
2026-01-10 02:32:26,233 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger does not exist in model's document

Let's also conduct robustness and stability testing of the two models with the RobustnessDiagnosis test. Robustness refers to a model's ability to maintain consistent performance, and stability refers to a model's ability to produce consistent outputs over time across different data subsets.

Again, we'll use both the training and testing datasets to establish baseline performance and to simulate real-world generalization:

vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    },
).log()

❌ Robustness Diagnosis Champion Vs Log Regression

Robustness Diagnosis: Champion vs Log Regression is designed to assess the resilience of machine learning models by quantifying how their predictive performance degrades when input data is subjected to controlled perturbations. The primary purpose of this test is to evaluate the robustness of models in the presence of noise, simulating real-world scenarios where data may be imperfect, incomplete, or corrupted, and to identify the extent to which model predictions remain reliable under such conditions.

The test operates by systematically introducing Gaussian noise to the numerical input features of the dataset at varying levels of standard deviation, referred to as perturbation sizes. For each perturbation level, the model’s performance is measured using the Area Under the Receiver Operating Characteristic Curve (AUC), a metric that quantifies the model’s ability to distinguish between classes. AUC values range from 0 to 1, where 1 indicates perfect discrimination and 0.5 suggests no discriminative power. The test calculates the performance decay, defined as the reduction in AUC relative to the baseline (no noise) scenario, for both training and test datasets. Results are aggregated and visualized, with plots showing AUC as a function of perturbation size, and tables providing detailed breakdowns by model, dataset, and noise level. The “Passed” indicator flags whether the performance decay remains within acceptable thresholds, highlighting any instances where robustness standards are not met.

The primary advantages of this test include its ability to provide a clear, quantitative assessment of model robustness across a spectrum of noise intensities, offering valuable insights into how models might behave in less-than-ideal data environments. By leveraging the AUC metric, the test ensures that the evaluation is both interpretable and relevant for classification tasks, capturing the model’s discriminative power under stress. The use of both tabular and graphical outputs facilitates comprehensive analysis, enabling users to quickly identify patterns, thresholds, and points of concern. This approach is particularly useful for comparing different models or configurations, as it exposes differences in sensitivity to input perturbations that may not be apparent under standard validation procedures.

It should be noted that the test’s reliance on Gaussian noise as the sole perturbation mechanism may not fully capture the diversity of real-world data corruptions, such as outliers, missing values, or adversarial manipulations. The thresholds for acceptable performance decay are somewhat arbitrary and may require adjustment to align with specific business or regulatory requirements. Additionally, the test focuses on numeric features and may not account for the impact of noise on categorical or unstructured data. Interpretation challenges may arise if performance decay is observed at low noise levels, as this could indicate model fragility or overfitting. Signs of high risk include significant drops in AUC with minimal noise, performance decay exceeding thresholds, or consistent failure to meet standards across multiple perturbation scales, all of which warrant further investigation.

This test shows results in both tabular and graphical formats. The tables present detailed results for each model (“log_model_champion” and “rf_model”), dataset (“train_dataset_final” and “test_dataset_final”), and perturbation size (ranging from baseline to 0.5 standard deviations). Each row includes the AUC, performance decay relative to baseline, and a pass/fail indicator based on predefined thresholds. The plots visualize AUC as a function of perturbation size, with separate lines for training and test datasets, and horizontal dashed lines indicating threshold values. For the “log_model_champion,” AUC values remain relatively stable across increasing noise levels, with only minor declines observed. In contrast, the “rf_model” exhibits a pronounced decrease in training AUC as noise increases, while test AUC remains more stable but eventually drops below the threshold at the highest perturbation level. Notable observations include the “rf_model” failing the robustness threshold on the training set at perturbation sizes of 0.2 and above, and on the test set at 0.5, while the “log_model_champion” consistently passes across all conditions. The range of AUC values for the “log_model_champion” spans from 0.6842 to 0.6663 on training and 0.7051 to 0.6927 on test, whereas the “rf_model” ranges from 1.0 to 0.7975 on training and 0.7625 to 0.6886 on test.

The test results reveal the following key insights:

  • Logistic Regression Model Maintains Robustness Across Noise Levels: The “log_model_champion” demonstrates minimal performance decay, with AUC values on both training and test datasets remaining above 0.66 even at the highest perturbation size, and all results passing the robustness threshold.
  • Random Forest Model Exhibits High Sensitivity to Noise in Training Data: The “rf_model” shows a steep decline in training AUC as perturbation size increases, dropping from 1.0 at baseline to 0.7975 at 0.5 standard deviations, with performance decay exceeding the threshold from 0.2 onwards, resulting in failed robustness checks.
  • Test Set Performance for Random Forest Remains Stable Until High Perturbation: On the test dataset, the “rf_model” maintains relatively stable AUC values up to a perturbation size of 0.4, but drops below the threshold at 0.5, indicating a late but significant loss of robustness.
  • Performance Decay Patterns Differ Between Models: The logistic regression model’s performance decay is gradual and minor, while the random forest model’s decay is abrupt and pronounced, particularly on the training set, suggesting differences in model complexity and overfitting behavior.
  • Threshold Exceedance Highlights Model Fragility: The “rf_model” fails the robustness threshold on the training set at perturbation sizes of 0.2 and above, and on the test set at 0.5, whereas the “log_model_champion” passes all robustness checks, indicating greater resilience to input noise.

Based on these results, the logistic regression model (“log_model_champion”) demonstrates consistent and stable performance under increasing levels of Gaussian noise, with only minor reductions in AUC and no instances of performance decay exceeding the predefined thresholds. This indicates a high degree of robustness and suggests that the model’s predictions are likely to remain reliable even when input data is subject to moderate perturbations. In contrast, the random forest model (“rf_model”) displays marked sensitivity to noise, particularly in the training data, where AUC declines rapidly and robustness thresholds are breached at relatively low perturbation sizes. The test set performance for the random forest model remains stable up to a point but ultimately fails at the highest noise level, highlighting a potential vulnerability to data corruption. The observed patterns suggest that model complexity and overfitting may contribute to the random forest’s fragility, while the logistic regression model’s simpler structure confers greater resilience. These insights provide a clear characterization of each model’s behavior under noisy conditions, with the logistic regression model exhibiting superior robustness and the random forest model showing susceptibility to performance degradation as input noise increases.

Tables

model Perturbation Size Dataset Row Count AUC Performance Decay Passed
log_model_champion Baseline (0.0) train_dataset_final 2585 0.6842 0.0000 True
log_model_champion Baseline (0.0) test_dataset_final 647 0.7051 0.0000 True
log_model_champion 0.1 train_dataset_final 2585 0.6841 0.0001 True
log_model_champion 0.1 test_dataset_final 647 0.7066 -0.0015 True
log_model_champion 0.2 train_dataset_final 2585 0.6827 0.0015 True
log_model_champion 0.2 test_dataset_final 647 0.7039 0.0012 True
log_model_champion 0.3 train_dataset_final 2585 0.6717 0.0126 True
log_model_champion 0.3 test_dataset_final 647 0.6950 0.0101 True
log_model_champion 0.4 train_dataset_final 2585 0.6736 0.0106 True
log_model_champion 0.4 test_dataset_final 647 0.6945 0.0106 True
log_model_champion 0.5 train_dataset_final 2585 0.6663 0.0179 True
log_model_champion 0.5 test_dataset_final 647 0.6927 0.0124 True
rf_model Baseline (0.0) train_dataset_final 2585 1.0000 0.0000 True
rf_model Baseline (0.0) test_dataset_final 647 0.7625 0.0000 True
rf_model 0.1 train_dataset_final 2585 0.9826 0.0174 True
rf_model 0.1 test_dataset_final 647 0.7766 -0.0141 True
rf_model 0.2 train_dataset_final 2585 0.9470 0.0530 False
rf_model 0.2 test_dataset_final 647 0.7639 -0.0014 True
rf_model 0.3 train_dataset_final 2585 0.8997 0.1003 False
rf_model 0.3 test_dataset_final 647 0.7601 0.0024 True
rf_model 0.4 train_dataset_final 2585 0.8543 0.1457 False
rf_model 0.4 test_dataset_final 647 0.7439 0.0187 True
rf_model 0.5 train_dataset_final 2585 0.7975 0.2025 False
rf_model 0.5 test_dataset_final 647 0.6886 0.0739 False

Figures

ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:f5f0
ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:1756
2026-01-10 02:32:58,661 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression does not exist in model's document

Run feature importance tests

We also want to verify the relative influence of different input features on our models' predictions, as well as inspect the differences between our champion and challenger model to see if a certain model offers more understandable or logical importance scores for features.

Use list_tests() to identify all the feature importance tests for classification:

# Store the feature importance tests
FI = vm.tests.list_tests(tags=["feature_importance"], task="classification",pretty=False)
FI
['validmind.model_validation.FeaturesAUC',
 'validmind.model_validation.sklearn.PermutationFeatureImportance',
 'validmind.model_validation.sklearn.SHAPGlobalImportance']

We'll only use our testing dataset (vm_test_ds) here, to provide a realistic, unseen sample that mimic future or production data, as the training dataset has already influenced our model during learning:

# Run and log our feature importance tests for both models for the testing dataset
for test in FI:
    vm.tests.run_test(
        "".join((test,':champion_vs_challenger')),
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        },
    ).log()

Features Champion Vs Challenger

Features AUC: Champion vs Challenger is designed to evaluate the discriminatory power of each individual feature within a binary classification model by calculating the Area Under the Curve (AUC) for each feature separately. The primary purpose of this test is to quantify how well each feature, on its own, can distinguish between the two classes in a binary classification problem, providing a univariate perspective on feature effectiveness.

The test operates by treating the values of each feature as raw scores and computing the AUC for each feature against the actual binary outcomes. For every feature, the test calculates how well the distribution of feature values separates the two classes, using the AUC metric as a measure of this separation. The AUC, or Area Under the Receiver Operating Characteristic (ROC) Curve, is a widely used metric in binary classification that quantifies the probability that a randomly chosen positive instance will have a higher score than a randomly chosen negative instance. The AUC value ranges from 0 to 1, where 0.5 indicates no discriminatory power (equivalent to random guessing), values closer to 1 indicate strong positive discrimination, and values closer to 0 indicate strong negative discrimination. The test requires only the feature values and the binary target labels, and it does not consider any interactions or combined effects between features. The resulting AUC scores for each feature provide a direct, interpretable measure of univariate classification strength, with higher values indicating greater individual predictive power.

The primary advantages of this test include its ability to isolate and highlight the individual contribution of each feature to the classification task, independent of other variables. This makes it particularly useful for initial feature screening, where the goal is to identify features with strong univariate predictive power before model development. Additionally, after model training, the test can provide insights into which features the model may be relying on most heavily, supporting interpretability and transparency. The simplicity and directness of the AUC metric make the results easy to communicate to both technical and non-technical stakeholders. The test is also robust to class imbalance, as the AUC is not affected by the proportion of positive and negative cases, and it can help detect potential data leakage if a feature exhibits unexpectedly high discriminatory power.

It should be noted that this test has several limitations and potential risks. Since it evaluates each feature in isolation, it does not capture any interactions or combined effects between features, which can be critical in many real-world models. Features that are weak individually may still be highly informative when combined with others, and this test would not identify such cases. The AUC values are calculated without reference to how the model actually uses the features, so the results may differ from model-based feature importance measures. There is also a risk of misinterpretation if a feature with a low AUC is expected to be predictive, or if a feature with a high AUC is not believed to be informative, which could indicate data leakage or other data quality issues. The test is applicable only to binary classification problems and cannot be directly extended to multiclass or regression tasks without modification.

This test shows the results in the form of horizontal bar plots, where each bar represents a feature and its corresponding AUC score on the test dataset. The x-axis displays the AUC values, ranging from 0 to 1, while the y-axis lists the features evaluated. The length of each bar indicates the univariate discriminatory power of the feature, with longer bars corresponding to higher AUC scores. The plots are titled "Feature AUC Scores (for dataset=test_dataset_final)" and present the features in descending order of AUC, making it easy to identify the most and least discriminative features at a glance. The key measurement displayed is the AUC score for each feature, which quantifies the probability that the feature can correctly distinguish between the two classes. Notable observations from the plots include the range of AUC values across features, the relative ranking of features, and any features that stand out as particularly strong or weak. For example, features such as "Geography_Germany" and "Balance" have the highest AUC scores, both exceeding 0.6, while features like "NumOfProducts" and "IsActiveMember" have lower AUC scores, closer to 0.4. The visualizations provide a clear, interpretable summary of the univariate discriminatory power of each feature, allowing for straightforward comparison and identification of patterns.

The test results reveal the following key insights:

  • Geography_Germany and Balance are the most discriminative features: Both "Geography_Germany" and "Balance" achieve the highest AUC scores, each exceeding 0.6, indicating that these features have the strongest univariate ability to separate the two classes in the test dataset.
  • CreditScore and EstimatedSalary show moderate discriminatory power: These features have AUC scores slightly above 0.5, suggesting they provide some univariate predictive value but are less powerful than the top features.
  • HasCrCard and Geography_Spain offer limited separation: With AUC values just below 0.5, these features contribute less to class differentiation on their own, though they may still be useful in combination with others.
  • Tenure, Gender_Male, IsActiveMember, and NumOfProducts have the lowest AUC scores: These features all have AUC values around 0.4 to 0.45, indicating weak univariate discriminatory power and suggesting limited individual predictive value in this context.
  • AUC values span a moderate range: The observed AUC scores range from approximately 0.4 to just above 0.6, with no features exhibiting extremely high or low values, which may indicate a lack of strong univariate predictors or the need for feature interactions to achieve higher performance.
  • Feature ranking is consistent across repeated plots: The order and relative magnitude of AUC scores are stable, reinforcing the reliability of the observed feature contributions.

Based on these results, the test demonstrates that "Geography_Germany" and "Balance" are the most effective individual features for distinguishing between the two classes in the test dataset, as evidenced by their AUC scores above 0.6. Features such as "CreditScore" and "EstimatedSalary" provide moderate univariate discrimination, while others like "HasCrCard," "Geography_Spain," "Tenure," "Gender_Male," "IsActiveMember," and "NumOfProducts" show weaker individual performance, with AUC values closer to random chance. The overall distribution of AUC scores suggests that while some features have meaningful univariate predictive power, the majority do not strongly separate the classes on their own. The consistency of feature rankings across repeated visualizations supports the robustness of these observations. These results highlight the importance of considering both individual and combined feature effects in model development, as features with low univariate AUC may still contribute significantly in multivariate models. The absence of extremely high AUC values also suggests that there is no immediate evidence of data leakage or overly dominant features, and the model's performance is likely to depend on the interplay of multiple features rather than reliance on a single variable.

Figures

ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:2e3c
ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:664c
2026-01-10 02:33:32,740 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.FeaturesAUC:champion_vs_challenger does not exist in model's document

Permutation Feature Importance Champion Vs Challenger

Permutation Feature Importance: Champion vs Challenger is designed to assess the significance of each feature in a machine learning model by quantifying the impact on model performance when the values of individual features are randomly permuted. The primary purpose of this test is to identify which features the model relies on most for its predictions, thereby providing transparency into the model’s decision-making process and highlighting potential dependencies or vulnerabilities.

The test operates by systematically shuffling the values of each feature in the dataset, one at a time, and measuring the resulting change in the model’s predictive performance. This is achieved using the permutation_importance method from the sklearn.inspection module, which evaluates the decrease in a chosen performance metric—such as accuracy or area under the curve—after each permutation. The underlying logic is that if permuting a feature’s values leads to a significant drop in performance, that feature is important for the model’s predictions. Conversely, if the performance remains largely unchanged, the feature is likely not influential. The output is typically a set of importance scores for each feature, which are non-negative and often normalized to sum to one or to reflect the absolute change in performance. Higher values indicate greater importance, while values near zero suggest minimal impact. This approach is model-agnostic and can be applied to any predictive model that supports performance evaluation.

The primary advantages of this test include its ability to provide clear, interpretable insights into feature importance across a wide range of model types. By directly measuring the effect of each feature on predictive accuracy, the test helps uncover which variables drive model behavior and can reveal unexpected dependencies or redundancies in the data. It is particularly useful for identifying overfitting, as features with disproportionately high importance may indicate that the model is relying too heavily on specific data characteristics. The method’s model-agnostic nature allows for consistent comparison across different algorithms, making it valuable for model selection and validation. Additionally, the visual output facilitates communication of results to both technical and non-technical stakeholders, supporting transparency and regulatory compliance.

It should be noted that permutation feature importance does not imply causality; it only measures the extent to which a feature contributes to the model’s predictive power within the context of the data and model structure. The method does not account for interactions between correlated features, which can result in the importance being attributed to one feature while underestimating the role of others. This limitation is particularly relevant in datasets with multicollinearity, where the true influence of individual features may be obscured. Furthermore, the test may highlight instability if the model relies heavily on features with high variance or those that are easily permuted, raising concerns about robustness. The approach is also limited by its inability to interact with certain modeling libraries, restricting its applicability in some environments. Interpretation challenges may arise if domain knowledge suggests that a feature should be important but the model assigns it low importance, potentially indicating issues with data quality or model specification.

This test shows the permutation feature importance results for two models: a logistic regression model (log_model_champion) and a random forest model (rf_model). The results are presented as horizontal bar plots, with each bar representing a feature and its corresponding importance score. The x-axis quantifies the importance, reflecting the decrease in model performance when the feature is permuted, while the y-axis lists the features in descending order of importance. For the logistic regression model, the most important features are Geography_Germany, IsActiveMember, Gender_Male, and Balance, with importance scores ranging from approximately 0.07 down to near zero. The random forest model, in contrast, assigns the highest importance to NumOfProducts, followed by Balance and Geography_Germany, with scores reaching up to 0.14. Features such as EstimatedSalary, HasCrCard, Tenure, and CreditScore exhibit low importance in both models, with values close to zero. The plots allow for direct comparison of feature importance across models, highlighting both shared and divergent patterns in feature reliance. Notably, the range of importance values is broader in the random forest model, indicating a more pronounced differentiation between key and peripheral features.

The test results reveal the following key insights:

  • Distinct Feature Reliance Across Models: The logistic regression model (log_model_champion) and the random forest model (rf_model) display markedly different patterns of feature importance, with each model prioritizing different variables for prediction.
  • Logistic Regression Emphasizes Geography and Membership: In the log_model_champion, Geography_Germany (importance ≈ 0.073), IsActiveMember (≈ 0.062), and Gender_Male (≈ 0.058) are the most influential features, collectively accounting for the majority of the model’s predictive power.
  • Random Forest Prioritizes Product Count and Balance: The rf_model assigns the highest importance to NumOfProducts (≈ 0.14) and Balance (≈ 0.05), with Geography_Germany (≈ 0.04) also contributing significantly, indicating a different set of primary drivers compared to the logistic regression model.
  • Low Impact Features Consistent Across Models: Features such as EstimatedSalary, HasCrCard, Tenure, and CreditScore consistently show low importance in both models, with scores near or below 0.01, suggesting limited influence on predictions regardless of model type.
  • Broader Importance Range in Random Forest: The rf_model demonstrates a wider spread of importance values, with a sharper distinction between highly influential and minimally impactful features, whereas the log_model_champion exhibits a more gradual decline in importance across features.
  • Potential Redundancy and Correlation Effects: The low importance of certain features, despite domain expectations, may indicate redundancy or the presence of correlated variables, particularly in the random forest model where feature interactions are more complex.

Based on these results, the permutation feature importance analysis reveals that the two models under consideration leverage different subsets of features to drive their predictions, with the logistic regression model relying more on demographic and membership-related variables, while the random forest model emphasizes transactional attributes such as product count and account balance. The consistent identification of low-importance features across both models suggests that certain variables contribute little to predictive accuracy in this context. The broader range of importance values in the random forest model highlights its capacity to differentiate sharply between key and peripheral features, potentially reflecting its ability to capture nonlinear relationships and interactions. The observed patterns also suggest that feature redundancy and correlation may influence the allocation of importance, particularly in models capable of modeling complex dependencies. Overall, the results provide a clear, quantitative basis for understanding how each model interprets the available data and which features are most critical to their respective predictive strategies.

Figures

ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:4755
ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:a411
2026-01-10 02:34:05,255 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger does not exist in model's document

SHAP Global Importance Champion Vs Challenger

SHAP Global Importance: Champion vs Challenger is designed to evaluate and visualize the global feature importance of machine learning models using SHAP (SHapley Additive exPlanations) values. The primary purpose of this test is to provide a transparent and quantitative understanding of how individual features contribute to model predictions, supporting model risk management by identifying which features most influence model outcomes and highlighting potential areas of risk or overfitting.

The test operates by first selecting an appropriate SHAP explainer based on the model type—TreeExplainer for tree-based models and LinearExplainer for linear models. The explainer computes Shapley values for each feature across all instances in the dataset, quantifying the marginal contribution of each feature to the model’s output. These values are then aggregated to produce two main visualizations: the mean importance plot and the summary plot. The mean importance plot displays the average absolute Shapley value for each feature, normalized as a percentage, to indicate global importance. The summary plot presents the distribution of Shapley values for each feature, with each point representing a single instance and colored by the feature’s value, allowing for the assessment of both the magnitude and direction of feature effects. The SHAP value itself measures the change in the model’s prediction when a feature is included versus excluded, with values typically ranging from negative to positive, where higher absolute values indicate greater influence. In practice, features with high mean SHAP values are considered more influential, while a concentration of importance in a few features or unexpected patterns may signal overfitting or model reliance on spurious relationships.

The primary advantages of this test include its ability to provide both global and local interpretability of model behavior, making it possible to understand not only which features are most important overall but also how they affect individual predictions. SHAP values are grounded in cooperative game theory, ensuring a fair and consistent allocation of importance among features. The visualizations produced by this test facilitate the identification of dominant features, the detection of potential biases, and the assessment of model robustness. This level of transparency is particularly valuable in regulated environments or high-stakes applications, where understanding the rationale behind model decisions is critical for compliance and trust. Additionally, the test supports comparative analysis between different models, enabling stakeholders to evaluate changes in feature importance and their implications for model performance and risk.

It should be noted that the test has several limitations and potential risks. In high-dimensional datasets, the interpretation of SHAP values can become complex, as the number of features may obscure meaningful patterns or dilute the importance of truly influential variables. The assignment of importance does not always translate directly to real-world impact, as the context and domain knowledge are required to interpret the results appropriately. Signs of high risk include an overemphasis on a small subset of features, which may indicate overfitting, and the presence of unexpected or illogical features with high importance, suggesting that the model may be capturing spurious correlations. Additionally, high variability or scatter in the summary plot can signal instability in feature effects, warranting further investigation. Users should exercise caution in drawing conclusions solely from SHAP values and consider complementary analyses to validate model behavior.

This test shows the results through a series of SHAP visualizations for both the champion (logistic regression) and challenger (random forest) models. The first plot is a normalized mean importance bar chart for the champion model, displaying the top features ranked by their average absolute SHAP value as a percentage. The horizontal axis represents normalized SHAP value (percentage), while the vertical axis lists the features. The most influential features for the champion model are "IsActiveMember," "Geography_Germany," and "Gender_Male," each contributing significantly more than the remaining features, with "IsActiveMember" reaching nearly 100% normalized importance. The second plot is a SHAP summary plot for the champion model, where each dot represents a single instance’s SHAP value for a feature, colored by the feature’s value from low (blue) to high (red). The horizontal axis shows the SHAP value (impact on model output), and the vertical axis lists the features in order of importance. This plot reveals the direction and spread of feature effects, with "IsActiveMember" and "Geography_Germany" showing the widest range of SHAP values. For the challenger random forest model, the third and fourth plots focus on "CreditScore" and "Tenure," showing both normalized SHAP value distributions and SHAP interaction values. The axes are similar, with the horizontal axis representing either normalized SHAP value or SHAP interaction value, and the vertical axis listing the features. The random forest model’s plots indicate a much narrower focus, with only two features showing significant importance and interaction effects, and a wider spread of SHAP values, including both positive and negative contributions. The color gradient in all summary plots provides additional context on how feature values relate to their impact on the model’s output.

The test results reveal the following key insights:

  • Champion Model Relies Heavily on a Few Features: The logistic regression champion model assigns the highest normalized SHAP importance to "IsActiveMember" (close to 100%), followed by "Geography_Germany" and "Gender_Male" (both above 70%), indicating a strong reliance on these features for its predictions.
  • Challenger Model Focuses on CreditScore and Tenure: The random forest challenger model’s SHAP plots show that only "CreditScore" and "Tenure" have substantial normalized SHAP values, with all other features contributing negligibly, suggesting a much narrower feature utilization.
  • Feature Effect Directions and Variability Differ by Model: The champion model’s summary plot displays a broad range of SHAP values for its top features, with both positive and negative impacts, while the challenger model’s plots show more symmetric and concentrated distributions, indicating different patterns of feature influence.
  • Potential Overemphasis and Risk of Overfitting in Both Models: The concentration of importance in a small number of features for both models, especially the near-exclusive reliance on "IsActiveMember" in the champion and on "CreditScore" and "Tenure" in the challenger, may signal a risk of overfitting or model instability.
  • Distinct Feature Interactions in Challenger Model: The SHAP interaction plot for the random forest model reveals notable interaction effects between "CreditScore" and "Tenure," with both positive and negative interaction values, highlighting complex dependencies not present in the champion model.

Based on these results, the SHAP global importance analysis demonstrates that the champion and challenger models exhibit markedly different patterns of feature reliance and interaction. The champion model’s predictions are driven primarily by a small set of categorical features, with "IsActiveMember" dominating the importance landscape, while the challenger model’s output is almost entirely determined by two numerical features, "CreditScore" and "Tenure." The summary plots further reveal that the direction and magnitude of feature effects vary substantially between models, with the champion model showing a wider range of SHAP values and the challenger model displaying more concentrated, symmetric distributions. The presence of strong feature interactions in the challenger model, as indicated by the SHAP interaction plot, suggests that it captures more complex relationships between variables. However, the high concentration of importance in a few features for both models raises the possibility of overfitting or excessive model dependence on specific variables, which may impact model robustness and generalizability. These observations provide a clear, quantitative basis for understanding how each model processes input features and where potential risks may arise in their decision-making logic.

Figures

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:4b5c
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:9695
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:3046
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:159f
2026-01-10 02:34:44,031 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger does not exist in model's document

In summary

In this third notebook, you learned how to:

Next steps

Finalize validation and reporting

Now that you're familiar with the basics of using the ValidMind Library to run and log validation tests, let's learn how to implement some custom tests and wrap up our validation: 4 — Finalize validation and reporting