ValidMind for model validation 4 — Finalize testing and reporting

Learn how to use ValidMind for your end-to-end model validation process with our series of four introductory notebooks. In this last notebook, finalize the compliance assessment process and have a complete validation report ready for review.

This notebook will walk you through how to supplement ValidMind tests with your own custom tests and include them as additional evidence in your validation report. A custom test is any function that takes a set of inputs and parameters as arguments and returns one or more outputs:

For a more in-depth introduction to custom tests, refer to our Implement custom tests notebook.

Learn by doing

Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals

Prerequisites

In order to finalize validation and reporting, you'll need to first have:

Setting up

This section should be very familiar to you now — as we performed the same actions in the previous two notebooks in this series.

Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

  1. In a browser, log in to ValidMind.

  2. In the left sidebar, navigate to Inventory and select the model you registered for this "ValidMind for model validation" series of notebooks.

  3. Go to Getting Started and click Copy snippet to clipboard.

Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
)
Note: you may need to restart the kernel to use updated packages.
2026-01-30 23:15:29,905 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report

Import the sample dataset

Next, we'll load in the same sample Bank Customer Churn Prediction dataset used to develop the champion model that we will independently preprocess:

# Load the sample dataset
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}
# Initialize the raw dataset for use in ValidMind tests
vm_raw_dataset = vm.init_dataset(
    dataset=raw_df,
    input_id="raw_dataset",
    target_column="Exited",
)
import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test:

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)
# Run HighPearsonCorrelation test with our balanced dataset as input and return a result object
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

The High Pearson Correlation test evaluates the linear relationships between feature pairs to identify potential redundancy or multicollinearity. The results table lists the top ten strongest absolute correlations, displaying the feature pairs, their Pearson correlation coefficients, and a Pass/Fail status based on a threshold of 0.3. Only one feature pair exceeds the threshold, while the remaining pairs show lower correlation magnitudes and pass the test criteria.

Key insights:

  • One feature pair exceeds correlation threshold: The pair (Age, Exited) has a correlation coefficient of 0.3288, surpassing the 0.3 threshold and receiving a Fail status.
  • All other feature pairs below threshold: The remaining nine feature pairs have absolute correlation coefficients ranging from 0.0492 to 0.2021, all below the threshold and marked as Pass.
  • No evidence of widespread multicollinearity: Only a single pair demonstrates a correlation above the threshold, with no clusters of high correlations among other features.

The results indicate that the dataset contains minimal evidence of high linear relationships between most feature pairs, with only the (Age, Exited) pair exceeding the specified threshold. The overall correlation structure suggests low risk of feature redundancy or multicollinearity based on the tested pairs.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(Age, Exited) 0.3288 Fail
(IsActiveMember, Exited) -0.2021 Pass
(Balance, NumOfProducts) -0.1822 Pass
(Balance, Exited) 0.1376 Pass
(NumOfProducts, Exited) -0.0531 Pass
(NumOfProducts, IsActiveMember) 0.0492 Pass
(CreditScore, HasCrCard) -0.0448 Pass
(Age, Balance) 0.0444 Pass
(Tenure, IsActiveMember) -0.0365 Pass
(Age, NumOfProducts) -0.0360 Pass
# From result object, extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df
Columns Coefficient Pass/Fail
0 (Age, Exited) 0.3288 Fail
1 (IsActiveMember, Exited) -0.2021 Pass
2 (Balance, NumOfProducts) -0.1822 Pass
3 (Balance, Exited) 0.1376 Pass
4 (NumOfProducts, Exited) -0.0531 Pass
5 (NumOfProducts, IsActiveMember) 0.0492 Pass
6 (CreditScore, HasCrCard) -0.0448 Pass
7 (Age, Balance) 0.0444 Pass
8 (Tenure, IsActiveMember) -0.0365 Pass
9 (Age, NumOfProducts) -0.0360 Pass
# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features
['(Age, Exited)']
# Extract feature names from the list of strings
high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features
['Age']
# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)
# Re-run the test with the reduced feature set
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

The High Pearson Correlation test evaluates the linear relationships between feature pairs to identify potential redundancy or multicollinearity within the dataset. The results table presents the top ten absolute Pearson correlation coefficients, along with the corresponding feature pairs and their Pass/Fail status relative to the specified threshold of 0.3. All reported coefficients are below the threshold, and each feature pair is marked as Pass.

Key insights:

  • No feature pairs exceed correlation threshold: All absolute Pearson correlation coefficients are below the 0.3 threshold, with the highest observed value being -0.2021 between IsActiveMember and Exited.
  • Low to moderate linear relationships: The top ten feature pair correlations range from -0.2021 to -0.0305, indicating only weak to very weak linear associations among the evaluated features.
  • Consistent Pass status across all pairs: Every feature pair in the results is marked as Pass, confirming the absence of high linear correlations within the top-ranked pairs.

The test results indicate that the dataset does not exhibit strong linear relationships or multicollinearity among the evaluated feature pairs. All observed correlations are well below the specified threshold, supporting the independence of features for subsequent modeling and analysis.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(IsActiveMember, Exited) -0.2021 Pass
(Balance, NumOfProducts) -0.1822 Pass
(Balance, Exited) 0.1376 Pass
(NumOfProducts, Exited) -0.0531 Pass
(NumOfProducts, IsActiveMember) 0.0492 Pass
(CreditScore, HasCrCard) -0.0448 Pass
(Tenure, IsActiveMember) -0.0365 Pass
(Tenure, HasCrCard) 0.0351 Pass
(CreditScore, Exited) -0.0332 Pass
(CreditScore, EstimatedSalary) -0.0305 Pass

Split the preprocessed dataset

With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:

# Encode categorical features in the dataset
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()
CreditScore Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited Geography_Germany Geography_Spain Gender_Male
1427 779 0 0.00 2 0 1 111906.00 0 False False False
13 635 7 0.00 2 1 1 65951.65 0 False True False
7243 435 3 151739.65 1 1 0 167461.50 0 True False True
6101 661 0 109493.62 1 0 0 188324.01 1 True False False
191 763 6 100160.75 1 1 0 33462.94 1 True False True
from sklearn.model_selection import train_test_split

# Split the dataset into train and test
train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]
# Initialize the split datasets
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

Import the champion model

With our raw dataset assessed and preprocessed, let's go ahead and import the champion model submitted by the model development team in the format of a .pkl file: lr_model_champion.pkl

# Import the champion model
import pickle as pkl

with open("lr_model_champion.pkl", "rb") as f:
    log_reg = pkl.load(f)
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/base.py:463: InconsistentVersionWarning:

Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.8.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations

Train potential challenger model

We'll also train our random forest classification challenger model to see how it compares:

# Import the Random Forest Classification model
from sklearn.ensemble import RandomForestClassifier

# Create the model instance with 50 decision trees
rf_model = RandomForestClassifier(
    n_estimators=50,
    random_state=42,
)

# Train the model
rf_model.fit(X_train, y_train)
RandomForestClassifier(n_estimators=50, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Initialize the model objects

In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models:

# Initialize the champion logistic regression model
vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model_champion",
)

# Initialize the challenger random forest classification model
vm_rf_model = vm.init_model(
    rf_model,
    input_id="rf_model",
)
# Assign predictions to Champion — Logistic regression model
vm_train_ds.assign_predictions(model=vm_log_model)
vm_test_ds.assign_predictions(model=vm_log_model)

# Assign predictions to Challenger — Random forest classification model
vm_train_ds.assign_predictions(model=vm_rf_model)
vm_test_ds.assign_predictions(model=vm_rf_model)
2026-01-30 23:15:37,881 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-30 23:15:37,882 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-30 23:15:37,883 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-30 23:15:37,885 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-30 23:15:37,887 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-30 23:15:37,890 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-30 23:15:37,891 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-30 23:15:37,892 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-30 23:15:37,895 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-30 23:15:37,917 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-30 23:15:37,919 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-30 23:15:37,941 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-30 23:15:37,944 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-30 23:15:37,956 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-30 23:15:37,957 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-30 23:15:37,969 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Implementing custom tests

Thanks to the model documentation (Learn more ...), we know that the model development team implemented a custom test to further evaluate the performance of the champion model.

In a usual model validation situation, you would load a saved custom test provided by the model development team. In the following section, we'll have you implement the same custom test and make it available for reuse, to familiarize you with the processes.

Want to learn more about custom tests?

Refer to our in-depth introduction to custom tests: Implement custom tests

Implement a custom inline test

Let's implement the same custom inline test that calculates the confusion matrix for a binary classification model that the model development team used in their performance evaluations.

  • An inline test refers to a test written and executed within the same environment as the code being tested — in this case, right in this Jupyter Notebook — without requiring a separate test file or framework.
  • You'll note that the custom test function is just a regular Python function that can include and require any Python library as you see fit.

Create a confusion matrix plot

Let's first create a confusion matrix plot using the confusion_matrix function from the sklearn.metrics module:

import matplotlib.pyplot as plt
from sklearn import metrics

# Get the predicted classes
y_pred = log_reg.predict(vm_test_ds.x)

confusion_matrix = metrics.confusion_matrix(y_test, y_pred)

cm_display = metrics.ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix, display_labels=[False, True]
)
cm_display.plot()

Next, create a @vm.test wrapper that will allow you to create a reusable test. Note the following changes in the code below:

  • The function confusion_matrix takes two arguments dataset and model. This is a VMDataset and VMModel object respectively.
    • VMDataset objects allow you to access the dataset's true (target) values by accessing the .y attribute.
    • VMDataset objects allow you to access the predictions for a given model by accessing the .y_pred() method.
  • The function docstring provides a description of what the test does. This will be displayed along with the result in this notebook as well as in the ValidMind Platform.
  • The function body calculates the confusion matrix using the sklearn.metrics.confusion_matrix function as we just did above.
  • The function then returns the ConfusionMatrixDisplay.figure_ object — this is important as the ValidMind Library expects the output of the custom test to be a plot or a table.
  • The @vm.test decorator is doing the work of creating a wrapper around the function that will allow it to be run by the ValidMind Library. It also registers the test so it can be found by the ID my_custom_tests.ConfusionMatrix.
@vm.test("my_custom_tests.ConfusionMatrix")
def confusion_matrix(dataset, model):
    """The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

    The confusion matrix is a 2x2 table that contains 4 values:

    - True Positive (TP): the number of correct positive predictions
    - True Negative (TN): the number of correct negative predictions
    - False Positive (FP): the number of incorrect positive predictions
    - False Negative (FN): the number of incorrect negative predictions

    The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.
    """
    y_true = dataset.y
    y_pred = dataset.y_pred(model=model)

    confusion_matrix = metrics.confusion_matrix(y_true, y_pred)

    cm_display = metrics.ConfusionMatrixDisplay(
        confusion_matrix=confusion_matrix, display_labels=[False, True]
    )
    cm_display.plot()

    plt.close()  # close the plot to avoid displaying it

    return cm_display.figure_  # return the figure object itself

You can now run the newly created custom test on both the training and test datasets for both models using the run_test() function:

# Champion train and test
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:champion",
    input_grid={
        "dataset": [vm_train_ds,vm_test_ds],
        "model" : [vm_log_model]
    }
).log()

Confusion Matrix Champion

The Confusion Matrix test evaluates the classification performance of the model by comparing predicted and true labels for both the training and test datasets. The resulting matrices display the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of model accuracy and error distribution. The first matrix corresponds to the training dataset, while the second matrix summarizes results for the test dataset.

Key insights:

  • High correct classification rates in both datasets: The training dataset shows 822 true negatives and 818 true positives, while the test dataset shows 203 true negatives and 191 true positives, indicating strong correct classification performance.
  • Moderate false positive and false negative rates: The training dataset records 486 false positives and 419 false negatives; the test dataset records 115 false positives and 118 false negatives, reflecting a balanced distribution of misclassifications across both classes.
  • Consistent error patterns across train and test: The relative proportions of false positives and false negatives are similar between the training and test datasets, suggesting stable model behavior and no evidence of overfitting or underfitting based on confusion matrix structure.

The confusion matrix results indicate that the model achieves a high number of correct classifications for both positive and negative classes in both training and test datasets. The distribution of false positives and false negatives is balanced and consistent across datasets, supporting the conclusion that the model maintains stable classification performance without significant shifts in error patterns between training and test samples.

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:champion:4e57
ValidMind Figure my_custom_tests.ConfusionMatrix:champion:856d
2026-01-30 23:15:47,843 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:champion does not exist in model's document
# Challenger train and test
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:challenger",
    input_grid={
        "dataset": [vm_train_ds,vm_test_ds],
        "model" : [vm_rf_model]
    }
).log()

Confusion Matrix Challenger

The Confusion Matrix:challenger test evaluates the classification performance of the model by comparing predicted and true labels for both the training and test datasets. The resulting confusion matrices display the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of model accuracy and error types. The matrices are presented separately for the train and test datasets, allowing for assessment of model fit and generalization.

Key insights:

  • Perfect classification on training data: The training dataset confusion matrix shows 1,287 true negatives, 1,297 true positives, and zero false positives or false negatives, indicating no misclassifications on the training set.
  • High but imperfect accuracy on test data: The test dataset confusion matrix records 240 true negatives, 229 true positives, 88 false positives, and 90 false negatives, indicating the presence of both types of misclassification.
  • Balanced error distribution on test set: The number of false positives (88) and false negatives (90) on the test set are nearly equal, suggesting no strong bias toward one error type over the other.

The confusion matrix results indicate that the model achieves perfect separation on the training data, with no observed misclassifications. On the test data, the model maintains high accuracy but exhibits both false positives and false negatives in nearly equal measure, reflecting a balanced error profile. This pattern suggests strong model fit to the training data and reasonable generalization to unseen data, with no evidence of systematic bias toward either class in the test set.

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:challenger:aacf
ValidMind Figure my_custom_tests.ConfusionMatrix:challenger:e44a
2026-01-30 23:15:58,129 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:challenger does not exist in model's document
Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.

Add parameters to custom tests

Custom tests can take parameters just like any other function. To demonstrate, let's modify the confusion_matrix function to take an additional parameter normalize that will allow you to normalize the confusion matrix:

@vm.test("my_custom_tests.ConfusionMatrix")
def confusion_matrix(dataset, model, normalize=False):
    """The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

    The confusion matrix is a 2x2 table that contains 4 values:

    - True Positive (TP): the number of correct positive predictions
    - True Negative (TN): the number of correct negative predictions
    - False Positive (FP): the number of incorrect positive predictions
    - False Negative (FN): the number of incorrect negative predictions

    The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.
    """
    y_true = dataset.y
    y_pred = dataset.y_pred(model=model)

    if normalize:
        confusion_matrix = metrics.confusion_matrix(y_true, y_pred, normalize="all")
    else:
        confusion_matrix = metrics.confusion_matrix(y_true, y_pred)

    cm_display = metrics.ConfusionMatrixDisplay(
        confusion_matrix=confusion_matrix, display_labels=[False, True]
    )
    cm_display.plot()

    plt.close()  # close the plot to avoid displaying it

    return cm_display.figure_  # return the figure object itself

Pass parameters to custom tests

You can pass parameters to custom tests by providing a dictionary of parameters to the run_test() function.

  • The parameters will override any default parameters set in the custom test definition. Note that dataset and model are still passed as inputs.
  • Since these are VMDataset or VMModel inputs, they have a special meaning.

Re-running and logging the custom confusion matrix with normalize=True for both models and our testing dataset looks like this:

# Champion with test dataset and normalize=True
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:test_normalized_champion",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_log_model]
    },
    params={"normalize": True}
).log()

Confusion Matrix Test Normalized Champion

The Confusion Matrix test evaluates the classification performance of the log_model_champion on the test_dataset_final by displaying the normalized proportions of true positives, true negatives, false positives, and false negatives. The matrix presents the fraction of predictions in each category, with values normalized such that the sum of all cells equals 1. The results provide a visual summary of the model's ability to correctly and incorrectly classify both positive and negative cases.

Key insights:

  • Balanced correct classification rates: The model correctly classifies negative cases (True Negatives) at 0.31 and positive cases (True Positives) at 0.30, indicating similar accuracy for both classes.
  • Comparable error rates for both classes: The proportion of false positives (0.19) and false negatives (0.20) are closely matched, suggesting that misclassification is distributed evenly between the two error types.
  • No class dominates prediction outcomes: All four matrix cells are within the range of 0.19 to 0.31, indicating that neither class is disproportionately favored or neglected by the model.

The confusion matrix reveals that the model demonstrates balanced performance across both positive and negative classes, with similar rates of correct and incorrect predictions. The distribution of outcomes suggests that the model does not exhibit a strong bias toward either class, and error rates are evenly split between false positives and false negatives. This balance indicates consistent classification behavior across the evaluated dataset.

Parameters:

{
  "normalize": true
}
            

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:test_normalized_champion:cec3
2026-01-30 23:16:07,407 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_normalized_champion does not exist in model's document
# Challenger with test dataset and normalize=True
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:test_normalized_challenger",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_rf_model]
    },
    params={"normalize": True}
).log()

Confusion Matrix Test Normalized Challenger

The ConfusionMatrix:test_normalized_challenger test evaluates the classification performance of the rf_model on the test_dataset_final by presenting a normalized confusion matrix. The matrix displays the proportion of true and false predictions for each class, with values normalized such that each cell represents the fraction of total predictions. The results are visualized as a heatmap, with the diagonal cells indicating correct classifications and the off-diagonal cells representing misclassifications.

Key insights:

  • Balanced correct classification rates: The model correctly classifies 0.37 of the total samples as True Negatives and 0.35 as True Positives, indicating similar accuracy for both classes.
  • Moderate misclassification rates: False Positives and False Negatives are observed at 0.14 each, reflecting a moderate level of misclassification for both classes.
  • No class dominance in errors: The distribution of errors is symmetric, with no single class exhibiting a disproportionately higher misclassification rate.

The confusion matrix reveals that the model demonstrates balanced performance across both classes, with correct and incorrect predictions distributed evenly. The normalized values indicate that the model does not favor one class over the other in its predictions, and both types of errors (False Positives and False Negatives) occur at comparable rates. This suggests consistent classification behavior without significant bias toward either class.

Parameters:

{
  "normalize": true
}
            

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:test_normalized_challenger:880b
2026-01-30 23:16:18,775 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_normalized_challenger does not exist in model's document

Use external test providers

Sometimes you may want to reuse the same set of custom tests across multiple models and share them with others in your organization, like the model development team would have done with you in this example workflow featured in this series of notebooks. In this case, you can create an external custom test provider that will allow you to load custom tests from a local folder or a Git repository.

In this section you will learn how to declare a local filesystem test provider that allows loading tests from a local folder following these high level steps:

  1. Create a folder of custom tests from existing inline tests (tests that exist in your active Jupyter Notebook)
  2. Save an inline test to a file
  3. Define and register a LocalTestProvider that points to that folder
  4. Run test provider tests
  5. Add the test results to your documentation

Create custom tests folder

Let's start by creating a new folder that will contain reusable custom tests from your existing inline tests.

The following code snippet will create a new my_tests directory in the current working directory if it doesn't exist:

tests_folder = "my_tests"

import os

# create tests folder
os.makedirs(tests_folder, exist_ok=True)

# remove existing tests
for f in os.listdir(tests_folder):
    # remove files and pycache
    if f.endswith(".py") or f == "__pycache__":
        os.system(f"rm -rf {tests_folder}/{f}")

After running the command above, confirm that a new my_tests directory was created successfully. For example:

~/notebooks/tutorials/model_validation/my_tests/

Save an inline test

The @vm.test decorator we used in Implement a custom inline test above to register one-off custom tests also includes a convenience method on the function object that allows you to simply call <func_name>.save() to save the test to a Python file at a specified path.

While save() will get you started by creating the file and saving the function code with the correct name, it won't automatically include any imports, or other functions or variables, outside of the functions that are needed for the test to run. To solve this, pass in an optional imports argument ensuring necessary imports are added to the file.

The confusion_matrix test requires the following additional imports:

import matplotlib.pyplot as plt
from sklearn import metrics

Let's pass these imports to the save() method to ensure they are included in the file with the following command:

confusion_matrix.save(
    # Save it to the custom tests folder we created
    tests_folder,
    imports=["import matplotlib.pyplot as plt", "from sklearn import metrics"],
)
2026-01-30 23:16:19,492 - INFO(validmind.tests.decorator): Saved to /home/runner/work/documentation/documentation/site/notebooks/EXECUTED/model_validation/my_tests/ConfusionMatrix.py!Be sure to add any necessary imports to the top of the file.
2026-01-30 23:16:19,492 - INFO(validmind.tests.decorator): This metric can be run with the ID: <test_provider_namespace>.ConfusionMatrix
  • # Saved from __main__.confusion_matrix
    # Original Test ID: my_custom_tests.ConfusionMatrix
    # New Test ID: <test_provider_namespace>.ConfusionMatrix
  • def ConfusionMatrix(dataset, model, normalize=False):

Register a local test provider

Now that your my_tests folder has a sample custom test, let's initialize a test provider that will tell the ValidMind Library where to find your custom tests:

  • ValidMind offers out-of-the-box test providers for local tests (tests in a folder) or a Github provider for tests in a Github repository.
  • You can also create your own test provider by creating a class that has a load_test method that takes a test ID and returns the test function matching that ID.
Want to learn more about test providers?

An extended introduction to test providers can be found in: Integrate external test providers
Initialize a local test provider

For most use cases, using a LocalTestProvider that allows you to load custom tests from a designated directory should be sufficient.

The most important attribute for a test provider is its namespace. This is a string that will be used to prefix test IDs in model documentation. This allows you to have multiple test providers with tests that can even share the same ID, but are distinguished by their namespace.

Let's go ahead and load the custom tests from our my_tests directory:

from validmind.tests import LocalTestProvider

# initialize the test provider with the tests folder we created earlier
my_test_provider = LocalTestProvider(tests_folder)

vm.tests.register_test_provider(
    namespace="my_test_provider",
    test_provider=my_test_provider,
)
# `my_test_provider.load_test()` will be called for any test ID that starts with `my_test_provider`
# e.g. `my_test_provider.ConfusionMatrix` will look for a function named `ConfusionMatrix` in `my_tests/ConfusionMatrix.py` file
Run test provider tests

Now that we've set up the test provider, we can run any test that's located in the tests folder by using the run_test() method as with any other test:

  • For tests that reside in a test provider directory, the test ID will be the namespace specified when registering the provider, followed by the path to the test file relative to the tests folder.
  • For example, the Confusion Matrix test we created earlier will have the test ID my_test_provider.ConfusionMatrix. You could organize the tests in subfolders, say classification and regression, and the test ID for the Confusion Matrix test would then be my_test_provider.classification.ConfusionMatrix.

Let's go ahead and re-run the confusion matrix test with our testing dataset for our two models by using the test ID my_test_provider.ConfusionMatrix. This should load the test from the test provider and run it as before.

# Champion with test dataset and test provider custom test
vm.tests.run_test(
    test_id="my_test_provider.ConfusionMatrix:champion",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_log_model]
    }
).log()

Confusion Matrix Champion

The Confusion Matrix test evaluates the classification performance of the log_model_champion on the test_dataset_final by comparing predicted and true class labels. The resulting 2x2 matrix displays the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of the model's prediction accuracy and error distribution. The matrix shows the number of correct and incorrect predictions for both positive and negative classes, enabling assessment of the model's strengths and weaknesses in distinguishing between classes.

Key insights:

  • Higher true negative and true positive counts: The model correctly classified 203 negative cases (true negatives) and 191 positive cases (true positives), indicating balanced correct predictions across both classes.
  • Comparable false positive and false negative rates: There are 115 false positives and 118 false negatives, showing that misclassification rates are similar for both types of errors.
  • No class dominance in misclassification: The distribution of errors does not indicate a strong bias toward either false positives or false negatives, suggesting the model does not disproportionately favor one class over the other.

The confusion matrix reveals that the model achieves a balanced performance, with similar rates of correct and incorrect predictions for both positive and negative classes. The absence of a dominant error type indicates that the model's classification errors are evenly distributed, supporting a consistent approach to both classes in the test dataset.

Figures

ValidMind Figure my_test_provider.ConfusionMatrix:champion:2ea8
2026-01-30 23:16:27,536 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix:champion does not exist in model's document
# Challenger with test dataset  and test provider custom test
vm.tests.run_test(
    test_id="my_test_provider.ConfusionMatrix:challenger",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_rf_model]
    }
).log()

Confusion Matrix Challenger

The Confusion Matrix test evaluates the classification performance of the rf_model on the test_dataset_final by comparing predicted and true labels. The resulting 2x2 matrix displays the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of the model's prediction accuracy and error types. The matrix shows the distribution of correct and incorrect predictions for both positive and negative classes.

Key insights:

  • High true positive and true negative counts: The model correctly classified 229 positive cases (true positives) and 240 negative cases (true negatives), indicating strong performance in both classes.
  • Balanced error distribution: The number of false positives (88) and false negatives (90) are similar, suggesting that the model's misclassification rates are comparable for both classes.
  • Low overall misclassification: The sum of false positives and false negatives (178) is substantially lower than the sum of correct predictions (469), reflecting a high overall accuracy.

The confusion matrix indicates that the rf_model demonstrates robust classification performance on the test dataset, with high counts of correct predictions for both positive and negative classes. The distribution of errors is balanced, and the overall misclassification rate remains low relative to the total number of predictions. This result reflects effective model discrimination between classes.

Figures

ValidMind Figure my_test_provider.ConfusionMatrix:challenger:95f4
2026-01-30 23:16:36,554 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix:challenger does not exist in model's document

Verify test runs

Our final task is to verify that all the tests provided by the model development team were run and reported accurately. Note the appended result_ids to delineate which dataset we ran the test with for the relevant tests.

Here, we'll specify all the tests we'd like to independently rerun in a dictionary called test_config. Note here that inputs and input_grid expect the input_id of the dataset or model as the value rather than the variable name we specified:

test_config = {
    # Run with the raw dataset
    'validmind.data_validation.DatasetDescription:raw_data': {
        'inputs': {'dataset': 'raw_dataset'}
    },
    'validmind.data_validation.DescriptiveStatistics:raw_data': {
        'inputs': {'dataset': 'raw_dataset'}
    },
    'validmind.data_validation.MissingValues:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_threshold': 1}
    },
    'validmind.data_validation.ClassImbalance:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_percent_threshold': 10}
    },
    'validmind.data_validation.Duplicates:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_threshold': 1}
    },
    'validmind.data_validation.HighCardinality:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {
            'num_threshold': 100,
            'percent_threshold': 0.1,
            'threshold_type': 'percent'
        }
    },
    'validmind.data_validation.Skewness:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'max_threshold': 1}
    },
    'validmind.data_validation.UniqueRows:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_percent_threshold': 1}
    },
    'validmind.data_validation.TooManyZeroValues:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'max_percent_threshold': 0.03}
    },
    'validmind.data_validation.IQROutliersTable:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'threshold': 5}
    },
    # Run with the preprocessed dataset
    'validmind.data_validation.DescriptiveStatistics:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TabularDescriptionTables:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.MissingValues:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'},
        'params': {'min_threshold': 1}
    },
    'validmind.data_validation.TabularNumericalHistograms:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TargetRateBarPlots:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'},
        'params': {'default_column': 'loan_status'}
    },
    # Run with the training and test datasets
    'validmind.data_validation.DescriptiveStatistics:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.TabularDescriptionTables:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.ClassImbalance:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_percent_threshold': 10}
    },
    'validmind.data_validation.UniqueRows:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_percent_threshold': 1}
    },
    'validmind.data_validation.TabularNumericalHistograms:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.MutualInformation:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_threshold': 0.01}
    },
    'validmind.data_validation.PearsonCorrelationMatrix:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.HighPearsonCorrelation:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'max_threshold': 0.3, 'top_n_correlations': 10}
    },
    'validmind.model_validation.ModelMetadata': {
        'input_grid': {'model': ['log_model_champion', 'rf_model']}
    },
    'validmind.model_validation.sklearn.ModelParameters': {
        'input_grid': {'model': ['log_model_champion', 'rf_model']}
    },
    'validmind.model_validation.sklearn.ROCCurve': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final'], 'model': ['log_model_champion']}
    },
    'validmind.model_validation.sklearn.MinimumROCAUCScore': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final'], 'model': ['log_model_champion']},
        'params': {'min_threshold': 0.5}
    }
}

Then batch run and log our tests in test_config:

for t in test_config:
    print(t)
    try:
        # Check if test has input_grid
        if 'input_grid' in test_config[t]:
            # For tests with input_grid, pass the input_grid configuration
            if 'params' in test_config[t]:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid']).log()
        else:
            # Original logic for regular inputs
            if 'params' in test_config[t]:
                vm.tests.run_test(t, inputs=test_config[t]['inputs'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, inputs=test_config[t]['inputs']).log()
    except Exception as e:
        print(f"Error running test {t}: {str(e)}")
validmind.data_validation.DatasetDescription:raw_data

Dataset Description Raw Data

The DatasetDescription test provides a comprehensive summary of the dataset's structure and content, detailing the type, count, missingness, and distinct value statistics for each column. The results table enumerates all columns, specifying their data types (numeric or categorical), the total number of records, the proportion and count of missing values, and the number and proportion of unique values per column. This overview enables a clear understanding of the dataset's completeness, feature diversity, and potential data quality considerations.

Key insights:

  • No missing values detected: All columns report 0 missing values, with both the count and percentage of missing entries at 0.0%.
  • High cardinality in select numeric features: The Balance and EstimatedSalary columns exhibit high distinct value counts (5088 and 8000 respectively), with EstimatedSalary showing a distinct percentage of 1.0, indicating all values are unique.
  • Low cardinality in categorical features: Categorical columns such as Geography, Gender, HasCrCard, IsActiveMember, and Exited have between 2 and 3 distinct values, representing a small set of categories.
  • Consistent record count across features: All columns have a count of 8000, indicating uniform data availability across the dataset.
  • No unsupported data types present: All columns are classified as either numeric or categorical, with no unsupported types identified.

The dataset is fully populated with no missing values and consistent record counts across all features. Numeric columns display a range of cardinalities, with some features (such as EstimatedSalary) containing exclusively unique values, while categorical features maintain low cardinality. The absence of unsupported data types and missing values indicates a high level of data completeness and structural integrity.

Tables

Dataset Description

Name Type Count Missing Missing % Distinct Distinct %
CreditScore Numeric 8000.0 0 0.0 452 0.0565
Geography Categorical 8000.0 0 0.0 3 0.0004
Gender Categorical 8000.0 0 0.0 2 0.0002
Age Numeric 8000.0 0 0.0 69 0.0086
Tenure Numeric 8000.0 0 0.0 11 0.0014
Balance Numeric 8000.0 0 0.0 5088 0.6360
NumOfProducts Numeric 8000.0 0 0.0 4 0.0005
HasCrCard Categorical 8000.0 0 0.0 2 0.0002
IsActiveMember Categorical 8000.0 0 0.0 2 0.0002
EstimatedSalary Numeric 8000.0 0 0.0 8000 1.0000
Exited Categorical 8000.0 0 0.0 2 0.0002
2026-01-30 23:16:41,512 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DatasetDescription:raw_data does not exist in model's document
validmind.data_validation.DescriptiveStatistics:raw_data

Descriptive Statistics Raw Data

The Descriptive Statistics test evaluates the distributional characteristics and diversity of both numerical and categorical variables in the dataset. The results present summary statistics for eight numerical variables, including measures of central tendency, dispersion, and range, as well as frequency-based summaries for two categorical variables. The tables provide a comprehensive overview of the dataset’s structure, highlighting the spread, central values, and category distributions for each variable.

Key insights:

  • Wide range and skewness in Balance: The Balance variable exhibits a minimum of 0.0, a median of 97,264.0, and a maximum of 250,898.0, with a mean (76,434.1) substantially below the median, indicating a right-skewed distribution and a large proportion of zero balances.
  • CreditScore and Age distributions are symmetric: CreditScore and Age have means (650.2 and 38.9, respectively) closely aligned with their medians (652.0 and 37.0), and standard deviations (96.8 and 10.5) consistent with their ranges, suggesting relatively symmetric distributions without pronounced skewness.
  • Categorical dominance in Geography and Gender: The Geography variable is dominated by "France" (50.12% of records), and Gender is dominated by "Male" (54.95%), indicating moderate category concentration but not extreme overdominance.
  • Binary variables show balanced representation: HasCrCard and IsActiveMember are binary variables with means of 0.70 and 0.52, respectively, indicating a relatively balanced split between categories.

The dataset demonstrates generally well-behaved distributions for most numerical variables, with the exception of Balance, which is highly skewed and contains a substantial proportion of zero values. Categorical variables show moderate concentration in the top categories but retain diversity, with no single category exceeding 55% representation. Overall, the data structure supports robust analysis, with the primary distributional risk arising from the skewness and zero-inflation in the Balance variable.

Tables

Numerical Variables

Name Count Mean Std Min 25% 50% 75% 90% 95% Max
CreditScore 8000.0 650.1596 96.8462 350.0 583.0 652.0 717.0 778.0 813.0 850.0
Age 8000.0 38.9489 10.4590 18.0 32.0 37.0 44.0 53.0 60.0 92.0
Tenure 8000.0 5.0339 2.8853 0.0 3.0 5.0 8.0 9.0 9.0 10.0
Balance 8000.0 76434.0965 62612.2513 0.0 0.0 97264.0 128045.0 149545.0 162488.0 250898.0
NumOfProducts 8000.0 1.5325 0.5805 1.0 1.0 1.0 2.0 2.0 2.0 4.0
HasCrCard 8000.0 0.7026 0.4571 0.0 0.0 1.0 1.0 1.0 1.0 1.0
IsActiveMember 8000.0 0.5199 0.4996 0.0 0.0 1.0 1.0 1.0 1.0 1.0
EstimatedSalary 8000.0 99790.1880 57520.5089 12.0 50857.0 99505.0 149216.0 179486.0 189997.0 199992.0

Categorical Variables

Name Count Number of Unique Values Top Value Top Value Frequency Top Value Frequency %
Geography 8000.0 3.0 France 4010.0 50.12
Gender 8000.0 2.0 Male 4396.0 54.95
2026-01-30 23:16:47,359 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:raw_data does not exist in model's document
validmind.data_validation.MissingValues:raw_data

✅ Missing Values Raw Data

The Missing Values test evaluates dataset quality by measuring the proportion of missing values in each feature and comparing it to a predefined threshold. The results table presents, for each column, the number and percentage of missing values, along with a Pass/Fail status based on whether the missingness exceeds the threshold. All features in the dataset are listed with their corresponding missing value statistics and test outcomes.

Key insights:

  • No missing values detected: All features report zero missing values, with both the number and percentage of missing values recorded as 0.0%.
  • Universal test pass across features: Every feature meets the missing value threshold criterion, resulting in a "Pass" status for all columns.

The dataset demonstrates complete data integrity with respect to missing values, as all features contain fully populated records and satisfy the established threshold. This indicates a high level of data completeness, supporting reliable downstream modeling and analysis.

Parameters:

{
  "min_threshold": 1
}
            

Tables

Column Number of Missing Values Percentage of Missing Values (%) Pass/Fail
CreditScore 0 0.0 Pass
Geography 0 0.0 Pass
Gender 0 0.0 Pass
Age 0 0.0 Pass
Tenure 0 0.0 Pass
Balance 0 0.0 Pass
NumOfProducts 0 0.0 Pass
HasCrCard 0 0.0 Pass
IsActiveMember 0 0.0 Pass
EstimatedSalary 0 0.0 Pass
Exited 0 0.0 Pass
2026-01-30 23:16:49,888 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MissingValues:raw_data does not exist in model's document
validmind.data_validation.ClassImbalance:raw_data

✅ Class Imbalance Raw Data

The Class Imbalance test evaluates the distribution of target classes within the dataset to identify potential imbalances that could impact model performance. The results table presents the percentage of records for each class in the "Exited" target variable, alongside a pass/fail assessment based on a minimum threshold of 10%. The accompanying bar plot visually depicts the proportion of each class, with class 0 and class 1 shown as distinct bars representing their respective frequencies.

Key insights:

  • Both classes exceed minimum threshold: Class 0 constitutes 79.80% and class 1 constitutes 20.20% of the dataset, with both surpassing the 10% minimum threshold.
  • No classes flagged for imbalance: The pass/fail assessment indicates that neither class is under-represented according to the defined threshold.
  • Class distribution is asymmetric: The majority class (0) is nearly four times as prevalent as the minority class (1), as visualized in the bar plot.

The class distribution in the "Exited" variable demonstrates that both classes meet the minimum representation criteria, with no classes identified as imbalanced by the test parameters. While the distribution is asymmetric, the absence of any class below the threshold indicates that the dataset passes the class imbalance assessment under the current configuration.

Parameters:

{
  "min_percent_threshold": 10
}
            

Tables

Exited Class Imbalance

Exited Percentage of Rows (%) Pass/Fail
0 79.80% Pass
1 20.20% Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:raw_data:9943
2026-01-30 23:16:57,417 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance:raw_data does not exist in model's document
validmind.data_validation.Duplicates:raw_data

✅ Duplicates Raw Data

The Duplicates test evaluates the presence of duplicate rows within the dataset to assess data quality and mitigate risks associated with redundant information. The results table presents the absolute number and percentage of duplicate rows detected in the dataset, with the test configured to flag results only if the number of duplicates meets or exceeds a minimum threshold of 1. The table indicates both the count and proportion of duplicate entries relative to the total dataset size.

Key insights:

  • No duplicate rows detected: The dataset contains zero duplicate rows, as indicated by a "Number of Duplicates" value of 0.
  • Zero percent duplication rate: The "Percentage of Rows (%)" is 0.0%, confirming the absence of redundant entries in the dataset.

The results demonstrate that the dataset is free from duplicate rows, indicating a high level of data quality with respect to redundancy. The absence of duplicates reduces the risk of model overfitting due to repeated information and supports the reliability of subsequent modeling processes.

Parameters:

{
  "min_threshold": 1
}
            

Tables

Duplicate Rows Results for Dataset

Number of Duplicates Percentage of Rows (%)
0 0.0
2026-01-30 23:17:00,554 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.Duplicates:raw_data does not exist in model's document
validmind.data_validation.HighCardinality:raw_data

✅ High Cardinality Raw Data

The High Cardinality test evaluates the number of unique values in categorical columns to identify potential risks of overfitting and data noise. The results table presents the number and percentage of distinct values for each categorical column, along with a pass/fail status based on a threshold of 10% distinct values. Both "Geography" and "Gender" columns are assessed, with their respective distinct value counts and percentages reported.

Key insights:

  • All categorical columns pass cardinality threshold: Both "Geography" (3 distinct values, 0.0375%) and "Gender" (2 distinct values, 0.025%) are well below the 10% threshold, resulting in a "Pass" status for each.
  • Low cardinality observed across features: The number of unique values in both columns is minimal relative to the dataset size, indicating limited risk of overfitting due to high cardinality.

The results indicate that all evaluated categorical columns exhibit low cardinality, with distinct value counts and percentages substantially below the defined threshold. No evidence of high cardinality risk is present in the assessed features.

Parameters:

{
  "num_threshold": 100,
  "percent_threshold": 0.1,
  "threshold_type": "percent"
}
            

Tables

Column Number of Distinct Values Percentage of Distinct Values (%) Pass/Fail
Geography 3 0.0375 Pass
Gender 2 0.0250 Pass
2026-01-30 23:17:04,012 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighCardinality:raw_data does not exist in model's document
validmind.data_validation.Skewness:raw_data

❌ Skewness Raw Data

The Skewness:raw_data test evaluates the asymmetry of numerical data distributions by calculating skewness for each numeric column and comparing the results to a maximum threshold of 1. The test results table presents skewness values for nine columns, along with pass/fail status based on the threshold. Most columns exhibit skewness values close to zero, while two columns exceed the threshold and are marked as failed.

Key insights:

  • Majority of columns within skewness threshold: Seven out of nine columns have skewness values between -0.89 and 0.72, all passing the test and indicating near-symmetric distributions.
  • Two columns exceed skewness threshold: Age (skewness = 1.0245) and Exited (skewness = 1.4847) both fail the test, reflecting higher levels of distributional asymmetry.
  • Minimal skewness in core financial variables: CreditScore, Balance, and EstimatedSalary all show skewness values near zero, indicating well-balanced distributions in these features.

The results indicate that most numerical features in the dataset exhibit low skewness and pass the defined threshold, supporting data quality for model development. However, Age and Exited display elevated skewness, highlighting notable asymmetry in these distributions relative to the threshold. The overall distributional profile suggests that, aside from these exceptions, the dataset maintains a high degree of symmetry across its numeric variables.

Parameters:

{
  "max_threshold": 1
}
            

Tables

Skewness Results for Dataset

Column Skewness Pass/Fail
CreditScore -0.0620 Pass
Age 1.0245 Fail
Tenure 0.0077 Pass
Balance -0.1353 Pass
NumOfProducts 0.7172 Pass
HasCrCard -0.8867 Pass
IsActiveMember -0.0796 Pass
EstimatedSalary 0.0095 Pass
Exited 1.4847 Fail
2026-01-30 23:17:09,099 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.Skewness:raw_data does not exist in model's document
validmind.data_validation.UniqueRows:raw_data

❌ Unique Rows Raw Data

The UniqueRows test evaluates the diversity of the dataset by measuring the proportion of unique values in each column and comparing it to a minimum percentage threshold. The results table presents, for each column, the number and percentage of unique values, along with a pass/fail outcome based on whether the percentage exceeds the 1% threshold. Columns such as CreditScore, Balance, and EstimatedSalary show high uniqueness and pass the test, while most categorical and low-cardinality columns do not meet the threshold and fail.

Key insights:

  • High uniqueness in continuous variables: EstimatedSalary (100%), Balance (63.6%), and CreditScore (5.65%) exceed the 1% uniqueness threshold, indicating substantial diversity in these columns.
  • Low uniqueness in categorical variables: Columns such as Geography (0.0375%), Gender (0.025%), HasCrCard (0.025%), IsActiveMember (0.025%), and Exited (0.025%) have very low percentages of unique values and fail the test.
  • Limited diversity in ordinal and discrete features: Age (0.8625%), Tenure (0.1375%), and NumOfProducts (0.05%) also fall below the threshold, reflecting limited row-level uniqueness in these variables.
  • Majority of columns fail uniqueness threshold: Only 3 out of 11 columns pass the test, with the remaining 8 columns failing to meet the minimum uniqueness requirement.

The results indicate that while continuous variables in the dataset exhibit high row-level diversity, most categorical and discrete columns have low uniqueness and do not meet the prescribed threshold. This pattern reflects the inherent characteristics of categorical variables and highlights a concentration of diversity in a subset of features. The overall dataset structure is characterized by a mix of highly unique continuous variables and low-uniqueness categorical or discrete variables.

Parameters:

{
  "min_percent_threshold": 1
}
            

Tables

Column Number of Unique Values Percentage of Unique Values (%) Pass/Fail
CreditScore 452 5.6500 Pass
Geography 3 0.0375 Fail
Gender 2 0.0250 Fail
Age 69 0.8625 Fail
Tenure 11 0.1375 Fail
Balance 5088 63.6000 Pass
NumOfProducts 4 0.0500 Fail
HasCrCard 2 0.0250 Fail
IsActiveMember 2 0.0250 Fail
EstimatedSalary 8000 100.0000 Pass
Exited 2 0.0250 Fail
2026-01-30 23:17:13,633 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.UniqueRows:raw_data does not exist in model's document
validmind.data_validation.TooManyZeroValues:raw_data

❌ Too Many Zero Values Raw Data

The TooManyZeroValues test identifies numerical columns with a proportion of zero values exceeding a defined threshold, set here at 0.03%. The results table summarizes the number and percentage of zero values for each numerical column, along with a pass/fail status based on the threshold. All four evaluated columns—Tenure, Balance, HasCrCard, and IsActiveMember—are reported with their respective zero value counts and fail the test due to exceeding the threshold.

Key insights:

  • All evaluated columns exceed zero value threshold: Each of the four numerical columns has a percentage of zero values significantly above the 0.03% threshold, resulting in a fail status for all.
  • High concentration of zeros in Balance and IsActiveMember: Balance contains 36.4% zero values, and IsActiveMember contains 48.01%, indicating substantial sparsity in these features.
  • Substantial zero values in binary indicator columns: HasCrCard and IsActiveMember, likely representing binary indicators, show 29.74% and 48.01% zero values respectively, reflecting a high proportion of one class.
  • Tenure column also affected: Tenure registers 4.04% zero values, which, while lower than other columns, still exceeds the threshold and fails the test.

All assessed numerical columns display a proportion of zero values well above the defined threshold, with Balance and IsActiveMember exhibiting particularly high sparsity. The presence of substantial zero values across both continuous and likely binary indicator columns is consistently observed, resulting in a fail status for each. This pattern indicates that zero values are a prominent characteristic in the dataset's numerical features.

Parameters:

{
  "max_percent_threshold": 0.03
}
            

Tables

Variable Row Count Number of Zero Values Percentage of Zero Values (%) Pass/Fail
Tenure 8000 323 4.0375 Fail
Balance 8000 2912 36.4000 Fail
HasCrCard 8000 2379 29.7375 Fail
IsActiveMember 8000 3841 48.0125 Fail
2026-01-30 23:17:17,440 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TooManyZeroValues:raw_data does not exist in model's document
validmind.data_validation.IQROutliersTable:raw_data

IQR Outliers Table Raw Data

The Interquartile Range Outliers Table (IQROutliersTable) test identifies and summarizes outliers in numerical features using the IQR method, with the threshold parameter set to 5 for this analysis. The results table presents the count and summary statistics of outliers for each numerical feature, highlighting the extent and distribution of extreme values in the dataset. In this instance, the results table is empty, indicating the absence of detected outliers across all evaluated numerical features.

Key insights:

  • No outliers detected in any feature: The test did not identify any data points as outliers in any numerical feature using the specified IQR threshold.
  • Uniform distribution within IQR bounds: All numerical feature values fall within the calculated IQR-based outlier boundaries, with no extreme deviations observed.

The absence of detected outliers indicates that all numerical features conform to the expected value ranges defined by the IQR method with the applied threshold. This suggests a high degree of data consistency and no evidence of extreme or anomalous values in the evaluated dataset.

Parameters:

{
  "threshold": 5
}
            

Tables

Summary of Outliers Detected by IQR Method

2026-01-30 23:17:20,435 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.IQROutliersTable:raw_data does not exist in model's document
validmind.data_validation.DescriptiveStatistics:preprocessed_data

Descriptive Statistics Preprocessed Data

The Descriptive Statistics test evaluates the distributional characteristics and summary statistics of both numerical and categorical variables in the preprocessed dataset. The results are presented in two tables: one summarizing key statistics for numerical variables such as mean, standard deviation, and percentiles, and another detailing counts, unique values, and frequency distributions for categorical variables. These tables provide a comprehensive overview of the dataset’s structure, central tendencies, and variability, supporting further analysis of model input data.

Key insights:

  • Wide range and skewness in balance values: The Balance variable exhibits a minimum of 0 and a maximum of 250,898, with a mean of 82,618 and a median (50th percentile) of 103,721, indicating a right-skewed distribution with a substantial proportion of low or zero balances.
  • CreditScore distribution is broad but centered: CreditScore values range from 350 to 850, with a mean of 648.5 and a median of 650, suggesting a relatively symmetric distribution around the center.
  • Categorical dominance in Geography and Gender: The Geography variable is dominated by 'France' (46.04%), and Gender is nearly evenly split, with 'Male' at 50.31%, indicating moderate diversity in categorical distributions.
  • Binary variables show balanced representation: HasCrCard and IsActiveMember variables have means of 0.697 and 0.470, respectively, reflecting a balanced presence of both classes in the dataset.

The dataset displays a mix of symmetric and skewed distributions among numerical variables, with Balance showing pronounced right skewness and a significant proportion of low or zero values. Categorical variables demonstrate moderate diversity, with no single category exceeding 50% representation. Binary variables are well balanced, supporting robust modeling without dominance by a single class. Overall, the descriptive statistics indicate a dataset with varied distributions and no extreme concentration in categorical features.

Tables

Numerical Variables

Name Count Mean Std Min 25% 50% 75% 90% 95% Max
CreditScore 3232.0 648.5235 98.8175 350.0 581.0 650.0 717.0 778.0 815.0 850.0
Tenure 3232.0 4.9421 2.9066 0.0 2.0 5.0 7.0 9.0 9.0 10.0
Balance 3232.0 82618.4058 61264.1919 0.0 0.0 103721.0 129526.0 150522.0 164057.0 250898.0
NumOfProducts 3232.0 1.5077 0.6703 1.0 1.0 1.0 2.0 2.0 3.0 4.0
HasCrCard 3232.0 0.6971 0.4596 0.0 0.0 1.0 1.0 1.0 1.0 1.0
IsActiveMember 3232.0 0.4703 0.4992 0.0 0.0 0.0 1.0 1.0 1.0 1.0
EstimatedSalary 3232.0 99262.6157 57316.2405 12.0 50354.0 98904.0 147841.0 179256.0 189412.0 199808.0

Categorical Variables

Name Count Number of Unique Values Top Value Top Value Frequency Top Value Frequency %
Geography 3232.0 3.0 France 1488.0 46.04
Gender 3232.0 2.0 Male 1626.0 50.31
2026-01-30 23:17:25,898 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:preprocessed_data does not exist in model's document
validmind.data_validation.TabularDescriptionTables:preprocessed_data

Tabular Description Tables Preprocessed Data

The Descriptive Statistics test evaluates the distributional characteristics, completeness, and data types of numerical and categorical variables in the dataset. The results present summary statistics for eight numerical variables and two categorical variables, including measures such as mean, minimum, maximum, missingness, and unique value counts. All variables are reported with their respective data types and observed value ranges, providing a comprehensive overview of the dataset’s structure and integrity.

Key insights:

  • No missing values detected: All numerical and categorical variables report 0.0% missing values, indicating complete data coverage across all fields.
  • Consistent data types across variables: Numerical variables are represented as int64 or float64, while categorical variables are of object type, aligning with their respective value formats.
  • Balanced binary and categorical distributions: Binary variables such as HasCrCard, IsActiveMember, and Exited have means near 0.5 or 0.7, and categorical variables Geography and Gender display three and two unique values, respectively, with no missingness.
  • Wide range in numerical variables: Variables such as CreditScore, Balance, and EstimatedSalary exhibit broad value ranges, with CreditScore spanning from 350 to 850 and Balance from 0 to 250,898.09.

The dataset demonstrates high data integrity, with complete observation counts and appropriate data types for all variables. The absence of missing values and the presence of well-defined value ranges across both numerical and categorical fields indicate a robust foundation for subsequent modeling or analysis. The structure and completeness of the data support reliable downstream processing and model development.

Tables

Numerical Variable Num of Obs Mean Min Max Missing Values (%) Data Type
CreditScore 3232 648.5235 350.00 850.00 0.0 int64
Tenure 3232 4.9421 0.00 10.00 0.0 int64
Balance 3232 82618.4058 0.00 250898.09 0.0 float64
NumOfProducts 3232 1.5077 1.00 4.00 0.0 int64
HasCrCard 3232 0.6971 0.00 1.00 0.0 int64
IsActiveMember 3232 0.4703 0.00 1.00 0.0 int64
EstimatedSalary 3232 99262.6157 11.58 199808.10 0.0 float64
Exited 3232 0.5000 0.00 1.00 0.0 int64
Categorical Variable Num of Obs Num of Unique Values Unique Values Missing Values (%) Data Type
Geography 3232.0 3.0 ['France' 'Spain' 'Germany'] 0.0 object
Gender 3232.0 2.0 ['Female' 'Male'] 0.0 object
2026-01-30 23:17:30,064 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularDescriptionTables:preprocessed_data does not exist in model's document
validmind.data_validation.MissingValues:preprocessed_data

✅ Missing Values Preprocessed Data

The Missing Values test evaluates dataset quality by measuring the proportion of missing values in each feature and comparing it to a predefined threshold. The results table presents, for each column, the number and percentage of missing values, along with a Pass/Fail status based on whether the missingness exceeds the threshold. In this test, the threshold was set to 1, and all features were assessed for missing data presence.

Key insights:

  • No missing values detected: All features, including CreditScore, Geography, Gender, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Exited, have zero missing values.
  • Universal pass across features: Every column meets the missing value threshold criterion, with all Pass/Fail statuses marked as "Pass."
  • Consistent data completeness: The percentage of missing values is 0.0% for all features, indicating uniform data integrity across the dataset.

The results demonstrate complete data integrity with no missing values present in any feature. All columns satisfy the missing value threshold, supporting a high standard of dataset quality for subsequent modeling or analysis.

Parameters:

{
  "min_threshold": 1
}
            

Tables

Column Number of Missing Values Percentage of Missing Values (%) Pass/Fail
CreditScore 0 0.0 Pass
Geography 0 0.0 Pass
Gender 0 0.0 Pass
Tenure 0 0.0 Pass
Balance 0 0.0 Pass
NumOfProducts 0 0.0 Pass
HasCrCard 0 0.0 Pass
IsActiveMember 0 0.0 Pass
EstimatedSalary 0 0.0 Pass
Exited 0 0.0 Pass
2026-01-30 23:17:32,995 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MissingValues:preprocessed_data does not exist in model's document
validmind.data_validation.TabularNumericalHistograms:preprocessed_data

Tabular Numerical Histograms Preprocessed Data

The TabularNumericalHistograms:preprocessed_data test provides a visual summary of the distribution of each numerical feature in the dataset using histograms. The resulting plots display the frequency distribution for each variable, enabling identification of distributional characteristics, skewness, and potential outliers. These visualizations facilitate an understanding of the underlying data structure prior to model development or further analysis.

Key insights:

  • CreditScore distribution is unimodal and right-skewed: The CreditScore histogram shows a single peak between 600 and 700, with a longer tail extending toward higher values, indicating right skewness and a concentration of scores in the mid-to-high range.
  • Tenure is uniformly distributed: The Tenure variable displays a nearly flat distribution across its range, with similar frequencies for each tenure value from 1 to 10, except for slightly lower counts at the endpoints.
  • Balance exhibits a strong zero-inflation: The Balance histogram reveals a pronounced spike at zero, indicating a substantial proportion of accounts with no balance, while the remainder of the distribution is approximately bell-shaped and centered around 120,000.
  • NumOfProducts is highly concentrated at lower values: The majority of observations are at 1 or 2 products, with very few customers having 3 or 4 products, indicating a strong left skew.
  • HasCrCard and IsActiveMember are binary and imbalanced: Both variables are binary, with HasCrCard showing a higher frequency for the value 1, and IsActiveMember displaying a moderate imbalance between the two categories.
  • EstimatedSalary is approximately uniform: The EstimatedSalary histogram is relatively flat across its range, suggesting a uniform distribution of salary values in the dataset.

The histograms indicate a range of distributional patterns across the numerical features, including right skewness, zero-inflation, and categorical imbalances. These characteristics highlight the presence of non-normality and concentration effects in several variables, which may influence model behavior and warrant consideration in subsequent modeling steps. The visualizations provide a clear overview of input data structure, supporting further analysis and feature engineering.

Figures

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:9e7f
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:b11c
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:f686
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:f88b
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:9e58
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:d90c
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:37f4
2026-01-30 23:17:48,296 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularNumericalHistograms:preprocessed_data does not exist in model's document
validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data

Tabular Categorical Bar Plots Preprocessed Data

The TabularCategoricalBarPlots test evaluates the distribution of categorical variables by generating bar plots for each category within the dataset. The resulting plots display the frequency counts for each category in the 'Geography' and 'Gender' features, providing a visual summary of the dataset's categorical composition. These visualizations facilitate the identification of category balance and potential representation issues within the data.

Key insights:

  • Balanced gender distribution: The 'Gender' feature shows nearly equal representation between 'Male' and 'Female' categories, with both categories having similar counts.
  • Geography category imbalance: The 'Geography' feature displays a notable imbalance, with 'France' having the highest count, followed by 'Germany', and 'Spain' having the lowest representation among the three categories.

The categorical composition of the dataset reveals a well-balanced gender distribution, minimizing the risk of gender-based underrepresentation. However, the 'Geography' feature exhibits a pronounced imbalance, with 'France' being the most frequent category and 'Spain' the least. This distribution may influence model behavior, particularly in scenarios where geographic representation is relevant to model outcomes.

Figures

ValidMind Figure validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data:644d
ValidMind Figure validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data:174c
2026-01-30 23:17:57,901 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data does not exist in model's document
validmind.data_validation.TargetRateBarPlots:preprocessed_data

Target Rate Bar Plots Preprocessed Data

The Target Rate Bar Plots test visualizes the distribution and target rates of categorical features to provide insight into model decision patterns. The results are presented as paired bar plots for each categorical variable, with the left plot showing the frequency of each category and the right plot displaying the mean target rate (proportion of positive class) for each category. The features analyzed include Geography and Gender, with each category’s count and corresponding target rate depicted side by side.

Key insights:

  • Geography exhibits target rate variation: The target rate for Germany is notably higher than for France and Spain, with Germany exceeding 0.6 while France and Spain are both below 0.45.
  • Balanced category representation in Gender: Male and Female categories have nearly identical counts, each around 1600, indicating balanced representation in the dataset.
  • Gender target rates differ: The target rate for Female is higher than for Male, with Female above 0.5 and Male below 0.45.
  • Geography category counts are uneven: France has the highest count (over 1500), followed by Germany (just above 1000), and Spain (below 800), indicating an imbalance in category frequencies.

The results reveal distinct differences in target rates across both Geography and Gender categories, with Germany and Female categories showing higher proportions of positive class outcomes. Category frequencies are balanced for Gender but show notable disparities for Geography. These patterns provide a clear, visual summary of how categorical features relate to model outcomes and highlight areas where model behavior varies across groups.

Figures

ValidMind Figure validmind.data_validation.TargetRateBarPlots:preprocessed_data:df83
ValidMind Figure validmind.data_validation.TargetRateBarPlots:preprocessed_data:6cf4
2026-01-30 23:18:09,266 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TargetRateBarPlots:preprocessed_data does not exist in model's document
validmind.data_validation.DescriptiveStatistics:development_data

Descriptive Statistics Development Data

The Descriptive Statistics test evaluates the distributional characteristics of numerical variables in both the training and test datasets. The results present summary statistics including count, mean, standard deviation, minimum, percentiles, and maximum for each variable. These statistics provide a comprehensive overview of the central tendency, dispersion, and range for each feature, enabling assessment of data quality and potential risk factors such as skewness or outliers.

Key insights:

  • Consistent central tendencies across datasets: Means and medians (50th percentiles) for all variables are closely aligned between the training and test datasets, indicating stable distributions.
  • Low to moderate standard deviations relative to means: Standard deviations for variables such as CreditScore, Tenure, NumOfProducts, HasCrCard, and IsActiveMember are moderate, while Balance and EstimatedSalary exhibit higher absolute variability, reflecting broader value ranges.
  • No evidence of extreme outliers: Maximum and minimum values for all variables fall within plausible ranges, with no extreme deviations observed in either dataset.
  • Binary and categorical variables show expected distributions: HasCrCard and IsActiveMember display means and percentiles consistent with binary encoding, with no missing or anomalous values.
  • Balance variable exhibits a high proportion of zero values: The 25th percentile for Balance is 0.0 in both datasets, indicating a substantial subset of records with zero balance.

The descriptive statistics indicate that the training and test datasets are well-aligned in terms of central tendency and dispersion for all monitored variables. No significant outliers or distributional anomalies are present, and binary/categorical variables maintain expected value ranges. The presence of a substantial proportion of zero values in the Balance variable is notable and may influence downstream modeling or interpretation. Overall, the data exhibits stable and well-controlled distributional properties across both datasets.

Tables

dataset Name Count Mean Std Min 25% 50% 75% 90% 95% Max
train_dataset_final CreditScore 2585.0 648.7253 98.9165 350.0 581.0 650.0 717.0 779.0 817.0 850.0
train_dataset_final Tenure 2585.0 4.9408 2.9040 0.0 2.0 5.0 7.0 9.0 9.0 10.0
train_dataset_final Balance 2585.0 82355.5406 61065.6729 0.0 0.0 103515.0 129336.0 149402.0 163472.0 250898.0
train_dataset_final NumOfProducts 2585.0 1.5072 0.6683 1.0 1.0 1.0 2.0 2.0 3.0 4.0
train_dataset_final HasCrCard 2585.0 0.6905 0.4624 0.0 0.0 1.0 1.0 1.0 1.0 1.0
train_dataset_final IsActiveMember 2585.0 0.4708 0.4992 0.0 0.0 0.0 1.0 1.0 1.0 1.0
train_dataset_final EstimatedSalary 2585.0 99766.9098 57379.6432 12.0 51309.0 99450.0 149044.0 179453.0 189783.0 199808.0
test_dataset_final CreditScore 647.0 647.7172 98.4932 350.0 582.0 651.0 718.0 771.0 811.0 850.0
test_dataset_final Tenure 647.0 4.9474 2.9190 0.0 2.0 5.0 8.0 9.0 9.0 10.0
test_dataset_final Balance 647.0 83668.6477 62087.5936 0.0 0.0 105000.0 130371.0 152808.0 168150.0 211774.0
test_dataset_final NumOfProducts 647.0 1.5100 0.6788 1.0 1.0 1.0 2.0 2.0 3.0 4.0
test_dataset_final HasCrCard 647.0 0.7233 0.4477 0.0 0.0 1.0 1.0 1.0 1.0 1.0
test_dataset_final IsActiveMember 647.0 0.4683 0.4994 0.0 0.0 0.0 1.0 1.0 1.0 1.0
test_dataset_final EstimatedSalary 647.0 97247.7776 57061.9413 288.0 47487.0 97734.0 142972.0 177109.0 188052.0 199293.0
2026-01-30 23:18:14,230 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:development_data does not exist in model's document
validmind.data_validation.TabularDescriptionTables:development_data

Tabular Description Tables Development Data

The Descriptive Statistics test evaluates the distributional characteristics and completeness of numerical variables in the train and test datasets. The results present summary statistics, including mean, minimum, maximum, and missingness percentage, for each numerical variable across both datasets. All variables are reported with their respective data types and observation counts, providing a comprehensive overview of the dataset structure and integrity.

Key insights:

  • No missing values across all variables: All numerical variables in both train and test datasets report 0.0% missing values, indicating complete data coverage for these fields.
  • Consistent variable ranges between datasets: Minimum and maximum values for variables such as CreditScore, Tenure, NumOfProducts, HasCrCard, IsActiveMember, and Exited are identical across train and test datasets, supporting dataset alignment.
  • Stable means for core features: Mean values for CreditScore, Tenure, NumOfProducts, HasCrCard, IsActiveMember, and Exited are closely matched between train and test datasets, with differences generally below 0.03.
  • Slight variation in financial variables: Balance and EstimatedSalary show minor differences in mean values between train and test datasets, with Balance means differing by approximately 1.6% and EstimatedSalary by approximately 2.5%.

The descriptive statistics indicate high data completeness and strong consistency between train and test datasets for all numerical variables. Variable distributions, as reflected by means and ranges, are closely aligned, with only minor differences observed in financial variables. The absence of missing values and consistent data types across datasets support robust data integrity for subsequent modeling activities.

Tables

dataset Numerical Variable Num of Obs Mean Min Max Missing Values (%) Data Type
train_dataset_final CreditScore 2585 648.7253 350.00 850.00 0.0 int64
train_dataset_final Tenure 2585 4.9408 0.00 10.00 0.0 int64
train_dataset_final Balance 2585 82355.5406 0.00 250898.09 0.0 float64
train_dataset_final NumOfProducts 2585 1.5072 1.00 4.00 0.0 int64
train_dataset_final HasCrCard 2585 0.6905 0.00 1.00 0.0 int64
train_dataset_final IsActiveMember 2585 0.4708 0.00 1.00 0.0 int64
train_dataset_final EstimatedSalary 2585 99766.9098 11.58 199808.10 0.0 float64
train_dataset_final Exited 2585 0.5017 0.00 1.00 0.0 int64
test_dataset_final CreditScore 647 647.7172 350.00 850.00 0.0 int64
test_dataset_final Tenure 647 4.9474 0.00 10.00 0.0 int64
test_dataset_final Balance 647 83668.6477 0.00 211774.31 0.0 float64
test_dataset_final NumOfProducts 647 1.5100 1.00 4.00 0.0 int64
test_dataset_final HasCrCard 647 0.7233 0.00 1.00 0.0 int64
test_dataset_final IsActiveMember 647 0.4683 0.00 1.00 0.0 int64
test_dataset_final EstimatedSalary 647 97247.7776 287.99 199293.01 0.0 float64
test_dataset_final Exited 647 0.4930 0.00 1.00 0.0 int64
2026-01-30 23:18:18,578 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularDescriptionTables:development_data does not exist in model's document
validmind.data_validation.ClassImbalance:development_data

✅ Class Imbalance Development Data

The Class Imbalance test evaluates the distribution of target classes within the training and test datasets to identify potential imbalances that could impact model performance. The results present the percentage representation of each class in both datasets, alongside a pass/fail assessment based on a minimum threshold of 10%. Bar plots visualize the class proportions for both the training and test datasets, facilitating interpretation of class distribution.

Key insights:

  • Balanced class distribution in training data: Both classes in the training dataset are nearly equally represented, with 50.17% for class 1 and 49.83% for class 0, each passing the 10% minimum threshold.
  • Balanced class distribution in test data: The test dataset also demonstrates near-equal representation, with 50.70% for class 0 and 49.30% for class 1, both exceeding the threshold.
  • No classes flagged for imbalance: All classes in both datasets pass the class imbalance test, indicating no under-represented classes according to the defined threshold.
  • Visual confirmation of balance: Bar plots for both datasets show visually similar heights for each class, supporting the tabular findings of balanced class proportions.

The results indicate that both the training and test datasets exhibit balanced class distributions, with all classes substantially exceeding the minimum percentage threshold. No evidence of class imbalance is observed, and the visualizations corroborate the quantitative findings. This distribution supports unbiased model training and evaluation with respect to class representation.

Parameters:

{
  "min_percent_threshold": 10
}
            

Tables

dataset Exited Percentage of Rows (%) Pass/Fail
train_dataset_final 1 50.17% Pass
train_dataset_final 0 49.83% Pass
test_dataset_final 0 50.70% Pass
test_dataset_final 1 49.30% Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:development_data:1e57
ValidMind Figure validmind.data_validation.ClassImbalance:development_data:b967
2026-01-30 23:18:27,740 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance:development_data does not exist in model's document
validmind.data_validation.UniqueRows:development_data

❌ Unique Rows Development Data

The UniqueRows test evaluates the diversity of the dataset by measuring the proportion of unique values in each column and comparing it to a minimum percentage threshold. The results table presents, for both the training and test datasets, the number and percentage of unique values per column, along with a pass/fail outcome based on whether the percentage exceeds the 1% threshold. Columns with a percentage of unique values below this threshold are marked as "Fail," while those above are marked as "Pass."

Key insights:

  • High uniqueness in continuous variables: Columns such as EstimatedSalary and Balance in both datasets, as well as CreditScore, exhibit high percentages of unique values (ranging from 16.7% to 100%), resulting in a "Pass" outcome.
  • Low uniqueness in categorical variables: Columns representing categorical or binary features (e.g., HasCrCard, IsActiveMember, Geography_Germany, Geography_Spain, Gender_Male, Exited) consistently show very low percentages of unique values (0.0774% to 0.3091%) and fail the uniqueness threshold.
  • Mixed results for Tenure: The Tenure column fails the threshold in the training dataset (0.4255%) but passes in the test dataset (1.7002%), indicating a higher diversity in the test sample.
  • NumOfProducts below threshold in both datasets: The NumOfProducts column remains below the uniqueness threshold in both training (0.1547%) and test (0.6182%) datasets, resulting in a "Fail" outcome.

The results indicate that continuous variables in both datasets demonstrate substantial diversity, consistently exceeding the minimum uniqueness threshold. In contrast, categorical and binary variables, as well as NumOfProducts, exhibit low uniqueness percentages and do not meet the threshold, reflecting their limited set of possible values. The Tenure column shows increased diversity in the test dataset compared to the training dataset. Overall, the dataset contains a mix of highly unique continuous features and low-uniqueness categorical features, with the latter group not passing the uniqueness criterion as defined by the test parameters.

Parameters:

{
  "min_percent_threshold": 1
}
            

Tables

dataset Column Number of Unique Values Percentage of Unique Values (%) Pass/Fail
train_dataset_final CreditScore 432 16.7118 Pass
train_dataset_final Tenure 11 0.4255 Fail
train_dataset_final Balance 1775 68.6654 Pass
train_dataset_final NumOfProducts 4 0.1547 Fail
train_dataset_final HasCrCard 2 0.0774 Fail
train_dataset_final IsActiveMember 2 0.0774 Fail
train_dataset_final EstimatedSalary 2585 100.0000 Pass
train_dataset_final Geography_Germany 2 0.0774 Fail
train_dataset_final Geography_Spain 2 0.0774 Fail
train_dataset_final Gender_Male 2 0.0774 Fail
train_dataset_final Exited 2 0.0774 Fail
test_dataset_final CreditScore 311 48.0680 Pass
test_dataset_final Tenure 11 1.7002 Pass
test_dataset_final Balance 443 68.4699 Pass
test_dataset_final NumOfProducts 4 0.6182 Fail
test_dataset_final HasCrCard 2 0.3091 Fail
test_dataset_final IsActiveMember 2 0.3091 Fail
test_dataset_final EstimatedSalary 647 100.0000 Pass
test_dataset_final Geography_Germany 2 0.3091 Fail
test_dataset_final Geography_Spain 2 0.3091 Fail
test_dataset_final Gender_Male 2 0.3091 Fail
test_dataset_final Exited 2 0.3091 Fail
2026-01-30 23:18:38,460 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.UniqueRows:development_data does not exist in model's document
validmind.data_validation.TabularNumericalHistograms:development_data

Tabular Numerical Histograms Development Data

The TabularNumericalHistograms test provides a visual summary of the distribution of each numerical feature in the dataset, supporting the identification of distributional characteristics, skewness, and outliers. The results include histograms for both the training and test datasets, covering features such as CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and several binary-encoded categorical variables. Each histogram displays the frequency distribution of values, enabling assessment of central tendency, spread, and the presence of any unusual patterns or concentrations.

Key insights:

  • CreditScore is unimodal with mild right skew: Both training and test datasets show a unimodal CreditScore distribution, peaking between 600 and 700, with a gradual decline toward higher values and a small right tail.
  • Balance exhibits a strong zero-inflation: A substantial proportion of records have a zero balance, with the remainder forming a bell-shaped distribution centered around 120,000–130,000.
  • NumOfProducts is highly concentrated at lower values: The majority of records have one or two products, with very few instances at three or four products.
  • Binary features show class imbalance: HasCrCard and IsActiveMember are both skewed, with HasCrCard predominantly true and IsActiveMember showing a moderate split favoring false in the training set.
  • EstimatedSalary is uniformly distributed: The EstimatedSalary feature displays a near-uniform distribution across its range in both datasets, with no pronounced peaks or gaps.
  • Tenure is evenly distributed: Tenure values are distributed relatively evenly across their range, with slight dips at the endpoints.
  • Categorical encodings reflect population splits: Geography and Gender binary encodings show clear splits, with some categories (e.g., Geography_Spain=false) being more prevalent.

The histograms reveal that most numerical features are either unimodal or uniformly distributed, with some features (notably Balance and NumOfProducts) exhibiting strong concentration at specific values. The presence of zero-inflation in Balance and class imbalance in certain binary features are notable characteristics. Overall, the distributions are stable between training and test datasets, with no evidence of extreme outliers or abrupt distributional shifts. These patterns provide a clear view of the input data structure and support further analysis of model input integrity.

Figures

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:c418
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:723d
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:542f
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:1c28
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:b83c
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:4ab3
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:985d
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:5f45
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:4894
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:18ae
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:6cef
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:4215
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:1764
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:2780
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:de8d
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:792e
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:77be
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:a341
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:9ea9
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:48a6
2026-01-30 23:18:56,504 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularNumericalHistograms:development_data does not exist in model's document
validmind.data_validation.MutualInformation:development_data

Mutual Information Development Data

The Mutual Information test evaluates the statistical dependency between each feature and the target variable to quantify feature relevance for model training. The results are presented as normalized mutual information scores (ranging from 0 to 1) for both the development and test datasets, with a threshold of 0.01 used to highlight features with minimal information content. Bar plots display the relative importance of each feature, with features above the threshold shown in blue and those below in red, enabling visual assessment of feature relevance and consistency across datasets.

Key insights:

  • NumOfProducts consistently highest information score: NumOfProducts exhibits the highest mutual information score in both development (≈0.095) and test (≈0.105) datasets, indicating strong statistical dependency with the target.
  • Limited number of features above threshold: Only a subset of features surpass the 0.01 threshold in both datasets, with most features registering low or near-zero mutual information scores.
  • Variation in feature ranking between datasets: Geography_Germany and IsActiveMember are above threshold in the development dataset but show reduced scores in the test dataset, while HasCrCard is above threshold only in the test dataset.
  • Several features consistently low or zero: CreditScore, Tenure, and Geography_Spain display mutual information scores at or near zero in both datasets, indicating minimal relevance to the target variable.

The mutual information analysis reveals that only a small number of features demonstrate substantial statistical dependency with the target variable, with NumOfProducts consistently providing the highest information content across both datasets. Most features exhibit low or negligible mutual information scores, and several features remain below the relevance threshold in both development and test samples. The observed variation in feature rankings between datasets highlights potential sensitivity to sample composition, while the consistently low scores for certain features indicate limited predictive value within the current modeling context.

Parameters:

{
  "min_threshold": 0.01
}
            

Figures

ValidMind Figure validmind.data_validation.MutualInformation:development_data:8f38
ValidMind Figure validmind.data_validation.MutualInformation:development_data:9869
2026-01-30 23:19:15,372 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MutualInformation:development_data does not exist in model's document
validmind.data_validation.PearsonCorrelationMatrix:development_data

Pearson Correlation Matrix Development Data

The Pearson Correlation Matrix test evaluates the linear relationships between all pairs of numerical variables in the dataset, providing insight into potential redundancy and multicollinearity. The results are presented as heatmaps for both the development (train) and test datasets, with correlation coefficients ranging from -1 to 1, and high correlations (|r| > 0.7) highlighted. The matrices display the strength and direction of pairwise correlations among variables such as CreditScore, Balance, NumOfProducts, and categorical encodings.

Key insights:

  • No high correlations detected: All pairwise correlation coefficients in both development and test datasets are below the 0.7 threshold, indicating an absence of strong linear dependencies among variables.
  • Moderate correlation between Balance and Geography_Germany: The highest observed correlation is 0.43 between Balance and Geography_Germany in the development dataset, and 0.39 in the test dataset, both below the high-correlation threshold.
  • Consistent correlation structure across datasets: The correlation patterns and magnitudes are similar between the development and test datasets, indicating stability in variable relationships.

The correlation analysis reveals that the dataset does not exhibit strong linear dependencies among its numerical variables, with all pairwise correlations remaining well below the high-correlation threshold. The observed correlation structure is stable across both development and test datasets, supporting the independence of input features and reducing concerns regarding redundancy or multicollinearity.

Figures

ValidMind Figure validmind.data_validation.PearsonCorrelationMatrix:development_data:66ca
ValidMind Figure validmind.data_validation.PearsonCorrelationMatrix:development_data:c3a4
2026-01-30 23:19:28,205 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.PearsonCorrelationMatrix:development_data does not exist in model's document
validmind.data_validation.HighPearsonCorrelation:development_data

❌ High Pearson Correlation Development Data

The High Pearson Correlation test identifies pairs of features in the dataset that exhibit strong linear relationships, with the aim of detecting potential feature redundancy or multicollinearity. The results present the top ten absolute Pearson correlation coefficients for both the training and test datasets, along with a Pass or Fail status based on a threshold of 0.3. Correlation coefficients and their corresponding feature pairs are listed, highlighting those that exceed the threshold.

Key insights:

  • Multiple feature pairs exceed correlation threshold: In both the training and test datasets, the pairs (Balance, Geography_Germany) and (Geography_Germany, Geography_Spain) display absolute correlation coefficients above the 0.3 threshold, resulting in a Fail status for these pairs.
  • Consistent correlation structure across datasets: The same feature pairs—(Balance, Geography_Germany) and (Geography_Germany, Geography_Spain)—exhibit high correlations in both the training and test datasets, with coefficients ranging from 0.362 to 0.428.
  • All other feature pairs below threshold: The remaining top correlations in both datasets have absolute values below 0.3, resulting in a Pass status and indicating lower risk of linear redundancy among these pairs.

The results indicate that a limited number of feature pairs demonstrate moderate linear relationships exceeding the defined threshold, specifically involving Balance and Geography_Germany, as well as Geography_Germany and Geography_Spain. This pattern is consistent across both training and test datasets, while all other examined feature pairs remain below the threshold, suggesting that the majority of features do not exhibit strong linear dependencies.

Parameters:

{
  "max_threshold": 0.3,
  "top_n_correlations": 10
}
            

Tables

dataset Columns Coefficient Pass/Fail
train_dataset_final (Balance, Geography_Germany) 0.4280 Fail
train_dataset_final (Geography_Germany, Geography_Spain) -0.3620 Fail
train_dataset_final (IsActiveMember, Exited) -0.2102 Pass
train_dataset_final (Balance, NumOfProducts) -0.1889 Pass
train_dataset_final (Geography_Germany, Exited) 0.1863 Pass
train_dataset_final (Balance, Geography_Spain) -0.1763 Pass
train_dataset_final (Balance, Exited) 0.1420 Pass
train_dataset_final (Gender_Male, Exited) -0.1157 Pass
train_dataset_final (NumOfProducts, IsActiveMember) 0.0543 Pass
train_dataset_final (NumOfProducts, Exited) -0.0530 Pass
test_dataset_final (Balance, Geography_Germany) 0.3908 Fail
test_dataset_final (Geography_Germany, Geography_Spain) -0.3765 Fail
test_dataset_final (Geography_Germany, Exited) 0.2281 Pass
test_dataset_final (IsActiveMember, Exited) -0.1697 Pass
test_dataset_final (Balance, NumOfProducts) -0.1566 Pass
test_dataset_final (Balance, Geography_Spain) -0.1550 Pass
test_dataset_final (Balance, Exited) 0.1206 Pass
test_dataset_final (Gender_Male, Exited) -0.1125 Pass
test_dataset_final (Geography_Spain, Exited) -0.0923 Pass
test_dataset_final (Balance, Gender_Male) 0.0871 Pass
2026-01-30 23:19:35,879 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighPearsonCorrelation:development_data does not exist in model's document
validmind.model_validation.ModelMetadata

Model Metadata

The ModelMetadata test compares the metadata of different models to assess consistency in architecture, framework, framework version, and programming language. The summary table presents side-by-side metadata for each model, including modeling technique, framework, version, and programming language. Both models, log_model_champion and rf_model, are included in the comparison, with all relevant metadata fields populated.

Key insights:

  • Consistent modeling technique: Both models use the SKlearnModel technique, indicating uniformity in modeling approach.
  • Identical framework and version: Both models are built using the sklearn framework, version 1.8.0, ensuring compatibility in deployment environments.
  • Uniform programming language: Python is the programming language for both models, supporting integration and maintenance consistency.

The metadata comparison reveals complete alignment across all key fields for the two models evaluated. No inconsistencies or missing metadata are present, and the use of the same framework, version, and programming language indicates a standardized modeling environment.

Tables

model Modeling Technique Modeling Framework Framework Version Programming Language
log_model_champion SKlearnModel sklearn 1.8.0 Python
rf_model SKlearnModel sklearn 1.8.0 Python
2026-01-30 23:19:39,739 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.ModelMetadata does not exist in model's document
validmind.model_validation.sklearn.ModelParameters

Model Parameters

The Model Parameters test extracts and displays the configuration parameters for each model to support transparency and reproducibility. The results present a structured table listing parameter names and their corresponding values for two models: a logistic regression model (log_model_champion) and a random forest model (rf_model). Each parameter is shown alongside its assigned value, providing a comprehensive view of the model setup at the time of testing.

Key insights:

  • Distinct parameterization for each model: The logistic regression model uses L1 penalty with the liblinear solver, while the random forest model is configured with 50 estimators, Gini criterion, and sqrt for max_features.
  • Explicit control of regularization and complexity: The logistic regression model sets C to 1 and penalty to L1, indicating explicit regularization choices. The random forest model specifies min_samples_leaf as 1 and min_samples_split as 2, reflecting default complexity controls.
  • Reproducibility supported by fixed random state: The random forest model includes a fixed random_state value of 42, supporting reproducibility of results.
  • No extreme or missing parameter values observed: All parameters are explicitly set or defaulted to standard values, with no indications of extreme or omitted settings.

The extracted parameter set provides a transparent and reproducible record of model configuration for both the logistic regression and random forest models. All critical parameters are explicitly captured, with no evidence of missing or extreme values. The configuration supports systematic auditing and facilitates consistent model retraining or validation.

Tables

model Parameter Value
log_model_champion C 1
log_model_champion dual False
log_model_champion fit_intercept True
log_model_champion intercept_scaling 1
log_model_champion max_iter 100
log_model_champion penalty l1
log_model_champion solver liblinear
log_model_champion tol 0.0001
log_model_champion verbose 0
log_model_champion warm_start False
rf_model bootstrap True
rf_model ccp_alpha 0.0
rf_model criterion gini
rf_model max_features sqrt
rf_model min_impurity_decrease 0.0
rf_model min_samples_leaf 1
rf_model min_samples_split 2
rf_model min_weight_fraction_leaf 0.0
rf_model n_estimators 50
rf_model oob_score False
rf_model random_state 42
rf_model verbose 0
rf_model warm_start False
2026-01-30 23:19:45,048 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ModelParameters does not exist in model's document
validmind.model_validation.sklearn.ROCCurve

ROC Curve

The ROC Curve test evaluates the binary classification model's ability to distinguish between positive and negative classes by plotting the True Positive Rate against the False Positive Rate at various thresholds and calculating the Area Under the Curve (AUC) score. The results include ROC curves and AUC values for both the training and test datasets, with a reference line indicating random performance (AUC = 0.5). The ROC curves for both datasets are positioned above the random line, and the AUC scores are reported in the plot legends.

Key insights:

  • AUC indicates moderate discriminative power: The AUC is 0.68 on the training dataset and 0.66 on the test dataset, reflecting moderate ability to distinguish between classes.
  • Consistent performance across datasets: The similarity between training and test AUC values suggests stable model behavior and limited overfitting.
  • ROC curves consistently above random: Both ROC curves remain above the random classifier line, indicating the model provides meaningful separation between classes.

The ROC Curve test results demonstrate that the model achieves moderate discriminative performance, with AUC values consistently above the random baseline on both training and test datasets. The close alignment of AUC scores across datasets indicates stable generalization, and the ROC curves confirm the model's ability to provide class separation beyond random chance.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:6713
ValidMind Figure validmind.model_validation.sklearn.ROCCurve:6811
2026-01-30 23:19:54,792 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve does not exist in model's document
validmind.model_validation.sklearn.MinimumROCAUCScore

✅ Minimum ROCAUC Score

The Minimum ROC AUC Score test evaluates whether the model's multiclass ROC AUC score meets or exceeds a specified minimum threshold, providing an assessment of the model's ability to distinguish between classes. The results table presents ROC AUC scores for both the training and test datasets, alongside the threshold value and the corresponding pass/fail status. Both datasets are evaluated against a minimum threshold of 0.5, with the observed scores and outcomes reported for each.

Key insights:

  • ROC AUC scores exceed threshold: Both the training (0.6763) and test (0.6637) datasets register ROC AUC scores above the minimum threshold of 0.5.
  • Consistent pass status across datasets: The test is marked as "Pass" for both datasets, indicating that the model's class discrimination performance meets the predefined criterion in both cases.
  • Comparable performance between train and test: The ROC AUC scores for the training and test datasets are closely aligned, with a difference of 0.0126, suggesting stable model performance across data splits.

The results indicate that the model demonstrates adequate class separation capability on both the training and test datasets, as measured by the multiclass ROC AUC metric. The close alignment of scores across datasets reflects consistent model behavior, with both results surpassing the established minimum threshold. No evidence of underperformance or significant divergence between training and test results is observed in this evaluation.

Parameters:

{
  "min_threshold": 0.5
}
            

Tables

dataset Score Threshold Pass/Fail
train_dataset_final 0.6763 0.5 Pass
test_dataset_final 0.6637 0.5 Pass
2026-01-30 23:19:59,868 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumROCAUCScore does not exist in model's document

In summary

In this final notebook, you learned how to:

With our ValidMind for model validation series of notebooks, you learned how to validate a model end-to-end with the ValidMind Library by running through some common scenarios in a typical model validation setting:

  • Verifying the data quality steps performed by the model development team
  • Independently replicating the champion model's results and conducting additional tests to assess performance, stability, and robustness
  • Setting up test inputs and a challenger model for comparative analysis
  • Running validation tests, analyzing results, and logging artifacts to ValidMind

Next steps

Work with your validation report

Now that you've logged all your test results and verified the work done by the model development team, head to the ValidMind Platform to wrap up your validation report. Continue to work on your validation report by:

  • Inserting additional test results: Click Link Evidence to Report under any section of 2. Validation in your validation report. (Learn more: Link evidence to reports)

  • Making qualitative edits to your test descriptions: Expand any linked evidence under Validator Evidence and click See evidence details to review and edit the ValidMind-generated test descriptions for quality and accuracy. (Learn more: Preparing validation reports)

  • Adding more findings: Click Link Finding to Report in any validation report section, then click + Create New Finding. (Learn more: Add and manage model findings)

  • Adding risk assessment notes: Click under Risk Assessment Notes in any validation report section to access the text editor and content editing toolbar, including an option to generate a draft with AI. Once generated, edit your ValidMind-generated test descriptions to adhere to your organization's requirements. (Learn more: Work with content blocks)

  • Assessing compliance: Under the Guideline for any validation report section, click ASSESSMENT and select the compliance status from the drop-down menu. (Learn more: Provide compliance assessments)

  • Collaborate with other stakeholders: Use the ValidMind Platform's real-time collaborative features to work seamlessly together with the rest of your organization, including model developers. Propose suggested changes in the model documentation, work with versioned history, and use comments to discuss specific portions of the model documentation. (Learn more: Collaborate with others)

When your validation report is complete and ready for review, submit it for approval from the same ValidMind Platform where you made your edits and collaborated with the rest of your organization, ensuring transparency and a thorough model validation history. (Learn more: Submit for approval)

Learn more

Now that you're familiar with the basics, you can explore the following notebooks to get a deeper understanding on how the ValidMind Library assists you in streamlining model validation:

More how-to guides and code samples

Discover more learning resources

All notebook samples can be found in the following directories of the ValidMind Library GitHub repository:

Or, visit our documentation to learn more about ValidMind.


Copyright © 2023-2026 ValidMind Inc. All rights reserved.
Refer to LICENSE for details.
SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial