ValidMind for model validation 4 — Finalize testing and reporting

Learn how to use ValidMind for your end-to-end model validation process with our series of four introductory notebooks. In this last notebook, finalize the compliance assessment process and have a complete validation report ready for review.

This notebook will walk you through how to supplement ValidMind tests with your own custom tests and include them as additional evidence in your validation report. A custom test is any function that takes a set of inputs and parameters as arguments and returns one or more outputs:

For a more in-depth introduction to custom tests, refer to our Implement custom tests notebook.

Learn by doing

Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals

Prerequisites

In order to finalize validation and reporting, you'll need to first have:

Setting up

This section should be very familiar to you now — as we performed the same actions in the previous two notebooks in this series.

Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

  1. In a browser, log in to ValidMind.

  2. In the left sidebar, navigate to Inventory and select the model you registered for this "ValidMind for model validation" series of notebooks.

  3. Go to Getting Started and click Copy snippet to clipboard.

Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
    # document="validation-report",
)
Note: you may need to restart the kernel to use updated packages.
2026-03-12 20:48:06,626 - ERROR(validmind.api_client): Future releases will require `document` as one of the options you must provide to `vm.init()`. To learn more, refer to https://docs.validmind.ai/developer/validmind-library.html
2026-03-12 20:48:06,760 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report

Import the sample dataset

Next, we'll load in the same sample Bank Customer Churn Prediction dataset used to develop the champion model that we will independently preprocess:

# Load the sample dataset
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}
# Initialize the raw dataset for use in ValidMind tests
vm_raw_dataset = vm.init_dataset(
    dataset=raw_df,
    input_id="raw_dataset",
    target_column="Exited",
)
import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test:

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)
# Run HighPearsonCorrelation test with our balanced dataset as input and return a result object
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

The High Pearson Correlation test identifies pairs of features in the dataset that exhibit strong linear relationships, with the aim of detecting potential feature redundancy or multicollinearity. The results table lists the top ten feature pairs ranked by the absolute value of their Pearson correlation coefficients, along with a Pass or Fail status based on a threshold of 0.3. Only one feature pair exceeds the threshold, while the remaining pairs display lower correlation values and pass the test criteria.

Key insights:

  • Single feature pair exceeds correlation threshold: The pair (Age, Exited) shows a Pearson correlation coefficient of 0.3245, surpassing the 0.3 threshold and receiving a Fail status.
  • All other feature pairs below threshold: The remaining nine feature pairs have absolute correlation coefficients ranging from 0.0348 to 0.2064, all below the threshold and marked as Pass.
  • Predominantly weak linear relationships: Most feature pairs demonstrate weak linear associations, with coefficients clustered near zero.

The test results indicate that the dataset contains minimal evidence of strong linear relationships among most feature pairs, with only the (Age, Exited) pair exhibiting a moderate correlation above the specified threshold. The overall correlation structure suggests low risk of widespread multicollinearity or feature redundancy based on linear associations.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(Age, Exited) 0.3245 Fail
(IsActiveMember, Exited) -0.2064 Pass
(Balance, NumOfProducts) -0.1749 Pass
(Balance, Exited) 0.1349 Pass
(NumOfProducts, Exited) -0.0550 Pass
(Age, NumOfProducts) -0.0444 Pass
(Age, Balance) 0.0409 Pass
(NumOfProducts, IsActiveMember) 0.0387 Pass
(HasCrCard, IsActiveMember) -0.0360 Pass
(Age, Tenure) -0.0348 Pass
# From result object, extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df
Columns Coefficient Pass/Fail
0 (Age, Exited) 0.3245 Fail
1 (IsActiveMember, Exited) -0.2064 Pass
2 (Balance, NumOfProducts) -0.1749 Pass
3 (Balance, Exited) 0.1349 Pass
4 (NumOfProducts, Exited) -0.0550 Pass
5 (Age, NumOfProducts) -0.0444 Pass
6 (Age, Balance) 0.0409 Pass
7 (NumOfProducts, IsActiveMember) 0.0387 Pass
8 (HasCrCard, IsActiveMember) -0.0360 Pass
9 (Age, Tenure) -0.0348 Pass
# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features
['(Age, Exited)']
# Extract feature names from the list of strings
high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features
['Age']
# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)
# Re-run the test with the reduced feature set
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

The High Pearson Correlation test evaluates the linear relationships between feature pairs to identify potential redundancy or multicollinearity within the dataset. The results table presents the top ten absolute Pearson correlation coefficients, each associated with a feature pair, the coefficient value, and a Pass/Fail status based on a threshold of 0.3. All observed coefficients are below the threshold, and each feature pair is marked as Pass.

Key insights:

  • No feature pairs exceed correlation threshold: All absolute Pearson correlation coefficients are below the 0.3 threshold, with the highest observed value being -0.2064 between IsActiveMember and Exited.
  • Low to moderate linear relationships: The strongest correlations, such as between Balance and NumOfProducts (-0.1749) and Balance and Exited (0.1349), remain well below levels typically associated with multicollinearity.
  • Consistent Pass status across all pairs: Every feature pair in the top ten list is marked as Pass, indicating no detected high-risk linear dependencies among the evaluated features.

The results indicate that the dataset does not exhibit high linear correlations among the top feature pairs, suggesting a low risk of feature redundancy or multicollinearity based on the tested threshold. The observed correlation structure supports the interpretability and stability of subsequent modeling efforts.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(IsActiveMember, Exited) -0.2064 Pass
(Balance, NumOfProducts) -0.1749 Pass
(Balance, Exited) 0.1349 Pass
(NumOfProducts, Exited) -0.0550 Pass
(NumOfProducts, IsActiveMember) 0.0387 Pass
(HasCrCard, IsActiveMember) -0.0360 Pass
(CreditScore, Exited) -0.0303 Pass
(Tenure, Exited) -0.0246 Pass
(Tenure, HasCrCard) 0.0239 Pass
(Tenure, EstimatedSalary) 0.0224 Pass

Split the preprocessed dataset

With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:

# Encode categorical features in the dataset
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()
CreditScore Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited Geography_Germany Geography_Spain Gender_Male
4938 850 3 51293.47 1 0 0 35534.68 0 True False False
775 610 9 0.00 3 0 1 83912.24 0 False True True
693 733 3 106545.53 1 1 1 134589.58 0 True False True
2545 515 9 113715.36 1 1 0 18424.24 1 True False True
5198 651 1 163700.78 3 1 1 29583.48 1 True False False
from sklearn.model_selection import train_test_split

# Split the dataset into train and test
train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]
# Initialize the split datasets
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

Import the champion model

With our raw dataset assessed and preprocessed, let's go ahead and import the champion model submitted by the model development team in the format of a .pkl file: lr_model_champion.pkl

# Import the champion model
import pickle as pkl

with open("lr_model_champion.pkl", "rb") as f:
    log_reg = pkl.load(f)
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/base.py:463: InconsistentVersionWarning: Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.8.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(

Train potential challenger model

We'll also train our random forest classification challenger model to see how it compares:

# Import the Random Forest Classification model
from sklearn.ensemble import RandomForestClassifier

# Create the model instance with 50 decision trees
rf_model = RandomForestClassifier(
    n_estimators=50,
    random_state=42,
)

# Train the model
rf_model.fit(X_train, y_train)
RandomForestClassifier(n_estimators=50, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Initialize the model objects

In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models:

# Initialize the champion logistic regression model
vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model_champion",
)

# Initialize the challenger random forest classification model
vm_rf_model = vm.init_model(
    rf_model,
    input_id="rf_model",
)
# Assign predictions to Champion — Logistic regression model
vm_train_ds.assign_predictions(model=vm_log_model)
vm_test_ds.assign_predictions(model=vm_log_model)

# Assign predictions to Challenger — Random forest classification model
vm_train_ds.assign_predictions(model=vm_rf_model)
vm_test_ds.assign_predictions(model=vm_rf_model)
2026-03-12 20:48:20,826 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-03-12 20:48:20,828 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-03-12 20:48:20,828 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-03-12 20:48:20,832 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-03-12 20:48:20,833 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-03-12 20:48:20,836 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-03-12 20:48:20,837 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-03-12 20:48:20,838 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-03-12 20:48:20,841 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-03-12 20:48:20,867 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-03-12 20:48:20,867 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-03-12 20:48:20,892 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-03-12 20:48:20,895 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-03-12 20:48:20,908 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-03-12 20:48:20,909 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-03-12 20:48:20,922 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Implementing custom tests

Thanks to the model documentation (Learn more ...), we know that the model development team implemented a custom test to further evaluate the performance of the champion model.

In a usual model validation situation, you would load a saved custom test provided by the model development team. In the following section, we'll have you implement the same custom test and make it available for reuse, to familiarize you with the processes.

Want to learn more about custom tests?

Refer to our in-depth introduction to custom tests: Implement custom tests

Implement a custom inline test

Let's implement the same custom inline test that calculates the confusion matrix for a binary classification model that the model development team used in their performance evaluations.

  • An inline test refers to a test written and executed within the same environment as the code being tested — in this case, right in this Jupyter Notebook — without requiring a separate test file or framework.
  • You'll note that the custom test function is just a regular Python function that can include and require any Python library as you see fit.

Create a confusion matrix plot

Let's first create a confusion matrix plot using the confusion_matrix function from the sklearn.metrics module:

import matplotlib.pyplot as plt
from sklearn import metrics

# Get the predicted classes
y_pred = log_reg.predict(vm_test_ds.x)

confusion_matrix = metrics.confusion_matrix(y_test, y_pred)

cm_display = metrics.ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix, display_labels=[False, True]
)
cm_display.plot()

Next, create a @vm.test wrapper that will allow you to create a reusable test. Note the following changes in the code below:

  • The function confusion_matrix takes two arguments dataset and model. This is a VMDataset and VMModel object respectively.
    • VMDataset objects allow you to access the dataset's true (target) values by accessing the .y attribute.
    • VMDataset objects allow you to access the predictions for a given model by accessing the .y_pred() method.
  • The function docstring provides a description of what the test does. This will be displayed along with the result in this notebook as well as in the ValidMind Platform.
  • The function body calculates the confusion matrix using the sklearn.metrics.confusion_matrix function as we just did above.
  • The function then returns the ConfusionMatrixDisplay.figure_ object — this is important as the ValidMind Library expects the output of the custom test to be a plot or a table.
  • The @vm.test decorator is doing the work of creating a wrapper around the function that will allow it to be run by the ValidMind Library. It also registers the test so it can be found by the ID my_custom_tests.ConfusionMatrix.
@vm.test("my_custom_tests.ConfusionMatrix")
def confusion_matrix(dataset, model):
    """The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

    The confusion matrix is a 2x2 table that contains 4 values:

    - True Positive (TP): the number of correct positive predictions
    - True Negative (TN): the number of correct negative predictions
    - False Positive (FP): the number of incorrect positive predictions
    - False Negative (FN): the number of incorrect negative predictions

    The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.
    """
    y_true = dataset.y
    y_pred = dataset.y_pred(model=model)

    confusion_matrix = metrics.confusion_matrix(y_true, y_pred)

    cm_display = metrics.ConfusionMatrixDisplay(
        confusion_matrix=confusion_matrix, display_labels=[False, True]
    )
    cm_display.plot()

    plt.close()  # close the plot to avoid displaying it

    return cm_display.figure_  # return the figure object itself

You can now run the newly created custom test on both the training and test datasets for both models using the run_test() function:

# Champion train and test
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:champion",
    input_grid={
        "dataset": [vm_train_ds,vm_test_ds],
        "model" : [vm_log_model]
    }
).log()

Confusion Matrix Champion

The Confusion Matrix test evaluates the classification performance of the model by comparing predicted and true labels for both the training and test datasets. The resulting matrices display the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of model accuracy and error distribution. The results are presented separately for the training dataset (train_dataset_final) and the test dataset (test_dataset_final), allowing for assessment of model generalization and potential overfitting.

Key insights:

  • Balanced classification performance across datasets: Both training and test confusion matrices show substantial counts in the true positive and true negative cells, indicating the model is able to correctly identify both classes in each dataset.
  • False positive and false negative rates are comparable: The number of false positives (446 in training, 116 in test) and false negatives (419 in training, 118 in test) are similar within each dataset, suggesting no strong bias toward one type of misclassification.
  • Consistent error distribution between train and test: The relative proportions of correct and incorrect predictions are similar between the training and test datasets, indicating stable model behavior and no evidence of significant overfitting.

The confusion matrix results demonstrate that the model maintains consistent classification performance across both training and test datasets, with balanced rates of true and false predictions. The error distribution does not indicate a dominant misclassification type, and the similarity between datasets suggests the model generalizes well to unseen data.

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:champion:13a6
ValidMind Figure my_custom_tests.ConfusionMatrix:champion:d3ad
2026-03-12 20:48:26,995 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:champion does not exist in model's document
# Challenger train and test
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:challenger",
    input_grid={
        "dataset": [vm_train_ds,vm_test_ds],
        "model" : [vm_rf_model]
    }
).log()

Confusion Matrix Challenger

The Confusion Matrix test evaluates the classification performance of the model by comparing predicted and true labels for both the training and test datasets. The resulting matrices display the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of model accuracy and error distribution. The first matrix corresponds to the training dataset, while the second matrix summarizes results for the test dataset.

Key insights:

  • Perfect classification on training data: The training confusion matrix shows 1,304 true negatives and 1,281 true positives, with zero false positives and zero false negatives, indicating no misclassifications on the training set.
  • Presence of misclassifications on test data: The test confusion matrix records 225 true negatives, 242 true positives, 87 false positives, and 93 false negatives, indicating both types of classification errors are present in the test set.
  • Balanced error distribution in test set: The number of false positives (87) and false negatives (93) are of similar magnitude, suggesting no strong bias toward one type of error in the test predictions.

The confusion matrices indicate that the model achieves perfect separation on the training data, with no observed misclassifications. On the test data, the model exhibits both false positives and false negatives, with error counts that are balanced between the two classes. This pattern suggests strong model fit to the training data and a moderate level of generalization error on unseen data, with no evidence of systematic bias toward either class in the test predictions.

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:challenger:8186
ValidMind Figure my_custom_tests.ConfusionMatrix:challenger:303d
2026-03-12 20:48:34,121 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:challenger does not exist in model's document
Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.

Add parameters to custom tests

Custom tests can take parameters just like any other function. To demonstrate, let's modify the confusion_matrix function to take an additional parameter normalize that will allow you to normalize the confusion matrix:

@vm.test("my_custom_tests.ConfusionMatrix")
def confusion_matrix(dataset, model, normalize=False):
    """The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

    The confusion matrix is a 2x2 table that contains 4 values:

    - True Positive (TP): the number of correct positive predictions
    - True Negative (TN): the number of correct negative predictions
    - False Positive (FP): the number of incorrect positive predictions
    - False Negative (FN): the number of incorrect negative predictions

    The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.
    """
    y_true = dataset.y
    y_pred = dataset.y_pred(model=model)

    if normalize:
        confusion_matrix = metrics.confusion_matrix(y_true, y_pred, normalize="all")
    else:
        confusion_matrix = metrics.confusion_matrix(y_true, y_pred)

    cm_display = metrics.ConfusionMatrixDisplay(
        confusion_matrix=confusion_matrix, display_labels=[False, True]
    )
    cm_display.plot()

    plt.close()  # close the plot to avoid displaying it

    return cm_display.figure_  # return the figure object itself

Pass parameters to custom tests

You can pass parameters to custom tests by providing a dictionary of parameters to the run_test() function.

  • The parameters will override any default parameters set in the custom test definition. Note that dataset and model are still passed as inputs.
  • Since these are VMDataset or VMModel inputs, they have a special meaning.

Re-running and logging the custom confusion matrix with normalize=True for both models and our testing dataset looks like this:

# Champion with test dataset and normalize=True
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:test_normalized_champion",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_log_model]
    },
    params={"normalize": True}
).log()

Confusion Matrix Test Normalized Champion

The ConfusionMatrix:test_normalized_champion test evaluates the classification performance of the log_model_champion model on the test_dataset_final dataset by displaying the normalized confusion matrix. The matrix presents the proportion of true positives, true negatives, false positives, and false negatives, with each cell value representing the fraction of total predictions for each outcome. The normalization enables direct comparison of prediction accuracy across both classes.

Key insights:

  • Balanced correct classification rates: The model correctly classifies 0.30 of negative cases (true negatives) and 0.32 of positive cases (true positives), indicating similar accuracy for both classes.
  • Moderate misclassification rates: False positives and false negatives are observed at 0.18 and 0.20, respectively, reflecting moderate levels of misclassification for each class.
  • No extreme class imbalance in predictions: The normalized values are distributed without extreme skew, suggesting the model does not disproportionately favor one class over the other.

The normalized confusion matrix indicates that the model achieves comparable accuracy in identifying both positive and negative cases, with moderate and relatively balanced misclassification rates. The absence of pronounced class bias in predictions suggests stable model behavior across the evaluated dataset.

Parameters:

{
  "normalize": true
}
            

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:test_normalized_champion:70f2
2026-03-12 20:48:41,510 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_normalized_champion does not exist in model's document
# Challenger with test dataset and normalize=True
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:test_normalized_challenger",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_rf_model]
    },
    params={"normalize": True}
).log()

Confusion Matrix Test Normalized Challenger

The ConfusionMatrix:test_normalized_challenger test evaluates the classification performance of the rf_model on the test_dataset_final by presenting a normalized confusion matrix. The matrix displays the proportion of true and false predictions for each class, with values normalized to sum to 1 across all entries. The plot provides a visual summary of the model's ability to correctly and incorrectly classify both positive and negative cases.

Key insights:

  • Balanced correct classification rates: The model correctly classifies 0.35 of all samples as true negatives and 0.37 as true positives, indicating similar accuracy for both classes.
  • Moderate false prediction rates: False positives and false negatives are observed at 0.13 and 0.14, respectively, reflecting moderate misclassification rates for both classes.
  • No class dominance in errors: The distribution of errors is relatively even between false positives and false negatives, with no single error type disproportionately represented.

The confusion matrix reveals that the model demonstrates balanced performance across both classes, with correct classification rates for true positives and true negatives closely aligned. Misclassification rates are moderate and evenly distributed, indicating that the model does not exhibit a strong bias toward either class in its prediction errors. This balanced error profile suggests consistent model behavior across the evaluated dataset.

Parameters:

{
  "normalize": true
}
            

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:test_normalized_challenger:3ba1
2026-03-12 20:48:51,702 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_normalized_challenger does not exist in model's document

Use external test providers

Sometimes you may want to reuse the same set of custom tests across multiple models and share them with others in your organization, like the model development team would have done with you in this example workflow featured in this series of notebooks. In this case, you can create an external custom test provider that will allow you to load custom tests from a local folder or a Git repository.

In this section you will learn how to declare a local filesystem test provider that allows loading tests from a local folder following these high level steps:

  1. Create a folder of custom tests from existing inline tests (tests that exist in your active Jupyter Notebook)
  2. Save an inline test to a file
  3. Define and register a LocalTestProvider that points to that folder
  4. Run test provider tests
  5. Add the test results to your documentation

Create custom tests folder

Let's start by creating a new folder that will contain reusable custom tests from your existing inline tests.

The following code snippet will create a new my_tests directory in the current working directory if it doesn't exist:

tests_folder = "my_tests"

import os

# create tests folder
os.makedirs(tests_folder, exist_ok=True)

# remove existing tests
for f in os.listdir(tests_folder):
    # remove files and pycache
    if f.endswith(".py") or f == "__pycache__":
        os.system(f"rm -rf {tests_folder}/{f}")

After running the command above, confirm that a new my_tests directory was created successfully. For example:

~/notebooks/tutorials/model_validation/my_tests/

Save an inline test

The @vm.test decorator we used in Implement a custom inline test above to register one-off custom tests also includes a convenience method on the function object that allows you to simply call <func_name>.save() to save the test to a Python file at a specified path.

While save() will get you started by creating the file and saving the function code with the correct name, it won't automatically include any imports, or other functions or variables, outside of the functions that are needed for the test to run. To solve this, pass in an optional imports argument ensuring necessary imports are added to the file.

The confusion_matrix test requires the following additional imports:

import matplotlib.pyplot as plt
from sklearn import metrics

Let's pass these imports to the save() method to ensure they are included in the file with the following command:

confusion_matrix.save(
    # Save it to the custom tests folder we created
    tests_folder,
    imports=["import matplotlib.pyplot as plt", "from sklearn import metrics"],
)
2026-03-12 20:48:52,174 - INFO(validmind.tests.decorator): Saved to /home/runner/work/documentation/documentation/site/notebooks/EXECUTED/model_validation/my_tests/ConfusionMatrix.py!Be sure to add any necessary imports to the top of the file.
2026-03-12 20:48:52,175 - INFO(validmind.tests.decorator): This metric can be run with the ID: <test_provider_namespace>.ConfusionMatrix
  • # Saved from __main__.confusion_matrix
    # Original Test ID: my_custom_tests.ConfusionMatrix
    # New Test ID: <test_provider_namespace>.ConfusionMatrix
  • def ConfusionMatrix(dataset, model, normalize=False):

Register a local test provider

Now that your my_tests folder has a sample custom test, let's initialize a test provider that will tell the ValidMind Library where to find your custom tests:

  • ValidMind offers out-of-the-box test providers for local tests (tests in a folder) or a Github provider for tests in a Github repository.
  • You can also create your own test provider by creating a class that has a load_test method that takes a test ID and returns the test function matching that ID.
Want to learn more about test providers?

An extended introduction to test providers can be found in: Integrate external test providers
Initialize a local test provider

For most use cases, using a LocalTestProvider that allows you to load custom tests from a designated directory should be sufficient.

The most important attribute for a test provider is its namespace. This is a string that will be used to prefix test IDs in model documentation. This allows you to have multiple test providers with tests that can even share the same ID, but are distinguished by their namespace.

Let's go ahead and load the custom tests from our my_tests directory:

from validmind.tests import LocalTestProvider

# initialize the test provider with the tests folder we created earlier
my_test_provider = LocalTestProvider(tests_folder)

vm.tests.register_test_provider(
    namespace="my_test_provider",
    test_provider=my_test_provider,
)
# `my_test_provider.load_test()` will be called for any test ID that starts with `my_test_provider`
# e.g. `my_test_provider.ConfusionMatrix` will look for a function named `ConfusionMatrix` in `my_tests/ConfusionMatrix.py` file
Run test provider tests

Now that we've set up the test provider, we can run any test that's located in the tests folder by using the run_test() method as with any other test:

  • For tests that reside in a test provider directory, the test ID will be the namespace specified when registering the provider, followed by the path to the test file relative to the tests folder.
  • For example, the Confusion Matrix test we created earlier will have the test ID my_test_provider.ConfusionMatrix. You could organize the tests in subfolders, say classification and regression, and the test ID for the Confusion Matrix test would then be my_test_provider.classification.ConfusionMatrix.

Let's go ahead and re-run the confusion matrix test with our testing dataset for our two models by using the test ID my_test_provider.ConfusionMatrix. This should load the test from the test provider and run it as before.

# Champion with test dataset and test provider custom test
vm.tests.run_test(
    test_id="my_test_provider.ConfusionMatrix:champion",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_log_model]
    }
).log()

Confusion Matrix Champion

The Confusion Matrix test evaluates the classification performance of the log_model_champion on the test_dataset_final by comparing predicted and true labels. The resulting matrix displays the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of the model's prediction accuracy and error distribution. The matrix is structured with true labels on the vertical axis and predicted labels on the horizontal axis, with each cell indicating the number of instances for each outcome.

Key insights:

  • Balanced true positive and true negative counts: The model correctly classified 207 true positives and 196 true negatives, indicating similar effectiveness in identifying both classes.
  • Comparable false positive and false negative rates: There are 116 false positives and 118 false negatives, suggesting that misclassification rates are nearly equivalent for both types of errors.
  • No evidence of class prediction bias: The distribution of correct and incorrect predictions does not indicate a strong bias toward either class, as both positive and negative classes are represented similarly in both correct and incorrect predictions.

The confusion matrix reveals that the log_model_champion demonstrates balanced performance across both classes, with similar rates of correct and incorrect predictions for positive and negative outcomes. The absence of pronounced class bias and the close alignment of false positive and false negative counts indicate that the model maintains consistent classification behavior across the test dataset.

Figures

ValidMind Figure my_test_provider.ConfusionMatrix:champion:2614
2026-03-12 20:48:59,823 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix:champion does not exist in model's document
# Challenger with test dataset  and test provider custom test
vm.tests.run_test(
    test_id="my_test_provider.ConfusionMatrix:challenger",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_rf_model]
    }
).log()

Confusion Matrix Challenger

The Confusion Matrix test evaluates the classification performance of the rf_model on the test_dataset_final by comparing predicted and true labels. The resulting matrix displays the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of the model's prediction accuracy and error distribution. The matrix is structured with actual class labels on the vertical axis and predicted class labels on the horizontal axis, with color intensity reflecting the count magnitude.

Key insights:

  • Balanced detection of both classes: The model correctly classified 225 negative cases (true negatives) and 242 positive cases (true positives), indicating effective identification of both classes.
  • Moderate false positive and false negative rates: There are 87 false positives and 93 false negatives, reflecting a moderate level of misclassification for both types of errors.
  • Comparable error distribution: The counts of false positives and false negatives are similar in magnitude, suggesting no substantial bias toward over- or under-predicting either class.

The confusion matrix reveals that the rf_model demonstrates balanced performance in identifying both positive and negative cases, with true positive and true negative counts closely matched. The rates of false positives and false negatives are moderate and similar in scale, indicating that misclassification is distributed relatively evenly across both classes. This pattern suggests the model does not exhibit a strong bias toward either class, and overall classification performance is consistent across the test dataset.

Figures

ValidMind Figure my_test_provider.ConfusionMatrix:challenger:99d1
2026-03-12 20:49:07,228 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix:challenger does not exist in model's document

Verify test runs

Our final task is to verify that all the tests provided by the model development team were run and reported accurately. Note the appended result_ids to delineate which dataset we ran the test with for the relevant tests.

Here, we'll specify all the tests we'd like to independently rerun in a dictionary called test_config. Note here that inputs and input_grid expect the input_id of the dataset or model as the value rather than the variable name we specified:

test_config = {
    # Run with the raw dataset
    'validmind.data_validation.DatasetDescription:raw_data': {
        'inputs': {'dataset': 'raw_dataset'}
    },
    'validmind.data_validation.DescriptiveStatistics:raw_data': {
        'inputs': {'dataset': 'raw_dataset'}
    },
    'validmind.data_validation.MissingValues:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_percentage_threshold': 1}
    },
    'validmind.data_validation.ClassImbalance:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_percent_threshold': 10}
    },
    'validmind.data_validation.Duplicates:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_threshold': 1}
    },
    'validmind.data_validation.HighCardinality:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {
            'num_threshold': 100,
            'percent_threshold': 0.1,
            'threshold_type': 'percent'
        }
    },
    'validmind.data_validation.Skewness:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'max_threshold': 1}
    },
    'validmind.data_validation.UniqueRows:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_percent_threshold': 1}
    },
    'validmind.data_validation.TooManyZeroValues:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'max_percent_threshold': 0.03}
    },
    'validmind.data_validation.IQROutliersTable:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'threshold': 5}
    },
    # Run with the preprocessed dataset
    'validmind.data_validation.DescriptiveStatistics:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TabularDescriptionTables:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.MissingValues:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'},
        'params': {'min_percentage_threshold': 1}
    },
    'validmind.data_validation.TabularNumericalHistograms:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TargetRateBarPlots:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'},
        'params': {'default_column': 'loan_status'}
    },
    # Run with the training and test datasets
    'validmind.data_validation.DescriptiveStatistics:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.TabularDescriptionTables:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.ClassImbalance:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_percent_threshold': 10}
    },
    'validmind.data_validation.UniqueRows:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_percent_threshold': 1}
    },
    'validmind.data_validation.TabularNumericalHistograms:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.MutualInformation:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_threshold': 0.01}
    },
    'validmind.data_validation.PearsonCorrelationMatrix:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.HighPearsonCorrelation:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'max_threshold': 0.3, 'top_n_correlations': 10}
    },
    'validmind.model_validation.ModelMetadata': {
        'input_grid': {'model': ['log_model_champion', 'rf_model']}
    },
    'validmind.model_validation.sklearn.ModelParameters': {
        'input_grid': {'model': ['log_model_champion', 'rf_model']}
    },
    'validmind.model_validation.sklearn.ROCCurve': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final'], 'model': ['log_model_champion']}
    },
    'validmind.model_validation.sklearn.MinimumROCAUCScore': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final'], 'model': ['log_model_champion']},
        'params': {'min_threshold': 0.5}
    }
}

Then batch run and log our tests in test_config:

for t in test_config:
    print(t)
    try:
        # Check if test has input_grid
        if 'input_grid' in test_config[t]:
            # For tests with input_grid, pass the input_grid configuration
            if 'params' in test_config[t]:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid']).log()
        else:
            # Original logic for regular inputs
            if 'params' in test_config[t]:
                vm.tests.run_test(t, inputs=test_config[t]['inputs'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, inputs=test_config[t]['inputs']).log()
    except Exception as e:
        print(f"Error running test {t}: {str(e)}")
validmind.data_validation.DatasetDescription:raw_data

Dataset Description Raw Data

The Dataset Description test provides a comprehensive summary of the dataset's structure, completeness, and feature characteristics. The results table details each column's data type, count, missingness, and the number of distinct values, offering a clear overview of the dataset composition. All columns are fully populated with no missing values, and the distinct value counts highlight the diversity and granularity of each feature. This summary enables a thorough understanding of the dataset's readiness for modeling and potential areas of complexity.

Key insights:

  • No missing values across all columns: All 11 columns have 8,000 non-missing entries, with 0% missingness observed throughout the dataset.
  • High cardinality in key numeric features: The Balance and EstimatedSalary columns exhibit high distinct value counts (5,088 and 8,000 respectively), indicating continuous or near-continuous distributions.
  • Low cardinality in categorical features: Categorical columns such as Geography, Gender, HasCrCard, IsActiveMember, and Exited have between 2 and 3 distinct values, reflecting well-defined categorical groupings.
  • Moderate diversity in demographic and behavioral features: Age and CreditScore show moderate distinct counts (69 and 452 respectively), while Tenure and NumOfProducts have lower diversity (11 and 4 distinct values).

The dataset is fully complete with no missing data, supporting robust downstream analysis. Numeric features display a range of cardinalities, from highly granular (EstimatedSalary, Balance) to more discretized (Tenure, NumOfProducts), while categorical features are well-structured with limited unique values. The observed structure indicates a dataset suitable for a variety of modeling approaches, with no immediate data quality concerns evident from the summary statistics.

Tables

Dataset Description

Name Type Count Missing Missing % Distinct Distinct %
CreditScore Numeric 8000.0 0 0.0 452 0.0565
Geography Categorical 8000.0 0 0.0 3 0.0004
Gender Categorical 8000.0 0 0.0 2 0.0002
Age Numeric 8000.0 0 0.0 69 0.0086
Tenure Numeric 8000.0 0 0.0 11 0.0014
Balance Numeric 8000.0 0 0.0 5088 0.6360
NumOfProducts Numeric 8000.0 0 0.0 4 0.0005
HasCrCard Categorical 8000.0 0 0.0 2 0.0002
IsActiveMember Categorical 8000.0 0 0.0 2 0.0002
EstimatedSalary Numeric 8000.0 0 0.0 8000 1.0000
Exited Categorical 8000.0 0 0.0 2 0.0002
2026-03-12 20:49:16,161 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DatasetDescription:raw_data does not exist in model's document
validmind.data_validation.DescriptiveStatistics:raw_data

Descriptive Statistics Raw Data

The Descriptive Statistics test evaluates the distributional characteristics and diversity of both numerical and categorical variables in the dataset. The results present summary statistics for eight numerical variables, including measures of central tendency, dispersion, and range, as well as frequency-based summaries for two categorical variables. The numerical table details counts, means, standard deviations, and percentiles, while the categorical table reports unique value counts and the dominance of the most frequent category. These results provide a comprehensive overview of the dataset's structure and highlight key aspects of variable distributions.

Key insights:

  • Wide range and skewness in balance values: The Balance variable exhibits a minimum of 0 and a maximum of 250,898, with a mean (76,434) substantially lower than the median (97,264), indicating a right-skewed distribution and the presence of a significant proportion of zero balances.
  • CreditScore and Age distributions are symmetric: CreditScore and Age show close alignment between mean and median (CreditScore mean: 650.16, median: 652; Age mean: 38.95, median: 37), suggesting relatively symmetric distributions without pronounced skewness.
  • Limited diversity in categorical variables: Geography is dominated by France (50.12% of records), and Gender is split between two categories, with Male comprising 54.95% of the dataset, indicating moderate imbalance but not extreme concentration.
  • Binary variables with balanced representation: HasCrCard and IsActiveMember are binary variables with means of 0.70 and 0.52, respectively, reflecting a moderate split between categories and no evidence of extreme imbalance.
  • NumOfProducts concentrated at lower values: The NumOfProducts variable has a mean of 1.53 and a median of 1, with 75% of values at or below 2, indicating most customers hold one or two products.

The dataset displays a mix of symmetric and skewed distributions among numerical variables, with Balance notably right-skewed and containing a substantial proportion of zero values. Categorical variables show moderate dominance by single categories but retain some diversity. Binary and count variables are distributed without extreme imbalance, supporting a representative sample across key dimensions. Overall, the data structure is well-characterized, with some variables warranting attention due to skewness or concentration.

Tables

Numerical Variables

Name Count Mean Std Min 25% 50% 75% 90% 95% Max
CreditScore 8000.0 650.1596 96.8462 350.0 583.0 652.0 717.0 778.0 813.0 850.0
Age 8000.0 38.9489 10.4590 18.0 32.0 37.0 44.0 53.0 60.0 92.0
Tenure 8000.0 5.0339 2.8853 0.0 3.0 5.0 8.0 9.0 9.0 10.0
Balance 8000.0 76434.0965 62612.2513 0.0 0.0 97264.0 128045.0 149545.0 162488.0 250898.0
NumOfProducts 8000.0 1.5325 0.5805 1.0 1.0 1.0 2.0 2.0 2.0 4.0
HasCrCard 8000.0 0.7026 0.4571 0.0 0.0 1.0 1.0 1.0 1.0 1.0
IsActiveMember 8000.0 0.5199 0.4996 0.0 0.0 1.0 1.0 1.0 1.0 1.0
EstimatedSalary 8000.0 99790.1880 57520.5089 12.0 50857.0 99505.0 149216.0 179486.0 189997.0 199992.0

Categorical Variables

Name Count Number of Unique Values Top Value Top Value Frequency Top Value Frequency %
Geography 8000.0 3.0 France 4010.0 50.12
Gender 8000.0 2.0 Male 4396.0 54.95
2026-03-12 20:49:23,997 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:raw_data does not exist in model's document
validmind.data_validation.MissingValues:raw_data

✅ Missing Values Raw Data

The Missing Values test evaluates dataset quality by measuring the proportion of missing values in each feature and comparing it to a predefined threshold. The results table presents, for each column, the number and percentage of missing values, along with a Pass/Fail status based on whether the missingness exceeds the 1.0% threshold. All features in the dataset are listed with their respective missing value statistics and test outcomes.

Key insights:

  • No missing values detected: All features report 0 missing values, corresponding to 0.0% missingness for each column.
  • Universal test pass across features: Every feature meets the missing value threshold, with all columns marked as "Pass" in the results.

The dataset demonstrates complete data integrity with respect to missing values, as no feature contains any missing entries. All columns satisfy the established threshold, indicating a high level of data completeness for subsequent modeling or analysis.

Parameters:

{
  "min_percentage_threshold": 1
}
            

Tables

Column Number of Missing Values Percentage of Missing Values (%) Pass/Fail
CreditScore 0 0.0 Pass
Geography 0 0.0 Pass
Gender 0 0.0 Pass
Age 0 0.0 Pass
Tenure 0 0.0 Pass
Balance 0 0.0 Pass
NumOfProducts 0 0.0 Pass
HasCrCard 0 0.0 Pass
IsActiveMember 0 0.0 Pass
EstimatedSalary 0 0.0 Pass
Exited 0 0.0 Pass
2026-03-12 20:49:28,021 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MissingValues:raw_data does not exist in model's document
validmind.data_validation.ClassImbalance:raw_data

✅ Class Imbalance Raw Data

The Class Imbalance test evaluates the distribution of target classes within the dataset to identify potential imbalances that could impact model performance. The results table presents the percentage representation of each class in the target variable "Exited," alongside a pass/fail assessment based on a minimum threshold of 10%. The accompanying bar plot visually depicts the proportion of each class, providing a clear overview of class distribution.

Key insights:

  • Both classes exceed minimum threshold: Class 0 constitutes 79.80% and class 1 constitutes 20.20% of the dataset, with both surpassing the 10% minimum threshold.
  • No classes flagged for imbalance: The pass/fail assessment indicates that neither class is under-represented according to the defined criterion.
  • Class distribution is asymmetric: The majority class (0) is nearly four times as prevalent as the minority class (1), as shown in both the table and the bar plot.

The results indicate that, while the dataset exhibits an asymmetric class distribution with a dominant majority class, both classes meet the minimum representation threshold set by the test. No classes are flagged for high imbalance risk under the current parameters, and the class proportions are visually confirmed by the bar plot. This distribution provides a basis for further model development without immediate concerns regarding under-representation of any class.

Parameters:

{
  "min_percent_threshold": 10
}
            

Tables

Exited Class Imbalance

Exited Percentage of Rows (%) Pass/Fail
0 79.80% Pass
1 20.20% Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:raw_data:660c
2026-03-12 20:49:34,907 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance:raw_data does not exist in model's document
validmind.data_validation.Duplicates:raw_data

✅ Duplicates Raw Data

The Duplicates:raw_data test evaluates the presence of duplicate rows within the dataset to ensure data quality and reduce the risk of model overfitting due to redundant information. The results table summarizes the absolute number and percentage of duplicate rows detected in the dataset, with the test configured to flag results only if the count exceeds a minimum threshold of 1. The table indicates both the total number of duplicate rows and their proportion relative to the dataset size.

Key insights:

  • No duplicate rows detected: The dataset contains 0 duplicate rows, as indicated by the "Number of Duplicates" value.
  • Zero percent duplication rate: The "Percentage of Rows (%)" is 0.0%, confirming the absence of redundancy in the dataset.

The results demonstrate that the dataset is free from duplicate entries, indicating a high level of data integrity with respect to row uniqueness. The absence of duplicates reduces the risk of model bias due to repeated information and supports reliable model training and evaluation.

Parameters:

{
  "min_threshold": 1
}
            

Tables

Duplicate Rows Results for Dataset

Number of Duplicates Percentage of Rows (%)
0 0.0
2026-03-12 20:49:38,152 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.Duplicates:raw_data does not exist in model's document
validmind.data_validation.HighCardinality:raw_data

✅ High Cardinality Raw Data

The High Cardinality test evaluates the number of unique values in categorical columns to identify potential risks associated with high cardinality, such as overfitting or data noise. The results table presents the number and percentage of distinct values for each categorical column, along with a pass/fail status based on a threshold of 10% distinct values. Both "Geography" and "Gender" columns are assessed, with their respective distinct value counts and percentages reported.

Key insights:

  • All categorical columns pass cardinality threshold: Both "Geography" (3 distinct values, 0.0375%) and "Gender" (2 distinct values, 0.025%) are well below the 10% threshold, resulting in a "Pass" status for each.
  • Low cardinality observed across features: The number of unique values in both columns is minimal relative to the dataset size, indicating low cardinality in all assessed categorical features.

The results indicate that all evaluated categorical columns exhibit low cardinality, with distinct value percentages substantially below the defined threshold. No evidence of high cardinality risk is present in the assessed features, supporting data quality and reducing the likelihood of overfitting related to categorical variable granularity.

Parameters:

{
  "num_threshold": 100,
  "percent_threshold": 0.1,
  "threshold_type": "percent"
}
            

Tables

Column Number of Distinct Values Percentage of Distinct Values (%) Pass/Fail
Geography 3 0.0375 Pass
Gender 2 0.0250 Pass
2026-03-12 20:49:41,574 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighCardinality:raw_data does not exist in model's document
validmind.data_validation.Skewness:raw_data

❌ Skewness Raw Data

The Skewness test evaluates the asymmetry of numerical data distributions to identify deviations from normality that may impact model performance. The results table presents skewness values for each numeric column, indicating whether each value falls below the maximum threshold of 1. Columns with skewness values below this threshold are marked as "Pass," while those exceeding it are marked as "Fail." The table enables assessment of distributional symmetry across all monitored features.

Key insights:

  • Most features exhibit low skewness: The majority of columns, including CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary, have skewness values well within the threshold, indicating near-symmetric distributions.
  • Age and Exited exceed skewness threshold: Age (skewness = 1.0245) and Exited (skewness = 1.4847) both exceed the maximum threshold, resulting in a "Fail" status for these columns.
  • Highest skewness observed in Exited: The Exited column displays the highest skewness (1.4847), indicating a pronounced asymmetry in its distribution relative to other features.
  • Negative skewness present but within limits: Features such as HasCrCard (-0.8867), Balance (-0.1353), and CreditScore (-0.062) show negative skewness, but all remain within the acceptable range.

The results indicate that most numeric features in the dataset maintain distributional symmetry within the defined threshold, supporting data quality for model development. However, Age and Exited display elevated skewness, with Exited showing the most pronounced asymmetry. These findings highlight localized distributional imbalances that may warrant further examination depending on model requirements and use case.

Parameters:

{
  "max_threshold": 1
}
            

Tables

Skewness Results for Dataset

Column Skewness Pass/Fail
CreditScore -0.0620 Pass
Age 1.0245 Fail
Tenure 0.0077 Pass
Balance -0.1353 Pass
NumOfProducts 0.7172 Pass
HasCrCard -0.8867 Pass
IsActiveMember -0.0796 Pass
EstimatedSalary 0.0095 Pass
Exited 1.4847 Fail
2026-03-12 20:49:46,228 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.Skewness:raw_data does not exist in model's document
validmind.data_validation.UniqueRows:raw_data

❌ Unique Rows Raw Data

The UniqueRows test evaluates the diversity of the dataset by measuring the proportion of unique values in each column relative to the total row count, with a minimum threshold set at 1%. The results table presents, for each column, the number and percentage of unique values, along with a pass/fail outcome based on whether the uniqueness percentage meets or exceeds the threshold. Columns such as EstimatedSalary, Balance, and CreditScore exhibit high uniqueness percentages and pass the test, while most categorical and low-cardinality columns fall below the threshold and fail.

Key insights:

  • High uniqueness in continuous variables: EstimatedSalary (100%), Balance (63.6%), and CreditScore (5.65%) exceed the 1% uniqueness threshold, indicating substantial diversity in these columns.
  • Low uniqueness in categorical variables: Columns such as Geography (0.0375%), Gender (0.025%), HasCrCard (0.025%), IsActiveMember (0.025%), and Exited (0.025%) have very low uniqueness percentages and fail the test.
  • Limited diversity in Age and Tenure: Age (0.8625%) and Tenure (0.1375%) do not meet the uniqueness threshold, reflecting limited distinct values relative to the dataset size.
  • Majority of columns fail uniqueness threshold: Only 3 out of 11 columns pass the test, with the remaining 8 columns failing to meet the minimum uniqueness requirement.

The results indicate that while continuous variables such as EstimatedSalary, Balance, and CreditScore provide substantial row-level diversity, the majority of columns—particularly those representing categorical or low-cardinality features—exhibit low uniqueness and do not meet the prescribed threshold. This distribution reflects a dataset structure where diversity is concentrated in a subset of variables, with most categorical features contributing limited unique information at the row level.

Parameters:

{
  "min_percent_threshold": 1
}
            

Tables

Column Number of Unique Values Percentage of Unique Values (%) Pass/Fail
CreditScore 452 5.6500 Pass
Geography 3 0.0375 Fail
Gender 2 0.0250 Fail
Age 69 0.8625 Fail
Tenure 11 0.1375 Fail
Balance 5088 63.6000 Pass
NumOfProducts 4 0.0500 Fail
HasCrCard 2 0.0250 Fail
IsActiveMember 2 0.0250 Fail
EstimatedSalary 8000 100.0000 Pass
Exited 2 0.0250 Fail
2026-03-12 20:49:51,268 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.UniqueRows:raw_data does not exist in model's document
validmind.data_validation.TooManyZeroValues:raw_data

❌ Too Many Zero Values Raw Data

The TooManyZeroValues test identifies numerical columns with a proportion of zero values exceeding a defined threshold, set here at 0.03%. The results table summarizes the number and percentage of zero values for each numerical column, along with a pass/fail status based on the threshold. All four evaluated columns—Tenure, Balance, HasCrCard, and IsActiveMember—are reported with their respective row counts, zero value counts, and calculated percentages.

Key insights:

  • All evaluated columns exceed zero value threshold: Each of the four numerical columns has a percentage of zero values significantly above the 0.03% threshold, resulting in a fail status for all.
  • High concentration of zeros in Balance and IsActiveMember: Balance contains 36.4% zero values, and IsActiveMember contains 48.01%, indicating substantial sparsity in these features.
  • Substantial zero values in binary indicator columns: HasCrCard and IsActiveMember, likely representing binary indicators, show 29.74% and 48.01% zero values respectively, reflecting a large proportion of one class.
  • Tenure column also affected: Tenure registers 4.04% zero values, which, while lower than other columns, still exceeds the threshold and results in a fail.

All tested numerical columns display zero value proportions well above the defined threshold, with Balance and IsActiveMember exhibiting particularly high sparsity. The prevalence of zeros across these features is consistent and systematic, as indicated by the fail status for each column. This pattern highlights a notable concentration of zero values in both continuous and binary-type variables within the dataset.

Parameters:

{
  "max_percent_threshold": 0.03
}
            

Tables

Variable Row Count Number of Zero Values Percentage of Zero Values (%) Pass/Fail
Tenure 8000 323 4.0375 Fail
Balance 8000 2912 36.4000 Fail
HasCrCard 8000 2379 29.7375 Fail
IsActiveMember 8000 3841 48.0125 Fail
2026-03-12 20:49:58,666 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TooManyZeroValues:raw_data does not exist in model's document
validmind.data_validation.IQROutliersTable:raw_data

IQR Outliers Table Raw Data

The Interquartile Range Outliers Table (IQROutliersTable) test identifies and summarizes outliers in numerical features using the IQR method, with the threshold parameter set to 5 for this analysis. The results table presents the count and summary statistics of outliers detected for each numerical feature in the dataset. In this instance, the table is empty, indicating no outliers were detected under the specified threshold.

Key insights:

  • No outliers detected in any feature: The test did not identify any data points as outliers across all numerical features at the threshold of 5.
  • Dataset exhibits high conformity to IQR bounds: All numerical feature values fall within the calculated IQR-based outlier limits, indicating absence of extreme deviations.

The absence of detected outliers at the specified threshold suggests that the dataset's numerical features are well-contained within the expected value ranges. This result indicates a high degree of distributional regularity and minimal presence of extreme values under the applied IQR criteria.

Parameters:

{
  "threshold": 5
}
            

Tables

Summary of Outliers Detected by IQR Method

2026-03-12 20:50:01,612 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.IQROutliersTable:raw_data does not exist in model's document
validmind.data_validation.DescriptiveStatistics:preprocessed_data

Descriptive Statistics Preprocessed Data

The Descriptive Statistics test evaluates the distributional characteristics and diversity of both numerical and categorical variables in the preprocessed dataset. The results are presented in two summary tables: one for numerical variables, detailing central tendency, dispersion, and range; and one for categorical variables, summarizing value counts, unique value diversity, and the dominance of top categories. These tables provide a comprehensive overview of the dataset’s structure, supporting assessment of data quality and potential risk factors.

Key insights:

  • Wide range and skewness in Balance: The Balance variable exhibits a minimum of 0.0, a median of 103,828.0, and a maximum of 250,898.0, with a mean (82,744.6) substantially below the median, indicating right-skewness and a concentration of lower values.
  • CreditScore distribution is symmetric and complete: CreditScore shows a mean (648.2) closely aligned with the median (650.0), and a full range from 350.0 to 850.0, suggesting a well-populated and symmetric distribution.
  • Binary variables show moderate class balance: HasCrCard and IsActiveMember are both binary, with HasCrCard having 70.1% of entries as 1 and IsActiveMember at 47.3% as 1, indicating moderate class balance without extreme dominance.
  • Categorical variables have limited diversity: Geography has three unique values, with France as the top value at 46.47% frequency. Gender is evenly split, with Male at 50.25%, indicating no single category is overwhelmingly dominant.

The dataset demonstrates generally balanced distributions across both numerical and categorical variables, with the exception of Balance, which is notably right-skewed and contains a substantial proportion of zero values. Categorical variables display limited but sufficient diversity, and binary variables do not exhibit extreme class imbalance. These characteristics provide a stable foundation for subsequent modeling, with the primary distributional risk concentrated in the Balance variable.

Tables

Numerical Variables

Name Count Mean Std Min 25% 50% 75% 90% 95% Max
CreditScore 3232.0 648.1894 97.2398 350.0 582.0 650.0 715.0 776.0 812.0 850.0
Tenure 3232.0 5.0226 2.9093 0.0 3.0 5.0 8.0 9.0 10.0 10.0
Balance 3232.0 82744.5585 61546.8678 0.0 0.0 103828.0 129848.0 151020.0 165337.0 250898.0
NumOfProducts 3232.0 1.5090 0.6694 1.0 1.0 1.0 2.0 2.0 3.0 4.0
HasCrCard 3232.0 0.7011 0.4578 0.0 0.0 1.0 1.0 1.0 1.0 1.0
IsActiveMember 3232.0 0.4725 0.4993 0.0 0.0 0.0 1.0 1.0 1.0 1.0
EstimatedSalary 3232.0 99725.4095 57416.6108 12.0 50950.0 98820.0 149928.0 179481.0 189189.0 199909.0

Categorical Variables

Name Count Number of Unique Values Top Value Top Value Frequency Top Value Frequency %
Geography 3232.0 3.0 France 1502.0 46.47
Gender 3232.0 2.0 Male 1624.0 50.25
2026-03-12 20:50:07,365 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:preprocessed_data does not exist in model's document
validmind.data_validation.TabularDescriptionTables:preprocessed_data

Tabular Description Tables Preprocessed Data

The TabularDescriptionTables:preprocessed_data test evaluates the descriptive statistics and data completeness of numerical and categorical variables in the dataset. The results present summary statistics for eight numerical variables and two categorical variables, including measures of central tendency, range, missingness, and data type. All variables are reported with their observed value ranges, means, and unique value counts, providing a comprehensive overview of the dataset's structure and integrity.

Key insights:

  • No missing values detected: All numerical and categorical variables report 0.0% missing values, indicating complete data coverage across all fields.
  • Numerical variables span expected ranges: Variables such as CreditScore (350.0–850.0), Balance (0.0–250,898.09), and EstimatedSalary (11.58–199,909.32) display wide but bounded ranges, with means consistent with their respective domains.
  • Categorical variables are low cardinality: Geography contains three unique values (Germany, Spain, France), and Gender contains two (Female, Male), both with 0.0% missingness.
  • Binary indicators are well-formed: HasCrCard, IsActiveMember, and Exited are encoded as int64 with minimum and maximum values of 0 and 1, confirming binary structure.

The dataset exhibits complete data integrity with no missing values across all variables. Numerical and categorical fields are well-structured, with value ranges and cardinalities consistent with their intended use. The absence of missingness and the presence of clearly defined variable types support robust downstream modeling and analysis.

Tables

Numerical Variable Num of Obs Mean Min Max Missing Values (%) Data Type
CreditScore 3232 648.1894 350.00 850.00 0.0 int64
Tenure 3232 5.0226 0.00 10.00 0.0 int64
Balance 3232 82744.5585 0.00 250898.09 0.0 float64
NumOfProducts 3232 1.5090 1.00 4.00 0.0 int64
HasCrCard 3232 0.7011 0.00 1.00 0.0 int64
IsActiveMember 3232 0.4725 0.00 1.00 0.0 int64
EstimatedSalary 3232 99725.4095 11.58 199909.32 0.0 float64
Exited 3232 0.5000 0.00 1.00 0.0 int64
Categorical Variable Num of Obs Num of Unique Values Unique Values Missing Values (%) Data Type
Geography 3232.0 3.0 ['Germany' 'Spain' 'France'] 0.0 object
Gender 3232.0 2.0 ['Female' 'Male'] 0.0 object
2026-03-12 20:50:11,733 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularDescriptionTables:preprocessed_data does not exist in model's document
validmind.data_validation.MissingValues:preprocessed_data

✅ Missing Values Preprocessed Data

The Missing Values test evaluates dataset quality by measuring the proportion of missing values in each feature and comparing it to a predefined threshold. The results table presents, for each column, the number and percentage of missing values, along with a Pass/Fail status based on whether the missingness exceeds the 1.0% threshold. All features in the dataset are listed with their respective missing value statistics and test outcomes.

Key insights:

  • No missing values detected: All features report 0 missing values, corresponding to 0.0% missingness for each column.
  • Universal pass across features: Every feature meets the missing value threshold, with all columns marked as "Pass" in the results.

The dataset demonstrates complete data integrity with respect to missing values, as no feature contains any missing entries. All columns satisfy the established missingness threshold, indicating a high level of data completeness for subsequent modeling or analysis.

Parameters:

{
  "min_percentage_threshold": 1
}
            

Tables

Column Number of Missing Values Percentage of Missing Values (%) Pass/Fail
CreditScore 0 0.0 Pass
Geography 0 0.0 Pass
Gender 0 0.0 Pass
Tenure 0 0.0 Pass
Balance 0 0.0 Pass
NumOfProducts 0 0.0 Pass
HasCrCard 0 0.0 Pass
IsActiveMember 0 0.0 Pass
EstimatedSalary 0 0.0 Pass
Exited 0 0.0 Pass
2026-03-12 20:50:15,144 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MissingValues:preprocessed_data does not exist in model's document
validmind.data_validation.TabularNumericalHistograms:preprocessed_data

Tabular Numerical Histograms Preprocessed Data

The TabularNumericalHistograms:preprocessed_data test provides visualizations of the distribution of each numerical feature in the dataset using histograms. These plots enable assessment of central tendency, spread, skewness, and the presence of outliers for each variable. The results display the frequency distribution for CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary, allowing for identification of distributional characteristics and potential data quality issues.

Key insights:

  • CreditScore displays moderate right skew: The CreditScore histogram shows a unimodal distribution with a longer right tail, indicating a concentration of values between 550 and 750, with fewer observations at the lower and higher extremes.
  • Tenure is nearly uniform with edge effects: The Tenure variable is distributed almost uniformly across its range, except for lower frequencies at the minimum (0) and maximum (10) values.
  • Balance is bimodal with a spike at zero: The Balance histogram reveals a pronounced spike at zero, followed by a bell-shaped distribution for nonzero values, indicating a substantial subset of accounts with zero balance.
  • NumOfProducts is highly concentrated at lower values: Most observations are at 1 or 2 products, with a steep drop-off for 3 and 4 products, indicating limited product diversification among customers.
  • HasCrCard and IsActiveMember are binary with class imbalance: Both variables are binary, with HasCrCard skewed toward 1 (majority have a credit card) and IsActiveMember showing a slight majority for 0 (not active).
  • EstimatedSalary is approximately uniform: The EstimatedSalary histogram is relatively flat across its range, indicating an even distribution of salary values without pronounced skew or clustering.

The histograms collectively indicate that most numerical features exhibit either uniform or moderately skewed distributions, with notable concentration effects in Balance (at zero) and NumOfProducts (at lower values). Binary features display class imbalance, and no extreme outliers are visually apparent in the continuous variables. These distributional characteristics provide a clear overview of the input data structure and highlight areas of concentration and potential segmentation within the dataset.

Figures

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:0d78
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:f897
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:6dcd
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:055a
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:ad98
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:3147
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:97ee
2026-03-12 20:50:24,322 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularNumericalHistograms:preprocessed_data does not exist in model's document
validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data

Tabular Categorical Bar Plots Preprocessed Data

The TabularCategoricalBarPlots test evaluates the distribution of categorical variables by generating bar plots that display the frequency of each category within the dataset. The resulting plots provide a visual summary of the counts for each category in the "Geography" and "Gender" features. These visualizations enable assessment of the dataset's composition and highlight the relative representation of each category.

Key insights:

  • Balanced gender distribution: The "Gender" feature shows nearly equal counts for "Male" and "Female" categories, indicating no significant imbalance.
  • Geography category imbalance observed: The "Geography" feature displays higher representation for "France" compared to "Germany" and "Spain," with "Spain" having the lowest count among the three categories.

The categorical composition of the dataset is characterized by a balanced gender split and a notable imbalance in the "Geography" feature, where "France" is the most represented category. These patterns provide clarity on the underlying distribution of categorical variables and may inform further analysis of model input representativeness.

Figures

ValidMind Figure validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data:6843
ValidMind Figure validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data:595a
2026-03-12 20:50:29,791 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data does not exist in model's document
validmind.data_validation.TargetRateBarPlots:preprocessed_data

Target Rate Bar Plots Preprocessed Data

The TargetRateBarPlots test visualizes the distribution and target rates of categorical features to provide insight into model decision patterns. The results display paired bar plots for each categorical variable, showing both the frequency of each category and the corresponding mean target (default) rate. This enables a direct comparison of how target rates vary across different groups within each feature.

Key insights:

  • Geography exhibits target rate variation: The target rate for Germany is notably higher than for France and Spain, with Germany exceeding 0.6 while France and Spain are closer to 0.4.
  • Balanced category representation in Gender: Male and Female categories have nearly identical counts, indicating balanced representation in the dataset.
  • Gender target rates differ: The target rate for Female is higher than for Male, with Female above 0.5 and Male below 0.5.
  • Uneven category counts in Geography: France has the highest count, followed by Germany and then Spain, indicating some imbalance in category frequencies.

The results reveal distinct differences in target rates across both Geography and Gender features, with Germany and Female categories exhibiting higher default rates relative to their counterparts. Category representation is balanced for Gender but shows moderate imbalance for Geography. These patterns highlight areas where model outcomes differ by group, providing a basis for further analysis of model behavior and potential risk segmentation.

Figures

ValidMind Figure validmind.data_validation.TargetRateBarPlots:preprocessed_data:07e8
ValidMind Figure validmind.data_validation.TargetRateBarPlots:preprocessed_data:fb7d
2026-03-12 20:50:35,724 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TargetRateBarPlots:preprocessed_data does not exist in model's document
validmind.data_validation.DescriptiveStatistics:development_data

Descriptive Statistics Development Data

The Descriptive Statistics test evaluates the distributional characteristics of numerical variables in both the training and test datasets. The results present summary statistics—including mean, standard deviation, minimum, maximum, and key percentiles—for each variable, enabling assessment of central tendency, dispersion, and potential outliers. The statistics are reported separately for the train and test datasets, allowing for direct comparison of data consistency and distributional alignment across development splits.

Key insights:

  • Consistent central tendencies across splits: Mean and median values for key variables such as CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary are closely aligned between the training and test datasets, indicating stable distributions.
  • Comparable dispersion and range: Standard deviations and value ranges for all variables are similar between datasets, with no evidence of significant shifts or anomalies in spread.
  • No extreme outliers detected: Maximum and minimum values for all variables fall within expected operational ranges, with no evidence of extreme or implausible values in either dataset.
  • Balanced categorical encodings: Binary variables (HasCrCard, IsActiveMember) display mean values near 0.5–0.7, with standard deviations consistent with balanced categorical distributions.

The descriptive statistics indicate strong alignment between the training and test datasets, with stable central tendencies and dispersion across all monitored variables. No material outliers or distributional anomalies are observed, supporting the representativeness and integrity of the development data. The observed consistency provides a sound basis for subsequent modeling and validation activities.

Tables

dataset Name Count Mean Std Min 25% 50% 75% 90% 95% Max
train_dataset_final CreditScore 2585.0 648.0870 97.1601 350.0 581.0 650.0 717.0 775.0 811.0 850.0
train_dataset_final Tenure 2585.0 5.0456 2.9270 0.0 3.0 5.0 8.0 9.0 10.0 10.0
train_dataset_final Balance 2585.0 82364.0648 61815.3725 0.0 0.0 103549.0 129935.0 151069.0 165346.0 250898.0
train_dataset_final NumOfProducts 2585.0 1.5014 0.6614 1.0 1.0 1.0 2.0 2.0 3.0 4.0
train_dataset_final HasCrCard 2585.0 0.7029 0.4571 0.0 0.0 1.0 1.0 1.0 1.0 1.0
train_dataset_final IsActiveMember 2585.0 0.4716 0.4993 0.0 0.0 0.0 1.0 1.0 1.0 1.0
train_dataset_final EstimatedSalary 2585.0 100001.3237 57409.5810 12.0 51553.0 99476.0 150228.0 179692.0 190140.0 199909.0
test_dataset_final CreditScore 647.0 648.5981 97.6318 350.0 584.0 649.0 712.0 780.0 816.0 850.0
test_dataset_final Tenure 647.0 4.9304 2.8377 0.0 3.0 5.0 7.0 9.0 10.0 10.0
test_dataset_final Balance 647.0 84264.7690 60485.4814 0.0 0.0 104478.0 129385.0 150914.0 164660.0 210433.0
test_dataset_final NumOfProducts 647.0 1.5394 0.7002 1.0 1.0 1.0 2.0 2.0 3.0 4.0
test_dataset_final HasCrCard 647.0 0.6940 0.4612 0.0 0.0 1.0 1.0 1.0 1.0 1.0
test_dataset_final IsActiveMember 647.0 0.4760 0.4998 0.0 0.0 0.0 1.0 1.0 1.0 1.0
test_dataset_final EstimatedSalary 647.0 98623.0319 57475.8860 599.0 49779.0 95393.0 149421.0 178705.0 187201.0 199662.0
2026-03-12 20:50:40,830 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:development_data does not exist in model's document
validmind.data_validation.TabularDescriptionTables:development_data

Tabular Description Tables Development Data

The Descriptive Statistics test evaluates the distributional characteristics and completeness of numerical variables in both the training and test datasets. The results present summary statistics including count, mean, minimum, maximum, missing value percentage, and data type for each numerical variable. All variables are reported for both datasets, with no missing values observed and consistent data types across variables.

Key insights:

  • No missing values detected: All numerical variables in both training and test datasets have 0.0% missing values, indicating complete data coverage for these fields.
  • Consistent data types across datasets: Data types for all variables are stable between training and test sets, with integer types for discrete variables and float types for continuous variables.
  • Stable central tendencies between datasets: Means for key variables such as CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Exited are closely aligned between training and test datasets, with differences generally within a small margin.
  • Full observed range maintained: Minimum and maximum values for variables such as CreditScore (350.0 to 850.0), Tenure (0.0 to 10.0), and NumOfProducts (1.0 to 4.0) are consistent with expected value ranges, with no evidence of out-of-range or anomalous values.

The descriptive statistics indicate that the numerical variables in both the training and test datasets are complete, with no missing values and consistent data types. Central tendencies and value ranges are stable across datasets, supporting data integrity and comparability for subsequent modeling steps. No data quality issues or distributional anomalies are observed in the reported statistics.

Tables

dataset Numerical Variable Num of Obs Mean Min Max Missing Values (%) Data Type
train_dataset_final CreditScore 2585 648.0870 350.00 850.00 0.0 int64
train_dataset_final Tenure 2585 5.0456 0.00 10.00 0.0 int64
train_dataset_final Balance 2585 82364.0648 0.00 250898.09 0.0 float64
train_dataset_final NumOfProducts 2585 1.5014 1.00 4.00 0.0 int64
train_dataset_final HasCrCard 2585 0.7029 0.00 1.00 0.0 int64
train_dataset_final IsActiveMember 2585 0.4716 0.00 1.00 0.0 int64
train_dataset_final EstimatedSalary 2585 100001.3237 11.58 199909.32 0.0 float64
train_dataset_final Exited 2585 0.4956 0.00 1.00 0.0 int64
test_dataset_final CreditScore 647 648.5981 350.00 850.00 0.0 int64
test_dataset_final Tenure 647 4.9304 0.00 10.00 0.0 int64
test_dataset_final Balance 647 84264.7690 0.00 210433.08 0.0 float64
test_dataset_final NumOfProducts 647 1.5394 1.00 4.00 0.0 int64
test_dataset_final HasCrCard 647 0.6940 0.00 1.00 0.0 int64
test_dataset_final IsActiveMember 647 0.4760 0.00 1.00 0.0 int64
test_dataset_final EstimatedSalary 647 98623.0319 598.80 199661.50 0.0 float64
test_dataset_final Exited 647 0.5178 0.00 1.00 0.0 int64
2026-03-12 20:50:45,885 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularDescriptionTables:development_data does not exist in model's document
validmind.data_validation.ClassImbalance:development_data

✅ Class Imbalance Development Data

The Class Imbalance test evaluates the distribution of target classes within the training and test datasets to identify potential imbalances that could affect model performance. The results present the percentage representation of each class in both datasets, benchmarked against a minimum threshold of 10%. Visualizations display the proportion of each class, supporting interpretation of class balance.

Key insights:

  • Both classes exceed the minimum threshold: In both the training and test datasets, each class (Exited = 0 and Exited = 1) represents more than 10% of the total records, with all values above 48%.
  • Near-equal class distribution in training data: The training dataset shows a balanced split, with Exited = 0 at 50.44% and Exited = 1 at 49.56%.
  • Slight variation in test data proportions: The test dataset displays Exited = 1 at 51.78% and Exited = 0 at 48.22%, indicating a minor shift but maintaining overall balance.
  • All classes pass the imbalance criterion: No class in either dataset is flagged for imbalance, as all pass the 10% minimum threshold.

The class distribution in both the training and test datasets is balanced, with each class comprising nearly half of the records. No evidence of class imbalance is observed, and all classes meet the predefined minimum representation criterion. This distribution supports unbiased model training and evaluation with respect to the target variable.

Parameters:

{
  "min_percent_threshold": 10
}
            

Tables

dataset Exited Percentage of Rows (%) Pass/Fail
train_dataset_final 0 50.44% Pass
train_dataset_final 1 49.56% Pass
test_dataset_final 1 51.78% Pass
test_dataset_final 0 48.22% Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:development_data:b9fc
ValidMind Figure validmind.data_validation.ClassImbalance:development_data:52e9
2026-03-12 20:50:51,718 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance:development_data does not exist in model's document
validmind.data_validation.UniqueRows:development_data

❌ Unique Rows Development Data

The UniqueRows test evaluates the diversity of each column in the dataset by measuring the proportion of unique values relative to the total row count, with a minimum threshold set at 1%. The results table presents, for both the training and test datasets, the number and percentage of unique values per column, along with a pass/fail outcome based on the threshold. Columns with a percentage of unique values below 1% are marked as "Fail," while those meeting or exceeding the threshold are marked as "Pass." This assessment provides a column-level view of data uniqueness and highlights areas of limited diversity.

Key insights:

  • High uniqueness in continuous variables: Columns such as EstimatedSalary and Balance exhibit high percentages of unique values (100% and 68%+ respectively) in both training and test datasets, consistently passing the uniqueness threshold.
  • Low uniqueness in categorical and binary variables: Columns representing categorical or binary features (e.g., HasCrCard, IsActiveMember, Geography_Germany, Gender_Male, Exited) show very low percentages of unique values (all below 1%), resulting in a fail outcome for these columns across both datasets.
  • Mixed results for ordinal variables: CreditScore demonstrates moderate to high uniqueness (16.3% in training, 45.4% in test), passing the threshold, while Tenure passes in the test set (1.7%) but fails in the training set (0.43%), indicating variability in uniqueness across splits.
  • Consistent patterns across datasets: The observed patterns of high uniqueness in continuous variables and low uniqueness in categorical variables are consistent between the training and test datasets.

The results indicate that continuous variables in both datasets provide substantial diversity, as reflected by high percentages of unique values and consistent pass outcomes. In contrast, categorical and binary variables uniformly fall below the uniqueness threshold, resulting in fail outcomes for these columns. This pattern reflects the inherent limitations of the UniqueRows test when applied to categorical features, as their value ranges are naturally constrained. The overall uniqueness profile is stable across both training and test datasets, with no evidence of data duplication or lack of diversity in continuous features.

Parameters:

{
  "min_percent_threshold": 1
}
            

Tables

dataset Column Number of Unique Values Percentage of Unique Values (%) Pass/Fail
train_dataset_final CreditScore 422 16.3250 Pass
train_dataset_final Tenure 11 0.4255 Fail
train_dataset_final Balance 1759 68.0464 Pass
train_dataset_final NumOfProducts 4 0.1547 Fail
train_dataset_final HasCrCard 2 0.0774 Fail
train_dataset_final IsActiveMember 2 0.0774 Fail
train_dataset_final EstimatedSalary 2585 100.0000 Pass
train_dataset_final Geography_Germany 2 0.0774 Fail
train_dataset_final Geography_Spain 2 0.0774 Fail
train_dataset_final Gender_Male 2 0.0774 Fail
train_dataset_final Exited 2 0.0774 Fail
test_dataset_final CreditScore 294 45.4405 Pass
test_dataset_final Tenure 11 1.7002 Pass
test_dataset_final Balance 455 70.3246 Pass
test_dataset_final NumOfProducts 4 0.6182 Fail
test_dataset_final HasCrCard 2 0.3091 Fail
test_dataset_final IsActiveMember 2 0.3091 Fail
test_dataset_final EstimatedSalary 647 100.0000 Pass
test_dataset_final Geography_Germany 2 0.3091 Fail
test_dataset_final Geography_Spain 2 0.3091 Fail
test_dataset_final Gender_Male 2 0.3091 Fail
test_dataset_final Exited 2 0.3091 Fail
2026-03-12 20:50:58,597 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.UniqueRows:development_data does not exist in model's document
validmind.data_validation.TabularNumericalHistograms:development_data

Tabular Numerical Histograms Development Data

The TabularNumericalHistograms test provides a visual assessment of the distribution of each numerical feature in both the training and test datasets. The resulting histograms display the frequency distribution for each variable, enabling identification of distributional characteristics, skewness, and potential outliers. These visualizations facilitate an understanding of the underlying data structure and highlight any notable deviations or concentration patterns across features.

Key insights:

  • CreditScore distributions are unimodal and right-skewed: Both training and test datasets show unimodal distributions for CreditScore, with a concentration between 600 and 750 and a right-skewed tail extending toward higher values.
  • Tenure is approximately uniform with edge effects: Tenure displays a near-uniform distribution across most values, with slightly lower frequencies at the minimum and maximum bins in both datasets.
  • Balance exhibits a strong zero-inflation: A substantial proportion of records have a zero balance, with the remainder forming a bell-shaped distribution centered around 120,000–140,000.
  • NumOfProducts is highly concentrated at lower values: The majority of records have one or two products, with very few instances at three or four products.
  • Binary features show class imbalance: HasCrCard and IsActiveMember are both skewed, with HasCrCard dominated by the '1' class and IsActiveMember showing a moderate split but with more '0' values in the training set.
  • EstimatedSalary is uniformly distributed: EstimatedSalary displays a flat distribution across its range in both datasets, indicating no significant skew or concentration.
  • Geography and Gender features are imbalanced: Geography_Germany and Geography_Spain show more records in the 'false' category, while Gender_Male is nearly balanced between true and false.

The histograms reveal that most numerical features exhibit stable and consistent distributional patterns between training and test datasets, with no evidence of extreme outliers or abrupt distributional shifts. Notable characteristics include strong zero-inflation in Balance, class imbalance in several binary features, and a uniform distribution for EstimatedSalary. These patterns provide a clear view of the data landscape and support further analysis of model input integrity.

Figures

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:a05f
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:1a01
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:c0d9
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:ac1b
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:1b31
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:883f
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:6d69
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:fdf7
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:695a
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:4f8d
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:9c92
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:685b
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:b7db
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:cf19
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:be81
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:2586
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:80e1
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:db72
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:7f8d
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:6384
2026-03-12 20:51:09,506 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularNumericalHistograms:development_data does not exist in model's document
validmind.data_validation.MutualInformation:development_data

Mutual Information Development Data

The Mutual Information test evaluates the statistical dependency between each feature and the target variable to quantify feature relevance for model training. The results are presented as normalized mutual information scores (ranging from 0 to 1) for both the development and test datasets, with a threshold of 0.01 indicated for interpretability. Bar plots display the relative importance of each feature, highlighting the distribution and magnitude of information content across variables.

Key insights:

  • NumOfProducts consistently dominates feature relevance: NumOfProducts exhibits the highest mutual information score in both development (≈0.105) and test (≈0.127) datasets, substantially exceeding all other features.
  • Majority of features show low information content: Most features register mutual information scores near or below the 0.01 threshold, particularly in the test dataset, where several features (Tenure, HasCrCard, EstimatedSalary, Geography_Spain) display scores at or near zero.
  • Score distribution is highly skewed: The mutual information scores are concentrated in a small subset of features, with a steep drop-off after the top one or two variables, indicating a non-uniform distribution of predictive power.
  • Notable variation in feature ranking across datasets: Some features, such as Balance and CreditScore, show increased mutual information in the test dataset compared to development, while others (IsActiveMember, HasCrCard) decrease or fall below the threshold.

The mutual information analysis reveals that predictive power is concentrated in a limited number of features, with NumOfProducts consistently providing the highest information content across both datasets. The majority of features contribute minimal or negligible information, as indicated by their low or near-zero scores. The distribution of mutual information is highly skewed, and there are observable shifts in feature relevance between development and test datasets, reflecting potential changes in feature-target relationships or sample composition.

Parameters:

{
  "min_threshold": 0.01
}
            

Figures

ValidMind Figure validmind.data_validation.MutualInformation:development_data:a34d
ValidMind Figure validmind.data_validation.MutualInformation:development_data:c98d
2026-03-12 20:51:21,677 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MutualInformation:development_data does not exist in model's document
validmind.data_validation.PearsonCorrelationMatrix:development_data

Pearson Correlation Matrix Development Data

The Pearson Correlation Matrix test evaluates the linear dependency between all pairs of numerical variables in the dataset by calculating Pearson correlation coefficients and visualizing them in a heat map. The resulting matrices for both the development (train) and test datasets display the magnitude and direction of correlations, with coefficients ranging from -1 to 1. Correlation values above 0.7 (absolute) are highlighted to indicate high linear dependency, while the color scale provides an at-a-glance overview of the correlation structure across variables.

Key insights:

  • No high correlations detected: All off-diagonal correlation coefficients in both development and test datasets are below the 0.7 threshold, indicating an absence of strong linear relationships between variable pairs.
  • Consistent correlation structure across splits: The correlation patterns and magnitudes are stable between the development and test datasets, with the highest observed correlations (e.g., Balance and Geography_Germany at 0.41) remaining moderate and consistent.
  • Low risk of multicollinearity: The lack of high-magnitude correlations suggests minimal redundancy among input variables, reducing the risk of multicollinearity affecting model estimation or interpretability.

The correlation analysis demonstrates that the dataset's numerical variables are largely independent, with no evidence of strong linear dependencies or redundancy. The observed correlation structure is stable across both development and test datasets, supporting the integrity of the feature set for modeling purposes.

Figures

ValidMind Figure validmind.data_validation.PearsonCorrelationMatrix:development_data:a46c
ValidMind Figure validmind.data_validation.PearsonCorrelationMatrix:development_data:f598
2026-03-12 20:51:30,012 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.PearsonCorrelationMatrix:development_data does not exist in model's document
validmind.data_validation.HighPearsonCorrelation:development_data

❌ High Pearson Correlation Development Data

The High Pearson Correlation test identifies pairs of features in the dataset that exhibit strong linear relationships, which may indicate redundancy or multicollinearity. The results table lists the top ten strongest correlations for both the training and test datasets, reporting the Pearson correlation coefficient and a Pass/Fail status based on a threshold of 0.3. Correlation coefficients above this threshold are marked as "Fail," signaling higher-than-acceptable linear association between the respective feature pairs.

Key insights:

  • Two feature pairs exceed correlation threshold: In both the training and test datasets, the pairs (Balance, Geography_Germany) and (Geography_Germany, Geography_Spain) display absolute correlation coefficients above the 0.3 threshold, with values ranging from 0.3601 to 0.4144, resulting in a "Fail" status for these pairs.
  • All other feature pairs below threshold: The remaining feature pairs in both datasets have absolute correlation coefficients below 0.3, receiving a "Pass" status and indicating no further high linear associations among the top correlations.
  • Consistency across datasets: The same feature pairs exceed the threshold in both the training and test datasets, with similar coefficient magnitudes, indicating stable correlation structure between these variables across data splits.

The results indicate that the majority of feature pairs exhibit low to moderate linear relationships, with only two pairs consistently surpassing the defined correlation threshold in both datasets. The observed high correlations are limited to specific geography-related and balance features, while all other top feature pairs remain below the threshold, suggesting limited risk of widespread multicollinearity within the evaluated features.

Parameters:

{
  "max_threshold": 0.3,
  "top_n_correlations": 10
}
            

Tables

dataset Columns Coefficient Pass/Fail
train_dataset_final (Balance, Geography_Germany) 0.4144 Fail
train_dataset_final (Geography_Germany, Geography_Spain) -0.3601 Fail
train_dataset_final (IsActiveMember, Exited) -0.2171 Pass
train_dataset_final (Geography_Germany, Exited) 0.2143 Pass
train_dataset_final (Balance, NumOfProducts) -0.1763 Pass
train_dataset_final (Balance, Geography_Spain) -0.1667 Pass
train_dataset_final (Balance, Exited) 0.1315 Pass
train_dataset_final (Gender_Male, Exited) -0.1125 Pass
train_dataset_final (NumOfProducts, Exited) -0.0646 Pass
train_dataset_final (Geography_Spain, Exited) -0.0542 Pass
test_dataset_final (Balance, Geography_Germany) 0.4067 Fail
test_dataset_final (Geography_Germany, Geography_Spain) -0.3672 Fail
test_dataset_final (Geography_Germany, Exited) 0.1765 Pass
test_dataset_final (Balance, NumOfProducts) -0.1718 Pass
test_dataset_final (IsActiveMember, Exited) -0.1640 Pass
test_dataset_final (Balance, Exited) 0.1479 Pass
test_dataset_final (Gender_Male, Exited) -0.1188 Pass
test_dataset_final (CreditScore, Exited) -0.0996 Pass
test_dataset_final (Balance, Geography_Spain) -0.0934 Pass
test_dataset_final (NumOfProducts, Gender_Male) -0.0846 Pass
2026-03-12 20:51:36,452 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighPearsonCorrelation:development_data does not exist in model's document
validmind.model_validation.ModelMetadata

Model Metadata

The ModelMetadata test compares the metadata of different models to assess consistency in architecture, framework, framework version, and programming language. The summary table presents metadata for two models, including their modeling technique, framework, framework version, and programming language. Both models are identified as using the SKlearnModel technique, the sklearn framework, version 1.8.0, and Python as the programming language.

Key insights:

  • Consistent modeling technique across models: Both models are classified as SKlearnModel, indicating uniformity in modeling approach.
  • Identical framework and version: Both models utilize the sklearn framework, version 1.8.0, ensuring compatibility in software dependencies.
  • Uniform programming language: Python is used for both models, supporting consistency in codebase and deployment environment.

The metadata comparison reveals complete alignment across all evaluated fields for the two models. No discrepancies or inconsistencies are observed in modeling technique, framework, framework version, or programming language. This uniformity supports streamlined model management and integration.

Tables

model Modeling Technique Modeling Framework Framework Version Programming Language
log_model_champion SKlearnModel sklearn 1.8.0 Python
rf_model SKlearnModel sklearn 1.8.0 Python
2026-03-12 20:51:40,454 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.ModelMetadata does not exist in model's document
validmind.model_validation.sklearn.ModelParameters

Model Parameters

The Model Parameters test extracts and displays all configuration parameters for each model to support transparency and reproducibility. The results present a structured table listing parameter names and their corresponding values for both the logistic regression model (log_model_champion) and the random forest model (rf_model). Each parameter is shown alongside its assigned value, providing a comprehensive snapshot of the model configurations at the time of testing.

Key insights:

  • Explicit parameterization for both models: All parameters for log_model_champion and rf_model are explicitly listed, including regularization, solver, and iteration settings for the logistic regression model, and tree construction, sampling, and splitting criteria for the random forest model.
  • Non-default penalty and solver in logistic regression: The logistic regression model uses an l1 penalty with the liblinear solver, indicating a configuration that supports feature selection through regularization.
  • Random forest uses 50 estimators and fixed random state: The random forest model is configured with 50 trees (n_estimators=50) and a fixed random seed (random_state=42), supporting reproducibility and controlled variance.
  • Standard splitting and impurity settings in random forest: The random forest model applies the gini criterion, sqrt for max_features, and default values for minimum samples and impurity thresholds, reflecting standard tree growth parameters.

The extracted parameter set provides a transparent and reproducible record of model configurations for both the logistic regression and random forest models. The use of explicit regularization and solver choices in the logistic regression model, along with reproducibility controls and standard tree settings in the random forest model, collectively document the operational setup and support systematic auditing of model behavior.

Tables

model Parameter Value
log_model_champion C 1
log_model_champion dual False
log_model_champion fit_intercept True
log_model_champion intercept_scaling 1
log_model_champion max_iter 100
log_model_champion penalty l1
log_model_champion solver liblinear
log_model_champion tol 0.0001
log_model_champion verbose 0
log_model_champion warm_start False
rf_model bootstrap True
rf_model ccp_alpha 0.0
rf_model criterion gini
rf_model max_features sqrt
rf_model min_impurity_decrease 0.0
rf_model min_samples_leaf 1
rf_model min_samples_split 2
rf_model min_weight_fraction_leaf 0.0
rf_model n_estimators 50
rf_model oob_score False
rf_model random_state 42
rf_model verbose 0
rf_model warm_start False
2026-03-12 20:51:46,956 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ModelParameters does not exist in model's document
validmind.model_validation.sklearn.ROCCurve

ROC Curve

The ROC Curve test evaluates the binary classification performance of the log_model_champion by plotting the Receiver Operating Characteristic (ROC) curve and calculating the Area Under the Curve (AUC) for both the training and test datasets. The resulting plots display the trade-off between the true positive rate and false positive rate across all classification thresholds, with the AUC providing a summary measure of the model's discriminative ability. The ROC curves for both datasets are compared against a baseline representing random classification (AUC = 0.5).

Key insights:

  • AUC indicates moderate discriminative power: The AUC is 0.69 on the training dataset and 0.66 on the test dataset, both above the random baseline of 0.5, indicating the model has moderate ability to distinguish between classes.
  • Consistent performance across datasets: The small difference in AUC between training and test datasets suggests stable model behavior and limited overfitting.
  • ROC curves consistently above random line: Both ROC curves remain above the diagonal line representing random classification, confirming the model's predictive value across thresholds.

The ROC Curve test results demonstrate that log_model_champion achieves moderate classification performance, with AUC values consistently above the random baseline on both training and test datasets. The close alignment of AUC scores across datasets indicates stable generalization, and the ROC curves confirm the model's ability to provide meaningful discrimination between classes throughout the range of possible thresholds.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:1bc8
ValidMind Figure validmind.model_validation.sklearn.ROCCurve:7ea4
2026-03-12 20:51:53,678 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve does not exist in model's document
validmind.model_validation.sklearn.MinimumROCAUCScore

✅ Minimum ROCAUC Score

The Minimum ROC AUC Score test evaluates whether the model's multiclass ROC AUC score meets or exceeds a specified minimum threshold, providing an assessment of the model's ability to distinguish between classes. The results table presents ROC AUC scores for both the training and test datasets, alongside the threshold value and pass/fail status for each dataset. Both datasets are evaluated against a threshold of 0.5, with the observed scores and outcomes reported for each.

Key insights:

  • ROC AUC scores exceed minimum threshold: Both the training (0.6867) and test (0.6634) datasets register ROC AUC scores above the 0.5 threshold.
  • Consistent pass status across datasets: The test is marked as "Pass" for both the train and test datasets, indicating consistent model performance relative to the defined criterion.
  • Moderate separation between classes: ROC AUC values in the range of 0.66–0.69 indicate moderate ability of the model to distinguish between classes on both datasets.

The results demonstrate that the model achieves ROC AUC scores above the specified minimum threshold on both training and test datasets, indicating moderate discriminatory power. The consistent pass status across datasets reflects stable model performance with respect to this metric.

Parameters:

{
  "min_threshold": 0.5
}
            

Tables

dataset Score Threshold Pass/Fail
train_dataset_final 0.6867 0.5 Pass
test_dataset_final 0.6634 0.5 Pass
2026-03-12 20:51:58,512 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumROCAUCScore does not exist in model's document

In summary

In this final notebook, you learned how to:

With our ValidMind for model validation series of notebooks, you learned how to validate a model end-to-end with the ValidMind Library by running through some common scenarios in a typical model validation setting:

  • Verifying the data quality steps performed by the model development team
  • Independently replicating the champion model's results and conducting additional tests to assess performance, stability, and robustness
  • Setting up test inputs and a challenger model for comparative analysis
  • Running validation tests, analyzing results, and logging artifacts to ValidMind

Next steps

Work with your validation report

Now that you've logged all your test results and verified the work done by the model development team, head to the ValidMind Platform to wrap up your validation report. Continue to work on your validation report by:

  • Inserting additional test results: Click Link Evidence to Report under any section of 2. Validation in your validation report. (Learn more: Link evidence to reports)

  • Making qualitative edits to your test descriptions: Expand any linked evidence under Validator Evidence and click See evidence details to review and edit the ValidMind-generated test descriptions for quality and accuracy. (Learn more: Preparing validation reports)

  • Adding more findings: Click Link Finding to Report in any validation report section, then click + Create New Finding. (Learn more: Add and manage model findings)

  • Adding risk assessment notes: Click under Risk Assessment Notes in any validation report section to access the text editor and content editing toolbar, including an option to generate a draft with AI. Once generated, edit your ValidMind-generated test descriptions to adhere to your organization's requirements. (Learn more: Work with content blocks)

  • Assessing compliance: Under the Guideline for any validation report section, click ASSESSMENT and select the compliance status from the drop-down menu. (Learn more: Provide compliance assessments)

  • Collaborate with other stakeholders: Use the ValidMind Platform's real-time collaborative features to work seamlessly together with the rest of your organization, including model developers. Propose suggested changes in the model documentation, work with versioned history, and use comments to discuss specific portions of the model documentation. (Learn more: Collaborate with others)

When your validation report is complete and ready for review, submit it for approval from the same ValidMind Platform where you made your edits and collaborated with the rest of your organization, ensuring transparency and a thorough model validation history. (Learn more: Submit for approval)

Learn more

Now that you're familiar with the basics, you can explore the following notebooks to get a deeper understanding on how the ValidMind Library assists you in streamlining model validation:

Use cases

Discover more learning resources

Learn more about the ValidMind Library tools we used in this notebook:

We offer many interactive notebooks to help you automate testing, documenting, validating, and more:

Or, visit our documentation to learn more about ValidMind.


Copyright © 2023-2026 ValidMind Inc. All rights reserved.
Refer to LICENSE for details.
SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial