ValidMind for model validation 4 — Finalize testing and reporting
Learn how to use ValidMind for your end-to-end model validation process with our series of four introductory notebooks. In this last notebook, finalize the compliance assessment process and have a complete validation report ready for review.
This notebook will walk you through how to supplement ValidMind tests with your own custom tests and include them as additional evidence in your validation report. A custom test is any function that takes a set of inputs and parameters as arguments and returns one or more outputs:
The function can be as simple or as complex as you need it to be — it can use external libraries, make API calls, or do anything else that you can do in Python.
The only requirement is that the function signature and return values can be "understood" and handled by the ValidMind Library. As such, custom tests offer added flexibility by extending the default tests provided by ValidMind, enabling you to document any type of model or use case.
For a more in-depth introduction to custom tests, refer to our Implement custom tests notebook.
Learn by doing
Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals
Prerequisites
In order to finalize validation and reporting, you'll need to first have:
Need help with the above steps?
Refer to the first three notebooks in this series:
# Make sure the ValidMind Library is installed%pip install -q validmind# Load your model identifier credentials from an `.env` file%load_ext dotenv%dotenv .env# Or replace with your code snippetimport validmind as vmvm.init(# api_host="...",# api_key="...",# api_secret="...",# model="...",# document="validation-report",)
Note: you may need to restart the kernel to use updated packages.
2026-03-12 20:48:06,626 - ERROR(validmind.api_client): Future releases will require `document` as one of the options you must provide to `vm.init()`. To learn more, refer to https://docs.validmind.ai/developer/validmind-library.html
2026-03-12 20:48:06,760 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report
Import the sample dataset
Next, we'll load in the same sample Bank Customer Churn Prediction dataset used to develop the champion model that we will independently preprocess:
# Load the sample datasetfrom validmind.datasets.classification import customer_churn as demo_datasetprint(f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}")raw_df = demo_dataset.load_data()
Loaded demo dataset with:
• Target column: 'Exited'
• Class labels: {'0': 'Did not exit', '1': 'Exited'}
# Initialize the raw dataset for use in ValidMind testsvm_raw_dataset = vm.init_dataset( dataset=raw_df, input_id="raw_dataset", target_column="Exited",)
import pandas as pdraw_copy_df = raw_df.sample(frac=1) # Create a copy of the raw dataset# Create a balanced dataset with the same number of exited and not exited customersexited_df = raw_copy_df.loc[raw_copy_df["Exited"] ==1]not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] ==0].sample(n=exited_df.shape[0])balanced_raw_df = pd.concat([exited_df, not_exited_df])balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)
Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test:
# Register new data and now 'balanced_raw_dataset' is the new dataset object of interestvm_balanced_raw_dataset = vm.init_dataset( dataset=balanced_raw_df, input_id="balanced_raw_dataset", target_column="Exited",)
# Run HighPearsonCorrelation test with our balanced dataset as input and return a result objectcorr_result = vm.tests.run_test( test_id="validmind.data_validation.HighPearsonCorrelation", params={"max_threshold": 0.3}, inputs={"dataset": vm_balanced_raw_dataset},)
❌ High Pearson Correlation
The High Pearson Correlation test identifies pairs of features in the dataset that exhibit strong linear relationships, with the aim of detecting potential feature redundancy or multicollinearity. The results table lists the top ten feature pairs ranked by the absolute value of their Pearson correlation coefficients, along with a Pass or Fail status based on a threshold of 0.3. Only one feature pair exceeds the threshold, while the remaining pairs display lower correlation values and pass the test criteria.
Key insights:
Single feature pair exceeds correlation threshold: The pair (Age, Exited) shows a Pearson correlation coefficient of 0.3245, surpassing the 0.3 threshold and receiving a Fail status.
All other feature pairs below threshold: The remaining nine feature pairs have absolute correlation coefficients ranging from 0.0348 to 0.2064, all below the threshold and marked as Pass.
Predominantly weak linear relationships: Most feature pairs demonstrate weak linear associations, with coefficients clustered near zero.
The test results indicate that the dataset contains minimal evidence of strong linear relationships among most feature pairs, with only the (Age, Exited) pair exhibiting a moderate correlation above the specified threshold. The overall correlation structure suggests low risk of widespread multicollinearity or feature redundancy based on linear associations.
Parameters:
{
"max_threshold": 0.3
}
Tables
Columns
Coefficient
Pass/Fail
(Age, Exited)
0.3245
Fail
(IsActiveMember, Exited)
-0.2064
Pass
(Balance, NumOfProducts)
-0.1749
Pass
(Balance, Exited)
0.1349
Pass
(NumOfProducts, Exited)
-0.0550
Pass
(Age, NumOfProducts)
-0.0444
Pass
(Age, Balance)
0.0409
Pass
(NumOfProducts, IsActiveMember)
0.0387
Pass
(HasCrCard, IsActiveMember)
-0.0360
Pass
(Age, Tenure)
-0.0348
Pass
# From result object, extract table from `corr_result.tables`features_df = corr_result.tables[0].datafeatures_df
Columns
Coefficient
Pass/Fail
0
(Age, Exited)
0.3245
Fail
1
(IsActiveMember, Exited)
-0.2064
Pass
2
(Balance, NumOfProducts)
-0.1749
Pass
3
(Balance, Exited)
0.1349
Pass
4
(NumOfProducts, Exited)
-0.0550
Pass
5
(Age, NumOfProducts)
-0.0444
Pass
6
(Age, Balance)
0.0409
Pass
7
(NumOfProducts, IsActiveMember)
0.0387
Pass
8
(HasCrCard, IsActiveMember)
-0.0360
Pass
9
(Age, Tenure)
-0.0348
Pass
# Extract list of features that failed the testhigh_correlation_features = features_df[features_df["Pass/Fail"] =="Fail"]["Columns"].tolist()high_correlation_features
['(Age, Exited)']
# Extract feature names from the list of stringshigh_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]high_correlation_features
['Age']
# Remove the highly correlated features from the datasetbalanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)# Re-initialize the dataset objectvm_raw_dataset_preprocessed = vm.init_dataset( dataset=balanced_raw_no_age_df, input_id="raw_dataset_preprocessed", target_column="Exited",)
# Re-run the test with the reduced feature setcorr_result = vm.tests.run_test( test_id="validmind.data_validation.HighPearsonCorrelation", params={"max_threshold": 0.3}, inputs={"dataset": vm_raw_dataset_preprocessed},)
✅ High Pearson Correlation
The High Pearson Correlation test evaluates the linear relationships between feature pairs to identify potential redundancy or multicollinearity within the dataset. The results table presents the top ten absolute Pearson correlation coefficients, each associated with a feature pair, the coefficient value, and a Pass/Fail status based on a threshold of 0.3. All observed coefficients are below the threshold, and each feature pair is marked as Pass.
Key insights:
No feature pairs exceed correlation threshold: All absolute Pearson correlation coefficients are below the 0.3 threshold, with the highest observed value being -0.2064 between IsActiveMember and Exited.
Low to moderate linear relationships: The strongest correlations, such as between Balance and NumOfProducts (-0.1749) and Balance and Exited (0.1349), remain well below levels typically associated with multicollinearity.
Consistent Pass status across all pairs: Every feature pair in the top ten list is marked as Pass, indicating no detected high-risk linear dependencies among the evaluated features.
The results indicate that the dataset does not exhibit high linear correlations among the top feature pairs, suggesting a low risk of feature redundancy or multicollinearity based on the tested threshold. The observed correlation structure supports the interpretability and stability of subsequent modeling efforts.
Parameters:
{
"max_threshold": 0.3
}
Tables
Columns
Coefficient
Pass/Fail
(IsActiveMember, Exited)
-0.2064
Pass
(Balance, NumOfProducts)
-0.1749
Pass
(Balance, Exited)
0.1349
Pass
(NumOfProducts, Exited)
-0.0550
Pass
(NumOfProducts, IsActiveMember)
0.0387
Pass
(HasCrCard, IsActiveMember)
-0.0360
Pass
(CreditScore, Exited)
-0.0303
Pass
(Tenure, Exited)
-0.0246
Pass
(Tenure, HasCrCard)
0.0239
Pass
(Tenure, EstimatedSalary)
0.0224
Pass
Split the preprocessed dataset
With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:
# Encode categorical features in the datasetbalanced_raw_no_age_df = pd.get_dummies( balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True)balanced_raw_no_age_df.head()
CreditScore
Tenure
Balance
NumOfProducts
HasCrCard
IsActiveMember
EstimatedSalary
Exited
Geography_Germany
Geography_Spain
Gender_Male
4938
850
3
51293.47
1
0
0
35534.68
0
True
False
False
775
610
9
0.00
3
0
1
83912.24
0
False
True
True
693
733
3
106545.53
1
1
1
134589.58
0
True
False
True
2545
515
9
113715.36
1
1
0
18424.24
1
True
False
True
5198
651
1
163700.78
3
1
1
29583.48
1
True
False
False
from sklearn.model_selection import train_test_split# Split the dataset into train and testtrain_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)X_train = train_df.drop("Exited", axis=1)y_train = train_df["Exited"]X_test = test_df.drop("Exited", axis=1)y_test = test_df["Exited"]
With our raw dataset assessed and preprocessed, let's go ahead and import the champion model submitted by the model development team in the format of a .pkl file: lr_model_champion.pkl
# Import the champion modelimport pickle as pklwithopen("lr_model_champion.pkl", "rb") as f: log_reg = pkl.load(f)
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/base.py:463: InconsistentVersionWarning: Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.8.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
warnings.warn(
Train potential challenger model
We'll also train our random forest classification challenger model to see how it compares:
# Import the Random Forest Classification modelfrom sklearn.ensemble import RandomForestClassifier# Create the model instance with 50 decision treesrf_model = RandomForestClassifier( n_estimators=50, random_state=42,)# Train the modelrf_model.fit(X_train, y_train)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models:
# Initialize the champion logistic regression modelvm_log_model = vm.init_model( log_reg, input_id="log_model_champion",)# Initialize the challenger random forest classification modelvm_rf_model = vm.init_model( rf_model, input_id="rf_model",)
# Assign predictions to Champion — Logistic regression modelvm_train_ds.assign_predictions(model=vm_log_model)vm_test_ds.assign_predictions(model=vm_log_model)# Assign predictions to Challenger — Random forest classification modelvm_train_ds.assign_predictions(model=vm_rf_model)vm_test_ds.assign_predictions(model=vm_rf_model)
2026-03-12 20:48:20,826 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-03-12 20:48:20,828 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-03-12 20:48:20,828 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-03-12 20:48:20,832 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-03-12 20:48:20,833 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-03-12 20:48:20,836 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-03-12 20:48:20,837 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-03-12 20:48:20,838 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-03-12 20:48:20,841 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-03-12 20:48:20,867 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-03-12 20:48:20,867 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-03-12 20:48:20,892 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-03-12 20:48:20,895 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-03-12 20:48:20,908 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-03-12 20:48:20,909 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-03-12 20:48:20,922 - INFO(validmind.vm_models.dataset.utils): Done running predict()
Implementing custom tests
Thanks to the model documentation (Learn more ...), we know that the model development team implemented a custom test to further evaluate the performance of the champion model.
In a usual model validation situation, you would load a saved custom test provided by the model development team. In the following section, we'll have you implement the same custom test and make it available for reuse, to familiarize you with the processes.
Let's implement the same custom inline test that calculates the confusion matrix for a binary classification model that the model development team used in their performance evaluations.
An inline test refers to a test written and executed within the same environment as the code being tested — in this case, right in this Jupyter Notebook — without requiring a separate test file or framework.
You'll note that the custom test function is just a regular Python function that can include and require any Python library as you see fit.
Create a confusion matrix plot
Let's first create a confusion matrix plot using the confusion_matrix function from the sklearn.metrics module:
import matplotlib.pyplot as pltfrom sklearn import metrics# Get the predicted classesy_pred = log_reg.predict(vm_test_ds.x)confusion_matrix = metrics.confusion_matrix(y_test, y_pred)cm_display = metrics.ConfusionMatrixDisplay( confusion_matrix=confusion_matrix, display_labels=[False, True])cm_display.plot()
Next, create a @vm.test wrapper that will allow you to create a reusable test. Note the following changes in the code below:
The function confusion_matrix takes two arguments dataset and model. This is a VMDataset and VMModel object respectively.
VMDataset objects allow you to access the dataset's true (target) values by accessing the .y attribute.
VMDataset objects allow you to access the predictions for a given model by accessing the .y_pred() method.
The function docstring provides a description of what the test does. This will be displayed along with the result in this notebook as well as in the ValidMind Platform.
The function body calculates the confusion matrix using the sklearn.metrics.confusion_matrix function as we just did above.
The function then returns the ConfusionMatrixDisplay.figure_ object — this is important as the ValidMind Library expects the output of the custom test to be a plot or a table.
The @vm.test decorator is doing the work of creating a wrapper around the function that will allow it to be run by the ValidMind Library. It also registers the test so it can be found by the ID my_custom_tests.ConfusionMatrix.
@vm.test("my_custom_tests.ConfusionMatrix")def confusion_matrix(dataset, model):"""The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known. The confusion matrix is a 2x2 table that contains 4 values: - True Positive (TP): the number of correct positive predictions - True Negative (TN): the number of correct negative predictions - False Positive (FP): the number of incorrect positive predictions - False Negative (FN): the number of incorrect negative predictions The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure. """ y_true = dataset.y y_pred = dataset.y_pred(model=model) confusion_matrix = metrics.confusion_matrix(y_true, y_pred) cm_display = metrics.ConfusionMatrixDisplay( confusion_matrix=confusion_matrix, display_labels=[False, True] ) cm_display.plot() plt.close() # close the plot to avoid displaying itreturn cm_display.figure_ # return the figure object itself
You can now run the newly created custom test on both the training and test datasets for both models using the run_test() function:
The Confusion Matrix test evaluates the classification performance of the model by comparing predicted and true labels for both the training and test datasets. The resulting matrices display the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of model accuracy and error distribution. The results are presented separately for the training dataset (train_dataset_final) and the test dataset (test_dataset_final), allowing for assessment of model generalization and potential overfitting.
Key insights:
Balanced classification performance across datasets: Both training and test confusion matrices show substantial counts in the true positive and true negative cells, indicating the model is able to correctly identify both classes in each dataset.
False positive and false negative rates are comparable: The number of false positives (446 in training, 116 in test) and false negatives (419 in training, 118 in test) are similar within each dataset, suggesting no strong bias toward one type of misclassification.
Consistent error distribution between train and test: The relative proportions of correct and incorrect predictions are similar between the training and test datasets, indicating stable model behavior and no evidence of significant overfitting.
The confusion matrix results demonstrate that the model maintains consistent classification performance across both training and test datasets, with balanced rates of true and false predictions. The error distribution does not indicate a dominant misclassification type, and the similarity between datasets suggests the model generalizes well to unseen data.
Figures
2026-03-12 20:48:26,995 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:champion does not exist in model's document
The Confusion Matrix test evaluates the classification performance of the model by comparing predicted and true labels for both the training and test datasets. The resulting matrices display the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of model accuracy and error distribution. The first matrix corresponds to the training dataset, while the second matrix summarizes results for the test dataset.
Key insights:
Perfect classification on training data: The training confusion matrix shows 1,304 true negatives and 1,281 true positives, with zero false positives and zero false negatives, indicating no misclassifications on the training set.
Presence of misclassifications on test data: The test confusion matrix records 225 true negatives, 242 true positives, 87 false positives, and 93 false negatives, indicating both types of classification errors are present in the test set.
Balanced error distribution in test set: The number of false positives (87) and false negatives (93) are of similar magnitude, suggesting no strong bias toward one type of error in the test predictions.
The confusion matrices indicate that the model achieves perfect separation on the training data, with no observed misclassifications. On the test data, the model exhibits both false positives and false negatives, with error counts that are balanced between the two classes. This pattern suggests strong model fit to the training data and a moderate level of generalization error on unseen data, with no evidence of systematic bias toward either class in the test predictions.
Figures
2026-03-12 20:48:34,121 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:challenger does not exist in model's document
Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for some test IDs.
That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.
Add parameters to custom tests
Custom tests can take parameters just like any other function. To demonstrate, let's modify the confusion_matrix function to take an additional parameter normalize that will allow you to normalize the confusion matrix:
@vm.test("my_custom_tests.ConfusionMatrix")def confusion_matrix(dataset, model, normalize=False):"""The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known. The confusion matrix is a 2x2 table that contains 4 values: - True Positive (TP): the number of correct positive predictions - True Negative (TN): the number of correct negative predictions - False Positive (FP): the number of incorrect positive predictions - False Negative (FN): the number of incorrect negative predictions The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure. """ y_true = dataset.y y_pred = dataset.y_pred(model=model)if normalize: confusion_matrix = metrics.confusion_matrix(y_true, y_pred, normalize="all")else: confusion_matrix = metrics.confusion_matrix(y_true, y_pred) cm_display = metrics.ConfusionMatrixDisplay( confusion_matrix=confusion_matrix, display_labels=[False, True] ) cm_display.plot() plt.close() # close the plot to avoid displaying itreturn cm_display.figure_ # return the figure object itself
Pass parameters to custom tests
You can pass parameters to custom tests by providing a dictionary of parameters to the run_test() function.
The parameters will override any default parameters set in the custom test definition. Note that dataset and model are still passed as inputs.
Since these are VMDataset or VMModel inputs, they have a special meaning.
Re-running and logging the custom confusion matrix with normalize=True for both models and our testing dataset looks like this:
# Champion with test dataset and normalize=Truevm.tests.run_test( test_id="my_custom_tests.ConfusionMatrix:test_normalized_champion", input_grid={"dataset": [vm_test_ds],"model" : [vm_log_model] }, params={"normalize": True}).log()
Confusion Matrix Test Normalized Champion
The ConfusionMatrix:test_normalized_champion test evaluates the classification performance of the log_model_champion model on the test_dataset_final dataset by displaying the normalized confusion matrix. The matrix presents the proportion of true positives, true negatives, false positives, and false negatives, with each cell value representing the fraction of total predictions for each outcome. The normalization enables direct comparison of prediction accuracy across both classes.
Key insights:
Balanced correct classification rates: The model correctly classifies 0.30 of negative cases (true negatives) and 0.32 of positive cases (true positives), indicating similar accuracy for both classes.
Moderate misclassification rates: False positives and false negatives are observed at 0.18 and 0.20, respectively, reflecting moderate levels of misclassification for each class.
No extreme class imbalance in predictions: The normalized values are distributed without extreme skew, suggesting the model does not disproportionately favor one class over the other.
The normalized confusion matrix indicates that the model achieves comparable accuracy in identifying both positive and negative cases, with moderate and relatively balanced misclassification rates. The absence of pronounced class bias in predictions suggests stable model behavior across the evaluated dataset.
Parameters:
{
"normalize": true
}
Figures
2026-03-12 20:48:41,510 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_normalized_champion does not exist in model's document
# Challenger with test dataset and normalize=Truevm.tests.run_test( test_id="my_custom_tests.ConfusionMatrix:test_normalized_challenger", input_grid={"dataset": [vm_test_ds],"model" : [vm_rf_model] }, params={"normalize": True}).log()
Confusion Matrix Test Normalized Challenger
The ConfusionMatrix:test_normalized_challenger test evaluates the classification performance of the rf_model on the test_dataset_final by presenting a normalized confusion matrix. The matrix displays the proportion of true and false predictions for each class, with values normalized to sum to 1 across all entries. The plot provides a visual summary of the model's ability to correctly and incorrectly classify both positive and negative cases.
Key insights:
Balanced correct classification rates: The model correctly classifies 0.35 of all samples as true negatives and 0.37 as true positives, indicating similar accuracy for both classes.
Moderate false prediction rates: False positives and false negatives are observed at 0.13 and 0.14, respectively, reflecting moderate misclassification rates for both classes.
No class dominance in errors: The distribution of errors is relatively even between false positives and false negatives, with no single error type disproportionately represented.
The confusion matrix reveals that the model demonstrates balanced performance across both classes, with correct classification rates for true positives and true negatives closely aligned. Misclassification rates are moderate and evenly distributed, indicating that the model does not exhibit a strong bias toward either class in its prediction errors. This balanced error profile suggests consistent model behavior across the evaluated dataset.
Parameters:
{
"normalize": true
}
Figures
2026-03-12 20:48:51,702 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_normalized_challenger does not exist in model's document
Use external test providers
Sometimes you may want to reuse the same set of custom tests across multiple models and share them with others in your organization, like the model development team would have done with you in this example workflow featured in this series of notebooks. In this case, you can create an external custom test provider that will allow you to load custom tests from a local folder or a Git repository.
In this section you will learn how to declare a local filesystem test provider that allows loading tests from a local folder following these high level steps:
Create a folder of custom tests from existing inline tests (tests that exist in your active Jupyter Notebook)
Let's start by creating a new folder that will contain reusable custom tests from your existing inline tests.
The following code snippet will create a new my_tests directory in the current working directory if it doesn't exist:
tests_folder ="my_tests"import os# create tests folderos.makedirs(tests_folder, exist_ok=True)# remove existing testsfor f in os.listdir(tests_folder):# remove files and pycacheif f.endswith(".py") or f =="__pycache__": os.system(f"rm -rf {tests_folder}/{f}")
After running the command above, confirm that a new my_tests directory was created successfully. For example:
~/notebooks/tutorials/model_validation/my_tests/
Save an inline test
The @vm.test decorator we used in Implement a custom inline test above to register one-off custom tests also includes a convenience method on the function object that allows you to simply call <func_name>.save() to save the test to a Python file at a specified path.
While save() will get you started by creating the file and saving the function code with the correct name, it won't automatically include any imports, or other functions or variables, outside of the functions that are needed for the test to run. To solve this, pass in an optional imports argument ensuring necessary imports are added to the file.
The confusion_matrix test requires the following additional imports:
import matplotlib.pyplot as pltfrom sklearn import metrics
Let's pass these imports to the save() method to ensure they are included in the file with the following command:
confusion_matrix.save(# Save it to the custom tests folder we created tests_folder, imports=["import matplotlib.pyplot as plt", "from sklearn import metrics"],)
2026-03-12 20:48:52,174 - INFO(validmind.tests.decorator): Saved to /home/runner/work/documentation/documentation/site/notebooks/EXECUTED/model_validation/my_tests/ConfusionMatrix.py!Be sure to add any necessary imports to the top of the file.
2026-03-12 20:48:52,175 - INFO(validmind.tests.decorator): This metric can be run with the ID: <test_provider_namespace>.ConfusionMatrix
# Saved from __main__.confusion_matrix
# Original Test ID: my_custom_tests.ConfusionMatrix
# New Test ID: <test_provider_namespace>.ConfusionMatrix
Now that your my_tests folder has a sample custom test, let's initialize a test provider that will tell the ValidMind Library where to find your custom tests:
ValidMind offers out-of-the-box test providers for local tests (tests in a folder) or a Github provider for tests in a Github repository.
You can also create your own test provider by creating a class that has a load_test method that takes a test ID and returns the test function matching that ID.
For most use cases, using a LocalTestProvider that allows you to load custom tests from a designated directory should be sufficient.
The most important attribute for a test provider is its namespace. This is a string that will be used to prefix test IDs in model documentation. This allows you to have multiple test providers with tests that can even share the same ID, but are distinguished by their namespace.
Let's go ahead and load the custom tests from our my_tests directory:
from validmind.tests import LocalTestProvider# initialize the test provider with the tests folder we created earliermy_test_provider = LocalTestProvider(tests_folder)vm.tests.register_test_provider( namespace="my_test_provider", test_provider=my_test_provider,)# `my_test_provider.load_test()` will be called for any test ID that starts with `my_test_provider`# e.g. `my_test_provider.ConfusionMatrix` will look for a function named `ConfusionMatrix` in `my_tests/ConfusionMatrix.py` file
Run test provider tests
Now that we've set up the test provider, we can run any test that's located in the tests folder by using the run_test() method as with any other test:
For tests that reside in a test provider directory, the test ID will be the namespace specified when registering the provider, followed by the path to the test file relative to the tests folder.
For example, the Confusion Matrix test we created earlier will have the test ID my_test_provider.ConfusionMatrix. You could organize the tests in subfolders, say classification and regression, and the test ID for the Confusion Matrix test would then be my_test_provider.classification.ConfusionMatrix.
Let's go ahead and re-run the confusion matrix test with our testing dataset for our two models by using the test ID my_test_provider.ConfusionMatrix. This should load the test from the test provider and run it as before.
# Champion with test dataset and test provider custom testvm.tests.run_test( test_id="my_test_provider.ConfusionMatrix:champion", input_grid={"dataset": [vm_test_ds],"model" : [vm_log_model] }).log()
Confusion Matrix Champion
The Confusion Matrix test evaluates the classification performance of the log_model_champion on the test_dataset_final by comparing predicted and true labels. The resulting matrix displays the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of the model's prediction accuracy and error distribution. The matrix is structured with true labels on the vertical axis and predicted labels on the horizontal axis, with each cell indicating the number of instances for each outcome.
Key insights:
Balanced true positive and true negative counts: The model correctly classified 207 true positives and 196 true negatives, indicating similar effectiveness in identifying both classes.
Comparable false positive and false negative rates: There are 116 false positives and 118 false negatives, suggesting that misclassification rates are nearly equivalent for both types of errors.
No evidence of class prediction bias: The distribution of correct and incorrect predictions does not indicate a strong bias toward either class, as both positive and negative classes are represented similarly in both correct and incorrect predictions.
The confusion matrix reveals that the log_model_champion demonstrates balanced performance across both classes, with similar rates of correct and incorrect predictions for positive and negative outcomes. The absence of pronounced class bias and the close alignment of false positive and false negative counts indicate that the model maintains consistent classification behavior across the test dataset.
Figures
2026-03-12 20:48:59,823 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix:champion does not exist in model's document
# Challenger with test dataset and test provider custom testvm.tests.run_test( test_id="my_test_provider.ConfusionMatrix:challenger", input_grid={"dataset": [vm_test_ds],"model" : [vm_rf_model] }).log()
Confusion Matrix Challenger
The Confusion Matrix test evaluates the classification performance of the rf_model on the test_dataset_final by comparing predicted and true labels. The resulting matrix displays the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of the model's prediction accuracy and error distribution. The matrix is structured with actual class labels on the vertical axis and predicted class labels on the horizontal axis, with color intensity reflecting the count magnitude.
Key insights:
Balanced detection of both classes: The model correctly classified 225 negative cases (true negatives) and 242 positive cases (true positives), indicating effective identification of both classes.
Moderate false positive and false negative rates: There are 87 false positives and 93 false negatives, reflecting a moderate level of misclassification for both types of errors.
Comparable error distribution: The counts of false positives and false negatives are similar in magnitude, suggesting no substantial bias toward over- or under-predicting either class.
The confusion matrix reveals that the rf_model demonstrates balanced performance in identifying both positive and negative cases, with true positive and true negative counts closely matched. The rates of false positives and false negatives are moderate and similar in scale, indicating that misclassification is distributed relatively evenly across both classes. This pattern suggests the model does not exhibit a strong bias toward either class, and overall classification performance is consistent across the test dataset.
Figures
2026-03-12 20:49:07,228 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix:challenger does not exist in model's document
Verify test runs
Our final task is to verify that all the tests provided by the model development team were run and reported accurately. Note the appended result_ids to delineate which dataset we ran the test with for the relevant tests.
Here, we'll specify all the tests we'd like to independently rerun in a dictionary called test_config. Note here that inputs and input_grid expect the input_id of the dataset or model as the value rather than the variable name we specified:
for t in test_config:print(t)try:# Check if test has input_gridif'input_grid'in test_config[t]:# For tests with input_grid, pass the input_grid configurationif'params'in test_config[t]: vm.tests.run_test(t, input_grid=test_config[t]['input_grid'], params=test_config[t]['params']).log()else: vm.tests.run_test(t, input_grid=test_config[t]['input_grid']).log()else:# Original logic for regular inputsif'params'in test_config[t]: vm.tests.run_test(t, inputs=test_config[t]['inputs'], params=test_config[t]['params']).log()else: vm.tests.run_test(t, inputs=test_config[t]['inputs']).log()exceptExceptionas e:print(f"Error running test {t}: {str(e)}")
The Dataset Description test provides a comprehensive summary of the dataset's structure, completeness, and feature characteristics. The results table details each column's data type, count, missingness, and the number of distinct values, offering a clear overview of the dataset composition. All columns are fully populated with no missing values, and the distinct value counts highlight the diversity and granularity of each feature. This summary enables a thorough understanding of the dataset's readiness for modeling and potential areas of complexity.
Key insights:
No missing values across all columns: All 11 columns have 8,000 non-missing entries, with 0% missingness observed throughout the dataset.
High cardinality in key numeric features: The Balance and EstimatedSalary columns exhibit high distinct value counts (5,088 and 8,000 respectively), indicating continuous or near-continuous distributions.
Low cardinality in categorical features: Categorical columns such as Geography, Gender, HasCrCard, IsActiveMember, and Exited have between 2 and 3 distinct values, reflecting well-defined categorical groupings.
Moderate diversity in demographic and behavioral features: Age and CreditScore show moderate distinct counts (69 and 452 respectively), while Tenure and NumOfProducts have lower diversity (11 and 4 distinct values).
The dataset is fully complete with no missing data, supporting robust downstream analysis. Numeric features display a range of cardinalities, from highly granular (EstimatedSalary, Balance) to more discretized (Tenure, NumOfProducts), while categorical features are well-structured with limited unique values. The observed structure indicates a dataset suitable for a variety of modeling approaches, with no immediate data quality concerns evident from the summary statistics.
Tables
Dataset Description
Name
Type
Count
Missing
Missing %
Distinct
Distinct %
CreditScore
Numeric
8000.0
0
0.0
452
0.0565
Geography
Categorical
8000.0
0
0.0
3
0.0004
Gender
Categorical
8000.0
0
0.0
2
0.0002
Age
Numeric
8000.0
0
0.0
69
0.0086
Tenure
Numeric
8000.0
0
0.0
11
0.0014
Balance
Numeric
8000.0
0
0.0
5088
0.6360
NumOfProducts
Numeric
8000.0
0
0.0
4
0.0005
HasCrCard
Categorical
8000.0
0
0.0
2
0.0002
IsActiveMember
Categorical
8000.0
0
0.0
2
0.0002
EstimatedSalary
Numeric
8000.0
0
0.0
8000
1.0000
Exited
Categorical
8000.0
0
0.0
2
0.0002
2026-03-12 20:49:16,161 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DatasetDescription:raw_data does not exist in model's document
The Descriptive Statistics test evaluates the distributional characteristics and diversity of both numerical and categorical variables in the dataset. The results present summary statistics for eight numerical variables, including measures of central tendency, dispersion, and range, as well as frequency-based summaries for two categorical variables. The numerical table details counts, means, standard deviations, and percentiles, while the categorical table reports unique value counts and the dominance of the most frequent category. These results provide a comprehensive overview of the dataset's structure and highlight key aspects of variable distributions.
Key insights:
Wide range and skewness in balance values: The Balance variable exhibits a minimum of 0 and a maximum of 250,898, with a mean (76,434) substantially lower than the median (97,264), indicating a right-skewed distribution and the presence of a significant proportion of zero balances.
CreditScore and Age distributions are symmetric: CreditScore and Age show close alignment between mean and median (CreditScore mean: 650.16, median: 652; Age mean: 38.95, median: 37), suggesting relatively symmetric distributions without pronounced skewness.
Limited diversity in categorical variables: Geography is dominated by France (50.12% of records), and Gender is split between two categories, with Male comprising 54.95% of the dataset, indicating moderate imbalance but not extreme concentration.
Binary variables with balanced representation: HasCrCard and IsActiveMember are binary variables with means of 0.70 and 0.52, respectively, reflecting a moderate split between categories and no evidence of extreme imbalance.
NumOfProducts concentrated at lower values: The NumOfProducts variable has a mean of 1.53 and a median of 1, with 75% of values at or below 2, indicating most customers hold one or two products.
The dataset displays a mix of symmetric and skewed distributions among numerical variables, with Balance notably right-skewed and containing a substantial proportion of zero values. Categorical variables show moderate dominance by single categories but retain some diversity. Binary and count variables are distributed without extreme imbalance, supporting a representative sample across key dimensions. Overall, the data structure is well-characterized, with some variables warranting attention due to skewness or concentration.
Tables
Numerical Variables
Name
Count
Mean
Std
Min
25%
50%
75%
90%
95%
Max
CreditScore
8000.0
650.1596
96.8462
350.0
583.0
652.0
717.0
778.0
813.0
850.0
Age
8000.0
38.9489
10.4590
18.0
32.0
37.0
44.0
53.0
60.0
92.0
Tenure
8000.0
5.0339
2.8853
0.0
3.0
5.0
8.0
9.0
9.0
10.0
Balance
8000.0
76434.0965
62612.2513
0.0
0.0
97264.0
128045.0
149545.0
162488.0
250898.0
NumOfProducts
8000.0
1.5325
0.5805
1.0
1.0
1.0
2.0
2.0
2.0
4.0
HasCrCard
8000.0
0.7026
0.4571
0.0
0.0
1.0
1.0
1.0
1.0
1.0
IsActiveMember
8000.0
0.5199
0.4996
0.0
0.0
1.0
1.0
1.0
1.0
1.0
EstimatedSalary
8000.0
99790.1880
57520.5089
12.0
50857.0
99505.0
149216.0
179486.0
189997.0
199992.0
Categorical Variables
Name
Count
Number of Unique Values
Top Value
Top Value Frequency
Top Value Frequency %
Geography
8000.0
3.0
France
4010.0
50.12
Gender
8000.0
2.0
Male
4396.0
54.95
2026-03-12 20:49:23,997 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:raw_data does not exist in model's document
validmind.data_validation.MissingValues:raw_data
✅ Missing Values Raw Data
The Missing Values test evaluates dataset quality by measuring the proportion of missing values in each feature and comparing it to a predefined threshold. The results table presents, for each column, the number and percentage of missing values, along with a Pass/Fail status based on whether the missingness exceeds the 1.0% threshold. All features in the dataset are listed with their respective missing value statistics and test outcomes.
Key insights:
No missing values detected: All features report 0 missing values, corresponding to 0.0% missingness for each column.
Universal test pass across features: Every feature meets the missing value threshold, with all columns marked as "Pass" in the results.
The dataset demonstrates complete data integrity with respect to missing values, as no feature contains any missing entries. All columns satisfy the established threshold, indicating a high level of data completeness for subsequent modeling or analysis.
Parameters:
{
"min_percentage_threshold": 1
}
Tables
Column
Number of Missing Values
Percentage of Missing Values (%)
Pass/Fail
CreditScore
0
0.0
Pass
Geography
0
0.0
Pass
Gender
0
0.0
Pass
Age
0
0.0
Pass
Tenure
0
0.0
Pass
Balance
0
0.0
Pass
NumOfProducts
0
0.0
Pass
HasCrCard
0
0.0
Pass
IsActiveMember
0
0.0
Pass
EstimatedSalary
0
0.0
Pass
Exited
0
0.0
Pass
2026-03-12 20:49:28,021 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MissingValues:raw_data does not exist in model's document
validmind.data_validation.ClassImbalance:raw_data
✅ Class Imbalance Raw Data
The Class Imbalance test evaluates the distribution of target classes within the dataset to identify potential imbalances that could impact model performance. The results table presents the percentage representation of each class in the target variable "Exited," alongside a pass/fail assessment based on a minimum threshold of 10%. The accompanying bar plot visually depicts the proportion of each class, providing a clear overview of class distribution.
Key insights:
Both classes exceed minimum threshold: Class 0 constitutes 79.80% and class 1 constitutes 20.20% of the dataset, with both surpassing the 10% minimum threshold.
No classes flagged for imbalance: The pass/fail assessment indicates that neither class is under-represented according to the defined criterion.
Class distribution is asymmetric: The majority class (0) is nearly four times as prevalent as the minority class (1), as shown in both the table and the bar plot.
The results indicate that, while the dataset exhibits an asymmetric class distribution with a dominant majority class, both classes meet the minimum representation threshold set by the test. No classes are flagged for high imbalance risk under the current parameters, and the class proportions are visually confirmed by the bar plot. This distribution provides a basis for further model development without immediate concerns regarding under-representation of any class.
Parameters:
{
"min_percent_threshold": 10
}
Tables
Exited Class Imbalance
Exited
Percentage of Rows (%)
Pass/Fail
0
79.80%
Pass
1
20.20%
Pass
Figures
2026-03-12 20:49:34,907 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance:raw_data does not exist in model's document
validmind.data_validation.Duplicates:raw_data
✅ Duplicates Raw Data
The Duplicates:raw_data test evaluates the presence of duplicate rows within the dataset to ensure data quality and reduce the risk of model overfitting due to redundant information. The results table summarizes the absolute number and percentage of duplicate rows detected in the dataset, with the test configured to flag results only if the count exceeds a minimum threshold of 1. The table indicates both the total number of duplicate rows and their proportion relative to the dataset size.
Key insights:
No duplicate rows detected: The dataset contains 0 duplicate rows, as indicated by the "Number of Duplicates" value.
Zero percent duplication rate: The "Percentage of Rows (%)" is 0.0%, confirming the absence of redundancy in the dataset.
The results demonstrate that the dataset is free from duplicate entries, indicating a high level of data integrity with respect to row uniqueness. The absence of duplicates reduces the risk of model bias due to repeated information and supports reliable model training and evaluation.
Parameters:
{
"min_threshold": 1
}
Tables
Duplicate Rows Results for Dataset
Number of Duplicates
Percentage of Rows (%)
0
0.0
2026-03-12 20:49:38,152 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.Duplicates:raw_data does not exist in model's document
The High Cardinality test evaluates the number of unique values in categorical columns to identify potential risks associated with high cardinality, such as overfitting or data noise. The results table presents the number and percentage of distinct values for each categorical column, along with a pass/fail status based on a threshold of 10% distinct values. Both "Geography" and "Gender" columns are assessed, with their respective distinct value counts and percentages reported.
Key insights:
All categorical columns pass cardinality threshold: Both "Geography" (3 distinct values, 0.0375%) and "Gender" (2 distinct values, 0.025%) are well below the 10% threshold, resulting in a "Pass" status for each.
Low cardinality observed across features: The number of unique values in both columns is minimal relative to the dataset size, indicating low cardinality in all assessed categorical features.
The results indicate that all evaluated categorical columns exhibit low cardinality, with distinct value percentages substantially below the defined threshold. No evidence of high cardinality risk is present in the assessed features, supporting data quality and reducing the likelihood of overfitting related to categorical variable granularity.
2026-03-12 20:49:41,574 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighCardinality:raw_data does not exist in model's document
validmind.data_validation.Skewness:raw_data
❌ Skewness Raw Data
The Skewness test evaluates the asymmetry of numerical data distributions to identify deviations from normality that may impact model performance. The results table presents skewness values for each numeric column, indicating whether each value falls below the maximum threshold of 1. Columns with skewness values below this threshold are marked as "Pass," while those exceeding it are marked as "Fail." The table enables assessment of distributional symmetry across all monitored features.
Key insights:
Most features exhibit low skewness: The majority of columns, including CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary, have skewness values well within the threshold, indicating near-symmetric distributions.
Age and Exited exceed skewness threshold: Age (skewness = 1.0245) and Exited (skewness = 1.4847) both exceed the maximum threshold, resulting in a "Fail" status for these columns.
Highest skewness observed in Exited: The Exited column displays the highest skewness (1.4847), indicating a pronounced asymmetry in its distribution relative to other features.
Negative skewness present but within limits: Features such as HasCrCard (-0.8867), Balance (-0.1353), and CreditScore (-0.062) show negative skewness, but all remain within the acceptable range.
The results indicate that most numeric features in the dataset maintain distributional symmetry within the defined threshold, supporting data quality for model development. However, Age and Exited display elevated skewness, with Exited showing the most pronounced asymmetry. These findings highlight localized distributional imbalances that may warrant further examination depending on model requirements and use case.
Parameters:
{
"max_threshold": 1
}
Tables
Skewness Results for Dataset
Column
Skewness
Pass/Fail
CreditScore
-0.0620
Pass
Age
1.0245
Fail
Tenure
0.0077
Pass
Balance
-0.1353
Pass
NumOfProducts
0.7172
Pass
HasCrCard
-0.8867
Pass
IsActiveMember
-0.0796
Pass
EstimatedSalary
0.0095
Pass
Exited
1.4847
Fail
2026-03-12 20:49:46,228 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.Skewness:raw_data does not exist in model's document
validmind.data_validation.UniqueRows:raw_data
❌ Unique Rows Raw Data
The UniqueRows test evaluates the diversity of the dataset by measuring the proportion of unique values in each column relative to the total row count, with a minimum threshold set at 1%. The results table presents, for each column, the number and percentage of unique values, along with a pass/fail outcome based on whether the uniqueness percentage meets or exceeds the threshold. Columns such as EstimatedSalary, Balance, and CreditScore exhibit high uniqueness percentages and pass the test, while most categorical and low-cardinality columns fall below the threshold and fail.
Key insights:
High uniqueness in continuous variables: EstimatedSalary (100%), Balance (63.6%), and CreditScore (5.65%) exceed the 1% uniqueness threshold, indicating substantial diversity in these columns.
Low uniqueness in categorical variables: Columns such as Geography (0.0375%), Gender (0.025%), HasCrCard (0.025%), IsActiveMember (0.025%), and Exited (0.025%) have very low uniqueness percentages and fail the test.
Limited diversity in Age and Tenure: Age (0.8625%) and Tenure (0.1375%) do not meet the uniqueness threshold, reflecting limited distinct values relative to the dataset size.
Majority of columns fail uniqueness threshold: Only 3 out of 11 columns pass the test, with the remaining 8 columns failing to meet the minimum uniqueness requirement.
The results indicate that while continuous variables such as EstimatedSalary, Balance, and CreditScore provide substantial row-level diversity, the majority of columns—particularly those representing categorical or low-cardinality features—exhibit low uniqueness and do not meet the prescribed threshold. This distribution reflects a dataset structure where diversity is concentrated in a subset of variables, with most categorical features contributing limited unique information at the row level.
Parameters:
{
"min_percent_threshold": 1
}
Tables
Column
Number of Unique Values
Percentage of Unique Values (%)
Pass/Fail
CreditScore
452
5.6500
Pass
Geography
3
0.0375
Fail
Gender
2
0.0250
Fail
Age
69
0.8625
Fail
Tenure
11
0.1375
Fail
Balance
5088
63.6000
Pass
NumOfProducts
4
0.0500
Fail
HasCrCard
2
0.0250
Fail
IsActiveMember
2
0.0250
Fail
EstimatedSalary
8000
100.0000
Pass
Exited
2
0.0250
Fail
2026-03-12 20:49:51,268 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.UniqueRows:raw_data does not exist in model's document
The TooManyZeroValues test identifies numerical columns with a proportion of zero values exceeding a defined threshold, set here at 0.03%. The results table summarizes the number and percentage of zero values for each numerical column, along with a pass/fail status based on the threshold. All four evaluated columns—Tenure, Balance, HasCrCard, and IsActiveMember—are reported with their respective row counts, zero value counts, and calculated percentages.
Key insights:
All evaluated columns exceed zero value threshold: Each of the four numerical columns has a percentage of zero values significantly above the 0.03% threshold, resulting in a fail status for all.
High concentration of zeros in Balance and IsActiveMember: Balance contains 36.4% zero values, and IsActiveMember contains 48.01%, indicating substantial sparsity in these features.
Substantial zero values in binary indicator columns: HasCrCard and IsActiveMember, likely representing binary indicators, show 29.74% and 48.01% zero values respectively, reflecting a large proportion of one class.
Tenure column also affected: Tenure registers 4.04% zero values, which, while lower than other columns, still exceeds the threshold and results in a fail.
All tested numerical columns display zero value proportions well above the defined threshold, with Balance and IsActiveMember exhibiting particularly high sparsity. The prevalence of zeros across these features is consistent and systematic, as indicated by the fail status for each column. This pattern highlights a notable concentration of zero values in both continuous and binary-type variables within the dataset.
Parameters:
{
"max_percent_threshold": 0.03
}
Tables
Variable
Row Count
Number of Zero Values
Percentage of Zero Values (%)
Pass/Fail
Tenure
8000
323
4.0375
Fail
Balance
8000
2912
36.4000
Fail
HasCrCard
8000
2379
29.7375
Fail
IsActiveMember
8000
3841
48.0125
Fail
2026-03-12 20:49:58,666 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TooManyZeroValues:raw_data does not exist in model's document
The Interquartile Range Outliers Table (IQROutliersTable) test identifies and summarizes outliers in numerical features using the IQR method, with the threshold parameter set to 5 for this analysis. The results table presents the count and summary statistics of outliers detected for each numerical feature in the dataset. In this instance, the table is empty, indicating no outliers were detected under the specified threshold.
Key insights:
No outliers detected in any feature: The test did not identify any data points as outliers across all numerical features at the threshold of 5.
Dataset exhibits high conformity to IQR bounds: All numerical feature values fall within the calculated IQR-based outlier limits, indicating absence of extreme deviations.
The absence of detected outliers at the specified threshold suggests that the dataset's numerical features are well-contained within the expected value ranges. This result indicates a high degree of distributional regularity and minimal presence of extreme values under the applied IQR criteria.
Parameters:
{
"threshold": 5
}
Tables
Summary of Outliers Detected by IQR Method
2026-03-12 20:50:01,612 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.IQROutliersTable:raw_data does not exist in model's document
The Descriptive Statistics test evaluates the distributional characteristics and diversity of both numerical and categorical variables in the preprocessed dataset. The results are presented in two summary tables: one for numerical variables, detailing central tendency, dispersion, and range; and one for categorical variables, summarizing value counts, unique value diversity, and the dominance of top categories. These tables provide a comprehensive overview of the dataset’s structure, supporting assessment of data quality and potential risk factors.
Key insights:
Wide range and skewness in Balance: The Balance variable exhibits a minimum of 0.0, a median of 103,828.0, and a maximum of 250,898.0, with a mean (82,744.6) substantially below the median, indicating right-skewness and a concentration of lower values.
CreditScore distribution is symmetric and complete: CreditScore shows a mean (648.2) closely aligned with the median (650.0), and a full range from 350.0 to 850.0, suggesting a well-populated and symmetric distribution.
Binary variables show moderate class balance: HasCrCard and IsActiveMember are both binary, with HasCrCard having 70.1% of entries as 1 and IsActiveMember at 47.3% as 1, indicating moderate class balance without extreme dominance.
Categorical variables have limited diversity: Geography has three unique values, with France as the top value at 46.47% frequency. Gender is evenly split, with Male at 50.25%, indicating no single category is overwhelmingly dominant.
The dataset demonstrates generally balanced distributions across both numerical and categorical variables, with the exception of Balance, which is notably right-skewed and contains a substantial proportion of zero values. Categorical variables display limited but sufficient diversity, and binary variables do not exhibit extreme class imbalance. These characteristics provide a stable foundation for subsequent modeling, with the primary distributional risk concentrated in the Balance variable.
Tables
Numerical Variables
Name
Count
Mean
Std
Min
25%
50%
75%
90%
95%
Max
CreditScore
3232.0
648.1894
97.2398
350.0
582.0
650.0
715.0
776.0
812.0
850.0
Tenure
3232.0
5.0226
2.9093
0.0
3.0
5.0
8.0
9.0
10.0
10.0
Balance
3232.0
82744.5585
61546.8678
0.0
0.0
103828.0
129848.0
151020.0
165337.0
250898.0
NumOfProducts
3232.0
1.5090
0.6694
1.0
1.0
1.0
2.0
2.0
3.0
4.0
HasCrCard
3232.0
0.7011
0.4578
0.0
0.0
1.0
1.0
1.0
1.0
1.0
IsActiveMember
3232.0
0.4725
0.4993
0.0
0.0
0.0
1.0
1.0
1.0
1.0
EstimatedSalary
3232.0
99725.4095
57416.6108
12.0
50950.0
98820.0
149928.0
179481.0
189189.0
199909.0
Categorical Variables
Name
Count
Number of Unique Values
Top Value
Top Value Frequency
Top Value Frequency %
Geography
3232.0
3.0
France
1502.0
46.47
Gender
3232.0
2.0
Male
1624.0
50.25
2026-03-12 20:50:07,365 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:preprocessed_data does not exist in model's document
The TabularDescriptionTables:preprocessed_data test evaluates the descriptive statistics and data completeness of numerical and categorical variables in the dataset. The results present summary statistics for eight numerical variables and two categorical variables, including measures of central tendency, range, missingness, and data type. All variables are reported with their observed value ranges, means, and unique value counts, providing a comprehensive overview of the dataset's structure and integrity.
Key insights:
No missing values detected: All numerical and categorical variables report 0.0% missing values, indicating complete data coverage across all fields.
Numerical variables span expected ranges: Variables such as CreditScore (350.0–850.0), Balance (0.0–250,898.09), and EstimatedSalary (11.58–199,909.32) display wide but bounded ranges, with means consistent with their respective domains.
Categorical variables are low cardinality: Geography contains three unique values (Germany, Spain, France), and Gender contains two (Female, Male), both with 0.0% missingness.
Binary indicators are well-formed: HasCrCard, IsActiveMember, and Exited are encoded as int64 with minimum and maximum values of 0 and 1, confirming binary structure.
The dataset exhibits complete data integrity with no missing values across all variables. Numerical and categorical fields are well-structured, with value ranges and cardinalities consistent with their intended use. The absence of missingness and the presence of clearly defined variable types support robust downstream modeling and analysis.
Tables
Numerical Variable
Num of Obs
Mean
Min
Max
Missing Values (%)
Data Type
CreditScore
3232
648.1894
350.00
850.00
0.0
int64
Tenure
3232
5.0226
0.00
10.00
0.0
int64
Balance
3232
82744.5585
0.00
250898.09
0.0
float64
NumOfProducts
3232
1.5090
1.00
4.00
0.0
int64
HasCrCard
3232
0.7011
0.00
1.00
0.0
int64
IsActiveMember
3232
0.4725
0.00
1.00
0.0
int64
EstimatedSalary
3232
99725.4095
11.58
199909.32
0.0
float64
Exited
3232
0.5000
0.00
1.00
0.0
int64
Categorical Variable
Num of Obs
Num of Unique Values
Unique Values
Missing Values (%)
Data Type
Geography
3232.0
3.0
['Germany' 'Spain' 'France']
0.0
object
Gender
3232.0
2.0
['Female' 'Male']
0.0
object
2026-03-12 20:50:11,733 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularDescriptionTables:preprocessed_data does not exist in model's document
The Missing Values test evaluates dataset quality by measuring the proportion of missing values in each feature and comparing it to a predefined threshold. The results table presents, for each column, the number and percentage of missing values, along with a Pass/Fail status based on whether the missingness exceeds the 1.0% threshold. All features in the dataset are listed with their respective missing value statistics and test outcomes.
Key insights:
No missing values detected: All features report 0 missing values, corresponding to 0.0% missingness for each column.
Universal pass across features: Every feature meets the missing value threshold, with all columns marked as "Pass" in the results.
The dataset demonstrates complete data integrity with respect to missing values, as no feature contains any missing entries. All columns satisfy the established missingness threshold, indicating a high level of data completeness for subsequent modeling or analysis.
Parameters:
{
"min_percentage_threshold": 1
}
Tables
Column
Number of Missing Values
Percentage of Missing Values (%)
Pass/Fail
CreditScore
0
0.0
Pass
Geography
0
0.0
Pass
Gender
0
0.0
Pass
Tenure
0
0.0
Pass
Balance
0
0.0
Pass
NumOfProducts
0
0.0
Pass
HasCrCard
0
0.0
Pass
IsActiveMember
0
0.0
Pass
EstimatedSalary
0
0.0
Pass
Exited
0
0.0
Pass
2026-03-12 20:50:15,144 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MissingValues:preprocessed_data does not exist in model's document
The TabularNumericalHistograms:preprocessed_data test provides visualizations of the distribution of each numerical feature in the dataset using histograms. These plots enable assessment of central tendency, spread, skewness, and the presence of outliers for each variable. The results display the frequency distribution for CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary, allowing for identification of distributional characteristics and potential data quality issues.
Key insights:
CreditScore displays moderate right skew: The CreditScore histogram shows a unimodal distribution with a longer right tail, indicating a concentration of values between 550 and 750, with fewer observations at the lower and higher extremes.
Tenure is nearly uniform with edge effects: The Tenure variable is distributed almost uniformly across its range, except for lower frequencies at the minimum (0) and maximum (10) values.
Balance is bimodal with a spike at zero: The Balance histogram reveals a pronounced spike at zero, followed by a bell-shaped distribution for nonzero values, indicating a substantial subset of accounts with zero balance.
NumOfProducts is highly concentrated at lower values: Most observations are at 1 or 2 products, with a steep drop-off for 3 and 4 products, indicating limited product diversification among customers.
HasCrCard and IsActiveMember are binary with class imbalance: Both variables are binary, with HasCrCard skewed toward 1 (majority have a credit card) and IsActiveMember showing a slight majority for 0 (not active).
EstimatedSalary is approximately uniform: The EstimatedSalary histogram is relatively flat across its range, indicating an even distribution of salary values without pronounced skew or clustering.
The histograms collectively indicate that most numerical features exhibit either uniform or moderately skewed distributions, with notable concentration effects in Balance (at zero) and NumOfProducts (at lower values). Binary features display class imbalance, and no extreme outliers are visually apparent in the continuous variables. These distributional characteristics provide a clear overview of the input data structure and highlight areas of concentration and potential segmentation within the dataset.
Figures
2026-03-12 20:50:24,322 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularNumericalHistograms:preprocessed_data does not exist in model's document
The TabularCategoricalBarPlots test evaluates the distribution of categorical variables by generating bar plots that display the frequency of each category within the dataset. The resulting plots provide a visual summary of the counts for each category in the "Geography" and "Gender" features. These visualizations enable assessment of the dataset's composition and highlight the relative representation of each category.
Key insights:
Balanced gender distribution: The "Gender" feature shows nearly equal counts for "Male" and "Female" categories, indicating no significant imbalance.
Geography category imbalance observed: The "Geography" feature displays higher representation for "France" compared to "Germany" and "Spain," with "Spain" having the lowest count among the three categories.
The categorical composition of the dataset is characterized by a balanced gender split and a notable imbalance in the "Geography" feature, where "France" is the most represented category. These patterns provide clarity on the underlying distribution of categorical variables and may inform further analysis of model input representativeness.
Figures
2026-03-12 20:50:29,791 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data does not exist in model's document
The TargetRateBarPlots test visualizes the distribution and target rates of categorical features to provide insight into model decision patterns. The results display paired bar plots for each categorical variable, showing both the frequency of each category and the corresponding mean target (default) rate. This enables a direct comparison of how target rates vary across different groups within each feature.
Key insights:
Geography exhibits target rate variation: The target rate for Germany is notably higher than for France and Spain, with Germany exceeding 0.6 while France and Spain are closer to 0.4.
Balanced category representation in Gender: Male and Female categories have nearly identical counts, indicating balanced representation in the dataset.
Gender target rates differ: The target rate for Female is higher than for Male, with Female above 0.5 and Male below 0.5.
Uneven category counts in Geography: France has the highest count, followed by Germany and then Spain, indicating some imbalance in category frequencies.
The results reveal distinct differences in target rates across both Geography and Gender features, with Germany and Female categories exhibiting higher default rates relative to their counterparts. Category representation is balanced for Gender but shows moderate imbalance for Geography. These patterns highlight areas where model outcomes differ by group, providing a basis for further analysis of model behavior and potential risk segmentation.
Figures
2026-03-12 20:50:35,724 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TargetRateBarPlots:preprocessed_data does not exist in model's document
The Descriptive Statistics test evaluates the distributional characteristics of numerical variables in both the training and test datasets. The results present summary statistics—including mean, standard deviation, minimum, maximum, and key percentiles—for each variable, enabling assessment of central tendency, dispersion, and potential outliers. The statistics are reported separately for the train and test datasets, allowing for direct comparison of data consistency and distributional alignment across development splits.
Key insights:
Consistent central tendencies across splits: Mean and median values for key variables such as CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary are closely aligned between the training and test datasets, indicating stable distributions.
Comparable dispersion and range: Standard deviations and value ranges for all variables are similar between datasets, with no evidence of significant shifts or anomalies in spread.
No extreme outliers detected: Maximum and minimum values for all variables fall within expected operational ranges, with no evidence of extreme or implausible values in either dataset.
Balanced categorical encodings: Binary variables (HasCrCard, IsActiveMember) display mean values near 0.5–0.7, with standard deviations consistent with balanced categorical distributions.
The descriptive statistics indicate strong alignment between the training and test datasets, with stable central tendencies and dispersion across all monitored variables. No material outliers or distributional anomalies are observed, supporting the representativeness and integrity of the development data. The observed consistency provides a sound basis for subsequent modeling and validation activities.
Tables
dataset
Name
Count
Mean
Std
Min
25%
50%
75%
90%
95%
Max
train_dataset_final
CreditScore
2585.0
648.0870
97.1601
350.0
581.0
650.0
717.0
775.0
811.0
850.0
train_dataset_final
Tenure
2585.0
5.0456
2.9270
0.0
3.0
5.0
8.0
9.0
10.0
10.0
train_dataset_final
Balance
2585.0
82364.0648
61815.3725
0.0
0.0
103549.0
129935.0
151069.0
165346.0
250898.0
train_dataset_final
NumOfProducts
2585.0
1.5014
0.6614
1.0
1.0
1.0
2.0
2.0
3.0
4.0
train_dataset_final
HasCrCard
2585.0
0.7029
0.4571
0.0
0.0
1.0
1.0
1.0
1.0
1.0
train_dataset_final
IsActiveMember
2585.0
0.4716
0.4993
0.0
0.0
0.0
1.0
1.0
1.0
1.0
train_dataset_final
EstimatedSalary
2585.0
100001.3237
57409.5810
12.0
51553.0
99476.0
150228.0
179692.0
190140.0
199909.0
test_dataset_final
CreditScore
647.0
648.5981
97.6318
350.0
584.0
649.0
712.0
780.0
816.0
850.0
test_dataset_final
Tenure
647.0
4.9304
2.8377
0.0
3.0
5.0
7.0
9.0
10.0
10.0
test_dataset_final
Balance
647.0
84264.7690
60485.4814
0.0
0.0
104478.0
129385.0
150914.0
164660.0
210433.0
test_dataset_final
NumOfProducts
647.0
1.5394
0.7002
1.0
1.0
1.0
2.0
2.0
3.0
4.0
test_dataset_final
HasCrCard
647.0
0.6940
0.4612
0.0
0.0
1.0
1.0
1.0
1.0
1.0
test_dataset_final
IsActiveMember
647.0
0.4760
0.4998
0.0
0.0
0.0
1.0
1.0
1.0
1.0
test_dataset_final
EstimatedSalary
647.0
98623.0319
57475.8860
599.0
49779.0
95393.0
149421.0
178705.0
187201.0
199662.0
2026-03-12 20:50:40,830 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:development_data does not exist in model's document
The Descriptive Statistics test evaluates the distributional characteristics and completeness of numerical variables in both the training and test datasets. The results present summary statistics including count, mean, minimum, maximum, missing value percentage, and data type for each numerical variable. All variables are reported for both datasets, with no missing values observed and consistent data types across variables.
Key insights:
No missing values detected: All numerical variables in both training and test datasets have 0.0% missing values, indicating complete data coverage for these fields.
Consistent data types across datasets: Data types for all variables are stable between training and test sets, with integer types for discrete variables and float types for continuous variables.
Stable central tendencies between datasets: Means for key variables such as CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Exited are closely aligned between training and test datasets, with differences generally within a small margin.
Full observed range maintained: Minimum and maximum values for variables such as CreditScore (350.0 to 850.0), Tenure (0.0 to 10.0), and NumOfProducts (1.0 to 4.0) are consistent with expected value ranges, with no evidence of out-of-range or anomalous values.
The descriptive statistics indicate that the numerical variables in both the training and test datasets are complete, with no missing values and consistent data types. Central tendencies and value ranges are stable across datasets, supporting data integrity and comparability for subsequent modeling steps. No data quality issues or distributional anomalies are observed in the reported statistics.
Tables
dataset
Numerical Variable
Num of Obs
Mean
Min
Max
Missing Values (%)
Data Type
train_dataset_final
CreditScore
2585
648.0870
350.00
850.00
0.0
int64
train_dataset_final
Tenure
2585
5.0456
0.00
10.00
0.0
int64
train_dataset_final
Balance
2585
82364.0648
0.00
250898.09
0.0
float64
train_dataset_final
NumOfProducts
2585
1.5014
1.00
4.00
0.0
int64
train_dataset_final
HasCrCard
2585
0.7029
0.00
1.00
0.0
int64
train_dataset_final
IsActiveMember
2585
0.4716
0.00
1.00
0.0
int64
train_dataset_final
EstimatedSalary
2585
100001.3237
11.58
199909.32
0.0
float64
train_dataset_final
Exited
2585
0.4956
0.00
1.00
0.0
int64
test_dataset_final
CreditScore
647
648.5981
350.00
850.00
0.0
int64
test_dataset_final
Tenure
647
4.9304
0.00
10.00
0.0
int64
test_dataset_final
Balance
647
84264.7690
0.00
210433.08
0.0
float64
test_dataset_final
NumOfProducts
647
1.5394
1.00
4.00
0.0
int64
test_dataset_final
HasCrCard
647
0.6940
0.00
1.00
0.0
int64
test_dataset_final
IsActiveMember
647
0.4760
0.00
1.00
0.0
int64
test_dataset_final
EstimatedSalary
647
98623.0319
598.80
199661.50
0.0
float64
test_dataset_final
Exited
647
0.5178
0.00
1.00
0.0
int64
2026-03-12 20:50:45,885 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularDescriptionTables:development_data does not exist in model's document
The Class Imbalance test evaluates the distribution of target classes within the training and test datasets to identify potential imbalances that could affect model performance. The results present the percentage representation of each class in both datasets, benchmarked against a minimum threshold of 10%. Visualizations display the proportion of each class, supporting interpretation of class balance.
Key insights:
Both classes exceed the minimum threshold: In both the training and test datasets, each class (Exited = 0 and Exited = 1) represents more than 10% of the total records, with all values above 48%.
Near-equal class distribution in training data: The training dataset shows a balanced split, with Exited = 0 at 50.44% and Exited = 1 at 49.56%.
Slight variation in test data proportions: The test dataset displays Exited = 1 at 51.78% and Exited = 0 at 48.22%, indicating a minor shift but maintaining overall balance.
All classes pass the imbalance criterion: No class in either dataset is flagged for imbalance, as all pass the 10% minimum threshold.
The class distribution in both the training and test datasets is balanced, with each class comprising nearly half of the records. No evidence of class imbalance is observed, and all classes meet the predefined minimum representation criterion. This distribution supports unbiased model training and evaluation with respect to the target variable.
Parameters:
{
"min_percent_threshold": 10
}
Tables
dataset
Exited
Percentage of Rows (%)
Pass/Fail
train_dataset_final
0
50.44%
Pass
train_dataset_final
1
49.56%
Pass
test_dataset_final
1
51.78%
Pass
test_dataset_final
0
48.22%
Pass
Figures
2026-03-12 20:50:51,718 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance:development_data does not exist in model's document
The UniqueRows test evaluates the diversity of each column in the dataset by measuring the proportion of unique values relative to the total row count, with a minimum threshold set at 1%. The results table presents, for both the training and test datasets, the number and percentage of unique values per column, along with a pass/fail outcome based on the threshold. Columns with a percentage of unique values below 1% are marked as "Fail," while those meeting or exceeding the threshold are marked as "Pass." This assessment provides a column-level view of data uniqueness and highlights areas of limited diversity.
Key insights:
High uniqueness in continuous variables: Columns such as EstimatedSalary and Balance exhibit high percentages of unique values (100% and 68%+ respectively) in both training and test datasets, consistently passing the uniqueness threshold.
Low uniqueness in categorical and binary variables: Columns representing categorical or binary features (e.g., HasCrCard, IsActiveMember, Geography_Germany, Gender_Male, Exited) show very low percentages of unique values (all below 1%), resulting in a fail outcome for these columns across both datasets.
Mixed results for ordinal variables: CreditScore demonstrates moderate to high uniqueness (16.3% in training, 45.4% in test), passing the threshold, while Tenure passes in the test set (1.7%) but fails in the training set (0.43%), indicating variability in uniqueness across splits.
Consistent patterns across datasets: The observed patterns of high uniqueness in continuous variables and low uniqueness in categorical variables are consistent between the training and test datasets.
The results indicate that continuous variables in both datasets provide substantial diversity, as reflected by high percentages of unique values and consistent pass outcomes. In contrast, categorical and binary variables uniformly fall below the uniqueness threshold, resulting in fail outcomes for these columns. This pattern reflects the inherent limitations of the UniqueRows test when applied to categorical features, as their value ranges are naturally constrained. The overall uniqueness profile is stable across both training and test datasets, with no evidence of data duplication or lack of diversity in continuous features.
Parameters:
{
"min_percent_threshold": 1
}
Tables
dataset
Column
Number of Unique Values
Percentage of Unique Values (%)
Pass/Fail
train_dataset_final
CreditScore
422
16.3250
Pass
train_dataset_final
Tenure
11
0.4255
Fail
train_dataset_final
Balance
1759
68.0464
Pass
train_dataset_final
NumOfProducts
4
0.1547
Fail
train_dataset_final
HasCrCard
2
0.0774
Fail
train_dataset_final
IsActiveMember
2
0.0774
Fail
train_dataset_final
EstimatedSalary
2585
100.0000
Pass
train_dataset_final
Geography_Germany
2
0.0774
Fail
train_dataset_final
Geography_Spain
2
0.0774
Fail
train_dataset_final
Gender_Male
2
0.0774
Fail
train_dataset_final
Exited
2
0.0774
Fail
test_dataset_final
CreditScore
294
45.4405
Pass
test_dataset_final
Tenure
11
1.7002
Pass
test_dataset_final
Balance
455
70.3246
Pass
test_dataset_final
NumOfProducts
4
0.6182
Fail
test_dataset_final
HasCrCard
2
0.3091
Fail
test_dataset_final
IsActiveMember
2
0.3091
Fail
test_dataset_final
EstimatedSalary
647
100.0000
Pass
test_dataset_final
Geography_Germany
2
0.3091
Fail
test_dataset_final
Geography_Spain
2
0.3091
Fail
test_dataset_final
Gender_Male
2
0.3091
Fail
test_dataset_final
Exited
2
0.3091
Fail
2026-03-12 20:50:58,597 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.UniqueRows:development_data does not exist in model's document
The TabularNumericalHistograms test provides a visual assessment of the distribution of each numerical feature in both the training and test datasets. The resulting histograms display the frequency distribution for each variable, enabling identification of distributional characteristics, skewness, and potential outliers. These visualizations facilitate an understanding of the underlying data structure and highlight any notable deviations or concentration patterns across features.
Key insights:
CreditScore distributions are unimodal and right-skewed: Both training and test datasets show unimodal distributions for CreditScore, with a concentration between 600 and 750 and a right-skewed tail extending toward higher values.
Tenure is approximately uniform with edge effects: Tenure displays a near-uniform distribution across most values, with slightly lower frequencies at the minimum and maximum bins in both datasets.
Balance exhibits a strong zero-inflation: A substantial proportion of records have a zero balance, with the remainder forming a bell-shaped distribution centered around 120,000–140,000.
NumOfProducts is highly concentrated at lower values: The majority of records have one or two products, with very few instances at three or four products.
Binary features show class imbalance: HasCrCard and IsActiveMember are both skewed, with HasCrCard dominated by the '1' class and IsActiveMember showing a moderate split but with more '0' values in the training set.
EstimatedSalary is uniformly distributed: EstimatedSalary displays a flat distribution across its range in both datasets, indicating no significant skew or concentration.
Geography and Gender features are imbalanced: Geography_Germany and Geography_Spain show more records in the 'false' category, while Gender_Male is nearly balanced between true and false.
The histograms reveal that most numerical features exhibit stable and consistent distributional patterns between training and test datasets, with no evidence of extreme outliers or abrupt distributional shifts. Notable characteristics include strong zero-inflation in Balance, class imbalance in several binary features, and a uniform distribution for EstimatedSalary. These patterns provide a clear view of the data landscape and support further analysis of model input integrity.
Figures
2026-03-12 20:51:09,506 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularNumericalHistograms:development_data does not exist in model's document
The Mutual Information test evaluates the statistical dependency between each feature and the target variable to quantify feature relevance for model training. The results are presented as normalized mutual information scores (ranging from 0 to 1) for both the development and test datasets, with a threshold of 0.01 indicated for interpretability. Bar plots display the relative importance of each feature, highlighting the distribution and magnitude of information content across variables.
Key insights:
NumOfProducts consistently dominates feature relevance: NumOfProducts exhibits the highest mutual information score in both development (≈0.105) and test (≈0.127) datasets, substantially exceeding all other features.
Majority of features show low information content: Most features register mutual information scores near or below the 0.01 threshold, particularly in the test dataset, where several features (Tenure, HasCrCard, EstimatedSalary, Geography_Spain) display scores at or near zero.
Score distribution is highly skewed: The mutual information scores are concentrated in a small subset of features, with a steep drop-off after the top one or two variables, indicating a non-uniform distribution of predictive power.
Notable variation in feature ranking across datasets: Some features, such as Balance and CreditScore, show increased mutual information in the test dataset compared to development, while others (IsActiveMember, HasCrCard) decrease or fall below the threshold.
The mutual information analysis reveals that predictive power is concentrated in a limited number of features, with NumOfProducts consistently providing the highest information content across both datasets. The majority of features contribute minimal or negligible information, as indicated by their low or near-zero scores. The distribution of mutual information is highly skewed, and there are observable shifts in feature relevance between development and test datasets, reflecting potential changes in feature-target relationships or sample composition.
Parameters:
{
"min_threshold": 0.01
}
Figures
2026-03-12 20:51:21,677 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MutualInformation:development_data does not exist in model's document
The Pearson Correlation Matrix test evaluates the linear dependency between all pairs of numerical variables in the dataset by calculating Pearson correlation coefficients and visualizing them in a heat map. The resulting matrices for both the development (train) and test datasets display the magnitude and direction of correlations, with coefficients ranging from -1 to 1. Correlation values above 0.7 (absolute) are highlighted to indicate high linear dependency, while the color scale provides an at-a-glance overview of the correlation structure across variables.
Key insights:
No high correlations detected: All off-diagonal correlation coefficients in both development and test datasets are below the 0.7 threshold, indicating an absence of strong linear relationships between variable pairs.
Consistent correlation structure across splits: The correlation patterns and magnitudes are stable between the development and test datasets, with the highest observed correlations (e.g., Balance and Geography_Germany at 0.41) remaining moderate and consistent.
Low risk of multicollinearity: The lack of high-magnitude correlations suggests minimal redundancy among input variables, reducing the risk of multicollinearity affecting model estimation or interpretability.
The correlation analysis demonstrates that the dataset's numerical variables are largely independent, with no evidence of strong linear dependencies or redundancy. The observed correlation structure is stable across both development and test datasets, supporting the integrity of the feature set for modeling purposes.
Figures
2026-03-12 20:51:30,012 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.PearsonCorrelationMatrix:development_data does not exist in model's document
The High Pearson Correlation test identifies pairs of features in the dataset that exhibit strong linear relationships, which may indicate redundancy or multicollinearity. The results table lists the top ten strongest correlations for both the training and test datasets, reporting the Pearson correlation coefficient and a Pass/Fail status based on a threshold of 0.3. Correlation coefficients above this threshold are marked as "Fail," signaling higher-than-acceptable linear association between the respective feature pairs.
Key insights:
Two feature pairs exceed correlation threshold: In both the training and test datasets, the pairs (Balance, Geography_Germany) and (Geography_Germany, Geography_Spain) display absolute correlation coefficients above the 0.3 threshold, with values ranging from 0.3601 to 0.4144, resulting in a "Fail" status for these pairs.
All other feature pairs below threshold: The remaining feature pairs in both datasets have absolute correlation coefficients below 0.3, receiving a "Pass" status and indicating no further high linear associations among the top correlations.
Consistency across datasets: The same feature pairs exceed the threshold in both the training and test datasets, with similar coefficient magnitudes, indicating stable correlation structure between these variables across data splits.
The results indicate that the majority of feature pairs exhibit low to moderate linear relationships, with only two pairs consistently surpassing the defined correlation threshold in both datasets. The observed high correlations are limited to specific geography-related and balance features, while all other top feature pairs remain below the threshold, suggesting limited risk of widespread multicollinearity within the evaluated features.
2026-03-12 20:51:36,452 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighPearsonCorrelation:development_data does not exist in model's document
validmind.model_validation.ModelMetadata
Model Metadata
The ModelMetadata test compares the metadata of different models to assess consistency in architecture, framework, framework version, and programming language. The summary table presents metadata for two models, including their modeling technique, framework, framework version, and programming language. Both models are identified as using the SKlearnModel technique, the sklearn framework, version 1.8.0, and Python as the programming language.
Key insights:
Consistent modeling technique across models: Both models are classified as SKlearnModel, indicating uniformity in modeling approach.
Identical framework and version: Both models utilize the sklearn framework, version 1.8.0, ensuring compatibility in software dependencies.
Uniform programming language: Python is used for both models, supporting consistency in codebase and deployment environment.
The metadata comparison reveals complete alignment across all evaluated fields for the two models. No discrepancies or inconsistencies are observed in modeling technique, framework, framework version, or programming language. This uniformity supports streamlined model management and integration.
Tables
model
Modeling Technique
Modeling Framework
Framework Version
Programming Language
log_model_champion
SKlearnModel
sklearn
1.8.0
Python
rf_model
SKlearnModel
sklearn
1.8.0
Python
2026-03-12 20:51:40,454 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.ModelMetadata does not exist in model's document
The Model Parameters test extracts and displays all configuration parameters for each model to support transparency and reproducibility. The results present a structured table listing parameter names and their corresponding values for both the logistic regression model (log_model_champion) and the random forest model (rf_model). Each parameter is shown alongside its assigned value, providing a comprehensive snapshot of the model configurations at the time of testing.
Key insights:
Explicit parameterization for both models: All parameters for log_model_champion and rf_model are explicitly listed, including regularization, solver, and iteration settings for the logistic regression model, and tree construction, sampling, and splitting criteria for the random forest model.
Non-default penalty and solver in logistic regression: The logistic regression model uses an l1 penalty with the liblinear solver, indicating a configuration that supports feature selection through regularization.
Random forest uses 50 estimators and fixed random state: The random forest model is configured with 50 trees (n_estimators=50) and a fixed random seed (random_state=42), supporting reproducibility and controlled variance.
Standard splitting and impurity settings in random forest: The random forest model applies the gini criterion, sqrt for max_features, and default values for minimum samples and impurity thresholds, reflecting standard tree growth parameters.
The extracted parameter set provides a transparent and reproducible record of model configurations for both the logistic regression and random forest models. The use of explicit regularization and solver choices in the logistic regression model, along with reproducibility controls and standard tree settings in the random forest model, collectively document the operational setup and support systematic auditing of model behavior.
Tables
model
Parameter
Value
log_model_champion
C
1
log_model_champion
dual
False
log_model_champion
fit_intercept
True
log_model_champion
intercept_scaling
1
log_model_champion
max_iter
100
log_model_champion
penalty
l1
log_model_champion
solver
liblinear
log_model_champion
tol
0.0001
log_model_champion
verbose
0
log_model_champion
warm_start
False
rf_model
bootstrap
True
rf_model
ccp_alpha
0.0
rf_model
criterion
gini
rf_model
max_features
sqrt
rf_model
min_impurity_decrease
0.0
rf_model
min_samples_leaf
1
rf_model
min_samples_split
2
rf_model
min_weight_fraction_leaf
0.0
rf_model
n_estimators
50
rf_model
oob_score
False
rf_model
random_state
42
rf_model
verbose
0
rf_model
warm_start
False
2026-03-12 20:51:46,956 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ModelParameters does not exist in model's document
validmind.model_validation.sklearn.ROCCurve
ROC Curve
The ROC Curve test evaluates the binary classification performance of the log_model_champion by plotting the Receiver Operating Characteristic (ROC) curve and calculating the Area Under the Curve (AUC) for both the training and test datasets. The resulting plots display the trade-off between the true positive rate and false positive rate across all classification thresholds, with the AUC providing a summary measure of the model's discriminative ability. The ROC curves for both datasets are compared against a baseline representing random classification (AUC = 0.5).
Key insights:
AUC indicates moderate discriminative power: The AUC is 0.69 on the training dataset and 0.66 on the test dataset, both above the random baseline of 0.5, indicating the model has moderate ability to distinguish between classes.
Consistent performance across datasets: The small difference in AUC between training and test datasets suggests stable model behavior and limited overfitting.
ROC curves consistently above random line: Both ROC curves remain above the diagonal line representing random classification, confirming the model's predictive value across thresholds.
The ROC Curve test results demonstrate that log_model_champion achieves moderate classification performance, with AUC values consistently above the random baseline on both training and test datasets. The close alignment of AUC scores across datasets indicates stable generalization, and the ROC curves confirm the model's ability to provide meaningful discrimination between classes throughout the range of possible thresholds.
Figures
2026-03-12 20:51:53,678 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve does not exist in model's document
The Minimum ROC AUC Score test evaluates whether the model's multiclass ROC AUC score meets or exceeds a specified minimum threshold, providing an assessment of the model's ability to distinguish between classes. The results table presents ROC AUC scores for both the training and test datasets, alongside the threshold value and pass/fail status for each dataset. Both datasets are evaluated against a threshold of 0.5, with the observed scores and outcomes reported for each.
Key insights:
ROC AUC scores exceed minimum threshold: Both the training (0.6867) and test (0.6634) datasets register ROC AUC scores above the 0.5 threshold.
Consistent pass status across datasets: The test is marked as "Pass" for both the train and test datasets, indicating consistent model performance relative to the defined criterion.
Moderate separation between classes: ROC AUC values in the range of 0.66–0.69 indicate moderate ability of the model to distinguish between classes on both datasets.
The results demonstrate that the model achieves ROC AUC scores above the specified minimum threshold on both training and test datasets, indicating moderate discriminatory power. The consistent pass status across datasets reflects stable model performance with respect to this metric.
Parameters:
{
"min_threshold": 0.5
}
Tables
dataset
Score
Threshold
Pass/Fail
train_dataset_final
0.6867
0.5
Pass
test_dataset_final
0.6634
0.5
Pass
2026-03-12 20:51:58,512 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumROCAUCScore does not exist in model's document
In summary
In this final notebook, you learned how to:
With our ValidMind for model validation series of notebooks, you learned how to validate a model end-to-end with the ValidMind Library by running through some common scenarios in a typical model validation setting:
Verifying the data quality steps performed by the model development team
Independently replicating the champion model's results and conducting additional tests to assess performance, stability, and robustness
Setting up test inputs and a challenger model for comparative analysis
Running validation tests, analyzing results, and logging artifacts to ValidMind
Next steps
Work with your validation report
Now that you've logged all your test results and verified the work done by the model development team, head to the ValidMind Platform to wrap up your validation report. Continue to work on your validation report by:
Inserting additional test results: Click Link Evidence to Report under any section of 2. Validation in your validation report. (Learn more: Link evidence to reports)
Making qualitative edits to your test descriptions: Expand any linked evidence under Validator Evidence and click See evidence details to review and edit the ValidMind-generated test descriptions for quality and accuracy. (Learn more: Preparing validation reports)
Adding more findings: Click Link Finding to Report in any validation report section, then click + Create New Finding. (Learn more: Add and manage model findings)
Adding risk assessment notes: Click under Risk Assessment Notes in any validation report section to access the text editor and content editing toolbar, including an option to generate a draft with AI. Once generated, edit your ValidMind-generated test descriptions to adhere to your organization's requirements. (Learn more: Work with content blocks)
Assessing compliance: Under the Guideline for any validation report section, click ASSESSMENT and select the compliance status from the drop-down menu. (Learn more: Provide compliance assessments)
Collaborate with other stakeholders: Use the ValidMind Platform's real-time collaborative features to work seamlessly together with the rest of your organization, including model developers. Propose suggested changes in the model documentation, work with versioned history, and use comments to discuss specific portions of the model documentation. (Learn more: Collaborate with others)
When your validation report is complete and ready for review, submit it for approval from the same ValidMind Platform where you made your edits and collaborated with the rest of your organization, ensuring transparency and a thorough model validation history. (Learn more: Submit for approval)
Learn more
Now that you're familiar with the basics, you can explore the following notebooks to get a deeper understanding on how the ValidMind Library assists you in streamlining model validation: