ValidMind for model validation 4 — Finalize testing and reporting

Learn how to use ValidMind for your end-to-end model validation process with our series of four introductory notebooks. In this last notebook, finalize the compliance assessment process and have a complete validation report ready for review.

This notebook will walk you through how to supplement ValidMind tests with your own custom tests and include them as additional evidence in your validation report. A custom test is any function that takes a set of inputs and parameters as arguments and returns one or more outputs:

For a more in-depth introduction to custom tests, refer to our Implement custom tests notebook.

Learn by doing

Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals

Prerequisites

In order to finalize validation and reporting, you'll need to first have:

Setting up

This section should be very familiar to you now — as we performed the same actions in the previous two notebooks in this series.

Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

  1. In a browser, log in to ValidMind.

  2. In the left sidebar, navigate to Inventory and select the model you registered for this "ValidMind for model validation" series of notebooks.

  3. Go to Getting Started and click Copy snippet to clipboard.

Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
)
Note: you may need to restart the kernel to use updated packages.
2026-01-10 02:09:24,992 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report

Import the sample dataset

Next, we'll load in the same sample Bank Customer Churn Prediction dataset used to develop the champion model that we will independently preprocess:

# Load the sample dataset
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}
# Initialize the raw dataset for use in ValidMind tests
vm_raw_dataset = vm.init_dataset(
    dataset=raw_df,
    input_id="raw_dataset",
    target_column="Exited",
)
import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test:

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)
# Run HighPearsonCorrelation test with our balanced dataset as input and return a result object
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

High Pearson Correlation is designed to identify pairs of features within a dataset that exhibit strong linear relationships, with the primary purpose of detecting potential feature redundancy or multicollinearity. This is crucial for ensuring that the predictive model remains interpretable and robust, as highly correlated features can obscure the true impact of individual variables and may lead to overfitting or instability in model coefficients.

The test operates by calculating the Pearson correlation coefficient for every possible pair of features in the dataset. The Pearson correlation coefficient is a statistical measure that quantifies the strength and direction of a linear relationship between two continuous variables, ranging from -1 (perfect negative linear relationship) to 1 (perfect positive linear relationship), with 0 indicating no linear relationship. The test systematically computes these coefficients for all feature pairs, removes self-correlations and duplicate pairs, and then sorts the results by the absolute value of the coefficient. A pre-defined threshold, set at 0.3 in this case, is used to determine whether a pair is considered highly correlated. Each pair is assigned a "Pass" if the absolute value of the coefficient is below the threshold, or a "Fail" if it exceeds the threshold. The test then returns the top n pairs with the strongest correlations, providing a clear view of the most significant linear relationships present in the data.

The primary advantages of this test include its efficiency and transparency in highlighting linear dependencies between features. By surfacing the most strongly correlated pairs, the test enables data scientists and risk managers to quickly identify areas where feature redundancy or multicollinearity may be present, which is particularly valuable during the early stages of model development and feature selection. The clear tabular output, which includes the feature pairs, their correlation coefficients, and pass/fail status, supports straightforward interpretation and documentation. This approach helps ensure that the model remains interpretable and that the influence of each feature can be reliably assessed, which is especially important in regulated environments or when model explainability is a priority.

It should be noted that the test is limited to detecting linear relationships and does not capture more complex, nonlinear dependencies that may exist between features. The Pearson correlation coefficient is also sensitive to outliers, which can distort the measure and potentially exaggerate or mask true relationships. Additionally, the test only evaluates pairwise relationships and may not identify multicollinearity that arises from interactions among three or more features. High correlation coefficients, particularly those exceeding the threshold, signal a risk of redundancy or multicollinearity, which can undermine the stability and interpretability of the model. Care must be taken in interpreting these results, as the presence of high correlations does not necessarily imply causation or guarantee negative impacts on model performance, but it does warrant further investigation.

This test shows its results in a tabular format, where each row represents a unique pair of features from the dataset. The columns include the feature pair, the Pearson correlation coefficient (rounded to four decimal places), and a pass/fail status based on whether the absolute value of the coefficient exceeds the threshold of 0.3. The coefficient values range from -1 to 1, with positive values indicating direct relationships and negative values indicating inverse relationships. The table is sorted by the absolute value of the coefficient, so the strongest relationships appear at the top. For example, the pair (Age, Exited) has a coefficient of 0.3623 and is marked as "Fail," indicating that this pair exceeds the threshold and may contribute to multicollinearity. All other pairs have coefficients below the threshold and are marked as "Pass." The table provides a clear and concise view of the linear relationships present in the data, allowing users to quickly identify which pairs may require further attention. Notably, the coefficients in this result set range from 0.3623 to 0.0323, with only one pair exceeding the threshold, suggesting that most feature pairs do not exhibit strong linear relationships.

The test results reveal the following key insights:

  • Only One Feature Pair Exceeds Correlation Threshold: The pair (Age, Exited) has a Pearson correlation coefficient of 0.3623, surpassing the threshold of 0.3 and resulting in a "Fail" status, indicating a notable linear relationship between these features.
  • All Other Feature Pairs Pass the Correlation Test: The remaining nine feature pairs have coefficients ranging from -0.2009 to 0.0323, all below the 0.3 threshold, and are marked as "Pass," suggesting limited linear dependency among these pairs.
  • Correlation Coefficients Show Both Positive and Negative Relationships: The coefficients include both positive and negative values, such as (IsActiveMember, Exited) at -0.2009 and (Balance, Exited) at 0.1445, indicating the presence of both direct and inverse linear relationships, though none are particularly strong.
  • Distribution of Correlation Strengths is Narrow: Aside from the (Age, Exited) pair, all other coefficients are relatively close to zero, with the majority falling between -0.2 and 0.05, reflecting a generally low level of linear association across most feature pairs.
  • No Evidence of Widespread Multicollinearity: The limited number of pairs exceeding the threshold and the low magnitude of most coefficients suggest that the dataset does not exhibit widespread multicollinearity among its features.

Based on these results, the dataset demonstrates a generally low level of linear correlation among its features, with only the (Age, Exited) pair exceeding the specified threshold for high correlation. This observation indicates that, with the exception of this single pair, the features are largely independent in terms of linear relationships, reducing the risk of multicollinearity affecting model interpretability or stability. The presence of both positive and negative coefficients further suggests a balanced mix of direct and inverse associations, but none are sufficiently strong to raise immediate concerns about redundancy or overfitting, aside from the one identified pair. The narrow distribution of coefficient values reinforces the overall independence of the features, supporting the suitability of the dataset for modeling purposes where feature independence is desirable. The results provide a clear and objective characterization of the linear relationships present, enabling informed decisions about feature selection and model design.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(Age, Exited) 0.3623 Fail
(IsActiveMember, Exited) -0.2009 Pass
(Balance, NumOfProducts) -0.1709 Pass
(Balance, Exited) 0.1445 Pass
(Age, Balance) 0.0546 Pass
(NumOfProducts, Exited) -0.0541 Pass
(Age, NumOfProducts) -0.0503 Pass
(NumOfProducts, IsActiveMember) 0.0494 Pass
(CreditScore, IsActiveMember) 0.0345 Pass
(Tenure, EstimatedSalary) 0.0323 Pass
# From result object, extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df
Columns Coefficient Pass/Fail
0 (Age, Exited) 0.3623 Fail
1 (IsActiveMember, Exited) -0.2009 Pass
2 (Balance, NumOfProducts) -0.1709 Pass
3 (Balance, Exited) 0.1445 Pass
4 (Age, Balance) 0.0546 Pass
5 (NumOfProducts, Exited) -0.0541 Pass
6 (Age, NumOfProducts) -0.0503 Pass
7 (NumOfProducts, IsActiveMember) 0.0494 Pass
8 (CreditScore, IsActiveMember) 0.0345 Pass
9 (Tenure, EstimatedSalary) 0.0323 Pass
# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features
['(Age, Exited)']
# Extract feature names from the list of strings
high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features
['Age']
# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)
# Re-run the test with the reduced feature set
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

High Pearson Correlation is designed to identify pairs of features within a dataset that exhibit strong linear relationships, with the primary purpose of detecting potential feature redundancy or multicollinearity. This is crucial for ensuring that the predictive model remains interpretable and robust, as highly correlated features can obscure the true impact of individual variables and may lead to overfitting or instability in model coefficients.

The test operates by calculating the Pearson correlation coefficient for every possible pair of features in the dataset. The Pearson correlation coefficient is a statistical measure that quantifies the strength and direction of a linear relationship between two continuous variables, ranging from -1 (perfect negative linear relationship) to +1 (perfect positive linear relationship), with 0 indicating no linear relationship. The test systematically computes these coefficients for all feature pairs, removes self-correlations and duplicate pairs, and then compares the absolute value of each coefficient to a predefined threshold, which in this case is set at 0.3. If the absolute value of the coefficient exceeds this threshold, the pair is flagged as potentially problematic due to high correlation. The test then presents the top n strongest correlations, regardless of whether they pass or fail the threshold, providing a transparent view of the most significant linear relationships in the data. This approach allows for early detection of multicollinearity, which can compromise model interpretability and performance.

The primary advantages of this test include its efficiency and transparency in surfacing linear dependencies between features. By providing a clear, ranked list of the strongest correlations, the test enables model developers and risk managers to quickly assess the extent of feature redundancy and make informed decisions about feature selection or engineering. The straightforward output format, which includes the feature pairs, their correlation coefficients, and pass/fail status, facilitates rapid review and documentation. This is particularly valuable in regulated environments or in early-stage model development, where understanding the structure of the data and potential sources of multicollinearity is essential for building robust, interpretable models.

It should be noted that the test is limited to detecting only linear relationships, as the Pearson correlation coefficient does not capture nonlinear dependencies or interactions among three or more variables. This means that important nonlinear associations or higher-order multicollinearity may go undetected. Additionally, the metric is sensitive to outliers, which can disproportionately influence the calculated coefficients and potentially misrepresent the true relationships between features. The test also relies on a fixed threshold, which, while configurable, may not be optimal for all datasets or modeling contexts. High correlation coefficients above the threshold are indicative of increased risk for multicollinearity, which can undermine the stability and interpretability of model estimates, but the absence of such coefficients does not guarantee the absence of all forms of redundancy or dependency.

This test shows its results in the form of a table, where each row represents a unique pair of features from the dataset. The columns include the feature pair (labeled as "Columns"), the Pearson correlation coefficient (labeled as "Coefficient"), and the pass/fail status based on whether the absolute value of the coefficient exceeds the threshold of 0.3. The coefficients are presented as decimal values, typically ranging from -1 to 1, with negative values indicating inverse relationships and positive values indicating direct relationships. In this particular output, all coefficients are well below the threshold, with the highest absolute value being -0.2009 for the pair (IsActiveMember, Exited). The table is sorted by the absolute value of the coefficient, displaying the ten strongest correlations in the dataset. Each pair is marked as "Pass," indicating that none of the observed correlations surpass the risk threshold. The range of coefficients spans from -0.2009 to -0.0243, suggesting generally weak linear relationships among the top feature pairs. Notably, the results do not show any pairs with coefficients close to the threshold, and both positive and negative relationships are present, though all are of low magnitude.

The test results reveal the following key insights:

  • No Feature Pairs Exceed Correlation Threshold: All observed Pearson correlation coefficients are below the threshold of 0.3, with the highest absolute value being -0.2009, indicating an absence of strong linear relationships among the top feature pairs.
  • Weak Linear Relationships Dominate: The coefficients for the top ten feature pairs range from -0.2009 to -0.0243, reflecting generally weak associations and suggesting low risk of linear multicollinearity in the dataset.
  • Balanced Distribution of Positive and Negative Correlations: Both positive and negative coefficients are present, with the most negative being (IsActiveMember, Exited) at -0.2009 and the most positive being (Tenure, EstimatedSalary) at 0.0323, indicating that neither direction of association is dominant.
  • No Redundant Feature Pairs Identified: Since all pairs pass the threshold, there is no evidence of feature redundancy based on linear correlation, supporting the interpretability and stability of subsequent modeling efforts.
  • Consistent Pass Status Across All Top Pairs: Every feature pair in the top ten is marked as "Pass," reinforcing the observation that the dataset does not exhibit problematic linear dependencies among its most strongly correlated features.

Based on these results, the dataset demonstrates a low degree of linear association among its features, as evidenced by the uniformly low Pearson correlation coefficients and the absence of any pairs exceeding the predefined threshold of 0.3. This pattern suggests that the risk of multicollinearity affecting model performance or interpretability is minimal within the scope of linear relationships. The distribution of coefficients, encompassing both weak positive and negative values, indicates that no single direction of association predominates, and the lack of any "Fail" status across the top pairs further supports the conclusion that the dataset is structurally sound with respect to linear feature redundancy. These observations collectively imply that the features are sufficiently independent in a linear sense, which is favorable for downstream modeling processes that assume low multicollinearity. The results provide a clear and objective characterization of the dataset's internal structure, supporting confidence in the reliability and interpretability of models developed using these features.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(IsActiveMember, Exited) -0.2009 Pass
(Balance, NumOfProducts) -0.1709 Pass
(Balance, Exited) 0.1445 Pass
(NumOfProducts, Exited) -0.0541 Pass
(NumOfProducts, IsActiveMember) 0.0494 Pass
(CreditScore, IsActiveMember) 0.0345 Pass
(Tenure, EstimatedSalary) 0.0323 Pass
(CreditScore, Exited) -0.0256 Pass
(Balance, IsActiveMember) -0.0254 Pass
(CreditScore, HasCrCard) -0.0243 Pass

Split the preprocessed dataset

With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:

# Encode categorical features in the dataset
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()
CreditScore Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited Geography_Germany Geography_Spain Gender_Male
4312 644 8 106022.73 2 0 0 148727.42 0 True False False
2774 543 3 0.00 2 1 1 78915.68 0 False True True
2086 568 2 129177.01 2 0 1 104617.99 0 True False False
847 756 3 100717.85 3 1 1 73406.04 1 True False False
7316 507 9 118214.32 3 1 0 119110.03 1 True False True
from sklearn.model_selection import train_test_split

# Split the dataset into train and test
train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]
# Initialize the split datasets
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

Import the champion model

With our raw dataset assessed and preprocessed, let's go ahead and import the champion model submitted by the model development team in the format of a .pkl file: lr_model_champion.pkl

# Import the champion model
import pickle as pkl

with open("lr_model_champion.pkl", "rb") as f:
    log_reg = pkl.load(f)
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/base.py:463: InconsistentVersionWarning:

Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.8.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations

Train potential challenger model

We'll also train our random forest classification challenger model to see how it compares:

# Import the Random Forest Classification model
from sklearn.ensemble import RandomForestClassifier

# Create the model instance with 50 decision trees
rf_model = RandomForestClassifier(
    n_estimators=50,
    random_state=42,
)

# Train the model
rf_model.fit(X_train, y_train)
RandomForestClassifier(n_estimators=50, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Initialize the model objects

In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models:

# Initialize the champion logistic regression model
vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model_champion",
)

# Initialize the challenger random forest classification model
vm_rf_model = vm.init_model(
    rf_model,
    input_id="rf_model",
)
# Assign predictions to Champion — Logistic regression model
vm_train_ds.assign_predictions(model=vm_log_model)
vm_test_ds.assign_predictions(model=vm_log_model)

# Assign predictions to Challenger — Random forest classification model
vm_train_ds.assign_predictions(model=vm_rf_model)
vm_test_ds.assign_predictions(model=vm_rf_model)
2026-01-10 02:10:25,942 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-10 02:10:25,944 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-10 02:10:25,944 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-10 02:10:25,946 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-10 02:10:25,948 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-10 02:10:25,948 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-10 02:10:25,949 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-10 02:10:25,950 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-10 02:10:25,952 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-10 02:10:25,973 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-10 02:10:25,973 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-10 02:10:25,994 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-10 02:10:25,996 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-10 02:10:26,007 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-10 02:10:26,007 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-10 02:10:26,018 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Implementing custom tests

Thanks to the model documentation (Learn more ...), we know that the model development team implemented a custom test to further evaluate the performance of the champion model.

In a usual model validation situation, you would load a saved custom test provided by the model development team. In the following section, we'll have you implement the same custom test and make it available for reuse, to familiarize you with the processes.

Want to learn more about custom tests?

Refer to our in-depth introduction to custom tests: Implement custom tests

Implement a custom inline test

Let's implement the same custom inline test that calculates the confusion matrix for a binary classification model that the model development team used in their performance evaluations.

  • An inline test refers to a test written and executed within the same environment as the code being tested — in this case, right in this Jupyter Notebook — without requiring a separate test file or framework.
  • You'll note that the custom test function is just a regular Python function that can include and require any Python library as you see fit.

Create a confusion matrix plot

Let's first create a confusion matrix plot using the confusion_matrix function from the sklearn.metrics module:

import matplotlib.pyplot as plt
from sklearn import metrics

# Get the predicted classes
y_pred = log_reg.predict(vm_test_ds.x)

confusion_matrix = metrics.confusion_matrix(y_test, y_pred)

cm_display = metrics.ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix, display_labels=[False, True]
)
cm_display.plot()

Next, create a @vm.test wrapper that will allow you to create a reusable test. Note the following changes in the code below:

  • The function confusion_matrix takes two arguments dataset and model. This is a VMDataset and VMModel object respectively.
    • VMDataset objects allow you to access the dataset's true (target) values by accessing the .y attribute.
    • VMDataset objects allow you to access the predictions for a given model by accessing the .y_pred() method.
  • The function docstring provides a description of what the test does. This will be displayed along with the result in this notebook as well as in the ValidMind Platform.
  • The function body calculates the confusion matrix using the sklearn.metrics.confusion_matrix function as we just did above.
  • The function then returns the ConfusionMatrixDisplay.figure_ object — this is important as the ValidMind Library expects the output of the custom test to be a plot or a table.
  • The @vm.test decorator is doing the work of creating a wrapper around the function that will allow it to be run by the ValidMind Library. It also registers the test so it can be found by the ID my_custom_tests.ConfusionMatrix.
@vm.test("my_custom_tests.ConfusionMatrix")
def confusion_matrix(dataset, model):
    """The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

    The confusion matrix is a 2x2 table that contains 4 values:

    - True Positive (TP): the number of correct positive predictions
    - True Negative (TN): the number of correct negative predictions
    - False Positive (FP): the number of incorrect positive predictions
    - False Negative (FN): the number of incorrect negative predictions

    The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.
    """
    y_true = dataset.y
    y_pred = dataset.y_pred(model=model)

    confusion_matrix = metrics.confusion_matrix(y_true, y_pred)

    cm_display = metrics.ConfusionMatrixDisplay(
        confusion_matrix=confusion_matrix, display_labels=[False, True]
    )
    cm_display.plot()

    plt.close()  # close the plot to avoid displaying it

    return cm_display.figure_  # return the figure object itself

You can now run the newly created custom test on both the training and test datasets for both models using the run_test() function:

# Champion train and test
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:champion",
    input_grid={
        "dataset": [vm_train_ds,vm_test_ds],
        "model" : [vm_log_model]
    }
).log()

Confusion Matrix Champion

Confusion Matrix: Champion is designed to provide a comprehensive summary of a classification model’s predictive performance by displaying the counts of true positives, true negatives, false positives, and false negatives in a tabular format. This test is primarily used to evaluate how well the model distinguishes between the two classes by comparing the predicted labels against the actual labels for a given dataset.

The test operates by constructing a two-by-two matrix where each cell represents a specific combination of predicted and actual class labels. The top-left cell shows the number of true negatives, indicating cases where the model correctly predicted the negative class. The top-right cell displays false positives, representing instances where the model incorrectly predicted the positive class. The bottom-left cell contains false negatives, which are cases where the model failed to identify the positive class. The bottom-right cell shows true positives, where the model correctly identified the positive class. From these four values, several key performance metrics can be derived: accuracy (the proportion of correct predictions out of all predictions), precision (the proportion of positive predictions that are actually correct), recall (the proportion of actual positives that are correctly identified), and the F1 score (the harmonic mean of precision and recall). These metrics typically range from 0 to 1, with higher values indicating better performance. The confusion matrix requires a labeled dataset with known true values and is applicable to binary classification tasks.

The primary advantages of this test include its ability to provide a detailed breakdown of model performance across all possible prediction outcomes, enabling a nuanced understanding of where the model excels and where it may misclassify. By presenting both the correct and incorrect predictions, the confusion matrix allows practitioners to identify specific types of errors, such as whether the model is more prone to false positives or false negatives. This level of granularity is particularly valuable in domains where the costs of different types of errors are not equal, such as in medical diagnosis or fraud detection. Additionally, the derived metrics offer a standardized way to compare models and track improvements over time, making the confusion matrix a foundational tool in model evaluation and selection.

It should be noted that the confusion matrix, while informative, has several limitations. It is inherently limited to binary or multiclass classification tasks and does not provide insight into the reasons behind misclassifications. The matrix does not account for class imbalance, which can lead to misleadingly high accuracy if one class dominates the dataset. Interpretation of the derived metrics can also be challenging in cases where the costs of false positives and false negatives differ significantly, as a single summary statistic may obscure important trade-offs. Furthermore, the confusion matrix does not capture the confidence of predictions or the calibration of the model, and it requires a sufficiently large and representative test set to yield reliable results. Care must be taken to contextualize the results within the broader modeling and business objectives.

This test shows the results in the form of two heatmap-style confusion matrix plots, one for the training dataset and one for the test dataset. Each plot is a 2x2 grid where the x-axis represents the predicted label (False or True) and the y-axis represents the true label (False or True). The color intensity of each cell corresponds to the count of observations, with a color bar indicating the scale. The training dataset confusion matrix shows 797 true negatives, 474 false positives, 434 false negatives, and 830 true positives. The test dataset confusion matrix displays 221 true negatives, 114 false positives, 113 false negatives, and 179 true positives. To interpret these plots, one reads across each row to see how the model’s predictions align with the actual class labels. The diagonal cells (top-left and bottom-right) represent correct predictions, while the off-diagonal cells (top-right and bottom-left) represent misclassifications. The values in each cell are absolute counts, and the color bar provides a visual cue for the magnitude of each count. Notably, the training set has higher absolute counts due to its larger size, while the test set provides a more realistic assessment of generalization. The distribution of values across the cells reveals the balance between correct and incorrect predictions for each class, and the relative proportions can be used to infer the model’s strengths and weaknesses.

The test results reveal the following key insights:

  • Model achieves balanced correct predictions across classes in both datasets: The confusion matrices for both training and test datasets show that the model produces a substantial number of correct predictions for both the negative and positive classes, with 797 true negatives and 830 true positives in training, and 221 true negatives and 179 true positives in testing.
  • False positive and false negative rates are comparable within each dataset: In the training set, the model records 474 false positives and 434 false negatives, while in the test set, there are 114 false positives and 113 false negatives, indicating a similar tendency to misclassify both classes.
  • Absolute counts decrease proportionally from training to test set: The reduction in all confusion matrix cell values from training to test reflects the smaller size of the test set, with the ratios between correct and incorrect predictions remaining consistent, suggesting stable model behavior across datasets.
  • No evidence of severe class imbalance in predictions: The counts for true negatives and true positives are of similar magnitude within each dataset, and the off-diagonal counts for false positives and false negatives are also closely matched, indicating that the model does not disproportionately favor one class over the other.
  • Generalization performance is consistent with training performance: The relative proportions of correct and incorrect predictions in the test set closely mirror those in the training set, suggesting that the model’s predictive behavior is stable and not subject to significant overfitting or underfitting.

Based on these results, the confusion matrix analysis demonstrates that the model maintains a consistent pattern of predictions across both the training and test datasets, with similar rates of correct and incorrect classifications for both the positive and negative classes. The close alignment of true positive and true negative counts, as well as the comparable false positive and false negative rates, indicates that the model does not exhibit a strong bias toward either class and is equally likely to misclassify in both directions. The proportional decrease in counts from training to test set, without a marked change in the distribution of errors, suggests that the model generalizes well and does not suffer from overfitting. The absence of pronounced class imbalance in the confusion matrices further supports the conclusion that the model’s predictions are balanced and reliable across the evaluated datasets. These observations collectively provide a clear and objective characterization of the model’s classification behavior, highlighting its stability and balanced error distribution in both development and evaluation contexts.

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:champion:8d4e
ValidMind Figure my_custom_tests.ConfusionMatrix:champion:7462
2026-01-10 02:10:56,692 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:champion does not exist in model's document
# Challenger train and test
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:challenger",
    input_grid={
        "dataset": [vm_train_ds,vm_test_ds],
        "model" : [vm_rf_model]
    }
).log()

Confusion Matrix Challenger

Confusion Matrix: Challenger is designed to provide a comprehensive summary of a classification model’s predictive performance by displaying the counts of true positives, true negatives, false positives, and false negatives. This test is primarily used to evaluate how well a model distinguishes between two classes by comparing the predicted labels against the actual labels in a structured tabular format. The confusion matrix enables practitioners to assess not only overall accuracy but also the types and frequencies of classification errors, which is critical for understanding model behavior in binary classification tasks.

The test operates by constructing a two-by-two matrix where each cell represents a specific outcome of the model’s predictions: true positives (correctly predicted positives), true negatives (correctly predicted negatives), false positives (incorrectly predicted positives), and false negatives (incorrectly predicted negatives). The matrix is populated by comparing the model’s predicted labels to the true labels for each instance in the dataset. From these counts, several key performance metrics can be derived: accuracy (the proportion of correct predictions out of all predictions), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives among all actual positives), and the F1 score (the harmonic mean of precision and recall). These metrics typically range from 0 to 1, where values closer to 1 indicate better performance. High accuracy suggests the model is generally correct, while high precision and recall indicate effective identification of the positive class with minimal false alarms and missed detections, respectively.

The primary advantages of this test include its ability to provide a detailed breakdown of model performance across all possible prediction outcomes, making it especially useful for identifying specific types of errors. The confusion matrix is straightforward to interpret and serves as the foundation for calculating a range of diagnostic metrics, enabling practitioners to tailor their evaluation to the specific requirements of the application domain. For example, in domains where false positives are more costly than false negatives, the confusion matrix allows for targeted analysis of these error types. Additionally, the test is applicable to both balanced and imbalanced datasets, offering a clear visual and quantitative summary that supports informed decision-making regarding model selection and tuning.

It should be noted that the confusion matrix, while informative, has certain limitations. It provides a static snapshot of model performance on a specific dataset and does not account for the underlying distribution of the data or the costs associated with different types of errors. In cases of class imbalance, accuracy may be misleading, as a model could achieve high accuracy by simply predicting the majority class. The matrix also does not capture the confidence of predictions or the model’s calibration, which can be important in risk-sensitive applications. Furthermore, interpretation can become challenging when the matrix is extended to multiclass problems, as the number of cells increases and the relationships between error types become more complex. Care must be taken to contextualize the results within the broader modeling and business objectives to avoid over- or underestimating model performance.

This test shows the results in the form of two heatmap-style confusion matrices, one for the training dataset and one for the test dataset. Each matrix displays the counts of true negatives, false positives, false negatives, and true positives, with the axes labeled as “True label” and “Predicted label.” The color intensity in each cell corresponds to the count, with a color bar indicating the scale. In the training dataset matrix, the top-left cell (true negatives) contains 1271, the top-right cell (false positives) contains 0, the bottom-left cell (false negatives) contains 0, and the bottom-right cell (true positives) contains 1314. In the test dataset matrix, the top-left cell (true negatives) contains 254, the top-right cell (false positives) contains 91, the bottom-left cell (false negatives) contains 86, and the bottom-right cell (true positives) contains 216. These matrices allow for immediate visual assessment of the distribution of correct and incorrect predictions. The training matrix shows perfect separation with no misclassifications, while the test matrix reveals the presence of both false positives and false negatives. The color bar provides a reference for interpreting the magnitude of each cell, and the axes clarify the mapping between predicted and actual classes. The range of values in the training matrix extends up to 1314, while in the test matrix it reaches 254 for true negatives and 216 for true positives, with error counts in the double digits for misclassifications. Notable observations include the absence of errors in the training set and a non-negligible number of errors in the test set, suggesting a difference in model performance between the two datasets.

The test results reveal the following key insights:

  • Perfect Classification on Training Data: The model achieves flawless performance on the training dataset, with 1271 true negatives and 1314 true positives, and zero false positives or false negatives, indicating no misclassifications during training.
  • Generalization Gap Evident on Test Data: On the test dataset, the model records 254 true negatives and 216 true positives, but also 91 false positives and 86 false negatives, highlighting a reduction in predictive accuracy when applied to unseen data.
  • Balanced Error Distribution in Test Set: The test set errors are distributed relatively evenly between false positives (91) and false negatives (86), suggesting that the model does not exhibit a strong bias toward over- or under-predicting either class.
  • Magnitude of Misclassifications: The number of misclassifications in the test set is substantial relative to the number of correct predictions, with errors accounting for approximately 27% of the total test samples, which may impact the reliability of the model in practical applications.
  • Contrast Between Training and Test Performance: The stark difference between perfect training performance and notable test errors suggests potential overfitting, where the model has learned the training data too precisely and does not generalize as well to new data.

Based on these results, the model demonstrates a clear ability to perfectly classify the training data, as evidenced by the absence of any misclassifications in the training confusion matrix. However, the test confusion matrix reveals a significant generalization gap, with both false positives and false negatives present in similar proportions. This pattern indicates that while the model has learned the training data in detail, it does not maintain the same level of accuracy on unseen data, which is a characteristic behavior of overfitting. The balanced distribution of errors in the test set suggests that the model does not systematically favor one class over the other, but the overall error rate is notable and may affect the model’s practical utility. The results highlight the importance of evaluating model performance on independent test data to obtain a realistic assessment of predictive capability. The observed discrepancy between training and test results underscores the need to consider both sets of metrics when interpreting model behavior, as reliance on training performance alone would provide an overly optimistic view of the model’s effectiveness.

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:challenger:b700
ValidMind Figure my_custom_tests.ConfusionMatrix:challenger:00a4
2026-01-10 02:11:24,020 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:challenger does not exist in model's document
Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.

Add parameters to custom tests

Custom tests can take parameters just like any other function. To demonstrate, let's modify the confusion_matrix function to take an additional parameter normalize that will allow you to normalize the confusion matrix:

@vm.test("my_custom_tests.ConfusionMatrix")
def confusion_matrix(dataset, model, normalize=False):
    """The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

    The confusion matrix is a 2x2 table that contains 4 values:

    - True Positive (TP): the number of correct positive predictions
    - True Negative (TN): the number of correct negative predictions
    - False Positive (FP): the number of incorrect positive predictions
    - False Negative (FN): the number of incorrect negative predictions

    The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.
    """
    y_true = dataset.y
    y_pred = dataset.y_pred(model=model)

    if normalize:
        confusion_matrix = metrics.confusion_matrix(y_true, y_pred, normalize="all")
    else:
        confusion_matrix = metrics.confusion_matrix(y_true, y_pred)

    cm_display = metrics.ConfusionMatrixDisplay(
        confusion_matrix=confusion_matrix, display_labels=[False, True]
    )
    cm_display.plot()

    plt.close()  # close the plot to avoid displaying it

    return cm_display.figure_  # return the figure object itself

Pass parameters to custom tests

You can pass parameters to custom tests by providing a dictionary of parameters to the run_test() function.

  • The parameters will override any default parameters set in the custom test definition. Note that dataset and model are still passed as inputs.
  • Since these are VMDataset or VMModel inputs, they have a special meaning.

Re-running and logging the custom confusion matrix with normalize=True for both models and our testing dataset looks like this:

# Champion with test dataset and normalize=True
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:test_normalized_champion",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_log_model]
    },
    params={"normalize": True}
).log()

Confusion Matrix Test Normalized Champion

Confusion Matrix: Test Normalized Champion is designed to provide a comprehensive overview of a classification model’s predictive performance by summarizing the relationship between actual and predicted class labels. The primary purpose of this test is to quantify the model’s ability to correctly identify positive and negative cases, as well as to highlight the types and frequencies of misclassifications. This is achieved by organizing the results into a matrix that displays the counts or proportions of true positives, true negatives, false positives, and false negatives, thereby enabling a holistic assessment of the model’s classification behavior.

The test operates by comparing the predicted labels generated by the model against the true labels from a labeled dataset. For each instance, the test records whether the prediction matches the actual class, and aggregates these outcomes into four categories: true positives (correctly predicted positives), true negatives (correctly predicted negatives), false positives (incorrectly predicted positives), and false negatives (incorrectly predicted negatives). In this instance, the confusion matrix is normalized, meaning each cell value represents the proportion of total predictions falling into that category, rather than raw counts. The normalization process allows for direct comparison across classes, regardless of class imbalance, and the resulting values range from 0 to 1, where higher values along the diagonal indicate better predictive accuracy. The matrix is typically visualized as a heatmap, with color intensity reflecting the magnitude of each cell, and is often accompanied by a color bar for reference.

The primary advantages of this test include its ability to provide a detailed, interpretable summary of model performance across all possible prediction outcomes. By breaking down the results into true and false positives and negatives, the confusion matrix enables practitioners to identify specific strengths and weaknesses in the model’s predictions, such as tendencies toward certain types of errors. The normalized format further enhances interpretability by allowing for meaningful comparisons between classes, even in the presence of class imbalance. This makes the confusion matrix particularly useful for evaluating models in domains where the costs of different types of errors vary, or where understanding the distribution of errors is critical for downstream decision-making.

It should be noted that the confusion matrix, while informative, has several limitations. First, it provides only a snapshot of model performance on a specific dataset and may not generalize to other data distributions. The matrix does not account for the relative costs or impacts of different types of errors, which may be important in certain applications. Additionally, the normalized values can obscure the absolute frequency of errors, potentially masking issues in cases of severe class imbalance. Interpretation can also be challenging when the matrix is not accompanied by additional metrics such as precision, recall, or F1 score, which provide more granular insights into model behavior. Finally, the confusion matrix does not capture the confidence of predictions or the underlying probability estimates, limiting its utility for models that output probabilistic scores.

This test shows a normalized confusion matrix presented as a heatmap, with the true labels on the vertical axis and the predicted labels on the horizontal axis. The matrix is divided into four cells: the top-left cell represents the proportion of true negatives (actual False, predicted False), the top-right cell shows the proportion of false positives (actual False, predicted True), the bottom-left cell indicates the proportion of false negatives (actual True, predicted False), and the bottom-right cell displays the proportion of true positives (actual True, predicted True). The color intensity of each cell corresponds to the proportion value, as indicated by the color bar on the right, which ranges from 0.19 to 0.34. The specific values in the matrix are 0.34 for true negatives, 0.19 for false positives, 0.19 for false negatives, and 0.28 for true positives. These values sum to 1, reflecting the normalized format. The heatmap allows for quick visual identification of where the model performs well (higher values on the diagonal) and where it makes errors (off-diagonal values). Notably, the true negative rate is the highest, followed by the true positive rate, while both types of errors are equally represented at 0.19. The visualization provides a clear, at-a-glance summary of the model’s classification tendencies and error distribution.

The test results reveal the following key insights:

  • True negatives are the most frequent outcome: The model correctly predicts negative cases 34% of the time, which is the highest proportion among all categories.
  • True positives are the second most common correct prediction: The model achieves a 28% rate of correctly identifying positive cases, indicating moderate sensitivity.
  • False positives and false negatives occur at equal rates: Both types of misclassification are observed at 19%, suggesting the model does not favor one type of error over the other.
  • Diagonal dominance is present but not overwhelming: The sum of correct predictions (true positives and true negatives) is 62%, while incorrect predictions (false positives and false negatives) account for 38%, indicating room for improvement in overall accuracy.
  • Normalized values facilitate class comparison: The use of normalized proportions allows for direct comparison between the model’s performance on positive and negative classes, independent of class distribution in the dataset.

Based on these results, the model demonstrates a moderate ability to distinguish between positive and negative cases, with correct predictions (true positives and true negatives) comprising just over three-fifths of all outcomes. The equal rates of false positives and false negatives indicate that the model’s errors are balanced between the two classes, rather than being skewed toward one type of misclassification. The higher proportion of true negatives suggests that the model is somewhat more effective at identifying negative cases than positive ones, but the difference is not substantial. The normalized confusion matrix format enables a clear assessment of these patterns, highlighting both the strengths and limitations of the model’s predictive behavior. The results suggest that while the model is capable of making correct predictions in a majority of cases, there is a significant proportion of misclassifications that may warrant further investigation, particularly if the costs of false positives and false negatives differ in the application context. The visualization provides a transparent and interpretable summary of the model’s performance characteristics on the test dataset.

Parameters:

{
  "normalize": true
}
            

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:test_normalized_champion:c120
2026-01-10 02:11:52,807 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_normalized_champion does not exist in model's document
# Challenger with test dataset and normalize=True
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:test_normalized_challenger",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_rf_model]
    },
    params={"normalize": True}
).log()

Confusion Matrix Test Normalized Challenger

Confusion Matrix: Test Normalized Challenger is designed to provide a comprehensive overview of a classification model’s predictive performance by summarizing the relationship between actual and predicted class labels. The primary purpose of this test is to quantify the model’s ability to correctly identify both positive and negative cases, as well as to highlight the types and frequencies of misclassifications. This is achieved by organizing the results into a matrix that displays the counts or proportions of true positives, true negatives, false positives, and false negatives, thereby enabling a holistic assessment of the model’s classification behavior.

The test operates by comparing the predicted labels generated by the model against the true labels from a labeled dataset. Each prediction is categorized into one of four groups: true positive (correctly predicted positive), true negative (correctly predicted negative), false positive (incorrectly predicted positive), and false negative (incorrectly predicted negative). In this instance, the confusion matrix is normalized, meaning each cell value represents the proportion of the total predictions rather than raw counts, with values ranging from 0 to 1. This normalization facilitates direct comparison across different datasets or models, regardless of sample size. The matrix is typically visualized as a heatmap, where the axes correspond to the true and predicted labels, and the color intensity reflects the magnitude of each cell. The test also enables the calculation of key performance metrics such as accuracy (overall proportion of correct predictions), precision (proportion of positive predictions that are correct), recall (proportion of actual positives correctly identified), and F1 score (harmonic mean of precision and recall). Higher values along the diagonal indicate better model performance, while higher off-diagonal values suggest more frequent misclassifications.

The primary advantages of this test include its ability to provide a detailed and interpretable summary of model performance across all possible prediction outcomes. By presenting both correct and incorrect classifications, the confusion matrix allows for the identification of specific error types, such as whether the model is more prone to false positives or false negatives. The normalized format further enhances interpretability by expressing results as proportions, making it easier to compare performance across different datasets or models. This test is particularly useful in scenarios where the costs of different types of errors are not equal, as it enables targeted analysis of model behavior in relation to business or regulatory requirements. Additionally, the confusion matrix serves as a foundation for calculating a range of secondary metrics, supporting a nuanced understanding of model strengths and weaknesses.

It should be noted that the confusion matrix, while informative, has several limitations. The test does not provide insight into the underlying causes of misclassifications or the reasons for model errors, nor does it account for class imbalance unless explicitly normalized or supplemented with additional metrics. Interpretation can be challenging in cases where the dataset is highly imbalanced, as high accuracy may mask poor performance on minority classes. The matrix also does not capture the confidence of predictions or the model’s calibration, and it is limited to binary or multiclass classification tasks. Furthermore, the visualization may become less interpretable as the number of classes increases, and the test does not inherently address the impact of misclassification costs unless these are explicitly incorporated into the analysis. Care must be taken to contextualize the results within the broader modeling and business environment to avoid over- or underestimating model performance.

This test shows a normalized confusion matrix presented as a heatmap, with the true labels on the vertical axis and the predicted labels on the horizontal axis. The color intensity of each cell corresponds to the proportion of predictions falling into that category, as indicated by the accompanying color bar, which ranges from approximately 0.13 to 0.39. The matrix is divided into four cells: the top-left cell (0.39) represents the proportion of true negatives, the top-right cell (0.14) represents false positives, the bottom-left cell (0.13) represents false negatives, and the bottom-right cell (0.33) represents true positives. The sum of all cell values is 1, reflecting the normalized format. The diagonal cells (true negatives and true positives) are more prominent, indicating a higher proportion of correct predictions, while the off-diagonal cells (false positives and false negatives) are less intense, indicating fewer misclassifications. The heatmap provides a visual summary of the model’s classification behavior, with the color bar facilitating interpretation of the relative magnitude of each cell. Notable observations include the relatively balanced distribution between true negatives and true positives, and the lower but non-negligible rates of both types of misclassifications.

The test results reveal the following key insights:

  • Correct classifications dominate the results: The model achieves a true negative rate of 0.39 and a true positive rate of 0.33, indicating that the majority of predictions are correct, with correct classifications accounting for 72% of all predictions.
  • Misclassifications are present but limited: The false positive rate is 0.14 and the false negative rate is 0.13, showing that misclassifications occur in approximately 27% of cases, with both types of errors occurring at similar rates.
  • Balanced error distribution: The rates of false positives and false negatives are closely matched, suggesting that the model does not exhibit a strong bias toward over- or under-predicting the positive class.
  • Normalized format enables direct comparison: The use of normalized proportions allows for straightforward interpretation and comparison across different datasets or models, independent of sample size.
  • Diagonal dominance indicates effective discrimination: The higher values along the diagonal cells relative to the off-diagonal cells demonstrate that the model is able to effectively distinguish between the two classes in most cases.

Based on these results, the confusion matrix reveals that the model demonstrates a clear ability to correctly classify both positive and negative cases, with correct predictions accounting for nearly three-quarters of all outcomes. The similar rates of false positives and false negatives indicate that the model’s errors are distributed evenly across both classes, suggesting a balanced approach to classification without a pronounced tendency to favor one class over the other. The normalized presentation of results facilitates objective assessment and comparison, highlighting the model’s overall effectiveness while also drawing attention to the non-negligible proportion of misclassifications. The visual dominance of the diagonal cells underscores the model’s capacity for accurate discrimination, while the presence of off-diagonal values points to areas where further refinement may be possible. Collectively, these insights provide a detailed and nuanced understanding of the model’s classification behavior, supporting informed evaluation of its performance characteristics.

Parameters:

{
  "normalize": true
}
            

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:test_normalized_challenger:3041
2026-01-10 02:12:20,451 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_normalized_challenger does not exist in model's document

Use external test providers

Sometimes you may want to reuse the same set of custom tests across multiple models and share them with others in your organization, like the model development team would have done with you in this example workflow featured in this series of notebooks. In this case, you can create an external custom test provider that will allow you to load custom tests from a local folder or a Git repository.

In this section you will learn how to declare a local filesystem test provider that allows loading tests from a local folder following these high level steps:

  1. Create a folder of custom tests from existing inline tests (tests that exist in your active Jupyter Notebook)
  2. Save an inline test to a file
  3. Define and register a LocalTestProvider that points to that folder
  4. Run test provider tests
  5. Add the test results to your documentation

Create custom tests folder

Let's start by creating a new folder that will contain reusable custom tests from your existing inline tests.

The following code snippet will create a new my_tests directory in the current working directory if it doesn't exist:

tests_folder = "my_tests"

import os

# create tests folder
os.makedirs(tests_folder, exist_ok=True)

# remove existing tests
for f in os.listdir(tests_folder):
    # remove files and pycache
    if f.endswith(".py") or f == "__pycache__":
        os.system(f"rm -rf {tests_folder}/{f}")

After running the command above, confirm that a new my_tests directory was created successfully. For example:

~/notebooks/tutorials/model_validation/my_tests/

Save an inline test

The @vm.test decorator we used in Implement a custom inline test above to register one-off custom tests also includes a convenience method on the function object that allows you to simply call <func_name>.save() to save the test to a Python file at a specified path.

While save() will get you started by creating the file and saving the function code with the correct name, it won't automatically include any imports, or other functions or variables, outside of the functions that are needed for the test to run. To solve this, pass in an optional imports argument ensuring necessary imports are added to the file.

The confusion_matrix test requires the following additional imports:

import matplotlib.pyplot as plt
from sklearn import metrics

Let's pass these imports to the save() method to ensure they are included in the file with the following command:

confusion_matrix.save(
    # Save it to the custom tests folder we created
    tests_folder,
    imports=["import matplotlib.pyplot as plt", "from sklearn import metrics"],
)
2026-01-10 02:12:21,084 - INFO(validmind.tests.decorator): Saved to /home/runner/work/documentation/documentation/site/notebooks/EXECUTED/model_validation/my_tests/ConfusionMatrix.py!Be sure to add any necessary imports to the top of the file.
2026-01-10 02:12:21,085 - INFO(validmind.tests.decorator): This metric can be run with the ID: <test_provider_namespace>.ConfusionMatrix
  • # Saved from __main__.confusion_matrix
    # Original Test ID: my_custom_tests.ConfusionMatrix
    # New Test ID: <test_provider_namespace>.ConfusionMatrix
  • def ConfusionMatrix(dataset, model, normalize=False):

Register a local test provider

Now that your my_tests folder has a sample custom test, let's initialize a test provider that will tell the ValidMind Library where to find your custom tests:

  • ValidMind offers out-of-the-box test providers for local tests (tests in a folder) or a Github provider for tests in a Github repository.
  • You can also create your own test provider by creating a class that has a load_test method that takes a test ID and returns the test function matching that ID.
Want to learn more about test providers?

An extended introduction to test providers can be found in: Integrate external test providers
Initialize a local test provider

For most use cases, using a LocalTestProvider that allows you to load custom tests from a designated directory should be sufficient.

The most important attribute for a test provider is its namespace. This is a string that will be used to prefix test IDs in model documentation. This allows you to have multiple test providers with tests that can even share the same ID, but are distinguished by their namespace.

Let's go ahead and load the custom tests from our my_tests directory:

from validmind.tests import LocalTestProvider

# initialize the test provider with the tests folder we created earlier
my_test_provider = LocalTestProvider(tests_folder)

vm.tests.register_test_provider(
    namespace="my_test_provider",
    test_provider=my_test_provider,
)
# `my_test_provider.load_test()` will be called for any test ID that starts with `my_test_provider`
# e.g. `my_test_provider.ConfusionMatrix` will look for a function named `ConfusionMatrix` in `my_tests/ConfusionMatrix.py` file
Run test provider tests

Now that we've set up the test provider, we can run any test that's located in the tests folder by using the run_test() method as with any other test:

  • For tests that reside in a test provider directory, the test ID will be the namespace specified when registering the provider, followed by the path to the test file relative to the tests folder.
  • For example, the Confusion Matrix test we created earlier will have the test ID my_test_provider.ConfusionMatrix. You could organize the tests in subfolders, say classification and regression, and the test ID for the Confusion Matrix test would then be my_test_provider.classification.ConfusionMatrix.

Let's go ahead and re-run the confusion matrix test with our testing dataset for our two models by using the test ID my_test_provider.ConfusionMatrix. This should load the test from the test provider and run it as before.

# Champion with test dataset and test provider custom test
vm.tests.run_test(
    test_id="my_test_provider.ConfusionMatrix:champion",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_log_model]
    }
).log()

Confusion Matrix Champion

Confusion Matrix: Champion is designed to provide a comprehensive summary of a classification model’s predictive performance by displaying the counts of true positives, true negatives, false positives, and false negatives in a tabular format. This test is primarily used to evaluate how well the model distinguishes between the positive and negative classes, offering a direct and interpretable view of the model’s strengths and weaknesses in classification tasks.

The test operates by comparing the predicted class labels generated by the model against the actual, or true, class labels from a labeled dataset. The confusion matrix is structured as a 2x2 table, where each cell represents a specific combination of predicted and actual outcomes: true positives (correctly predicted positives), true negatives (correctly predicted negatives), false positives (incorrectly predicted positives), and false negatives (incorrectly predicted negatives). From these four values, several key performance metrics can be derived, including accuracy (the proportion of total correct predictions), precision (the proportion of positive predictions that are correct), recall (the proportion of actual positives that are correctly identified), and the F1 score (the harmonic mean of precision and recall). Each of these metrics ranges from 0 to 1, where values closer to 1 indicate better performance. The confusion matrix thus provides a granular breakdown of model performance, allowing for the identification of specific types of errors and the calculation of multiple evaluation metrics.

The primary advantages of this test include its ability to present a holistic and interpretable summary of model performance in a single visualization, making it easy to identify not only the overall accuracy but also the specific types of errors the model is making. The confusion matrix is particularly useful in scenarios where the costs of different types of misclassifications are not equal, as it allows practitioners to assess the trade-offs between false positives and false negatives. Additionally, the derived metrics such as precision, recall, and F1 score provide deeper insights into the model’s behavior, especially in imbalanced datasets where accuracy alone may be misleading. The test’s visual format also facilitates communication of results to both technical and non-technical stakeholders.

It should be noted that the confusion matrix, while informative, has limitations. It is inherently tied to the threshold used for classification, meaning that different thresholds can yield different confusion matrices and associated metrics. This can make interpretation challenging if the optimal threshold is not well defined. Additionally, the confusion matrix does not provide information about the model’s calibration or the confidence of its predictions, nor does it account for the relative costs or risks associated with different types of errors unless explicitly incorporated into the analysis. In cases of highly imbalanced datasets, the confusion matrix may also obscure poor performance on the minority class if not interpreted alongside class distribution information. Finally, the test is limited to binary or multiclass classification tasks and is not applicable to regression or ranking problems.

This test shows a heatmap-style confusion matrix for the model labeled "log_model_champion" evaluated on the "test_dataset_final" dataset. The matrix is presented as a 2x2 grid, with the true class labels on the vertical axis and the predicted class labels on the horizontal axis. The four cells of the matrix display the counts of each outcome: the top-left cell (True Negative) contains 221 instances where the model correctly predicted the negative class, the top-right cell (False Positive) contains 114 instances where the model incorrectly predicted positive for a negative case, the bottom-left cell (False Negative) contains 113 instances where the model incorrectly predicted negative for a positive case, and the bottom-right cell (True Positive) contains 179 instances where the model correctly predicted the positive class. The color intensity of each cell corresponds to the count, with a color bar on the right indicating the scale, ranging from the lowest count (113) to the highest (221). This visualization allows for immediate assessment of the distribution of correct and incorrect predictions. The matrix provides the raw counts necessary to compute accuracy, precision, recall, and F1 score, and highlights the balance between the two types of errors (false positives and false negatives). Notably, the counts of false positives and false negatives are similar (114 and 113, respectively), suggesting a relatively balanced error profile, while the true negative count is the highest, indicating the model is more likely to correctly identify negative cases.

The test results reveal the following key insights:

  • Balanced Error Distribution Between False Positives and False Negatives: The model produces 114 false positives and 113 false negatives, indicating that it is equally likely to misclassify positive and negative cases, which suggests a balanced approach to error but also highlights that both types of misclassification are present in similar proportions.
  • Higher True Negative Rate: With 221 true negatives, the model demonstrates a stronger ability to correctly identify negative cases compared to positive cases, as the true positive count is lower at 179, which may reflect either class imbalance or a model bias toward the negative class.
  • Moderate True Positive Rate: The true positive count of 179, while substantial, is notably lower than the true negative count, suggesting that the model may be less effective at identifying positive cases, which could impact recall and overall sensitivity.
  • Overall Distribution Reflects Model’s Predictive Focus: The distribution of counts across the matrix cells, with the highest value in true negatives and similar values for false positives and false negatives, provides a clear picture of the model’s predictive tendencies and the types of errors it is most likely to make.

Based on these results, the confusion matrix for the "log_model_champion" model on the "test_dataset_final" dataset demonstrates that the model achieves a relatively balanced error profile, with nearly equal numbers of false positives and false negatives. The model is more effective at correctly identifying negative cases, as evidenced by the higher true negative count, while its ability to correctly identify positive cases is somewhat lower. This pattern suggests that the model may be slightly conservative in predicting the positive class, potentially prioritizing specificity over sensitivity. The similar counts of false positives and false negatives indicate that the model does not disproportionately favor one type of error over the other, which may be desirable in applications where the costs of both error types are comparable. The raw counts provided by the confusion matrix enable the calculation of key performance metrics and facilitate a nuanced understanding of the model’s classification behavior, supporting further analysis of trade-offs and model calibration as needed.

Figures

ValidMind Figure my_test_provider.ConfusionMatrix:champion:84d5
2026-01-10 02:12:48,141 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix:champion does not exist in model's document
# Challenger with test dataset  and test provider custom test
vm.tests.run_test(
    test_id="my_test_provider.ConfusionMatrix:challenger",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_rf_model]
    }
).log()

Confusion Matrix Challenger

Confusion Matrix: Challenger is designed to provide a comprehensive summary of a classification model’s predictive performance by displaying the counts of true positives, true negatives, false positives, and false negatives in a tabular format. This test is primarily used to evaluate how well a model distinguishes between two classes by comparing the predicted labels against the actual, true labels for a given dataset. The confusion matrix is a foundational diagnostic tool in classification tasks, offering a direct and interpretable view of the model’s strengths and weaknesses in terms of correct and incorrect predictions.

The test operates by constructing a two-by-two matrix where each cell represents a specific outcome of the model’s predictions: true positives (correctly predicted positives), true negatives (correctly predicted negatives), false positives (incorrectly predicted positives), and false negatives (incorrectly predicted negatives). The matrix is populated by comparing each prediction made by the model to the actual label in the test dataset. From these four values, several key performance metrics can be derived: accuracy (the proportion of total correct predictions), precision (the proportion of positive predictions that are actually correct), recall (the proportion of actual positives that are correctly identified), and the F1 score (the harmonic mean of precision and recall). Each metric provides a different perspective on model performance: accuracy ranges from 0 to 1 and reflects overall correctness, precision and recall also range from 0 to 1 and highlight the model’s ability to avoid false positives and false negatives, respectively, while the F1 score balances these two aspects. High values in these metrics generally indicate strong model performance, while lower values may signal issues with misclassification or class imbalance.

The primary advantages of this test include its ability to provide a granular and interpretable breakdown of model predictions, making it easy to identify specific types of errors and their frequencies. The confusion matrix is particularly useful in scenarios where the costs of different types of misclassification are not equal, as it allows practitioners to directly observe the trade-offs between false positives and false negatives. This level of detail supports informed decision-making regarding model selection, threshold tuning, and post-processing strategies. Additionally, the derived metrics offer a standardized way to compare models across different datasets or configurations, facilitating robust model evaluation and benchmarking. The visual representation of the confusion matrix further enhances interpretability, enabling stakeholders to quickly grasp the model’s predictive behavior.

It should be noted that the confusion matrix, while informative, has several limitations. It is inherently limited to binary or multiclass classification tasks and does not provide insight into the underlying reasons for misclassification. The test’s effectiveness can be diminished in the presence of significant class imbalance, where high accuracy may mask poor performance on minority classes. Additionally, the confusion matrix does not account for the probability or confidence of predictions, focusing solely on hard classification outcomes. Interpretation challenges may arise when comparing models across datasets with different class distributions, as the same confusion matrix values can have different implications depending on the context. Furthermore, the derived metrics, while useful, may not fully capture the operational impact of misclassifications in real-world applications, necessitating supplementary analyses for comprehensive model assessment.

This test shows a heatmap-style confusion matrix for the random forest model evaluated on the test_dataset_final dataset. The matrix is presented as a color-coded grid, with the true labels on the vertical axis and the predicted labels on the horizontal axis. The four cells of the matrix display the counts of each prediction outcome: the top-left cell (254) represents true negatives, the top-right cell (91) represents false positives, the bottom-left cell (86) represents false negatives, and the bottom-right cell (216) represents true positives. The color intensity corresponds to the magnitude of the counts, with a color bar on the right providing a reference scale. To interpret the matrix, one reads across each row to see how the model’s predictions align with the actual class: for example, among all true negatives, 254 were correctly predicted as negative, while 91 were incorrectly predicted as positive. Similarly, among all true positives, 216 were correctly identified, while 86 were missed. The matrix provides a direct visualization of the model’s classification behavior, highlighting both correct and incorrect predictions. The range of values spans from 86 to 254, with the highest count observed in the true negative cell. Notable observations include a relatively balanced distribution between true positives and true negatives, and a moderate number of both false positives and false negatives, suggesting that the model does not exhibit extreme bias toward either class.

The test results reveal the following key insights:

  • Balanced Correct Classification Across Classes: The model achieves 254 true negatives and 216 true positives, indicating that it is able to correctly identify both negative and positive cases with similar frequency, reflecting balanced performance across classes.
  • Moderate False Positive and False Negative Rates: There are 91 false positives and 86 false negatives, showing that the model makes a comparable number of errors in both directions, with neither type of misclassification dominating the results.
  • Distribution of Prediction Outcomes: The total number of negative cases (true negatives plus false positives) is 345, while the total number of positive cases (true positives plus false negatives) is 302, suggesting a slightly higher prevalence of negative cases in the test dataset.
  • Color Intensity Reflects Count Magnitude: The heatmap’s color gradient visually emphasizes the higher counts in the true negative and true positive cells, making it easy to identify where the model performs best and where errors are concentrated.
  • No Extreme Class Imbalance in Errors: The similar magnitudes of false positives and false negatives indicate that the model does not disproportionately misclassify one class over the other, supporting the observation of balanced predictive behavior.

Based on these results, the confusion matrix demonstrates that the random forest model exhibits balanced classification performance on the test dataset, with similar rates of correct identification for both positive and negative classes. The counts of true positives and true negatives are closely matched, and the numbers of false positives and false negatives are also similar, indicating that the model does not favor one class at the expense of the other. The distribution of prediction outcomes aligns with the underlying class distribution in the dataset, and the visual representation confirms that the majority of predictions are correct. The absence of extreme disparities in misclassification rates suggests that the model maintains consistent behavior across classes, without introducing significant bias. These observations collectively indicate that the model is capable of generalizing to both classes in the test data, providing a reliable basis for further evaluation using additional performance metrics or domain-specific criteria.

Figures

ValidMind Figure my_test_provider.ConfusionMatrix:challenger:2bda
2026-01-10 02:13:14,537 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix:challenger does not exist in model's document

Verify test runs

Our final task is to verify that all the tests provided by the model development team were run and reported accurately. Note the appended result_ids to delineate which dataset we ran the test with for the relevant tests.

Here, we'll specify all the tests we'd like to independently rerun in a dictionary called test_config. Note here that inputs and input_grid expect the input_id of the dataset or model as the value rather than the variable name we specified:

test_config = {
    # Run with the raw dataset
    'validmind.data_validation.DatasetDescription:raw_data': {
        'inputs': {'dataset': 'raw_dataset'}
    },
    'validmind.data_validation.DescriptiveStatistics:raw_data': {
        'inputs': {'dataset': 'raw_dataset'}
    },
    'validmind.data_validation.MissingValues:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_threshold': 1}
    },
    'validmind.data_validation.ClassImbalance:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_percent_threshold': 10}
    },
    'validmind.data_validation.Duplicates:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_threshold': 1}
    },
    'validmind.data_validation.HighCardinality:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {
            'num_threshold': 100,
            'percent_threshold': 0.1,
            'threshold_type': 'percent'
        }
    },
    'validmind.data_validation.Skewness:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'max_threshold': 1}
    },
    'validmind.data_validation.UniqueRows:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_percent_threshold': 1}
    },
    'validmind.data_validation.TooManyZeroValues:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'max_percent_threshold': 0.03}
    },
    'validmind.data_validation.IQROutliersTable:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'threshold': 5}
    },
    # Run with the preprocessed dataset
    'validmind.data_validation.DescriptiveStatistics:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TabularDescriptionTables:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.MissingValues:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'},
        'params': {'min_threshold': 1}
    },
    'validmind.data_validation.TabularNumericalHistograms:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TargetRateBarPlots:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'},
        'params': {'default_column': 'loan_status'}
    },
    # Run with the training and test datasets
    'validmind.data_validation.DescriptiveStatistics:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.TabularDescriptionTables:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.ClassImbalance:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_percent_threshold': 10}
    },
    'validmind.data_validation.UniqueRows:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_percent_threshold': 1}
    },
    'validmind.data_validation.TabularNumericalHistograms:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.MutualInformation:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_threshold': 0.01}
    },
    'validmind.data_validation.PearsonCorrelationMatrix:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.HighPearsonCorrelation:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'max_threshold': 0.3, 'top_n_correlations': 10}
    },
    'validmind.model_validation.ModelMetadata': {
        'input_grid': {'model': ['log_model_champion', 'rf_model']}
    },
    'validmind.model_validation.sklearn.ModelParameters': {
        'input_grid': {'model': ['log_model_champion', 'rf_model']}
    },
    'validmind.model_validation.sklearn.ROCCurve': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final'], 'model': ['log_model_champion']}
    },
    'validmind.model_validation.sklearn.MinimumROCAUCScore': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final'], 'model': ['log_model_champion']},
        'params': {'min_threshold': 0.5}
    }
}

Then batch run and log our tests in test_config:

for t in test_config:
    print(t)
    try:
        # Check if test has input_grid
        if 'input_grid' in test_config[t]:
            # For tests with input_grid, pass the input_grid configuration
            if 'params' in test_config[t]:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid']).log()
        else:
            # Original logic for regular inputs
            if 'params' in test_config[t]:
                vm.tests.run_test(t, inputs=test_config[t]['inputs'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, inputs=test_config[t]['inputs']).log()
    except Exception as e:
        print(f"Error running test {t}: {str(e)}")
validmind.data_validation.DatasetDescription:raw_data

Dataset Description Raw Data

Dataset Description: Raw Data is designed to provide a comprehensive analysis and statistical summary of each column in a machine learning model’s dataset. The primary purpose of this test is to deliver a detailed overview of the dataset’s structure, including the distribution, completeness, and uniqueness of data across all features, which is essential for understanding the data characteristics that underpin model development and evaluation.

The test operates by systematically examining each column in the dataset to infer its data type—such as numeric, categorical, or boolean—and then computing a suite of descriptive statistics for each. For every column, the test calculates the total number of entries, the count and proportion of missing values, and the number and proportion of unique values. For numeric columns, the test generates histograms to visualize the distribution of values, while for categorical and boolean columns, it computes frequency counts for each unique value. The methodology ensures that all relevant data types are supported, and any unsupported types are flagged. The metrics produced—such as count, missing percentage, and distinct value percentage—are designed to quantify data completeness, variability, and potential data quality issues. For example, a high missing percentage indicates a large proportion of absent data, which may affect model reliability, while a high distinct percentage in a categorical column may suggest high cardinality, potentially complicating pattern recognition. These metrics typically range from 0 to 1 (or 0% to 100%), with lower missing percentages and appropriate levels of distinctness generally considered favorable for modeling. The test aggregates all these insights into a summary table, providing a holistic view of the dataset’s readiness for modeling.

The primary advantages of this test include its ability to deliver a thorough and versatile summary of the dataset, encompassing both quantitative and qualitative aspects of each feature. By capturing key statistics such as counts, missing values, and unique value distributions, the test enables practitioners to quickly identify potential data quality issues, such as incomplete or highly variable columns, before model training begins. Its flexibility in handling various data types—numeric, categorical, boolean, and text—ensures broad applicability across different datasets and modeling scenarios. The inclusion of histograms and frequency counts further aids in visualizing data distributions, making it easier to detect irregularities, outliers, or skewness that could impact model performance. This comprehensive approach supports informed decision-making regarding data preprocessing, feature engineering, and model selection, ultimately contributing to more robust and reliable machine learning solutions.

It should be noted that the test has several limitations and potential risks. The computational cost can be significant, especially for large datasets with many columns, as generating detailed statistics and histograms for each feature requires substantial processing resources. The choice of histogram binning is arbitrary and may not always capture the true underlying distribution, potentially leading to misinterpretation of data patterns. Columns with unsupported data types are excluded from analysis, which may result in incomplete dataset characterization. Additionally, columns with all missing values are omitted from histogram computation, potentially masking data quality issues. Interpretation challenges may arise when columns exhibit high missing value ratios, extreme cardinality, or irregular distributions, as these characteristics can complicate downstream modeling. High missing percentages, unsupported types, and excessive unique values are all signs of elevated risk that may warrant further investigation.

This test shows the results in the form of a summary table, where each row represents a dataset column and each column in the table provides a specific metric: the feature name, its inferred type (numeric or categorical), the total count of non-missing entries, the absolute and percentage of missing values, the count of distinct values, and the percentage of distinct values relative to the total. The table allows users to quickly scan for completeness (via the missing value columns), uniqueness (via the distinct value columns), and data type distribution. For example, the “CreditScore” column is numeric with 8000 entries, no missing values, and 452 unique values, representing 5.65% of the total. The “Geography” column is categorical with three unique values, while “EstimatedSalary” is numeric with all 8000 values unique, indicating a continuous variable. The “Balance” column also shows high uniqueness (5088 distinct values, 63.6%), suggesting a wide range of values. All columns have zero missing values, as indicated by the missing count and percentage columns. The distinct percentage column helps identify columns with high cardinality, such as “EstimatedSalary,” which may require special handling in modeling. The table format is straightforward: each metric is clearly labeled, and the values are presented in absolute numbers and percentages, making it easy to interpret the scope and distribution of each feature. Notably, there are no columns with missing data, and the dataset includes a mix of numeric and categorical features, with varying degrees of uniqueness.

The test results reveal the following key insights:

  • Dataset Completeness Is High: All columns have 8000 entries with zero missing values, indicating full data availability across all features.
  • Feature Types Are Well-Defined: The dataset comprises both numeric and categorical columns, with clear type assignments for each feature.
  • Distinct Value Distribution Varies Widely: Numeric columns such as “EstimatedSalary” and “Balance” exhibit high uniqueness, with 100% and 63.6% distinct values respectively, while categorical columns like “Gender” and “HasCrCard” have only two unique values each.
  • Categorical Columns Show Low Cardinality: Features such as “Geography,” “Gender,” “HasCrCard,” “IsActiveMember,” and “Exited” have between two and three unique values, suggesting manageable levels of categorical diversity.
  • Numeric Columns Exhibit Diverse Ranges: Columns like “CreditScore” and “Age” have moderate numbers of unique values (452 and 69, respectively), while “Tenure” and “NumOfProducts” have lower uniqueness, reflecting their likely role as discrete or ordinal variables.
  • No Unsupported or Problematic Data Types Detected: All columns are successfully classified as either numeric or categorical, with no unsupported types present in the dataset.

Based on these results, the dataset demonstrates a high degree of completeness and well-structured feature types, with no missing values or unsupported data types detected. The distribution of unique values across columns reveals a mix of continuous, discrete, and categorical variables, each with appropriate levels of cardinality for their respective roles. Numeric features such as “EstimatedSalary” and “Balance” provide a broad range of values, supporting nuanced modeling, while categorical features maintain low cardinality, facilitating straightforward encoding and interpretation. The absence of missing data reduces the risk of data quality issues affecting model performance, and the clear delineation of feature types supports robust preprocessing and feature engineering. The observed patterns in distinct value percentages highlight the need for tailored handling of high-cardinality numeric features, but overall, the dataset’s structure and distribution are conducive to effective machine learning model development and evaluation.

Tables

Dataset Description

Name Type Count Missing Missing % Distinct Distinct %
CreditScore Numeric 8000.0 0 0.0 452 0.0565
Geography Categorical 8000.0 0 0.0 3 0.0004
Gender Categorical 8000.0 0 0.0 2 0.0002
Age Numeric 8000.0 0 0.0 69 0.0086
Tenure Numeric 8000.0 0 0.0 11 0.0014
Balance Numeric 8000.0 0 0.0 5088 0.6360
NumOfProducts Numeric 8000.0 0 0.0 4 0.0005
HasCrCard Categorical 8000.0 0 0.0 2 0.0002
IsActiveMember Categorical 8000.0 0 0.0 2 0.0002
EstimatedSalary Numeric 8000.0 0 0.0 8000 1.0000
Exited Categorical 8000.0 0 0.0 2 0.0002
2026-01-10 02:13:41,535 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DatasetDescription:raw_data does not exist in model's document
validmind.data_validation.DescriptiveStatistics:raw_data

Descriptive Statistics Raw Data

Descriptive Statistics: Raw Data is designed to provide a comprehensive summary of both numerical and categorical variables within a dataset, offering a detailed overview of the data’s distribution, central tendency, and variability. The primary purpose of this test is to facilitate an understanding of the dataset’s structure and characteristics, which is essential for interpreting model behavior and anticipating performance outcomes.

The test operates by applying established statistical functions to the dataset. For numerical variables, it uses a summary statistics approach, calculating the count of observations, mean (average value), standard deviation (a measure of spread or variability), minimum and maximum values, and key percentiles (25th, 50th, 75th, 90th, and 95th). These metrics collectively describe the central tendency, dispersion, and range of the data. The mean provides an average, while the median (50th percentile) offers a robust measure of central location, less sensitive to outliers. The standard deviation quantifies how much values deviate from the mean, with higher values indicating greater spread. Percentiles help identify the distribution of values across the dataset, highlighting skewness or concentration in certain ranges. For categorical variables, the test counts the total number of entries, the number of unique categories, the most frequent category (top value), its frequency, and the proportion this frequency represents relative to the total. This approach reveals the diversity and dominance of categories, which is crucial for understanding potential biases or imbalances. The typical range for these metrics is determined by the data itself, with counts ranging from zero to the dataset size, proportions from 0% to 100%, and numerical values spanning the observed data range. High dominance of a single category or significant differences between mean and median can indicate skewness or lack of diversity, which may impact model performance.

The primary advantages of this test include its ability to quickly and effectively summarize large and complex datasets, making it easier to identify patterns, anomalies, and potential data quality issues. By providing both central tendency and dispersion measures for numerical variables, the test enables users to detect outliers, skewness, and unusual distributions that could affect model training and inference. For categorical variables, the test highlights the presence of dominant categories or limited diversity, which are important for assessing the risk of bias or overfitting. This comprehensive overview is particularly useful in the early stages of model development, data validation, and regulatory review, as it ensures that all relevant aspects of the data are considered before proceeding to more advanced analyses. The test’s versatility allows it to be applied across a wide range of domains and data types, supporting robust and transparent model documentation.

It should be noted that while this test provides valuable high-level insights, it does not capture relationships or dependencies between variables, nor does it detect subtle patterns or correlations that may be critical for model performance. The test is limited to univariate analysis, meaning it examines each variable independently without considering interactions. As a result, it cannot identify multicollinearity, confounding factors, or complex data structures. Additionally, the test may not detect rare but important categories in categorical variables if they are overshadowed by dominant classes. Interpretation challenges may arise if the data contains significant outliers or is heavily skewed, as these can distort summary statistics such as the mean and standard deviation. Signs of high risk include large discrepancies between mean and median, high standard deviation relative to the mean, or a single category accounting for a large proportion of the data. These characteristics may indicate potential issues with data representativeness or suitability for modeling, and should prompt further investigation using complementary statistical tests.

This test shows the results in two tabular formats: one for numerical variables and one for categorical variables. The numerical variables table lists each variable alongside its count, mean, standard deviation, minimum, several percentiles (25th, 50th, 75th, 90th, 95th), and maximum values, providing a detailed snapshot of the distribution and spread for each feature. For example, the "CreditScore" variable has a mean of 650.16, a standard deviation of 96.85, and ranges from 350 to 850, with percentiles indicating the distribution across the population. The "Balance" variable shows a mean of 76,434.10 and a wide standard deviation of 62,612.25, with a minimum of 0 and a maximum of 250,898, suggesting a highly variable distribution. The categorical variables table presents each variable with its total count, number of unique values, the most frequent category, its frequency, and the percentage this represents. For instance, "Geography" has three unique values, with "France" being the most common at 50.12% of the data, while "Gender" is split between two categories, with "Male" comprising 54.95%. These tables allow for straightforward identification of central tendencies, variability, and category dominance, and can be read by examining each row for the variable of interest and interpreting the corresponding summary statistics. Notable observations include the presence of variables with high standard deviations, potential skewness in distributions, and categorical variables with dominant classes.

The test results reveal the following key insights:

  • Numerical variables exhibit wide ranges and varying degrees of dispersion: Variables such as "CreditScore" and "Balance" display substantial spreads, with "Balance" showing a particularly high standard deviation (62,612.25) relative to its mean (76,434.10), indicating significant variability and potential outliers.
  • Central tendency and skewness are evident in several variables: The "CreditScore" mean (650.16) is close to the median (652.0), suggesting a relatively symmetric distribution, while "Balance" has a median (97,264.0) notably higher than the mean, indicating right-skewness with a concentration of lower values and a long tail of higher balances.
  • Categorical variables show limited diversity and dominance of specific categories: "Geography" is dominated by "France" (50.12%), and "Gender" by "Male" (54.95%), highlighting potential imbalances that could influence model outcomes if not addressed.
  • Binary variables are well represented and balanced: Variables such as "HasCrCard" and "IsActiveMember" are binary, with means of 0.70 and 0.52, respectively, indicating a moderate split between categories and reducing the risk of extreme imbalance.
  • Percentile analysis reveals concentration and outlier presence: For "Age," the 95th percentile is 60, while the maximum is 92, suggesting a small number of older individuals that may act as outliers. Similarly, "Balance" and "EstimatedSalary" show large gaps between the 95th percentile and maximum values, further indicating the presence of extreme values.

Based on these results, the dataset demonstrates a mix of well-behaved and highly variable features, with numerical variables such as "CreditScore" and "Age" showing relatively symmetric distributions and moderate dispersion, while "Balance" and "EstimatedSalary" exhibit significant variability and right-skewness, as evidenced by high standard deviations and large differences between percentiles and maximum values. The categorical variables are characterized by limited diversity, with a single category accounting for over half of the observations in both "Geography" and "Gender," which may introduce bias or reduce the model’s ability to generalize across less-represented groups. Binary variables are reasonably balanced, minimizing the risk of model bias due to class imbalance. The presence of outliers in variables like "Balance" and "Age" is apparent from the percentile and maximum value comparisons, suggesting that further investigation or preprocessing may be warranted to mitigate their impact. Overall, the descriptive statistics provide a clear and detailed overview of the dataset’s structure, highlighting areas of stability, variability, and potential risk that are critical for understanding model behavior and informing subsequent modeling decisions.

Tables

Numerical Variables

Name Count Mean Std Min 25% 50% 75% 90% 95% Max
CreditScore 8000.0 650.1596 96.8462 350.0 583.0 652.0 717.0 778.0 813.0 850.0
Age 8000.0 38.9489 10.4590 18.0 32.0 37.0 44.0 53.0 60.0 92.0
Tenure 8000.0 5.0339 2.8853 0.0 3.0 5.0 8.0 9.0 9.0 10.0
Balance 8000.0 76434.0965 62612.2513 0.0 0.0 97264.0 128045.0 149545.0 162488.0 250898.0
NumOfProducts 8000.0 1.5325 0.5805 1.0 1.0 1.0 2.0 2.0 2.0 4.0
HasCrCard 8000.0 0.7026 0.4571 0.0 0.0 1.0 1.0 1.0 1.0 1.0
IsActiveMember 8000.0 0.5199 0.4996 0.0 0.0 1.0 1.0 1.0 1.0 1.0
EstimatedSalary 8000.0 99790.1880 57520.5089 12.0 50857.0 99505.0 149216.0 179486.0 189997.0 199992.0

Categorical Variables

Name Count Number of Unique Values Top Value Top Value Frequency Top Value Frequency %
Geography 8000.0 3.0 France 4010.0 50.12
Gender 8000.0 2.0 Male 4396.0 54.95
2026-01-10 02:14:20,305 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:raw_data does not exist in model's document
validmind.data_validation.MissingValues:raw_data

✅ Missing Values Raw Data

Missing Values is designed to assess the quality of a dataset by quantifying the proportion of missing values present in each feature, with the primary purpose of ensuring that the missing value ratio for any column does not exceed a specified threshold. This test is essential for maintaining the integrity and reliability of data used in machine learning models, as excessive missing data can compromise predictive performance and model validity.

The test operates by systematically examining each column in the dataset, counting the number of missing entries—typically represented as "NaN" (Not a Number)—and calculating the percentage of missing values relative to the total number of records for that feature. For each column, the test compares this percentage to a predefined minimum threshold, which in this case is set to 1%. The methodology involves iterating through all features, aggregating missing value counts, and expressing these as a percentage of the total row count. The resulting metric for each column ranges from 0% (no missing values) to 100% (all values missing). A lower percentage indicates higher data completeness, while a higher percentage signals potential data quality issues. The test then assigns a "Pass" or "Fail" status to each feature based on whether the missing value percentage is below the threshold, providing a clear and interpretable measure of data quality for each variable.

The primary advantages of this test include its ability to quickly and precisely identify the presence and extent of missing data across all features in a dataset. By providing a granular, column-level breakdown, the test enables data scientists and model risk managers to pinpoint specific variables that may require further attention or remediation. This level of detail is particularly valuable in regulated environments or high-stakes modeling scenarios, where data quality is paramount. The test's straightforward approach and clear pass/fail criteria make it an effective first-line diagnostic tool for ensuring that datasets meet minimum standards for completeness before proceeding to more advanced modeling or analysis stages.

It should be noted that the test is limited in several respects. It does not diagnose the underlying causes of missing data, nor does it offer guidance on how to address or impute missing values. The test also does not account for non-standard representations of missingness, such as placeholder values like "-999" or "None," which may not be technically classified as missing but can have similar implications for model performance. Additionally, features with missing value percentages just below the threshold may still pose risks to model reliability, but these are not flagged by the test. High risk is indicated when any column exceeds the threshold or when multiple columns approach the threshold, potentially undermining the overall quality and robustness of the dataset.

This test shows the results in a tabular format, where each row corresponds to a feature in the dataset and columns display the feature name, the number of missing values, the percentage of missing values, and the pass/fail status based on the 1% threshold. The table provides a comprehensive overview of missing data distribution across all features, with the "Number of Missing Values" column indicating the absolute count, and the "Percentage of Missing Values (%)" column expressing this as a proportion of the total dataset size. The "Pass/Fail" column summarizes whether each feature meets the data quality criterion. In this particular result, all features—including "CreditScore," "Geography," "Gender," "Age," "Tenure," "Balance," "NumOfProducts," "HasCrCard," "IsActiveMember," "EstimatedSalary," and "Exited"—show zero missing values, corresponding to 0.0% missingness for each. Every feature receives a "Pass" status, indicating full data completeness. The table is straightforward to interpret, with all values falling at the lower bound of the possible range (0%), and no features approaching or exceeding the threshold.

The test results reveal the following key insights:

  • All features exhibit complete data: Every feature in the dataset has zero missing values, as indicated by both the absolute count and the percentage columns.
  • Uniformity in missing value distribution: There is no variation across features; all columns report 0.0% missingness, demonstrating consistent data quality throughout the dataset.
  • All features meet the threshold criteria: Each feature passes the test, with the "Pass/Fail" column showing "Pass" for all, confirming that the dataset satisfies the minimum completeness requirement.
  • No features require further missing value analysis: The absence of missing data across all features eliminates the need for additional investigation or remediation related to missingness.

Based on these results, the dataset demonstrates a high standard of data completeness, with no missing values detected in any feature. This uniformity across all columns ensures that the dataset is well-suited for downstream modeling activities, as there are no gaps that could introduce bias or reduce predictive accuracy. The consistent "Pass" status for every feature confirms that the dataset meets the predefined threshold for missing values, supporting the reliability and robustness of any subsequent analyses or model development. The absence of missing data also simplifies preprocessing requirements, as there is no need for imputation or special handling of incomplete records. Overall, the results indicate that the dataset is fully compliant with data quality expectations regarding missing values, providing a solid foundation for further modeling and risk assessment activities.

Parameters:

{
  "min_threshold": 1
}
            

Tables

Column Number of Missing Values Percentage of Missing Values (%) Pass/Fail
CreditScore 0 0.0 Pass
Geography 0 0.0 Pass
Gender 0 0.0 Pass
Age 0 0.0 Pass
Tenure 0 0.0 Pass
Balance 0 0.0 Pass
NumOfProducts 0 0.0 Pass
HasCrCard 0 0.0 Pass
IsActiveMember 0 0.0 Pass
EstimatedSalary 0 0.0 Pass
Exited 0 0.0 Pass
2026-01-10 02:14:36,169 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MissingValues:raw_data does not exist in model's document
validmind.data_validation.ClassImbalance:raw_data

✅ Class Imbalance Raw Data

Class Imbalance is designed to evaluate and quantify the distribution of target classes within a dataset used by a machine learning model, with the primary purpose of identifying whether any class is under-represented to a degree that could introduce bias into the model’s predictions. By systematically assessing the proportion of each class, the test helps ensure that the dataset is sufficiently balanced to support robust and fair model training, thereby reducing the risk of the model favoring the majority class at the expense of the minority class.

The test operates by calculating the frequency of each class in the target column, expressing these frequencies as percentages of the total dataset. It then compares each class’s percentage to a predefined minimum threshold, which is set to 10% by default but can be adjusted to suit specific use cases. The methodology involves counting the number of records for each class, dividing by the total number of records, and multiplying by 100 to obtain a percentage. Each class is then evaluated against the threshold: if a class’s percentage falls below the threshold, it is flagged as not meeting the balance criterion. The test outputs both a pass/fail status for each class and a visual representation of the class proportions. The percentage metric ranges from 0% to 100%, where higher values indicate greater representation. A class passing the threshold is generally considered adequately represented, while failing indicates potential imbalance that could affect model performance.

The primary advantages of this test include its ability to quickly and clearly identify under-represented classes, which is critical for preventing model bias and ensuring equitable performance across all classes. The test’s straightforward calculation and visual output make it accessible and easy to interpret, even for non-technical stakeholders. Its flexibility in allowing the minimum percentage threshold to be adjusted means it can be tailored to different domains and business requirements. The visual plot enhances interpretability by providing an immediate, intuitive understanding of class proportions, which is particularly useful in early-stage data exploration and model risk management. By quantifying the degree of imbalance, the test supports informed decision-making regarding data collection, preprocessing, and model selection strategies.

It should be noted that the test has several limitations. It may be less informative for datasets with a large number of classes, where some degree of imbalance is expected due to the natural distribution of the data. The sensitivity of the test to the chosen threshold means that setting the threshold too high could result in false positives for imbalance, while setting it too low might overlook meaningful disparities. The test does not account for the varying costs or impacts of misclassifying different classes, which can be significant in certain applications. Additionally, while the test identifies imbalances, it does not provide direct solutions for addressing them, such as resampling or reweighting techniques. The test is also limited to classification tasks and is not applicable to regression or clustering problems. High risk is indicated when any class falls below the threshold, signaling a need for further investigation into potential model bias.

This test shows the results in both tabular and graphical formats. The table presents each class in the target variable “Exited,” displaying the percentage of rows corresponding to each class and a pass/fail status based on the 10% minimum threshold. Specifically, the table lists two classes: class 0, which comprises 79.80% of the dataset, and class 1, which comprises 20.20%. Both classes are marked as “Pass,” indicating that each exceeds the minimum threshold. The accompanying bar plot visually depicts the proportion of each class, with the x-axis representing the class labels (0 and 1) and the y-axis showing the percentage of the dataset each class occupies. The height of each bar corresponds to the class’s relative frequency, making it easy to compare the representation of each class at a glance. The scale of the y-axis ranges from 0 to 1 (or 0% to 100%), and the plot title “Exited Class Imbalance” provides context. Notably, the plot reveals a clear majority for class 0, with class 1 being substantially less frequent but still above the threshold. There are no classes with extremely low representation, and the distribution, while imbalanced, does not trigger a fail status for either class.

The test results reveal the following key insights:

  • Both classes exceed the minimum threshold: The table shows that class 0 constitutes 79.80% and class 1 constitutes 20.20% of the dataset, with both surpassing the 10% minimum threshold and receiving a “Pass” status.
  • Class 0 is the majority class: The graphical plot and tabular data both indicate that class 0 is the dominant class, representing nearly four times the proportion of class 1.
  • Class 1 is under-represented but not critically so: Although class 1 is less frequent, its 20.20% share is well above the threshold, suggesting that while the dataset is imbalanced, it is not severely so according to the test’s criteria.
  • No classes are flagged as high risk: The absence of any “Fail” status in the results indicates that, by the test’s definition, there are no classes at immediate risk of being too rare to support reliable model training.
  • Visual representation confirms quantitative results: The bar plot provides a clear visual confirmation of the numerical data, with the disparity between classes easily observable but not extreme enough to warrant a fail.

Based on these results, the dataset exhibits a moderate class imbalance, with class 0 being the majority and class 1 the minority, but both classes are sufficiently represented to pass the test’s minimum threshold criterion. The quantitative table and visual plot together provide a comprehensive view of the class distribution, confirming that while the dataset is not perfectly balanced, it does not present an immediate risk of class under-representation according to the defined threshold. The results suggest that the model trained on this data is unlikely to be severely biased due to class imbalance alone, as both classes have adequate representation for the model to learn meaningful patterns. The clear majority-minority relationship between the classes is evident, but the absence of any class below the threshold indicates that the dataset meets the basic requirements for balanced class representation as defined by the test parameters. This supports the reliability of subsequent model evaluation and performance metrics, as both classes are present in sufficient numbers to allow for robust assessment of predictive accuracy and fairness.

Parameters:

{
  "min_percent_threshold": 10
}
            

Tables

Exited Class Imbalance

Exited Percentage of Rows (%) Pass/Fail
0 79.80% Pass
1 20.20% Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:raw_data:4ecb
2026-01-10 02:15:00,123 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance:raw_data does not exist in model's document
validmind.data_validation.Duplicates:raw_data

✅ Duplicates Raw Data

Duplicates:raw_data is designed to identify and quantify duplicate rows within a dataset, with the primary purpose of ensuring data quality and supporting model reliability by preventing the influence of redundant or repeated information. This test is a critical component of data preprocessing, as it helps to mitigate the risk of overfitting and ensures that the model is trained on unique, representative data rather than memorizing repeated entries.

The test operates by systematically scanning each row of the dataset to detect exact duplicates. If a specific text column is designated, the test focuses on that column; otherwise, it evaluates all feature columns collectively. The process involves comparing each row to all others to determine if any are identical across the relevant columns. The test then calculates two key metrics: the absolute number of duplicate rows and the percentage of duplicates relative to the total number of rows. These metrics provide a quantitative assessment of data redundancy. The number of duplicates is a straightforward count, while the percentage contextualizes this count within the size of the dataset, typically ranging from 0% (no duplicates) to 100% (all rows are duplicates). A low value is generally desirable, indicating high data uniqueness, whereas higher values may signal data collection or processing issues. The test is considered passed if the number of duplicates falls below a user-defined minimum threshold, which in this case is set to 1.

The primary advantages of this test include its ability to provide a clear, quantitative overview of data redundancy, which is essential for maintaining the integrity of the model training process. By reporting both the absolute and relative frequency of duplicate rows, the test enables practitioners to quickly assess whether the dataset is at risk of overfitting due to repeated entries. This is particularly valuable in scenarios where data is aggregated from multiple sources or where manual data entry may introduce unintentional repetition. The customizable threshold feature allows users to tailor the test to the specific requirements and risk tolerances of their modeling context, making it adaptable to a wide range of applications. Additionally, the test’s straightforward methodology ensures that results are easy to interpret and communicate to both technical and non-technical stakeholders.

It should be noted that the test is limited to detecting exact duplicates and does not account for semantically similar but non-identical entries, which may still introduce bias or redundancy into the model. The test also does not differentiate between benign duplicates—such as legitimate repeated observations—and problematic ones arising from data processing errors. As the dataset size increases, the computational cost of the test may become significant, potentially impacting performance for very large datasets. Furthermore, a high number or percentage of duplicates, as indicated by the test, may signal underlying issues with data collection or integration processes, which require further investigation beyond the scope of this test. Interpretation of the results should therefore consider the broader data context and the potential for both technical and process-driven sources of duplication.

This test shows the results in a tabular format, specifically presenting a table titled "Duplicate Rows Results for Dataset." The table contains two columns: "Number of Duplicates" and "Percentage of Rows (%)". The "Number of Duplicates" column reports the absolute count of duplicate rows detected in the dataset, while the "Percentage of Rows (%)" column expresses this count as a proportion of the total dataset size, scaled to a percentage. In this instance, the table displays a value of 0 for the number of duplicates and 0.0% for the percentage of rows, indicating that no duplicate rows were found in the dataset. This result is straightforward to interpret: the absence of duplicates means that every row in the dataset is unique with respect to the columns evaluated. The scale for both metrics starts at zero, with higher values indicating increasing levels of duplication. The table provides a clear, at-a-glance summary of the dataset’s redundancy status, with no notable outliers or regions of concern, as both metrics are at their minimum possible values.

The test results reveal the following key insights:

  • No Duplicate Rows Detected: The dataset contains zero duplicate rows, as indicated by the "Number of Duplicates" value of 0.
  • Complete Data Uniqueness: The "Percentage of Rows (%)" is 0.0%, confirming that every row in the dataset is unique and there is no redundancy.
  • Threshold Criteria Met: The result satisfies the minimum threshold parameter of 1, meaning the dataset passes the test for duplicate detection.
  • Consistent Data Quality: The absence of duplicates suggests that data collection and processing have maintained high standards of quality and integrity.

Based on these results, the dataset demonstrates a high degree of uniqueness and integrity, with no evidence of redundant or repeated entries that could compromise model training or evaluation. The zero values for both the absolute and relative measures of duplication indicate that the data preprocessing and collection processes have effectively prevented the introduction of duplicate rows. This supports the reliability of subsequent modeling efforts, as the risk of overfitting due to repeated data is minimized. The results also suggest that the dataset is well-suited for use in both classification and regression tasks, as the absence of duplicates ensures that the model will be exposed to a diverse set of examples. The clear and unambiguous outcome of the test provides confidence in the dataset’s suitability for further analysis and modeling, with no immediate need for additional duplicate remediation or investigation.

Parameters:

{
  "min_threshold": 1
}
            

Tables

Duplicate Rows Results for Dataset

Number of Duplicates Percentage of Rows (%)
0 0.0
2026-01-10 02:15:24,405 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.Duplicates:raw_data does not exist in model's document
validmind.data_validation.HighCardinality:raw_data

✅ High Cardinality Raw Data

High Cardinality is designed to assess the number of unique values present in categorical columns of a dataset, with the primary purpose of detecting high cardinality that may indicate potential overfitting or unwanted noise in the data. By quantifying the uniqueness within categorical features, the test helps to identify columns that could introduce complexity or instability into downstream modeling processes.

The test operates by first identifying all columns in the dataset that are classified as categorical. For each of these columns, it calculates two key metrics: the number of distinct values (n_distinct) and the percentage of distinct values relative to the total number of records (p_distinct). The number of distinct values provides a direct count of unique categories, while the percentage contextualizes this count as a proportion of the dataset size, typically ranging from 0% to 100%. The test then compares these metrics against a predefined threshold, which in this case is set as a percentage (0.1 or 10%). If the number of distinct values in a column is less than the calculated threshold, the column passes the test; otherwise, it fails. This approach allows for a standardized assessment of cardinality risk across different datasets and use cases, with the pass/fail status serving as an immediate indicator of whether a column's cardinality is within acceptable bounds.

The primary advantages of this test include its effectiveness in early detection of potential overfitting risks and data quality issues associated with high-cardinality categorical features. By systematically quantifying the uniqueness of categorical columns, the test enables practitioners to proactively identify features that may introduce excessive model complexity or capture spurious patterns. This is particularly valuable in scenarios where categorical variables are numerous or derived from external sources, as it helps to ensure that only features with manageable cardinality are included in model development. Additionally, the test's applicability to both classification and regression tasks enhances its versatility, making it a useful tool for a wide range of modeling applications. Its straightforward output and clear pass/fail criteria further facilitate rapid interpretation and integration into automated data quality pipelines.

It should be noted that the test is limited to categorical data types and does not evaluate numerical or continuous features, which may also exhibit problematic distributions. The static nature of the threshold, whether defined as a number or percentage, may not be optimal for all datasets or business contexts, potentially leading to false positives or negatives in certain cases. Furthermore, the test does not account for the semantic importance or predictive value of unique categories, which means that columns with high cardinality but critical business relevance may be flagged unnecessarily. Interpretation challenges may arise when the test is applied to datasets with inherently high diversity or when rare categories are meaningful. High risk is indicated by columns that fail the test, specifically those with a number of distinct values at or above the threshold, as these may contribute to overfitting or instability in model performance.

This test shows the results in a tabular format, where each row corresponds to a categorical column in the dataset. The table includes the column name, the number of distinct values, the percentage of distinct values relative to the dataset size, and the pass/fail status based on the defined threshold. For the columns "Geography" and "Gender," the number of distinct values is 3 and 2, respectively, with corresponding percentages of 0.0375% and 0.025%. Both columns are marked as "Pass," indicating that their cardinality is well below the 10% threshold. The table is straightforward to interpret: lower values in the "Number of Distinct Values" and "Percentage of Distinct Values (%)" columns suggest lower cardinality and, consequently, a reduced risk of overfitting. The "Pass/Fail" column provides an immediate summary of whether each feature meets the cardinality criteria. Notably, all evaluated columns in this test have a very low proportion of unique values, and none approach the threshold that would trigger a fail status. This suggests a stable and manageable level of categorical diversity within the dataset.

The test results reveal the following key insights:

  • All Categorical Columns Exhibit Low Cardinality: Both "Geography" and "Gender" have a small number of unique values, with 3 and 2 distinct categories, respectively, indicating limited diversity within these features.
  • Percentage of Distinct Values Is Substantially Below Threshold: The percentage of distinct values for "Geography" (0.0375%) and "Gender" (0.025%) is significantly lower than the 10% threshold, demonstrating that these columns are far from the high-cardinality risk zone.
  • Uniform Pass Status Across Evaluated Columns: Both columns pass the test, confirming that no categorical feature in the current dataset exceeds the cardinality threshold or poses an immediate risk of overfitting due to excessive uniqueness.
  • No Evidence of Outlier or Anomalous Categories: The absence of columns with high numbers or percentages of distinct values suggests that the dataset does not contain categorical features with outlier or anomalous category distributions.

Based on these results, the dataset's categorical features demonstrate a stable and controlled level of cardinality, with all evaluated columns comfortably passing the high cardinality test. The low number and percentage of distinct values in both "Geography" and "Gender" indicate that these features are unlikely to introduce overfitting or excessive model complexity. The uniform pass status across all tested columns suggests a consistent approach to categorical feature engineering and data preparation, with no evidence of problematic or anomalous category distributions. This pattern supports the conclusion that the dataset is well-suited for modeling applications that require categorical stability and interpretability. The absence of high-cardinality features reduces the risk of spurious relationships and enhances the reliability of downstream model performance, providing a solid foundation for further analysis and model development.

Parameters:

{
  "num_threshold": 100,
  "percent_threshold": 0.1,
  "threshold_type": "percent"
}
            

Tables

Column Number of Distinct Values Percentage of Distinct Values (%) Pass/Fail
Geography 3 0.0375 Pass
Gender 2 0.0250 Pass
2026-01-10 02:15:48,399 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighCardinality:raw_data does not exist in model's document
validmind.data_validation.Skewness:raw_data

❌ Skewness Raw Data

Skewness:raw_data is designed to evaluate the asymmetry in the distribution of numerical data within a dataset, with the primary purpose of identifying deviations from normality that may impact the quality and performance of predictive machine learning models. By quantifying the degree of skewness in each numeric column, this test helps to detect potential data quality issues that could influence model accuracy, fairness, and interpretability.

The test operates by calculating the skewness statistic for each numerical column in the dataset. Skewness measures the extent to which the distribution of data values deviates from a symmetrical, bell-shaped (normal) distribution. A skewness value close to zero indicates a nearly symmetric distribution, while positive values suggest a longer or fatter right tail, and negative values indicate a longer or fatter left tail. The test compares each calculated skewness value to a predefined maximum threshold, set at 1 for this analysis. If the absolute value of skewness for a column is less than the threshold, the column passes; otherwise, it fails. The test outputs a table listing each column, its skewness value, and a pass/fail status, providing a clear and quantitative assessment of distributional asymmetry. The typical range for skewness is unbounded, but values between -1 and 1 are generally considered acceptable for most modeling purposes, with values outside this range indicating substantial skewness that may warrant further attention.

The primary advantages of this test include its efficiency and clarity in identifying unequal data distributions that could undermine model assumptions or performance. The test is computationally lightweight, making it suitable for large datasets and routine data quality checks. Its adjustable threshold allows practitioners to tailor the sensitivity of the test to specific modeling requirements or regulatory standards. By providing a direct, quantitative measure of skewness, the test enables rapid detection of problematic distributions, supporting proactive risk management and model optimization. This is particularly valuable in scenarios where model performance is sensitive to input data distributions, such as in credit scoring, fraud detection, or other regulated domains.

It should be noted that the test is limited to numeric columns and does not capture skewness or bias in categorical or text data, which may also influence model outcomes. The assumption that data should approximate a normal distribution may not always hold, especially in real-world applications where certain variables are naturally skewed. The threshold for acceptable skewness is subjective and may require expert judgment or iterative refinement to align with business objectives and regulatory expectations. High skewness values, especially those that persist across multiple columns, may signal underlying data quality issues or violations of model assumptions, potentially leading to suboptimal predictions or biased inferences.

This test shows a tabular output titled "Skewness Results for Dataset," presenting each numerical column alongside its calculated skewness value and a pass/fail status based on the threshold of 1. The table includes columns such as "CreditScore," "Age," "Tenure," "Balance," "NumOfProducts," "HasCrCard," "IsActiveMember," "EstimatedSalary," and "Exited." The skewness values range from -0.8867 to 1.4847, with most columns exhibiting values close to zero, indicating near-symmetric distributions. Notably, "Age" and "Exited" have skewness values of 1.0245 and 1.4847, respectively, both exceeding the threshold and marked as "Fail." The remaining columns, including "CreditScore" (-0.062), "Tenure" (0.0077), "Balance" (-0.1353), "NumOfProducts" (0.7172), "HasCrCard" (-0.8867), "IsActiveMember" (-0.0796), and "EstimatedSalary" (0.0095), all fall within the acceptable range and are marked as "Pass." The table is read by examining each row to determine which columns meet the skewness criterion and which do not, with the "Pass/Fail" status providing an immediate visual cue. The scale of skewness values allows for direct comparison across columns, highlighting those with distributions that may require further investigation.

The test results reveal the following key insights:

  • Most columns exhibit low skewness: The majority of numerical columns, including "CreditScore," "Tenure," "Balance," "NumOfProducts," "HasCrCard," "IsActiveMember," and "EstimatedSalary," have skewness values between -0.8867 and 0.7172, all within the acceptable threshold.
  • Age and Exited display substantial skewness: The "Age" column has a skewness of 1.0245, and the "Exited" column has a skewness of 1.4847, both exceeding the threshold of 1 and resulting in a "Fail" status.
  • Distribution symmetry is prevalent: Columns such as "Tenure" (0.0077) and "EstimatedSalary" (0.0095) are nearly perfectly symmetric, indicating well-balanced distributions.
  • Negative skewness observed in some columns: "CreditScore," "Balance," "HasCrCard," and "IsActiveMember" all have negative skewness values, suggesting a slight leftward tail, but these remain within the acceptable range.
  • No extreme outliers in skewness except for Exited: The "Exited" column stands out with the highest skewness (1.4847), indicating a pronounced right tail and a potential imbalance in the target variable.

Based on these results, the dataset demonstrates generally well-balanced distributions across most numerical columns, with skewness values falling within the acceptable range for the majority of features. The exceptions are "Age" and "Exited," both of which exceed the skewness threshold, indicating notable asymmetry in their distributions. The "Age" column's skewness suggests a concentration of values toward the lower end with a longer right tail, while the "Exited" column's high skewness points to a significant imbalance in the target variable, which may reflect class imbalance or a rare event scenario. The presence of negative skewness in several columns is mild and does not breach the threshold, indicating only minor deviations from symmetry. The overall pattern suggests that, aside from the two flagged columns, the dataset is unlikely to introduce substantial distributional bias into downstream modeling. The clear pass/fail status for each column facilitates targeted review and supports transparent documentation of data quality characteristics relevant to model development and risk management.

Parameters:

{
  "max_threshold": 1
}
            

Tables

Skewness Results for Dataset

Column Skewness Pass/Fail
CreditScore -0.0620 Pass
Age 1.0245 Fail
Tenure 0.0077 Pass
Balance -0.1353 Pass
NumOfProducts 0.7172 Pass
HasCrCard -0.8867 Pass
IsActiveMember -0.0796 Pass
EstimatedSalary 0.0095 Pass
Exited 1.4847 Fail
2026-01-10 02:16:19,577 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.Skewness:raw_data does not exist in model's document
validmind.data_validation.UniqueRows:raw_data

❌ Unique Rows Raw Data

Unique Rows is designed to assess the diversity of the dataset by verifying that the number of unique rows in each column exceeds a specified minimum percentage threshold. The primary purpose of this test is to ensure that the data used for model development is sufficiently varied, which is essential for building robust and unbiased machine learning models capable of generalizing to new, unseen data.

The test operates by first determining the total number of rows in the dataset and then calculating the number of unique values present in each column. For each column, the percentage of unique values is computed by dividing the count of unique values by the total row count and multiplying by 100 to express the result as a percentage. This percentage is then compared to a predefined minimum threshold, which in this case is set at 1%. If the percentage of unique values in a column meets or exceeds this threshold, the column is marked as having passed the test; otherwise, it is marked as having failed. The test provides a pass/fail verdict for each column, and the overall assessment is based on whether all columns meet the threshold. The typical range for the percentage of unique values is from 0% (no diversity) to 100% (all values unique), with higher percentages generally indicating greater data diversity. Columns with low percentages may signal limited variability, which can impact the model’s ability to generalize.

The primary advantages of this test include its efficiency and systematic approach to evaluating data diversity across all columns in a dataset. By quantifying the proportion of unique values, the test provides a clear and interpretable metric for assessing data quality. This is particularly useful in scenarios where the risk of overfitting is a concern, as high data diversity is often associated with improved model generalization. The test is also straightforward to implement and interpret, making it a practical tool for routine data quality checks during model development. Its ability to quickly highlight columns with limited variability allows data scientists to focus their attention on potential areas of concern, thereby supporting the development of more robust and reliable models.

It should be noted that the Unique Rows test has several limitations. It assumes that higher uniqueness directly correlates with better data quality, which may not always be the case, especially in domains where certain repeated values are meaningful or necessary. The test treats all columns equally, without considering their relative importance or predictive power in the model, which can lead to misleading interpretations if some columns are inherently less variable by design (such as binary or categorical features). Additionally, the test may not be suitable for columns with a naturally limited set of possible values, as these will almost always fail the uniqueness threshold regardless of their relevance. A lack of diversity, as indicated by a low percentage of unique values, is considered a sign of high risk, as it may lead to overfitting and poor model generalization. However, the test does not account for the context or intended use of each column, which can limit its effectiveness as a standalone measure of data quality.

This test shows the results in a tabular format, where each row corresponds to a column in the dataset and includes the column name, the number of unique values, the percentage of unique values relative to the total row count, and a pass/fail indicator based on the 1% threshold. The table allows for straightforward interpretation: columns with a percentage of unique values above 1% are marked as "Pass," while those below are marked as "Fail." For example, the "CreditScore" column has 452 unique values, representing 5.65% of the total, and passes the test. In contrast, columns such as "Geography," "Gender," "NumOfProducts," "HasCrCard," "IsActiveMember," and "Exited" have very low percentages of unique values (ranging from 0.025% to 0.05%) and fail the test. The "Balance" and "EstimatedSalary" columns exhibit high uniqueness, with 63.6% and 100% respectively, and both pass. The "Age" and "Tenure" columns, despite having more unique values than some others, still fall below the threshold and fail. The range of unique value percentages spans from as low as 0.025% to as high as 100%, highlighting significant variability in data diversity across columns. Notably, only three columns ("CreditScore," "Balance," and "EstimatedSalary") meet the minimum uniqueness requirement, while the remaining eight do not.

The test results reveal the following key insights:

  • Most Columns Exhibit Low Uniqueness: The majority of columns, including "Geography," "Gender," "Age," "Tenure," "NumOfProducts," "HasCrCard," "IsActiveMember," and "Exited," have a percentage of unique values well below the 1% threshold, indicating limited diversity in these features.
  • High Uniqueness in Select Columns: "EstimatedSalary" achieves 100% uniqueness, and "Balance" and "CreditScore" also display high uniqueness at 63.6% and 5.65% respectively, suggesting these columns contain highly individualized data points.
  • Categorical Features Consistently Fail: All categorical columns, such as "Geography," "Gender," "NumOfProducts," "HasCrCard," "IsActiveMember," and "Exited," fail the uniqueness threshold, reflecting the inherent limitation of this test for features with a small set of possible values.
  • Continuous Features Show Greater Diversity: Columns representing continuous variables, such as "EstimatedSalary," "Balance," and "CreditScore," are more likely to pass the uniqueness test, highlighting a clear distinction between variable types in terms of data diversity.
  • Threshold Sensitivity Evident in Marginal Columns: Columns like "Age" and "Tenure," despite having a moderate number of unique values (69 and 11, respectively), still fall short of the 1% threshold, underscoring the sensitivity of the test to the chosen cutoff and the distribution of values within each feature.

Based on these results, the dataset demonstrates a pronounced disparity in data diversity across its columns, with only a subset of features—primarily those representing continuous variables—exhibiting sufficient uniqueness to pass the test. The categorical features, by contrast, consistently display low percentages of unique values, which is expected given their limited set of possible categories. This pattern suggests that while certain columns provide a high degree of individualized information, much of the dataset is composed of features with restricted variability. The observed distribution of uniqueness percentages highlights the importance of considering feature type and context when interpreting the results, as the test’s uniform threshold does not account for the natural constraints of categorical variables. The results indicate that the dataset’s overall diversity is driven primarily by a few highly unique columns, while the majority of features contribute limited new information per row. This characteristic may influence the model’s ability to generalize, particularly if the predictive power is concentrated in the more diverse features. The test thus provides a clear, quantitative snapshot of data diversity, revealing both the strengths and limitations of the dataset in supporting robust model development.

Parameters:

{
  "min_percent_threshold": 1
}
            

Tables

Column Number of Unique Values Percentage of Unique Values (%) Pass/Fail
CreditScore 452 5.6500 Pass
Geography 3 0.0375 Fail
Gender 2 0.0250 Fail
Age 69 0.8625 Fail
Tenure 11 0.1375 Fail
Balance 5088 63.6000 Pass
NumOfProducts 4 0.0500 Fail
HasCrCard 2 0.0250 Fail
IsActiveMember 2 0.0250 Fail
EstimatedSalary 8000 100.0000 Pass
Exited 2 0.0250 Fail
2026-01-10 02:16:54,081 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.UniqueRows:raw_data does not exist in model's document
validmind.data_validation.TooManyZeroValues:raw_data

❌ Too Many Zero Values Raw Data

Too Many Zero Values is designed to identify numerical columns within a dataset that contain an excessive proportion of zero values, as defined by a configurable threshold percentage. The primary purpose of this test is to highlight columns where the prevalence of zero values may indicate data sparsity or a lack of variation, which could limit the effectiveness of these features in downstream machine learning models.

The test operates by systematically evaluating each numerical column in the dataset. For every such column, it calculates the total number of zero values and determines the ratio of these zeros to the total number of rows. This ratio is then compared to a predefined threshold, which in this case is set at 0.03%. If the proportion of zero values in a column exceeds this threshold, the column is flagged as having too many zeros and is marked as having failed the test. The test outputs a summary table for each numerical column, displaying the variable name, total row count, number of zero values, percentage of zero values, and a pass/fail status. The percentage metric ranges from 0% to 100%, where higher values indicate a greater concentration of zeros. A low percentage is generally desirable, as it suggests more data variation, while a high percentage may signal potential issues with feature informativeness or data quality.

The primary advantages of this test include its ability to efficiently surface columns with unexpectedly high concentrations of zero values, which might otherwise be overlooked in large datasets. By providing both the absolute count and the percentage of zero values, the test enables a nuanced understanding of the distribution of zeros within each column. The configurable threshold allows users to tailor the sensitivity of the test to the specific requirements of their analysis or modeling context. Additionally, by focusing exclusively on numerical columns, the test avoids misapplication to categorical or text data, thereby reducing the risk of irrelevant or misleading results. This targeted approach is particularly useful in scenarios where data sparsity or lack of feature variation could adversely impact model performance or interpretability.

It should be noted that this test is limited to detecting zero values and does not account for other potentially problematic data characteristics, such as missing values, extreme outliers, or non-zero constants. The test does not consider the contextual meaning of zeros, which in some domains may be entirely appropriate or even expected. As a result, columns flagged as having too many zeros may not necessarily be problematic from a modeling perspective. The test also does not identify patterns or sequences of zeros, which could be relevant in time-series or longitudinal data. Furthermore, the binary pass/fail outcome does not provide guidance on the practical implications of the observed zero concentrations, and the test does not extend to non-numerical columns, which may harbor other types of data quality issues. High risk is signaled when a column exhibits a zero value ratio that significantly exceeds the threshold, especially if the column is entirely or predominantly zeros, as this suggests a lack of data variation.

This test shows the results in the form of a structured table, where each row corresponds to a numerical variable in the dataset. The columns of the table include the variable name, the total number of rows, the number of zero values, the percentage of zero values, and a pass/fail indicator based on the specified threshold. To interpret the table, one should examine the percentage of zero values for each variable and compare it to the threshold of 0.03%. All four variables listed—Tenure, Balance, HasCrCard, and IsActiveMember—have percentages of zero values that far exceed the threshold, with values ranging from 4.04% to 48.01%. The pass/fail column clearly indicates that all variables have failed the test. The scale of the percentage column allows for quick identification of variables with particularly high concentrations of zeros, and the absolute counts provide additional context regarding the magnitude of zeros present. Notably, IsActiveMember has the highest proportion of zeros at 48.01%, followed by Balance at 36.4%, HasCrCard at 29.74%, and Tenure at 4.04%. These results suggest that a substantial portion of the data in these columns consists of zero values, which may have implications for their utility in modeling.

The test results reveal the following key insights:

  • All Numerical Columns Exceed Zero Threshold: Every numerical variable assessed—Tenure, Balance, HasCrCard, and IsActiveMember—has a percentage of zero values that surpasses the 0.03% threshold, resulting in a fail status for each column.
  • High Concentration of Zeros in IsActiveMember and Balance: The IsActiveMember variable exhibits the highest proportion of zero values at 48.01%, closely followed by Balance at 36.4%, indicating that nearly half and over a third of the entries in these columns are zeros, respectively.
  • Substantial Zero Presence in HasCrCard: The HasCrCard variable also demonstrates a significant zero concentration, with 29.74% of its values being zero, which is markedly above the threshold and suggests limited variation in this feature.
  • Tenure Shows Lower but Still Excessive Zero Proportion: While Tenure has a lower percentage of zero values at 4.04% compared to the other variables, this is still well above the threshold and represents a non-trivial portion of the data.
  • Consistent Pattern of Failing Across Variables: The uniform fail status across all variables highlights a consistent pattern of high zero prevalence in the dataset’s numerical features, suggesting a broader characteristic of the data rather than isolated occurrences.

Based on these results, the dataset’s numerical columns are characterized by a substantial presence of zero values, with all assessed variables exceeding the predefined threshold by significant margins. The IsActiveMember and Balance columns, in particular, display extremely high proportions of zeros, indicating that these features may offer limited variation and could potentially act as quasi-constant variables in modeling applications. The HasCrCard and Tenure variables, while exhibiting lower percentages, still present a notable concentration of zeros relative to the threshold. The consistent pattern of failing the test across all variables suggests that the dataset as a whole may be subject to data sparsity or structural characteristics that result in frequent zero entries. These observations provide a clear quantitative profile of the zero value distribution within the dataset’s numerical features, which is essential for understanding the potential impact on model training, feature selection, and interpretability. The results underscore the importance of considering the role and meaning of zero values in the context of the specific modeling task, as their prevalence may influence both the statistical properties of the data and the behavior of machine learning algorithms.

Parameters:

{
  "max_percent_threshold": 0.03
}
            

Tables

Variable Row Count Number of Zero Values Percentage of Zero Values (%) Pass/Fail
Tenure 8000 323 4.0375 Fail
Balance 8000 2912 36.4000 Fail
HasCrCard 8000 2379 29.7375 Fail
IsActiveMember 8000 3841 48.0125 Fail
2026-01-10 02:17:18,964 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TooManyZeroValues:raw_data does not exist in model's document
validmind.data_validation.IQROutliersTable:raw_data

IQR Outliers Table Raw Data

IQROutliersTable is designed to determine and summarize outliers in numerical features using the Interquartile Range (IQR) method. The primary purpose of this test is to identify data points that deviate significantly from the central tendency of each numerical variable, as such outliers can distort statistical analysis and adversely affect the performance and reliability of machine learning models. By systematically flagging these extreme values, the test supports robust data pre-processing and quality assurance.

The test operates by calculating the IQR for each numerical feature in the dataset, which is the difference between the 75th percentile (third quartile) and the 25th percentile (first quartile) of the data. Outliers are defined as data points that fall below the lower bound (first quartile minus a multiple of the IQR) or above the upper bound (third quartile plus the same multiple of the IQR). The threshold multiplier, which determines the sensitivity of outlier detection, is set to 5 in this test, making the criteria for outlier identification more stringent than the conventional value of 1.5. For each feature, the test summarizes the number of outliers and provides descriptive statistics—minimum, 25th percentile, median, 75th percentile, and maximum—of the detected outlier values. The output is typically presented in a tabular format, with each row corresponding to a numerical feature and columns detailing the outlier count and summary statistics. The values in these columns represent the actual data values of the outliers, and the counts indicate the prevalence of extreme values in each feature. A high outlier count or extreme summary statistics may indicate data quality issues or the presence of rare but influential observations.

The primary advantages of this test include its robustness to extreme values, as the IQR method is less sensitive to the presence of outliers than methods based on the mean and standard deviation. This makes it particularly effective for datasets that may contain heavy-tailed distributions or sporadic anomalies. The test provides a comprehensive summary for each numerical feature, enabling users to quickly identify which variables may require further scrutiny or special handling. Its flexibility allows users to adjust the outlier threshold and select specific features for analysis, making it adaptable to a wide range of data quality assessment scenarios. By focusing on quartile-based boundaries, the test remains effective even when the data distribution is not perfectly normal, and it is well-suited for initial exploratory data analysis and ongoing data monitoring.

It should be noted that the test is limited to numerical features and does not address outliers in categorical data. The reliance on quartile-based thresholds means that the test may produce false positives in highly skewed distributions or in data with naturally heavy tails, where extreme values are expected and not necessarily indicative of data quality problems. The default or user-specified threshold may not be optimal for all datasets, especially those that have undergone significant pre-processing or transformation. Additionally, the test does not provide guidance on how to handle detected outliers, leaving interpretation and remediation to the user. Signs of high risk include a large number of outliers across multiple features or outlier values that are substantially distant from the central tendency, which may suggest data entry errors or other quality concerns. Interpretation challenges may arise when distinguishing between genuine rare events and erroneous data points.

This test shows the results in the form of a table titled "Summary of Outliers Detected by IQR Method." The table is structured to display, for each numerical feature, the count of detected outliers and summary statistics (minimum, 25th percentile, median, 75th percentile, and maximum) of the outlier values. Each column represents a specific metric, and each row corresponds to a different numerical feature. The units of measurement for the summary statistics are the same as those of the original data features. The table is designed to be read by scanning across each row to assess the prevalence and distribution of outliers for each feature. In this particular test run, the table is empty, indicating that no outliers were detected in any of the numerical features when applying the IQR method with a threshold of 5. This means that, under the current criteria, all data points for all numerical features fall within the expected range defined by the quartile boundaries, and there are no extreme values that meet the outlier definition. The absence of outlier counts and summary statistics in the table confirms the lack of detected anomalies, and the scale of the values is implicitly determined by the original data but is not displayed due to the absence of outliers.

The test results reveal the following key insights:

  • No Outliers Detected Across All Features: The table contains no entries, indicating that none of the numerical features in the dataset have data points classified as outliers under the IQR method with a threshold of 5.
  • Stringent Threshold Yields Conservative Detection: The use of a threshold value of 5, which is higher than the conventional 1.5, results in a more conservative approach to outlier detection, further reducing the likelihood of flagging data points as outliers.
  • Uniform Data Distribution Within Expected Ranges: The absence of outliers suggests that the numerical features are distributed within the expected quartile-based boundaries, with no extreme deviations present in the dataset.
  • No Evidence of Data Quality Anomalies: The lack of detected outliers implies that, at the current threshold, there are no apparent data entry errors or unusual values that would warrant further investigation for the numerical features analyzed.

Based on these results, the dataset demonstrates a high degree of conformity to the IQR-based outlier criteria, with all numerical features exhibiting values that fall within the expected range defined by the quartile boundaries and the specified threshold. The absence of detected outliers suggests that the data is free from extreme values that could potentially distort statistical analysis or model performance, at least under the current, relatively stringent detection parameters. This observation indicates a stable and well-behaved distribution of numerical features, with no immediate evidence of data quality anomalies or rare, influential observations. The results provide confidence in the integrity of the numerical data for subsequent modeling or analysis steps, as there are no outlier-driven distortions present. However, it is important to recognize that the conservative threshold may mask less extreme but still influential values, and the results are specific to the current configuration of the test. The overall pattern observed is one of uniformity and stability across all numerical features, with no notable deviations or dependencies detected in the outlier analysis.

Parameters:

{
  "threshold": 5
}
            

Tables

Summary of Outliers Detected by IQR Method

2026-01-10 02:17:44,181 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.IQROutliersTable:raw_data does not exist in model's document
validmind.data_validation.DescriptiveStatistics:preprocessed_data

Descriptive Statistics Preprocessed Data

Descriptive Statistics: Preprocessed Data is designed to provide a comprehensive summary of both numerical and categorical variables within a dataset, enabling a clear understanding of the data’s distribution, central tendency, variability, and categorical composition. The primary purpose of this test is to facilitate an initial assessment of the dataset’s structure and characteristics, which is essential for evaluating the suitability of the data for modeling and for anticipating potential challenges related to data quality or representativeness.

The test operates by leveraging established statistical functions to generate detailed summaries for both numerical and categorical variables. For numerical data, the test uses a descriptive statistics function that calculates key metrics such as count, mean, standard deviation, minimum, maximum, and several percentiles (including the 25th, 50th, 75th, 90th, and 95th). These metrics collectively describe the central tendency, spread, and range of the data, allowing for the identification of skewness, outliers, and overall variability. The mean provides an average value, while the standard deviation quantifies the typical deviation from the mean. Percentiles offer insight into the distribution, highlighting where most data points lie and how extreme values compare to the bulk of the data. For categorical variables, the test applies a value counting function to determine the total count, the number of unique categories, the most frequent category (top value), its frequency, and its proportion relative to the total. This approach reveals the diversity and dominance of categories, which is critical for understanding potential class imbalances or lack of representativeness. The metrics for categorical data typically range from 0% to 100% for proportions, with higher values for the top category indicating less diversity. The test’s outputs are structured into separate tables for numerical and categorical variables, each providing a clear, quantitative overview of the dataset’s composition.

The primary advantages of this test include its ability to quickly and effectively summarize the main characteristics of a dataset, making it an indispensable tool for initial data exploration and quality assessment. By presenting a broad array of statistical measures, the test enables users to detect anomalies such as outliers, extreme values, or skewed distributions that could impact model performance. The inclusion of both numerical and categorical summaries ensures that all variable types are considered, supporting a holistic understanding of the data. This comprehensive approach is particularly useful in scenarios where data quality, representativeness, or potential biases must be evaluated before proceeding with more complex analyses or model development. The test’s reliance on well-established statistical methods ensures robustness and interpretability, making it suitable for regulatory documentation and transparent reporting.

It should be noted that while this test provides a thorough overview of individual variable distributions, it does not capture relationships or dependencies between variables, nor does it detect subtle patterns or correlations that may exist within the data. The test is limited to univariate analysis, meaning it cannot identify multivariate anomalies or interactions that could influence model outcomes. Additionally, the presence of significant differences between the mean and median, high standard deviations, or a dominant category in categorical variables may signal potential risks such as skewness, outliers, or lack of diversity. These characteristics can affect model stability and generalizability, especially if the dataset is not representative of the broader population. Interpretation challenges may arise if the data contains hidden biases or if the summary statistics mask important sub-group variations. Therefore, the results of this test should be considered as part of a broader suite of analyses to ensure comprehensive data understanding.

This test shows the results in two tabular formats: one for numerical variables and one for categorical variables. The numerical variables table lists each variable alongside its count (number of non-missing observations), mean (average value), standard deviation (measure of spread), minimum and maximum values, and several percentiles (25th, 50th, 75th, 90th, and 95th), which indicate the values below which a certain percentage of the data falls. For example, the 50th percentile represents the median. The categorical variables table presents each variable with its total count, the number of unique categories, the most frequent category (top value), the frequency of this top value, and its proportion as a percentage of the total. This format allows for straightforward comparison across variables and easy identification of dominant categories or potential imbalances. Notable observations include the wide range in numerical variables such as Balance and EstimatedSalary, the high frequency of certain categories in categorical variables, and the presence of variables with binary outcomes. The tables provide a clear, quantitative snapshot of the dataset, highlighting both the central tendencies and the variability present in the data.

The test results reveal the following key insights:

  • Numerical variables exhibit diverse distributions and ranges: Variables such as CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary display a wide range of values, with counts consistently at 3,232 for all variables, indicating no missing data. CreditScore ranges from 350 to 850 with a mean of 647.77 and a standard deviation of 98.77, suggesting moderate variability. Balance and EstimatedSalary show particularly high standard deviations (61,414.14 and 58,310.67, respectively), indicating substantial spread and potential for outliers.
  • Percentile analysis highlights skewness and concentration: For Balance, the 25th percentile is 0, the median is 103,253, and the 75th percentile is 129,344, indicating a significant portion of the data has zero or low balances, with a sharp increase in higher percentiles. EstimatedSalary also shows a broad distribution, with the 25th percentile at 48,713 and the 95th percentile at 190,094, reflecting a wide income range.
  • Binary and categorical variables show varying levels of diversity: HasCrCard and IsActiveMember are binary, with means of 0.699 and 0.470, respectively, indicating that approximately 70% have a credit card and 47% are active members. NumOfProducts ranges from 1 to 4, with a mean of 1.51, suggesting most customers have one or two products.
  • Categorical variables reveal dominant categories: Geography has three unique values, with France as the top value at 1,480 occurrences (45.79%). Gender is nearly balanced, with Male as the top value at 1,634 occurrences (50.56%), indicating no significant gender imbalance.
  • No evidence of missing data or extreme outliers in counts: All variables report a count of 3,232, confirming complete data coverage for the analyzed fields.

Based on these results, the dataset demonstrates a high degree of completeness, with no missing values across the analyzed variables. The numerical variables show a mix of distributions, with some (such as Balance and EstimatedSalary) exhibiting substantial variability and potential skewness, as indicated by large standard deviations and differences between percentiles. The presence of a significant proportion of zero balances and the wide range in estimated salaries suggest that the dataset includes both low- and high-value customers, which may influence model behavior and risk segmentation. The categorical variables display moderate diversity, with Geography showing a dominant category (France) but still maintaining representation from other regions, and Gender being nearly evenly split. Binary variables such as HasCrCard and IsActiveMember provide clear segmentation points, with the majority of customers holding credit cards but less than half being active members. Overall, the data structure supports robust modeling, but the observed variability and category dominance should be considered when interpreting model outputs, as they may impact the model’s ability to generalize across different customer segments. The descriptive statistics provide a solid foundation for further analysis, highlighting areas where additional scrutiny or targeted preprocessing may be warranted to ensure balanced and representative model development.

Tables

Numerical Variables

Name Count Mean Std Min 25% 50% 75% 90% 95% Max
CreditScore 3232.0 647.7707 98.7717 350.0 580.0 651.0 715.0 778.0 815.0 850.0
Tenure 3232.0 5.0149 2.9001 0.0 3.0 5.0 8.0 9.0 9.0 10.0
Balance 3232.0 82173.3282 61414.1405 0.0 0.0 103253.0 129344.0 150500.0 165296.0 250898.0
NumOfProducts 3232.0 1.5084 0.6694 1.0 1.0 1.0 2.0 2.0 3.0 4.0
HasCrCard 3232.0 0.6993 0.4587 0.0 0.0 1.0 1.0 1.0 1.0 1.0
IsActiveMember 3232.0 0.4697 0.4992 0.0 0.0 0.0 1.0 1.0 1.0 1.0
EstimatedSalary 3232.0 99459.2961 58310.6660 12.0 48713.0 99490.0 149771.0 179488.0 190094.0 199992.0

Categorical Variables

Name Count Number of Unique Values Top Value Top Value Frequency Top Value Frequency %
Geography 3232.0 3.0 France 1480.0 45.79
Gender 3232.0 2.0 Male 1634.0 50.56
2026-01-10 02:18:20,510 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:preprocessed_data does not exist in model's document
validmind.data_validation.TabularDescriptionTables:preprocessed_data

Tabular Description Tables Preprocessed Data

Tabular Description Tables is designed to provide a comprehensive summary of the descriptive statistics for all numerical, categorical, and datetime variables within a dataset. Its primary purpose is to offer a clear and structured overview of the data’s key characteristics, including central tendencies, distributions, data types, and the presence of missing values, which are essential for understanding the dataset’s structure and readiness for further modeling or analysis.

The test operates by first segregating the dataset’s variables according to their data types: numerical, categorical, or datetime. For numerical variables, it calculates the number of observations, mean, minimum and maximum values, the percentage of missing values, and the data type. These metrics collectively describe the central tendency, spread, and completeness of each variable. For categorical variables, the test reports the number of observations, the count of unique values, a list of those unique values, the percentage of missing values, and the data type, which together provide insight into the diversity and completeness of categorical data. For datetime variables, the test would typically summarize the number of unique values, the earliest and latest dates, missing value percentages, and data type, though none are present in this result. The metrics are derived directly from the raw data, with missing value percentages ranging from 0% (no missing data) to 100% (all data missing), and unique value counts reflecting the level of categorical diversity. High missing value percentages or unexpected data types can indicate data quality issues, while the range and mean values help identify potential anomalies or out-of-range values.

The primary advantages of this test include its ability to quickly and systematically surface the essential characteristics of a dataset, making it particularly valuable in the early stages of data exploration and quality assessment. By providing a detailed snapshot of each variable’s distribution, completeness, and type, the test enables data scientists and analysts to identify potential data quality issues, such as missing values or inappropriate data types, before proceeding to more complex modeling tasks. The inclusion of both summary statistics and metadata ensures that users have a holistic view of the dataset, supporting informed decisions about preprocessing, feature engineering, and model selection. This comprehensive overview is especially useful in regulated environments or high-stakes applications, where data integrity and transparency are paramount.

It should be noted that this test is limited to descriptive statistics and does not extend to deeper statistical analysis, such as detecting outliers, assessing relationships between variables, or evaluating the impact of missing data on model performance. The test does not provide information about potential correlations, interactions, or the need for data transformations, which may be critical for certain modeling approaches. Additionally, while the test highlights the presence of missing values and data type inconsistencies, it does not diagnose their causes or suggest remediation strategies. High percentages of missing values or inappropriate data types are flagged as potential risks, as they may indicate underlying data collection or integrity issues that could compromise model reliability. Interpretation of the results requires domain knowledge to assess whether observed distributions and data types are appropriate for the intended application.

This test shows the results in the form of two structured tables: one summarizing numerical variables and the other summarizing categorical variables. The numerical variables table lists each variable alongside the number of observations, mean, minimum and maximum values, percentage of missing values, and data type. For example, the “CreditScore” variable has 3,232 observations, a mean of 647.77, a minimum of 350, and a maximum of 850, with no missing values and an integer data type. The categorical variables table presents each variable with the number of observations, the count and list of unique values, percentage of missing values, and data type. For instance, “Geography” has three unique values (“Germany”, “Spain”, “France”) and no missing data. All variables in both tables report 0% missing values, indicating complete data coverage. The range and mean values for numerical variables are within expected bounds for their respective domains, and categorical variables display a manageable number of unique categories. The data types are consistent with the expected nature of each variable, with numerical variables represented as integers or floats and categorical variables as objects. The tables are straightforward to interpret, with each column clearly labeled and values presented in standard units, such as raw counts or percentages.

The test results reveal the following key insights:

  • Complete Data Coverage Across All Variables: All numerical and categorical variables report 0% missing values, indicating that the dataset is fully populated and free from missing data, which supports robust downstream analysis.
  • Numerical Variables Exhibit Plausible Ranges and Central Tendencies: Variables such as “CreditScore” (mean: 647.77, min: 350, max: 850) and “EstimatedSalary” (mean: 99,459.30, min: 11.58, max: 199,992.48) display ranges and averages that are consistent with typical financial datasets, suggesting no apparent data entry errors or out-of-range values.
  • Categorical Variables Have Manageable and Interpretable Cardinality: “Geography” contains three unique values (“Germany”, “Spain”, “France”), and “Gender” contains two (“Female”, “Male”), both of which are expected and manageable for modeling purposes.
  • Data Types Are Consistent With Variable Roles: All numerical variables are appropriately typed as integers or floats, and categorical variables are stored as objects, reducing the risk of type-related processing errors.
  • Binary Variables Are Clearly Defined and Balanced: Variables such as “HasCrCard”, “IsActiveMember”, and “Exited” are binary (min: 0, max: 1), with means close to 0.5 or 0.7, indicating a relatively balanced distribution between categories.

Based on these results, the dataset demonstrates strong data integrity, with complete coverage and appropriate data types across all variables. The numerical variables show plausible ranges and central tendencies, supporting the assumption that the data is representative of typical financial or customer datasets. The categorical variables are limited to a small number of interpretable categories, which facilitates straightforward encoding and analysis. The absence of missing values and the consistency in data types suggest that the dataset is well-prepared for subsequent modeling steps, with minimal need for additional preprocessing related to data completeness or type conversion. The balanced distribution of binary variables further indicates that the dataset is unlikely to introduce bias due to class imbalance at the variable level. Collectively, these characteristics provide a solid foundation for reliable model development and evaluation, as the data’s structure and quality align with best practices for quantitative analysis and regulatory documentation.

Tables

Numerical Variable Num of Obs Mean Min Max Missing Values (%) Data Type
CreditScore 3232 647.7707 350.00 850.00 0.0 int64
Tenure 3232 5.0149 0.00 10.00 0.0 int64
Balance 3232 82173.3282 0.00 250898.09 0.0 float64
NumOfProducts 3232 1.5084 1.00 4.00 0.0 int64
HasCrCard 3232 0.6993 0.00 1.00 0.0 int64
IsActiveMember 3232 0.4697 0.00 1.00 0.0 int64
EstimatedSalary 3232 99459.2961 11.58 199992.48 0.0 float64
Exited 3232 0.5000 0.00 1.00 0.0 int64
Categorical Variable Num of Obs Num of Unique Values Unique Values Missing Values (%) Data Type
Geography 3232.0 3.0 ['Germany' 'Spain' 'France'] 0.0 object
Gender 3232.0 2.0 ['Female' 'Male'] 0.0 object
2026-01-10 02:18:42,314 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularDescriptionTables:preprocessed_data does not exist in model's document
validmind.data_validation.MissingValues:preprocessed_data

✅ Missing Values Preprocessed Data

Missing Values: preprocessed data is designed to assess the quality of a dataset by quantifying the proportion of missing values present in each feature. The primary purpose of this test is to ensure that the ratio of missing data to total data for every column remains below a specified threshold, thereby supporting the reliability and predictive strength of downstream machine learning models.

The test operates by systematically examining each column in the dataset and counting the number of missing entries, typically represented as NaN values. For each feature, it calculates the percentage of missing values relative to the total number of records, providing a clear metric for data completeness. The test then compares these percentages to a predefined threshold, in this case set to 1%. If the percentage of missing values in a column is less than the threshold, the column is marked as "Pass"; otherwise, it is marked as "Fail." The results are presented in a tabular format, with columns for the feature name, the count of missing values, the percentage of missing values, and the pass/fail status. This approach allows for a granular assessment of data quality, with the percentage metric ranging from 0% (no missing data) to 100% (all data missing), where lower percentages are indicative of higher data quality.

The primary advantages of this test include its ability to quickly and transparently identify the presence and extent of missing data across all features in a dataset. By providing a detailed breakdown for each column, the test enables data scientists and model risk managers to pinpoint specific areas where data quality may be compromised. This level of granularity is particularly valuable in regulated environments or high-stakes modeling scenarios, where even small amounts of missing data can have significant impacts on model performance and interpretability. The test's straightforward methodology and clear output make it an effective tool for routine data quality checks, supporting robust model development and validation processes.

It should be noted that the test is limited in several respects. It does not diagnose the underlying causes of missing data, nor does it offer guidance on how to address or impute missing values. The test also does not account for non-standard representations of missingness, such as placeholder values like "-999" or "None," which may not be technically classified as missing but can have similar effects on model outcomes. Additionally, features with missing value percentages just below the threshold may still pose risks to model reliability, yet will not be flagged by this test. High risk is indicated when any column exceeds the threshold or when missing values are distributed across multiple features, potentially undermining the overall integrity of the dataset.

This test shows the results in a structured table, where each row corresponds to a feature in the dataset and the columns display the feature name, the number of missing values, the percentage of missing values, and the pass/fail status based on the 1% threshold. To interpret the table, users should look for columns where the percentage of missing values approaches or exceeds the threshold, as these are flagged as "Fail." In this particular result, all features—such as CreditScore, Geography, Gender, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Exited—have zero missing values, corresponding to 0.0% missingness for each. Every feature is marked as "Pass," indicating full data completeness across the entire dataset. The scale for the percentage of missing values ranges from 0% to 100%, but in this case, all values are at the lower bound. There are no features with any degree of missingness, and no columns are close to the threshold, suggesting a high level of data integrity. The table provides a clear, at-a-glance summary of the missing value status for each feature, with no notable outliers or areas of concern.

The test results reveal the following key insights:

  • All features exhibit complete data: Every column in the dataset has zero missing values, resulting in a 0.0% missing value percentage for all features.
  • Uniform pass status across all features: Each feature meets the threshold requirement, with all columns marked as "Pass" and none approaching the 1% threshold.
  • No evidence of missing data risk: The absence of missing values across all features indicates that the dataset is free from missing data-related risks, supporting robust downstream modeling.
  • Consistent data quality across feature types: Both categorical and numerical features, such as Geography, Gender, Balance, and EstimatedSalary, show identical completeness, with no variation in missingness patterns.

Based on these results, the dataset demonstrates a high degree of completeness, with no missing values detected in any feature. This uniformity in data quality across all columns suggests that the preprocessing steps applied to the data have been effective in eliminating missing entries, thereby reducing the risk of data-driven biases or instability in subsequent modeling efforts. The consistent "Pass" status for every feature, regardless of data type, further indicates that the dataset is well-suited for use in machine learning applications, as there are no gaps that could compromise model training, validation, or interpretability. The absence of missing data also simplifies downstream processes, such as feature engineering and model selection, by removing the need for imputation or special handling of incomplete records. Overall, the results provide strong evidence that the dataset meets the necessary standards for data quality, supporting reliable and transparent model development.

Parameters:

{
  "min_threshold": 1
}
            

Tables

Column Number of Missing Values Percentage of Missing Values (%) Pass/Fail
CreditScore 0 0.0 Pass
Geography 0 0.0 Pass
Gender 0 0.0 Pass
Tenure 0 0.0 Pass
Balance 0 0.0 Pass
NumOfProducts 0 0.0 Pass
HasCrCard 0 0.0 Pass
IsActiveMember 0 0.0 Pass
EstimatedSalary 0 0.0 Pass
Exited 0 0.0 Pass
2026-01-10 02:19:01,022 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MissingValues:preprocessed_data does not exist in model's document
validmind.data_validation.TabularNumericalHistograms:preprocessed_data

Tabular Numerical Histograms Preprocessed Data

Tabular Numerical Histograms: Preprocessed Data is designed to provide a visual overview of the distribution of each numerical feature in a dataset, supporting the identification of distributional characteristics, skewness, and potential outliers. The primary purpose of this test is to facilitate exploratory data analysis by enabling a clear understanding of how each numerical input variable is distributed, which is essential for assessing the suitability of the data for downstream modeling and for detecting any irregularities that may impact model performance.

The test operates by systematically extracting all numerical columns from the provided dataset and generating a histogram for each feature using 50 bins. Each histogram visually represents the frequency distribution of values within the corresponding feature, allowing for the detection of patterns such as normality, skewness, and the presence of outliers. The methodology leverages the principle that the shape and spread of a feature’s distribution can reveal important information about the data’s structure and potential preprocessing needs. The histograms are constructed using plotly, which ensures interactive and high-resolution visualizations. The x-axis of each histogram corresponds to the range of values for the feature, while the y-axis indicates the count of observations within each bin. This approach does not require any assumptions about the underlying distribution and is applicable to any numerical variable, making it a robust tool for univariate analysis.

The primary advantages of this test include its ability to quickly and intuitively highlight the distributional properties of each numerical feature, making it particularly useful for large datasets with multiple variables. By visualizing the data, users can easily spot skewed distributions, heavy tails, or clusters of outliers that may not be apparent from summary statistics alone. This test is especially valuable in scenarios where model performance is sensitive to input distributions, such as when algorithms assume normality or when outliers can disproportionately influence results. The visual nature of the histograms also aids in communicating data characteristics to both technical and non-technical stakeholders, supporting transparent and informed decision-making during data preprocessing and model development.

It should be noted that this test is limited to univariate analysis, focusing solely on the individual distributions of numerical features without considering relationships or interactions between variables. As a result, it may miss multivariate patterns or dependencies that could be relevant for modeling. Additionally, the test does not provide any direct insight into how the observed distributions affect model outcomes, nor does it address categorical or non-numerical data. Interpretation challenges may arise if the expected distribution for a feature is not well-defined, or if the presence of outliers is not clearly linked to data quality or business context. High-risk indicators, such as pronounced skewness, unexpected distribution shapes, or extreme outliers, should be interpreted with caution, as they may signal underlying data issues that require further investigation.

This test shows a series of histograms, each corresponding to a numerical feature in the dataset, including CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary. Each plot displays the frequency of observations (y-axis) across the range of feature values (x-axis), with bin widths chosen to provide a detailed view of the distribution. For example, the CreditScore histogram reveals a roughly bell-shaped distribution with a peak between 600 and 700, and a range extending from below 400 to above 800. The Tenure histogram shows a relatively uniform distribution across most values, with lower counts at the endpoints. The Balance histogram is notable for a large spike at zero, indicating a substantial number of accounts with no balance, followed by a normal-like distribution for nonzero balances. NumOfProducts, HasCrCard, and IsActiveMember are all discrete or binary features, with histograms showing clear peaks at specific values, reflecting the categorical nature of these variables. EstimatedSalary appears uniformly distributed across its range, with no pronounced skewness or clustering. The scale of each axis is tailored to the feature, with counts ranging from tens to over a thousand, and value ranges spanning from binary (0/1) to continuous values up to 250,000. Notable observations include the concentration of zero balances, the dominance of one or two products per customer, and the high prevalence of credit card holders and active members.

The test results reveal the following key insights:

  • CreditScore Distribution Centers Around 600–700: The CreditScore histogram displays a unimodal, approximately normal distribution, with the majority of values concentrated between 600 and 700, and fewer observations at the lower and upper extremes, indicating a typical credit score range for the dataset.
  • Tenure Exhibits Uniform Distribution Except at Endpoints: The Tenure feature is distributed nearly evenly across values from 1 to 10, with slightly lower frequencies at the minimum and maximum, suggesting no strong bias toward short or long tenure among customers.
  • Balance Shows High Proportion of Zero Values: The Balance histogram reveals a significant spike at zero, indicating a large subset of customers with no account balance, while the remaining values form a bell-shaped distribution centered around 120,000, highlighting a bimodal pattern.
  • NumOfProducts Dominated by One or Two Products: The NumOfProducts feature is heavily concentrated at values 1 and 2, with very few customers holding three or four products, reflecting limited product diversification within the customer base.
  • HasCrCard and IsActiveMember Are Highly Imbalanced: Both HasCrCard and IsActiveMember are binary features, with the majority of customers having a credit card and a slight majority being active members, as shown by the tall bars at 1 for HasCrCard and at both 0 and 1 for IsActiveMember.
  • EstimatedSalary Is Uniformly Distributed: The EstimatedSalary histogram is relatively flat across its range, indicating that salaries are evenly distributed without significant skewness or clustering, which may reflect the sampling or data generation process.

Based on these results, the dataset’s numerical features display a range of distributional characteristics, from approximately normal (CreditScore) to uniform (EstimatedSalary and Tenure), as well as discrete and binary patterns (NumOfProducts, HasCrCard, IsActiveMember). The presence of a large number of zero balances and the dominance of one or two products per customer suggest specific customer behaviors or business rules that may influence model training and interpretation. The uniformity in EstimatedSalary and Tenure indicates a lack of strong segmentation in these features, while the imbalances in binary features highlight potential areas for further analysis regarding their impact on model predictions. Overall, the histograms provide a comprehensive view of the input data’s structure, supporting the identification of typical value ranges, the detection of outliers, and the assessment of feature distributions relative to modeling assumptions. These observations form a foundational understanding of the dataset’s numerical landscape, informing subsequent steps in data preprocessing and model development.

Figures

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:1670
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:8868
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:7007
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:ad69
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:4856
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:f0ae
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:1e97
2026-01-10 02:19:33,474 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularNumericalHistograms:preprocessed_data does not exist in model's document
validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data

Tabular Categorical Bar Plots Preprocessed Data

Tabular Categorical Bar Plots: Preprocessed Data is designed to provide a visual assessment of the distribution of categorical variables within a dataset, with the primary purpose of evaluating the dataset’s composition and identifying any imbalances or irregularities in category representation.

The test operates by first scanning the dataset to identify all categorical variables. For each detected categorical feature, it calculates the frequency of each unique category and generates a corresponding bar plot. Each bar plot displays the count of records for every category within the feature, allowing for a direct visual comparison of category prevalence. The methodology is straightforward: categorical columns are isolated, and the number of occurrences for each category is tallied. The resulting bar plots use the category names as the x-axis and the count of records as the y-axis, with each bar’s height representing the frequency of that category. This approach is particularly effective for quickly spotting imbalances, such as categories with very few or very many instances, and for detecting features with an excessive number of categories, which could complicate model training. The values on the y-axis are non-negative integers, and the interpretation is direct—higher bars indicate more frequent categories, while shorter bars highlight underrepresented groups. The test does not provide a quantitative threshold for what constitutes a “good” or “poor” distribution, but extreme imbalances or a proliferation of categories are generally considered less desirable for robust model performance.

The primary advantages of this test include its ability to deliver an immediate, intuitive understanding of categorical data distributions, which is essential for both data exploration and model risk assessment. By visualizing the frequency of each category, the test enables practitioners to quickly identify potential sources of bias or underrepresentation that could impact model fairness or generalizability. This visual approach is especially useful in scenarios where categorical imbalances could lead to model overfitting or poor predictive performance on minority classes. The test’s simplicity and clarity make it accessible to a wide range of stakeholders, from data scientists to business analysts, and its graphical output facilitates communication of data characteristics without requiring advanced statistical knowledge. Additionally, the test is scalable to datasets with multiple categorical features, providing a comprehensive overview of the categorical landscape in a single pass.

It should be noted that this test is limited to categorical variables and does not provide any insights into the distribution or characteristics of numerical features. When categorical features contain a large number of unique values, the resulting bar plots can become cluttered and difficult to interpret, reducing their effectiveness. The test does not directly assess model performance or predictive accuracy; instead, it offers a descriptive summary of the input data. Interpretation challenges may arise if the dataset contains rare categories with very low counts, as these may be visually minimized or overlooked in the plots. High risk is indicated by extreme category imbalances or an excessive number of categories within a single feature, both of which can negatively affect model training and generalization. Users should be cautious not to overinterpret the visualizations, as they do not account for downstream modeling effects or interactions between features.

This test shows the results in the form of bar plots, with each plot corresponding to a single categorical feature. The x-axis of each plot lists the unique categories within the feature, while the y-axis represents the count of records for each category. For example, the “Geography” plot displays three categories: France, Germany, and Spain, with France having the highest count at approximately 1,500, Germany around 1,000, and Spain about 750. The “Gender” plot shows two categories, Male and Female, with both categories having similar counts, each slightly above 1,600. The bar heights provide a direct visual cue to the relative prevalence of each category, making it easy to spot imbalances or dominant groups. The scale of the y-axis is consistent across plots, allowing for straightforward comparison of category sizes within each feature. Notable observations include the clear dominance of France in the Geography feature and the near parity between Male and Female in the Gender feature. The plots are uncluttered, with a manageable number of categories, ensuring that the visualizations remain interpretable and actionable.

The test results reveal the following key insights:

  • Geography Distribution Is Uneven: The Geography feature shows a marked imbalance, with France representing the largest group at approximately 1,500 records, Germany at about 1,000, and Spain at roughly 750, indicating that France is overrepresented relative to the other categories.
  • Gender Distribution Is Nearly Balanced: The Gender feature demonstrates near parity, with Male and Female categories both having counts slightly above 1,600, suggesting that there is no significant gender imbalance in the dataset.
  • No Excessive Category Proliferation: Both categorical features contain a manageable number of categories—three for Geography and two for Gender—ensuring that the plots remain interpretable and that the risk of overfitting due to high cardinality is minimal.
  • Absence of Rare Categories: All categories in both features have substantial representation, with no category falling below 750 records, reducing the risk of underrepresentation or insufficient data for any single group.

Based on these results, the dataset’s categorical composition is characterized by a significant imbalance in the Geography feature, where France is notably overrepresented compared to Germany and Spain, while the Gender feature maintains a nearly equal distribution between Male and Female. The manageable number of categories in both features ensures that the risk of model overfitting due to high cardinality is low, and the absence of rare categories suggests that all groups are sufficiently represented for modeling purposes. These patterns indicate that, while the dataset is generally well-structured in terms of categorical diversity, attention should be paid to the potential impact of the Geography imbalance on model behavior, as models trained on this data may be more attuned to the characteristics of the French subgroup. The visualizations provide a clear and accessible summary of the categorical landscape, supporting further analysis and model development by highlighting areas of potential risk and stability within the dataset.

Figures

ValidMind Figure validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data:af59
ValidMind Figure validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data:c02a
2026-01-10 02:20:07,574 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data does not exist in model's document
validmind.data_validation.TargetRateBarPlots:preprocessed_data

Target Rate Bar Plots Preprocessed Data

Target Rate Bar Plots: Preprocessed Data is designed to provide a visual summary of the default rates associated with categorical features in a classification machine learning model. The primary purpose of this test is to enable rapid, intuitive assessment of how the model's predictions or target outcomes are distributed across different categories, making it easier to identify patterns, irregularities, or potential misclassifications within the data.

The test operates by generating paired bar plots for each categorical feature in the dataset. For each feature, the first plot displays the frequency count of each category, showing how many instances fall into each group. The second plot presents the mean target rate for each category, which is the proportion of positive class outcomes (defaults) within that group, as derived from the "default_column." The target rate is calculated by dividing the number of positive outcomes by the total number of instances for each category, resulting in a value between 0 and 1, where higher values indicate a greater proportion of positive outcomes. These plots are created using the Plotly library, which ensures clear visual distinction between categories and metrics. The frequency plot helps assess the representation of each category, while the target rate plot reveals how the likelihood of the positive class varies across categories. This dual-plot approach allows for both quantitative and qualitative evaluation of the model's behavior with respect to categorical variables.

The primary advantages of this test include its ability to transform complex categorical data into easily interpretable visualizations, making it straightforward to spot outliers, inconsistencies, or unexpected patterns in model predictions. By providing both frequency and target rate information side by side, the test enables users to quickly assess whether certain categories are over- or under-represented and whether the model's predictions align with expectations for each group. This is particularly useful in domains where categorical features play a significant role in decision-making, as it allows for targeted investigation of model fairness, bias, or performance issues. The flexibility to apply the test to any or all categorical columns further enhances its utility, supporting both broad and focused analyses.

It should be noted that the test's effectiveness can diminish as the number of distinct categories increases, potentially leading to crowded or less interpretable plots. Additionally, the test assumes that the "default_column" contains consistent, binary values; any deviation from this format can complicate or invalidate the calculation of target rates. Interpretation challenges may arise if certain categories have very low or high target rates, as this could indicate model misclassification or underlying data imbalances. Users should also be cautious when drawing conclusions from categories with low sample counts, as these may not provide reliable estimates of target rates. The visual nature of the test, while intuitive, may obscure subtle statistical nuances, so results should be considered alongside other quantitative metrics.

This test shows the results in the form of paired bar plots for each categorical feature, specifically "Geography" and "Gender." For each feature, the left plot displays the count of instances per category, while the right plot shows the mean target rate for each category, represented as a proportion between 0 and 1. In the "Geography" plots, France, Germany, and Spain are compared, with France having the highest count and Spain the lowest. The target rate plot reveals that Germany has the highest mean target rate, followed by Spain and France. In the "Gender" plots, the counts for Male and Female are nearly equal, but the target rate for Females is notably higher than for Males. The axes are clearly labeled, with counts on the left and target rates on the right, and each category is color-coded for clarity. The range of target rates spans from approximately 0.4 to over 0.6, indicating substantial variation between categories. Notable observations include the elevated target rate for Germany in the Geography feature and for Females in the Gender feature, suggesting that these groups experience higher rates of the positive class outcome.

The test results reveal the following key insights:

  • Germany Exhibits Highest Default Rate Among Geographies: The target rate for Germany is approximately 0.63, which is significantly higher than France (about 0.43) and Spain (about 0.45), indicating that instances from Germany are more likely to be classified as positive by the model.
  • France Has the Largest Representation but Lower Target Rate: France has the highest count of instances (around 1500), yet its target rate is lower than Germany, suggesting that the model's positive class predictions are not simply a function of category size.
  • Gender Distribution Is Balanced but Target Rates Differ: The counts for Male and Female are nearly identical (both around 1600), but the target rate for Females is higher (approximately 0.55) compared to Males (approximately 0.45), indicating a gender-based difference in model outcomes.
  • Spain Shows Lower Representation and Moderate Target Rate: Spain has the lowest count among the geographies (about 800) and a target rate similar to France, suggesting that the model's behavior for this group is more aligned with France than with Germany.
  • Distinct Patterns Across Features: Both categorical features display clear differences in target rates across their categories, with no evidence of uniformity, highlighting the model's varying behavior depending on the input group.

Based on these results, the model demonstrates distinct and measurable differences in target rates across both Geography and Gender features. The elevated target rate for Germany, despite its moderate representation, suggests that the model is more likely to predict the positive class for this group, which may reflect underlying data characteristics or model sensitivity to this category. Similarly, the higher target rate for Females, in the context of balanced gender representation, points to a consistent pattern in the model's predictions that warrants further examination. The relatively lower and similar target rates for France and Spain, despite differences in sample size, indicate that the model's behavior is not solely driven by category frequency. These observations collectively suggest that the model's decision patterns are influenced by categorical groupings, with certain categories experiencing higher rates of positive predictions. This highlights the importance of ongoing monitoring and analysis to ensure that such patterns align with business objectives and do not inadvertently introduce bias or misclassification for specific groups.

Figures

ValidMind Figure validmind.data_validation.TargetRateBarPlots:preprocessed_data:48c4
ValidMind Figure validmind.data_validation.TargetRateBarPlots:preprocessed_data:9940
2026-01-10 02:20:39,738 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TargetRateBarPlots:preprocessed_data does not exist in model's document
validmind.data_validation.DescriptiveStatistics:development_data

Descriptive Statistics Development Data

Descriptive Statistics: Development Data is designed to provide a comprehensive summary of both numerical and categorical variables within a dataset, with the primary purpose of visualizing the overall distribution and characteristics of the data. This test enables a detailed understanding of the dataset’s structure, which is essential for interpreting model behavior and anticipating performance outcomes.

The test operates by applying established statistical functions to the dataset, specifically leveraging the describe() function for numerical variables and value_counts() for categorical variables. For numerical data, the test calculates key metrics such as count, mean, standard deviation, minimum, maximum, and several percentiles (including the 25th, 50th, 75th, 90th, and 95th). These metrics collectively describe the central tendency, spread, and range of the data. The mean provides an average value, while the standard deviation quantifies variability. Percentiles offer insight into the distribution, highlighting where most data points lie and identifying potential outliers. For categorical variables, the test determines the count of observations, the number of unique categories, the most frequent category, its frequency, and the proportion of this top category relative to the total. This approach allows for the identification of dominant categories and the assessment of diversity within categorical fields. The values for these metrics are typically non-negative, with means and percentiles reflecting the scale of the underlying data, and proportions for categorical dominance ranging from 0 to 1, where higher values indicate less diversity.

The primary advantages of this test include its ability to deliver a clear, high-level overview of the dataset, making it easier to detect anomalies, outliers, and patterns that could influence model performance. By summarizing both numerical and categorical data, the test supports a holistic understanding of the dataset, which is particularly useful during initial data exploration, model development, and validation phases. The inclusion of multiple percentiles and measures of spread enables the identification of skewness and potential data quality issues, while the categorical analysis highlights any lack of diversity or overrepresentation of specific categories. This comprehensive approach ensures that users can quickly assess the suitability of the data for modeling and identify areas that may require further investigation.

It should be noted that while this test provides valuable summary statistics, it does not capture relationships or dependencies between variables, nor does it detect subtle or complex patterns that may exist within the data. The test is limited to univariate analysis, meaning it examines each variable independently. As a result, it cannot identify multicollinearity, interactions, or other forms of association that could impact model performance. Additionally, the test may not fully reveal the presence of high risk unless there are clear signs such as significant differences between the mean and median (indicating skewness) or a dominant category in categorical variables (suggesting low diversity). Interpretation challenges may arise if the data contains hidden biases or if the summary statistics mask important subpopulations. Therefore, these results should be considered as part of a broader suite of analyses to ensure a comprehensive understanding of the dataset.

This test shows the results in the form of structured tables, with each row representing a variable and each column displaying a specific statistical metric. For numerical variables, the columns include count, mean, standard deviation, minimum, several percentiles (25th, 50th, 75th, 90th, 95th), and maximum. These values are presented for both the training and test datasets, allowing for direct comparison. The units of measurement correspond to the original data fields, such as credit score points, years of tenure, or currency for balance and estimated salary. The tables reveal the range and distribution of each variable, with the percentiles providing a granular view of how values are spread across the dataset. Notable observations include the presence of variables with a wide range (e.g., balance and estimated salary), as well as those with more limited variability (e.g., number of products, which ranges from 1 to 4). The categorical variables are summarized by their mean and standard deviation, reflecting the proportion of each category (e.g., the proportion of customers with a credit card or active membership). The results also highlight the consistency between the training and test datasets, with similar means and standard deviations observed across most variables. This consistency suggests that the data splitting process has preserved the underlying distribution, which is important for model generalizability. The tables also make it possible to identify any potential outliers or skewness, such as the minimum and maximum values for balance and estimated salary, which span a broad range.

The test results reveal the following key insights:

  • Numerical Variable Distributions Are Consistent Across Datasets: The mean and standard deviation for key numerical variables such as CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary are closely aligned between the training and test datasets, indicating that the data split maintains the original distribution and supports model generalizability.
  • Wide Range and Skewness in Balance and EstimatedSalary: Both Balance and EstimatedSalary exhibit large ranges and high standard deviations relative to their means, with minimum values near zero and maximum values exceeding 200,000, suggesting the presence of significant skewness and potential outliers in these variables.
  • Categorical Variables Show High Proportion of Dominant Categories: Variables such as HasCrCard and IsActiveMember display mean values near 0.7 and 0.5, respectively, indicating that a majority of customers have a credit card, while active membership is more evenly split, reflecting moderate diversity in these binary fields.
  • Percentile Analysis Highlights Data Concentration: The 25th, 50th, and 75th percentiles for variables like CreditScore and Tenure show that most data points are concentrated within a relatively narrow band, while the 90th and 95th percentiles reveal the presence of higher-value outliers.
  • Limited Diversity in NumOfProducts: The NumOfProducts variable has a mean around 1.5 and a maximum of 4, with the majority of observations falling at 1 or 2, indicating limited diversity and a strong concentration in lower product counts.

Based on these results, the dataset used for model development and testing demonstrates a high degree of consistency between the training and test splits, with similar distributions observed across all key numerical and categorical variables. The presence of wide ranges and high standard deviations in variables such as Balance and EstimatedSalary suggests that these fields may contain outliers or exhibit skewed distributions, which could influence model behavior, particularly in terms of sensitivity to extreme values. The analysis of categorical variables indicates that while some fields, such as HasCrCard, are dominated by a single category, others like IsActiveMember are more evenly distributed, providing a balanced representation of customer activity. The percentile breakdowns further reveal that most data points are concentrated within central ranges, with a smaller proportion of extreme values. The limited diversity observed in NumOfProducts suggests that the majority of customers hold only one or two products, which may impact the model’s ability to differentiate between customer segments. Overall, the descriptive statistics provide a clear and objective overview of the dataset’s structure, highlighting both the stability of the data split and the presence of specific patterns and characteristics that may inform subsequent modeling and analysis efforts.

Tables

dataset Name Count Mean Std Min 25% 50% 75% 90% 95% Max
train_dataset_final CreditScore 2585.0 647.4054 99.1170 350.0 580.0 650.0 715.0 778.0 815.0 850.0
train_dataset_final Tenure 2585.0 4.9729 2.8716 0.0 2.0 5.0 7.0 9.0 9.0 10.0
train_dataset_final Balance 2585.0 82542.3928 61253.7555 0.0 0.0 103549.0 129835.0 150756.0 164951.0 238388.0
train_dataset_final NumOfProducts 2585.0 1.5106 0.6752 1.0 1.0 1.0 2.0 2.0 3.0 4.0
train_dataset_final HasCrCard 2585.0 0.6967 0.4598 0.0 0.0 1.0 1.0 1.0 1.0 1.0
train_dataset_final IsActiveMember 2585.0 0.4642 0.4988 0.0 0.0 0.0 1.0 1.0 1.0 1.0
train_dataset_final EstimatedSalary 2585.0 99075.6264 58315.3840 12.0 49324.0 98293.0 149594.0 179533.0 189669.0 199992.0
test_dataset_final CreditScore 647.0 649.2303 97.4424 350.0 582.0 652.0 716.0 776.0 802.0 850.0
test_dataset_final Tenure 647.0 5.1824 3.0079 0.0 3.0 5.0 8.0 9.0 10.0 10.0
test_dataset_final Balance 647.0 80698.7810 62076.6372 0.0 0.0 102882.0 127599.0 149838.0 166830.0 250898.0
test_dataset_final NumOfProducts 647.0 1.4992 0.6462 1.0 1.0 1.0 2.0 2.0 3.0 4.0
test_dataset_final HasCrCard 647.0 0.7094 0.4544 0.0 0.0 1.0 1.0 1.0 1.0 1.0
test_dataset_final IsActiveMember 647.0 0.4915 0.5003 0.0 0.0 0.0 1.0 1.0 1.0 1.0
test_dataset_final EstimatedSalary 647.0 100992.1958 58311.6945 469.0 48169.0 106024.0 150093.0 179127.0 191304.0 199808.0
2026-01-10 02:21:16,124 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:development_data does not exist in model's document
validmind.data_validation.TabularDescriptionTables:development_data

Tabular Description Tables Development Data

Tabular Description Tables is designed to provide a comprehensive summary of the descriptive statistics for numerical, categorical, and datetime variables within a dataset. Its primary purpose is to offer a clear overview of the data’s structure, distribution, and quality, enabling users to quickly assess the characteristics and integrity of the dataset before proceeding with further analysis or modeling.

The test operates by first categorizing each variable in the dataset according to its data type—numerical, categorical, or datetime. For numerical variables, it calculates key statistics such as the number of observations, mean, minimum and maximum values, the percentage of missing values, and the data type. For categorical variables, it determines the number of unique values, lists those unique values, counts missing values, and identifies the data type. For datetime variables, it reports the number of unique values, the earliest and latest dates, missing value counts, and data type. These metrics are derived by systematically scanning each column, applying type-specific aggregation functions, and summarizing the results in tabular form. The values for metrics such as mean, minimum, and maximum are interpreted within the context of the variable’s expected range, with missing value percentages ranging from 0% (no missing data) to 100% (all data missing). High missing value percentages or unexpected data types can indicate potential data quality issues, while the distribution of values provides insight into the dataset’s suitability for modeling.

The primary advantages of this test include its ability to deliver a rapid, holistic snapshot of the dataset’s structure and content, which is essential for data scientists and analysts at the initial stages of data exploration. By presenting detailed statistics for each variable, the test helps identify potential data quality issues, such as missing values or incorrect data types, that could impact downstream modeling. The inclusion of both summary statistics and metadata, such as data types and unique value counts, ensures that users have all the necessary information to make informed decisions about preprocessing, feature engineering, and model selection. This comprehensive overview is particularly valuable in regulated environments, where transparency and data integrity are paramount, and where early detection of anomalies or inconsistencies can prevent costly errors later in the modeling process.

It should be noted that this test is limited to descriptive statistics and does not perform deeper statistical analyses, such as outlier detection, correlation assessment, or evaluation of variable relationships. It does not provide insights into the potential impact of missing values on model performance, nor does it suggest or apply data transformations that may be necessary for optimal modeling. The test also does not address the presence of outliers or the appropriateness of variable distributions beyond basic range checks. High percentages of missing values, inappropriate data types, or unexpected value ranges are signs of potential risk, as they may indicate data collection or integrity issues that require further investigation. Interpretation of the results should be done with caution, as the test does not account for the broader context or intended use of the data.

This test shows the results in tabular format, with each row representing a numerical variable from either the training or test dataset, and columns detailing the dataset name, variable name, number of observations, mean, minimum and maximum values, percentage of missing values, and data type. The tables are straightforward to read: each variable’s statistics are presented side by side for both datasets, allowing for direct comparison. The key measurements include the mean, which provides the central tendency; the minimum and maximum, which define the observed range; and the missing values percentage, which indicates data completeness. All variables in both datasets have 0% missing values, suggesting complete data. The numerical variables include CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Exited. The ranges for these variables are consistent with their expected domains, such as CreditScore ranging from 350 to 850 and Tenure from 0 to 10. The mean values for most variables are similar between the training and test datasets, with only minor variations. Notably, variables like Balance and EstimatedSalary have wide ranges, reflecting the diversity of financial data. The data types are consistent across both datasets, with integer and float types appropriately assigned. No categorical or datetime variables are present in the provided results, so the focus remains on numerical data.

The test results reveal the following key insights:

  • Complete Data Coverage Across All Variables: Both the training and test datasets exhibit 0% missing values for all numerical variables, indicating full data availability and no immediate concerns regarding data completeness.
  • Consistent Variable Ranges and Data Types: The minimum and maximum values for each variable align with expected business rules, such as CreditScore ranging from 350 to 850 and Tenure from 0 to 10, with all data types correctly assigned as integer or float.
  • Stable Central Tendencies Between Datasets: The mean values for key variables, including CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Exited, are closely matched between the training and test datasets, suggesting that the data distributions are stable and comparable.
  • Wide Financial Value Distributions: Variables such as Balance and EstimatedSalary display substantial ranges, with Balance spanning from 0 to over 238,000 in the training set and up to 250,898 in the test set, and EstimatedSalary ranging from as low as 11.58 to nearly 200,000, reflecting significant variability in customer financial profiles.
  • Binary and Discrete Variable Integrity: Variables like HasCrCard, IsActiveMember, NumOfProducts, and Exited maintain appropriate minimum and maximum values (0/1 for binary, 1–4 for NumOfProducts), confirming correct encoding and absence of out-of-domain values.

Based on these results, the datasets demonstrate strong data integrity, with complete coverage and consistent statistical properties across both training and test sets. The absence of missing values and the alignment of variable ranges and means indicate that the data is well-prepared for subsequent modeling steps, with no immediate signs of data quality risks such as mass missingness or inappropriate data types. The stability of central tendencies between datasets suggests that the training and test sets are drawn from similar distributions, supporting reliable model evaluation and generalization. The observed variability in financial variables is expected given the domain and does not present any anomalies. Overall, the descriptive statistics confirm that the numerical variables are correctly structured and encoded, providing a solid foundation for further analysis and model development.

Tables

dataset Numerical Variable Num of Obs Mean Min Max Missing Values (%) Data Type
train_dataset_final CreditScore 2585 647.4054 350.00 850.00 0.0 int64
train_dataset_final Tenure 2585 4.9729 0.00 10.00 0.0 int64
train_dataset_final Balance 2585 82542.3928 0.00 238387.56 0.0 float64
train_dataset_final NumOfProducts 2585 1.5106 1.00 4.00 0.0 int64
train_dataset_final HasCrCard 2585 0.6967 0.00 1.00 0.0 int64
train_dataset_final IsActiveMember 2585 0.4642 0.00 1.00 0.0 int64
train_dataset_final EstimatedSalary 2585 99075.6264 11.58 199992.48 0.0 float64
train_dataset_final Exited 2585 0.5083 0.00 1.00 0.0 int64
test_dataset_final CreditScore 647 649.2303 350.00 850.00 0.0 int64
test_dataset_final Tenure 647 5.1824 0.00 10.00 0.0 int64
test_dataset_final Balance 647 80698.7810 0.00 250898.09 0.0 float64
test_dataset_final NumOfProducts 647 1.4992 1.00 4.00 0.0 int64
test_dataset_final HasCrCard 647 0.7094 0.00 1.00 0.0 int64
test_dataset_final IsActiveMember 647 0.4915 0.00 1.00 0.0 int64
test_dataset_final EstimatedSalary 647 100992.1958 468.94 199808.10 0.0 float64
test_dataset_final Exited 647 0.4668 0.00 1.00 0.0 int64
2026-01-10 02:21:40,705 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularDescriptionTables:development_data does not exist in model's document
validmind.data_validation.ClassImbalance:development_data

✅ Class Imbalance Development Data

Class Imbalance is designed to evaluate and quantify the distribution of target classes within a dataset used by a machine learning model, with the primary purpose of identifying whether any class is under-represented to a degree that could introduce bias into the model’s predictions. By ensuring that each class meets a minimum representation threshold, the test helps safeguard against the risk of the model favoring the majority class and underperforming on the minority class, which is critical for maintaining fairness and predictive reliability in classification tasks.

The test operates by calculating the frequency of each class in the target column of the dataset and expressing these frequencies as percentages of the total number of records. For each class, the test compares its percentage representation to a configurable minimum threshold, which is set to 10% by default. If any class falls below this threshold, it is flagged as high risk for class imbalance. The test outputs both a pass/fail status for each class and a quantitative breakdown of class proportions. The methodology is straightforward: it counts the number of records for each class, divides by the total number of records, and multiplies by 100 to obtain a percentage. The resulting values range from 0% to 100%, where higher values indicate greater representation. A class is considered adequately represented if its percentage meets or exceeds the threshold, while values below the threshold signal potential imbalance. The test also generates visualizations to aid in interpreting the class distribution.

The primary advantages of this test include its ability to quickly and clearly identify under-represented classes that could compromise model performance. The calculation is computationally efficient and easy to interpret, making it suitable for rapid diagnostics during data preparation. The test’s quantitative output not only highlights the presence of imbalance but also provides a precise measure of its extent, which is valuable for both technical and non-technical stakeholders. The adjustable threshold parameter allows the test to be tailored to different domains and risk tolerances, enhancing its flexibility. Additionally, the inclusion of visual plots of class proportions supports intuitive understanding and communication of the results, which is particularly useful in collaborative or regulatory environments.

It should be noted that the test has several limitations. It may be less informative for datasets with a large number of classes, where some degree of imbalance is often unavoidable due to the natural distribution of the data. The sensitivity of the test to the chosen threshold means that inappropriate threshold settings could either mask genuine imbalance or overstate minor deviations. The test does not account for the varying costs or consequences of misclassifying different classes, which can be significant in certain applications. Furthermore, while the test can detect and quantify imbalance, it does not provide direct solutions for addressing it, such as resampling or reweighting strategies. The test is also limited to classification problems and is not applicable to regression or clustering tasks. High risk is specifically indicated when any class falls below the minimum percentage threshold, which should prompt further investigation.

This test shows the results in both tabular and graphical formats. The tables present, for each dataset (train and test), the class label, the percentage of rows corresponding to each class, and a pass/fail status based on the 10% minimum threshold. The columns are clearly labeled: “dataset” identifies the data split, “Exited” indicates the class label, “Percentage of Rows (%)” provides the class proportion as a percentage, and “Pass/Fail” shows whether the class meets the threshold. The bar plots visually depict the proportion of each class within the train and test datasets, with the x-axis representing the class label and the y-axis showing the percentage. The bars are colored for clarity, and the plots are titled to indicate the dataset being visualized. In the train dataset, the class proportions are nearly equal, with 1 (Exited) at 50.83% and 0 (Not Exited) at 49.17%. In the test dataset, class 0 is slightly more prevalent at 53.32%, while class 1 is at 46.68%. All classes in both datasets exceed the 10% threshold, and all receive a “Pass” status. The visualizations confirm the numerical results, showing balanced distributions with no class falling below the threshold. The range of values is close to 50% for both classes in both datasets, indicating a well-balanced dataset with no significant skew.

The test results reveal the following key insights:

  • Both datasets exhibit balanced class distributions: The train and test datasets show class proportions that are close to equal, with no class falling below the 10% threshold.
  • All classes pass the minimum representation threshold: Each class in both datasets exceeds the 10% minimum, with the lowest proportion being 46.68% for class 1 in the test dataset.
  • Visualizations confirm numerical balance: The bar plots for both datasets display nearly equal heights for both classes, visually reinforcing the balanced nature of the data.
  • Minor variation between train and test splits: While both datasets are balanced, the test dataset has a slightly higher proportion of class 0 (53.32%) compared to the train dataset (49.17%), indicating a small but non-critical shift in class distribution.
  • No evidence of high-risk class imbalance: All pass/fail indicators are positive, and no class is flagged as under-represented, suggesting that the risk of model bias due to class imbalance is minimal.

Based on these results, the class distribution in both the train and test datasets is well-balanced, with each class represented at nearly equal proportions and comfortably above the 10% minimum threshold. The numerical and visual outputs consistently indicate that neither class is under-represented, and the minor differences between the train and test splits do not approach the threshold for concern. This balanced distribution supports the expectation that the model will not be unduly biased toward either class due to data imbalance, and that both classes should be learnable by the model with comparable accuracy. The absence of any class falling below the threshold, as well as the close alignment between the train and test distributions, suggests that the dataset is suitable for training and evaluating classification models without the need for additional balancing interventions. The results provide clear evidence that the risk of class imbalance affecting model performance is low in this context, and the dataset structure is robust with respect to this aspect of data quality.

Parameters:

{
  "min_percent_threshold": 10
}
            

Tables

dataset Exited Percentage of Rows (%) Pass/Fail
train_dataset_final 1 50.83% Pass
train_dataset_final 0 49.17% Pass
test_dataset_final 0 53.32% Pass
test_dataset_final 1 46.68% Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:development_data:314d
ValidMind Figure validmind.data_validation.ClassImbalance:development_data:9048
2026-01-10 02:22:07,944 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance:development_data does not exist in model's document
validmind.data_validation.UniqueRows:development_data

❌ Unique Rows Development Data

Unique Rows: Development Data is designed to assess the diversity of the dataset by verifying that the number of unique rows in each column exceeds a specified minimum percentage threshold. The primary purpose of this test is to ensure that the data used for model development is sufficiently varied, which is critical for training robust and unbiased machine learning models capable of generalizing well to new, unseen data.

The test operates by first determining the total number of rows in the dataset and then calculating the number of unique values present in each column. For each column, the percentage of unique values is computed as the ratio of unique values to the total row count, expressed as a percentage. This percentage is then compared against a predefined minimum threshold—in this case, 1%. If the percentage of unique values in a column meets or exceeds the threshold, the column is marked as "Pass"; otherwise, it is marked as "Fail." The test is applied independently to each column, and the results are reported for both the training and test datasets. The typical range for the percentage of unique values is from 0% (no diversity) to 100% (all values unique), with higher percentages generally indicating greater data diversity. Columns with low percentages may indicate limited variability, which can be problematic for model training, especially if such columns are expected to capture important distinctions in the data.

The primary advantages of this test include its efficiency and systematic approach to evaluating data diversity across all columns in a dataset. By providing a clear, quantitative measure of uniqueness for each column, the test enables rapid identification of columns that may lack sufficient variability. This is particularly useful in scenarios where data quality and representativeness are critical, such as in regulated environments or when developing models intended for deployment in dynamic, real-world settings. The test's straightforward methodology makes it easy to interpret and communicate results, supporting transparent model documentation and facilitating early detection of potential data quality issues that could impact model performance.

It should be noted that the Unique Rows test has several limitations. It assumes that higher uniqueness directly correlates with better data quality, which may not always be the case, especially for categorical variables where a limited set of categories is expected and appropriate. The test does not account for the predictive importance of each column, treating all columns equally regardless of their relevance to the model's target variable. Additionally, the test may flag columns with inherently low cardinality, such as binary or categorical features, as failing, even when this is not indicative of a data quality problem. A lack of diversity in certain columns may be entirely appropriate depending on the domain context, and the test does not distinguish between such cases and genuine data quality concerns. Furthermore, the test does not consider interactions between columns or the overall structure of the dataset, focusing solely on univariate uniqueness.

This test shows the results in tabular format, with each row representing a column from either the training or test dataset. The table includes the dataset name, column name, number of unique values, percentage of unique values, and a pass/fail indicator based on the 1% threshold. To interpret the table, one should look at the "Percentage of Unique Values (%)" column to assess the diversity of each feature and the "Pass/Fail" column to quickly identify which columns meet the minimum uniqueness requirement. For example, in the training dataset, "CreditScore" has 423 unique values (16.36%), "Balance" has 1,773 unique values (68.59%), and "EstimatedSalary" has 2,585 unique values (100%), all of which pass the threshold. In contrast, columns such as "Tenure," "NumOfProducts," "HasCrCard," "IsActiveMember," and several one-hot encoded categorical features have very low percentages (all below 1%) and are marked as failing. The test dataset shows similar patterns, with continuous variables generally passing and categorical or binary variables failing. The range of unique value percentages spans from as low as 0.08% for binary columns to 100% for "EstimatedSalary," highlighting significant variability in data diversity across features.

The test results reveal the following key insights:

  • Continuous Features Exhibit High Uniqueness: Columns such as "CreditScore," "Balance," and "EstimatedSalary" in both the training and test datasets have high percentages of unique values, ranging from approximately 16% to 100%, and consistently pass the uniqueness threshold.
  • Categorical and Binary Features Show Low Diversity: Columns representing categorical or binary variables, including "NumOfProducts," "HasCrCard," "IsActiveMember," "Geography_Germany," "Geography_Spain," "Gender_Male," and "Exited," have very low percentages of unique values (all below 1%) and fail the test in both datasets.
  • Consistency Across Datasets: The pattern of passing and failing columns is consistent between the training and test datasets, indicating stable data structure and encoding practices across splits.
  • Threshold Sensitivity for Low-Cardinality Columns: The 1% uniqueness threshold is not met by any of the binary or low-cardinality categorical columns, which is expected given their limited possible values, but these columns are still flagged as failing by the test.
  • Distinctiveness of Salary Feature: "EstimatedSalary" stands out with 100% unique values in both datasets, indicating that every record has a distinct salary value, which may reflect either true data diversity or the presence of a continuous, unrounded variable.

Based on these results, the dataset demonstrates a clear distinction between continuous and categorical features in terms of uniqueness, with continuous variables such as "CreditScore," "Balance," and "EstimatedSalary" providing substantial diversity and easily surpassing the minimum uniqueness threshold. In contrast, categorical and binary features, by their nature, exhibit low uniqueness and do not meet the threshold, resulting in a fail status for these columns. This pattern is consistent across both the training and test datasets, suggesting that the data preparation and encoding processes are applied uniformly. The high uniqueness observed in continuous features supports the potential for these variables to contribute to model generalization, while the low uniqueness in categorical features is an expected characteristic given their design. The results highlight the importance of interpreting uniqueness metrics in the context of feature types, as low-cardinality columns will inherently fail this test despite being appropriate for their intended use. Overall, the test provides a comprehensive view of data diversity, confirming that the dataset contains a mix of highly unique continuous features and appropriately encoded categorical variables, each exhibiting expected patterns of uniqueness given their respective data types.

Parameters:

{
  "min_percent_threshold": 1
}
            

Tables

dataset Column Number of Unique Values Percentage of Unique Values (%) Pass/Fail
train_dataset_final CreditScore 423 16.3636 Pass
train_dataset_final Tenure 11 0.4255 Fail
train_dataset_final Balance 1773 68.5880 Pass
train_dataset_final NumOfProducts 4 0.1547 Fail
train_dataset_final HasCrCard 2 0.0774 Fail
train_dataset_final IsActiveMember 2 0.0774 Fail
train_dataset_final EstimatedSalary 2585 100.0000 Pass
train_dataset_final Geography_Germany 2 0.0774 Fail
train_dataset_final Geography_Spain 2 0.0774 Fail
train_dataset_final Gender_Male 2 0.0774 Fail
train_dataset_final Exited 2 0.0774 Fail
test_dataset_final CreditScore 294 45.4405 Pass
test_dataset_final Tenure 11 1.7002 Pass
test_dataset_final Balance 433 66.9243 Pass
test_dataset_final NumOfProducts 4 0.6182 Fail
test_dataset_final HasCrCard 2 0.3091 Fail
test_dataset_final IsActiveMember 2 0.3091 Fail
test_dataset_final EstimatedSalary 647 100.0000 Pass
test_dataset_final Geography_Germany 2 0.3091 Fail
test_dataset_final Geography_Spain 2 0.3091 Fail
test_dataset_final Gender_Male 2 0.3091 Fail
test_dataset_final Exited 2 0.3091 Fail
2026-01-10 02:22:33,078 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.UniqueRows:development_data does not exist in model's document
validmind.data_validation.TabularNumericalHistograms:development_data

Tabular Numerical Histograms Development Data

Tabular Numerical Histograms is designed to provide a visual overview of the distribution of each numerical feature in a dataset, supporting the identification of distributional characteristics, skewness, and potential outliers. The primary purpose of this test is to facilitate exploratory data analysis by enabling a clear understanding of how each input variable is distributed, which is essential for assessing the suitability of data for downstream modeling and for detecting any irregularities that may impact model performance.

The test operates by systematically extracting all numerical columns from the provided dataset and generating histograms for each feature using a consistent binning strategy, typically with 50 bins. Each histogram visually represents the frequency distribution of values for a single feature, with the x-axis denoting the range of observed values and the y-axis indicating the count of records within each bin. This approach allows for the detection of central tendencies, spread, skewness, and the presence of outliers. The methodology is purely univariate, focusing on one feature at a time, and does not incorporate any relationships between features. The histograms are generated for both training and test datasets, enabling a direct visual comparison of feature distributions across data splits. The typical range of values for each feature is determined by the data itself, and the interpretation of the histograms relies on visual assessment of the shape, spread, and any anomalies such as spikes or long tails.

The primary advantages of this test include its simplicity and effectiveness in providing immediate, interpretable visual feedback on the distributional properties of each numerical feature. By presenting the data in histogram form, the test makes it straightforward to spot skewed distributions, heavy tails, or clusters of outliers that may otherwise go unnoticed in summary statistics. This is particularly useful in scenarios where model assumptions require normally distributed inputs or where the presence of extreme values could unduly influence model training. The test is scalable to large datasets and can be applied to any number of numerical variables, making it a versatile tool for initial data quality assessment and for ongoing monitoring of data integrity in production environments.

It should be noted that this test is limited to univariate analysis of numerical features and does not consider interactions or dependencies between variables. As a result, it may miss multivariate patterns or anomalies that only become apparent when examining combinations of features. The test also does not provide any direct insight into how the observed distributions affect model outputs or performance, nor does it address categorical or text-based features. Interpretation challenges may arise if the expected distribution for a feature is not well defined, or if the presence of outliers is contextually appropriate but visually prominent. High-risk indicators, such as pronounced skewness, unexpected distributional shapes, or extreme outliers, should be interpreted with caution, as they may signal data quality issues or the need for further investigation.

This test shows a series of histograms for each numerical feature in both the training and test datasets, with each plot displaying the frequency of observations across the range of possible values for that feature. The x-axis of each histogram represents the feature values, while the y-axis shows the count of records in each bin. For example, the "CreditScore" histogram reveals the distribution of credit scores, with most values concentrated between 500 and 800, and a visible right tail. The "Balance" histogram displays a large spike at zero, indicating a substantial proportion of records with no balance, followed by a roughly normal distribution for nonzero balances. "NumOfProducts," "HasCrCard," and "IsActiveMember" are shown as bar plots with discrete values, reflecting the count of customers in each category. "EstimatedSalary" appears uniformly distributed across its range. Categorical features encoded as binary indicators, such as "Geography_Germany," "Geography_Spain," and "Gender_Male," are also presented as bar plots, showing the proportion of records in each group. The test results are presented separately for the training and test datasets, allowing for direct comparison of feature distributions and assessment of data consistency across splits. Notable observations include the presence of strong class imbalances in some binary features, a high frequency of zero balances, and generally consistent distributions between training and test sets.

The test results reveal the following key insights:

  • Feature Distributions Are Consistent Across Splits: The histograms for both training and test datasets show similar shapes and ranges for all features, indicating that the data splits are representative and that there is no evidence of distributional drift between training and test sets.
  • CreditScore Exhibits Mild Right Skew: The "CreditScore" feature is concentrated between 500 and 800, with a peak around 650–700 and a gradual decline toward higher values, suggesting a mild right skew in both datasets.
  • Balance Has a Large Proportion of Zero Values: The "Balance" feature displays a pronounced spike at zero, with over 800 records in the training set and over 200 in the test set, followed by a bell-shaped distribution for nonzero balances, indicating a significant segment of customers with no account balance.
  • NumOfProducts and Binary Features Are Highly Imbalanced: The majority of records have one or two products, with very few having three or four. Similarly, "HasCrCard" and "IsActiveMember" show strong imbalances, with most customers having a credit card and a near-even split for active membership.
  • EstimatedSalary Is Uniformly Distributed: The "EstimatedSalary" feature is spread evenly across its range from 0 to 200,000, with no visible skew or clustering, suggesting a uniform sampling or synthetic generation.
  • Geographical and Gender Features Show Clear Groupings: The binary-encoded geography and gender features reveal distinct group sizes, with "Geography_Germany" and "Geography_Spain" showing fewer records in the "true" category, and "Gender_Male" being nearly balanced.

Based on these results, the distributions of numerical and binary features in both the training and test datasets are broadly consistent, supporting the validity of the data split and reducing the likelihood of sampling bias. The presence of a large number of zero balances and the strong class imbalances in certain binary features are notable characteristics that may influence model behavior, particularly in terms of feature importance and the handling of rare categories. The mild right skew in "CreditScore" and the uniform distribution of "EstimatedSalary" suggest that these features may require different preprocessing or modeling strategies depending on the assumptions of downstream algorithms. The clear groupings in geographical and gender features provide a basis for subgroup analysis or fairness assessments. Overall, the visualizations confirm that the data is well-structured and that the key distributional properties are stable across datasets, providing a reliable foundation for subsequent modeling and analysis.

Figures

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:2922
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:76fa
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:fbe9
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:7991
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:d049
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:24cc
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:3d50
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:f6c7
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:a97d
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:9bf5
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:43e4
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:aa38
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:c48d
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:3d91
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:5531
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:6fe6
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:e151
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:9f4b
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:d965
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:f89c
2026-01-10 02:23:07,361 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularNumericalHistograms:development_data does not exist in model's document
validmind.data_validation.MutualInformation:development_data

Mutual Information Development Data

Mutual Information: development data is designed to quantify the statistical dependency between each feature and the target variable, providing a measure of how much information each feature contributes to predicting the target. The primary purpose of this test is to support feature selection and model interpretability by identifying which features are most relevant for model training and which may be redundant or irrelevant.

The test operates by calculating mutual information scores for each feature with respect to the target variable using established methods from the scikit-learn library. Specifically, it applies either a classification or regression approach depending on the nature of the target, and computes a normalized score for each feature that ranges from 0 to 1. Mutual information measures the reduction in uncertainty about the target variable given knowledge of a particular feature, capturing both linear and non-linear relationships. The calculation requires the full set of feature values and the corresponding target values, and the resulting score reflects the strength of association: a score of 0 indicates no dependency, while a score closer to 1 indicates a strong relationship. The test presents these results in both tabular and graphical formats, with a configurable threshold line to highlight features that may fall below a minimum relevance level. Interpretation of the scores should consider that higher values suggest greater predictive power, while low or near-zero values may indicate irrelevance or redundancy.

The primary advantages of this test include its ability to detect both linear and non-linear associations between features and the target, making it robust to a wide range of data types and relationships. It is scale-invariant, meaning that the measurement is not affected by the units or scaling of the features, and it can be applied to both numerical and categorical variables. The mutual information score is interpretable and bounded, facilitating straightforward comparison across features. This test is computationally efficient for most practical dataset sizes and does not require assumptions about the underlying data distribution. Its flexibility and interpretability make it particularly useful for automated feature selection processes and for gaining insights into which features are most influential in a predictive modeling context.

It should be noted that the test has several limitations and potential risks. Reliable estimation of mutual information requires a sufficient amount of data, and the method may become computationally intensive for very large datasets. The test only measures pairwise relationships between individual features and the target, so it cannot detect redundancy or interactions among features. For continuous variables, the results can be sensitive to how the data is discretized, and rare but important events may be underrepresented. The method does not handle missing values directly and may be affected by extreme class imbalance, potentially leading to underestimated scores for minority classes. Signs of high risk include many features with very low scores, key business features with unexpectedly low scores, a highly skewed distribution of scores, or critical features falling below the minimum threshold. Inconsistent results across different data samples may also indicate instability or unreliability in the feature relevance assessment.

This test shows the mutual information scores for each feature in both the training and test datasets, presented as bar plots with a dashed horizontal line indicating the minimum threshold of 0.01. Each bar represents a feature, with the height corresponding to its mutual information score, and the color indicating whether the score is above (blue) or below (red) the threshold. The x-axis lists the feature names, while the y-axis shows the mutual information score on a scale from 0 to 0.1. In the training dataset, "NumOfProducts" stands out with the highest score, exceeding 0.09, followed by "Balance" at approximately 0.035. Other features such as "Geography_Germany," "IsActiveMember," "Tenure," "Gender_Male," and "Geography_Spain" have scores ranging from just above 0.01 to 0.02, all above the threshold. "EstimatedSalary," "HasCrCard," and "CreditScore" fall below the threshold, with scores near or below 0.005. In the test dataset, "NumOfProducts" again has the highest score, close to 0.1, with "Geography_Spain" and "Balance" following at around 0.03 and 0.015, respectively. "Gender_Male" is just above the threshold, while the remaining features, including "CreditScore," "Tenure," "HasCrCard," "IsActiveMember," "EstimatedSalary," and "Geography_Germany," have scores at or near zero. The plots allow for easy comparison of feature relevance and highlight which features consistently contribute information about the target across both datasets.

The test results reveal the following key insights:

  • NumOfProducts consistently dominates feature relevance: In both training and test datasets, "NumOfProducts" achieves the highest mutual information score, with values above 0.09 and 0.1, respectively, indicating it is the most informative feature for predicting the target.
  • Secondary features show moderate but consistent relevance: "Balance" and "Geography_Spain" maintain moderate scores, with "Balance" at 0.035 in training and 0.015 in test, and "Geography_Spain" at 0.011 in training and 0.03 in test, suggesting stable but lesser predictive power.
  • Several features fall below the minimum threshold: "EstimatedSalary," "HasCrCard," and "CreditScore" in the training set, and "CreditScore," "Tenure," "HasCrCard," "IsActiveMember," "EstimatedSalary," and "Geography_Germany" in the test set, all have mutual information scores below the 0.01 threshold, indicating limited or negligible relevance.
  • Feature importance distribution is highly skewed: The majority of the predictive power is concentrated in a small subset of features, with most features contributing little to no information, as evidenced by the sharp drop-off in scores after the top two or three features.
  • Consistency and variation across datasets: While the top feature remains the same across both datasets, there are shifts in the relative importance of secondary features, such as "Geography_Germany" being more relevant in training but not in test, and "Geography_Spain" increasing in importance in the test set.

Based on these results, the mutual information analysis demonstrates that the predictive information in the dataset is heavily concentrated in a small number of features, with "NumOfProducts" consistently providing the most substantial contribution to the target variable in both training and test datasets. Secondary features such as "Balance" and "Geography_Spain" also show moderate and relatively stable relevance, though their scores are significantly lower than the top feature. The majority of features, including "EstimatedSalary," "HasCrCard," and "CreditScore," exhibit very low or near-zero mutual information scores, suggesting they add little value for prediction in this context. The distribution of scores is highly skewed, with a clear separation between the most and least informative features. There is general consistency in the ranking of feature importance across datasets, though some variation in the relative scores of secondary features is observed. These patterns indicate that the model's predictive capacity is likely driven by a small subset of features, and that most features do not provide substantial additional information about the target. This concentration of information may have implications for model complexity, interpretability, and the potential for overfitting or underfitting, depending on how these features are utilized in subsequent modeling steps.

Parameters:

{
  "min_threshold": 0.01
}
            

Figures

ValidMind Figure validmind.data_validation.MutualInformation:development_data:825e
ValidMind Figure validmind.data_validation.MutualInformation:development_data:042c
2026-01-10 02:23:47,241 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MutualInformation:development_data does not exist in model's document
validmind.data_validation.PearsonCorrelationMatrix:development_data

Pearson Correlation Matrix Development Data

Pearson Correlation Matrix: Development Data is designed to evaluate the extent of linear dependency between all pairs of numerical variables in a dataset. Its primary purpose is to identify potential redundancy among variables by quantifying the strength and direction of their linear relationships, thereby supporting dimensionality reduction and improving model interpretability.

The test operates by calculating the Pearson correlation coefficient for every pair of numerical variables in the dataset. This coefficient measures the degree to which two variables move together in a linear fashion, with values ranging from -1 to 1. A value of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. The test visualizes these coefficients in a heat map, where the color intensity and hue represent the magnitude and direction of the correlation. The heat map provides a matrix view, with each cell corresponding to the correlation between a pair of variables. Correlations with an absolute value greater than 0.7 are highlighted, signaling a high degree of linear dependency. The test requires all numerical variables as input and systematically compares each pair, making it possible to quickly identify clusters of highly correlated features that may be redundant.

The primary advantages of this test include its ability to efficiently detect and quantify linear relationships between variables, which is essential for identifying redundant features that could be removed to simplify the model without significant loss of information. The heat map visualization offers an intuitive and accessible overview of the correlation structure, making it easier for both technical and non-technical stakeholders to interpret the results. This approach supports informed decisions about feature selection and engineering, potentially leading to more robust and generalizable models. Additionally, by highlighting highly correlated variables, the test helps mitigate risks associated with multicollinearity, which can adversely affect model stability and interpretability.

It should be noted that the test is limited to detecting linear relationships and may not capture more complex, non-linear dependencies between variables. As a result, some important associations could be overlooked if they do not manifest as linear correlations. The test also does not measure the causal influence or predictive power of one variable over another, focusing solely on the degree of co-movement. The threshold of 0.7 for high correlation is somewhat arbitrary and may not be appropriate for all datasets or modeling contexts. Furthermore, a large number of highly correlated variables can indicate redundancy and increase the risk of overfitting, which may compromise the model’s generalizability. Interpretation challenges may arise if users rely solely on the heat map without considering the broader context of the data and modeling objectives.

This test shows the results in the form of two heat maps, one for the training dataset and one for the test dataset. Each heat map displays a matrix where both the rows and columns represent the numerical variables under consideration. The color of each cell indicates the Pearson correlation coefficient between the corresponding pair of variables, with the color bar on the right providing a reference scale from -1 (strong negative correlation, shown in red) to 1 (strong positive correlation, shown in blue), and values near zero represented by white or light shades. The diagonal cells, which compare each variable with itself, always show a value of 1. The off-diagonal cells reveal the pairwise correlations, with specific values annotated within each cell for precise interpretation. Notably, no cells are highlighted in white, indicating that none of the variable pairs exceed the 0.7 absolute correlation threshold. The range of observed correlations is generally between -0.38 and 0.43, with most values clustering near zero, suggesting weak or negligible linear relationships among most variable pairs. The most prominent correlations are observed between "Balance" and "Geography_Germany" (0.43 in train, 0.42 in test), and between "Geography_Spain" and "Geography_Germany" (-0.36 in train, -0.38 in test), reflecting expected relationships due to the encoding of categorical variables. The heat maps are consistent across both datasets, indicating stability in the correlation structure.

The test results reveal the following key insights:

  • Correlation Structure Is Stable Across Datasets: The heat maps for both the training and test datasets display highly similar patterns, with correlation coefficients for corresponding variable pairs remaining consistent in magnitude and direction.
  • No Variable Pairs Exceed High Correlation Threshold: All observed correlation coefficients fall below the 0.7 absolute value threshold, indicating that there are no pairs of variables with a high degree of linear dependency that would suggest redundancy.
  • Most Variable Pairs Exhibit Weak or Negligible Correlation: The majority of correlation coefficients are close to zero, with values typically ranging from -0.19 to 0.20 for most pairs, suggesting that the variables are largely independent in their linear relationships.
  • Notable Moderate Correlations Among Encoded Categorical Variables: The strongest correlations are observed between "Balance" and "Geography_Germany" (0.43 in train, 0.42 in test), and between "Geography_Spain" and "Geography_Germany" (-0.36 in train, -0.38 in test), which are likely due to the one-hot encoding of categorical features rather than substantive relationships.
  • No Evidence of Multicollinearity Risk: The absence of high correlation values across all variable pairs suggests that the dataset does not exhibit multicollinearity, reducing the risk of instability in downstream modeling.

Based on these results, the dataset demonstrates a stable and consistent correlation structure between the training and test splits, with no evidence of high linear dependency among the numerical variables. The observed correlations are generally weak, with only a few moderate relationships present, primarily among variables derived from categorical encodings. This pattern indicates that the variables are largely independent in their linear associations, minimizing the risk of redundancy and multicollinearity in subsequent modeling. The lack of strong correlations supports the suitability of the dataset for use in predictive modeling, as it suggests that each variable is likely to contribute unique information. The consistency of the correlation structure across both datasets further reinforces the reliability of these observations, indicating that the relationships among variables are stable and not subject to significant variation between data splits. Overall, the test results provide a clear and comprehensive overview of the linear relationships within the dataset, supporting informed decisions regarding feature selection and model development.

Figures

ValidMind Figure validmind.data_validation.PearsonCorrelationMatrix:development_data:e2d3
ValidMind Figure validmind.data_validation.PearsonCorrelationMatrix:development_data:5a65
2026-01-10 02:24:16,575 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.PearsonCorrelationMatrix:development_data does not exist in model's document
validmind.data_validation.HighPearsonCorrelation:development_data

❌ High Pearson Correlation Development Data

High Pearson Correlation: development data is designed to identify highly correlated feature pairs within a dataset, with the primary purpose of detecting potential feature redundancy or multicollinearity. This process is essential for ensuring that the features used in a machine learning model do not exhibit strong linear relationships that could compromise model interpretability or performance.

The test operates by calculating the Pearson correlation coefficient for every possible pair of features in the dataset. The Pearson correlation coefficient is a statistical measure that quantifies the strength and direction of a linear relationship between two continuous variables, producing values that range from -1 to 1. A value of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. The test systematically computes these coefficients for all feature pairs, removes self-correlations and duplicate pairs, and then sorts the results by the absolute value of the coefficient. A pre-defined threshold, set at 0.3 in this case, is used to determine whether a pair is considered highly correlated. Pairs exceeding this threshold are flagged, and the test outputs the top ten strongest correlations, regardless of whether they pass or fail the threshold. Each result includes the dataset name, the feature pair, the correlation coefficient, and a pass/fail status based on the threshold.

The primary advantages of this test include its ability to quickly and transparently highlight linear dependencies between features, which is particularly useful in the early stages of model development and risk assessment. By surfacing the most strongly correlated feature pairs, the test enables practitioners to proactively address multicollinearity, which can otherwise lead to unstable model coefficients, inflated variance, and reduced interpretability. The clear tabular output, which lists the most significant correlations along with their magnitudes and pass/fail status, supports efficient review and documentation. This approach is especially valuable in regulated environments where model transparency and explainability are critical, as it provides a straightforward mechanism for identifying and documenting potential sources of redundancy.

It should be noted that the test is limited to detecting linear relationships and does not capture nonlinear dependencies, which may also be relevant in some modeling contexts. The Pearson correlation coefficient is sensitive to outliers, meaning that a small number of extreme values can disproportionately influence the results. Additionally, the test only examines pairwise relationships and may not detect more complex interactions involving three or more features. High correlation coefficients, particularly those exceeding the threshold, are indicative of potential multicollinearity, which can undermine the reliability of model estimates and obscure the true predictive value of individual features. Care must be taken in interpreting these results, as the presence of high correlations does not necessarily imply causation or guarantee adverse model behavior, but it does warrant further investigation.

This test shows the results in tabular format, with each row representing a unique feature pair from either the training or test dataset. The columns include the dataset name, the feature pair, the Pearson correlation coefficient (rounded to four decimal places), and a pass/fail status based on whether the absolute value of the coefficient exceeds the 0.3 threshold. The coefficients range from approximately -0.38 to 0.43, with both positive and negative values indicating the direction of the linear relationship. Notably, the strongest correlations are observed between "Balance" and "Geography_Germany" (0.4273 in training, 0.4177 in test) and between "Geography_Germany" and "Geography_Spain" (-0.3625 in training, -0.383 in test), both of which exceed the threshold and are marked as "Fail." The remaining pairs exhibit lower correlations, with coefficients generally below 0.21 in magnitude and marked as "Pass." The table allows for direct comparison between the training and test datasets, revealing consistency in the most highly correlated pairs. The pass/fail status provides a clear indication of which pairs may require further scrutiny due to potential multicollinearity.

The test results reveal the following key insights:

  • Strongest Correlations Consistent Across Datasets: The feature pairs "Balance" with "Geography_Germany" and "Geography_Germany" with "Geography_Spain" exhibit the highest absolute correlation coefficients in both the training (0.4273 and -0.3625) and test (0.4177 and -0.383) datasets, consistently exceeding the 0.3 threshold and marked as "Fail."
  • Majority of Feature Pairs Below Threshold: Most feature pairs, including "IsActiveMember" with "Exited" and "Balance" with "NumOfProducts," have correlation coefficients well below the 0.3 threshold, with values ranging from approximately -0.21 to 0.20, and are marked as "Pass."
  • Directionality of Correlations Varies: The observed correlations include both positive and negative values, indicating that some feature pairs move together while others move in opposite directions, such as the negative correlation between "Geography_Germany" and "Geography_Spain."
  • Stability Between Training and Test Sets: The correlation structure is stable across the training and test datasets, with the same feature pairs appearing among the top correlations and similar coefficient magnitudes, suggesting consistent relationships in the data.
  • No Extreme Multicollinearity Detected Beyond Top Pairs: Apart from the two pairs exceeding the threshold, all other feature pairs show moderate to low correlations, indicating that widespread multicollinearity is not present in the dataset.

Based on these results, the dataset exhibits a generally low level of linear dependency among most feature pairs, with only two pairs in both the training and test datasets surpassing the 0.3 correlation threshold. The consistency of these results across both datasets suggests that the observed relationships are stable and not an artifact of data partitioning. The presence of moderate positive and negative correlations among the remaining pairs indicates a diverse set of feature interactions without pervasive redundancy. The clear identification of the most highly correlated pairs provides transparency into the dataset's structure and supports further analysis of potential multicollinearity. Overall, the results demonstrate that, aside from the two identified pairs, the features are largely independent in a linear sense, supporting the interpretability and reliability of subsequent modeling efforts.

Parameters:

{
  "max_threshold": 0.3,
  "top_n_correlations": 10
}
            

Tables

dataset Columns Coefficient Pass/Fail
train_dataset_final (Balance, Geography_Germany) 0.4273 Fail
train_dataset_final (Geography_Germany, Geography_Spain) -0.3625 Fail
train_dataset_final (IsActiveMember, Exited) -0.1986 Pass
train_dataset_final (Geography_Germany, Exited) 0.1822 Pass
train_dataset_final (Balance, NumOfProducts) -0.1673 Pass
train_dataset_final (Balance, Exited) 0.1451 Pass
train_dataset_final (Balance, Geography_Spain) -0.1343 Pass
train_dataset_final (Gender_Male, Exited) -0.1314 Pass
train_dataset_final (NumOfProducts, IsActiveMember) 0.0531 Pass
train_dataset_final (NumOfProducts, Exited) -0.0516 Pass
test_dataset_final (Balance, Geography_Germany) 0.4177 Fail
test_dataset_final (Geography_Germany, Geography_Spain) -0.3830 Fail
test_dataset_final (IsActiveMember, Exited) -0.2072 Pass
test_dataset_final (Geography_Germany, Exited) 0.2026 Pass
test_dataset_final (Balance, NumOfProducts) -0.1863 Pass
test_dataset_final (Balance, Geography_Spain) -0.1468 Pass
test_dataset_final (Balance, Exited) 0.1409 Pass
test_dataset_final (HasCrCard, Gender_Male) 0.1167 Pass
test_dataset_final (Tenure, Geography_Germany) 0.1021 Pass
test_dataset_final (Geography_Spain, Exited) -0.0956 Pass
2026-01-10 02:24:38,734 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighPearsonCorrelation:development_data does not exist in model's document
validmind.model_validation.ModelMetadata

Model Metadata

Model Metadata is designed to compare the metadata of different models and generate a summary table with the results. The primary purpose of this test is to provide a clear, standardized comparison of essential model metadata, including architecture, framework, framework version, and programming language, to support model documentation and management processes.

The test operates by retrieving metadata for each model using a dedicated function that extracts key information such as the modeling technique, framework, framework version, and programming language. This information is then standardized by renaming columns according to a predefined set of labels, ensuring consistency across models. The standardized metadata is compiled into a summary table, which allows for direct comparison of the models’ technical characteristics. The test does not compute statistical or quantitative metrics but instead focuses on categorical and versioning information. The summary table format enables users to quickly identify similarities and differences in model construction, which is particularly important for integration, deployment, and ongoing model governance. The typical values in the table are categorical (e.g., framework names, programming languages) or version numbers, and the interpretation centers on consistency and compatibility rather than performance.

The primary advantages of this test include its ability to provide a clear and concise overview of the technical landscape of multiple models. By standardizing metadata labels and presenting them in a unified table, the test facilitates rapid identification of potential compatibility or integration challenges, such as mismatched framework versions or differing programming languages. This is especially useful in environments where multiple models are developed and maintained by different teams or over extended periods. The test also supports regulatory and audit requirements by ensuring that essential model metadata is consistently documented and easily accessible. Its focus on high-level metadata makes it particularly effective for initial model inventory, portfolio reviews, and technical due diligence.

It should be noted that the test is limited by its reliance on the completeness and accuracy of the metadata provided by each model. If the underlying function used to extract metadata does not return all necessary fields, or if the metadata is outdated or inconsistent, the resulting summary table may not fully reflect the true technical characteristics of the models. The test does not capture detailed parameter settings or hyperparameters, focusing instead on broader architectural and environmental attributes. Additionally, the test does not assess the functional performance or predictive quality of the models. High risk is indicated by inconsistent or missing metadata, as well as significant differences in framework versions or programming languages, which could complicate model integration, deployment, or maintenance. Interpretation challenges may arise if the metadata standards are not uniformly enforced across all models.

This test shows a summary table that presents the metadata for two models: "log_model_champion" and "rf_model." The table includes columns for the model name, modeling technique, modeling framework, framework version, and programming language. Each row corresponds to a specific model, allowing for direct comparison across these key attributes. The modeling technique for both models is listed as "SKlearnModel," indicating a shared approach to model construction. The modeling framework for both is "sklearn," with a framework version of "1.8.0," and both are implemented in the "Python" programming language. The table is straightforward to read: each column header identifies a specific metadata attribute, and each cell contains the corresponding value for that model. There are no missing values or inconsistencies in the presented data. The range of values is categorical, with version numbers following standard software versioning conventions. Notably, both models share identical metadata across all reported fields, suggesting a high degree of technical alignment and compatibility. No anomalies, outliers, or discrepancies are observed in the table, and the uniformity of the metadata supports straightforward integration and management.

The test results reveal the following key insights:

  • Complete Metadata Consistency Across Models: Both "log_model_champion" and "rf_model" display identical values for all metadata fields, including modeling technique, framework, framework version, and programming language.
  • Uniform Use of SKlearnModel Technique: The modeling technique for both models is "SKlearnModel," indicating a consistent approach to model development within the portfolio.
  • Identical Framework and Versioning: Both models utilize the "sklearn" framework at version "1.8.0," eliminating potential compatibility issues related to framework updates or deprecations.
  • Standardized Programming Language: Both models are implemented in "Python," ensuring that language-specific dependencies and integration requirements are uniform across the models.
  • Absence of Missing or Inconsistent Metadata: All required metadata fields are present and populated for both models, with no discrepancies or omissions observed.

Based on these results, the metadata comparison demonstrates a high degree of technical uniformity between the "log_model_champion" and "rf_model." The models share the same modeling technique, framework, framework version, and programming language, which supports seamless integration, deployment, and maintenance within a shared technical environment. The absence of missing or inconsistent metadata indicates robust model documentation practices and effective metadata management. The uniformity across all reported fields suggests that the models are likely to be compatible from a technical perspective, reducing the risk of integration challenges or version conflicts. This level of consistency is advantageous for model governance, auditability, and operational efficiency, as it simplifies the management of dependencies and technical support requirements. The results provide a clear and objective snapshot of the current model portfolio’s technical landscape, supporting informed decision-making for future model development and deployment activities.

Tables

model Modeling Technique Modeling Framework Framework Version Programming Language
log_model_champion SKlearnModel sklearn 1.8.0 Python
rf_model SKlearnModel sklearn 1.8.0 Python
2026-01-10 02:25:06,743 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.ModelMetadata does not exist in model's document
validmind.model_validation.sklearn.ModelParameters

Model Parameters

Model Parameters is designed to provide a transparent and structured overview of all configuration parameters that define a machine learning model’s behavior, supporting both transparency and reproducibility. The primary purpose of this test is to ensure that all relevant model parameters are explicitly documented, enabling effective auditing, validation, and the ability to reproduce model training and results.

The test operates by systematically extracting model parameters using the standardized get_params() method, which is a convention followed by scikit-learn and compatible estimators such as XGBoost and RandomForest. This method retrieves all parameters that were set during model instantiation, as well as any default values that remain unchanged. The extracted parameters are then organized into a structured DataFrame, with each row representing a parameter and its corresponding value for a specific model. This approach ensures that every aspect of the model’s configuration is captured, including hyperparameters that control regularization, optimization, tree structure, and other model-specific behaviors. The resulting table provides a comprehensive snapshot of the model’s setup, which is essential for both internal and external review processes. The values presented are typically categorical (such as solver type or penalty) or numerical (such as regularization strength or number of estimators), and their interpretation depends on the context of the model and the parameter’s intended effect. The test does not evaluate the appropriateness of parameter values but ensures their visibility for further analysis.

The primary advantages of this test include its universal applicability to any model that adheres to the scikit-learn API, making it a versatile tool for model documentation across a wide range of machine learning algorithms. By providing a complete and explicit record of all model parameters, the test greatly enhances transparency, which is critical for regulatory compliance, model risk management, and reproducibility. This systematic approach also facilitates version control, as changes in parameter settings can be easily tracked over time. Furthermore, the structured output supports efficient auditing and review, enabling stakeholders to quickly identify whether critical parameters have been set appropriately or if defaults have been used where tuning may be necessary. The test’s compatibility with both classification and regression models further broadens its utility, making it a foundational component of robust model governance frameworks.

It should be noted that the test is limited to models that implement the get_params() method, which means it cannot be applied to custom models or those outside the scikit-learn ecosystem. Additionally, the test only captures static parameters set prior to or during model instantiation and does not account for dynamic parameters that may be adjusted during training, such as those influenced by early stopping or adaptive learning rates. The test does not assess the suitability or impact of parameter values, nor does it detect complex interactions between parameters that could affect model performance or stability. Interpretation challenges may arise when parameter meanings differ across model types, and the test cannot identify indirect effects or risks associated with certain parameter combinations. High-risk scenarios include missing critical parameters, extreme or default values for key settings, and inconsistencies across model versions, all of which require further expert review beyond the scope of this test.

This test shows the extracted model parameters for two models, presented in a tabular format. Each row of the table corresponds to a specific parameter for either the logistic regression model (log_model_champion) or the random forest model (rf_model), with columns indicating the model name, parameter name, and parameter value. For the logistic regression model, parameters such as regularization strength (C), penalty type (l1), solver (liblinear), and maximum iterations (max_iter) are displayed, with values like C set to 1 and penalty set to l1, indicating the use of L1 regularization. For the random forest model, parameters include the number of estimators (n_estimators), criterion for split quality (gini), maximum features considered at each split (sqrt), and random state (42), among others. The values are a mix of booleans (e.g., bootstrap: True), numerics (e.g., n_estimators: 50), and categorical strings (e.g., criterion: gini). The table allows for straightforward comparison of parameter settings across models and highlights which parameters have been explicitly set versus those left at default values. Notable observations include the use of L1 regularization in the logistic regression model and a relatively small number of estimators (50) in the random forest model, as well as the explicit setting of random_state for reproducibility. The scale and range of parameter values are consistent with typical configurations for these model types, and no extreme or missing values are immediately apparent from the table.

The test results reveal the following key insights:

  • Comprehensive Parameter Coverage Across Models: Both the logistic regression and random forest models have all relevant parameters extracted and displayed, covering regularization, optimization, and tree construction settings.
  • Explicit Regularization and Solver Choices in Logistic Regression: The logistic regression model uses L1 regularization (penalty: l1) with the liblinear solver, and regularization strength (C) is set to 1, indicating a balanced approach to penalization.
  • Random Forest Configured for Reproducibility and Simplicity: The random forest model specifies a random_state of 42 for reproducibility, uses 50 estimators, and applies the gini criterion for split quality, with bootstrap sampling enabled.
  • Default Values Retained for Non-Critical Parameters: Several parameters, such as min_samples_leaf (1), min_samples_split (2), and ccp_alpha (0.0), remain at their default values, suggesting standard model configurations without aggressive tuning.
  • No Extreme or Missing Parameter Values Detected: All parameter values fall within expected ranges for their respective model types, and no critical parameters appear to be omitted or set to extreme values that could indicate overfitting or instability.

Based on these results, the parameter extraction provides a clear and detailed snapshot of the configuration for both the logistic regression and random forest models. The logistic regression model is configured with L1 regularization and the liblinear solver, which is suitable for smaller datasets and supports sparse solutions, while the regularization strength is set to a moderate value. The random forest model is set up with a standard number of estimators and default settings for most tree construction parameters, with explicit control over randomness to ensure reproducibility. The use of default values for several parameters suggests a reliance on standard model behavior rather than extensive hyperparameter tuning. The absence of extreme or missing values indicates that both models are configured within typical operational ranges, reducing the likelihood of inadvertent overfitting or instability due to parameter choices. The structured presentation of parameters enables straightforward auditing and comparison, supporting transparency and facilitating further review if needed. Overall, the results demonstrate that both models are configured in a manner consistent with standard practices, with explicit documentation of all key parameters and no immediate indications of high-risk configurations.

Tables

model Parameter Value
log_model_champion C 1
log_model_champion dual False
log_model_champion fit_intercept True
log_model_champion intercept_scaling 1
log_model_champion max_iter 100
log_model_champion penalty l1
log_model_champion solver liblinear
log_model_champion tol 0.0001
log_model_champion verbose 0
log_model_champion warm_start False
rf_model bootstrap True
rf_model ccp_alpha 0.0
rf_model criterion gini
rf_model max_features sqrt
rf_model min_impurity_decrease 0.0
rf_model min_samples_leaf 1
rf_model min_samples_split 2
rf_model min_weight_fraction_leaf 0.0
rf_model n_estimators 50
rf_model oob_score False
rf_model random_state 42
rf_model verbose 0
rf_model warm_start False
2026-01-10 02:25:33,663 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ModelParameters does not exist in model's document
validmind.model_validation.sklearn.ROCCurve

ROC Curve

ROC Curve is designed to evaluate the performance of binary classification models by illustrating the trade-off between the true positive rate and the false positive rate across a range of classification thresholds. Its primary purpose is to measure the model’s ability to distinguish between two classes, providing a visual and quantitative assessment of discriminative power through the ROC curve and the associated Area Under the Curve (AUC) score.

The test operates by first generating predicted probabilities for each observation in the dataset using the binary classification model under evaluation. These probabilities, along with the true class labels, are used to compute the true positive rate (the proportion of actual positives correctly identified) and the false positive rate (the proportion of actual negatives incorrectly identified as positives) at various threshold levels. The ROC curve is then plotted with the false positive rate on the x-axis and the true positive rate on the y-axis, providing a comprehensive view of model performance across all possible thresholds. The AUC score, which ranges from 0 to 1, is calculated as the area under the ROC curve; a value of 0.5 indicates no discriminative ability (equivalent to random guessing), while a value closer to 1.0 indicates strong discrimination between classes. The test also includes a reference line representing random performance for comparison. Infinite values in the threshold set are removed to ensure the integrity of the curve and the AUC calculation. The resulting ROC plots and AUC scores are saved for documentation and further analysis.

The primary advantages of this test include its ability to provide a holistic and threshold-independent evaluation of model discrimination, making it particularly valuable in scenarios where the optimal classification threshold is not predetermined or may vary depending on operational requirements. The ROC curve visually summarizes the model’s performance across all thresholds, allowing stakeholders to assess the trade-offs between sensitivity and specificity. The AUC score condenses this information into a single, interpretable metric that remains robust even when class distributions are imbalanced, ensuring that the evaluation is not unduly influenced by the prevalence of one class over another. This makes the ROC-AUC approach especially useful for comparing models or monitoring performance over time in dynamic environments.

It should be noted that the ROC Curve test is limited to binary classification tasks and does not extend to multi-class or regression models. Additionally, the test may be less informative when the model outputs probabilities that are highly concentrated near 0 or 1, as this can distort the shape of the ROC curve. In cases of severe class imbalance, the ROC curve may still appear favorable even if the model performs poorly on the minority class, since the metric focuses on ranking rather than absolute accuracy. AUC values near or below 0.5 are indicative of a model with little or no discriminative power, and a ROC curve that closely follows the diagonal line of randomness signals that the model is not effectively distinguishing between classes. Interpretation should therefore consider both the AUC score and the visual characteristics of the ROC curve, especially in the context of the underlying data distribution and business requirements.

This test shows the results in the form of two ROC curve plots, one for the training dataset and one for the test dataset, each displaying the relationship between the false positive rate (x-axis) and the true positive rate (y-axis) for the model’s predicted probabilities. Both plots include a solid line representing the model’s ROC curve and a dashed diagonal line indicating random performance (AUC = 0.5). The AUC score is annotated in the legend for each plot, with both the training and test datasets achieving an AUC of 0.67. The axes range from 0 to 1, allowing for direct comparison of model performance against the random baseline. The ROC curves for both datasets consistently lie above the diagonal, indicating that the model performs better than random guessing. The curves are relatively smooth, suggesting stable probability estimates across thresholds. There are no abrupt changes or regions where the curve dips below the random line, which would indicate problematic model behavior. The similarity in AUC scores and curve shapes between the training and test datasets suggests that the model’s discriminative ability generalizes reasonably well and is not confined to the training data alone.

The test results reveal the following key insights:

  • Consistent Model Discrimination Across Datasets: Both the training and test datasets yield an identical AUC score of 0.67, indicating that the model maintains a stable level of discriminative power when applied to unseen data.
  • Performance Exceeds Random Baseline: The ROC curves for both datasets consistently lie above the diagonal line representing random classification, confirming that the model is able to distinguish between the two classes better than chance.
  • Moderate Discriminative Ability: An AUC of 0.67 suggests that the model has moderate ability to separate positive and negative classes, with a meaningful but not strong distinction between them.
  • Smooth and Stable ROC Curves: The absence of sharp inflections or irregularities in the ROC curves indicates that the model’s probability outputs are well-calibrated and do not exhibit erratic behavior across thresholds.
  • No Evidence of Overfitting: The close alignment of the ROC curves and AUC scores between the training and test datasets suggests that the model’s performance is not artificially inflated on the training data and generalizes appropriately to new data.

Based on these results, the model demonstrates a moderate and consistent ability to discriminate between the two classes in both the training and test datasets, as evidenced by identical AUC scores of 0.67 and similarly shaped ROC curves. The model’s performance is clearly superior to random guessing, with the ROC curves remaining above the diagonal reference line throughout the range of false positive rates. The stability of the curves and the absence of overfitting indicate that the model’s probability estimates are reliable and generalize well to unseen data. However, the AUC value of 0.67, while above the threshold for random performance, suggests that there is room for improvement in the model’s discriminative power. The results provide a clear and objective assessment of the model’s current capabilities, highlighting both its strengths in generalization and its limitations in achieving higher levels of class separation.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:bda0
ValidMind Figure validmind.model_validation.sklearn.ROCCurve:4b81
2026-01-10 02:25:58,885 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve does not exist in model's document
validmind.model_validation.sklearn.MinimumROCAUCScore

✅ Minimum ROCAUC Score

Minimum ROC AUC Score is designed to validate the model’s ability to distinguish between classes by ensuring that the Receiver Operating Characteristic Area Under the Curve (ROC AUC) score meets or exceeds a specified minimum threshold. This test is essential for assessing the model’s discriminatory power in both binary and multiclass classification tasks, providing a quantitative measure of how well the model separates different classes based on its predictions.

The test operates by calculating the multiclass ROC AUC score using the true target values and the model’s predicted probabilities on the provided datasets. To accommodate multiclass scenarios, the test first transforms the categorical target variables into a binary format using a label binarization process. It then computes the ROC AUC score, which quantifies the model’s ability to rank positive instances higher than negative ones across all possible classification thresholds. The ROC AUC score ranges from 0 to 1, where a value of 0.5 indicates random performance and a value closer to 1.0 reflects near-perfect discrimination. The test compares the calculated ROC AUC score against a predefined threshold (in this case, 0.5). If the score meets or exceeds this threshold, the test is marked as passed; otherwise, it is marked as failed. The results, including the score, threshold, and pass/fail status, are systematically recorded for each dataset evaluated.

The primary advantages of this test include its ability to provide a comprehensive and threshold-independent assessment of model performance, as the ROC AUC score evaluates the model’s quality across all possible classification thresholds. This makes it particularly robust for both binary and multiclass problems, as it does not rely on a single decision boundary. Additionally, the test’s use of macro-averaging ensures that each class is given equal consideration, which is valuable when class importance is balanced. The ROC AUC metric is widely recognized and interpretable, making it a standard choice for evaluating classification models in regulated environments where transparency and comparability are important.

It should be noted that the test has certain limitations, particularly in scenarios with highly imbalanced class distributions. In such cases, the ROC AUC score may remain high even if the model performs poorly on minority classes, potentially masking deficiencies in class-specific performance. The macro-averaging approach, while ensuring equal weight for each class, may not reflect the true impact of misclassifications in imbalanced datasets. Furthermore, the test does not provide diagnostic information about the sources of poor performance if the ROC AUC score is unsatisfactory, nor does it offer guidance on how to address such issues. A low ROC AUC score, especially one below the minimum threshold, signals that the model is not effectively distinguishing between classes, which could indicate a high risk in operational deployment.

This test shows the results in a tabular format, presenting the ROC AUC scores for both the training and test datasets alongside the minimum threshold and the corresponding pass/fail status. The table includes columns for the dataset name, the calculated ROC AUC score (rounded to four decimal places), the threshold used for evaluation, and whether the model passed or failed the test. The scores for the training and test datasets are 0.6738 and 0.6696, respectively, both of which exceed the minimum threshold of 0.5. The pass/fail column indicates that the model passes the test on both datasets. The values are presented on a scale from 0 to 1, with higher values indicating better discriminatory performance. The close alignment of the scores between the training and test datasets suggests consistent model behavior across different data splits, and the absence of extreme values or large discrepancies indicates stability in the model’s classification ability.

The test results reveal the following key insights:

  • Model Consistently Exceeds Minimum Threshold: Both the training and test datasets achieve ROC AUC scores well above the minimum threshold of 0.5, with values of 0.6738 and 0.6696, respectively, indicating reliable discriminatory power.
  • Stable Performance Across Datasets: The similarity between the training and test ROC AUC scores demonstrates that the model maintains consistent performance and does not exhibit significant overfitting or underfitting.
  • Pass Status on All Evaluated Splits: The model passes the minimum ROC AUC score requirement on both the training and test datasets, confirming that it meets the predefined standard for classification quality in all evaluated scenarios.
  • Moderate Discriminatory Power Observed: While the scores are above the threshold, they are not close to 1.0, suggesting that the model’s ability to distinguish between classes is moderate rather than exceptional.

Based on these results, the model demonstrates a stable and consistent ability to distinguish between classes on both the training and test datasets, as evidenced by ROC AUC scores of 0.6738 and 0.6696, both comfortably above the minimum threshold of 0.5. The close alignment of these scores across datasets indicates that the model’s performance generalizes well and is not limited to the training data, reducing the likelihood of overfitting. The pass status on both splits confirms that the model meets the required standard for discriminatory power as defined by the test parameters. However, the moderate level of the ROC AUC scores suggests that while the model is effective at class separation, there remains room for improvement in its classification capabilities. The results provide a clear and objective assessment of the model’s current performance, highlighting its reliability and stability in distinguishing between classes under the specified evaluation criteria.

Parameters:

{
  "min_threshold": 0.5
}
            

Tables

dataset Score Threshold Pass/Fail
train_dataset_final 0.6738 0.5 Pass
test_dataset_final 0.6696 0.5 Pass
2026-01-10 02:26:18,901 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumROCAUCScore does not exist in model's document

In summary

In this final notebook, you learned how to:

With our ValidMind for model validation series of notebooks, you learned how to validate a model end-to-end with the ValidMind Library by running through some common scenarios in a typical model validation setting:

  • Verifying the data quality steps performed by the model development team
  • Independently replicating the champion model's results and conducting additional tests to assess performance, stability, and robustness
  • Setting up test inputs and a challenger model for comparative analysis
  • Running validation tests, analyzing results, and logging artifacts to ValidMind

Next steps

Work with your validation report

Now that you've logged all your test results and verified the work done by the model development team, head to the ValidMind Platform to wrap up your validation report. Continue to work on your validation report by:

  • Inserting additional test results: Click Link Evidence to Report under any section of 2. Validation in your validation report. (Learn more: Link evidence to reports)

  • Making qualitative edits to your test descriptions: Expand any linked evidence under Validator Evidence and click See evidence details to review and edit the ValidMind-generated test descriptions for quality and accuracy. (Learn more: Preparing validation reports)

  • Adding more findings: Click Link Finding to Report in any validation report section, then click + Create New Finding. (Learn more: Add and manage model findings)

  • Adding risk assessment notes: Click under Risk Assessment Notes in any validation report section to access the text editor and content editing toolbar, including an option to generate a draft with AI. Once generated, edit your ValidMind-generated test descriptions to adhere to your organization's requirements. (Learn more: Work with content blocks)

  • Assessing compliance: Under the Guideline for any validation report section, click ASSESSMENT and select the compliance status from the drop-down menu. (Learn more: Provide compliance assessments)

  • Collaborate with other stakeholders: Use the ValidMind Platform's real-time collaborative features to work seamlessly together with the rest of your organization, including model developers. Propose suggested changes in the model documentation, work with versioned history, and use comments to discuss specific portions of the model documentation. (Learn more: Collaborate with others)

When your validation report is complete and ready for review, submit it for approval from the same ValidMind Platform where you made your edits and collaborated with the rest of your organization, ensuring transparency and a thorough model validation history. (Learn more: Submit for approval)

Learn more

Now that you're familiar with the basics, you can explore the following notebooks to get a deeper understanding on how the ValidMind Library assists you in streamlining model validation:

More how-to guides and code samples

Discover more learning resources

All notebook samples can be found in the following directories of the ValidMind Library GitHub repository:

Or, visit our documentation to learn more about ValidMind.