ValidMind for model development 3 — Integrate custom tests

Learn how to use ValidMind for your end-to-end model documentation process with our series of four introductory notebooks. In this third notebook, supplement ValidMind tests with your own and include them as additional evidence in your documentation.

This notebook assumes that you already have a repository of custom made tests considered critical to include in your documentation. A custom test is any function that takes a set of inputs and parameters as arguments and returns one or more outputs:

For a more in-depth introduction to custom tests, refer to our Implement custom tests notebook.

Learn by doing

Our course tailor-made for developers new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Developer Fundamentals

Prerequisites

In order to integrate custom tests with your model documentation with this notebook, you'll need to first have:

Need help with the above steps?

Refer to the first two notebooks in this series:

Setting up

This section should be quite familiar to you — as we performed the same actions in the previous notebook, 2 — Start the model development process.

Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

  1. In a browser, log in to ValidMind.

  2. In the left sidebar, navigate to Inventory and select the model you registered for this "ValidMind for model development" series of notebooks.

  3. Go to Getting Started and click Copy snippet to clipboard.

Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
)
Note: you may need to restart the kernel to use updated packages.
2026-01-10 01:59:00,548 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model development (ID: cmalgf3qi02ce199qm3rdkl46)
📁 Document Type: model_documentation

Import sample dataset

Next, we'll import the same public Bank Customer Churn Prediction dataset from Kaggle we used in the last notebook so that we have something to work with:

from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}

We'll apply a simple rebalancing technique to the dataset before continuing:

import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

Remove highly correlated features

Let's also quickly remove highly correlated features from the dataset using the output from a ValidMind test.

As you learned previously, before we can run tests you'll need to initialize a ValidMind dataset object:

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)

With our balanced dataset initialized, we can then run our test and utilize the output to help us identify the features we want to remove:

# Run HighPearsonCorrelation test with our balanced dataset as input and return a result object
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

High Pearson Correlation is designed to identify pairs of features within a dataset that exhibit strong linear relationships, with the primary purpose of detecting potential feature redundancy or multicollinearity. This is crucial for ensuring that the predictive model remains interpretable and robust, as highly correlated features can obscure the true impact of individual variables and may lead to overfitting or instability in model estimates.

The test operates by calculating the Pearson correlation coefficient for every possible pair of features in the dataset. The Pearson correlation coefficient is a statistical measure that quantifies the strength and direction of a linear relationship between two continuous variables, ranging from -1 (perfect negative linear relationship) to +1 (perfect positive linear relationship), with 0 indicating no linear relationship. The test systematically computes these coefficients for all feature pairs, removes self-correlations and duplicate pairs, and then sorts the results by the absolute value of the coefficient. A pre-defined threshold, set at 0.3 in this case, is used to determine whether a pair is considered highly correlated. Any pair with an absolute coefficient exceeding this threshold is flagged as a potential risk for multicollinearity. The test outputs the top n strongest correlations, providing a clear view of the most significant relationships in the data.

The primary advantages of this test include its efficiency and transparency in surfacing linear dependencies between features, which is particularly valuable during the early stages of model development and risk assessment. By highlighting pairs of features with strong linear associations, the test enables practitioners to proactively address multicollinearity, which can otherwise compromise model interpretability and predictive stability. The clear tabular output, which lists feature pairs, their correlation coefficients, and pass/fail status relative to the threshold, supports straightforward communication of results to both technical and non-technical stakeholders. This makes the test especially useful for regulatory documentation and for guiding feature selection or engineering decisions in environments where model transparency and reliability are paramount.

It should be noted that the test is limited to detecting linear relationships and does not capture more complex, nonlinear dependencies that may exist among features. The Pearson correlation coefficient is also sensitive to outliers, which can distort the measured strength of relationships and potentially lead to misleading conclusions. Additionally, the test only evaluates pairwise relationships and may not identify higher-order interactions involving three or more features. A high correlation coefficient, particularly one exceeding the threshold, signals a potential risk of multicollinearity, which can undermine the interpretability and stability of model coefficients. However, the test does not provide direct guidance on how to address such risks, nor does it assess the impact of correlated features on model performance in practice.

This test shows its results in a tabular format, where each row represents a unique pair of features from the dataset. The columns include the feature pair, the calculated Pearson correlation coefficient, and a pass/fail indicator based on whether the absolute value of the coefficient exceeds the threshold of 0.3. The coefficients are presented as decimal values, typically ranging from -1 to 1, with positive values indicating direct relationships and negative values indicating inverse relationships. The table is sorted by the absolute value of the coefficient, so the strongest correlations appear at the top. In this particular output, the pair (Age, Exited) has the highest absolute correlation coefficient at 0.3417 and is the only pair marked as "Fail," indicating it exceeds the threshold. All other pairs have coefficients below the threshold, with values ranging from -0.1825 to -0.0364, and are marked as "Pass." The table provides a clear and immediate view of which feature pairs may warrant further investigation due to their linear association, and the pass/fail status simplifies the identification of pairs that may pose a risk for multicollinearity.

The test results reveal the following key insights:

  • Only Age and Exited Exceed Correlation Threshold: The feature pair (Age, Exited) has a Pearson correlation coefficient of 0.3417, surpassing the threshold of 0.3 and resulting in a "Fail" status, indicating a notable linear relationship between customer age and the likelihood of exit.
  • All Other Feature Pairs Remain Below Threshold: The remaining nine feature pairs exhibit absolute correlation coefficients ranging from 0.1825 to 0.0364, all of which are below the 0.3 threshold and are marked as "Pass," suggesting no other strong linear dependencies among these features.
  • Distribution of Correlation Coefficients Is Centered Near Zero: Most coefficients are relatively close to zero, indicating weak or negligible linear relationships between the majority of feature pairs, with both positive and negative associations present.
  • Negative and Positive Relationships Are Both Represented: The table includes both positive and negative coefficients, such as (IsActiveMember, Exited) at -0.1825 and (Balance, Exited) at 0.1496, reflecting a mix of direct and inverse linear associations across the dataset.
  • No Evidence of Widespread Multicollinearity: With only one pair exceeding the threshold and the rest showing low correlations, the dataset does not display pervasive multicollinearity among its features.

Based on these results, the dataset demonstrates a generally low level of linear association between most feature pairs, with only the (Age, Exited) pair exhibiting a correlation coefficient above the specified threshold of 0.3. This suggests that, aside from the relationship between age and exit status, the features are largely independent in terms of linear relationships, reducing the risk of multicollinearity affecting model interpretability or stability. The presence of both positive and negative coefficients further indicates a balanced distribution of relationships, with no single direction dominating the dataset. The clear separation between the one "Fail" and the remaining "Pass" pairs provides a straightforward view of where potential redundancy may exist, while the overall low magnitude of most coefficients supports the conclusion that the feature set is not heavily burdened by linear dependencies. This pattern is consistent with a dataset that is well-suited for modeling, with minimal risk of confounding effects from highly correlated predictors, except for the specific case of age and exit status, which may warrant closer examination in subsequent modeling steps.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(Age, Exited) 0.3417 Fail
(IsActiveMember, Exited) -0.1825 Pass
(Balance, NumOfProducts) -0.1802 Pass
(Balance, Exited) 0.1496 Pass
(NumOfProducts, Exited) -0.0481 Pass
(HasCrCard, IsActiveMember) -0.0444 Pass
(NumOfProducts, IsActiveMember) 0.0414 Pass
(Age, HasCrCard) -0.0371 Pass
(Age, Balance) 0.0369 Pass
(CreditScore, Exited) -0.0364 Pass
# From result object, extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df
Columns Coefficient Pass/Fail
0 (Age, Exited) 0.3417 Fail
1 (IsActiveMember, Exited) -0.1825 Pass
2 (Balance, NumOfProducts) -0.1802 Pass
3 (Balance, Exited) 0.1496 Pass
4 (NumOfProducts, Exited) -0.0481 Pass
5 (HasCrCard, IsActiveMember) -0.0444 Pass
6 (NumOfProducts, IsActiveMember) 0.0414 Pass
7 (Age, HasCrCard) -0.0371 Pass
8 (Age, Balance) 0.0369 Pass
9 (CreditScore, Exited) -0.0364 Pass
# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features
['(Age, Exited)']
# Extract feature names from the list of strings
high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features
['Age']

We can then re-initialize the dataset with a different input_id and the highly correlated features removed and re-run the test for confirmation:

# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)
# Re-run the test with the reduced feature set
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

High Pearson Correlation is designed to identify pairs of features within a dataset that exhibit strong linear relationships, with the primary purpose of detecting potential feature redundancy or multicollinearity. This is crucial for ensuring that the predictive model remains interpretable and robust, as high correlations between features can obscure the true impact of individual variables and may lead to overfitting or instability in model estimates.

The test operates by calculating the Pearson correlation coefficient for every possible pair of features in the dataset. The Pearson correlation coefficient is a statistical measure that quantifies the strength and direction of a linear relationship between two continuous variables, ranging from -1 (perfect negative linear relationship) to +1 (perfect positive linear relationship), with 0 indicating no linear relationship. The test systematically computes these coefficients for all feature pairs, removes self-correlations and duplicate pairs, and then compares the absolute value of each coefficient to a predefined threshold, which in this case is set at 0.3. Any pair with an absolute correlation exceeding this threshold is flagged as potentially problematic, while those below the threshold are considered to pass. The test then returns the top n strongest correlations, regardless of their pass or fail status, providing a transparent view of the most significant linear relationships present in the data.

The primary advantages of this test include its efficiency and clarity in highlighting linear dependencies between features, which is particularly valuable during the early stages of model development and risk assessment. By surfacing the most strongly correlated feature pairs, the test enables data scientists and risk managers to quickly identify and address potential sources of multicollinearity, which can otherwise compromise model interpretability and predictive stability. The transparent tabular output, which lists feature pairs alongside their correlation coefficients and pass/fail status, facilitates straightforward communication of results to both technical and non-technical stakeholders. This makes the test especially useful in regulated environments where model transparency and documentation are paramount.

It should be noted that the test is limited to detecting linear relationships and does not capture more complex, nonlinear dependencies that may exist between features. Additionally, the Pearson correlation coefficient is sensitive to outliers, which can distort the true strength of the relationship between variables. The test only examines pairwise relationships, potentially missing higher-order interactions involving three or more features. Furthermore, the presence of high correlation coefficients is a sign of potential risk, as it may indicate redundancy or multicollinearity, but the absence of high correlations does not guarantee that the dataset is free from all forms of dependency or redundancy. Interpretation of the results should therefore be contextualized within the broader modeling and business environment.

This test shows its results in a tabular format, where each row represents a unique pair of features from the dataset. The columns include the feature pair, the calculated Pearson correlation coefficient, and a pass/fail status based on whether the absolute value of the coefficient exceeds the threshold of 0.3. The coefficients are presented as decimal values, typically ranging from -1 to 1, with negative values indicating inverse relationships and positive values indicating direct relationships. In this particular output, all listed feature pairs have coefficients well below the threshold, with the strongest observed correlation being -0.1825 between "IsActiveMember" and "Exited." The table is sorted by the absolute value of the coefficient, allowing for quick identification of the most significant relationships. All pairs in the table are marked as "Pass," indicating that none of the observed correlations exceed the risk threshold. The range of coefficients spans from -0.1825 to -0.0295, suggesting generally weak linear relationships among the top pairs. Notably, the relationships include both positive and negative associations, but none approach the level that would typically raise concerns about redundancy or multicollinearity.

The test results reveal the following key insights:

  • No Feature Pairs Exceed Correlation Threshold: All feature pairs analyzed have absolute Pearson correlation coefficients below the threshold of 0.3, with the highest magnitude observed at -0.1825, indicating an absence of strong linear relationships among the top pairs.
  • Weak Negative and Positive Associations Present: The strongest correlations are negative, such as between "IsActiveMember" and "Exited" (-0.1825) and "Balance" and "NumOfProducts" (-0.1802), while the highest positive correlation is between "NumOfProducts" and "IsActiveMember" (0.0414), all of which are weak in magnitude.
  • Distribution of Correlation Coefficients is Narrow: The coefficients for the top ten pairs range from -0.1825 to -0.0295, reflecting a narrow distribution and suggesting that no feature pair dominates in terms of linear association.
  • Pass Status Uniform Across All Pairs: Every feature pair in the output is marked as "Pass," confirming that none of the relationships approach the threshold that would indicate potential multicollinearity or redundancy.
  • Variety of Feature Types Represented: The pairs include combinations of binary, categorical, and continuous features, demonstrating that the weak correlations are consistent across different types of variables in the dataset.

Based on these results, the dataset exhibits a generally low degree of linear association among its features, as evidenced by the uniformly weak Pearson correlation coefficients and the absence of any pairs exceeding the predefined threshold of 0.3. The narrow range of coefficients and the consistent pass status across all top pairs suggest that the risk of feature redundancy or multicollinearity is minimal within the scope of linear relationships. This pattern holds across various types of features, indicating that the dataset's structure does not favor strong linear dependencies between any particular pair of variables. As a result, the model built on this dataset is unlikely to encounter interpretability or stability challenges arising from linear multicollinearity among the examined features. The observed relationships, while present, are not of sufficient magnitude to materially impact the model's ability to distinguish the individual contributions of each feature. This supports the conclusion that, from a linear correlation perspective, the dataset is well-suited for modeling without immediate need for feature reduction or transformation to address multicollinearity.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(IsActiveMember, Exited) -0.1825 Pass
(Balance, NumOfProducts) -0.1802 Pass
(Balance, Exited) 0.1496 Pass
(NumOfProducts, Exited) -0.0481 Pass
(HasCrCard, IsActiveMember) -0.0444 Pass
(NumOfProducts, IsActiveMember) 0.0414 Pass
(CreditScore, Exited) -0.0364 Pass
(CreditScore, EstimatedSalary) -0.0338 Pass
(CreditScore, IsActiveMember) 0.0299 Pass
(Tenure, IsActiveMember) -0.0295 Pass

Train the model

We'll then use ValidMind tests to train a simple logistic regression model on our prepared dataset:

# First encode the categorical features in our dataset with the highly correlated features removed
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()
CreditScore Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited Geography_Germany Geography_Spain Gender_Male
1990 619 0 0.00 2 0 0 113645.40 0 False True False
5851 797 4 129321.44 1 1 1 93624.55 0 False True True
4545 641 5 102145.13 1 1 1 100637.07 0 False True True
4741 579 0 144386.32 1 1 1 22497.10 1 True False True
4618 684 2 116563.58 1 1 0 120257.70 1 False False True
# Split the processed dataset into train and test
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]
from sklearn.linear_model import LogisticRegression

# Logistic Regression grid params
log_reg_params = {
    "penalty": ["l1", "l2"],
    "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    "solver": ["liblinear"],
}

# Grid search for Logistic Regression
from sklearn.model_selection import GridSearchCV

grid_log_reg = GridSearchCV(LogisticRegression(), log_reg_params)
grid_log_reg.fit(X_train, y_train)

# Logistic Regression best estimator
log_reg = grid_log_reg.best_estimator_
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1135: FutureWarning:

'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', and C=np.inf instead of penalty=None.

/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1160: UserWarning:

Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.

Initialize the ValidMind objects

Let's initialize the ValidMind Dataset and Model objects in preparation for assigning model predictions to each dataset:

# Initialize the datasets into their own dataset objects
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

# Initialize a model object
vm_model = vm.init_model(log_reg, input_id="log_reg_model_v1")

Assign predictions

Once the model is registered, we'll assign predictions to the training and test datasets:

vm_train_ds.assign_predictions(model=vm_model)
vm_test_ds.assign_predictions(model=vm_model)
2026-01-10 01:59:44,454 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-10 01:59:44,456 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-10 01:59:44,457 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-10 01:59:44,459 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-01-10 01:59:44,461 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-01-10 01:59:44,462 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-01-10 01:59:44,462 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-01-10 01:59:44,463 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Implementing a custom inline test

With the set up out of the way, let's implement a custom inline test that calculates the confusion matrix for a binary classification model.

  • An inline test refers to a test written and executed within the same environment as the code being tested — in this case, right in this Jupyter Notebook — without requiring a separate test file or framework.
  • You'll note that the custom test function is just a regular Python function that can include and require any Python library as you see fit.

Create a confusion matrix plot

Let's first create a confusion matrix plot using the confusion_matrix function from the sklearn.metrics module:

import matplotlib.pyplot as plt
from sklearn import metrics

# Get the predicted classes
y_pred = log_reg.predict(vm_test_ds.x)

confusion_matrix = metrics.confusion_matrix(y_test, y_pred)

cm_display = metrics.ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix, display_labels=[False, True]
)
cm_display.plot()

Next, create a @vm.test wrapper that will allow you to create a reusable test. Note the following changes in the code below:

  • The function confusion_matrix takes two arguments dataset and model. This is a VMDataset and VMModel object respectively.
    • VMDataset objects allow you to access the dataset's true (target) values by accessing the .y attribute.
    • VMDataset objects allow you to access the predictions for a given model by accessing the .y_pred() method.
  • The function docstring provides a description of what the test does. This will be displayed along with the result in this notebook as well as in the ValidMind Platform.
  • The function body calculates the confusion matrix using the sklearn.metrics.confusion_matrix function as we just did above.
  • The function then returns the ConfusionMatrixDisplay.figure_ object — this is important as the ValidMind Library expects the output of the custom test to be a plot or a table.
  • The @vm.test decorator is doing the work of creating a wrapper around the function that will allow it to be run by the ValidMind Library. It also registers the test so it can be found by the ID my_custom_tests.ConfusionMatrix.
@vm.test("my_custom_tests.ConfusionMatrix")
def confusion_matrix(dataset, model):
    """The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

    The confusion matrix is a 2x2 table that contains 4 values:

    - True Positive (TP): the number of correct positive predictions
    - True Negative (TN): the number of correct negative predictions
    - False Positive (FP): the number of incorrect positive predictions
    - False Negative (FN): the number of incorrect negative predictions

    The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.
    """
    y_true = dataset.y
    y_pred = dataset.y_pred(model=model)

    confusion_matrix = metrics.confusion_matrix(y_true, y_pred)

    cm_display = metrics.ConfusionMatrixDisplay(
        confusion_matrix=confusion_matrix, display_labels=[False, True]
    )
    cm_display.plot()

    plt.close()  # close the plot to avoid displaying it

    return cm_display.figure_  # return the figure object itself

You can now run the newly created custom test on both the training and test datasets using the run_test() function:

# Training dataset
result = vm.tests.run_test(
    "my_custom_tests.ConfusionMatrix:training_dataset",
    inputs={"model": vm_model, "dataset": vm_train_ds},
)

Confusion Matrix Training Dataset

Confusion Matrix: Training Dataset is designed to provide a comprehensive summary of a classification model’s predictive performance by displaying the counts of true positives, true negatives, false positives, and false negatives. This test is primarily used to evaluate how well the model distinguishes between the two classes in a binary classification setting, offering a direct and interpretable view of the model’s strengths and weaknesses in terms of correct and incorrect predictions.

The test operates by comparing the predicted class labels generated by the model against the actual, or true, class labels in the training dataset. The confusion matrix is structured as a 2x2 table, where each cell represents a specific combination of predicted and actual outcomes: true positives (correctly predicted positives), true negatives (correctly predicted negatives), false positives (incorrectly predicted positives), and false negatives (incorrectly predicted negatives). From these four values, several key performance metrics can be derived, including accuracy (the proportion of total correct predictions), precision (the proportion of positive predictions that are correct), recall (the proportion of actual positives that are correctly identified), and the F1 score (the harmonic mean of precision and recall). Each of these metrics ranges from 0 to 1, where higher values indicate better performance. The confusion matrix thus provides a granular breakdown of model performance, allowing for the identification of specific types of errors and the calculation of multiple evaluation metrics from a single table.

The primary advantages of this test include its ability to present a holistic and interpretable summary of model performance, making it easy to identify not only the overall accuracy but also the specific types of errors the model is making. By breaking down predictions into true and false positives and negatives, the confusion matrix enables practitioners to assess the balance between sensitivity and specificity, which is particularly important in domains where the costs of different types of errors are not equal. The test is also highly flexible, supporting the calculation of a wide range of derived metrics that can be tailored to the specific needs of the application. Its visual representation further aids in quickly communicating model behavior to both technical and non-technical stakeholders.

It should be noted that the confusion matrix, while informative, has several limitations. It is inherently tied to the class distribution in the dataset, which means that imbalanced datasets can lead to misleading impressions of model performance if not interpreted carefully. The matrix itself does not account for the relative costs or risks associated with different types of errors, which may be critical in certain applications. Additionally, the confusion matrix provides a static snapshot based on a specific threshold for classification; it does not capture the model’s performance across varying thresholds or provide insight into probabilistic outputs. Interpretation can also become challenging when the matrix is used for multiclass problems, as the number of cells increases and the relationships between errors become more complex. Finally, the confusion matrix does not provide information about the underlying causes of errors or the model’s calibration.

This test shows a heatmap-style confusion matrix for the training dataset, with the axes labeled as “True label” and “Predicted label,” and the two possible classes denoted as “False” and “True.” The matrix is color-coded according to the count in each cell, with a color bar indicating the scale from lower to higher values. The top-left cell (True Negative) contains 841 instances where the model correctly predicted the negative class, while the top-right cell (False Positive) contains 436 instances where the model incorrectly predicted positive for a true negative. The bottom-left cell (False Negative) shows 448 instances where the model incorrectly predicted negative for a true positive, and the bottom-right cell (True Positive) contains 840 correct positive predictions. The values in each cell are clearly labeled, and the color intensity reflects the magnitude of the counts, with higher values appearing brighter. The matrix is symmetric in terms of the number of correct predictions for each class (841 TN and 840 TP), and the number of incorrect predictions is also similar (436 FP and 448 FN). The total number of samples represented is 2,565, and the distribution of errors and correct predictions can be visually assessed by comparing the color intensities and the numerical values in each cell.

The test results reveal the following key insights:

  • Balanced Correct Predictions Across Classes: The model achieves nearly equal numbers of true positives (840) and true negatives (841), indicating that it is equally effective at identifying both classes in the training dataset.
  • Comparable Rates of False Positives and False Negatives: The counts of false positives (436) and false negatives (448) are similar, suggesting that the model’s tendency to misclassify one class as the other is balanced and does not favor one type of error over the other.
  • Moderate Error Rates Evident in Both Classes: The presence of 436 false positives and 448 false negatives, relative to the correct predictions, indicates that the model makes a substantial number of errors in both directions, which may impact derived metrics such as precision and recall.
  • Symmetry in Model Performance: The close alignment of values across the diagonal (TP and TN) and off-diagonal (FP and FN) cells suggests that the model does not exhibit a strong bias toward either class in its predictions on the training data.
  • Visual Clarity of Distribution: The heatmap representation, with its color gradient and clear labeling, allows for immediate visual assessment of the distribution of correct and incorrect predictions, highlighting the areas of model strength and weakness.

Based on these results, the confusion matrix for the training dataset demonstrates that the model exhibits a balanced performance in terms of both correct and incorrect predictions for each class. The nearly equal numbers of true positives and true negatives, as well as the similar counts of false positives and false negatives, indicate that the model does not systematically favor one class over the other in its predictions. However, the moderate rates of both types of errors suggest that there is room for improvement in the model’s ability to distinguish between the two classes, as a significant proportion of predictions are incorrect. The visual symmetry and distribution of values in the matrix provide a clear and interpretable summary of the model’s behavior, supporting further analysis of derived metrics such as accuracy, precision, recall, and F1 score. These results collectively characterize the model as having balanced but moderate discriminative power on the training data, with no pronounced bias toward either class and a consistent pattern of errors across both positive and negative predictions.

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:training_dataset:eca6
# Test dataset
result = vm.tests.run_test(
    "my_custom_tests.ConfusionMatrix:test_dataset",
    inputs={"model": vm_model, "dataset": vm_test_ds},
)

Confusion Matrix Test Dataset

Confusion Matrix: Test Dataset is designed to provide a comprehensive summary of a classification model’s predictive performance by comparing the model’s predicted labels against the true labels for a given dataset. The primary purpose of this test is to quantify the model’s ability to correctly identify positive and negative cases, as well as to highlight the types and frequencies of misclassifications. This enables a detailed understanding of the model’s strengths and weaknesses in distinguishing between the two classes under evaluation.

The test operates by constructing a two-by-two matrix, where each cell represents a count of instances corresponding to a specific combination of predicted and actual class labels. The four key components are: true positives (correctly predicted positives), true negatives (correctly predicted negatives), false positives (incorrectly predicted positives), and false negatives (incorrectly predicted negatives). These counts are then used to derive several important performance metrics. Accuracy measures the proportion of total correct predictions and ranges from 0 to 1, with higher values indicating better overall performance. Precision quantifies the proportion of positive predictions that are actually correct, reflecting the model’s reliability when it predicts a positive outcome. Recall, or sensitivity, measures the proportion of actual positives that are correctly identified, indicating the model’s ability to capture true cases. The F1 score is the harmonic mean of precision and recall, providing a balanced metric that accounts for both false positives and false negatives. Each metric is calculated using the counts from the confusion matrix, and their values typically range from 0 (worst) to 1 (best), with higher values signifying stronger model performance.

The primary advantages of this test include its ability to provide a granular and interpretable breakdown of model predictions, making it possible to identify specific types of errors and their frequencies. The confusion matrix is particularly useful in scenarios where the costs of different types of misclassification are not equal, as it allows practitioners to assess the trade-offs between false positives and false negatives. Additionally, the derived metrics—accuracy, precision, recall, and F1 score—offer a multi-faceted view of model performance, supporting informed decision-making regarding model selection, threshold tuning, and deployment. The visual representation of the confusion matrix further enhances interpretability, enabling stakeholders to quickly grasp the distribution of prediction outcomes and to communicate results effectively across technical and non-technical audiences.

It should be noted that the confusion matrix and its derived metrics have certain limitations and potential risks. The test is inherently dependent on the class distribution within the dataset, which can lead to misleading impressions of performance in the presence of class imbalance. For example, high accuracy may be achieved simply by favoring the majority class, even if the model performs poorly on the minority class. Additionally, the confusion matrix does not account for the relative costs or consequences of different types of errors, which may be critical in certain applications. Interpretation challenges may arise when comparing models across datasets with differing class proportions or when the underlying data distribution shifts over time. Furthermore, the test provides a static snapshot of performance and does not capture model calibration or the confidence of predictions, which may be important for risk-sensitive domains.

This test shows a confusion matrix presented as a color-coded heatmap, with the true labels on the vertical axis and the predicted labels on the horizontal axis. The matrix is divided into four cells, each annotated with the corresponding count: 197 true negatives (top-left), 142 false positives (top-right), 112 false negatives (bottom-left), and 196 true positives (bottom-right). The color intensity of each cell reflects the magnitude of the count, as indicated by the accompanying color bar, which ranges from approximately 110 to just above 190. The matrix provides a direct visual summary of the model’s classification outcomes, allowing for immediate identification of the most and least frequent prediction types. The true negative and true positive cells are the most prominent, indicating that the model correctly classifies a substantial number of both negative and positive cases. The false positive and false negative cells, while less intense, are still significant, suggesting that misclassifications are non-negligible. The overall distribution of values can be interpreted by comparing the relative sizes of each cell, with the diagonal cells representing correct predictions and the off-diagonal cells representing errors. The matrix does not display derived metrics directly, but these can be calculated from the provided counts to further quantify performance.

The test results reveal the following key insights:

  • Balanced Correct Classification Across Classes: The model demonstrates a relatively even distribution of correct predictions, with 197 true negatives and 196 true positives, indicating that it is capable of identifying both classes with similar effectiveness.
  • Substantial Misclassification Rates: There are 142 false positives and 112 false negatives, highlighting that the model makes a notable number of errors in both directions, which may impact its reliability in critical applications.
  • Diagonal Dominance with Significant Off-Diagonal Values: While the diagonal cells (correct predictions) are the most populated, the off-diagonal cells (errors) are not negligible, suggesting that the model’s discriminative power is moderate rather than strong.
  • Color Intensity Reflects Distribution: The heatmap’s color gradient visually emphasizes the higher counts in the true negative and true positive cells, while still drawing attention to the non-trivial error rates in the false positive and false negative cells.
  • Potential Class Imbalance Not Evident: The similar counts for true positives and true negatives, as well as for false positives and false negatives, suggest that the dataset may be relatively balanced between the two classes, or that the model’s performance is not heavily skewed toward one class.

Based on these results, the confusion matrix indicates that the model achieves a comparable level of accuracy in predicting both positive and negative cases, as evidenced by the nearly equal counts of true positives and true negatives. However, the presence of substantial false positives and false negatives reveals that the model’s predictions are subject to a moderate degree of error, which may affect its suitability for applications where misclassification costs are high. The visual representation underscores the importance of considering both correct and incorrect predictions when evaluating model performance, as reliance on a single metric such as accuracy could obscure the true nature of the model’s behavior. The observed distribution suggests that the model does not exhibit a strong bias toward either class, but the error rates indicate room for improvement in discriminative capability. These insights collectively provide a nuanced understanding of the model’s predictive characteristics, supporting further analysis and potential refinement based on the specific requirements of the intended application.

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:test_dataset:5f4a

Add parameters to custom tests

Custom tests can take parameters just like any other function. To demonstrate, let's modify the confusion_matrix function to take an additional parameter normalize that will allow you to normalize the confusion matrix:

@vm.test("my_custom_tests.ConfusionMatrix")
def confusion_matrix(dataset, model, normalize=False):
    """The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

    The confusion matrix is a 2x2 table that contains 4 values:

    - True Positive (TP): the number of correct positive predictions
    - True Negative (TN): the number of correct negative predictions
    - False Positive (FP): the number of incorrect positive predictions
    - False Negative (FN): the number of incorrect negative predictions

    The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.
    """
    y_true = dataset.y
    y_pred = dataset.y_pred(model=model)

    if normalize:
        confusion_matrix = metrics.confusion_matrix(y_true, y_pred, normalize="all")
    else:
        confusion_matrix = metrics.confusion_matrix(y_true, y_pred)

    cm_display = metrics.ConfusionMatrixDisplay(
        confusion_matrix=confusion_matrix, display_labels=[False, True]
    )
    cm_display.plot()

    plt.close()  # close the plot to avoid displaying it

    return cm_display.figure_  # return the figure object itself

Pass parameters to custom tests

You can pass parameters to custom tests by providing a dictionary of parameters to the run_test() function.

  • The parameters will override any default parameters set in the custom test definition. Note that dataset and model are still passed as inputs.
  • Since these are VMDataset or VMModel inputs, they have a special meaning.
  • When declaring a dataset, model, datasets or models argument in a custom test function, the ValidMind Library will expect these get passed as inputs to run_test() or run_documentation_tests().

Re-running the confusion matrix with normalize=True and our testing dataset looks like this:

# Test dataset with normalize=True
result = vm.tests.run_test(
    "my_custom_tests.ConfusionMatrix:test_dataset_normalized",
    inputs={"model": vm_model, "dataset": vm_test_ds},
    params={"normalize": True}
)

Confusion Matrix Test Dataset Normalized

Confusion Matrix: Test Dataset Normalized is designed to provide a comprehensive summary of a classification model’s predictive performance by displaying the counts of true positives, true negatives, false positives, and false negatives in a structured matrix format. The primary purpose of this test is to enable a clear and immediate understanding of how well the model distinguishes between the two classes, offering a direct visualization of both correct and incorrect predictions relative to the actual outcomes.

The test operates by comparing the predicted class labels generated by the model against the actual, or true, class labels in the test dataset. Each prediction is categorized into one of four groups: true positive (correctly predicted positive), true negative (correctly predicted negative), false positive (incorrectly predicted positive), and false negative (incorrectly predicted negative). These values are then arranged in a 2x2 matrix, with the axes representing the true and predicted labels. In this instance, the matrix is normalized, meaning each cell value is expressed as a proportion of the total, rather than as raw counts. This normalization allows for direct comparison across datasets of different sizes and helps to mitigate the effects of class imbalance. The confusion matrix serves as the basis for calculating several key performance metrics: accuracy (the proportion of all correct predictions), precision (the proportion of positive predictions that are correct), recall (the proportion of actual positives that are correctly identified), and the F1 score (the harmonic mean of precision and recall). Each of these metrics ranges from 0 to 1, where values closer to 1 indicate better performance. High values for true positives and true negatives, and low values for false positives and false negatives, are generally desirable, as they indicate the model is making accurate predictions.

The primary advantages of this test include its ability to provide a holistic and interpretable summary of model performance in a single visualization, making it easy to identify both strengths and weaknesses in classification. The normalized confusion matrix is particularly useful when comparing models or datasets with differing class distributions, as it presents results in a standardized format. This approach enables practitioners to quickly assess not only overall accuracy but also the balance between different types of errors, which is critical in domains where the cost of false positives and false negatives may differ significantly. The test’s visual format facilitates rapid identification of systematic prediction errors, such as a tendency to over-predict one class, and supports the calculation of a range of secondary metrics that inform model selection and tuning.

It should be noted that the confusion matrix, while informative, has several limitations. It provides only a snapshot of model performance on a specific dataset and does not account for the underlying probability estimates or the confidence of predictions. The matrix does not capture the relative costs or impacts of different types of errors, which may be important in certain applications. Additionally, the normalized format, while useful for comparison, can obscure the absolute scale of errors, making it harder to assess the practical significance of the results without additional context. Interpretation can also be challenging in cases of severe class imbalance, as high accuracy may be achieved by simply predicting the majority class. Finally, the confusion matrix does not provide insight into the reasons behind misclassifications or the model’s behavior on individual instances, necessitating further analysis for a complete understanding.

This test shows a normalized confusion matrix presented as a color-coded heatmap, with the true labels on the vertical axis and the predicted labels on the horizontal axis. Each cell in the matrix contains a value representing the proportion of total predictions falling into that category, with the color intensity corresponding to the magnitude of the value, as indicated by the accompanying color bar. The matrix is divided into four quadrants: the top-left cell (0.30) represents the proportion of true negatives, the top-right cell (0.22) represents false positives, the bottom-left cell (0.17) represents false negatives, and the bottom-right cell (0.30) represents true positives. The values range from 0.17 to 0.30, indicating the distribution of predictions across the four possible outcomes. The color bar to the right of the matrix provides a visual reference for interpreting the color scale, which spans from approximately 0.17 (darkest) to 0.30 (brightest). This format allows for immediate visual assessment of where the model is performing well and where errors are concentrated. Notably, the true positive and true negative rates are equal at 0.30, while the false positive and false negative rates are somewhat lower, at 0.22 and 0.17, respectively. This balance suggests a relatively even distribution of correct predictions across both classes, with a moderate level of misclassification present in both directions.

The test results reveal the following key insights:

  • Balanced Correct Prediction Rates Across Classes: The model achieves equal normalized rates of 0.30 for both true positives and true negatives, indicating that it is equally effective at correctly identifying both positive and negative cases in the test dataset.
  • Moderate False Positive and False Negative Rates: The normalized rate for false positives is 0.22, while the rate for false negatives is 0.17, showing that the model makes a moderate number of errors in both directions, with slightly fewer false negatives than false positives.
  • Proportional Distribution of Predictions: The sum of all matrix cells is 1.0, confirming that the normalization is correctly applied and that the proportions reflect the complete distribution of model predictions.
  • Visual Clarity of Error Patterns: The heatmap format, with its distinct color gradations, makes it easy to visually distinguish between areas of high and low predictive accuracy, supporting rapid identification of the model’s strengths and areas for improvement.

Based on these results, the model demonstrates a balanced ability to correctly classify both positive and negative cases, as evidenced by the equal normalized rates for true positives and true negatives. The presence of moderate false positive and false negative rates suggests that while the model is not perfect, it does not exhibit a strong bias toward over-predicting or under-predicting either class. The normalized confusion matrix provides a clear and interpretable summary of the model’s predictive behavior, with the color-coded heatmap facilitating quick assessment of performance across all outcome categories. The proportional distribution of predictions confirms that the model’s outputs are well-calibrated in terms of class representation. Overall, the results indicate that the model maintains a reasonable balance between sensitivity and specificity, with no extreme imbalances or systematic errors apparent in the test dataset. This balanced performance profile is particularly important in applications where both types of errors carry significant consequences, and it provides a solid foundation for further evaluation and potential refinement of the model.

Parameters:

{
  "normalize": true
}
            

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:test_dataset_normalized:5ea1

Log the confusion matrix results

As we learned in 2 — Start the model development process under Documenting results > Run and log an individual tests, you can log any result to the ValidMind Platform with the .log() method of the result object, allowing you to then add the result to the documentation.

You can now do the same for the confusion matrix results:

result.log()
2026-01-10 02:01:19,409 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_dataset_normalized does not exist in model's document
Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for this particular test ID.

That's expected, as when we run individual tests the results logged need to be manually added to your documentation within the ValidMind Platform.

Using external test providers

Creating inline custom tests with a function is a great way to customize your model documentation. However, sometimes you may want to reuse the same set of tests across multiple models and share them with others in your organization. In this case, you can create an external custom test provider that will allow you to load custom tests from a local folder or a Git repository.

In this section you will learn how to declare a local filesystem test provider that allows loading tests from a local folder following these high level steps:

  1. Create a folder of custom tests from existing inline tests (tests that exist in your active Jupyter Notebook)
  2. Save an inline test to a file
  3. Define and register a LocalTestProvider that points to that folder
  4. Run test provider tests
  5. Add the test results to your documentation

Create custom tests folder

Let's start by creating a new folder that will contain reusable custom tests from your existing inline tests.

The following code snippet will create a new my_tests directory in the current working directory if it doesn't exist:

tests_folder = "my_tests"

import os

# create tests folder
os.makedirs(tests_folder, exist_ok=True)

# remove existing tests
for f in os.listdir(tests_folder):
    # remove files and pycache
    if f.endswith(".py") or f == "__pycache__":
        os.system(f"rm -rf {tests_folder}/{f}")

After running the command above, confirm that a new my_tests directory was created successfully. For example:

~/notebooks/tutorials/model_development/my_tests/

Save an inline test

The @vm.test decorator we used in Implementing a custom inline test above to register one-off custom tests also includes a convenience method on the function object that allows you to simply call <func_name>.save() to save the test to a Python file at a specified path.

While save() will get you started by creating the file and saving the function code with the correct name, it won't automatically include any imports, or other functions or variables, outside of the functions that are needed for the test to run. To solve this, pass in an optional imports argument ensuring necessary imports are added to the file.

The confusion_matrix test requires the following additional imports:

import matplotlib.pyplot as plt
from sklearn import metrics

Let's pass these imports to the save() method to ensure they are included in the file with the following command:

confusion_matrix.save(
    # Save it to the custom tests folder we created
    tests_folder,
    imports=["import matplotlib.pyplot as plt", "from sklearn import metrics"],
)
2026-01-10 02:01:20,022 - INFO(validmind.tests.decorator): Saved to /home/runner/work/documentation/documentation/site/notebooks/EXECUTED/model_development/my_tests/ConfusionMatrix.py!Be sure to add any necessary imports to the top of the file.
2026-01-10 02:01:20,023 - INFO(validmind.tests.decorator): This metric can be run with the ID: <test_provider_namespace>.ConfusionMatrix
  • # Saved from __main__.confusion_matrix
    # Original Test ID: my_custom_tests.ConfusionMatrix
    # New Test ID: <test_provider_namespace>.ConfusionMatrix
  • def ConfusionMatrix(dataset, model, normalize=False):

Register a local test provider

Now that your my_tests folder has a sample custom test, let's initialize a test provider that will tell the ValidMind Library where to find your custom tests:

  • ValidMind offers out-of-the-box test providers for local tests (tests in a folder) or a Github provider for tests in a Github repository.
  • You can also create your own test provider by creating a class that has a load_test method that takes a test ID and returns the test function matching that ID.
Want to learn more about test providers?

An extended introduction to test providers can be found in: Integrate external test providers

Initialize a local test provider

For most use cases, using a LocalTestProvider that allows you to load custom tests from a designated directory should be sufficient.

The most important attribute for a test provider is its namespace. This is a string that will be used to prefix test IDs in model documentation. This allows you to have multiple test providers with tests that can even share the same ID, but are distinguished by their namespace.

Let's go ahead and load the custom tests from our my_tests directory:

from validmind.tests import LocalTestProvider

# initialize the test provider with the tests folder we created earlier
my_test_provider = LocalTestProvider(tests_folder)

vm.tests.register_test_provider(
    namespace="my_test_provider",
    test_provider=my_test_provider,
)
# `my_test_provider.load_test()` will be called for any test ID that starts with `my_test_provider`
# e.g. `my_test_provider.ConfusionMatrix` will look for a function named `ConfusionMatrix` in `my_tests/ConfusionMatrix.py` file

Run test provider tests

Now that we've set up the test provider, we can run any test that's located in the tests folder by using the run_test() method as with any other test:

  • For tests that reside in a test provider directory, the test ID will be the namespace specified when registering the provider, followed by the path to the test file relative to the tests folder.
  • For example, the Confusion Matrix test we created earlier will have the test ID my_test_provider.ConfusionMatrix. You could organize the tests in subfolders, say classification and regression, and the test ID for the Confusion Matrix test would then be my_test_provider.classification.ConfusionMatrix.

Let's go ahead and re-run the confusion matrix test with our testing dataset by using the test ID my_test_provider.ConfusionMatrix. This should load the test from the test provider and run it as before.

result = vm.tests.run_test(
    "my_test_provider.ConfusionMatrix",
    inputs={"model": vm_model, "dataset": vm_test_ds},
    params={"normalize": True},
)

result.log()

Confusion Matrix

Confusion Matrix is designed to provide a comprehensive summary of a classification model’s predictive performance by displaying the counts of true positives, true negatives, false positives, and false negatives in a structured table. This test is primarily used to evaluate how well a model distinguishes between two classes, offering a direct view of both correct and incorrect predictions relative to the actual outcomes.

The test operates by comparing the predicted class labels generated by the model against the actual, true class labels for each observation in the dataset. The results are organized into a 2x2 matrix, where each cell represents a specific combination of predicted and actual class outcomes: true positives (correctly predicted positives), true negatives (correctly predicted negatives), false positives (incorrectly predicted positives), and false negatives (incorrectly predicted negatives). In this instance, the matrix is normalized, meaning each cell value is expressed as a proportion of the total, rather than as a raw count. This normalization allows for easier comparison across datasets of different sizes and provides a clearer understanding of the model’s relative performance. The confusion matrix serves as the foundation for calculating several key performance metrics, including accuracy (the proportion of all correct predictions), precision (the proportion of positive predictions that are correct), recall (the proportion of actual positives that are correctly identified), and the F1 score (the harmonic mean of precision and recall). These metrics typically range from 0 to 1, where values closer to 1 indicate better performance. High values for true positives and true negatives, and low values for false positives and false negatives, are generally desirable, as they indicate the model is making accurate predictions.

The primary advantages of this test include its ability to provide a detailed and interpretable breakdown of model performance across all possible prediction outcomes. By presenting both the correct and incorrect predictions, the confusion matrix enables practitioners to identify specific types of errors, such as whether the model is more prone to false positives or false negatives. This level of granularity is particularly valuable in domains where the costs of different types of errors are not equal, such as in medical diagnosis or fraud detection. The normalized format further enhances interpretability by allowing for direct comparison of performance across different datasets or models, regardless of sample size. Additionally, the confusion matrix forms the basis for a suite of widely used performance metrics, making it a central tool in model evaluation and selection.

It should be noted that the confusion matrix, while informative, has several limitations. It is inherently limited to classification tasks with discrete, mutually exclusive classes and does not generalize to regression or multi-label problems without modification. The matrix provides no information about the underlying probability estimates or confidence levels of the model’s predictions, which can be important for risk-sensitive applications. Interpretation can also be challenging in cases of class imbalance, where high accuracy may mask poor performance on minority classes. Furthermore, the confusion matrix does not account for the relative costs or consequences of different types of errors unless explicitly incorporated into the analysis. Care must be taken to ensure that the normalization method matches the intended interpretation, as row-wise, column-wise, or overall normalization can yield different perspectives on model performance.

This test shows a normalized confusion matrix presented as a color-coded heatmap, with the true class labels on the vertical axis and the predicted class labels on the horizontal axis. Each cell in the matrix contains a value representing the proportion of total predictions falling into that category, with the color intensity corresponding to the magnitude of the value, as indicated by the accompanying color bar. The top-left cell (False, False) shows a value of 0.30, representing the proportion of true negatives, while the top-right cell (False, True) shows 0.22, indicating the proportion of false positives. The bottom-left cell (True, False) has a value of 0.17, corresponding to false negatives, and the bottom-right cell (True, True) shows 0.30, representing true positives. The values range from 0.17 to 0.30, and the color bar provides a visual reference for interpreting the relative magnitude of each cell. This layout allows for immediate visual assessment of the model’s strengths and weaknesses in distinguishing between the two classes. The matrix is balanced in terms of true positives and true negatives, both at 0.30, while the error rates for false positives and false negatives are 0.22 and 0.17, respectively. The normalized format ensures that these proportions are directly comparable and sum to 1 across the entire matrix.

The test results reveal the following key insights:

  • Balanced Correct Prediction Rates: The model achieves equal proportions of true positives and true negatives, each at 0.30, indicating similar effectiveness in correctly identifying both classes.
  • Moderate False Positive Rate: The proportion of false positives is 0.22, suggesting that the model incorrectly predicts the positive class in approximately one-fifth of all cases.
  • Lower False Negative Rate: The false negative rate is 0.17, which is lower than the false positive rate, indicating that the model is somewhat less likely to miss true positive cases than to incorrectly flag negatives as positives.
  • Normalized Distribution Across All Outcomes: The sum of all matrix cells equals 1, confirming that the normalization parameter has been applied correctly and that the results represent relative frequencies rather than raw counts.
  • Visual Clarity of Error Patterns: The heatmap format, with its color gradient, makes it easy to visually distinguish between high and low frequency cells, highlighting the areas where the model performs well and where errors are more prevalent.

Based on these results, the confusion matrix demonstrates that the model exhibits a balanced ability to correctly classify both positive and negative cases, as evidenced by the equal proportions of true positives and true negatives. The error rates, while present, are moderate, with the model more likely to produce false positives than false negatives. This pattern suggests that the model may be slightly more conservative in its positive predictions, favoring recall over precision. The normalized presentation of the results allows for straightforward interpretation and comparison, making it clear that the model’s predictive behavior is not skewed toward one class at the expense of the other. The visual representation further aids in identifying the relative frequency of each outcome, supporting a nuanced understanding of the model’s strengths and areas where misclassifications occur. Overall, the results indicate a model that is neither overly biased toward positive nor negative predictions, with error rates that are distributed in a manner consistent with balanced classification performance.

Parameters:

{
  "normalize": true
}
            

Figures

ValidMind Figure my_test_provider.ConfusionMatrix:3a37
2026-01-10 02:01:53,786 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix does not exist in model's document
Again, note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for this particular test ID.

That's expected, as when we run individual tests the results logged need to be manually added to your documentation within the ValidMind Platform.

Add test results to documentation

With our custom tests run and results logged to the ValidMind Platform, let's head to the model we connected to at the beginning of this notebook and insert our test results into the documentation (Need more help?):

  1. From the Inventory in the ValidMind Platform, go to the model you connected to earlier.

  2. In the left sidebar that appears for your model, click Documentation under Documents.

  3. Locate the Data Preparation section and click on 3.2. Model Evaluation to expand that section.

  4. Hover under the Pearson Correlation Matrix content block until a horizontal dashed line with a + button appears, indicating that you can insert a new block.

    Screenshot showing insert block button in model documentation

  5. Click + and then select Test-Driven Block under FROM LIBRARY:

    • Click on Custom under TEST-DRIVEN in the left sidebar.
    • Select the two custom ConfusionMatrix tests you logged above:

    Screenshot showing the ConfusionMatrix tests selected

  6. Finally, click Insert 2 Test Results to Document to add the test results to the documentation.

    Confirm that the two individual results for the confusion matrix tests have been correctly inserted into section 3.2. Model Evaluation of the documentation.

In summary

In this third notebook, you learned how to:

Next steps

Finalize testing and documentation

Now that you're proficient at using the ValidMind Library to run and log tests, let's put the last pieces in place to prepare our fully documented sample model for review: 4 — Finalize testing and documentation