ValidMind for model validation 3 — Developing a potential challenger model

Learn how to use ValidMind for your end-to-end model validation process with our series of four introductory notebooks. In this third notebook, develop a potential challenger model and then pass your model and its predictions to ValidMind.

A challenger model is an alternate model that attempts to outperform the champion model, ensuring that the best performing fit-for-purpose model is always considered for deployment. Challenger models also help avoid over-reliance on a single model, and allow testing of new features, algorithms, or data sources without disrupting the production lifecycle.

Learn by doing

Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals

Prerequisites

In order to develop potential challenger models with this notebook, you'll need to first have:

Registered a model within the ValidMind Platform and granted yourself access to the model as a validator
Installed the ValidMind Library in your local environment, allowing you to access all its features
Learned how to import and initialize datasets for use with ValidMind
Understood the basics of how to run and log tests with ValidMind
Run data quality tests on the datasets used to train the champion model, and logged the results of those tests to ValidMind
Inserted your logged test results into your validation report

Need help with the above steps?

Refer to the first two notebooks in this series:

Setting up

This section should be quite familiar to you — as we performed the same actions in the previous notebook, 2 — Start the model validation process.

Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

In a browser, log in to ValidMind.
In the left sidebar, navigate to Inventory and select the model you registered for this "ValidMind for model validation" series of notebooks.
Go to Getting Started and click Copy snippet to clipboard.

Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
    # document="validation-report",
)

Note: you may need to restart the kernel to use updated packages.

2026-03-12 20:44:24,699 - ERROR(validmind.api_client): Future releases will require `document` as one of the options you must provide to `vm.init()`. To learn more, refer to https://docs.validmind.ai/developer/validmind-library.html
2026-03-12 20:44:24,789 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report

Import the sample dataset

Next, we'll load in the sample Bank Customer Churn Prediction dataset used to develop the champion model that we will independently preprocess:

# Load the sample dataset
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()

Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}

Preprocess the dataset

We’ll apply a simple rebalancing technique to the dataset before continuing:

import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test.

As you know, before we can run tests you’ll need to initialize a ValidMind dataset object with the init_dataset function:

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)

With our balanced dataset initialized, we can then run our test and utilize the output to help us identify the features we want to remove:

# Run HighPearsonCorrelation test with our balanced dataset as input and return a result object
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

The High Pearson Correlation test evaluates the linear relationships between feature pairs to identify potential redundancy or multicollinearity. The results table presents the top ten strongest absolute correlations, listing the feature pairs, their Pearson correlation coefficients, and a Pass/Fail status based on a threshold of 0.3. Only one feature pair exceeds the threshold, while the remaining pairs show lower correlation magnitudes and pass the test criteria.

Key insights:

One feature pair exceeds correlation threshold: The pair (Age, Exited) has a correlation coefficient of 0.3674, surpassing the 0.3 threshold and receiving a Fail status.
All other feature pairs below threshold: The remaining nine feature pairs have absolute correlation coefficients ranging from 0.0441 to 0.1874, all below the threshold and marked as Pass.
Predominantly weak linear relationships: Most feature pairs exhibit weak linear associations, with coefficients close to zero, indicating limited direct linear dependency among these features.

The results indicate that, with the exception of the (Age, Exited) pair, the dataset does not display strong linear relationships among the top correlated feature pairs. The overall correlation structure suggests low risk of widespread multicollinearity, with only isolated moderate correlation observed.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(Age, Exited)	0.3674	Fail
(Balance, NumOfProducts)	-0.1874	Pass
(IsActiveMember, Exited)	-0.1856	Pass
(Balance, Exited)	0.1565	Pass
(Age, Balance)	0.0594	Pass
(NumOfProducts, Exited)	-0.0554	Pass
(Tenure, IsActiveMember)	-0.0523	Pass
(Age, NumOfProducts)	-0.0507	Pass
(HasCrCard, IsActiveMember)	-0.0442	Pass
(NumOfProducts, IsActiveMember)	0.0441	Pass

# From result object, extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df

	Columns	Coefficient	Pass/Fail
0	(Age, Exited)	0.3674	Fail
1	(Balance, NumOfProducts)	-0.1874	Pass
2	(IsActiveMember, Exited)	-0.1856	Pass
3	(Balance, Exited)	0.1565	Pass
4	(Age, Balance)	0.0594	Pass
5	(NumOfProducts, Exited)	-0.0554	Pass
6	(Tenure, IsActiveMember)	-0.0523	Pass
7	(Age, NumOfProducts)	-0.0507	Pass
8	(HasCrCard, IsActiveMember)	-0.0442	Pass
9	(NumOfProducts, IsActiveMember)	0.0441	Pass

# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features

['(Age, Exited)']

# Extract feature names from the list of strings
high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features

['Age']

We can then re-initialize the dataset with a different input_id and the highly correlated features removed and re-run the test for confirmation:

# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)

# Re-run the test with the reduced feature set
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

The High Pearson Correlation test evaluates the linear relationships between feature pairs to identify potential redundancy or multicollinearity. The results table presents the top ten absolute Pearson correlation coefficients among feature pairs, along with their corresponding Pass/Fail status based on a threshold of 0.3. All reported coefficients are below the threshold, and each feature pair is marked as Pass.

Key insights:

No feature pairs exceed correlation threshold: All absolute Pearson correlation coefficients are below the 0.3 threshold, with the highest magnitude observed at 0.1874 between Balance and NumOfProducts.
Weak linear relationships dominate: The strongest observed correlations, both positive and negative, remain in the weak range, with coefficients ranging from -0.1874 to 0.1565.
Consistent Pass status across all pairs: Every feature pair in the top ten list is marked as Pass, indicating no detected risk of linear redundancy or multicollinearity among these features.

The results indicate that the dataset does not exhibit strong linear dependencies among the top correlated feature pairs. All observed relationships fall well below the specified threshold, suggesting low risk of feature redundancy or multicollinearity based on linear correlation. The feature set maintains independence suitable for reliable model estimation and interpretability.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(Balance, NumOfProducts)	-0.1874	Pass
(IsActiveMember, Exited)	-0.1856	Pass
(Balance, Exited)	0.1565	Pass
(NumOfProducts, Exited)	-0.0554	Pass
(Tenure, IsActiveMember)	-0.0523	Pass
(HasCrCard, IsActiveMember)	-0.0442	Pass
(NumOfProducts, IsActiveMember)	0.0441	Pass
(CreditScore, EstimatedSalary)	-0.0397	Pass
(CreditScore, Exited)	-0.0349	Pass
(CreditScore, IsActiveMember)	0.0311	Pass

Split the preprocessed dataset

With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:

# Encode categorical features in the dataset
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()

	CreditScore	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited	Geography_Germany	Geography_Spain	Gender_Male
5683	850	3	100476.46	2	1	1	136539.13	0	False	True	True
3153	712	2	182888.08	1	1	0	3061.00	0	False	False	True
1497	570	8	0.00	1	1	1	124641.42	0	False	False	False
4646	676	1	0.00	1	1	0	79342.31	1	False	False	False
6458	609	1	108019.27	3	1	1	184524.65	1	False	False	True

from sklearn.model_selection import train_test_split

# Split the dataset into train and test
train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]

# Initialize the split datasets
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

Import the champion model

With our raw dataset assessed and preprocessed, let's go ahead and import the champion model submitted by the model development team in the format of a .pkl file: lr_model_champion.pkl

# Import the champion model
import pickle as pkl

with open("lr_model_champion.pkl", "rb") as f:
    log_reg = pkl.load(f)

/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/base.py:463: InconsistentVersionWarning: Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.8.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(

Training a potential challenger model

We're curious how an alternate model compares to our champion model, so let's train a challenger model as a basis for our testing.

Our champion logistic regression model is a simpler, parametric model that assumes a linear relationship between the independent variables and the log-odds of the outcome. While logistic regression may not capture complex patterns as effectively, it offers a high degree of interpretability and is easier to explain to stakeholders. However, model risk is not calculated in isolation from a single factor, but rather in consideration with trade-offs in predictive performance, ease of interpretability, and overall alignment with business objectives.

Random forest classification model

A random forest classification model is an ensemble machine learning algorithm that uses multiple decision trees to classify data. In ensemble learning, multiple models are combined to improve prediction accuracy and robustness.

Random forest classification models generally have higher accuracy because they capture complex, non-linear relationships, but as a result they lack transparency in their predictions.

# Import the Random Forest Classification model
from sklearn.ensemble import RandomForestClassifier

# Create the model instance with 50 decision trees
rf_model = RandomForestClassifier(
    n_estimators=50,
    random_state=42,
)

# Train the model
rf_model.fit(X_train, y_train)

RandomForestClassifier(n_estimators=50, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Initializing the model objects

Initialize the model objects

In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models.

You simply initialize this model object with vm.init_model():

# Initialize the champion logistic regression model
vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model_champion",
)

# Initialize the challenger random forest classification model
vm_rf_model = vm.init_model(
    rf_model,
    input_id="rf_model",
)

Assign predictions

With our models registered, we'll move on to assigning both the predictive probabilities coming directly from each model's predictions, and the binary prediction after applying the cutoff threshold described in the Compute binary predictions step above.

The assign_predictions() method from the Dataset object can link existing predictions to any number of models.
This method links the model's class prediction values and probabilities to our vm_train_ds and vm_test_ds datasets.

If no prediction values are passed, the method will compute predictions automatically:

# Champion — Logistic regression model
vm_train_ds.assign_predictions(model=vm_log_model)
vm_test_ds.assign_predictions(model=vm_log_model)

# Challenger — Random forest classification model
vm_train_ds.assign_predictions(model=vm_rf_model)
vm_test_ds.assign_predictions(model=vm_rf_model)

2026-03-12 20:44:34,414 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-03-12 20:44:34,416 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-03-12 20:44:34,416 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-03-12 20:44:34,419 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-03-12 20:44:34,421 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-03-12 20:44:34,422 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-03-12 20:44:34,423 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-03-12 20:44:34,424 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-03-12 20:44:34,426 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-03-12 20:44:34,451 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-03-12 20:44:34,452 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-03-12 20:44:34,476 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-03-12 20:44:34,478 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-03-12 20:44:34,491 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-03-12 20:44:34,492 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-03-12 20:44:34,505 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Running model evaluation tests

With our setup complete, let's run the rest of our validation tests. Since we have already verified the data quality of the dataset used to train our champion model, we will now focus on comprehensive performance evaluations of both the champion and challenger models.

Run model performance tests

Let's run some performance tests, beginning with independent testing of our champion logistic regression model, then moving on to our potential challenger model.

Use vm.tests.list_tests() to identify all the model performance tests for classification:


vm.tests.list_tests(tags=["model_performance"], task="classification")

ID	Name	Description	Has Figure	Has Table	Required Inputs	Params	Tags	Tasks
validmind.model_validation.sklearn.CalibrationCurve	Calibration Curve	Evaluates the calibration of probability estimates by comparing predicted probabilities against observed...	True	False	['model', 'dataset']	{'n_bins': {'type': 'int', 'default': 10}}	['sklearn', 'model_performance', 'classification']	['classification']
validmind.model_validation.sklearn.ClassifierPerformance	Classifier Performance	Evaluates performance of binary or multiclass classification models using precision, recall, F1-Score, accuracy,...	False	True	['dataset', 'model']	{'average': {'type': 'str', 'default': 'macro'}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.model_validation.sklearn.ConfusionMatrix	Confusion Matrix	Evaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix...	True	False	['dataset', 'model']	{'threshold': {'type': 'float', 'default': 0.5}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']	['classification', 'text_classification']
validmind.model_validation.sklearn.HyperParametersTuning	Hyper Parameters Tuning	Performs exhaustive grid search over specified parameter ranges to find optimal model configurations...	False	True	['model', 'dataset']	{'param_grid': {'type': 'dict', 'default': None}, 'scoring': {'type': 'Union', 'default': None}, 'thresholds': {'type': 'Union', 'default': None}, 'fit_params': {'type': 'dict', 'default': None}}	['sklearn', 'model_performance']	['clustering', 'classification']
validmind.model_validation.sklearn.MinimumAccuracy	Minimum Accuracy	Checks if the model's prediction accuracy meets or surpasses a specified threshold....	False	True	['dataset', 'model']	{'min_threshold': {'type': 'float', 'default': 0.7}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumF1Score	Minimum F1 Score	Assesses if the model's F1 score on the validation set meets a predefined minimum threshold, ensuring balanced...	False	True	['dataset', 'model']	{'min_threshold': {'type': 'float', 'default': 0.5}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumROCAUCScore	Minimum ROCAUC Score	Validates model by checking if the ROC AUC score meets or surpasses a specified threshold....	False	True	['dataset', 'model']	{'min_threshold': {'type': 'float', 'default': 0.5}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.model_validation.sklearn.ModelsPerformanceComparison	Models Performance Comparison	Evaluates and compares the performance of multiple Machine Learning models using various metrics like accuracy,...	False	True	['dataset', 'models']	{}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'model_comparison']	['classification', 'text_classification']
validmind.model_validation.sklearn.PopulationStabilityIndex	Population Stability Index	Assesses the Population Stability Index (PSI) to quantify the stability of an ML model's predictions across...	True	True	['datasets', 'model']	{'num_bins': {'type': 'int', 'default': 10}, 'mode': {'type': 'str', 'default': 'fixed'}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurve	Precision Recall Curve	Evaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve....	True	False	['model', 'dataset']	{}	['sklearn', 'binary_classification', 'model_performance', 'visualization']	['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurve	ROC Curve	Evaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic...	True	False	['model', 'dataset']	{}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']	['classification', 'text_classification']
validmind.model_validation.sklearn.RegressionErrors	Regression Errors	Assesses the performance and error distribution of a regression model using various error metrics....	False	True	['model', 'dataset']	{}	['sklearn', 'model_performance']	['regression', 'classification']
validmind.model_validation.sklearn.TrainingTestDegradation	Training Test Degradation	Tests if model performance degradation between training and test datasets exceeds a predefined threshold....	False	True	['datasets', 'model']	{'max_threshold': {'type': 'float', 'default': 0.1}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']	['classification', 'text_classification']
validmind.model_validation.statsmodels.GINITable	GINI Table	Evaluates classification model performance using AUC, GINI, and KS metrics for training and test datasets....	False	True	['dataset', 'model']	{}	['model_performance']	['classification']
validmind.ongoing_monitoring.CalibrationCurveDrift	Calibration Curve Drift	Evaluates changes in probability calibration between reference and monitoring datasets....	True	True	['datasets', 'model']	{'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}}	['sklearn', 'binary_classification', 'model_performance', 'visualization']	['classification', 'text_classification']
validmind.ongoing_monitoring.ClassDiscriminationDrift	Class Discrimination Drift	Compares classification discrimination metrics between reference and monitoring datasets....	False	True	['datasets', 'model']	{'drift_pct_threshold': {'type': '_empty', 'default': 20}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.ongoing_monitoring.ClassificationAccuracyDrift	Classification Accuracy Drift	Compares classification accuracy metrics between reference and monitoring datasets....	False	True	['datasets', 'model']	{'drift_pct_threshold': {'type': '_empty', 'default': 20}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.ongoing_monitoring.ConfusionMatrixDrift	Confusion Matrix Drift	Compares confusion matrix metrics between reference and monitoring datasets....	False	True	['datasets', 'model']	{'drift_pct_threshold': {'type': '_empty', 'default': 20}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.ongoing_monitoring.ROCCurveDrift	ROC Curve Drift	Compares ROC curves between reference and monitoring datasets....	True	False	['datasets', 'model']	{}	['sklearn', 'binary_classification', 'model_performance', 'visualization']	['classification', 'text_classification']

We'll isolate the specific tests we want to run in mpt:

As we learned in the previous notebook 2 — Start the model validation process, you can use a custom result_id to tag the individual result with a unique identifier by appending this result_id to the test_id with a : separator. We'll append an identifier for our champion model here:

mpt = [
    "validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion",
    "validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion",
    "validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion",
    "validmind.model_validation.sklearn.MinimumF1Score:logreg_champion",
    "validmind.model_validation.sklearn.ROCCurve:logreg_champion"
]

Evaluate performance of the champion model

Now, let's run and log our batch of model performance tests using our testing dataset (vm_test_ds) for our champion model:

The test set serves as a proxy for real-world data, providing an unbiased estimate of model performance since it was not used during training or tuning.
The test set also acts as protection against selection bias and model tweaking, giving a final, more unbiased checkpoint.

for test in mpt:
    vm.tests.run_test(
        test,
        inputs={
            "dataset": vm_test_ds, "model" : vm_log_model,
        },
    ).log()

Classifier Performance Logreg Champion

The Classifier Performance test evaluates the predictive effectiveness of classification models by reporting precision, recall, F1-score, accuracy, and ROC AUC metrics. The results table presents these metrics for each class, as well as macro and weighted averages, alongside overall accuracy and ROC AUC values. The reported values provide a quantitative summary of the model's ability to correctly classify instances and distinguish between classes.

Key insights:

Balanced class-wise performance: Precision, recall, and F1-scores are similar across both classes, with precision ranging from 0.6382 to 0.6414 and recall from 0.6120 to 0.6667, indicating no substantial disparity in model performance between classes.
Consistent macro and weighted averages: Macro and weighted averages for precision, recall, and F1-score are closely aligned (all approximately 0.639), reflecting uniformity in class performance and absence of class imbalance effects in these metrics.
Moderate overall accuracy: The model achieves an accuracy of 0.6399, indicating that approximately 64% of predictions match the true class labels.
ROC AUC indicates moderate separability: The ROC AUC score of 0.6901 suggests the model has moderate ability to distinguish between the two classes.

The results indicate that the model demonstrates consistent and balanced predictive performance across both classes, with moderate accuracy and ROC AUC values. The close alignment of macro and weighted averages further supports the absence of significant class imbalance effects. Overall, the model exhibits moderate classification effectiveness, with no pronounced weaknesses in class-specific performance metrics.

Tables

Precision, Recall, and F1

Class	Precision	Recall	F1
0	0.6414	0.6667	0.6538
1	0.6382	0.6120	0.6248
Weighted Average	0.6398	0.6399	0.6396
Macro Average	0.6398	0.6393	0.6393

Accuracy and ROC AUC

Metric	Value
Accuracy	0.6399
ROC AUC	0.6901

2026-03-12 20:44:44,385 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion does not exist in model's document

Confusion Matrix Logreg Champion

The Confusion Matrix test evaluates the classification performance of the logistic regression model by comparing predicted and actual class labels, providing a breakdown of true positives, true negatives, false positives, and false negatives. The resulting matrix visually displays the distribution of correct and incorrect predictions, enabling assessment of the model’s ability to distinguish between the two classes. The matrix quantifies each outcome, supporting detailed analysis of model strengths and error patterns.

Key insights:

True Negatives exceed other outcomes: The model correctly identified 220 true negatives, representing the highest count among all matrix categories.
True Positives are substantial: There are 194 true positives, indicating a strong ability to correctly classify positive cases.
False Negatives outnumber False Positives: The model produced 123 false negatives compared to 110 false positives, highlighting a greater tendency to miss positive cases than to incorrectly flag negatives as positives.
Non-trivial error rates in both classes: Both false positive and false negative counts are material, indicating that misclassification occurs in both directions.

The confusion matrix reveals that the model demonstrates a higher rate of correct classification for negative cases, with true negatives being the most frequent outcome. While true positives are also substantial, the presence of notable false negative and false positive counts indicates that classification errors are distributed across both classes. This distribution provides a clear view of the model’s predictive strengths and areas where misclassification risk is present.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion:cad1

2026-03-12 20:44:53,158 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion does not exist in model's document

❌ Minimum Accuracy Logreg Champion

The Minimum Accuracy test evaluates whether the model's prediction accuracy meets or exceeds a specified threshold, providing a direct measure of overall model correctness. The results table presents the model's achieved accuracy score, the minimum threshold set for the test, and the corresponding pass/fail outcome. The model's accuracy score is compared against the threshold to determine if the model satisfies the minimum performance requirement.

Key insights:

Accuracy below threshold: The model achieved an accuracy score of 0.6399, which is below the specified threshold of 0.7.
Test outcome is Fail: The test result is marked as "Fail," indicating the model did not meet the minimum accuracy requirement.

The results indicate that the model's predictive accuracy falls short of the established minimum threshold, as evidenced by the accuracy score of 0.6399 against a requirement of 0.7. This outcome highlights a gap in overall model correctness relative to the defined performance criterion.

Tables

Score	Threshold	Pass/Fail
0.6399	0.7	Fail

2026-03-12 20:44:58,742 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion does not exist in model's document

✅ Minimum F1 Score Logreg Champion

The MinimumF1Score:logreg_champion test evaluates whether the model's F1 score on the validation set meets or exceeds a predefined minimum threshold, ensuring balanced performance between precision and recall. The results table presents the observed F1 score, the minimum threshold for passing, and the pass/fail outcome. The model achieved an F1 score of 0.6248, compared against a threshold of 0.5, with the test outcome marked as "Pass".

Key insights:

F1 score exceeds minimum threshold: The model's F1 score of 0.6248 is above the required threshold of 0.5, indicating balanced performance between precision and recall on the validation set.
Test outcome is Pass: The model satisfies the minimum F1 score requirement, as reflected by the "Pass" result in the test output.

The results indicate that the model demonstrates balanced classification performance on the validation set, with the F1 score surpassing the established minimum threshold. The test outcome confirms that the model meets the predefined standard for F1-based performance.

Tables

Score	Threshold	Pass/Fail
0.6248	0.5	Pass

2026-03-12 20:45:02,324 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:logreg_champion does not exist in model's document

ROC Curve Logreg Champion

The ROC Curve test evaluates the binary classification performance of the logreg_champion model by plotting the Receiver Operating Characteristic (ROC) curve and calculating the Area Under the Curve (AUC) on the test_dataset_final. The ROC curve visualizes the trade-off between the true positive rate and false positive rate across all possible classification thresholds, while the AUC quantifies the model's overall discriminative ability. The test result presents the ROC curve for the model alongside a reference line representing random classification (AUC = 0.5), with the model's AUC score displayed in the legend.

Key insights:

AUC indicates moderate discriminative ability: The model achieves an AUC of 0.69, reflecting moderate capability to distinguish between the two classes.
ROC curve consistently above random baseline: The ROC curve remains above the diagonal line representing random performance, indicating the model provides meaningful separation between positive and negative classes.
No evidence of near-random classification: The ROC curve does not approach the random baseline, and the AUC is well above 0.5, suggesting the model avoids high-risk performance zones.

The ROC analysis demonstrates that the logreg_champion model exhibits moderate discriminative power on the test dataset, with an AUC of 0.69. The model's ROC curve consistently outperforms random classification, indicating reliable, though not exceptional, separation between classes. No indications of high-risk or near-random classification behavior are present in the observed results.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:logreg_champion:44c8

2026-03-12 20:45:08,707 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:logreg_champion does not exist in model's document

Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.

Log an artifact

As we can observe from the output above, our champion model doesn't pass the MinimumAccuracy based on the default thresholds of the out-of-the-box test, so let's log an artifact (finding) in the ValidMind Platform (Need more help?):

From the Inventory in the ValidMind Platform, go to the model you connected to earlier.
In the left sidebar that appears for your model, click Validation Report under Documents.
Locate the Data Preparation section and click on 2.2.2. Model Performance to expand that section.
Under the Model Performance Metrics section, locate Artifacts then click Link Artifact to Report:
Select Validation Issue as the type of artifact.
Click + Add Validation Issue to add a validation issue type artifact.
Enter in the details for your validation issue, for example:
- TITLE — Champion Logistic Regression Model Fails Minimum Accuracy Threshold
- RISK AREA — Model Performance
- DOCUMENTATION SECTION — 3.2. Model Evaluation
- DESCRIPTION — The logistic regression champion model was subjected to a Minimum Accuracy test to determine whether its predictive accuracy meets the predefined performance threshold of 0.7. The model achieved an accuracy score of 0.6136, which falls below the required minimum. As a result, the test produced a Fail outcome.
Click Save.
Select the validation issue you just added to link to your validation report and click Update Linked Artifacts to insert your validation issue.
Click on the validation issue to expand the issue, where you can adjust details such as severity, owner, due date, status, etc. as well as include proposed remediation plans or supporting documentation as attachments.

Evaluate performance of challenger model

We've now conducted similar tests as the model development team for our champion model, with the aim of verifying their test results.

Next, let's see how our challenger models compare. We'll use the same batch of tests here as we did in mpt, but append a different result_id to indicate that these results should be associated with our challenger model:

mpt_chall = [
    "validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger",
    "validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger",
    "validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger"
]

We'll run each test once for each model with the same vm_test_ds dataset to compare them:

for test in mpt_chall:
    vm.tests.run_test(
        test,
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        }
    ).log()

Classifier Performance Champion Vs Challenger

The Classifier Performance: champion_vs_challenger test evaluates the predictive performance of classification models using precision, recall, F1-Score, accuracy, and ROC AUC metrics. The results present a comparative analysis between two models, "log_model_champion" and "rf_model," with detailed class-level and aggregate performance statistics. Metrics are reported for each class, as well as macro and weighted averages, alongside overall accuracy and ROC AUC values for both models.

Key insights:

rf_model outperforms log_model_champion across all metrics: rf_model achieves higher precision, recall, F1-Score, accuracy (0.7156), and ROC AUC (0.7962) compared to log_model_champion, which records accuracy of 0.6399 and ROC AUC of 0.6901.
Consistent class-level performance within each model: Both models display similar precision and recall values across classes 0 and 1, with no substantial imbalance between classes.
Macro and weighted averages align closely: For both models, macro and weighted averages for precision, recall, and F1-Score are nearly identical, indicating balanced class distribution and uniform model behavior across classes.

The comparative results indicate that rf_model demonstrates superior classification performance relative to log_model_champion, as evidenced by higher scores across all evaluated metrics. Both models exhibit balanced predictive behavior across classes, with minimal disparity between class-specific and aggregate performance measures. The observed differences in accuracy and ROC AUC highlight a clear performance advantage for rf_model in this evaluation.

Tables

model	Class	Precision	Recall	F1
log_model_champion	0	0.6414	0.6667	0.6538
log_model_champion	1	0.6382	0.6120	0.6248
log_model_champion	Weighted Average	0.6398	0.6399	0.6396
log_model_champion	Macro Average	0.6398	0.6393	0.6393
rf_model	0	0.7160	0.7333	0.7246
rf_model	1	0.7152	0.6972	0.7061
rf_model	Weighted Average	0.7156	0.7156	0.7155
rf_model	Macro Average	0.7156	0.7152	0.7153

model	Metric	Value
log_model_champion	Accuracy	0.6399
log_model_champion	ROC AUC	0.6901
rf_model	Accuracy	0.7156
rf_model	ROC AUC	0.7962

2026-03-12 20:45:16,670 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger does not exist in model's document

Confusion Matrix Champion Vs Challenger

The Confusion Matrix: champion_vs_challenger test evaluates the predictive performance of two classification models by comparing their predicted and actual class labels, visualized through annotated heatmaps. The confusion matrices display the counts of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) for both the champion (log_model_champion) and challenger (rf_model) models, enabling direct assessment of classification accuracy and error types.

Key insights:

Challenger model reduces both FP and FN: The rf_model records 88 False Positives and 96 False Negatives, compared to 110 False Positives and 123 False Negatives for the log_model_champion, indicating improved error control.
Higher correct classification in challenger model: The rf_model achieves 221 True Positives and 242 True Negatives, exceeding the log_model_champion’s 194 True Positives and 220 True Negatives.
Overall error reduction in challenger: The total number of misclassifications (FP + FN) is lower for the rf_model (184) than for the log_model_champion (233), reflecting a net improvement in predictive accuracy.

The confusion matrix results demonstrate that the challenger model (rf_model) outperforms the champion model (log_model_champion) across all key confusion matrix categories. The challenger achieves higher counts of correct classifications (TP and TN) and lower counts of both types of errors (FP and FN), resulting in a lower overall misclassification rate. This indicates a clear improvement in classification performance for the challenger model based on the observed test data.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:8184

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:68ad

2026-03-12 20:45:22,597 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger does not exist in model's document

❌ Minimum Accuracy Champion Vs Challenger

The Minimum Accuracy test evaluates whether each model's prediction accuracy meets or exceeds a specified threshold, with results presented for both the log_model_champion and rf_model. The table displays each model's accuracy score, the threshold applied (0.7), and the corresponding pass/fail outcome. The log_model_champion achieved an accuracy of 0.6399, while the rf_model achieved an accuracy of 0.7156, allowing for direct comparison of model performance relative to the threshold.

Key insights:

rf_model surpasses accuracy threshold: The rf_model achieved an accuracy score of 0.7156, exceeding the minimum threshold of 0.7 and resulting in a passing outcome.
log_model_champion falls below threshold: The log_model_champion recorded an accuracy of 0.6399, which is below the threshold, resulting in a failing outcome.
Clear performance differentiation: The two models display a marked difference in accuracy, with the rf_model outperforming the log_model_champion by approximately 7.6 percentage points.

The results indicate that the rf_model meets the minimum accuracy requirement, while the log_model_champion does not. This differentiation highlights a substantial performance gap between the two models under the specified evaluation criteria. The observed outcomes provide a clear basis for model selection based on accuracy performance relative to the defined threshold.

Tables

model	Score	Threshold	Pass/Fail
log_model_champion	0.6399	0.7	Fail
rf_model	0.7156	0.7	Pass

2026-03-12 20:45:28,782 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger does not exist in model's document

✅ Minimum F1 Score Champion Vs Challenger

The MinimumF1Score:champion_vs_challenger test evaluates whether each model's F1 score on the validation set meets or exceeds a predefined minimum threshold, ensuring balanced performance between precision and recall. The results table presents F1 scores for both the champion (log_model_champion) and challenger (rf_model) models, alongside the minimum threshold and pass/fail status. Both models are assessed independently against the threshold value of 0.5, with their respective F1 scores and outcomes displayed.

Key insights:

Both models exceed minimum F1 threshold: log_model_champion achieved an F1 score of 0.6248 and rf_model achieved 0.7061, both surpassing the threshold of 0.5.
Challenger model demonstrates higher F1 performance: rf_model outperforms log_model_champion by 0.0813 in F1 score, indicating stronger balance between precision and recall on the validation set.
All models pass the test criteria: Both models are marked as "Pass," confirming that each meets the minimum F1 score requirement.

Both the champion and challenger models satisfy the minimum F1 score criterion, with the challenger model (rf_model) exhibiting a higher F1 score than the champion. The results indicate that both models maintain balanced classification performance on the validation set, with the challenger model providing a measurable improvement in F1 score relative to the champion.

Tables

model	Score	Threshold	Pass/Fail
log_model_champion	0.6248	0.5	Pass
rf_model	0.7061	0.5	Pass

2026-03-12 20:45:32,959 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger does not exist in model's document

ROC Curve Champion Vs Challenger

The ROC Curve test evaluates the discrimination ability of binary classification models by plotting the trade-off between true positive rate and false positive rate across thresholds, and by calculating the Area Under the Curve (AUC) as a summary metric. The results present ROC curves and AUC values for two models—log_model_champion and rf_model—on the test_dataset_final, with each curve compared against a random classifier baseline (AUC = 0.5). The ROC curves and corresponding AUC scores provide a visual and quantitative assessment of each model’s ability to distinguish between the positive and negative classes.

Key insights:

rf_model demonstrates higher discrimination: The rf_model achieves an AUC of 0.80, indicating stronger separation between classes compared to the log_model_champion.
log_model_champion shows moderate performance: The log_model_champion records an AUC of 0.69, reflecting moderate discriminative ability above random chance but below that of the rf_model.
Both models outperform random baseline: Both ROC curves are consistently above the random classifier line (AUC = 0.5), confirming that each model provides meaningful predictive power on the test dataset.

The comparative ROC analysis reveals that the rf_model exhibits superior classification performance relative to the log_model_champion, as evidenced by a higher AUC and a more pronounced ROC curve. Both models demonstrate the ability to distinguish between classes beyond random chance, with the rf_model providing a notably stronger level of discrimination on the evaluated dataset.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:589c

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:3083

2026-03-12 20:45:41,867 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger does not exist in model's document

Based on the performance metrics, our challenger random forest classification model passes the MinimumAccuracy where our champion did not.

In your validation report, support your recommendation in your validation issue's Proposed Remediation Plan to investigate the usage of our challenger model by inserting the performance tests we logged with this notebook into the appropriate section.

Run diagnostic tests

Next, we want to inspect the robustness and stability testing comparison between our champion and challenger model.

Use list_tests() to list all available diagnosis tests applicable to classification tasks:

vm.tests.list_tests(tags=["model_diagnosis"], task="classification")

ID	Name	Description	Has Figure	Has Table	Required Inputs	Params	Tags	Tasks
validmind.model_validation.sklearn.OverfitDiagnosis	Overfit Diagnosis	Assesses potential overfitting in a model's predictions, identifying regions where performance between training and...	True	True	['model', 'datasets']	{'metric': {'type': 'str', 'default': None}, 'cut_off_threshold': {'type': 'float', 'default': 0.04}}	['sklearn', 'binary_classification', 'multiclass_classification', 'linear_regression', 'model_diagnosis']	['classification', 'regression']
validmind.model_validation.sklearn.RobustnessDiagnosis	Robustness Diagnosis	Assesses the robustness of a machine learning model by evaluating performance decay under noisy conditions....	True	True	['datasets', 'model']	{'metric': {'type': 'str', 'default': None}, 'scaling_factor_std_dev_list': {'type': 'List', 'default': [0.1, 0.2, 0.3, 0.4, 0.5]}, 'performance_decay_threshold': {'type': 'float', 'default': 0.05}}	['sklearn', 'model_diagnosis', 'visualization']	['classification', 'regression']
validmind.model_validation.sklearn.WeakspotsDiagnosis	Weakspots Diagnosis	Identifies and visualizes weak spots in a machine learning model's performance across various sections of the...	True	True	['datasets', 'model']	{'features_columns': {'type': 'Optional', 'default': None}, 'metrics': {'type': 'Optional', 'default': None}, 'thresholds': {'type': 'Optional', 'default': None}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_diagnosis', 'visualization']	['classification', 'text_classification']

Let’s now assess the models for potential signs of overfitting and identify any sub-segments where performance may inconsistent with the OverfitDiagnosis test.

Overfitting occurs when a model learns the training data too well, capturing not only the true pattern but noise and random fluctuations resulting in excellent performance on the training dataset but poor generalization to new, unseen data:

Since the training dataset (vm_train_ds) was used to fit the model, we use this set to establish a baseline performance for how well the model performs on data it has already seen.
The testing dataset (vm_test_ds) was never seen during training, and here simulates real-world generalization, or how well the model performs on new, unseen data.

vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    }
).log()

Overfit Diagnosis Champion Vs Challenger

The Overfit Diagnosis: champion_vs_challenger test evaluates the extent of overfitting by comparing model performance between training and test sets across feature segments. The results present AUC gaps for both the logistic regression (log_model_champion) and random forest (rf_model) models, highlighting regions where the difference in AUC between training and test data exceeds the default threshold of 0.04. Bar plots visualize these gaps for each feature, enabling identification of segments with significant overfitting.

Key insights:

Random forest model exhibits widespread overfitting: The rf_model shows consistently large AUC gaps across nearly all feature segments, with gaps frequently exceeding 0.2 and reaching as high as 1.0 in certain Balance segments and 0.74 in NumOfProducts.
Logistic regression model shows localized overfitting: The log_model_champion displays moderate AUC gaps, with most segments below the threshold, but notable exceptions include CreditScore (gap = 0.1074 for 450–500), Balance (gap = 0.25 for 200,718–225,808), and EstimatedSalary (gap = 0.1277 for 179,981–199,953).
Overfitting concentrated in specific feature bins: For both models, the largest AUC gaps are observed in extreme or sparsely populated bins, such as high Balance and high NumOfProducts segments.
Minimal overfitting in binary categorical features for logistic regression: HasCrCard, IsActiveMember, Geography, and Gender segments in log_model_champion show AUC gaps well below the threshold, indicating stable generalization in these features.

The results indicate that the random forest model demonstrates extensive overfitting across all examined features, with AUC gaps substantially exceeding the threshold in most segments. In contrast, the logistic regression model exhibits overfitting primarily in specific, often low-sample, feature bins, while maintaining stable performance in the majority of segments. Overfitting is most pronounced in regions with limited data, particularly for the random forest model, underscoring the importance of segment-level evaluation in model validation.

Tables

model	Feature	Slice	Number of Training Records	Number of Test Records	Training AUC	Test AUC	Gap
log_model_champion	CreditScore	(450.0, 500.0]	117	32	0.7026	0.5951	0.1074
log_model_champion	Tenure	(2.0, 3.0]	272	70	0.6377	0.5479	0.0898
log_model_champion	Balance	(50179.618, 75269.427]	93	29	0.5764	0.4697	0.1067
log_model_champion	Balance	(100359.236, 125449.045]	587	142	0.7366	0.6883	0.0483
log_model_champion	Balance	(200718.472, 225808.281]	13	4	0.2500	0.0000	0.2500
log_model_champion	NumOfProducts	(2.8, 3.1]	150	40	0.7478	0.6154	0.1324
log_model_champion	EstimatedSalary	(60151.514, 80123.202]	274	62	0.7389	0.6855	0.0534
log_model_champion	EstimatedSalary	(80123.202, 100094.89]	275	59	0.6703	0.5644	0.1060
log_model_champion	EstimatedSalary	(179981.642, 199953.33]	231	71	0.7050	0.5774	0.1277
rf_model	CreditScore	(400.0, 450.0]	47	12	1.0000	0.7222	0.2778
rf_model	CreditScore	(450.0, 500.0]	117	32	1.0000	0.6741	0.3259
rf_model	CreditScore	(500.0, 550.0]	263	59	1.0000	0.7442	0.2558
rf_model	CreditScore	(550.0, 600.0]	340	106	1.0000	0.7753	0.2247
rf_model	CreditScore	(600.0, 650.0]	528	112	1.0000	0.8188	0.1812
rf_model	CreditScore	(650.0, 700.0]	495	124	1.0000	0.7960	0.2040
rf_model	CreditScore	(700.0, 750.0]	381	95	1.0000	0.8627	0.1373
rf_model	CreditScore	(750.0, 800.0]	239	70	1.0000	0.7382	0.2618
rf_model	CreditScore	(800.0, 850.0]	166	32	1.0000	0.9042	0.0958
rf_model	Tenure	(-0.01, 1.0]	385	92	1.0000	0.6982	0.3018
rf_model	Tenure	(1.0, 2.0]	249	65	1.0000	0.8471	0.1529
rf_model	Tenure	(2.0, 3.0]	272	70	1.0000	0.8026	0.1974
rf_model	Tenure	(3.0, 4.0]	264	64	1.0000	0.7791	0.2209
rf_model	Tenure	(4.0, 5.0]	264	63	1.0000	0.8657	0.1343
rf_model	Tenure	(5.0, 6.0]	227	57	1.0000	0.7654	0.2346
rf_model	Tenure	(6.0, 7.0]	258	66	1.0000	0.7973	0.2027
rf_model	Tenure	(7.0, 8.0]	266	71	1.0000	0.8201	0.1799
rf_model	Tenure	(8.0, 9.0]	258	68	1.0000	0.8219	0.1781
rf_model	Tenure	(9.0, 10.0]	142	31	1.0000	0.8632	0.1368
rf_model	Balance	(-250.898, 25089.809]	846	212	1.0000	0.8571	0.1429
rf_model	Balance	(25089.809, 50179.618]	16	6	1.0000	0.7500	0.2500
rf_model	Balance	(50179.618, 75269.427]	93	29	1.0000	0.7828	0.2172
rf_model	Balance	(75269.427, 100359.236]	289	67	1.0000	0.6326	0.3674
rf_model	Balance	(100359.236, 125449.045]	587	142	1.0000	0.7910	0.2090
rf_model	Balance	(125449.045, 150538.854]	499	127	1.0000	0.7308	0.2692
rf_model	Balance	(150538.854, 175628.663]	191	50	1.0000	0.7250	0.2750
rf_model	Balance	(200718.472, 225808.281]	13	4	1.0000	0.0000	1.0000
rf_model	NumOfProducts	(0.997, 1.3]	1474	390	1.0000	0.6905	0.3095
rf_model	NumOfProducts	(1.9, 2.2]	926	208	1.0000	0.6903	0.3097
rf_model	NumOfProducts	(2.8, 3.1]	150	40	1.0000	0.2564	0.7436
rf_model	HasCrCard	(-0.001, 0.1]	788	191	1.0000	0.7809	0.2191
rf_model	HasCrCard	(0.9, 1.0]	1797	456	1.0000	0.8034	0.1966
rf_model	IsActiveMember	(-0.001, 0.1]	1399	340	1.0000	0.7656	0.2344
rf_model	IsActiveMember	(0.9, 1.0]	1186	307	1.0000	0.8036	0.1964
rf_model	EstimatedSalary	(36.733, 20208.138]	265	64	1.0000	0.7926	0.2074
rf_model	EstimatedSalary	(20208.138, 40179.826]	233	67	1.0000	0.8155	0.1845
rf_model	EstimatedSalary	(40179.826, 60151.514]	239	61	1.0000	0.8761	0.1239
rf_model	EstimatedSalary	(60151.514, 80123.202]	274	62	1.0000	0.7388	0.2612
rf_model	EstimatedSalary	(80123.202, 100094.89]	275	59	1.0000	0.7741	0.2259
rf_model	EstimatedSalary	(100094.89, 120066.578]	259	75	1.0000	0.7911	0.2089
rf_model	EstimatedSalary	(120066.578, 140038.266]	274	65	1.0000	0.8400	0.1600
rf_model	EstimatedSalary	(140038.266, 160009.954]	256	70	1.0000	0.7697	0.2303
rf_model	EstimatedSalary	(160009.954, 179981.642]	279	52	1.0000	0.7585	0.2415
rf_model	EstimatedSalary	(179981.642, 199953.33]	231	71	1.0000	0.7887	0.2113
rf_model	Geography_Germany	(-0.001, 0.1]	1792	446	1.0000	0.7890	0.2110
rf_model	Geography_Germany	(0.9, 1.0]	793	201	1.0000	0.7404	0.2596
rf_model	Geography_Spain	(-0.001, 0.1]	1983	506	1.0000	0.8016	0.1984
rf_model	Geography_Spain	(0.9, 1.0]	602	141	1.0000	0.7722	0.2278
rf_model	Gender_Male	(-0.001, 0.1]	1293	304	1.0000	0.8059	0.1941
rf_model	Gender_Male	(0.9, 1.0]	1292	343	1.0000	0.7772	0.2228

Figures

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:01b6

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:2a55

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:4473

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:54b2

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:de2d

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:caa1

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:151d

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:6a29

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:ee1b

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:0a05

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:d1b2

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:686e

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:0092

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:e8c8

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:0ef8

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:7004

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:d7d6

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:778c

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:97a8

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:1a85

2026-03-12 20:46:04,690 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger does not exist in model's document

Let's also conduct robustness and stability testing of the two models with the RobustnessDiagnosis test. Robustness refers to a model's ability to maintain consistent performance, and stability refers to a model's ability to produce consistent outputs over time across different data subsets.

Again, we'll use both the training and testing datasets to establish baseline performance and to simulate real-world generalization:

vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    },
).log()

❌ Robustness Diagnosis Champion Vs Log Regression

The Robustness Diagnosis test evaluates the resilience of the log_model_champion and rf_model models by measuring AUC performance decay under increasing levels of Gaussian noise applied to numeric input features. Results are presented for both train and test datasets across perturbation sizes ranging from 0.0 to 0.5 standard deviations. The tables and plots display AUC values, performance decay, and pass/fail status at each noise level, enabling direct comparison of robustness characteristics between the two models.

Key insights:

Logistic regression model exhibits gradual, low-magnitude decay: The log_model_champion shows a steady but modest decline in AUC as perturbation size increases, with test set AUC decreasing from 0.6901 (baseline) to 0.6664 (0.5 SD), and performance decay remaining below 0.024 across all noise levels.
Random forest model displays pronounced train-test divergence: The rf_model achieves perfect AUC (1.0) on the train set at baseline but experiences rapid performance decay under noise, with train AUC dropping to 0.7834 (0.5 SD) and performance decay exceeding 0.21. In contrast, test set AUC declines more gradually, from 0.7962 to 0.7255.
Threshold failures concentrated in random forest train and test sets: The rf_model fails the robustness threshold on the train set at perturbation sizes ≥0.2 and on the test set at 0.5, while the log_model_champion passes all thresholds across both datasets and all perturbation levels.
Test set robustness superior to train set for both models: Both models demonstrate lower performance decay and higher AUC retention on the test set compared to the train set as noise increases, particularly evident in the rf_model.

The results indicate that the log_model_champion maintains stable performance under increasing input noise, with minimal AUC decay and consistent threshold passing across all tested perturbation sizes. The rf_model, while initially achieving higher baseline AUC, is more sensitive to input noise, particularly on the train set, where performance decay is substantial and threshold failures occur at moderate noise levels. Test set robustness is consistently higher than train set robustness for both models, with the logistic regression model demonstrating greater resilience to noisy input features overall.

Tables

model	Perturbation Size	Dataset	Row Count	AUC	Performance Decay	Passed
log_model_champion	Baseline (0.0)	train_dataset_final	2585	0.6768	0.0000	True
log_model_champion	Baseline (0.0)	test_dataset_final	647	0.6901	0.0000	True
log_model_champion	0.1	train_dataset_final	2585	0.6765	0.0004	True
log_model_champion	0.1	test_dataset_final	647	0.6882	0.0019	True
log_model_champion	0.2	train_dataset_final	2585	0.6727	0.0041	True
log_model_champion	0.2	test_dataset_final	647	0.6801	0.0100	True
log_model_champion	0.3	train_dataset_final	2585	0.6708	0.0061	True
log_model_champion	0.3	test_dataset_final	647	0.6828	0.0073	True
log_model_champion	0.4	train_dataset_final	2585	0.6640	0.0128	True
log_model_champion	0.4	test_dataset_final	647	0.6747	0.0155	True
log_model_champion	0.5	train_dataset_final	2585	0.6547	0.0221	True
log_model_champion	0.5	test_dataset_final	647	0.6664	0.0237	True
rf_model	Baseline (0.0)	train_dataset_final	2585	1.0000	0.0000	True
rf_model	Baseline (0.0)	test_dataset_final	647	0.7962	0.0000	True
rf_model	0.1	train_dataset_final	2585	0.9860	0.0140	True
rf_model	0.1	test_dataset_final	647	0.7959	0.0003	True
rf_model	0.2	train_dataset_final	2585	0.9400	0.0600	False
rf_model	0.2	test_dataset_final	647	0.7850	0.0112	True
rf_model	0.3	train_dataset_final	2585	0.8836	0.1164	False
rf_model	0.3	test_dataset_final	647	0.7814	0.0148	True
rf_model	0.4	train_dataset_final	2585	0.8354	0.1646	False
rf_model	0.4	test_dataset_final	647	0.7706	0.0256	True
rf_model	0.5	train_dataset_final	2585	0.7834	0.2166	False
rf_model	0.5	test_dataset_final	647	0.7255	0.0707	False

Figures

ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:94d1

ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:da9e

2026-03-12 20:46:23,146 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression does not exist in model's document

Run feature importance tests

We also want to verify the relative influence of different input features on our models' predictions, as well as inspect the differences between our champion and challenger model to see if a certain model offers more understandable or logical importance scores for features.

Use list_tests() to identify all the feature importance tests for classification:

# Store the feature importance tests
FI = vm.tests.list_tests(tags=["feature_importance"], task="classification",pretty=False)
FI

['validmind.model_validation.FeaturesAUC',
 'validmind.model_validation.sklearn.PermutationFeatureImportance',
 'validmind.model_validation.sklearn.SHAPGlobalImportance']

We'll only use our testing dataset (vm_test_ds) here, to provide a realistic, unseen sample that mimic future or production data, as the training dataset has already influenced our model during learning:

# Run and log our feature importance tests for both models for the testing dataset
for test in FI:
    vm.tests.run_test(
        "".join((test,':champion_vs_challenger')),
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        },
    ).log()

Features Champion Vs Challenger

The FeaturesAUC:champion_vs_challenger test evaluates the discriminatory power of each individual feature by calculating the Area Under the Curve (AUC) for each feature in isolation against the binary target. The resulting plot displays AUC values for all features in the test_dataset_final dataset, with higher AUC values indicating stronger univariate separation between classes. The features are ranked by their AUC scores, providing a direct comparison of their individual classification strength.

Key insights:

Balance exhibits highest univariate discriminatory power: The Balance feature achieves the highest AUC, exceeding 0.6, indicating the strongest individual ability to distinguish between classes among all features evaluated.
Geography_Germany and CreditScore show moderate separation: Both Geography_Germany and CreditScore display AUC values above 0.5, suggesting moderate univariate predictive strength.
Several features cluster at lower AUC values: Features such as NumOfProducts and IsActiveMember have AUC values near 0.4, reflecting limited individual discriminatory capability in the univariate context.

The results indicate that Balance is the most individually informative feature for class separation in this dataset, with Geography_Germany and CreditScore also contributing moderate univariate predictive value. The remaining features demonstrate lower AUC scores, suggesting weaker standalone classification strength when evaluated independently. This distribution of AUC values provides insight into the relative univariate importance of each feature within the binary classification task.

Figures

ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:e4af

ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:c6f8

2026-03-12 20:46:33,312 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.FeaturesAUC:champion_vs_challenger does not exist in model's document

Permutation Feature Importance Champion Vs Challenger

The Permutation Feature Importance (PFI) test evaluates the relative importance of each input feature by measuring the decrease in model performance when feature values are randomly permuted. The results are presented as bar plots for both the logistic regression (log_model_champion) and random forest (rf_model) models, with each bar representing the magnitude of performance reduction attributable to permuting a specific feature. The plots enable direct comparison of feature importance rankings between the two models, highlighting which features most strongly influence predictions in each case.

Key insights:

Distinct top features by model type: The logistic regression model assigns highest importance to IsActiveMember and Geography_Germany, while the random forest model ranks NumOfProducts as the most influential feature.
Geography_Germany consistently important: Geography_Germany is among the top two features for both models, indicating a strong and consistent impact on model predictions.
Model-specific feature reliance: IsActiveMember is highly important for the logistic regression model but less so for the random forest, whereas Balance is a key driver for the random forest but not for the logistic regression model.
Low importance for several features: Features such as EstimatedSalary, Gender_Male, and HasCrCard exhibit low permutation importance in both models, suggesting minimal influence on predictive performance.

The PFI results reveal that feature importance rankings differ substantially between the logistic regression and random forest models, with each model relying on a distinct subset of features for prediction. Geography_Germany emerges as a consistently important variable across both models, while other features such as IsActiveMember and NumOfProducts show model-specific prominence. Several features contribute minimally to predictive accuracy in both models, indicating limited relevance within the current modeling context.

Figures

ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:5e7e

ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:3925

2026-03-12 20:46:46,104 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger does not exist in model's document

SHAP Global Importance Champion Vs Challenger

The SHAPGlobalImportance:champion_vs_challenger test evaluates and visualizes the global feature importance for both the champion (log_model_champion) and challenger (rf_model) models using SHAP values. The results include normalized mean importance plots and SHAP summary plots, which display the relative contribution of each feature to model predictions and the distribution of SHAP values across the dataset. These visualizations facilitate comparison of feature influence and model reasoning between the two models.

Key insights:

Champion model dominated by few features: For log_model_champion, IsActiveMember, Geography_Germany, and Gender_Male exhibit the highest normalized SHAP values, with IsActiveMember showing the greatest influence on model output.
Challenger model relies on limited features: The rf_model challenger model assigns importance almost exclusively to CreditScore and Tenure, with other features not represented in the importance plots.
Distinct feature utilization between models: The champion model distributes importance across a broader set of features, while the challenger model's importance is concentrated on two variables.
SHAP value distributions are compact: Both models display relatively tight SHAP value distributions for their most important features, with no evidence of extreme outliers or high variability in the summary plots.

The results indicate that the champion and challenger models differ substantially in their feature utilization, with the champion model leveraging a wider range of predictors and the challenger model focusing on a narrow subset. The absence of high variability or scattered SHAP values suggests stable model behavior in both cases. No anomalies or illogical feature importances are observed in the visualizations.

Figures

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:f5c1

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:9d50

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:8542

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:9d6b

2026-03-12 20:47:22,327 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger does not exist in model's document

In summary

In this third notebook, you learned how to:

Initialize ValidMind model objects
Assign predictions and probabilities to your ValidMind model objects
Use tests from ValidMind to evaluate the potential of models, including comparative tests between champion and challenger models
Log an artifact in the ValidMind Platform

Next steps

Finalize validation and reporting

Now that you're familiar with the basics of using the ValidMind Library to run and log validation tests, let's learn how to implement some custom tests and wrap up our validation: 4 — Finalize validation and reporting

	n_estimators n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22 The default value of ``n_estimators`` changed from 10 to 100 in 0.22.	50
	criterion criterion: {"gini", "entropy", "log_loss"}, default="gini" The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "log_loss" and "entropy" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.	'gini'
	max_depth max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.	None
	min_samples_split min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for fractions.	2
	min_samples_leaf min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for fractions.	1
	min_weight_fraction_leaf min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.	0.0
	max_features max_features: {"sqrt", "log2", None}, int or float, default="sqrt" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and `max(1, int(max_features * n_features_in_))` features are considered at each split. - If "sqrt", then `max_features=sqrt(n_features)`. - If "log2", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1 The default of `max_features` changed from `"auto"` to `"sqrt"`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.	'sqrt'
	max_leaf_nodes max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.	None
	min_impurity_decrease min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19	0.0
	bootstrap bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.	True
	oob_score oob_score: bool or callable, default=False Whether to use out-of-bag samples to estimate the generalization score. By default, :func:`~sklearn.metrics.accuracy_score` is used. Provide a callable with signature `metric(y_true, y_pred)` to use a custom metric. Only available if `bootstrap=True`. For an illustration of out-of-bag (OOB) error estimation, see the example :ref:`sphx_glr_auto_examples_ensemble_plot_ensemble_oob.py`.	False
	n_jobs n_jobs: int, default=None The number of jobs to run in parallel. :meth:`fit`, :meth:`predict`, :meth:`decision_path` and :meth:`apply` are all parallelized over the trees. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.	None
	random_state random_state: int, RandomState instance or None, default=None Controls both the randomness of the bootstrapping of the samples used when building trees (if ``bootstrap=True``) and the sampling of the features to consider when looking for the best split at each node (if ``max_features < n_features``). See :term:`Glossary ` for details.	42
	verbose verbose: int, default=0 Controls the verbosity when fitting and predicting.	0
	warm_start warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See :term:`Glossary ` and :ref:`tree_ensemble_warm_start` for details.	False
	class_weight class_weight: {"balanced", "balanced_subsample"}, dict or list of dicts, default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y. Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}]. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))`` The "balanced_subsample" mode is the same as "balanced" except that weights are computed based on the bootstrap sample for every tree grown. For multi-output, the weights of each column of y will be multiplied. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.	None
	ccp_alpha ccp_alpha: non-negative float, default=0.0 Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ``ccp_alpha`` will be chosen. By default, no pruning is performed. See :ref:`minimal_cost_complexity_pruning` for details. See :ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py` for an example of such pruning. .. versionadded:: 0.22	0.0
	max_samples max_samples: int or float, default=None If bootstrap is True, the number of samples to draw from X to train each base estimator. - If None (default), then draw `X.shape[0]` samples. - If int, then draw `max_samples` samples. - If float, then draw `max(round(n_samples * max_samples), 1)` samples. Thus, `max_samples` should be in the interval `(0.0, 1.0]`. .. versionadded:: 0.22	None
	monotonic_cst monotonic_cst: array-like of int of shape (n_features), default=None Indicates the monotonicity constraint to enforce on each feature. - 1: monotonic increase - 0: no constraint - -1: monotonic decrease If monotonic_cst is None, no constraints are applied. Monotonicity constraints are not supported for: - multiclass classifications (i.e. when `n_classes > 2`), - multioutput classifications (i.e. when `n_outputs_ > 1`), - classifications trained on data with missing values. The constraints hold over the probability of the positive class. Read more in the :ref:`User Guide `. .. versionadded:: 1.4	None