%pip install -q validmind
Add context to LLM-generated test descriptions
When you run ValidMind tests, test descriptions are automatically generated with LLM using the test results, the test name, and the static test definitions provided in the test’s docstring. While this metadata offers valuable high-level overviews of tests, insights produced by the LLM-based descriptions may not always align with your specific use cases or incorporate organizational policy requirements.
In this notebook, you’ll learn how to add context to the generated descriptions by providing additional information about the test or the use case. Including custom use case context is useful when you want to highlight information about the intended use and technique of the model, or the insitution policies and standards specific to your use case.
Install the ValidMind Library
To install the library:
Initialize the ValidMind Library
ValidMind generates a unique code snippet for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.
Get your code snippet
In a browser, log in to ValidMind.
In the left sidebar, navigate to Model Inventory and click + Register Model.
Enter the model details and click Continue. (Need more help?)
For example, to register a model for use with this notebook, select:
- Documentation template:
Binary classification
- Use case:
Marketing/Sales - Attrition/Churn Management
You can fill in other options according to your preference.
- Documentation template:
Go to Getting Started and click Copy snippet to clipboard.
Next, load your model identifier credentials from an .env
file or replace the placeholder with your own code snippet:
# Load your model identifier credentials from an `.env` file
%load_ext dotenv
%dotenv .env
# Or replace with your code snippet
import validmind as vm
vm.init(# api_host = "https://api.prod.validmind.ai/api/v1/tracking",
# api_key = "...",
# api_secret = "...",
# model = "..."
)
Initialize the Python environment
After you’ve connected to your model register in the ValidMind Platform, let’s import the necessary libraries and set up your Python environment for data analysis:
import xgboost as xgb
import os
%matplotlib inline
Load the sample dataset
First, we’ll import a sample ValidMind dataset and load it into a pandas DataFrame, a two-dimensional tabular data structure that makes use of rows and columns:
# Import the sample dataset from the library
from validmind.datasets.classification import customer_churn
print(
f"Loaded demo dataset with: \n\n\t• Target column: '{customer_churn.target_column}' \n\t• Class labels: {customer_churn.class_labels}"
)
= customer_churn.load_data()
raw_df raw_df.head()
Preprocess the raw dataset
Then, we’ll perform a number of operations to get ready for the subsequent steps:
- Preprocess the data: Splits the DataFrame (
df
) into multiple datasets (train_df
,validation_df
, andtest_df
) usingdemo_dataset.preprocess
to simplify preprocessing. - Separate features and targets: Drops the target column to create feature sets (
x_train
,x_val
) and target sets (y_train
,y_val
). - Initialize XGBoost classifier: Creates an
XGBClassifier
object with early stopping rounds set to 10. - Set evaluation metrics: Specifies metrics for model evaluation as
error
,logloss
, andauc
. - Fit the model: Trains the model on
x_train
andy_train
using the validation set(x_val, y_val)
. Verbose output is disabled.
= customer_churn.preprocess(raw_df)
train_df, validation_df, test_df
= train_df.drop(customer_churn.target_column, axis=1)
x_train = train_df[customer_churn.target_column]
y_train = validation_df.drop(customer_churn.target_column, axis=1)
x_val = validation_df[customer_churn.target_column]
y_val
= xgb.XGBClassifier(early_stopping_rounds=10)
model
model.set_params(=["error", "logloss", "auc"],
eval_metric
)
model.fit(
x_train,
y_train,=[(x_val, y_val)],
eval_set=False,
verbose )
Initialize the ValidMind objects
Initialize the datasets
Before you can run tests, you’ll need to initialize a ValidMind dataset object using the init_dataset
function from the ValidMind (vm
) module.
We’ll include the following arguments:
dataset
— the raw dataset that you want to provide as input to testsinput_id
- a unique identifier that allows tracking what inputs are used when running each individual testtarget_column
— a required argument if tests require access to true values. This is the name of the target column in the datasetclass_labels
— an optional value to map predicted classes to class labels
With all datasets ready, you can now initialize the raw, training, and test datasets (raw_df
, train_df
and test_df
) created earlier into their own dataset objects using vm.init_dataset()
:
= vm.init_dataset(
vm_raw_dataset =raw_df,
dataset="raw_dataset",
input_id=customer_churn.target_column,
target_column=customer_churn.class_labels,
class_labels
)
= vm.init_dataset(
vm_train_ds =train_df,
dataset="train_dataset",
input_id=customer_churn.target_column,
target_column
)
= vm.init_dataset(
vm_test_ds =test_df, input_id="test_dataset", target_column=customer_churn.target_column
dataset )
Initialize a model object
Additionally, you’ll need to initialize a ValidMind model object (vm_model
) that can be passed to other functions for analysis and tests on the data.
Simply intialize this model object with vm.init_model()
:
= vm.init_model(
vm_model
model,="model",
input_id )
Assign predictions to the datasets
We can now use the assign_predictions()
method from the Dataset object to link existing predictions to any model.
If no prediction values are passed, the method will compute predictions automatically:
vm_train_ds.assign_predictions(=vm_model,
model
)
vm_test_ds.assign_predictions(=vm_model,
model )
Set custom context for test descriptions
Review default LLM-generated descriptions
By default, custom context for LLM-generated descriptions is disabled, meaning that the output will not include any additional context.
Let’s generate an initial test description for the DatasetDescription
test for comparision with later iterations:
vm.tests.run_test("validmind.data_validation.DatasetDescription",
={
inputs"dataset": vm_raw_dataset,
}, )
Enable use case context
To enable custom use case context, set the VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED
environment variable to 1
.
This is a global setting that will affect all tests for your linked model:
"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED"] = "1" os.environ[
Enabling use case context allows you to pass in additional context, such as information about your model, relevant regulatory requirements, or model validation targets to the LLM-generated text descriptions within use_case_context
:
= """
use_case_context
This is a customer churn prediction model for a banking loan application system using XGBoost classifier.
Key Model Information:
- Use Case: Predict customer churn risk during loan application process
- Model Type: Binary classification using XGBoost
- Critical Decision Point: Used in loan approval workflow
Regulatory Requirements:
- Subject to model risk management review and validation
- Results require validation review for regulatory compliance
- Model decisions directly impact loan approval process
- Does this result raise any regulatory concerns?
Validation Focus:
- Explain strengths and weaknesses of the test and the context of whether the result is acceptable.
- What does the result indicate about model reliability?
- Is the result within acceptable thresholds for loan decisioning?
- What are the implications for customer impact?
""".strip()
"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = use_case_context os.environ[
With the use case context set, generate an updated test description for the DatasetDescription
test for comparision with default output:
vm.tests.run_test("validmind.data_validation.DatasetDescription",
={
inputs"dataset": vm_raw_dataset,
}, ).log()
Disable use case context
To disable custom use case context, set the VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED
environment variable to 0
.
This is a global setting that will affect all tests for your linked model:
"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED"] = "0" os.environ[
With the use case context disabled again, generate another test description for the DatasetDescription
test for comparision with previous custom output:
vm.tests.run_test("validmind.data_validation.DatasetDescription",
={
inputs"dataset": vm_raw_dataset,
}, ).log()
Add test-specific context
In addition to the model-level use_case_context
, you’re able to add test-specific context to your LLM-generated descriptions allowing you to provide test-specific validation criteria about the test that is being run.
We’ll reenable use case context by setting the VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED
environment variable to 1
, then join the test-specific context to the use case context using the VALIDMIND_LLM_DESCRIPTIONS_CONTEXT
environment variable.
"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED"] = "1" os.environ[
Dataset Description
Rather than relying on generic dataset result descriptions in isolation, we’ll use the context to specify precise thresholds for missing values, appropriate data types for banking variables (like CreditScore
and Balance
), and valid value ranges based on particular business rules:
= """
test_context
Acceptance Criteria:
- Missing Values: All critical features must have less than 5% missing values (including CreditScore, Balance, Age)
- Data Types: All columns must have appropriate data types (numeric for CreditScore/Balance/Age, categorical for Geography/Gender)
- Cardinality: Categorical variables must have fewer than 50 unique values, while continuous variables should show appropriate distinct value counts (e.g., high for EstimatedSalary, exactly 2 for Boolean fields)
- Value Ranges: Numeric fields must fall within business-valid ranges (CreditScore: 300-850, Age: ≥18, Balance: ≥0)
""".strip()
= f"""
context {use_case_context}
{test_context}
""".strip()
"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context os.environ[
With the test-specific context set, generate an updated test description for the DatasetDescription
test again:
vm.tests.run_test("validmind.data_validation.DatasetDescription",
={
inputs"dataset": vm_raw_dataset,
}, )
Class Imbalance
The following test-specific context example adds value to the LLM-generated description by providing defined risk levels to assess class representation:
- By categorizing classes into
Low
,Medium
, andHigh Risk
, the LLM can generate more nuanced and actionable insights, ensuring that the analysis aligns with business requirements for balanced datasets. - This approach not only highlights potential issues but also guides necessary documentation and mitigation strategies for high-risk classes.
= """
test_context
Acceptance Criteria:
• Risk Levels for Class Representation:
- Low Risk: Each class represents 20% or more of the total dataset
- Medium Risk: Each class represents between 10% and 19.9% of the total dataset
- High Risk: Any class represents less than 10% of the total dataset
• Overall Requirement:
- All classes must achieve at least Medium Risk status to pass
""".strip()
= f"""
context {use_case_context}
{test_context}
""".strip()
"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context os.environ[
With the test-specific context set, generate a test description for the ClassImbalance
test for review:
vm.tests.run_test("validmind.data_validation.ClassImbalance",
={
inputs"dataset": vm_raw_dataset,
},={
params"min_percent_threshold": 10,
}, )
High Cardinality
In this below case, the context specifies a risk-based criteria for the number of distinct values in categorical features.
This helps the LLM to generate more nuanced and actionable insights, ensuring the descriptions are more relevant to your organization’s policies.
= """
test_context
Acceptance Criteria:
• Risk Levels for Distinct Values in Categorical Features:
- Low Risk: Each categorical column has fewer than 50 distinct values or less than 5% unique values relative to the total dataset size
- Medium Risk: Each categorical column has between 50 and 100 distinct values or between 5% and 10% unique values
- High Risk: Any categorical column has more than 100 distinct values or more than 10% unique values
• Overall Requirement:
- All categorical columns must achieve at least Medium Risk status to pass
""".strip()
= f"""
context {use_case_context}
{test_context}
""".strip()
"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context os.environ[
With the test-specific context set, generate a test description for the HighCardinality
test for review:
vm.tests.run_test("validmind.data_validation.HighCardinality",
={
inputs"dataset": vm_raw_dataset,
},= {
params"num_threshold": 100,
"percent_threshold": 0.1,
"threshold_type": "percent"
} )
Missing Values
Here, we use the test-specific context to establish differentiated risk thresholds across features.
Rather than applying uniform criteria, the context allows for specific requirements for critical financial features (CreditScore
, Balance
, Age
).
= """
test_context Test-Specific Context for Missing Values Analysis:
Acceptance Criteria:
• Risk Levels for Missing Values:
- Low Risk: Less than 1% missing values in any column
- Medium Risk: Between 1% and 5% missing values
- High Risk: More than 5% missing values
• Feature-Specific Requirements:
- Critical Features (CreditScore, Balance, Age):
* Must maintain Low Risk status
* No missing values allowed
- Secondary Features (Tenure, NumOfProducts, EstimatedSalary):
* Must achieve at least Medium Risk status
* Up to 3% missing values acceptable
- Categorical Features (Geography, Gender):
* Must achieve at least Medium Risk status
* Up to 5% missing values acceptable
""".strip()
= f"""
context {use_case_context}
{test_context}
""".strip()
"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context os.environ[
With the test-specific context set, generate a test description for the MissingValues
test for review:
vm.tests.run_test("validmind.data_validation.MissingValues",
={
inputs"dataset": vm_raw_dataset,
},= {
params"min_threshold": 1
} )
Unique Rows
This example context establishes variable-specific thresholds based on business expectations.
Rather than applying uniform criteria, it recognizes that high variability is expected in features like EstimatedSalary
(>90%) and Balance
(>50%), while enforcing strict limits on categorical features like Geography
(<5 values), ensuring meaningful validation aligned with banking data characteristics.
= """
test_context
Acceptance Criteria:
• High-Variability Expected Features:
- EstimatedSalary: Must have >90% unique values
- Balance: Must have >50% unique values
- CreditScore: Must have between 5-10% unique values
• Medium-Variability Features:
- Age: Should have between 0.5-2% unique values
- Tenure: Should have between 0.1-0.5% unique values
• Low-Variability Features:
- Binary Features (HasCrCard, IsActiveMember, Gender, Exited): Must have exactly 2 unique values
- Geography: Must have fewer than 5 unique values
- NumOfProducts: Must have fewer than 10 unique values
• Overall Requirements:
- Features must fall within their specified ranges to pass
""".strip()
= f"""
context {use_case_context}
{test_context}
""".strip()
"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context os.environ[
With the test-specific context set, generate a test description for the UniqueRows
test for review:
vm.tests.run_test("validmind.data_validation.UniqueRows",
={
inputs"dataset": vm_raw_dataset,
},= {
params"min_percent_threshold": 1
} )
Too Many Zero Values
Here, test-specific context is used to provide meaning and expectations for different variables:
- For instance, zero values in
Balance
andTenure
indicate risk, whereas zeros in binary variables likeHasCrCard
orIsActiveMember
are expected. - This tailored context ensures that the analysis accurately reflects the business significance of zero values across different features.
= """
test_context
Acceptance Criteria:
- Numerical Features Only: Test evaluates only continuous numeric columns (Balance, Tenure),
excluding binary columns (HasCrCard, IsActiveMember)
- Risk Level Thresholds for Balance and Tenure:
- High Risk: More than 5% zero values
- Medium Risk: Between 3% and 5% zero values
- Low Risk: Less than 3% zero values
- Individual Column Requirements:
- Balance: Must be Low Risk (banking context requires accurate balance tracking)
- Tenure: Must be Low or Medium Risk (some zero values acceptable for new customers)
• Overall Test Result: Test must achieve "Pass" status (Low Risk) for Balance, and at least Medium Risk for Tenure
""".strip()
= f"""
context {use_case_context}
{test_context}
""".strip()
"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context os.environ[
With the test-specific context set, generate a test description for the TooManyZeroValues
test for review:
vm.tests.run_test("validmind.data_validation.TooManyZeroValues",
={
inputs"dataset": vm_raw_dataset,
},= {
params"max_percent_threshold": 0.03
} )
IQR Outliers Table
In this case, we use test-specific context to incorporate risk levels tailored to key variables, like CreditScore
, Age
, and NumOfProducts
, that otherwise would not be considered for outlier analysis if we ran the test without context where all variables would be evaluated without any business criteria.
= """
test_context
Acceptance Criteria:
- Risk Levels for Outliers:
- Low Risk: 0-50 outliers
- Medium Risk: 51-300 outliers
- High Risk: More than 300 outliers
- Feature-Specific Requirements:
- CreditScore, Age, NumOfProducts: Must maintain Low Risk status to ensure data quality and model reliability
""".strip()
= f"""
context {use_case_context}
{test_context}
""".strip()
"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context os.environ[
With the test-specific context set, generate a test description for the IQROutliersTable
test for review:
vm.tests.run_test("validmind.data_validation.IQROutliersTable",
={
inputs"dataset": vm_raw_dataset,
},= {
params"threshold": 1.5
} )
Descriptive Statistics
Test-specific context is used in this case to provide risk-based thresholds aligned with the bank’s policy.
For instance, CreditScore
ranges of 550-850 are considered low risk based on standard credit assessment practices, while Balance
thresholds reflect typical retail banking ranges.
= """
test_context
Acceptance Criteria:
• CreditScore:
- Low Risk: 550-850
- Medium Risk: 450-549
- High Risk: <450 or missing
- Justification: Banking standards require reliable credit assessment
• Age:
- Low Risk: 18-75
- Medium Risk: 76-85
- High Risk: >85 or <18
- Justification: Core banking demographic with age-appropriate products
• Balance:
- Low Risk: 0-200,000
- Medium Risk: 200,001-250,000
- High Risk: >250,000
- Justification: Typical retail banking balance ranges
• Tenure:
- Low Risk: 1-10 years
- Medium Risk: <1 year
- High Risk: 0 or >10 years
- Justification: Expected customer relationship duration
• EstimatedSalary:
- Low Risk: 25,000-150,000
- Medium Risk: 150,001-200,000
- High Risk: <25,000 or >200,000
- Justification: Typical income ranges for retail banking customers
""".strip()
= f"""
context {use_case_context}
{test_context}
""".strip()
"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context os.environ[
With the test-specific context set, generate a test description for the DescriptiveStatistics
test for review:
vm.tests.run_test("validmind.data_validation.DescriptiveStatistics",
={
inputs"dataset": vm_raw_dataset,
}, )
Pearson Correlation Matrix
For this test, the context provides meaningful correlation ranges between specific variable pairs based on business criteria.
For example, while a general correlation analysis might flag any correlation above 0.7 as concerning, the test-specific context specifies that Balance
and NumOfProducts
should maintain a negative correlation between -0.4 and 0, reflecting expected banking relationships.
= """
test_context
Acceptance Criteria:
• Target Variable Correlations (Exited):
- Must show correlation coefficients between ±0.1 and ±0.3 with Age, CreditScore, and Balance
- Should not exceed ±0.2 correlation with other features
- Justification: Ensures predictive power while avoiding target leakage
• Feature Correlations:
- Balance & NumOfProducts: Must maintain correlation between -0.4 and 0
- Age & Tenure: Should show positive correlation between 0.1 and 0.3
- CreditScore & Balance: Should maintain correlation between 0.1 and 0.3
• Binary Feature Correlations:
- HasCreditCard & IsActiveMember: Must not exceed ±0.15 correlation
- Binary features should not show strong correlations (>±0.2) with continuous features
• Overall Requirement:
- No feature pair should exceed ±0.7 correlation to avoid multicollinearity
""".strip()
= f"""
context {use_case_context}
{test_context}
""".strip()
"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context os.environ[
With the test-specific context set, generate a test description for the PearsonCorrelationMatrix
test for review:
vm.tests.run_test("validmind.data_validation.PearsonCorrelationMatrix",
={
inputs"dataset": vm_raw_dataset,
}, )