Add context to LLM-generated test descriptions

When you run ValidMind tests, test descriptions are automatically generated with LLM using the test results, the test name, and the static test definitions provided in the test's docstring. While this metadata offers valuable high-level overviews of tests, insights produced by the LLM-based descriptions may not always align with your specific use cases or incorporate organizational policy requirements.

In this notebook, you'll learn how to add context to the generated descriptions by providing additional information about the test or the use case. Including custom use case context is useful when you want to highlight information about the intended use and technique of the model, or the insitution policies and standards specific to your use case.

Install the ValidMind Library

To install the library:

%pip install -q validmind

Initialize the ValidMind Library

ValidMind generates a unique code snippet for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.

Get your code snippet

In a browser, log in to ValidMind.
In the left sidebar, navigate to Model Inventory and click + Register Model.
Enter the model details and click Continue. (Need more help?)

For example, to register a model for use with this notebook, select:
- Documentation template: Binary classification
- Use case: Marketing/Sales - Attrition/Churn Management
You can fill in other options according to your preference.
Go to Getting Started and click Copy snippet to clipboard.

Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
)

Initialize the Python environment

After you've connected to your model register in the ValidMind Platform, let's import the necessary libraries and set up your Python environment for data analysis:

import xgboost as xgb
import os

%matplotlib inline

Load the sample dataset

First, we'll import a sample ValidMind dataset and load it into a pandas DataFrame, a two-dimensional tabular data structure that makes use of rows and columns:

# Import the sample dataset from the library

from validmind.datasets.classification import customer_churn

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{customer_churn.target_column}' \n\t• Class labels: {customer_churn.class_labels}"
)

raw_df = customer_churn.load_data()
raw_df.head()

Preprocess the raw dataset

Then, we'll perform a number of operations to get ready for the subsequent steps:

Preprocess the data: Splits the DataFrame (df) into multiple datasets (train_df, validation_df, and test_df) using demo_dataset.preprocess to simplify preprocessing.
Separate features and targets: Drops the target column to create feature sets (x_train, x_val) and target sets (y_train, y_val).
Initialize XGBoost classifier: Creates an XGBClassifier object with early stopping rounds set to 10.
Set evaluation metrics: Specifies metrics for model evaluation as error, logloss, and auc.
Fit the model: Trains the model on x_train and y_train using the validation set (x_val, y_val). Verbose output is disabled.

train_df, validation_df, test_df = customer_churn.preprocess(raw_df)

x_train = train_df.drop(customer_churn.target_column, axis=1)
y_train = train_df[customer_churn.target_column]
x_val = validation_df.drop(customer_churn.target_column, axis=1)
y_val = validation_df[customer_churn.target_column]

model = xgb.XGBClassifier(early_stopping_rounds=10)
model.set_params(
    eval_metric=["error", "logloss", "auc"],
)
model.fit(
    x_train,
    y_train,
    eval_set=[(x_val, y_val)],
    verbose=False,
)

Initializing the ValidMind objects

Initialize the datasets

Before you can run tests, you'll need to initialize a ValidMind dataset object using the init_dataset function from the ValidMind (vm) module.

We'll include the following arguments:

dataset — the raw dataset that you want to provide as input to tests
input_id - a unique identifier that allows tracking what inputs are used when running each individual test
target_column — a required argument if tests require access to true values. This is the name of the target column in the dataset
class_labels — an optional value to map predicted classes to class labels

With all datasets ready, you can now initialize the raw, training, and test datasets (raw_df, train_df and test_df) created earlier into their own dataset objects using vm.init_dataset():

vm_raw_dataset = vm.init_dataset(
    dataset=raw_df,
    input_id="raw_dataset",
    target_column=customer_churn.target_column,
    class_labels=customer_churn.class_labels,
)

vm_train_ds = vm.init_dataset(
    dataset=train_df,
    input_id="train_dataset",
    target_column=customer_churn.target_column,
)

vm_test_ds = vm.init_dataset(
    dataset=test_df, input_id="test_dataset", target_column=customer_churn.target_column
)

Initialize a model object

Additionally, you'll need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data.

Simply intialize this model object with vm.init_model():

vm_model = vm.init_model(
    model,
    input_id="model",
)

Assign predictions to the datasets

We can now use the assign_predictions() method from the Dataset object to link existing predictions to any model.

If no prediction values are passed, the method will compute predictions automatically:

vm_train_ds.assign_predictions(
    model=vm_model,
)

vm_test_ds.assign_predictions(
    model=vm_model,
)

Set custom context for test descriptions

Review default LLM-generated descriptions

By default, custom context for LLM-generated descriptions is disabled, meaning that the output will not include any additional context.

Let's generate an initial test description for the DatasetDescription test for comparison with later iterations:

vm.tests.run_test(
    "validmind.data_validation.DatasetDescription",
    inputs={
        "dataset": vm_raw_dataset,
    },
)

Enable use case context

To enable custom use case context, set the VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED environment variable to 1.

This is a global setting that will affect all tests for your linked model for the duration of your ValidMind Library session:

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED"] = "1"

Enabling use case context allows you to pass in additional context, such as information about your model, relevant regulatory requirements, or model validation targets to the LLM-generated text descriptions within use_case_context:

use_case_context = """

This is a customer churn prediction model for a banking loan application system using XGBoost classifier. 

Key Model Information:
- Use Case: Predict customer churn risk during loan application process
- Model Type: Binary classification using XGBoost
- Critical Decision Point: Used in loan approval workflow

Regulatory Requirements:
- Subject to model risk management review and validation
- Results require validation review for regulatory compliance
- Model decisions directly impact loan approval process
- Does this result raise any regulatory concerns?

Validation Focus:
- Explain strengths and weaknesses of the test and the context of whether the result is acceptable.
- What does the result indicate about model reliability?
- Is the result within acceptable thresholds for loan decisioning?
- What are the implications for customer impact?

""".strip()

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = use_case_context

With the use case context set, generate an updated test description for the DatasetDescription test for comparison with default output:

vm.tests.run_test(
    "validmind.data_validation.DatasetDescription",
    inputs={
        "dataset": vm_raw_dataset,
    },
).log()

Disable use case context

To disable custom use case context, set the VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED environment variable to 0.

This is a global setting that will affect all tests for your linked model for the duration of your ValidMind Library session:

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED"] = "0"

With the use case context disabled again, generate another test description for the DatasetDescription test for comparison with previous custom output:

vm.tests.run_test(
    "validmind.data_validation.DatasetDescription",
    inputs={
        "dataset": vm_raw_dataset,
    },
).log()

Add test-specific context

In addition to the model-level use_case_context, you're able to add test-specific context to your LLM-generated descriptions allowing you to provide test-specific validation criteria about the test that is being run.

We'll reenable use case context by setting the VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED environment variable to 1, then join the test-specific context to the use case context using the VALIDMIND_LLM_DESCRIPTIONS_CONTEXT environment variable.

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED"] = "1"

Dataset Description

Rather than relying on generic dataset result descriptions in isolation, we'll use the context to specify precise thresholds for missing values, appropriate data types for banking variables (like CreditScore and Balance), and valid value ranges based on particular business rules:

test_context = """

Acceptance Criteria:
- Missing Values: All critical features must have less than 5% missing values (including CreditScore, Balance, Age)
- Data Types: All columns must have appropriate data types (numeric for CreditScore/Balance/Age, categorical for Geography/Gender)
- Cardinality: Categorical variables must have fewer than 50 unique values, while continuous variables should show appropriate distinct value counts (e.g., high for EstimatedSalary, exactly 2 for Boolean fields)
- Value Ranges: Numeric fields must fall within business-valid ranges (CreditScore: 300-850, Age: ≥18, Balance: ≥0)
""".strip()

context = f"""
{use_case_context}

{test_context}
""".strip()

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context

With the test-specific context set, generate an updated test description for the DatasetDescription test again:

vm.tests.run_test(
    "validmind.data_validation.DatasetDescription",
    inputs={
        "dataset": vm_raw_dataset,
    },
)

Class Imbalance

The following test-specific context example adds value to the LLM-generated description by providing defined risk levels to assess class representation:

By categorizing classes into Low, Medium, and High Risk, the LLM can generate more nuanced and actionable insights, ensuring that the analysis aligns with business requirements for balanced datasets.
This approach not only highlights potential issues but also guides necessary documentation and mitigation strategies for high-risk classes.

test_context = """

Acceptance Criteria:

• Risk Levels for Class Representation:
  - Low Risk: Each class represents 20% or more of the total dataset
  - Medium Risk: Each class represents between 10% and 19.9% of the total dataset
  - High Risk: Any class represents less than 10% of the total dataset

• Overall Requirement:
  - All classes must achieve at least Medium Risk status to pass
""".strip()

context = f"""
{use_case_context}

{test_context}
""".strip()

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context

With the test-specific context set, generate a test description for the ClassImbalance test for review:

vm.tests.run_test(
    "validmind.data_validation.ClassImbalance",
    inputs={
        "dataset": vm_raw_dataset,
    },
    params={
        "min_percent_threshold": 10,
    },
)

High Cardinality

In this below case, the context specifies a risk-based criteria for the number of distinct values in categorical features.

This helps the LLM to generate more nuanced and actionable insights, ensuring the descriptions are more relevant to your organization's policies.

test_context = """

Acceptance Criteria:

• Risk Levels for Distinct Values in Categorical Features:
  - Low Risk: Each categorical column has fewer than 50 distinct values or less than 5% unique values relative to the total dataset size
  - Medium Risk: Each categorical column has between 50 and 100 distinct values or between 5% and 10% unique values
  - High Risk: Any categorical column has more than 100 distinct values or more than 10% unique values

• Overall Requirement:
  - All categorical columns must achieve at least Medium Risk status to pass
""".strip()

context = f"""
{use_case_context}

{test_context}
""".strip()

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context

With the test-specific context set, generate a test description for the HighCardinality test for review:

vm.tests.run_test(
    "validmind.data_validation.HighCardinality",
    inputs={
        "dataset": vm_raw_dataset,
    },
    params= {
        "num_threshold": 100,
        "percent_threshold": 0.1,
        "threshold_type": "percent"
        }
)

Missing Values

Here, we use the test-specific context to establish differentiated risk thresholds across features.

Rather than applying uniform criteria, the context allows for specific requirements for critical financial features (CreditScore, Balance, Age).

test_context = """
Test-Specific Context for Missing Values Analysis:

Acceptance Criteria:

• Risk Levels for Missing Values:
  - Low Risk: Less than 1% missing values in any column
  - Medium Risk: Between 1% and 5% missing values
  - High Risk: More than 5% missing values

• Feature-Specific Requirements:
  - Critical Features (CreditScore, Balance, Age):
    * Must maintain Low Risk status
    * No missing values allowed
  
  - Secondary Features (Tenure, NumOfProducts, EstimatedSalary):
    * Must achieve at least Medium Risk status
    * Up to 3% missing values acceptable

  - Categorical Features (Geography, Gender):
    * Must achieve at least Medium Risk status
    * Up to 5% missing values acceptable
""".strip()

context = f"""
{use_case_context}

{test_context}
""".strip()

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context

With the test-specific context set, generate a test description for the MissingValues test for review:

vm.tests.run_test(
    "validmind.data_validation.MissingValues",
    inputs={
        "dataset": vm_raw_dataset,
    },
    params= {
        "min_threshold": 1
        }
)

Unique Rows

This example context establishes variable-specific thresholds based on business expectations.

Rather than applying uniform criteria, it recognizes that high variability is expected in features like EstimatedSalary (>90%) and Balance (>50%), while enforcing strict limits on categorical features like Geography (<5 values), ensuring meaningful validation aligned with banking data characteristics.

test_context = """

Acceptance Criteria:

• High-Variability Expected Features:
  - EstimatedSalary: Must have >90% unique values
  - Balance: Must have >50% unique values
  - CreditScore: Must have between 5-10% unique values

• Medium-Variability Features:
  - Age: Should have between 0.5-2% unique values
  - Tenure: Should have between 0.1-0.5% unique values

• Low-Variability Features:
  - Binary Features (HasCrCard, IsActiveMember, Gender, Exited): Must have exactly 2 unique values
  - Geography: Must have fewer than 5 unique values
  - NumOfProducts: Must have fewer than 10 unique values

• Overall Requirements:
  - Features must fall within their specified ranges to pass
""".strip()

context = f"""
{use_case_context}

{test_context}
""".strip()

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context

With the test-specific context set, generate a test description for the UniqueRows test for review:

vm.tests.run_test(
    "validmind.data_validation.UniqueRows",
    inputs={
        "dataset": vm_raw_dataset,
    },
    params= {
        "min_percent_threshold": 1
        }
)

Too Many Zero Values

Here, test-specific context is used to provide meaning and expectations for different variables:

For instance, zero values in Balance and Tenure indicate risk, whereas zeros in binary variables like HasCrCard or IsActiveMember are expected.
This tailored context ensures that the analysis accurately reflects the business significance of zero values across different features.

test_context = """

Acceptance Criteria:
- Numerical Features Only: Test evaluates only continuous numeric columns (Balance, Tenure), 
  excluding binary columns (HasCrCard, IsActiveMember)

- Risk Level Thresholds for Balance and Tenure:
  - High Risk: More than 5% zero values
  - Medium Risk: Between 3% and 5% zero values
  - Low Risk: Less than 3% zero values

- Individual Column Requirements:
  - Balance: Must be Low Risk (banking context requires accurate balance tracking)
  - Tenure: Must be Low or Medium Risk (some zero values acceptable for new customers)

• Overall Test Result: Test must achieve "Pass" status (Low Risk) for Balance, and at least Medium Risk for Tenure

""".strip()

context = f"""
{use_case_context}

{test_context}
""".strip()

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context

With the test-specific context set, generate a test description for the TooManyZeroValues test for review:

vm.tests.run_test(
    "validmind.data_validation.TooManyZeroValues",
    inputs={
        "dataset": vm_raw_dataset,
    },
    params= {
        "max_percent_threshold": 0.03
        }
)

IQR Outliers Table

In this case, we use test-specific context to incorporate risk levels tailored to key variables, like CreditScore, Age, and NumOfProducts, that otherwise would not be considered for outlier analysis if we ran the test without context where all variables would be evaluated without any business criteria.

test_context = """

Acceptance Criteria:
- Risk Levels for Outliers:
    - Low Risk: 0-50 outliers
    - Medium Risk: 51-300 outliers
    - High Risk: More than 300 outliers
- Feature-Specific Requirements:
    - CreditScore, Age, NumOfProducts: Must maintain Low Risk status to ensure data quality and model reliability

""".strip()

context = f"""
{use_case_context}

{test_context}
""".strip()

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context

With the test-specific context set, generate a test description for the IQROutliersTable test for review:

vm.tests.run_test(
    "validmind.data_validation.IQROutliersTable",
    inputs={
        "dataset": vm_raw_dataset,
    },
    params= {
        "threshold": 1.5
        }
)

Descriptive Statistics

Test-specific context is used in this case to provide risk-based thresholds aligned with the bank's policy.

For instance, CreditScore ranges of 550-850 are considered low risk based on standard credit assessment practices, while Balance thresholds reflect typical retail banking ranges.

test_context = """

Acceptance Criteria:

• CreditScore:
  - Low Risk: 550-850
  - Medium Risk: 450-549
  - High Risk: <450 or missing
  - Justification: Banking standards require reliable credit assessment

• Age:
  - Low Risk: 18-75
  - Medium Risk: 76-85
  - High Risk: >85 or <18
  - Justification: Core banking demographic with age-appropriate products

• Balance:
  - Low Risk: 0-200,000
  - Medium Risk: 200,001-250,000
  - High Risk: >250,000
  - Justification: Typical retail banking balance ranges

• Tenure:
  - Low Risk: 1-10 years
  - Medium Risk: <1 year
  - High Risk: 0 or >10 years
  - Justification: Expected customer relationship duration

• EstimatedSalary:
  - Low Risk: 25,000-150,000
  - Medium Risk: 150,001-200,000
  - High Risk: <25,000 or >200,000
  - Justification: Typical income ranges for retail banking customers

""".strip()

context = f"""
{use_case_context}

{test_context}
""".strip()

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context

With the test-specific context set, generate a test description for the DescriptiveStatistics test for review:

vm.tests.run_test(
    "validmind.data_validation.DescriptiveStatistics",
    inputs={
        "dataset": vm_raw_dataset,
    },
)

Pearson Correlation Matrix

For this test, the context provides meaningful correlation ranges between specific variable pairs based on business criteria.

For example, while a general correlation analysis might flag any correlation above 0.7 as concerning, the test-specific context specifies that Balance and NumOfProducts should maintain a negative correlation between -0.4 and 0, reflecting expected banking relationships.

test_context = """

Acceptance Criteria:

• Target Variable Correlations (Exited):
  - Must show correlation coefficients between ±0.1 and ±0.3 with Age, CreditScore, and Balance
  - Should not exceed ±0.2 correlation with other features
  - Justification: Ensures predictive power while avoiding target leakage

• Feature Correlations:
  - Balance & NumOfProducts: Must maintain correlation between -0.4 and 0
  - Age & Tenure: Should show positive correlation between 0.1 and 0.3
  - CreditScore & Balance: Should maintain correlation between 0.1 and 0.3

• Binary Feature Correlations:
  - HasCreditCard & IsActiveMember: Must not exceed ±0.15 correlation
  - Binary features should not show strong correlations (>±0.2) with continuous features

• Overall Requirement:
  - No feature pair should exceed ±0.7 correlation to avoid multicollinearity

""".strip()

context = f"""
{use_case_context}

{test_context}
""".strip()

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context

With the test-specific context set, generate a test description for the PearsonCorrelationMatrix test for review:

vm.tests.run_test(
    "validmind.data_validation.PearsonCorrelationMatrix",
    inputs={
        "dataset": vm_raw_dataset,
    },
)

Add test-specific context using the docstring

Another way to customize test result descriptions is to include explicit instructions in the test docstring:

Unlike the environment variable methods above which require runtime configuration and is best for dynamic customization across multiple levels (global, test suite, test-specific), modifying the docstring permanently embeds instructions within the test definition itself.
This docstring approach is ideal for tests with consistent reporting requirements that should persist across environments, ensuring standardized outputs regardless of external configuration settings.

In the following example, we demonstrate using post-processing functions to dynamically modify the docstring at runtime, which is useful for experimentation in notebooks. However, users can alternatively hardcode these same instructions directly in the test's docstring if they want these customizations to be permanently part of the test definition without requiring additional runtime code. Use this method when you want instructions to remain an intrinsic part of the test's definition, eliminating the need to repeatedly set environment variables in different execution contexts.

We'll implement a custom test with a default docstring that follows the ValidMind docstring structure. First, we will run this custom test with the default description:

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED"] = "0"

@vm.test("my_custom_tests.MissingValues")
def MissingValues(dataset, min_threshold = 1):
    """
    Evaluates dataset quality by ensuring missing value ratio across all features does not exceed a set threshold.

    ### Purpose

    The Missing Values test is designed to evaluate the quality of a dataset by measuring the number of missing values
    across all features. The objective is to ensure that the ratio of missing data to total data is less than a
    predefined threshold, defaulting to 1, in order to maintain the data quality necessary for reliable predictive
    strength in a machine learning model.

    ### Test Mechanism

    The mechanism for this test involves iterating through each column of the dataset, counting missing values
    (represented as NaNs), and calculating the percentage they represent against the total number of rows. The test
    then checks if these missing value counts are less than the predefined `min_threshold`. The results are shown in a
    table summarizing each column, the number of missing values, the percentage of missing values in each column, and a
    Pass/Fail status based on the threshold comparison.

    ### Signs of High Risk

    - When the number of missing values in any column exceeds the `min_threshold` value.
    - Presence of missing values across many columns, leading to multiple instances of failing the threshold.

    ### Strengths

    - Quick and granular identification of missing data across each feature in the dataset.
    - Provides an effective and straightforward means of maintaining data quality, essential for constructing efficient
    machine learning models.

    ### Limitations

    - Does not suggest the root causes of the missing values or recommend ways to impute or handle them.
    - May overlook features with significant missing data but still less than the `min_threshold`, potentially
    impacting the model.
    - Does not account for data encoded as values like "-999" or "None," which might not technically classify as
    missing but could bear similar implications.
    """
    df = dataset.df
    missing = df.isna().sum()

    return (
        [
            {
                "Column": col,
                "Number of Missing Values": missing[col],
                "Percentage of Missing Values (%)": missing[col] / df.shape[0] * 100,
                "Pass/Fail": "Pass" if missing[col] < min_threshold else "Fail",
            }
            for col in missing.index
        ],
        all(missing[col] < min_threshold for col in missing.index),
    )

vm.tests.run_test(
    "my_custom_tests.MissingValues",
    inputs={"dataset": vm_raw_dataset},
)

Now, let's append custom instructions to the test docstring by using a post-processing function that modifies the default docstring before rerunning the test:

from validmind.vm_models.result import TestResult

# This function will append the instructions to the end of the docstring
def add_instructions(result: TestResult): 
    result.doc += """\n\nINSTRUCTIONS: 
    - Generate 5 Key insights.
    - Add the following note at the end of the generated output: '*NOTE: This is a sample of the data, for the full data results please look in the appendix*'
    """
    return result

You’ll notice that the description generated by the LLM is now updated to reflect the appended instructions:

vm.tests.run_test(
    "my_custom_tests.MissingValues",
    inputs={"dataset": vm_raw_dataset},
    post_process_fn=add_instructions,
)

If we are happy with the customization, we can now hardcode the instructions into our docstring, so they become a permanent part of the test definition and persist across any environment where the test is executed:

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED"] = "0"

@vm.test("my_custom_tests.MissingValues")
def MissingValues(dataset, min_threshold = 1):
    """
    Evaluates dataset quality by ensuring missing value ratio across all features does not exceed a set threshold.

    ### Purpose

    The Missing Values test is designed to evaluate the quality of a dataset by measuring the number of missing values
    across all features. The objective is to ensure that the ratio of missing data to total data is less than a
    predefined threshold, defaulting to 1, in order to maintain the data quality necessary for reliable predictive
    strength in a machine learning model.

    ### Test Mechanism

    The mechanism for this test involves iterating through each column of the dataset, counting missing values
    (represented as NaNs), and calculating the percentage they represent against the total number of rows. The test
    then checks if these missing value counts are less than the predefined `min_threshold`. The results are shown in a
    table summarizing each column, the number of missing values, the percentage of missing values in each column, and a
    Pass/Fail status based on the threshold comparison.

    ### Signs of High Risk

    - When the number of missing values in any column exceeds the `min_threshold` value.
    - Presence of missing values across many columns, leading to multiple instances of failing the threshold.

    ### Strengths

    - Quick and granular identification of missing data across each feature in the dataset.
    - Provides an effective and straightforward means of maintaining data quality, essential for constructing efficient
    machine learning models.

    ### Limitations

    - Does not suggest the root causes of the missing values or recommend ways to impute or handle them.
    - May overlook features with significant missing data but still less than the `min_threshold`, potentially
    impacting the model.
    - Does not account for data encoded as values like "-999" or "None," which might not technically classify as
    missing but could bear similar implications.

    INSTRUCTIONS: 
    - Generate 5 Key insights.
    - Add the following note at the end of the generated output: '*NOTE: This is a sample of the data, for the full data results please look in the appendix*'
    """
    df = dataset.df
    missing = df.isna().sum()

    return (
        [
            {
                "Column": col,
                "Number of Missing Values": missing[col],
                "Percentage of Missing Values (%)": missing[col] / df.shape[0] * 100,
                "Pass/Fail": "Pass" if missing[col] < min_threshold else "Fail",
            }
            for col in missing.index
        ],
        all(missing[col] < min_threshold for col in missing.index),
    )

vm.tests.run_test(
    "my_custom_tests.MissingValues",
    inputs={"dataset": vm_raw_dataset},
)

Best practices for adding custom context

When working with test result descriptions, it's often useful to provide custom instructions that guide how a test should be interpreted. There are two main ways to add this context: through environment variables or by modifying the test's docstring. Each method has different use cases, levels of persistence, and scopes of influence. Understanding when to use each can help ensure clarity, maintainability, and consistency in your testing workflows.

The test’s docstring is a natural place to embed instructions that are closely tied to the purpose and interpretation of that specific test. There are two ways to leverage docstrings:

Hardcoded instructions can be written directly into the test function’s docstring in the source code. These instructions become a permanent part of the test definition and will persist across all environments where the test is run. This approach is ideal when you want clear, consistent guidance that always travels with the test.
Dynamic modification of the docstring at runtime allows you to append instructions using a post-processing function. This is useful in interactive or experimental settings—such as notebooks—where you may want to fine-tune the test’s description temporarily. These instructions are local to the current test and do not affect others

While docstrings are localized to individual tests, environment variables offer a broader mechanism for injecting context at runtime. This method is best used when you want to apply the same guidance across multiple tests in a session. For example, you might define a high-level context once and have it apply globally throughout a suite of tests. However, because environment variables persist beyond a single test, they can unintentionally influence the behavior of subsequent tests unless explicitly overridden or cleared. This global scope is powerful, but it requires careful handling to avoid unexpected side effects.

Choosing the right approach - Use hardcoded docstring instructions when you want test guidance to be a permanent part of the test definition, ensuring consistency across environments. - Use docstring post-processing when you need flexibility for local or temporary customization, particularly in experimental or development settings. - Use environment variables to apply a shared context across multiple tests, but be mindful that the configuration will persist unless you reset it.

This tiered approach provides both precision and flexibility—letting you decide whether context should live inside the test, be generated dynamically at runtime, or apply more broadly across test runs.