Customize test result descriptions

When you run ValidMind tests, test descriptions are automatically generated with LLM using the test results, the test name, and the static test definitions provided in the test's docstring. While this metadata offers valuable high-level overviews of tests, insights produced by the LLM-based descriptions may not always align with your specific use cases or incorporate organizational policy requirements.

In this notebook, you'll learn how to take complete control over the context that drives test description generation. ValidMind provides a context parameter in run_test that accepts a dictionary with three complementary keys for comprehensive context management:

instructions: Overwrites ValidMind’s default result description structure. If you provide custom instructions, they take full priority over the built-in ones. This parameter controls how the final description is structured and presented. Use this to specify formatting requirements, target different audiences (executives vs. technical teams), or ensure consistent report styles across your organization.
test_description: Overwrites the test’s built-in docstring if provided. This parameter contains the technical mechanics of how the test works. However, for generic tests where the methodology isn't the focus, you may use this to describe what's actually being analyzed—the specific variables, features, or metrics being plotted and their business meaning rather than the statistical mechanics. You can also override ValidMind's built-in test documentation if you prefer different structure or language.
additional_context: Does not overwrite the instructions or test descriptions, but instead adds to them. This parameter provides any background information you want the LLM to consider when analyzing results. It could include business priorities, acceptance thresholds, regulatory requirements, domain expertise, use case details, model purpose, or stakeholder concerns—any information that helps the LLM better understand and interpret your specific situation.

Together, these context parameters allow you to manage every aspect of how the LLM interprets and presents your test results. Whether you need to align descriptions with regulatory requirements, target specific audiences, incorporate organizational policies, or ensure consistent reporting standards, this context management approach gives you the flexibility to generate descriptions that perfectly match your needs while still leveraging the analytical power of AI-generated insights.

Setting up

This section covers the basic setup required to run the examples in this notebook. We'll install ValidMind, connect to the platform, and create a customer churn model that we'll use to demonstrate the instructions and knowledge parameters throughout the examples.

Install the ValidMind Library

To install the library:

%pip install -q validmind

Initialize the ValidMind Library

Register sample model

Let's first register a sample model for use with this notebook:

In a browser, log in to ValidMind.
In the left sidebar, navigate to Inventory and click + Register Model.
Enter the model details and click Next > to continue to assignment of model stakeholders. (Need more help?)

For example, to register a model for use with this notebook, select the following use case: Marketing/Sales - Attrition/Churn Management
Select your own name under the MODEL OWNER drop-down.
Click Register Model to add the model to your inventory.

Apply documentation template

Once you've registered your model, let's select a documentation template. A template predefines sections for your model documentation and provides a general outline to follow, making the documentation process much easier.

In the left sidebar that appears for your model, click Documents and select Documentation.
Under TEMPLATE, select Binary classification.
Click Use Template to apply the template.

Get your code snippet

ValidMind generates a unique code snippet for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.

On the left sidebar that appears for your model, select Getting Started and click Copy snippet to clipboard.
Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
)

Initialize the Python environment

After you've connected to your model register in the ValidMind Platform, let's import the necessary libraries and set up your Python environment for data analysis:

import xgboost as xgb
import os

%matplotlib inline

Model development

Now we'll build the customer churn model using XGBoost and ValidMind's sample dataset. This trained model will generate the test results we'll use to demonstrate the instructions and knowledge parameters.

Load data

First, we'll import a sample ValidMind dataset and load it into a pandas dataframe:

# Import the sample dataset from the library

from validmind.datasets.classification import customer_churn

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{customer_churn.target_column}' \n\t• Class labels: {customer_churn.class_labels}"
)

raw_df = customer_churn.load_data()
raw_df.head()

Fit the model

Then, we prepare the data and model by first splitting the DataFrame into training, validation, and test sets, then separating features from targets. An XGBoost classifier is initialized with early stopping, evaluation metrics (error, logloss, and auc) are defined, and the model is trained on the training data with validation monitoring.

train_df, validation_df, test_df = customer_churn.preprocess(raw_df)

x_train = train_df.drop(customer_churn.target_column, axis=1)
y_train = train_df[customer_churn.target_column]
x_val = validation_df.drop(customer_churn.target_column, axis=1)
y_val = validation_df[customer_churn.target_column]

model = xgb.XGBClassifier(early_stopping_rounds=10)
model.set_params(
    eval_metric=["error", "logloss", "auc"],
)
model.fit(
    x_train,
    y_train,
    eval_set=[(x_val, y_val)],
    verbose=False,
)

Initialize the ValidMind objects

Before you can run tests, you'll need to initialize a ValidMind dataset object using the init_dataset function from the ValidMind (vm) module.

We'll include the following arguments:

dataset — the raw dataset that you want to provide as input to tests
input_id - a unique identifier that allows tracking what inputs are used when running each individual test
target_column — a required argument if tests require access to true values. This is the name of the target column in the dataset
class_labels — an optional value to map predicted classes to class labels

With all datasets ready, you can now initialize the raw, training, and test datasets (raw_df, train_df and test_df) created earlier into their own dataset objects using vm.init_dataset():

vm_raw_dataset = vm.init_dataset(
    dataset=raw_df,
    input_id="raw_dataset",
    target_column=customer_churn.target_column,
    class_labels=customer_churn.class_labels,
)

vm_train_ds = vm.init_dataset(
    dataset=train_df,
    input_id="train_dataset",
    target_column=customer_churn.target_column,
)

vm_test_ds = vm.init_dataset(
    dataset=test_df, input_id="test_dataset", target_column=customer_churn.target_column
)

Additionally, you'll need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data.

Simply intialize this model object with vm.init_model():

vm_model = vm.init_model(
    model,
    input_id="model",
)

We can now use the assign_predictions() method from the Dataset object to link existing predictions to any model.

If no prediction values are passed, the method will compute predictions automatically:

vm_train_ds.assign_predictions(
    model=vm_model,
)

vm_test_ds.assign_predictions(
    model=vm_model,
)

Understanding test result descriptions

Before diving into custom instructions, let's understand how ValidMind generates test descriptions by default.

Default LLM-generated descriptions

When you run a test without custom instructions, ValidMind's LLM analyzes: - The test results (tables, figures) - The test's built-in documentation (docstring)

When ValidMind generates test descriptions automatically (without custom instructions), the LLM follows a series of standardized sections designed to provide comprehensive, objective analysis of test results:

Test purpose: This section opens with a clear explanation of what the test does and why it exists. It draws from the test’s documentation and presents the purpose in accessible, straightforward language.
Test mechanism: Here the description outlines how the test works, including its methodology, what it measures, and how those measurements are derived. For statistical tests, it also explains the meaning of each metric, how values are typically interpreted, and what ranges are expected.
Test strengths: This part highlights the value of the test by pointing out its key strengths and the scenarios where it is most useful. It also notes the kinds of insights it can provide that other tests may not capture.
Test limitations: Limitations focus on both technical constraints and interpretation challenges. The text notes when results should be treated with caution and highlights specific risk indicators tied to the test type.
Results interpretation: The results section explains how to read the outputs, whether tables or figures, and clarifies what each column, axis, or metric means. It also points out key data points, units of measurement, and any notable observations that help frame interpretation.
Key insights: Insights are listed in bullet points, moving from broad to specific. Each one has a clear title, includes relevant numbers or ranges, and ensures that all important aspects of the results are addressed.
Conclusions: The conclusion ties the insights together into a coherent narrative. It synthesizes the findings into objective technical takeaways and emphasizes what the results reveal about the model or data.

Let's see a default description:

vm.tests.run_test(
    "validmind.model_validation.sklearn.ClassifierPerformance",
    inputs={
        "dataset": vm_test_ds,
        "model": vm_model,
    },
)

Customizing results structure with instructions

While the default descriptions are designed to be comprehensive, there are many cases where you might want to tailor them for your specific context. Customizing test results allows you to shape descriptions to fit your organization’s standards and practical needs. This can involve adjusting report formats, applying specific risk rating scales, adding mandatory disclaimer text, or emphasizing particular metrics.

The instructions parameter is what enables this flexibility by adapting the generated descriptions to different audiences and test types. Executives often need concise summaries that emphasize overall risk, data scientists look for detailed explanations of the methodology behind tests, and compliance teams require precise language that aligns with regulatory expectations. Different test types also demand different emphases: performance metrics may benefit from technical breakdowns, while validation checks might require risk-focused narratives.

Simple instruction example

Let's start with simple examples of the instructions parameter. Here's how to provide basic guidance to the LLM-generated descriptions:

simple_instructions = """
Please focus on business impact and provide a concise summary. 
Include specific actionable recommendations.
"""

vm.tests.run_test(
    "validmind.model_validation.sklearn.ClassifierPerformance",
    inputs={
        "dataset": vm_test_ds,
        "model": vm_model,
    },
    context={
        "instructions": simple_instructions,
    },
)

Structured format instructions

You can request specific formatting and structure:

structured_instructions = """
Please structure your analysis using the following format:

### Executive Summary
- One sentence overview of the test results

### Key Findings
- Bullet points with the most important insights
- Include specific percentages and thresholds

### Risk Assessment
- Classify risk level as Low/Medium/High
- Explain reasoning for the risk classification

### Recommendations
- Specific actionable next steps
- Priority level for each recommendation

"""

vm.tests.run_test(
    "validmind.model_validation.sklearn.ClassifierPerformance",
    inputs={
        "dataset": vm_test_ds,
        "model": vm_model,
    },
    context={
        "instructions": structured_instructions,
    },
)

Template with LLM fill-ins

One of the most powerful features is combining hardcoded text with LLM-generated content using placeholders. This allows you to ensure specific information is always included while still getting intelligent analysis of the results.

Create a template where specific sections are filled by the LLM:

template_instructions = """
Please generate the description using this exact template. 
Fill in the [PLACEHOLDER] sections with your analysis:

---
**VALIDATION REPORT: CLASSIFIER PERFORMANCE ASSESSMENT**

**Dataset ID:** test_dataset
**Validation Type:** Classification Performance Analysis
**Reviewer:** ValidMind AI Analysis

**EXECUTIVE SUMMARY:**
[PROVIDE_2_SENTENCE_SUMMARY_OF_RESULTS]

**KEY FINDINGS:**
[ANALYZE_AND_LIST_TOP_3_MOST_IMPORTANT_FINDINGS_WITH_VALUES]

**CLASSIFICATION PERFORMANCE ASSESSMENT:**
[DETAILED_ANALYSIS_OF_CLASSIFICATION_PERFORMANCE_PATTERNS_AND_IMPACT]

**RISK RATING:** [ASSIGN_LOW_MEDIUM_HIGH_RISK_WITH_JUSTIFICATION]

**RECOMMENDATIONS:**
[PROVIDE_SPECIFIC_ACTIONABLE_RECOMMENDATIONS_NUMBERED_LIST]

**VALIDATION STATUS:** [PASS_CONDITIONAL_PASS_OR_FAIL_WITH_REASONING]

---
*This report was generated using ValidMind's automated validation platform.*
*For questions about this analysis, contact the Data Science team.*
---

Important: Use the exact template structure above and fill in each [PLACEHOLDER] section.
"""

vm.tests.run_test(
    "validmind.model_validation.sklearn.ClassifierPerformance",
    inputs={
        "dataset": vm_test_ds,
        "model": vm_model,
    },
    context={
        "instructions": template_instructions,
    },
)

Mixed static and dynamic content

Combine mandatory text with intelligent analysis:

# Mixed static and dynamic content
mixed_content_instructions ="""
Return ONLY the assembled content in plain Markdown paragraphs and lists.
Do NOT include any headings or titles (no lines starting with '#'), labels,
XML-like tags (<MANDATORY>, <PLACEHOLDER>), variable names, or code fences.
Do NOT repeat or paraphrase these instructions. Start the first line with the
first mandatory sentence below—no preface.

You MUST include all MANDATORY blocks verbatim (exact characters, spacing, and punctuation).
You MUST replace PLACEHOLDER blocks with the requested content.
Between blocks, include exactly ONE blank line.

MANDATORY BLOCK A (include verbatim):
This data validation assessment was conducted in accordance with the 
XYZ Bank Model Risk Management Policy (Document ID: MRM-2024-001). 
All findings must be reviewed by the Model Validation Team before 
model deployment.

PLACEHOLDER BLOCK B (replace with prose paragraphs; no headings):
[Provide detailed analysis of the test results, including specific values, 
interpretations, and implications for model quality. Focus on classification performance quality 
aspects and potential issues that could affect model performance.]

MANDATORY BLOCK C (include verbatim):
IMPORTANT: This automated analysis is supplementary to human expert review. 
All high-risk findings require immediate escalation to the Chief Risk Officer. 
Model deployment is prohibited until all Medium and High risk items are resolved.

PLACEHOLDER BLOCK D (replace with a numbered list only):
[Create a numbered list of specific action items with responsible parties 
and suggested timelines for resolution.]

MANDATORY BLOCK E (include verbatim):
Validation performed using ValidMind Platform v2.0 | 
Next review required: [30 days from test date] | 
Contact: model-risk@xyzbank.com

Compliance checks BEFORE you finalize your answer:
- No headings or titles present (no '#' anywhere).
- No tags (<MANDATORY>, <PLACEHOLDER>) or labels (e.g., "BLOCK A") in the output.
- All three MANDATORY blocks included exactly as written.
- PLACEHOLDER B replaced with prose; PLACEHOLDER D replaced with a numbered list.
- Exactly one blank line between each block.
"""


vm.tests.run_test(
    "validmind.model_validation.sklearn.ClassifierPerformance",
    inputs={
        "dataset": vm_test_ds,
        "model": vm_model,
    },
    context={
        "instructions": mixed_content_instructions,
    },
)

Enriching results with additional context

While the instructions parameter controls how your test descriptions are formatted and structured, the additional_context parameter provides background information about what the results mean for your specific business situation. Think of instructions as the "presentation guide" and additional_context as the "business background" that helps the LLM understand what matters most in your organization and how to interpret the results in your specific context.

Understanding the additional context parameter

The additional_context parameter can be used to add any background information that helps put the test results into context. For example, you might include business priorities and constraints that shape how results are interpreted, risk tolerance levels or acceptance criteria specific to your organization, regulatory requirements that influence what counts as acceptable performance, or details about the intended use case of the model in production. These are just examples—the parameter is flexible and can capture whatever context is most relevant to your needs.

Key difference: - instructions: "Write a 3-paragraph executive summary"

additional_context: "If Accuracy is above 0.85 but Class 1 Recall falls below 0.60, the model should be considered high risk"

When used together, these parameters create descriptions that don’t just report the Recall or Accuracy measures for Class 1, but explain that because Accuracy is above 0.85 while Recall falls below 0.60, the model should be treated as high risk for your business.

Basic additional context usage

Here's how business context transforms the interpretation of our classifier results:

simple_context = """
MODEL CONTEXT:
- Class 0 = Customer stays (retains banking relationship)
- Class 1 = Customer churns (closes accounts, leaves bank)

DECISION RULES:
- ROC AUC >0.9: APPROVE deployment
- ROC AUC <0.9: REJECT model

CHURN DETECTION RULES:
- Recall >50% for churning customers: Good - use high-touch retention  
- Recall <50% for churning customers: Poor - retention program will fail
"""

vm.tests.run_test(
    "validmind.model_validation.sklearn.ClassifierPerformance",
    inputs={
        "dataset": vm_test_ds,
        "model": vm_model,
    },
    context={
        "additional_context": simple_context,
    },
)

Combining instructions and additional context

Here's how combining both parameters creates targeted analysis of our churn model performance, using additional_context to pass both static business rules and dynamic real-time information like analysis dates:

from datetime import datetime

# Get today's date
today = datetime.now().strftime("%B %d, %Y")

# Executive decision instructions with date placeholder
executive_instructions = """
Create a GO/NO-GO decision memo following this template:

<TEMPLATE>
**DATE:** [Use analysis date from context]
**THRESHOLD ANALYSIS:** [Pass/Fail against specific thresholds]
**BUSINESS IMPACT:** [Revenue impact of current performance]  
**DEPLOYMENT DECISION:** [APPROVE/CONDITIONAL/REJECT]
**REQUIRED ACTIONS:** [Specific next steps with timelines]
</TEMPLATE>

Be definitive - use the thresholds to make clear recommendations.
"""

# Retail banking with hard thresholds including date
retail_thresholds = f"""
RETAIL BANKING CONTEXT (Analysis Date: {today}):
- Class 0 = Customer retention (keeps checking/savings accounts)
- Class 1 = Customer churn (closes accounts, switches banks)

REGULATORY THRESHOLDS:
- AUC >0.80: Meets regulatory model standards
- Churn Recall >55%: Adequate churn detection 
- Churn Precision >65%: Cost-effective targeting 

DEPLOYMENT CRITERIA:
- All 3 Pass: FULL DEPLOYMENT
- 2 Pass: CONDITIONAL DEPLOYMENT
- <2 Pass: REJECT MODEL

"""

vm.tests.run_test(
    "validmind.model_validation.sklearn.ClassifierPerformance",
    inputs={
        "dataset": vm_test_ds,
        "model": vm_model,
    },
    context={
        "instructions": executive_instructions,
        "additional_context": retail_thresholds,
    },
)

Overriding test documentation with test description parameter

Each test, whether built-in or customized, includes a built-in docstring that serves as its default documentation. This docstring usually explains what the test does and what it outputs. In many cases, especially for specialized tests with well-defined purposes—the default docstring is already useful and sufficient.

Structure of ValidMind built-in test docstrings

Every ValidMind built-in test includes a docstring that serves as its default documentation. This docstring follows a consistent structure so that both users and the LLM can rely on a predictable format. While the content varies depending on the type of test—for example, highly specific tests like SHAP values or PSI provide technical detail, whereas generic tests like descriptive statistics or histograms are more general—the overall layout remains the same.

A typical docstring contains the following sections:

Overview: A short description of what the test does and what kind of output it generates.
Purpose: Explains why the test exists and what it is designed to evaluate. This section provides the context for the test’s role in model documentation, often describing the intended use cases or the kind of insights it supports.
Test mechanism: Describes how the test works internally. This includes the approach or methodology, what inputs are used, how results are calculated or visualized, and the logic behind the test’s implementation.
Signs of high risks: Outlines risk indicators that are specific to the test. These highlight situations where results should be interpreted with caution—for example, imbalances in distributions or errors in processing steps.
Strengths: Highlights the capabilities and benefits of the test, explaining what makes it particularly useful and what kinds of insights it provides that may not be captured elsewhere.
Limitations: Discusses the constraints of the test, including technical shortcomings, interpretive challenges, and situations where the results might be misleading or incomplete.

This structure ensures that all built-in tests provide a comprehensive explanation of their purpose, mechanics, strengths, and limitations. For more generic tests, the docstring may read as boilerplate information about the test’s mechanics. In these cases, the doc parameter can be used to override the docstring with context that is more relevant to the dataset, feature, or business use case under analysis.

Understanding the test description parameter

Overriding the docstring with the test_description parameter is particularly valuable for more generic tests, where the default text often focuses on the mechanics of producing an output rather than the data or variable being analyzed. For example, instead of including documentation about the details about the methodology used to compute an histogram, you may want to document the business meaning of the feature being visualized, its expected distribution, or what to pay attention to. Similarly, when generating a descriptive statistics table, you may prefer documentation that describes the dataset under review.

Customizing the doc, allows you to shift the focus of the explanation from the test machinery to the aspects of the data that matter most for your audience, while still relying on the built-in docstring for cases where the default detail is already fit for purpose.

When to override

For tests like histograms or descriptive statistics where the statistical methodology is standard and uninteresting, replace the generic documentation with meaningful descriptions of the variables being analyzed. Also use this to customize ValidMind's built-in test documentation when you want different terminology, structure, or emphasis than what's provided by default.

Basic test description usage

custom_description = """
This test evaluates customer churn prediction model performance specifically 
for retail banking applications. The analysis focuses on classification 
metrics relevant to customer retention programs and regulatory compliance 
requirements under our internal Model Risk Management framework.

Key metrics analyzed:
- Precision: Accuracy of churn predictions to minimize wasted retention costs
- Recall: Coverage of actual churners to maximize retention program effectiveness  
- F1-Score: Balanced measure considering both precision and recall
- ROC AUC: Overall discriminatory power for regulatory model approval

Results inform deployment decisions for automated retention campaigns.
"""

result = vm.tests.run_test(
    "validmind.model_validation.sklearn.ClassifierPerformance",
    inputs={
        "model": vm_model,
        "dataset": vm_test_ds
    },
    context={
        "test_description": custom_description,
    },
)

Combining test description with instructions and additional context

# All three parameters working together
banking_test_description = """
Customer Churn Risk Assessment Test for Retail Banking.
Evaluates model's ability to identify customers likely to close accounts 
and switch to competitor banks within 12 months.
- Class 0 = Customer retention (maintains banking relationship)
- Class 1 = Customer churn (closes primary accounts)
"""

executive_instructions = """
Format as a risk committee briefing:
**TEST DESCRIPTION:** [Test description]
**RISK ASSESSMENT:** [Model risk level]
**REGULATORY STATUS:** [Compliance with banking regulations]
**BUSINESS RECOMMENDATION:** [Deploy/Hold/Reject with rationale]
"""

banking_contetx = """
REGULATORY CONTEXT:
- OCC guidance requires AUC >0.80 for model approval
- Our threshold: Churn recall >50% for retention program viability
"""

result = vm.tests.run_test(
    "validmind.model_validation.sklearn.ClassifierPerformance",
    inputs={
        "model": vm_model,
        "dataset": vm_test_ds
    },
    context={
        "test_description": banking_test_description,
        "instructions": executive_instructions,
        "additional_context": banking_contetx,
    },
)

Best practices for managing context

When using instructions, additional_context, and test_description parameters together, follow these guidelines to create effective, consistent, and maintainable test descriptions.

Choose the right parameter for each need:

Use test_description for technical corrections when you need to fix or clarify test methodology, override ValidMind's built-in documentation with your preferred structure or terminology, replace generic test mechanics with meaningful descriptions of variables and features being analyzed, or provide domain-specific context for regulatory compliance.
Apply additional_context for business rules like performance thresholds and decision criteria, business context such as customer economics and operational constraints, threshold-driven decision logic, regulatory requirements, real-time information like dates or risk indicators, stakeholder priorities, or any background information that helps the LLM interpret results in your specific context
Leverage instructions for audience targeting to control format and presentation style, create structured templates with specific sections and placeholders for LLM fill-ins, combine hardcoded mandatory text with dynamic analysis, and ensure consistent organizational reporting standards across different stakeholder groups.

Avoid redundancy:

Don't repeat the same information across multiple parameters, as each parameter should add unique value to the description generation. If content overlaps, choose the most appropriate parameter for that information to maintain clarity and prevent conflicting or duplicate guidance in your test descriptions.

Increasing consistency and grounding:

Since LLMs can produce variable responses, use hardcoded sections in your instructions for content that requires no variability, combined with specific placeholders for data you trust the LLM to generate. For example, include mandatory disclaimers, policy references, and fixed formatting exactly as written, while using placeholders like [ANALYZE_PERFORMANCE_METRICS] for dynamic content. This approach ensures critical information appears consistently while still leveraging the LLM's analytical capabilities.

Use test_description and additional_context parameters to anchor test results descriptions in your specific domain and business context, preventing the LLM from generating generic or inappropriate interpretations. Then use instructions to explicitly direct the LLM to ground its analysis in this provided context, such as "Base all recommendations on the thresholds specified in the additional context section" or "Interpret all metrics according to the test description provided."