Configure judge LLM and judge embeddings

This notebook shows how to configure and validate the default judge LLM and judge embeddings used by the ValidMind Library for LLM-focused tests.

It exercises three important paths: 1. Prompt-validation tests, which depend on the default judge LLM. 2. RAGAS-based tests, which depend on both the default judge LLM and the default judge embeddings model. 3. DeepEval scorers, which use the judge LLM configured via set_judge_config().

The notebook automatically selects the available provider from your environment, with OpenAI taking precedence when both OpenAI and Gemini keys are set, to match the library's default-provider logic.

Introduction

This notebook shows how to configure and validate the default judge LLM and judge embeddings used by the ValidMind Library for LLM-focused tests.

It walks through the provider configuration used by three important evaluation paths: - prompt-validation tests - RAGAS-based tests - DeepEval scorers

Along the way, you will initialize ValidMind model and dataset objects, inspect the resolved judge configuration, run representative tests, and optionally log the results to the ValidMind Platform. By the end of the notebook, you will have a practical reference for configuring judge models and understanding how those settings affect different LLM evaluation workflows.

About ValidMind

ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models.

You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.

Before you begin

Before running this notebook, make sure you have: - a Python environment with the ValidMind Library and its LLM dependencies installed - access to a ValidMind account if you want to log results to the ValidMind Platform - credentials for one supported judge provider in your environment

This notebook supports: - OpenAI via OPENAI_API_KEY, with optional OPENAI_MODEL and OPENAI_EMBEDDINGS_MODEL overrides. The current default judge model is gpt-4.1 and the default embeddings model is text-embedding-3-small. - Gemini via GOOGLE_API_KEY or GEMINI_API_KEY, with optional GEMINI_MODEL and GEMINI_EMBEDDINGS_MODEL overrides. The current defaults are gemini-2.5-pro and models/text-embedding-004. - Azure OpenAI via AZURE_OPENAI_KEY, AZURE_OPENAI_ENDPOINT, and AZURE_OPENAI_MODEL. The current default embeddings model is text-embedding-3-small.

You can still run the notebook locally without connecting to the ValidMind Platform, but connecting a model document makes it easier to review and share results after the tests complete.

New to ValidMind?

If you are new to the ValidMind Library, start with the ValidMind Library overview. It introduces the core workflow for initializing models and datasets, running tests, and logging outputs back to the ValidMind Platform.

You only need a ValidMind account if you want to log results to the ValidMind Platform.

Register with ValidMind

### Key conceptsJudge LLM: The language model used by ValidMind to evaluate prompts, answers, contexts, and other LLM outputs.Judge embeddings: The embeddings model used when a test requires semantic similarity or retrieval-based comparison.Provider credentials: Environment variables that tell ValidMind which provider to use for judge evaluation. In this notebook, the provider is resolved automatically from the credentials available in your environment.ValidMind dataset: A dataset initialized with vm.init_dataset(). Wrapping a pandas DataFrame this way lets you pass the dataset into ValidMind tests with the metadata those tests expect.ValidMind model: A model initialized with vm.init_model(). In this notebook, we use a lightweight model object to run prompt-validation tests against a prompt template.Prompt-validation tests: Tests that evaluate prompt quality and instructions, such as clarity or bias, using a judge LLM.RAGAS tests: Retrieval-augmented generation tests that can rely on both a judge LLM and judge embeddings.DeepEval scorers: LLM-based scorers used for tasks such as answer relevancy and hallucination detection. These use the judge LLM configured via set_judge_config() and do not require judge embeddings.

Setting up

Install the ValidMind Library

Recommended Python versions

Python 3.8 <= x <= 3.14

Install the ValidMind Library with the optional LLM dependencies so the notebook can run prompt-validation tests, RAGAS tests, and DeepEval scorers:

%pip install -q "validmind[llm]"

Connect to the ValidMind Platform

If you want to log notebook outputs to the ValidMind Platform, start by selecting an existing model in your inventory or registering a new one. This notebook can run without platform connectivity, but linking it to a model document gives you a place to review the results after the examples finish.

Register or select a model

In a browser, log in to ValidMind.
In the left sidebar, select Inventory.
Select Model by clicking on {Record} Inventory, where {Record} is the currently active type of record.
Either select an existing model from the list, or click + Register Model to register a new one.
Complete the model details and stakeholder assignments if you are registering a new model.
Open the document where you want notebook results to be logged.

Using a real model document is especially helpful in this notebook because it lets you compare the locally executed tests with the sections available in your template.

Choose a documentation template

If you plan to log results from this notebook, make sure your model document uses a template that includes sections for the LLM evaluation results you want to capture.

This is important because tests that are not included in the selected template will not appear automatically in the Platform document, even if you run and log them successfully from the notebook. If you want to document those results as well, you can add the relevant sections or tests manually in the Platform.

Before running the notebook, preview the template structure and confirm that the document has the sections you expect for your workflow.

Get your code snippet

Initialize the ValidMind Library with the code snippet associated with your model document so that test results are uploaded to the correct destination in the ValidMind Platform.

In the model sidebar, open Getting Started.
Select the document you want to update.
Copy the generated code snippet.
Load the values from an .env file or replace the placeholders in the example below with your own values.

Using environment variables is usually the easiest way to keep the notebook portable across environments and avoid hard-coding connection details in the notebook itself.

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    api_host="https://app.prod.validmind.ai/api/v1/tracking",
    api_key="...",
    api_secret="...",
    document="documentation", # requires library >=2.12.0
    model="...",
)

Initialize the notebook environment

Load environment variables and prepare the notebook session. In the execution cells that follow, you will import the libraries needed for this walkthrough, inspect the configured judge provider, and create the ValidMind objects used by the example tests.

This section is also where the notebook becomes reproducible: once your credentials and dependencies are in place, the remaining sections can be run top to bottom.

import os

import pandas as pd

from validmind.ai import utils as ai_utils
from validmind.models import Prompt
from validmind.tests import run_test

Getting to know ValidMind

Preview the documentation template

If you have already connected this notebook to a model document, you can preview the active template structure directly from the library.

This is useful for confirming where logged results will appear before you run the prompt-validation, RAGAS, and DeepEval examples below. It also helps you spot gaps early if a test you plan to run is not represented in the current template:

vm.preview_template()

View model documentation in the ValidMind Platform

After you run the notebook and log results, open your model document in the ValidMind Platform to review how the test outputs were added.

Comparing the template preview with the rendered document is a good way to confirm that your notebook is writing results to the expected sections. If a result does not appear automatically, check whether the corresponding test is part of the selected template before troubleshooting the notebook run itself.

Configure the judge provider

The next cells load your environment variables, resolve the judge provider from the credentials available in your session, and initialize the ValidMind Library for result logging.

This notebook uses the same provider resolution logic as the library itself: - OpenAI is selected when OPENAI_API_KEY is available, with OPENAI_MODEL as an optional override. The current default judge model is gpt-4.1. - Azure OpenAI is selected when Azure credentials are available, using AZURE_OPENAI_MODEL for the judge model. - Gemini is selected when GOOGLE_API_KEY or GEMINI_API_KEY is available, with optional GEMINI_MODEL and GEMINI_EMBEDDINGS_MODEL overrides. The current defaults are gemini-2.5-pro and models/text-embedding-004.

If more than one provider is configured, OpenAI takes precedence to match the library default.

This matters because the same default judge configuration is reused across multiple evaluation paths, so checking it once here makes the later test results easier to interpret.

# Optional: override the default judge models for this notebook session.
# os.environ["OPENAI_MODEL"] = "gpt-4.1"
# os.environ["GEMINI_MODEL"] = "gemini-2.5-pro"
# os.environ["GEMINI_EMBEDDINGS_MODEL"] = "models/text-embedding-004"

# Optional: explicitly set the judge LLM and embeddings using set_judge_config().## This overrides automatic provider detection and applies to all three evaluation# paths in this notebook: prompt-validation tests, RAGAS tests, and DeepEval scorers.## from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings# from validmind.ai.utils import set_judge_config# set_judge_config(#     judge_llm=ChatGoogleGenerativeAI(model="gemini-2.0-flash"),#     judge_embeddings=GoogleGenerativeAIEmbeddings(model="models/text-embedding-004"),# )

The next cells import the required libraries, inspect the resolved provider configuration, and connect the notebook to the ValidMind Platform. Reading the printed provider and class names is a quick sanity check that your environment is using the judge setup you expect before any tests are executed.

Load credentials and resolve the provider

Run the next cells to: - import the libraries used in this notebook - inspect the provider selected from your environment - inspect the resolved judge LLM and judge embeddings classes - initialize the ValidMind Library with your platform credentials

If both OpenAI and Gemini credentials are available, OpenAI will be selected to match the default provider precedence used by the library.

This section gives you a concrete view of the effective configuration that the later prompt-validation, RAGAS, and DeepEval examples will use.

from validmind.ai.utils import get_client_and_model, get_judge_config

client, model = get_client_and_model()
judge_llm, judge_embeddings = get_judge_config()

print("resolved_model:", model)
print("judge_llm_type:", type(judge_llm).__name__)
print("judge_embeddings_type:", type(judge_embeddings).__name__)

# Useful for Gemini/OpenAI/Azure debugging
print("judge_llm:", judge_llm)
print("judge_embeddings:", judge_embeddings)

Prompt-validation tests

This section validates the default judge LLM path with two representative prompt-validation tests. For this smoke test, we use a simple prompt-only model because these tests evaluate the prompt template itself and do not require model predictions.

The example below creates a ValidMind model with vm.init_model() and attaches a prompt template to it. That gives the tests a standard object to inspect, even though there is no real predictive model behind the example.

Clarity checks whether the prompt instructions are clear and well-scoped.
Bias checks whether the prompt structure or examples could induce biased behavior.

system_prompt = """
You are an AI assistant specialized in sentiment analysis for financial news.
You will classify each sentence as positive, negative, or neutral.
Respond only with the sentiment label.
""".strip()


def noop_predict(_):
    return "dummy"


vm_prompt_model = vm.init_model(
    input_id="judge_prompt_model",
    predict_fn=noop_predict,
    prompt=Prompt(template=system_prompt, variables=[]),
)

vm_prompt_model.prompt.template

run_test(
    test_id="validmind.prompt_validation.Clarity",
    inputs={"model": vm_prompt_model},
).log()

run_test(
    test_id="validmind.prompt_validation.Bias",
    inputs={"model": vm_prompt_model},
).log()

RAGAS tests

This section validates the default judge LLM plus default judge embeddings path. The selected tests are useful because they exercise the RAGAS integration that historically depended on the default OpenAI setup.

The example data is wrapped with vm.init_dataset(), which turns the pandas DataFrame into a ValidMind dataset object that can be passed directly into these tests.

ResponseRelevancy exercises the judge LLM and embeddings path.
AnswerCorrectness exercises semantic and factual comparison with judge embeddings.
Faithfulness is a companion smoke test for the judge LLM path on RAG data.

These tests produce Plotly figures, so this notebook focuses on running and logging the results rather than comparing visual output in detail.

rag_df = pd.DataFrame(
    {
        "user_input": [
            "What happened to the company's revenue guidance?",
            "Why did the bank's stock decline?",
            "What was the announced dividend decision?",
        ],
        "retrieved_contexts": [
            [
                "The company raised its full-year revenue guidance after reporting strong demand in the enterprise segment.",
                "Management said the improved forecast was driven by larger-than-expected renewals.",
            ],
            [
                "The bank's stock declined after it reported higher-than-expected credit losses in its consumer portfolio.",
                "Executives also warned that provisions may remain elevated next quarter.",
            ],
            [
                "The board announced that it would keep the quarterly dividend unchanged.",
                "Management said capital return policy remains the same for now.",
            ],
        ],
        "response": [
            "The company increased its full-year revenue guidance after stronger enterprise demand.",
            "The bank's stock fell because it disclosed higher-than-expected credit losses.",
            "The company kept its dividend unchanged.",
        ],
        "reference": [
            "The company raised its full-year revenue guidance because demand in the enterprise segment was strong.",
            "The bank's shares dropped after it reported higher-than-expected credit losses.",
            "The board decided to leave the quarterly dividend unchanged.",
        ],
    }
)

vm_rag_ds = vm.init_dataset(
    dataset=rag_df,
    input_id="judge_rag_dataset",
    text_column="user_input",
    target_column="reference",
)

run_test(
    test_id="validmind.model_validation.ragas.ResponseRelevancy",
    inputs={"dataset": vm_rag_ds},
).log()

run_test(
    test_id="validmind.model_validation.ragas.AnswerCorrectness",
    inputs={"dataset": vm_rag_ds},
).log()

run_test(
    test_id="validmind.model_validation.ragas.Faithfulness",
    inputs={"dataset": vm_rag_ds},
).log()

## DeepEval scorersThis section validates the DeepEval scorer path in validmind.scorers.llm.deepeval.DeepEval scorers use the judge LLM configured via set_judge_config() — the same model used by prompt-validation and RAGAS tests. This means any provider you configure above (OpenAI, Gemini, Azure, or Vertex AI via langchain-google-vertexai) is automatically picked up by scorers without any additional setup.As in the RAGAS example, we create a ValidMind dataset with vm.init_dataset() so the scorer workflow runs against the same kind of object customers would use in their own notebooks.These scorers do not use the judge embeddings object. For this notebook, we use two representative examples:- AnswerRelevancy- HallucinationThey are included here so the notebook covers all three LLM evaluation surfaces:- prompt-validation- RAGAS- DeepEval scorers

deepeval_df = pd.DataFrame(
    {
        "input": [
            "What is the capital of France?",
            "Why did the company raise its full-year guidance?",
            "What did the board decide about the quarterly dividend?",
        ],
        "actual_output": [
            "The capital of France is Paris.",
            "The company raised guidance because enterprise demand was stronger than expected.",
            "The board kept the quarterly dividend unchanged.",
        ],
        "context": [
            ["France's capital city is Paris."],
            [
                "Management raised its full-year guidance after reporting stronger-than-expected demand in the enterprise segment."
            ],
            [
                "The board announced that the quarterly dividend would remain unchanged."
            ],
        ],
    }
)

vm_deepeval_ds = vm.init_dataset(
    dataset=deepeval_df,
    input_id="judge_deepeval_dataset",
    text_column="input",
    target_column="actual_output",
)

deepeval_df

vm_deepeval_ds.assign_scores(metrics=[
    "validmind.scorers.llm.deepeval.Hallucination",
    "validmind.scorers.llm.deepeval.AnswerRelevancy"
])

## In summaryIn this notebook, you learned how to:- [x] configure the judge provider from environment credentials- [x] override the default judge LLM and judge embeddings models- [x] use set_judge_config() to explicitly set the judge for any provider, including Vertex AI- [x] initialize ValidMind model and dataset objects for LLM evaluation workflows- [x] run prompt-validation tests that use the judge LLM- [x] run RAGAS tests that use the judge LLM and judge embeddings- [x] run DeepEval scorers that use the configured judge LLM

## Next stepsYou can use this notebook as a starting point for your own LLM evaluation workflows. A few practical follow-ups are:- replace the sample prompt and datasets with your own evaluation inputs- set OPENAI_MODEL / OPENAI_EMBEDDINGS_MODEL when you want to override the OpenAI judge pair, or GEMINI_MODEL / GEMINI_EMBEDDINGS_MODEL when you want to standardize the Gemini judge pair used across notebooks or environments- use set_judge_config() to explicitly wire a specific judge LLM and embeddings model — this applies to all evaluation paths, including DeepEval scorers- expand the set of tests and scorers based on your use case

Discover more learning resources

To continue learning about testing and evaluation with the ValidMind Library, explore:

You can also visit the ValidMind documentation for broader guidance on configuration, testing workflows, and model documentation.

Upgrade ValidMind

After installing ValidMind, periodically check that you are using a recent version so you can access the latest provider integrations, tests, and product improvements.

Retrieve the information for the currently installed version of ValidMind:

%pip show validmind

If the version returned is lower than the version indicated in our production open-source code, restart your notebook and run:

%pip install --upgrade validmind

You may need to restart your kernel after running the upgrade package for changes to be applied.