%pip install -q validmind
RAG Model Documentation Demo
In this notebook, we are going to implement a simple RAG Model for automating the process of answering RFP questions using GenAI. We will see how we can initialize an embedding model, a retrieval model and a generator model with LangChain components and use them within the ValidMind Library to run tests against them. Finally, we will see how we can put them together in a Pipeline and run that to get e2e results and run tests against that.
About ValidMind
ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models.
You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.
Before you begin
This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language.
If you encounter errors due to missing modules in your Python environment, install the modules with pip install
, and then re-run the notebook. For more help, refer to Installing Python Modules.
New to ValidMind?
If you haven’t already seen our Get started with the ValidMind Library, we recommend you explore the available resources for developers at some point. There, you can learn more about documenting models, find code samples, or read our developer reference.
Signing up is FREE — Register with ValidMind
Key concepts
- FunctionModels: ValidMind offers support for creating
VMModel
instances from Python functions. This enables us to support any “model” by simply using the provided function as the model’spredict
method. - PipelineModels: ValidMind models (
VMModel
instances) of any type can be piped together to create a model pipeline. This allows model components to be created and tested/documented independently, and then combined into a single model for end-to-end testing and documentation. We use the|
operator to pipe models together. - RAG: RAG stands for Retrieval Augmented Generation and refers to a wide range of GenAI applications where some form of retrieval is used to add context to the prompt so that the LLM that generates content can refer to it when creating its output. In this notebook, we are going to implement a simple RAG setup using LangChain components.
Prerequisites
Let’s go ahead and install the validmind
library if its not already installed… Then we can install the qdrant-client
library for our vector store and langchain
for everything else:
%pip install -q qdrant-client langchain langchain-openai sentencepiece
Initialize the ValidMind Library
ValidMind generates a unique code snippet for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.
Get your code snippet
In a browser, log in to ValidMind.
In the left sidebar, navigate to Model Inventory and click + Register Model.
Enter the model details and click Continue. (Need more help?)
For example, to register a model for use with this notebook, select:
- Documentation template:
Gen AI RAG Template
- Use case:
Marketing/Sales - Analytics
You can fill in other options according to your preference.
- Documentation template:
Go to Getting Started and click Copy snippet to clipboard.
Next, load your model identifier credentials from an .env
file or replace the placeholder with your own code snippet:
# Load your model identifier credentials from an `.env` file
%load_ext dotenv
%dotenv .env
# Or replace with your code snippet
import validmind as vm
vm.init(# api_host="...",
# api_key="...",
# api_secret="...",
# model="...",
)
Read Open AI API Key
We will need to have an OpenAI API key to be able to use their text-embedding-3-small
model for our embeddings, gpt-3.5-turbo
model for our generator and gpt-4o
model for our LLM-as-Judge tests. If you don’t have an OpenAI API key, you can get one by signing up at OpenAI. Then you can create a .env
file in the root of your project and the following cell will load it from there. Alternatively, you can just uncomment the line below to directly set the key (not recommended for security reasons).
# load openai api key
import os
import dotenv
import nltk
dotenv.load_dotenv()'stopwords')
nltk.download('punkt_tab')
nltk.download(
# os.environ["OPENAI_API_KEY"] = "sk-..."
if not "OPENAI_API_KEY" in os.environ:
raise ValueError("OPENAI_API_KEY is not set")
Dataset Loader
Great, now that we have all of our dependencies installed, the ValidMind Library initialized and connected to our model and our OpenAI API key setup, we can go ahead and load our datasets. We will use the synthetic RFP
dataset included with ValidMind for this notebook. This dataset contains a variety of RFP questions and ground truth answers that we can use both as the source where our Retriever will search for similar question-answer pairs as well as our test set for evaluating the performance of our RAG model. To do this, we just have to load it and call the preprocess function to get a split of the data into train and test sets.
# Import the sample dataset from the library
from validmind.datasets.llm.rag import rfp
= rfp.load_data()
raw_df = rfp.preprocess(raw_df) train_df, test_df
= vm.init_dataset(
vm_train_ds
train_df,="question",
text_column="ground_truth",
target_column
)
= vm.init_dataset(
vm_test_ds
test_df,="question",
text_column="ground_truth",
target_column
)
vm_test_ds.df.head()
Data validation
Now that we have loaded our dataset, we can go ahead and run some data validation tests right away to start assessing and documenting the quality of our data. Since we are using a text dataset, we can use ValidMind’s built-in array of text data quality tests to check that things like number of duplicates, missing values, and other common text data issues are not present in our dataset. We can also run some tests to check the sentiment and toxicity of our data.
Duplicates
First, let’s check for duplicates in our dataset. We can use the validmind.data_validation.Duplicates
test and pass our dataset:
from validmind.tests import run_test
run_test(="validmind.data_validation.Duplicates",
test_id={"dataset": vm_train_ds},
inputs ).log()
Stop Words
Next, let’s check for stop words in our dataset. We can use the validmind.data_validation.StopWords
test and pass our dataset:
run_test(="validmind.data_validation.nlp.StopWords",
test_id={
inputs"dataset": vm_train_ds,
}, ).log()
Punctuations
Next, let’s check for punctuations in our dataset. We can use the validmind.data_validation.Punctuations
test:
run_test(="validmind.data_validation.nlp.Punctuations",
test_id={
inputs"dataset": vm_train_ds,
}, ).log()
Common Words
Next, let’s check for common words in our dataset. We can use the validmind.data_validation.CommonWord
test:
run_test(="validmind.data_validation.nlp.CommonWords",
test_id={
inputs"dataset": vm_train_ds,
}, ).log()
Language Detection
For documentation purposes, we can detect and log the languages used in the dataset with the validmind.data_validation.LanguageDetection
test:
run_test(="validmind.data_validation.nlp.LanguageDetection",
test_id={
inputs"dataset": vm_train_ds,
}, ).log()
Toxicity Score
Now, let’s go ahead and run the validmind.data_validation.nlp.Toxicity
test to compute a toxicity score for our dataset:
run_test("validmind.data_validation.nlp.Toxicity",
={
inputs"dataset": vm_train_ds,
}, ).log()
Polarity and Subjectivity
We can also run the validmind.data_validation.nlp.PolarityAndSubjectivity
test to compute the polarity and subjectivity of our dataset:
run_test("validmind.data_validation.nlp.PolarityAndSubjectivity",
={
inputs"dataset": vm_train_ds,
}, ).log()
Sentiment
Finally, we can run the validmind.data_validation.nlp.Sentiment
test to plot the sentiment of our dataset:
run_test("validmind.data_validation.nlp.Sentiment",
={
inputs"dataset": vm_train_ds,
}, ).log()
Embedding Model
Now that we have our dataset loaded and have run some data validation tests to assess and document the quality of our data, we can go ahead and initialize our embedding model. We will use the text-embedding-3-small
model from OpenAI for this purpose wrapped in the OpenAIEmbeddings
class from LangChain. This model will be used to “embed” our questions both for inserting the question-answer pairs from the “train” set into the vector store and for embedding the question from inputs when making predictions with our RAG model.
from langchain_openai import OpenAIEmbeddings
= OpenAIEmbeddings(model="text-embedding-3-small")
embedding_client
def embed(input):
"""Returns a text embedding for the given text"""
return embedding_client.embed_query(input["question"])
= vm.init_model(input_id="embedding_model", predict_fn=embed) vm_embedder
What we have done here is to initialize the OpenAIEmbeddings
class so it uses OpenAI’s text-embedding-3-small
model. We then created an embed
function that takes in an input
dictionary and uses the embed_query
method of the embedding client to compute the embeddings of the question
. We use an embed
function since that is how ValidMind supports any custom model. We will use this strategy for the retrieval and generator models as well but you could also use, say, a HuggingFace model directly. See the documentation for more information on which model types are directly supported - ValidMind Documentation… Finally, we use the init_model
function from the ValidMind Library to create a VMModel
object that can be used in ValidMind tests. This also logs the model to our model documentation and any test that uses the model will be linked to the logged model and its metadata.
Assign Predictions
To precompute the embeddings for our test set, we can call the assign_predictions
method of our vm_test_ds
object we created above. This will compute the embeddings for each question in the test set and store them in the a special prediction column of the test set thats linked to our vm_embedder
model. This will allow us to use these embeddings later when we run tests against our embedding model.
vm_test_ds.assign_predictions(vm_embedder)print(vm_test_ds)
Run tests
Now that everything is setup for the embedding model, we can go ahead and run some tests to assess and document the quality of our embeddings. We will use the validmind.model_validation.embeddings.*
tests to compute a variety of metrics against our model.
run_test("validmind.model_validation.embeddings.StabilityAnalysisRandomNoise",
={
inputs"model": vm_embedder,
"dataset": vm_test_ds,
},={"probability": 0.3},
params ).log()
run_test("validmind.model_validation.embeddings.StabilityAnalysisSynonyms",
={
inputs"model": vm_embedder,
"dataset": vm_test_ds,
},={"probability": 0.3},
params ).log()
run_test("validmind.model_validation.embeddings.StabilityAnalysisTranslation",
={
inputs"model": vm_embedder,
"dataset": vm_test_ds,
},={
params"source_lang": "en",
"target_lang": "fr",
}, ).log()
run_test("validmind.model_validation.embeddings.CosineSimilarityHeatmap",
={
inputs"model": vm_embedder,
"dataset": vm_test_ds,
}, ).log()
run_test("validmind.model_validation.embeddings.CosineSimilarityDistribution",
={
inputs"model": vm_embedder,
"dataset": vm_test_ds,
}, ).log()
run_test("validmind.model_validation.embeddings.EuclideanDistanceHeatmap",
={
inputs"model": vm_embedder,
"dataset": vm_test_ds,
}, ).log()
run_test("validmind.model_validation.embeddings.PCAComponentsPairwisePlots",
={
inputs"model": vm_embedder,
"dataset": vm_test_ds,
},={"n_components": 3},
params ).log()
run_test("validmind.model_validation.embeddings.TSNEComponentsPairwisePlots",
={
inputs"model": vm_embedder,
"dataset": vm_test_ds,
},={"n_components": 3, "perplexity": 20},
params ).log()
Setup Vector Store
Great, so now that we have assessed our embedding model and verified that it is performing well, we can go ahead and use it to compute embeddings for our question-answer pairs in the “train” set. We will then use these embeddings to insert the question-answer pairs into a vector store. We will use an in-memory qdrant
vector database for demo purposes but any option would work just as well here. We will use the QdrantClient
class from LangChain to interact with the vector store. This class will allow us to insert and search for embeddings in the vector store.
Generate embeddings for the Train Set
We can use the same assign_predictions
method from earlier except this time we will use the vm_train_ds
object to compute the embeddings for the question-answer pairs in the “train” set.
vm_train_ds.assign_predictions(vm_embedder)print(vm_train_ds)
Insert embeddings and questions into Vector DB
Now that we have computed the embeddings for our question-answer pairs in the “train” set, we can go ahead and insert them into the vector store:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import DataFrameLoader
# load documents from dataframe
= DataFrameLoader(train_df, page_content_column="question")
loader = loader.load()
docs # choose model using embedding client
= OpenAIEmbeddings(model="text-embedding-3-small")
embedding_client
# setup vector datastore
= Qdrant.from_documents(
qdrant
docs,
embedding_client,=":memory:", # Local mode with in-memory storage only
location="rfp_rag_collection",
collection_name )
Retrieval Model
Now that we have an embedding model and a vector database setup and loaded with our data, we need a Retrieval model that can search for similar question-answer pairs for a given input question. Once created, we can initialize this as a ValidMind model and assign_predictions
to it just like our embedding model.
def retrieve(input):
= []
contexts
for result in qdrant.similarity_search_with_score(input["question"]):
= result
document, score = f"Q: {document.page_content}\n"
context += f"A: {document.metadata['ground_truth']}\n"
context
contexts.append(context)
return contexts
= vm.init_model(input_id="retrieval_model", predict_fn=retrieve) vm_retriever
=vm_retriever)
vm_test_ds.assign_predictions(modelprint(vm_test_ds)
Generation Model
As the final piece of this simple RAG pipeline, we can create and initialize a generation model that will use the retrieved context to generate an answer to the input question. We will use the gpt-3.5-turbo
model from OpenAI.
from openai import OpenAI
from validmind.models import Prompt
= """
system_prompt You are an expert RFP AI assistant.
You are tasked with answering new RFP questions based on existing RFP questions and answers.
You will be provided with the existing RFP questions and answer pairs that are the most relevant to the new RFP question.
After that you will be provided with a new RFP question.
You will generate an answer and respond only with the answer.
Ignore your pre-existing knowledge and answer the question based on the provided context.
""".strip()
= OpenAI()
openai_client
def generate(input):
= openai_client.chat.completions.create(
response ="gpt-3.5-turbo",
model=[
messages"role": "system", "content": system_prompt},
{"role": "user", "content": "\n\n".join(input["retrieval_model"])},
{"role": "user", "content": input["question"]},
{
],
)
return response.choices[0].message.content
= vm.init_model(
vm_generator ="generation_model",
input_id=generate,
predict_fn=Prompt(template=system_prompt),
prompt )
Let’s test it out real quick:
import pandas as pd
vm_generator.predict(
pd.DataFrame("retrieval_model": [["My name is anil"]], "question": ["what is my name"]}
{
) )
Prompt Evaluation
Now that we have our generator model initialized, we can run some LLM-as-Judge tests to evaluate the system prompt. This will allow us to get an initial sense of how well the prompt meets a few best practices for prompt engineering. These tests use an LLM to rate the prompt on a scale of 1-10 against the following criteria:
- Examplar Bias: When using multi-shot prompting, does the prompt contain an unbiased distribution of examples?
- Delimitation: When using complex prompts containing examples, contextual information, or other elements, is the prompt formatted in such a way that each element is clearly separated?
- Clarity: How clearly the prompt states the task.
- Conciseness: How succinctly the prompt states the task.
- Instruction Framing: Whether the prompt contains negative instructions.
- Specificity: How specific the prompt defines the task.
run_test("validmind.prompt_validation.Bias",
={
inputs"model": vm_generator,
}, ).log()
run_test("validmind.prompt_validation.Clarity",
={
inputs"model": vm_generator,
}, ).log()
run_test("validmind.prompt_validation.Conciseness",
={
inputs"model": vm_generator,
}, ).log()
run_test("validmind.prompt_validation.Delimitation",
={
inputs"model": vm_generator,
}, ).log()
run_test("validmind.prompt_validation.NegativeInstruction",
={
inputs"model": vm_generator,
}, ).log()
run_test("validmind.prompt_validation.Specificity",
={
inputs"model": vm_generator,
}, ).log()
Setup RAG Pipeline Model
Now that we have all of our individual “component” models setup and initialized we need some way to put them all together in a single “pipeline”. We can use the PipelineModel
class to do this. This ValidMind model type simply wraps any number of other ValidMind models and runs them in sequence. We can use a pipe(|
) operator - in Python this is normally an or
operator but we have overloaded it for easy pipeline creation - to chain together our models. We can then initialize this pipeline model and assign predictions to it just like any other model.
= vm.init_model(vm_retriever | vm_generator, input_id="rag_model") vm_rag_model
We can assign_predictions
to the pipeline model just like we did with the individual models. This will run the pipeline on the test set and store the results in the test set for later use.
=vm_rag_model)
vm_test_ds.assign_predictions(modelprint(vm_test_ds)
5) vm_test_ds._df.head(
Run tests
RAGAS evaluation
Let’s go ahead and run some of our new RAG tests against our model…
Note: these tests are still being developed and are not yet in a stable state. We are using advanced tests here that use LLM-as-Judge and other strategies to assess things like the relevancy of the retrieved context to the input question and the correctness of the generated answer when compared to the ground truth. There is more to come in this area so stay tuned!
import warnings
"ignore") warnings.filterwarnings(
Semantic Similarity
The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth.
Measuring the semantic similarity between answers can offer valuable insights into the quality of the generated response. This evaluation utilizes a cross-encoder model to calculate the semantic similarity score.
run_test("validmind.model_validation.ragas.SemanticSimilarity",
={"dataset": vm_test_ds},
inputs={
params"response_column": "rag_model_prediction",
"reference_column": "ground_truth",
}, ).log()
Context Entity Recall
This test gives the measure of recall of the retrieved context, based on the number of entities present in both ground_truths and contexts relative to the number of entities present in the ground_truths alone. Simply put, it is a measure of what fraction of entities are recalled from ground_truths. This test is useful in fact-based use cases like tourism help desk, historical QA, etc. This test can help evaluate the retrieval mechanism for entities, based on comparison with entities present in ground_truths, because in cases where entities matter, we need the contexts which cover them.
run_test("validmind.model_validation.ragas.ContextEntityRecall",
={"dataset": vm_test_ds},
inputs={
params"reference_column": "ground_truth",
"retrieved_contexts_column": "retrieval_model_prediction",
}, ).log()
Context Precision
Context Precision is a test that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This test is computed using the question, ground_truth and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.
run_test("validmind.model_validation.ragas.ContextPrecision",
={"dataset": vm_test_ds},
inputs={
params"user_input_column": "question",
"retrieved_contexts_column": "retrieval_model_prediction",
"reference_column": "ground_truth",
}, ).log()
Context Precision Without Reference
This test evaluates whether retrieved contexts align well with the expected response for a given user input, without requiring a ground-truth reference. This test assesses the relevance of each retrieved context chunk by comparing it directly to the response.
run_test("validmind.model_validation.ragas.ContextPrecisionWithoutReference",
={"dataset": vm_test_ds},
inputs={
params"user_input_column": "question",
"retrieved_contexts_column": "retrieval_model_prediction",
"response_column": "rag_model_prediction",
}, ).log()
Faithfulness
This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.
The generated answer is regarded as faithful if all the claims that are made in the answer can be inferred from the given context. To calculate this a set of claims from the generated answer is first identified. Then each one of these claims are cross checked with given context to determine if it can be inferred from given context or not.
run_test("validmind.model_validation.ragas.Faithfulness",
={"dataset": vm_test_ds},
inputs={
params"user_input_column": "question",
"response_column": "rag_model_prediction",
"retrieved_contexts_column": "retrieval_model_prediction",
}, ).log()
Response Relevancy
The Response Relevancy test, focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This test is computed using the question, the context and the answer.
The Response Relevancy is defined as the mean cosine similartiy of the original question to a number of artifical questions, which where generated (reverse engineered) based on the answer.
Please note, that eventhough in practice the score will range between 0 and 1 most of the time, this is not mathematically guranteed, due to the nature of the cosine similarity ranging from -1 to 1.
Note: This is a reference free test. If you’re looking to compare ground truth answer with generated answer refer to Answer Correctness.
An answer is deemed relevant when it directly and appropriately addresses the original question. Importantly, our assessment of answer relevance does not consider factuality but instead penalizes cases where the answer lacks completeness or contains redundant details. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity between these generated questions and the original question is measured. The underlying idea is that if the generated answer accurately addresses the initial question, the LLM should be able to generate questions from the answer that align with the original question.
run_test("validmind.model_validation.ragas.ResponseRelevancy",
={"dataset": vm_test_ds},
inputs={
params"user_input_column": "question",
"response_column": "rag_model_prediction",
"retrieved_contexts_column": "retrieval_model_prediction",
}, ).log()
Context Recall
Context recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance.
To estimate context recall from the ground truth answer, each sentence in the ground truth answer is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all sentences in the ground truth answer should be attributable to the retrieved context.
run_test("validmind.model_validation.ragas.ContextRecall",
={"dataset": vm_test_ds},
inputs={
params"user_input_column": "question",
"retrieved_contexts_column": "retrieval_model_prediction",
"reference_column": "ground_truth",
}, ).log()
Answer Correctness
The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness.
Answer correctness encompasses two critical aspects: semantic similarity between the generated answer and the ground truth, as well as factual similarity. These aspects are combined using a weighted scheme to formulate the answer correctness score.
Factual correctness quantifies the factual overlap between the generated answer and the ground truth answer. This is done using the concepts of:
- TP (True Positive): Facts or statements that are present in both the ground truth and the generated answer.
- FP (False Positive): Facts or statements that are present in the generated answer but not in the ground truth.
- FN (False Negative): Facts or statements that are present in the ground truth but not in the generated answer.
run_test("validmind.model_validation.ragas.AnswerCorrectness",
={"dataset": vm_test_ds},
inputs={
params"user_input_column": "question",
"response_column": "rag_model_prediction",
"reference_column": "ground_truth",
}, ).log()
Aspect Critic
This is designed to assess submissions based on predefined aspects such as harmlessness and correctness. Additionally, users have the flexibility to define their own aspects for evaluating submissions according to their specific criteria. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not. This evaluation is performed using the ‘answer’ as input.
Critiques within the LLM evaluators evaluate submissions based on the provided aspect. Ragas Critiques offers a range of predefined aspects like correctness, harmfulness, etc. Users can also define their own aspects for evaluating submissions based on their specific criteria. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not.
run_test("validmind.model_validation.ragas.AspectCritic",
={"dataset": vm_test_ds},
inputs={
params"user_input_column": "question",
"response_column": "rag_model_prediction",
"retrieved_contexts_column": "retrieval_model_prediction",
}, ).log()
Noise Sensitivity
This test is designed to evaluate the robustness of the RAG pipeline model against noise in the retrieved context. It works by checking how well the “claims” in the generated answer match up with the “claims” in the ground truth answer. If the generated answer contains “claims” from the contexts that the ground truth answer does not contain, those claims are considered incorrect. The score for each answer is the number of incorrect claims divided by the total number of claims. This can be interpreted as a measure of how sensitive the LLM is to “noise” in the context where “noise” is information that is relevant but should not be included in the answer since the ground truth answer does not contain it.
run_test("validmind.model_validation.ragas.NoiseSensitivity",
={"dataset": vm_test_ds},
inputs={
params"user_input_column": "question",
"response_column": "rag_model_prediction",
"reference_column": "ground_truth",
"retrieved_contexts_column": "retrieval_model_prediction",
}, ).log()
Generation quality
In this section, we evaluate the alignment and relevance of generated responses to reference outputs within our retrieval-augmented generation (RAG) application. We use metrics that assess various quality dimensions of the generated responses, including semantic similarity, structural alignment, and phrasing overlap. Semantic similarity metrics compare embeddings of generated and reference text to capture deeper contextual alignment, while overlap and alignment measures quantify how well the phrasing and structure of generated responses match the intended outputs.
Token Disparity
This test assesses the difference in token counts between the reference texts (ground truth) and the answers generated by the RAG model. It helps evaluate how well the model’s outputs align with the expected length and level of detail in the reference texts. A significant disparity in token counts could signal issues with generation quality, such as excessive verbosity or insufficient detail. Consistently low token counts in generated answers compared to references might suggest that the model’s outputs are incomplete or overly concise, missing important contextual information.
run_test("validmind.model_validation.TokenDisparity",
={
inputs"dataset": vm_test_ds,
"model": vm_rag_model,
}, ).log()
ROUGE Score
This test evaluates the quality of answers generated by the RAG model by measuring overlaps in n-grams, word sequences, and word pairs between the model output and the reference (ground truth) text. ROUGE, short for Recall-Oriented Understudy for Gisting Evaluation, assesses both precision and recall, providing a balanced view of how well the generated response captures the reference content. ROUGE precision measures the proportion of n-grams in the generated text that match the reference, highlighting relevance and conciseness, while ROUGE recall assesses the proportion of reference n-grams present in the generated text, indicating completeness and thoroughness.
Low precision scores might reveal that the generated text includes redundant or irrelevant information, while low recall scores suggest omissions of essential details from the reference. Consistently low ROUGE scores could indicate poor overall alignment with the ground truth, suggesting the model may be missing key content or failing to capture the intended meaning.
run_test("validmind.model_validation.RougeScore",
={
inputs"dataset": vm_test_ds,
"model": vm_rag_model,
},={
params"metric": "rouge-1",
}, ).log()
BLEU Score
The BLEU Score test evaluates the quality of answers generated by the RAG model by measuring n-gram overlap between the generated text and the reference (ground truth) text, with a specific focus on exact precision in phrasing. While ROUGE precision also assesses overlap, BLEU differs in two main ways: first, it applies a geometric average across multiple n-gram levels, capturing precise phrase alignment, and second, it includes a brevity penalty to prevent overly short outputs from inflating scores artificially. This added precision focus is valuable in RAG applications where strict adherence to reference language is essential, as BLEU emphasizes the match to exact phrasing. In contrast, ROUGE precision evaluates general content overlap without penalizing brevity, offering a broader sense of content alignment.
run_test("validmind.model_validation.BleuScore",
={
inputs"dataset": vm_test_ds,
"model": vm_rag_model,
}, ).log()
BERT Score
This test evaluates the quality of the RAG generated answers using BERT embeddings to measure precision, recall, and F1 scores based on semantic similarity, rather than exact n-gram matches as in BLEU and ROUGE. This approach captures contextual meaning, making it valuable when wording differs but the intended message closely aligns with the reference. In RAG applications, the BERT score is especially useful for ensuring that generated answers convey the reference text’s meaning, even if phrasing varies. Consistently low scores indicate a lack of semantic alignment, suggesting the model may miss or misrepresent key content. Low precision may reflect irrelevant or redundant details, while low recall can indicate omissions.
run_test("validmind.model_validation.BertScore",
={
inputs"dataset": vm_test_ds,
"model": vm_rag_model,
}, ).log()
METEOR Score
This test evaluates the quality of the generated answers by measuring alignment with the ground truth, emphasizing both accuracy and fluency. Unlike BLEU and ROUGE, which focus on n-gram matches, METEOR combines precision, recall, synonym matching, and word order, focusing at how well the generated text conveys meaning and reads naturally. This metric is especially useful for RAG applications where sentence structure and natural flow are crucial for clear communication. Lower scores may suggest alignment issues, indicating that the answers may lack fluency or key content. Discrepancies in word order or high fragmentation penalties can reveal problems with how the model constructs sentences, potentially affecting readability.
run_test("validmind.model_validation.MeteorScore",
={
inputs"dataset": vm_test_ds,
"model": vm_rag_model,
}, ).log()
Bias and Toxicity
In this section, we use metrics like Toxicity Score and Regard Score to evaluate both the generated responses and the ground truth. These tests helps us detect any harmful, offensive, or inappropriate language and evaluate the level of bias and neutrality enabling us to assess and mitigate potential biases in both the model’s responses and the original dataset.
Toxicity Score
This test measures the level of harmful or offensive content in the generated answers. The test uses a preloaded toxicity detection tool from Hugging Face, which identifies language that may be inappropriate, aggressive, or derogatory. High toxicity scores indicate potentially toxic content, while consistently elevated scores across multiple outputs may signal underlying issues in the model’s generation process that require attention to prevent the spread of harmful language.
run_test("validmind.model_validation.ToxicityScore",
={
inputs"dataset": vm_test_ds,
"model": vm_rag_model,
}, ).log()
Regard Score
This test evaluates the sentiment and perceived regard—categorized as positive, negative, neutral, or other—in answers generated by the RAG model. This is important for identifying any biases or sentiment tendencies in responses, ensuring that generated answers are balanced and appropriate for the context. The uses a preloaded regard evaluation tool from Hugging Face to compute scores for each response. High skewness in regard scores, especially if the generated responses consistently diverge from expected sentiments in the reference texts, may reveal biases in the model’s generation, such as overly positive or negative tones where neutrality is expected.
run_test("validmind.model_validation.RegardScore",
={
inputs"dataset": vm_test_ds,
"model": vm_rag_model,
}, ).log()
Conclusion
In this notebook, we have seen how we can use LangChain and ValidMind together to build, evaluate and document a simple RAG Model as its developed. This is a great example of the interactive development experience that ValidMind is designed to support. We can quickly iterate on our model and document as we go… We have seen how ValidMind supports non-traditional “models” using a functional interface and how we can build pipelines of many models to support complex GenAI workflows.
This is still a work in progress and we are actively developing new tests to support more advanced GenAI workflows. We are also keeping an eye on the most popular GenAI models and libraries to explore direct integrations. Stay tuned for more updates and new features in this area!
Upgrade ValidMind
Retrieve the information for the currently installed version of ValidMind:
%pip show validmind
If the version returned is lower than the version indicated in our production open-source code, restart your notebook and run:
%pip install --upgrade validmind
You may need to restart your kernel after running the upgrade package for changes to be applied.