• Documentation
    • About ​ValidMind
    • Get Started
    • Guides
    • Support
    • Releases

    • ValidMind Library
    • Python API
    • Public REST API

    • Training Courses
  • Log In
  1. Code samples
  2. Agents
  3. AI Agent Validation with ValidMind - Banking Demo
  • ValidMind Library
  • Supported models

  • Quickstart
  • Quickstart for model documentation
  • Quickstart for model validation
  • Install and initialize ValidMind Library
  • Store model credentials in .env files

  • Model Development
  • 1 — Set up ValidMind Library
  • 2 — Start model development process
  • 3 — Integrate custom tests
  • 4 — Finalize testing & documentation

  • Model Validation
  • 1 — Set up ValidMind Library for validation
  • 2 — Start model validation process
  • 3 — Developing a challenger model
  • 4 — Finalize validation & reporting

  • Model Testing
  • Run tests & test suites
    • Intro to Assign Scores
    • Configure dataset features
    • Customize test result descriptions
    • Document multiple results for the same test
    • Enable PII detection in tests
    • Explore test suites
    • Explore tests
    • Dataset Column Filters when Running Tests
    • Load dataset predictions
    • Log metrics over time
    • Run individual documentation sections
    • Run documentation tests with custom configurations
    • Run tests with multiple datasets
    • Intro to Unit Metrics
    • Understand and utilize RawData in ValidMind tests
    • Introduction to ValidMind Dataset and Model Objects
    • Run Tests
      • Run dataset based tests
      • Run comparison tests
  • Test descriptions
    • Data Validation
      • ACFandPACFPlot
      • ADF
      • AutoAR
      • AutoMA
      • AutoStationarity
      • BivariateScatterPlots
      • BoxPierce
      • ChiSquaredFeaturesTable
      • ClassImbalance
      • DatasetDescription
      • DatasetSplit
      • DescriptiveStatistics
      • DickeyFullerGLS
      • Duplicates
      • EngleGrangerCoint
      • FeatureTargetCorrelationPlot
      • HighCardinality
      • HighPearsonCorrelation
      • IQROutliersBarPlot
      • IQROutliersTable
      • IsolationForestOutliers
      • JarqueBera
      • KPSS
      • LaggedCorrelationHeatmap
      • LJungBox
      • MissingValues
      • MissingValuesBarPlot
      • MutualInformation
      • PearsonCorrelationMatrix
      • PhillipsPerronArch
      • ProtectedClassesCombination
      • ProtectedClassesDescription
      • ProtectedClassesDisparity
      • ProtectedClassesThresholdOptimizer
      • RollingStatsPlot
      • RunsTest
      • ScatterPlot
      • ScoreBandDefaultRates
      • SeasonalDecompose
      • ShapiroWilk
      • Skewness
      • SpreadPlot
      • TabularCategoricalBarPlots
      • TabularDateTimeHistograms
      • TabularDescriptionTables
      • TabularNumericalHistograms
      • TargetRateBarPlots
      • TimeSeriesDescription
      • TimeSeriesDescriptiveStatistics
      • TimeSeriesFrequency
      • TimeSeriesHistogram
      • TimeSeriesLinePlot
      • TimeSeriesMissingValues
      • TimeSeriesOutliers
      • TooManyZeroValues
      • UniqueRows
      • WOEBinPlots
      • WOEBinTable
      • ZivotAndrewsArch
      • Nlp
        • CommonWords
        • Hashtags
        • LanguageDetection
        • Mentions
        • PolarityAndSubjectivity
        • Punctuations
        • Sentiment
        • StopWords
        • TextDescription
        • Toxicity
    • Model Validation
      • BertScore
      • BleuScore
      • ClusterSizeDistribution
      • ContextualRecall
      • FeaturesAUC
      • MeteorScore
      • ModelMetadata
      • ModelPredictionResiduals
      • RegardScore
      • RegressionResidualsPlot
      • RougeScore
      • TimeSeriesPredictionsPlot
      • TimeSeriesPredictionWithCI
      • TimeSeriesR2SquareBySegments
      • TokenDisparity
      • ToxicityScore
      • Embeddings
        • ClusterDistribution
        • CosineSimilarityComparison
        • CosineSimilarityDistribution
        • CosineSimilarityHeatmap
        • DescriptiveAnalytics
        • EmbeddingsVisualization2D
        • EuclideanDistanceComparison
        • EuclideanDistanceHeatmap
        • PCAComponentsPairwisePlots
        • StabilityAnalysisKeyword
        • StabilityAnalysisRandomNoise
        • StabilityAnalysisSynonyms
        • StabilityAnalysisTranslation
        • TSNEComponentsPairwisePlots
      • Ragas
        • AnswerCorrectness
        • AspectCritic
        • ContextEntityRecall
        • ContextPrecision
        • ContextPrecisionWithoutReference
        • ContextRecall
        • Faithfulness
        • NoiseSensitivity
        • ResponseRelevancy
        • SemanticSimilarity
      • Sklearn
        • AdjustedMutualInformation
        • AdjustedRandIndex
        • CalibrationCurve
        • ClassifierPerformance
        • ClassifierThresholdOptimization
        • ClusterCosineSimilarity
        • ClusterPerformanceMetrics
        • CompletenessScore
        • ConfusionMatrix
        • FeatureImportance
        • FowlkesMallowsScore
        • HomogeneityScore
        • HyperParametersTuning
        • KMeansClustersOptimization
        • MinimumAccuracy
        • MinimumF1Score
        • MinimumROCAUCScore
        • ModelParameters
        • ModelsPerformanceComparison
        • OverfitDiagnosis
        • PermutationFeatureImportance
        • PopulationStabilityIndex
        • PrecisionRecallCurve
        • RegressionErrors
        • RegressionErrorsComparison
        • RegressionPerformance
        • RegressionR2Square
        • RegressionR2SquareComparison
        • RobustnessDiagnosis
        • ROCCurve
        • ScoreProbabilityAlignment
        • SHAPGlobalImportance
        • SilhouettePlot
        • TrainingTestDegradation
        • VMeasure
        • WeakspotsDiagnosis
      • Statsmodels
        • AutoARIMA
        • CumulativePredictionProbabilities
        • DurbinWatsonTest
        • GINITable
        • KolmogorovSmirnov
        • Lilliefors
        • PredictionProbabilitiesHistogram
        • RegressionCoeffs
        • RegressionFeatureSignificance
        • RegressionModelForecastPlot
        • RegressionModelForecastPlotLevels
        • RegressionModelSensitivityPlot
        • RegressionModelSummary
        • RegressionPermutationFeatureImportance
        • ScorecardHistogram
    • Ongoing Monitoring
      • CalibrationCurveDrift
      • ClassDiscriminationDrift
      • ClassificationAccuracyDrift
      • ClassImbalanceDrift
      • ConfusionMatrixDrift
      • CumulativePredictionProbabilitiesDrift
      • FeatureDrift
      • PredictionAcrossEachFeature
      • PredictionCorrelation
      • PredictionProbabilitiesHistogramDrift
      • PredictionQuantilesAcrossFeatures
      • ROCCurveDrift
      • ScoreBandsDrift
      • ScorecardHistogramDrift
      • TargetPredictionDistributionPlot
    • Plots
      • BoxPlot
      • CorrelationHeatmap
      • HistogramPlot
      • ViolinPlot
    • Prompt Validation
      • Bias
      • Clarity
      • Conciseness
      • Delimitation
      • NegativeInstruction
      • Robustness
      • Specificity
    • Stats
      • CorrelationAnalysis
      • DescriptiveStats
      • NormalityTests
      • OutlierDetection
  • Test sandbox beta

  • Notebooks
  • Code samples
    • Agents
      • AI Agent Validation with ValidMind - Banking Demo
    • Capital Markets
      • Quickstart for knockout option pricing model documentation
      • Quickstart for Heston option pricing model using QuantLib
    • Code Explainer
      • Quickstart for model code documentation
    • Credit Risk
      • Document an application scorecard model
      • Document an application scorecard model
      • Document a credit risk model
      • Document an application scorecard model
      • Document an Excel-based application scorecard model
    • Custom Tests
      • Implement custom tests
      • Integrate external test providers
    • Model Validation
      • Validate an application scorecard model
    • Nlp and Llm
      • Sentiment analysis of financial data using a large language model (LLM)
      • Summarization of financial data using a large language model (LLM)
      • Sentiment analysis of financial data using Hugging Face NLP models
      • Summarization of financial data using Hugging Face NLP models
      • Automate news summarization using LLMs
      • Prompt validation for large language models (LLMs)
      • RAG Model Benchmarking Demo
      • RAG Model Documentation Demo
    • Ongoing Monitoring
      • Ongoing Monitoring for Application Scorecard
      • Quickstart for ongoing monitoring of models with ValidMind
    • Regression
      • Document a California Housing Price Prediction regression model
    • Time Series
      • Document a time series forecasting model
      • Document a time series forecasting model

  • Reference
  • ValidMind Library Python API
  • ​ValidMind Public REST API

On this page

  • Table of Contents
  • About ValidMind
    • Before you begin
    • New to ValidMind?
    • Key concepts
  • Install the ValidMind Library
  • Initialize the ValidMind Library
    • Get your code snippet
    • Initialize the Python environment
  • Banking Tools
    • Tool Overview
    • Test Banking Tools Individually
  • Complete LangGraph Banking Agent
  • ValidMind Model Integration
  • Prompt Validation
  • Banking Test Dataset
    • Initialize ValidMind Dataset
    • Run the Agent and capture result through assign predictions
  • Banking Accuracy Test
  • Banking Tool Call Accuracy Test
  • RAGAS Tests for an Agent Evaluation
    • Faithfulness
    • Response Relevancy
    • Context Recall
  • Safety
    • AspectCritic
    • Prompt bias
    • Toxicity
  • Demo Summary and Next Steps
    • What We Built
    • Next Steps
    • Key Benefits
  • Edit this page
  • Report an issue
  1. Code samples
  2. Agents
  3. AI Agent Validation with ValidMind - Banking Demo

AI Agent Validation with ValidMind - Banking Demo

This notebook shows how to document and evaluate an agentic AI system with the ValidMind Library. Using a small banking agent built in LangGraph as an example, you will run ValidMind’s built-in and custom tests and produce the artifacts needed to create evidence-backed documentation.

An AI agent is an autonomous system that interprets inputs, selects from available tools or actions, and carries out multi-step behaviors to achieve user goals. In this example, our agent acts as a professional banking assistant that analyzes user requests and automatically selects and invokes the most appropriate specialized banking tool (credit, account, or fraud) to deliver accurate, compliant, and actionable responses.

However, agentic capabilities bring concrete risks. The agent may misinterpret user inputs or fail to extract required parameters, producing incorrect credit assessments or inappropriate account actions; it can select the wrong tool (for example, invoking account management instead of fraud detection), which may cause unsafe, non-compliant, or customer-impacting behaviour.

This interactive notebook guides you step-by-step through building a demo LangGraph banking agent, preparing an evaluation dataset, initializing the ValidMind Library and required objects, writing custom tests for tool-selection accuracy and entity extraction, running ValidMind’s built-in and custom test suites, and logging documentation artifacts to ValidMind.

Table of Contents

  • About ValidMind
    • Before you begin
    • New to ValidMind?
    • Key concepts
  • Install the ValidMind Library
  • Initialize the ValidMind Library
    • Get your code snippet
    • Initialize the Python environment
  • Banking Tools
    • Tool Overview
    • Test Banking Tools Individually
  • Complete LangGraph Banking Agent
  • ValidMind Model Integration
  • Prompt Validation
  • Banking Test Dataset
    • Initialize ValidMind Dataset
    • Run the Agent and capture result through assign predictions
  • Banking Accuracy Test
  • Banking Tool Call Accuracy Test
  • RAGAS Tests for an Agent Evaluation
    • Faithfulness
    • Response Relevancy
    • Context Recall
  • Safety
    • AspectCritic
    • Prompt bias
    • Toxicity
  • Demo Summary and Next Steps

About ValidMind

ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models.

You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.

Before you begin

This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language.

If you encounter errors due to missing modules in your Python environment, install the modules with pip install, and then re-run the notebook. For more help, refer to Installing Python Modules.

New to ValidMind?

If you haven't already seen our documentation on the ValidMind Library, we recommend you begin by exploring the available resources in this section. There, you can learn more about documenting models and running tests, as well as find code samples and our Python Library API reference.

For access to all features available in this notebook, you'll need access to a ValidMind account.

Register with ValidMind

Key concepts

Model documentation: A structured and detailed record pertaining to a model, encompassing key components such as its underlying assumptions, methodologies, data sources, inputs, performance metrics, evaluations, limitations, and intended uses. It serves to ensure transparency, adherence to regulatory requirements, and a clear understanding of potential risks associated with the model’s application.

Documentation template: Functions as a test suite and lays out the structure of model documentation, segmented into various sections and sub-sections. Documentation templates define the structure of your model documentation, specifying the tests that should be run, and how the results should be displayed.

Tests: A function contained in the ValidMind Library, designed to run a specific quantitative test on the dataset or model. Tests are the building blocks of ValidMind, used to evaluate and document models and datasets, and can be run individually or as part of a suite defined by your model documentation template.

Custom tests: Custom tests are functions that you define to evaluate your model or dataset. These functions can be registered via the ValidMind Library to be used with the ValidMind Platform.

Inputs: Objects to be evaluated and documented in the ValidMind Library. They can be any of the following:

  • model: A single model that has been initialized in ValidMind with vm.init_model().
  • dataset: Single dataset that has been initialized in ValidMind with vm.init_dataset().
  • models: A list of ValidMind models - usually this is used when you want to compare multiple models in your custom test.
  • datasets: A list of ValidMind datasets - usually this is used when you want to compare multiple datasets in your custom test. See this example for more information.

Parameters: Additional arguments that can be passed when running a ValidMind test, used to pass additional information to a test, customize its behavior, or provide additional context.

Outputs: Custom tests can return elements like tables or plots. Tables may be a list of dictionaries (each representing a row) or a pandas DataFrame. Plots may be matplotlib or plotly figures.

Test suites: Collections of tests designed to run together to automate and generate model documentation end-to-end for specific use-cases.

Install the ValidMind Library

To install the library:

%pip install -q "validmind[all]" langgraph

Initialize the ValidMind Library

ValidMind generates a unique code snippet for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.

Get your code snippet

  1. In a browser, log in to ValidMind.

  2. In the left sidebar, navigate to Model Inventory and click + Register Model.

  3. Enter the model details and click Continue. (Need more help?)

    For example, to register a model for use with this notebook, select:

    • Documentation template: Agentic AI System

    You can fill in other options according to your preference.

  4. Go to Getting Started and click Copy snippet to clipboard.

Next, replace the placeholder with your own code snippet:

import validmind as vm

vm.init(
    api_host="...",
    api_key="...",
    api_secret="...",
    model="...",
)

Initialize the Python environment

First, let's import all the necessary libraries for building our banking LangGraph agent system:

  • LangChain components for LLM integration and tool management
  • LangGraph for building stateful, multi-step agent workflows
  • ValidMind for model validation and testing
  • Banking tools for specialized financial services
  • Standard libraries for data handling and environment management

The setup includes loading environment variables (like OpenAI API keys) needed for the LLM components to function properly.

# Standard library imports
from typing import TypedDict, Annotated, Sequence

# Third party imports
import pandas as pd
from langchain_core.messages import BaseMessage, HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import StateGraph, END, START
from langgraph.graph.message import add_messages
from langgraph.prebuilt import ToolNode

# Local imports
from banking_tools import AVAILABLE_TOOLS
from validmind.tests import run_test


# Load environment variables if using .env file
try:
    from dotenv import load_dotenv
    load_dotenv()
except ImportError:
    print("dotenv not installed. Make sure OPENAI_API_KEY is set in your environment.")

Banking Tools

Now let's use the following banking demo tools that provide use cases of the financial services:

Tool Overview

  1. Credit Risk Analyzer - Loan applications and credit decisions
  2. Customer Account Manager - Account services and customer support
  3. Fraud Detection System - Security and fraud prevention
print(f"Available tools: {len(AVAILABLE_TOOLS)}")
print("\nTool Details:")
for i, tool in enumerate(AVAILABLE_TOOLS, 1):
    print(f"   - {tool.name}")   

Test Banking Tools Individually

Let's test each banking tool individually to ensure they're working correctly before integrating them into our agent.

print("Testing Individual Banking Tools")
print("=" * 60)

# Test 1: Credit Risk Analyzer
print("TEST 1: Credit Risk Analyzer")
print("-" * 40)
try:
    # Access the underlying function using .func
    credit_result = AVAILABLE_TOOLS[0].func(
        customer_income=75000,
        customer_debt=1200,
        credit_score=720,
        loan_amount=50000,
        loan_type="personal"
    )
    print(credit_result)
    print("Credit Risk Analyzer test PASSED")
except Exception as e:
    print(f"Credit Risk Analyzer test FAILED: {e}")

print("" + "=" * 60)

# Test 2: Customer Account Manager
print("TEST 2: Customer Account Manager")
print("-" * 40)
try:
    # Test checking balance
    account_result = AVAILABLE_TOOLS[1].func(
        account_type="checking",
        customer_id="12345",
        action="check_balance"
    )
    print(account_result)
    
    # Test getting account info
    info_result = AVAILABLE_TOOLS[1].func(
        account_type="all",
        customer_id="12345", 
        action="get_info"
    )
    print(info_result)
    print("Customer Account Manager test PASSED")
except Exception as e:
    print(f"Customer Account Manager test FAILED: {e}")

print("" + "=" * 60)

# Test 3: Fraud Detection System
print("TEST 3: Fraud Detection System")
print("-" * 40)
try:
    fraud_result = AVAILABLE_TOOLS[2].func(
        transaction_id="TX123",
        customer_id="12345",
        transaction_amount=500.00,
        transaction_type="withdrawal",
        location="Miami, FL",
        device_id="DEVICE_001"
    )
    print(fraud_result)
    print("Fraud Detection System test PASSED")
except Exception as e:
    print(f"Fraud Detection System test FAILED: {e}")

print("" + "=" * 60)

Complete LangGraph Banking Agent

Now we'll create our intelligent banking agent with LangGraph that can automatically select and use the appropriate banking tools based on user requests.


# Enhanced banking system prompt with tool selection guidance
system_context = """You are a professional banking AI assistant with access to specialized banking tools.
            Analyze the user's banking request and directly use the most appropriate tools to help them.
            
            AVAILABLE BANKING TOOLS:
            
            credit_risk_analyzer - Analyze credit risk for loan applications and credit decisions
            - Use for: loan applications, credit assessments, risk analysis, mortgage eligibility
            - Examples: "Analyze credit risk for $50k personal loan", "Assess mortgage eligibility for $300k home purchase"
            - Parameters: customer_income, customer_debt, credit_score, loan_amount, loan_type

            customer_account_manager - Manage customer accounts and provide banking services
            - Use for: account information, transaction processing, product recommendations, customer service
            - Examples: "Check balance for checking account 12345", "Recommend products for customer with high balance"
            - Parameters: account_type, customer_id, action, amount, account_details

            fraud_detection_system - Analyze transactions for potential fraud and security risks
            - Use for: transaction monitoring, fraud prevention, risk assessment, security alerts
            - Examples: "Analyze fraud risk for $500 ATM withdrawal in Miami", "Check security for $2000 online purchase"
            - Parameters: transaction_id, customer_id, transaction_amount, transaction_type, location, device_id

            BANKING INSTRUCTIONS:
            - Analyze the user's banking request carefully and identify the primary need
            - If they need credit analysis → use credit_risk_analyzer
            - If they need financial calculations → use financial_calculator
            - If they need account services → use customer_account_manager
            - If they need security analysis → use fraud_detection_system
            - Extract relevant parameters from the user's request
            - Provide helpful, accurate banking responses based on tool outputs
            - Always consider banking regulations, risk management, and best practices
            - Be professional and thorough in your analysis

            Choose and use tools wisely to provide the most helpful banking assistance.
        """
# Initialize the main LLM for banking responses
main_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)
# Bind all banking tools to the main LLM
llm_with_tools = main_llm.bind_tools(AVAILABLE_TOOLS)

# Banking Agent State Definition
class BankingAgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], add_messages]
    user_input: str
    session_id: str
    context: dict

def create_banking_langgraph_agent():
    """Create a comprehensive LangGraph banking agent with intelligent tool selection."""
    def llm_node(state: BankingAgentState) -> BankingAgentState:
        """Main LLM node that processes banking requests and selects appropriate tools."""
        messages = state["messages"]
        # Add system context to messages
        enhanced_messages = [SystemMessage(content=system_context)] + list(messages)
        # Get LLM response with tool selection
        response = llm_with_tools.invoke(enhanced_messages)
        return {
            **state,
            "messages": messages + [response]
        }
    
    def should_continue(state: BankingAgentState) -> str:
        """Decide whether to use tools or end the conversation."""
        last_message = state["messages"][-1]
        # Check if the LLM wants to use tools
        if hasattr(last_message, 'tool_calls') and last_message.tool_calls:
            return "tools"
        return END
        
    # Create the banking state graph
    workflow = StateGraph(BankingAgentState)
    # Add nodes
    workflow.add_node("llm", llm_node)
    workflow.add_node("tools", ToolNode(AVAILABLE_TOOLS))
    # Simplified entry point - go directly to LLM
    workflow.add_edge(START, "llm")
    # From LLM, decide whether to use tools or end
    workflow.add_conditional_edges(
        "llm",
        should_continue,
        {"tools": "tools", END: END}
    )
    # Tool execution flows back to LLM for final response
    workflow.add_edge("tools", "llm")
    # Set up memory
    memory = MemorySaver()
    # Compile the graph
    agent = workflow.compile(checkpointer=memory)
    return agent

# Create the banking intelligent agent
banking_agent = create_banking_langgraph_agent()

print("Banking LangGraph Agent Created Successfully!")
print("\nFeatures:")
print("   - Intelligent banking tool selection")
print("   - Comprehensive banking system prompt")
print("   - Streamlined workflow: LLM → Tools → Response")
print("   - Automatic tool parameter extraction")
print("   - Professional banking assistance")

ValidMind Model Integration

Now we'll integrate our banking LangGraph agent with ValidMind for comprehensive testing and validation.

from validmind.models import Prompt

def banking_agent_fn(input):
    """
    Invoke the banking agent with the given input.
    """
    try:
        # Initial state for banking agent
        initial_state = {
            "user_input": input["input"],
            "messages": [HumanMessage(content=input["input"])],
            "session_id": input["session_id"],
            "context": {}
        }
        session_config = {"configurable": {"thread_id": input["session_id"]}}
        result = banking_agent.invoke(initial_state, config=session_config)

        from utils import capture_tool_output_messages

        # Capture all tool outputs and metadata
        captured_data = capture_tool_output_messages(result)
    
        # Access specific tool outputs, this will be used for RAGAS tests
        tool_message = ""
        for output in captured_data["tool_outputs"]:
            tool_message += output['content']

        return {"prediction": result['messages'][-1].content, "output": result, "tool_messages": [tool_message]}
    except Exception as e:
        # Return a fallback response if the agent fails
        error_message = f"""I apologize, but I encountered an error while processing your banking request: {str(e)}.
        Please try rephrasing your question or contact support if the issue persists."""
        return {
            "prediction": error_message, 
            "output": {
                "messages": [HumanMessage(content=input["input"]), SystemMessage(content=error_message)],
                "error": str(e)
            }
        }

## Initialize the model
vm_banking_model = vm.init_model(
    input_id="banking_agent_model",
    predict_fn=banking_agent_fn,
    prompt=Prompt(template=system_context)
)

# Add the banking agent to the vm model
vm_banking_model.model = banking_agent

print("Banking Agent Successfully Integrated with ValidMind!")
print(f"Model ID: {vm_banking_model.input_id}")

Prompt Validation

Let's get an initial sense of how well the prompt meets a few best practices for prompt engineering. These tests use an LLM to rate the prompt on a scale of 1-10 against the following criteria:

  • Clarity: How clearly the prompt states the task.
  • Conciseness: How succinctly the prompt states the task.
  • Delimitation: When using complex prompts containing examples, contextual information, or other elements, is the prompt formatted in such a way that each element is clearly separated?
  • NegativeInstruction: Whether the prompt contains negative instructions.
  • Specificity: How specific the prompt defines the task.
run_test(
    "validmind.prompt_validation.Clarity",
    inputs={
        "model": vm_banking_model,
    },
).log()
run_test(
    "validmind.prompt_validation.Conciseness",
    inputs={
        "model": vm_banking_model,
    },
).log()
run_test(
    "validmind.prompt_validation.Delimitation",
    inputs={
        "model": vm_banking_model,
    },
).log()
run_test(
    "validmind.prompt_validation.NegativeInstruction",
    inputs={
        "model": vm_banking_model,
    },
).log()
run_test(
    "validmind.prompt_validation.Specificity",
    inputs={
        "model": vm_banking_model,
    },
).log()

Banking Test Dataset

We'll use our comprehensive banking test dataset to evaluate our agent's performance across different banking scenarios.

Initialize ValidMind Dataset

Before we can run tests and evaluations, we need to initialize our banking test dataset as a ValidMind dataset object.

# Import our banking-specific test dataset
from banking_test_dataset import banking_test_dataset

vm_test_dataset = vm.init_dataset(
    input_id="banking_test_dataset",
    dataset=banking_test_dataset,
    text_column="input",
    target_column="possible_outputs",
)

print("Banking Test Dataset Initialized in ValidMind!")
print(f"Dataset ID: {vm_test_dataset.input_id}")
print(f"Dataset columns: {vm_test_dataset._df.columns}")

Run the Agent and capture result through assign predictions

Now we'll execute our banking agent on the test dataset and capture its responses for evaluation.

vm_test_dataset.assign_predictions(vm_banking_model)

print("Banking Agent Predictions Generated Successfully!")
print(f"Predictions assigned to {len(vm_test_dataset._df)} test cases")

Dataframe Display Settings

pd.set_option('display.max_colwidth', 40)
pd.set_option('display.width', 120)
pd.set_option('display.max_colwidth', None)
print("Banking Test Dataset with Predictions:")
vm_test_dataset._df.head()

Banking Accuracy Test

This test evaluates the banking agent's ability to provide accurate responses by: - Testing against a dataset of predefined banking questions and expected answers - Checking if responses contain expected keywords and banking terminology - Providing detailed test results including pass/fail status - Helping identify any gaps in the agent's banking knowledge or response quality


@vm.test("my_custom_tests.banking_accuracy_test")
def banking_accuracy_test(model, dataset, list_of_columns):
    """
    Run tests on a dataset of banking questions and expected responses.
    Optimized version using vectorized operations and list comprehension.
    """
    df = dataset._df
    
    # Pre-compute responses for all tests
    y_true = dataset.y.tolist()
    y_pred = dataset.y_pred(model).tolist()

    # Vectorized test results
    test_results = []
    for response, keywords in zip(y_pred, y_true):
        # Convert keywords to list if not already a list
        if not isinstance(keywords, list):
            keywords = [keywords]
        test_results.append(any(str(keyword).lower() in str(response).lower() for keyword in keywords))
        
    results = pd.DataFrame()
    column_names = [col + "_details" for col in list_of_columns]
    results[column_names] = df[list_of_columns]
    results["actual"] = y_pred
    results["expected"] = y_true
    results["passed"] = test_results
    results["error"] = None if test_results else f'Response did not contain any expected keywords: {y_true}'
    
    return results
   
result = run_test(
    "my_custom_tests.banking_accuracy_test",
    inputs={
        "dataset": vm_test_dataset,
        "model": vm_banking_model
    },
    params={
        "list_of_columns": ["input"]
    }
)
result.log()

Banking Tool Call Accuracy Test

This test evaluates how accurately our intelligent banking router selects the correct tools for different banking requests. This test provides quantitative feedback on the agent's core intelligence - its ability to understand what users need and select the right banking tools to help them.

@vm.test("my_custom_tests.BankingToolCallAccuracy")
def BankingToolCallAccuracy(dataset, agent_output_column, expected_tools_column):
    """Test validation using actual LangGraph banking agent results."""
    def validate_tool_calls_simple(messages, expected_tools):
        """Simple validation of tool calls without RAGAS dependency issues."""
        
        tool_calls_found = []
        
        for message in messages:
            if hasattr(message, 'tool_calls') and message.tool_calls:
                for tool_call in message.tool_calls:
                    # Handle both dictionary and object formats
                    if isinstance(tool_call, dict):
                        tool_calls_found.append(tool_call['name'])
                    else:
                        # ToolCall object - use attribute access
                        tool_calls_found.append(tool_call.name)
        
        # Check if expected tools were called
        accuracy = 0.0
        matches = 0
        if expected_tools:
            matches = sum(1 for tool in expected_tools if tool in tool_calls_found)
            accuracy = matches / len(expected_tools)
        
        return {
            'expected_tools': expected_tools,
            'found_tools': tool_calls_found,
            'matches': matches,
            'total_expected': len(expected_tools) if expected_tools else 0,
            'accuracy': accuracy,
        }

    df = dataset._df
    
    results = []
    for i, row in df.iterrows():
        result = validate_tool_calls_simple(row[agent_output_column]['messages'], row[expected_tools_column])
        results.append(result)
         
    return results

run_test(
    "my_custom_tests.BankingToolCallAccuracy",
    inputs = {
        "dataset": vm_test_dataset,
    },
    params = {
        "agent_output_column": "banking_agent_model_output",
        "expected_tools_column": "expected_tools"
    }
)

RAGAS Tests for an Agent Evaluation

RAGAS (Retrieval-Augmented Generation Assessment) provides specialized metrics for evaluating conversational AI systems like our banking agent. These tests analyze different aspects of agent performance:

Our banking agent uses tools to retrieve information and generates responses based on that context, making it similar to a RAG system. RAGAS metrics help evaluate:

  • Response Quality: How well the agent uses retrieved tool outputs to generate helpful banking responses
  • Information Faithfulness: Whether agent responses accurately reflect tool outputs
  • Relevance Assessment: How well responses address the original banking query
  • Context Utilization: How effectively the agent incorporates tool results into final answers

These tests provide insights into how well our banking agent integrates tool usage with conversational abilities, ensuring it provides accurate, relevant, and helpful responses to banking users.

Faithfulness

Faithfulness measures how accurately the banking agent's responses reflect the information retrieved from tools. This metric evaluates:

Information Accuracy: Whether the agent correctly uses tool outputs in its responses - Fact Preservation: Ensuring credit scores, loan calculations, compliance results are accurately reported - No Hallucination: Verifying the agent doesn't invent banking information not provided by tools - Source Attribution: Checking that responses align with actual tool outputs

Critical for Banking Trust: Faithfulness is essential for banking agent reliability because users need to trust that: - Credit analysis results are reported correctly - Financial calculations are accurate
- Compliance checks return real information - Risk assessments are properly communicated

run_test(
    "validmind.model_validation.ragas.Faithfulness",
    inputs={"dataset": vm_test_dataset},
    param_grid={
        "user_input_column": ["input"],
        "response_column": ["banking_agent_model_prediction"],
        "retrieved_contexts_column": ["banking_agent_model_tool_messages"],
    },
).log()

Response Relevancy

Response Relevancy evaluates how well the banking agent's answers address the user's original banking question or request. This metric assesses:

Query Alignment: Whether responses directly answer what users asked for - Intent Fulfillment: Checking if the agent understood and addressed the user's actual banking need - Completeness: Ensuring responses provide sufficient information to satisfy the banking query - Focus: Avoiding irrelevant information that doesn't help the banking user

Banking Quality: Measures the agent's ability to maintain relevant, helpful banking dialogue - Context Awareness: Responses should be appropriate for the banking conversation context - User Satisfaction: Answers should be useful and actionable for banking users - Clarity: Banking information should be presented in a way that directly helps the user

High relevancy indicates the banking agent successfully understands user needs and provides targeted, helpful banking responses.

run_test(
    "validmind.model_validation.ragas.ResponseRelevancy",
    inputs={"dataset": vm_test_dataset},
    params={
        "user_input_column": "input",
        "response_column": "banking_agent_model_prediction",
        "retrieved_contexts_column": "banking_agent_model_tool_messages",
    }
).log()

Context Recall

Context Recall measures how well the banking agent utilizes the information retrieved from tools when generating its responses. This metric evaluates:

Information Utilization: Whether the agent effectively incorporates tool outputs into its responses - Coverage: How much of the available tool information is used in the response - Integration: How well tool outputs are woven into coherent, natural banking responses - Completeness: Whether all relevant information from tools is considered

Tool Effectiveness: Assesses whether selected banking tools provide useful context for responses - Relevance: Whether tool outputs actually help answer the user's banking question - Sufficiency: Whether enough information was retrieved to generate good banking responses - Quality: Whether the tools provided accurate, helpful banking information

High context recall indicates the banking agent not only selects the right tools but also effectively uses their outputs to create comprehensive, well-informed banking responses.

run_test(
    "validmind.model_validation.ragas.ContextRecall",
    inputs={"dataset": vm_test_dataset},
    param_grid={
        "user_input_column": ["input"],
        "retrieved_contexts_column": ["banking_agent_model_tool_messages"],
        "reference_column": ["banking_agent_model_prediction"],
    },
).log()

Safety

Safety testing is critical for banking AI agents to ensure they operate reliably and securely. These tests help validate that our banking agent maintains high standards of fairness and professionalism.

AspectCritic

AspectCritic provides comprehensive evaluation across multiple dimensions of banking agent performance. This metric analyzes various aspects of response quality:

Multi-Dimensional Assessment: Evaluates responses across different quality criteria: - Conciseness: Whether responses are clear and to-the-point without unnecessary details - Coherence: Whether responses are logically structured and easy to follow - Correctness: Accuracy of banking information and appropriateness of recommendations - Harmfulness: Whether responses could cause harm or damage to users or systems - Maliciousness: Whether responses contain malicious content or intent

Holistic Quality Scoring: Provides an overall assessment that considers: - User Experience: How satisfying and useful the banking interaction would be for real users - Professional Standards: Whether responses meet quality expectations for production banking systems - Consistency: Whether the banking agent maintains quality across different types of requests

AspectCritic helps identify specific areas where the banking agent excels or needs improvement, providing actionable insights for enhancing overall performance and user satisfaction in banking scenarios.

run_test(
    "validmind.model_validation.ragas.AspectCritic",
    inputs={"dataset": vm_test_dataset},
    param_grid={
        "user_input_column": ["input"],
        "response_column": ["banking_agent_model_prediction"],
        "retrieved_contexts_column": ["banking_agent_model_tool_messages"],
    },
).log()

Prompt bias

Let's check if the agent's prompts contain unintended biases that could affect banking decisions.

run_test(
    "validmind.prompt_validation.Bias",
    inputs={
        "model": vm_banking_model,
    },
).log()

Toxicity

Let's ensure responses are professional and appropriate for banking contexts.

run_test(
    "validmind.data_validation.nlp.Toxicity",
    inputs={
        "dataset": vm_test_dataset,
    },
).log()

Demo Summary and Next Steps

We have successfully built and tested a comprehensive Banking AI Agent using LangGraph and ValidMind. Here's what we've accomplished:

What We Built

  1. 5 Specialized Banking Tools
    • Credit Risk Analyzer for loan assessments
    • Customer Account Manager for account services
    • Fraud Detection System for security monitoring
  2. Intelligent LangGraph Agent
    • Automatic tool selection based on user requests
    • Banking-specific system prompts and guidance
    • Professional banking assistance and responses
  3. Comprehensive Testing Framework
    • banking-specific test cases
    • ValidMind integration for validation
    • Performance analysis across banking domains

Next Steps

  1. Customize Tools: Adapt the banking tools to your specific banking requirements
  2. Expand Test Cases: Add more banking scenarios and edge cases
  3. Integrate with Real Data: Connect to actual banking systems and databases
  4. Add More Tools: Implement additional banking-specific functionality
  5. Production Deployment: Deploy the agent in a production banking environment

Key Benefits

  • Industry-Specific: Designed specifically for banking operations
  • Regulatory Compliance: Built-in SR 11-7 and SS 1-23 compliance checks
  • Risk Management: Comprehensive credit and fraud risk assessment
  • Customer Focus: Tools for both retail and commercial banking needs
  • Real-World Applicability: Addresses actual banking use cases and challenges

Your banking AI agent is now ready to handle real-world banking scenarios while maintaining regulatory compliance and risk management best practices!

Code samples
Quickstart for knockout option pricing model documentation
  • ValidMind Logo
    ©
    Copyright 2025 ValidMind Inc.
    All Rights Reserved.
    Cookie preferences
    Legal
  • Get started
    • Model development
    • Model validation
    • Setup & admin
  • Guides
    • Access
    • Configuration
    • Model inventory
    • Model documentation
    • Model validation
    • Model workflows
    • Reporting
    • Monitoring
    • Attestation
  • Library
    • For developers
    • For validators
    • Code samples
    • Python API
    • Public REST API
  • Training
    • Learning paths
    • Courses
    • Videos
  • Support
    • Troubleshooting
    • FAQ
    • Get help
  • Community
    • Slack
    • GitHub
    • Blog
  • Edit this page
  • Report an issue