Evaluations
Transform your AI agents from unpredictable to production-ready! ๐ฏ Test, validate, and perfect your agents with automated quality assurance.
โจ Why Testing Matters
Just like you wouldn't ship code without tests, don't deploy AI agents without evaluations! Evaluations give you confidence that your agents will handle real-world scenarios gracefully, consistently, and professionally.
๐ฏ What Are Evaluations?
Think of evaluations as unit tests for your AI agents! They help you:
๐ CSV-Based Testing
Define test cases in simple CSV files - no complex setup required!
๐ค Automated Validation
Run hundreds of tests automatically with built-in assertions
๐งโโ๏ธ LLM-as-Judge
Use AI to evaluate subjective qualities like helpfulness
๐ Result Tracking
Export results to CSV for analysis and CI/CD integration
๐ Creating Your First Evaluation
Step 1: Generate the Evaluation Class
Let's create an evaluation to test a customer support agent! Run this magical command:
php artisan vizra:make:evaluation CustomerSupportEvaluation
Boom! ๐ฅ This creates a new evaluation in app/Evaluations/CustomerSupportEvaluation.php
:
<?php
namespace App\Evaluations;
use Vizra\VizraADK\Evaluations\BaseEvaluation;
class CustomerSupportEvaluation extends BaseEvaluation
{
public string $name = 'customer_support_eval';
public string $description = 'Evaluate customer support agent responses';
public string $agentName = 'customer_support'; // Agent alias
public string $csvPath = 'app/Evaluations/data/customer_support_tests.csv';
public function preparePrompt(array $csvRowData): string
{
// Use the 'prompt' column from CSV by default
return $csvRowData[$this->getPromptCsvColumn()] ?? '';
}
public function evaluateRow(array $csvRowData, string $llmResponse): array
{
// Reset assertions for this row
$this->resetAssertionResults();
// Run assertions based on test type
if ($csvRowData['test_type'] === 'greeting') {
$this->assertResponseContains($llmResponse, 'help');
$this->assertResponseHasPositiveSentiment($llmResponse);
}
// Return evaluation results
$allPassed = collect($this->assertionResults)
->every(fn($r) => $r['status'] === 'pass');
return [
'row_data' => $csvRowData,
'llm_response' => $llmResponse,
'assertions' => $this->assertionResults,
'final_status' => $allPassed ? 'pass' : 'fail',
];
}
}
Step 2: Create Your Test Data
Now for the fun part - creating test scenarios! ๐จ Let's make a CSV file with different customer interactions:
prompt,test_type,expected_contains,expected_sentiment
"Hello, I need help",greeting,help,positive
"Where is my order #12345?",order_inquiry,order,neutral
"I want to return this product",return_request,return,positive
"This is terrible service!",complaint,sorry,empathetic
๐ก Pro Tip: Structure Your CSV Wisely!
Each column in your CSV can be used for different purposes:
prompt
- The input to send to your agenttest_type
- Categorize tests for different assertion logicexpected_*
- What you expect in the response- Add any custom columns you need!
๐งฐ Your Assertion Toolbox
Vizra ADK provides a rich collection of assertions to validate every aspect of your agent's responses!
๐ Content Assertions
// Check if response contains text
$this->assertResponseContains($response, 'expected text');
$this->assertResponseDoesNotContain($response, 'unwanted');
// Pattern matching
$this->assertResponseMatchesRegex($response, '/pattern/');
// Position checks
$this->assertResponseStartsWith($response, 'Hello');
$this->assertResponseEndsWith($response, '.');
// Multiple checks
$this->assertContainsAnyOf($response, ['yes', 'sure', 'okay']);
$this->assertContainsAllOf($response, ['thank', 'you']);
๐ Length & Structure
// Response length
$this->assertResponseLengthBetween($response, 50, 500);
// Word count
$this->assertWordCountBetween($response, 10, 100);
// Format validation
$this->assertResponseIsValidJson($response);
$this->assertJsonHasKey($response, 'result');
$this->assertResponseIsValidXml($response);
โจ Quality Checks
// Sentiment analysis
$this->assertResponseHasPositiveSentiment($response);
// Writing quality
$this->assertGrammarCorrect($response);
$this->assertReadabilityLevel($response, 12);
$this->assertNoRepetition($response, 0.3);
๐ก๏ธ Safety & Security
// Content safety
$this->assertNotToxic($response);
// Privacy protection
$this->assertNoPII($response);
// General safety
$this->assertResponseIsNotEmpty($response);
๐งโโ๏ธ LLM as Judge - The Ultimate Quality Check
Sometimes you need another AI to evaluate subjective qualities. That's where LLM-as-Judge comes in! ๐ญ
๐ค When to Use LLM Judge?
Perfect for evaluating:
- Helpfulness and professionalism
- Empathy and emotional intelligence
- Creativity and originality
- Accuracy of complex responses
- Overall response quality
Using LLM Judge Assertions
public function evaluateRow(array $csvRowData, string $llmResponse): array
{
$this->resetAssertionResults();
// Use LLM judge for subjective evaluation
$this->assertLlmJudge(
$llmResponse,
'Is this response helpful and professional?',
'llm_judge', // agent name
'pass', // expected outcome
'Response should be helpful and professional'
);
// Quality scoring
$this->assertLlmJudgeQuality(
$llmResponse,
'Rate the clarity and completeness of this response',
7, // minimum score out of 10
'llm_judge'
);
// Compare responses
$referenceResponse = $csvRowData['reference_response'] ?? '';
if ($referenceResponse) {
$this->assertLlmJudgeComparison(
$llmResponse,
$referenceResponse,
'Which response is more helpful?',
'actual' // expect actual to win
);
}
// Return results...
}
Setting Up Your Judge Agent
The LLM judge needs to be registered first. Here's how to create your own expert evaluator! ๐จโโ๏ธ
// Register the LLM judge agent
public function boot(): void
{
// Option 1: Use a dedicated judge agent class
Agent::build(LlmJudgeAgent::class)->register();
// Option 2: Create an ad-hoc judge agent (easier!)
Agent::define('llm_judge')
->description('Expert evaluator for judging responses')
->instructions('You are an expert evaluator...')
->model('gpt-4')
->temperature(0.3) // Lower temperature for consistency
->register();
}
๐โโ๏ธ Running Your Evaluations
Time to put your agent to the test! Let's see how it performs! ๐ฌ
Running from CLI
# Run evaluation by class name
php artisan vizra:run:eval CustomerSupportEvaluation
# Save results to CSV for analysis
php artisan vizra:run:eval CustomerSupportEvaluation --output=results.csv
# Results are saved to storage/app/evaluations/ by default
What You'll See
Watch the magic happen with a beautiful progress bar and detailed results! โจ
Running evaluation: customer_support_eval
Description: Evaluate customer support agent responses
Processing 4 rows from CSV using agent 'customer_support'...
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 4/4
Evaluation processing complete.
โโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโ
โ Row โ Final Status โ LLM Response Summary โ Assertions Countโ Error โ
โโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโค
โ 1 โ โ
pass โ Hello! I'd be happy to...โ 2 โ โ
โ 2 โ โ
pass โ I can help you track... โ 1 โ โ
โ 3 โ โ fail โ Sure, let me assist... โ 2 โ โ
โ 4 โ โ
pass โ I understand your... โ 3 โ โ
โโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโ
Summary: Total Rows: 4, Passed: 3 (75%), Failed: 1 (25%), Errors: 0
๐ Advanced Example - Putting It All Together
Ready for the full experience? Here's a complete evaluation implementation that showcases all the techniques! ๐ช
<?php
namespace App\Evaluations;
use Vizra\VizraADK\Evaluations\BaseEvaluation;
class CustomerSupportEvaluation extends BaseEvaluation
{
public string $name = 'customer_support_eval';
public string $description = 'Comprehensive customer support evaluation';
public string $agentName = 'customer_support';
public string $csvPath = 'app/Evaluations/data/customer_support_tests.csv';
public function preparePrompt(array $csvRowData): string
{
// Get the base prompt
$prompt = $csvRowData[$this->getPromptCsvColumn()] ?? '';
// Add context if available
if (isset($csvRowData['context'])) {
$prompt = "Context: " . $csvRowData['context'] . "\n\n" . $prompt;
}
return $prompt;
}
public function evaluateRow(array $csvRowData, string $llmResponse): array
{
$this->resetAssertionResults();
// Basic content checks
if (isset($csvRowData['expected_contains'])) {
$this->assertResponseContains(
$llmResponse,
$csvRowData['expected_contains']
);
}
// Test type specific assertions
switch ($csvRowData['test_type'] ?? '') {
case 'greeting':
$this->assertResponseHasPositiveSentiment($llmResponse);
$this->assertWordCountBetween($llmResponse, 10, 50);
break;
case 'complaint':
$this->assertResponseContains($llmResponse, 'sorry');
$this->assertNotToxic($llmResponse);
$this->assertLlmJudge(
$llmResponse,
'Is this response empathetic and de-escalating?',
'llm_judge',
'pass'
);
break;
case 'technical':
$this->assertReadabilityLevel($llmResponse, 12);
$this->assertGrammarCorrect($llmResponse);
break;
}
// General quality checks
$this->assertResponseIsNotEmpty($llmResponse);
$this->assertNoPII($llmResponse);
// Determine final status
$allPassed = collect($this->assertionResults)
->every(fn($r) => $r['status'] === 'pass');
return [
'row_data' => $csvRowData,
'llm_response' => $llmResponse,
'assertions' => $this->assertionResults,
'final_status' => $allPassed ? 'pass' : 'fail',
];
}
}
๐ Analyzing Your Results
CSV Output Structure
When you export results with --output
, you get a comprehensive CSV report! ๐
CSV Columns Explained:
-
๐
Evaluation Name - The name of your evaluation
-
๐
Row Index - Which test case from your CSV
-
๐
Final Status - pass โ , fail โ, or error โ ๏ธ
-
๐
LLM Response - What your agent actually said
-
๐
Assertions (JSON) - Detailed results of each check
๐ CI/CD Integration
Make testing automatic! Here's how to add evaluations to your CI/CD pipeline! ๐
# Evaluate agents on every push
name: Evaluate Agents
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Setup PHP & Dependencies
uses: shivammathur/setup-php@v2
with:
php-version: '8.2'
- name: Install Dependencies
run: composer install
- name: Run Evaluations
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
php artisan vizra:run:eval CustomerSupportEvaluation --output=results.csv
- name: Check Results
run: |
# Add your own pass/fail logic based on CSV results
php artisan app:check-eval-results storage/app/evaluations/results.csv
- name: Upload Results
uses: actions/upload-artifact@v2
with:
name: evaluation-results
path: storage/app/evaluations/
๐ Best Practices for Awesome Evaluations
CSV Organization - Use clear test types and descriptive columns
Thorough Testing - Combine multiple assertion types
LLM Judge - Use for subjective quality checks
CI/CD Integration - Run evaluations on every push
Track Progress - Monitor performance over time
Real Data - Include actual user queries
Edge Cases - Test error scenarios too
Consistency - Use the same criteria across agents
๐ You're Ready to Test Like a Pro!
With evaluations, you can ship AI agents with confidence! Your agents will be tested, validated, and ready for real-world challenges. Happy testing! ๐