Evaluations

✨ Why Testing Matters

Just like you wouldn't ship code without tests, don't deploy AI agents without evaluations! Evaluations give you confidence that your agents will handle real-world scenarios gracefully, consistently, and professionally.

🎯 What Are Evaluations?

Think of evaluations as unit tests for your AI agents! They help you:

📊 CSV-Based Testing

Define test cases in simple CSV files - no complex setup required!

🤖 Automated Validation

Run hundreds of tests automatically with built-in assertions

🧑‍⚖️ LLM-as-Judge

Use AI to evaluate subjective qualities like helpfulness

📈 Result Tracking

Export results to CSV for analysis and CI/CD integration

🚀 Creating Your First Evaluation

Step 1: Generate the Evaluation Class

Let's create an evaluation to test a customer support agent! Run this magical command:

Terminal

php artisan vizra:make:eval CustomerSupportEvaluation

✨ Double Magic! What Gets Created

This single command creates two files for you:

app/Evaluations/CustomerSupportEvaluation.php - Your evaluation class
app/Evaluations/data/customer_support_evaluation.csv - Empty CSV with headers ready for test data

No need to manually create the CSV file - it's all set up and ready for you to add test cases! 🎉

Boom! 💥 This creates your evaluation class in app/Evaluations/CustomerSupportEvaluation.php:

app/Evaluations/CustomerSupportEvaluation.php

<?php

namespace App\Evaluations;

use Vizra\VizraADK\Evaluations\BaseEvaluation;

class CustomerSupportEvaluation extends BaseEvaluation
{
    public string $name = 'customer_support_eval';

    public string $description = 'Evaluate customer support agent responses';

    public string $agentName = 'customer_support'; // Agent alias

    public string $csvPath = 'app/Evaluations/data/customer_support_tests.csv';

    public function preparePrompt(array $csvRowData): string
    {
        // Use the 'prompt' column from CSV by default
        return $csvRowData[$this->getPromptCsvColumn()] ?? '';
    }

    public function evaluateRow(array $csvRowData, string $llmResponse): array
    {
        // Reset assertions for this row
        $this->resetAssertionResults();

        // Run assertions based on test type
        if ($csvRowData['test_type'] === 'greeting') {
            $this->assertResponseContains($llmResponse, 'help');
            $this->assertResponseHasPositiveSentiment($llmResponse);
        }

        // Return evaluation results
        $allPassed = collect($this->assertionResults)
            ->every(fn($r) => $r['status'] === 'pass');

        return [
            'row_data' => $csvRowData,
            'llm_response' => $llmResponse,
            'assertions' => $this->assertionResults,
            'final_status' => $allPassed ? 'pass' : 'fail',
        ];
    }
}

Step 2: Add Your Test Data

Now for the fun part - adding test scenarios! 🎨 The CSV file was automatically created with standard headers. Let's populate it with different customer interactions:

app/Evaluations/data/customer_support_evaluation.csv

prompt,expected_response,description
"Hello, I need help",help,"Greeting test - should offer assistance"
"Where is my order #12345?",order,"Order inquiry - should help track order"
"I want to return this product",return,"Return request - should explain process"
"This is terrible service!",sorry,"Complaint - should be empathetic"

💡 Pro Tip: Customize Your CSV Structure!

The auto-generated CSV starts with standard headers, but you can customize it for your needs:

prompt - The input to send to your agent (required)
expected_response - What you expect in the response
description - Human-readable test description
test_type - Add this to categorize tests for different assertion logic
context - Add background information for the test
Add any custom columns you need!

The command creates the basic structure - feel free to add more columns as your evaluation needs grow!

🧰 Your Assertion Toolbox

Vizra ADK provides a rich collection of assertions to validate every aspect of your agent's responses!

📝 Content Assertions

// Check if response contains text
$this->assertResponseContains($llmResponse, 'expected text');
$this->assertResponseDoesNotContain($llmResponse, 'unwanted');

// Pattern matching
$this->assertResponseMatchesRegex($llmResponse, '/pattern/');

// Position checks
$this->assertResponseStartsWith($llmResponse, 'Hello');
$this->assertResponseEndsWith($llmResponse, '.');

// Multiple checks
$this->assertContainsAnyOf($llmResponse, ['yes', 'sure', 'okay']);
$this->assertContainsAllOf($llmResponse, ['thank', 'you']);

📏 Length & Structure

// Response length
$this->assertResponseLengthBetween($llmResponse, 50, 500);

// Word count
$this->assertWordCountBetween($llmResponse, 10, 100);

// Format validation
$this->assertResponseIsValidJson($llmResponse);
$this->assertJsonHasKey($llmResponse, 'result');
$this->assertResponseIsValidXml($llmResponse);

✨ Quality Checks

// Sentiment analysis
$this->assertResponseHasPositiveSentiment($llmResponse);

// Writing quality
$this->assertGrammarCorrect($llmResponse);
$this->assertReadabilityLevel($llmResponse, 12);
$this->assertNoRepetition($llmResponse, 0.3);

🛡️ Safety & Security

// Content safety
$this->assertNotToxic($llmResponse);

// Privacy protection
$this->assertNoPII($llmResponse);

// General safety
$this->assertResponseIsNotEmpty($llmResponse);

🧑‍⚖️ LLM as Judge - The Ultimate Quality Check

Sometimes you need another AI to evaluate subjective qualities. That's where LLM-as-Judge comes in! 🎭

🤔 When to Use LLM Judge?

Perfect for evaluating:

Helpfulness and professionalism
Empathy and emotional intelligence
Creativity and originality
Accuracy of complex responses
Overall response quality

Using LLM Judge Assertions

✨ New Fluent Judge Interface!

We've introduced a cleaner, more intuitive syntax for judge assertions:

app/Evaluations/CustomerSupportEvaluation.php

public function evaluateRow(array $csvRowData, string $llmResponse): array
{
    $this->resetAssertionResults();

    // Simple pass/fail evaluation
    $this->judge($llmResponse)
        ->using(PassFailJudgeAgent::class)
        ->expectPass();

    // Quality score evaluation
    $this->judge($llmResponse)
        ->using(QualityJudgeAgent::class)
        ->expectMinimumScore(7.5);

    // Multi-dimensional evaluation
    $this->judge($llmResponse)
        ->using(ComprehensiveJudgeAgent::class)
        ->expectMinimumScore([
            'accuracy' => 8,
            'helpfulness' => 7,
            'clarity' => 7
        ]);

    // Return results...
}

🎯 Three Judge Patterns

1. Pass/Fail Judge

For binary decisions - returns {"pass": true/false, "reasoning": "..."}

$this->judge($response)
    ->using(PassFailJudgeAgent::class)
    ->expectPass();

2. Quality Score Judge

For numeric ratings - returns {"score": 8.5, "reasoning": "..."}

$this->judge($response)
    ->using(QualityJudgeAgent::class)
    ->expectMinimumScore(7.0);

3. Comprehensive Judge

For multi-dimensional evaluation - returns {"scores": {...}, "reasoning": "..."}

$this->judge($response)
    ->using(ComprehensiveJudgeAgent::class)
    ->expectMinimumScore([
        'accuracy' => 8,
        'helpfulness' => 7,
        'clarity' => 7
    ]);

🏃‍♂️ Running Your Evaluations

Time to put your agent to the test! Let's see how it performs! 🎬

Running from CLI

Terminal

# Run evaluation by class name
php artisan vizra:run:eval CustomerSupportEvaluation

# Save results to CSV for analysis
php artisan vizra:run:eval CustomerSupportEvaluation --output=results.csv

# Results are saved to storage/app/evaluations/ by default

What You'll See

Watch the magic happen with a beautiful progress bar and detailed results! ✨

Console Output

Running evaluation: customer_support_eval
Description: Evaluate customer support agent responses
Processing 4 rows from CSV using agent 'customer_support'...
████████████████████████████████████████ 4/4
Evaluation processing complete.

┌─────┬──────────────┬──────────────────────────┬─────────────────┬───────┐
│ Row │ Final Status │ LLM Response Summary     │ Assertions Count│ Error │
├─────┼──────────────┼──────────────────────────┼─────────────────┼───────┤
│ 1   │ ✅ pass      │ Hello! I'd be happy to...│ 2               │       │
│ 2   │ ✅ pass      │ I can help you track...  │ 1               │       │
│ 3   │ ❌ fail      │ Sure, let me assist...   │ 2               │       │
│ 4   │ ✅ pass      │ I understand your...     │ 3               │       │
└─────┴──────────────┴──────────────────────────┴─────────────────┴───────┘

Summary: Total Rows: 4, Passed: 3 (75%), Failed: 1 (25%), Errors: 0

🎓 Advanced Example - Putting It All Together

Ready for the full experience? Here's a complete evaluation implementation that showcases all the techniques! 💪

app/Evaluations/CustomerSupportEvaluation.php

<?php

namespace App\Evaluations;

use Vizra\VizraADK\Evaluations\BaseEvaluation;

class CustomerSupportEvaluation extends BaseEvaluation
{
    public string $name = 'customer_support_eval';
    public string $description = 'Comprehensive customer support evaluation';
    public string $agentName = 'customer_support';
    public string $csvPath = 'app/Evaluations/data/customer_support_tests.csv';

    public function preparePrompt(array $csvRowData): string
    {
        // Get the base prompt
        $prompt = $csvRowData[$this->getPromptCsvColumn()] ?? '';

        // Add context if available
        if (isset($csvRowData['context'])) {
            $prompt = "Context: " . $csvRowData['context'] . "\n\n" . $prompt;
        }

        return $prompt;
    }

    public function evaluateRow(array $csvRowData, string $llmResponse): array
    {
        $this->resetAssertionResults();

        // Basic content checks
        if (isset($csvRowData['expected_contains'])) {
            $this->assertResponseContains(
                $llmResponse,
                $csvRowData['expected_contains']
            );
        }

        // Test type specific assertions
        switch ($csvRowData['test_type'] ?? '') {
            case 'greeting':
                $this->assertResponseHasPositiveSentiment($llmResponse);
                $this->assertWordCountBetween($llmResponse, 10, 50);
                break;

            case 'complaint':
                $this->assertResponseContains($llmResponse, 'sorry');
                $this->assertNotToxic($llmResponse);
                $this->assertLlmJudge(
                    $llmResponse,
                    'Is this response empathetic and de-escalating?',
                    'llm_judge',
                    'pass'
                );
                break;

            case 'technical':
                $this->assertReadabilityLevel($llmResponse, 12);
                $this->assertGrammarCorrect($llmResponse);
                break;
        }

        // General quality checks
        $this->assertResponseIsNotEmpty($llmResponse);
        $this->assertNoPII($llmResponse);

        // Determine final status
        $allPassed = collect($this->assertionResults)
            ->every(fn($r) => $r['status'] === 'pass');

        return [
            'row_data' => $csvRowData,
            'llm_response' => $llmResponse,
            'assertions' => $this->assertionResults,
            'final_status' => $allPassed ? 'pass' : 'fail',
        ];
    }
}

📊 Analyzing Your Results

CSV Output Structure

When you export results with --output, you get a comprehensive CSV report! 📈

CSV Columns Explained:

📌
Evaluation Name - The name of your evaluation
📌
Row Index - Which test case from your CSV
📌
Final Status - pass ✅, fail ❌, or error ⚠️
📌
LLM Response - What your agent actually said
📌
Assertions (JSON) - Detailed results of each check

🎨 Creating Custom Assertions

Need something specific? Create your own reusable assertion classes! 🚀

Simple Example: Product Name Assertion

Let's create a simple assertion that checks if a product name is mentioned:

app/Evaluations/Assertions/ContainsProductAssertion.php

<?php

namespace App\Evaluations\Assertions;

use Vizra\VizraADK\Evaluations\Assertions\BaseAssertion;

class ContainsProductAssertion extends BaseAssertion
{
    public function assert(string $response, ...$params): array
    {
        $productName = $params[0] ?? '';

        if (empty($productName)) {
            return $this->result(false, 'Product name parameter is required');
        }

        $contains = stripos($response, $productName) !== false;

        return $this->result(
            $contains,
            "Response should mention the product '{$productName}'",
            "contains '{$productName}'",
            $contains ? "found '{$productName}'" : "product not mentioned"
        );
    }
}

Using Your Custom Assertion

app/Evaluations/ProductReviewEvaluation.php

use App\Evaluations\Assertions\ContainsProductAssertion;

class ProductReviewEvaluation extends BaseEvaluation
{
    private ContainsProductAssertion $productAssertion;

    public function __construct()
    {
        parent::__construct();
        $this->productAssertion = new ContainsProductAssertion();
    }

    public function evaluateRow(array $csvRowData, string $llmResponse): array
    {
        $this->resetAssertionResults();

        // Use your custom assertion
        $this->assertCustom(ContainsProductAssertion::class, $llmResponse, 'MacBook Pro');

        // Mix with built-in assertions
        $this->assertWordCountBetween($llmResponse, 50, 200);

        // Determine final status
        $allPassed = collect($this->assertionResults)
            ->every(fn($r) => $r['status'] === 'pass');

        return [
            'assertions' => $this->assertionResults,
            'final_status' => $allPassed ? 'pass' : 'fail',
        ];
    }
}

💡 Pro Tip: CSV-Driven Custom Assertions!

You can even specify custom assertions in your CSV files:

prompt,assertion_class,assertion_params
"Tell me about the new iPhone",ContainsProductAssertion,"[\"iPhone\"]"
"Describe the MacBook features",ContainsProductAssertion,"[\"MacBook\"]"

Then use them dynamically in your evaluation:

if (isset($csvRowData['assertion_class'])) {
    $params = json_decode($csvRowData['assertion_params'] ?? '[]', true);
    $this->assertCustom($csvRowData['assertion_class'], $llmResponse, ...$params);
}

Generate Assertion Classes with Artisan

Creating new assertions is super easy with our generator command! ⚡

Terminal

php artisan vizra:make:assertion EmailValidationAssertion

This creates a ready-to-use assertion class with helpful boilerplate!

Built-in Custom Assertions

Vizra ADK comes with several ready-to-use custom assertions:

📦 ContainsProductAssertion

Check if a product name is mentioned

📄 JsonSchemaAssertion

Validate JSON structure against a schema

💰 PriceFormatAssertion

Verify price formatting in any currency

📧 EmailFormatAssertion

Check for valid email addresses

🚀 CI/CD Integration

Make testing automatic! Here's how to add evaluations to your CI/CD pipeline! 🔄

.github/workflows/evaluate.yml

# Evaluate agents on every push
name: Evaluate Agents
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Setup PHP & Dependencies
        uses: shivammathur/setup-php@v2
        with:
          php-version: '8.2'

      - name: Install Dependencies
        run: composer install

      - name: Run Evaluations
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          php artisan vizra:run:eval CustomerSupportEvaluation --output=results.csv

      - name: Check Results
        run: |
          # Add your own pass/fail logic based on CSV results
          php artisan app:check-eval-results storage/app/evaluations/results.csv

      - name: Upload Results
        uses: actions/upload-artifact@v2
        with:
          name: evaluation-results
          path: storage/app/evaluations/

🏆 Best Practices for Awesome Evaluations

📋

CSV Organization - Use clear test types and descriptive columns

🔍

Thorough Testing - Combine multiple assertion types

🤖

LLM Judge - Use for subjective quality checks

🔄

CI/CD Integration - Run evaluations on every push

📈

Track Progress - Monitor performance over time

👥

Real Data - Include actual user queries

⚠️

Edge Cases - Test error scenarios too

🎯

Consistency - Use the same criteria across agents

🎉 You're Ready to Test Like a Pro!

With evaluations, you can ship AI agents with confidence! Your agents will be tested, validated, and ready for real-world challenges. Happy testing! 🚀

Next: Tracing →

Learn about debugging with traces

Evaluation API Reference →

Detailed evaluation class documentation

✨ Why Testing Matters

🎯 What Are Evaluations?

📊 CSV-Based Testing

🤖 Automated Validation

🧑‍⚖️ LLM-as-Judge

📈 Result Tracking

🚀 Creating Your First Evaluation

Step 1: Generate the Evaluation Class

✨ Double Magic! What Gets Created

Step 2: Add Your Test Data

💡 Pro Tip: Customize Your CSV Structure!

🧰 Your Assertion Toolbox

📝 Content Assertions

📏 Length & Structure

✨ Quality Checks

🛡️ Safety & Security

🧑‍⚖️ LLM as Judge - The Ultimate Quality Check

🤔 When to Use LLM Judge?

Using LLM Judge Assertions

✨ New Fluent Judge Interface!

🎯 Three Judge Patterns

1. Pass/Fail Judge

2. Quality Score Judge

3. Comprehensive Judge

🏃‍♂️ Running Your Evaluations

Running from CLI

What You'll See

🎓 Advanced Example - Putting It All Together

📊 Analyzing Your Results

CSV Output Structure

CSV Columns Explained:

🎨 Creating Custom Assertions

Simple Example: Product Name Assertion

Using Your Custom Assertion

💡 Pro Tip: CSV-Driven Custom Assertions!

Generate Assertion Classes with Artisan

Built-in Custom Assertions

📦 ContainsProductAssertion

📄 JsonSchemaAssertion

💰 PriceFormatAssertion

📧 EmailFormatAssertion

🚀 CI/CD Integration

🏆 Best Practices for Awesome Evaluations

🎉 You're Ready to Test Like a Pro!

Next: Tracing →

Evaluation API Reference →

Ready for Professional AI Agent Evaluation? 🚀