Vizra.ai |

Documentation

๐Ÿงช

Evaluations

Transform your AI agents from unpredictable to production-ready! ๐ŸŽฏ Test, validate, and perfect your agents with automated quality assurance.

โœจ Why Testing Matters

Just like you wouldn't ship code without tests, don't deploy AI agents without evaluations! Evaluations give you confidence that your agents will handle real-world scenarios gracefully, consistently, and professionally.

๐ŸŽฏ What Are Evaluations?

Think of evaluations as unit tests for your AI agents! They help you:

๐Ÿ“Š CSV-Based Testing

Define test cases in simple CSV files - no complex setup required!

๐Ÿค– Automated Validation

Run hundreds of tests automatically with built-in assertions

๐Ÿง‘โ€โš–๏ธ LLM-as-Judge

Use AI to evaluate subjective qualities like helpfulness

๐Ÿ“ˆ Result Tracking

Export results to CSV for analysis and CI/CD integration

๐Ÿš€ Creating Your First Evaluation

Step 1: Generate the Evaluation Class

Let's create an evaluation to test a customer support agent! Run this magical command:

Terminal
php artisan vizra:make:evaluation CustomerSupportEvaluation

Boom! ๐Ÿ’ฅ This creates a new evaluation in app/Evaluations/CustomerSupportEvaluation.php:

app/Evaluations/CustomerSupportEvaluation.php
<?php

namespace App\Evaluations;

use Vizra\VizraADK\Evaluations\BaseEvaluation;

class CustomerSupportEvaluation extends BaseEvaluation
{
    public string $name = 'customer_support_eval';

    public string $description = 'Evaluate customer support agent responses';

    public string $agentName = 'customer_support'; // Agent alias

    public string $csvPath = 'app/Evaluations/data/customer_support_tests.csv';

    public function preparePrompt(array $csvRowData): string
    {
        // Use the 'prompt' column from CSV by default
        return $csvRowData[$this->getPromptCsvColumn()] ?? '';
    }

    public function evaluateRow(array $csvRowData, string $llmResponse): array
    {
        // Reset assertions for this row
        $this->resetAssertionResults();

        // Run assertions based on test type
        if ($csvRowData['test_type'] === 'greeting') {
            $this->assertResponseContains($llmResponse, 'help');
            $this->assertResponseHasPositiveSentiment($llmResponse);
        }

        // Return evaluation results
        $allPassed = collect($this->assertionResults)
            ->every(fn($r) => $r['status'] === 'pass');

        return [
            'row_data' => $csvRowData,
            'llm_response' => $llmResponse,
            'assertions' => $this->assertionResults,
            'final_status' => $allPassed ? 'pass' : 'fail',
        ];
    }
}

Step 2: Create Your Test Data

Now for the fun part - creating test scenarios! ๐ŸŽจ Let's make a CSV file with different customer interactions:

app/Evaluations/data/customer_support_tests.csv
prompt,test_type,expected_contains,expected_sentiment
"Hello, I need help",greeting,help,positive
"Where is my order #12345?",order_inquiry,order,neutral
"I want to return this product",return_request,return,positive
"This is terrible service!",complaint,sorry,empathetic

๐Ÿ’ก Pro Tip: Structure Your CSV Wisely!

Each column in your CSV can be used for different purposes:

  • prompt - The input to send to your agent
  • test_type - Categorize tests for different assertion logic
  • expected_* - What you expect in the response
  • Add any custom columns you need!

๐Ÿงฐ Your Assertion Toolbox

Vizra ADK provides a rich collection of assertions to validate every aspect of your agent's responses!

๐Ÿ“ Content Assertions

// Check if response contains text
$this->assertResponseContains($response, 'expected text');
$this->assertResponseDoesNotContain($response, 'unwanted');

// Pattern matching
$this->assertResponseMatchesRegex($response, '/pattern/');

// Position checks
$this->assertResponseStartsWith($response, 'Hello');
$this->assertResponseEndsWith($response, '.');

// Multiple checks
$this->assertContainsAnyOf($response, ['yes', 'sure', 'okay']);
$this->assertContainsAllOf($response, ['thank', 'you']);

๐Ÿ“ Length & Structure

// Response length
$this->assertResponseLengthBetween($response, 50, 500);

// Word count
$this->assertWordCountBetween($response, 10, 100);

// Format validation
$this->assertResponseIsValidJson($response);
$this->assertJsonHasKey($response, 'result');
$this->assertResponseIsValidXml($response);

โœจ Quality Checks

// Sentiment analysis
$this->assertResponseHasPositiveSentiment($response);

// Writing quality
$this->assertGrammarCorrect($response);
$this->assertReadabilityLevel($response, 12);
$this->assertNoRepetition($response, 0.3);

๐Ÿ›ก๏ธ Safety & Security

// Content safety
$this->assertNotToxic($response);

// Privacy protection
$this->assertNoPII($response);

// General safety
$this->assertResponseIsNotEmpty($response);

๐Ÿง‘โ€โš–๏ธ LLM as Judge - The Ultimate Quality Check

Sometimes you need another AI to evaluate subjective qualities. That's where LLM-as-Judge comes in! ๐ŸŽญ

๐Ÿค” When to Use LLM Judge?

Perfect for evaluating:

  • Helpfulness and professionalism
  • Empathy and emotional intelligence
  • Creativity and originality
  • Accuracy of complex responses
  • Overall response quality

Using LLM Judge Assertions

app/Evaluations/CustomerSupportEvaluation.php
public function evaluateRow(array $csvRowData, string $llmResponse): array
{
    $this->resetAssertionResults();

    // Use LLM judge for subjective evaluation
    $this->assertLlmJudge(
        $llmResponse,
        'Is this response helpful and professional?',
        'llm_judge', // agent name
        'pass',       // expected outcome
        'Response should be helpful and professional'
    );

    // Quality scoring
    $this->assertLlmJudgeQuality(
        $llmResponse,
        'Rate the clarity and completeness of this response',
        7, // minimum score out of 10
        'llm_judge'
    );

    // Compare responses
    $referenceResponse = $csvRowData['reference_response'] ?? '';
    if ($referenceResponse) {
        $this->assertLlmJudgeComparison(
            $llmResponse,
            $referenceResponse,
            'Which response is more helpful?',
            'actual' // expect actual to win
        );
    }

    // Return results...
}

Setting Up Your Judge Agent

The LLM judge needs to be registered first. Here's how to create your own expert evaluator! ๐Ÿ‘จโ€โš–๏ธ

app/Providers/AppServiceProvider.php
// Register the LLM judge agent
public function boot(): void
{
    // Option 1: Use a dedicated judge agent class
    Agent::build(LlmJudgeAgent::class)->register();

    // Option 2: Create an ad-hoc judge agent (easier!)
    Agent::define('llm_judge')
        ->description('Expert evaluator for judging responses')
        ->instructions('You are an expert evaluator...')
        ->model('gpt-4')
        ->temperature(0.3) // Lower temperature for consistency
        ->register();
}

๐Ÿƒโ€โ™‚๏ธ Running Your Evaluations

Time to put your agent to the test! Let's see how it performs! ๐ŸŽฌ

Running from CLI

Terminal
# Run evaluation by class name
php artisan vizra:run:eval CustomerSupportEvaluation

# Save results to CSV for analysis
php artisan vizra:run:eval CustomerSupportEvaluation --output=results.csv

# Results are saved to storage/app/evaluations/ by default

What You'll See

Watch the magic happen with a beautiful progress bar and detailed results! โœจ

Console Output
Running evaluation: customer_support_eval
Description: Evaluate customer support agent responses
Processing 4 rows from CSV using agent 'customer_support'...
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 4/4
Evaluation processing complete.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Row โ”‚ Final Status โ”‚ LLM Response Summary     โ”‚ Assertions Countโ”‚ Error โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ โœ… pass      โ”‚ Hello! I'd be happy to...โ”‚ 2               โ”‚       โ”‚
โ”‚ 2   โ”‚ โœ… pass      โ”‚ I can help you track...  โ”‚ 1               โ”‚       โ”‚
โ”‚ 3   โ”‚ โŒ fail      โ”‚ Sure, let me assist...   โ”‚ 2               โ”‚       โ”‚
โ”‚ 4   โ”‚ โœ… pass      โ”‚ I understand your...     โ”‚ 3               โ”‚       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Summary: Total Rows: 4, Passed: 3 (75%), Failed: 1 (25%), Errors: 0

๐ŸŽ“ Advanced Example - Putting It All Together

Ready for the full experience? Here's a complete evaluation implementation that showcases all the techniques! ๐Ÿ’ช

app/Evaluations/CustomerSupportEvaluation.php
<?php

namespace App\Evaluations;

use Vizra\VizraADK\Evaluations\BaseEvaluation;

class CustomerSupportEvaluation extends BaseEvaluation
{
    public string $name = 'customer_support_eval';
    public string $description = 'Comprehensive customer support evaluation';
    public string $agentName = 'customer_support';
    public string $csvPath = 'app/Evaluations/data/customer_support_tests.csv';

    public function preparePrompt(array $csvRowData): string
    {
        // Get the base prompt
        $prompt = $csvRowData[$this->getPromptCsvColumn()] ?? '';

        // Add context if available
        if (isset($csvRowData['context'])) {
            $prompt = "Context: " . $csvRowData['context'] . "\n\n" . $prompt;
        }

        return $prompt;
    }

    public function evaluateRow(array $csvRowData, string $llmResponse): array
    {
        $this->resetAssertionResults();

        // Basic content checks
        if (isset($csvRowData['expected_contains'])) {
            $this->assertResponseContains(
                $llmResponse,
                $csvRowData['expected_contains']
            );
        }

        // Test type specific assertions
        switch ($csvRowData['test_type'] ?? '') {
            case 'greeting':
                $this->assertResponseHasPositiveSentiment($llmResponse);
                $this->assertWordCountBetween($llmResponse, 10, 50);
                break;

            case 'complaint':
                $this->assertResponseContains($llmResponse, 'sorry');
                $this->assertNotToxic($llmResponse);
                $this->assertLlmJudge(
                    $llmResponse,
                    'Is this response empathetic and de-escalating?',
                    'llm_judge',
                    'pass'
                );
                break;

            case 'technical':
                $this->assertReadabilityLevel($llmResponse, 12);
                $this->assertGrammarCorrect($llmResponse);
                break;
        }

        // General quality checks
        $this->assertResponseIsNotEmpty($llmResponse);
        $this->assertNoPII($llmResponse);

        // Determine final status
        $allPassed = collect($this->assertionResults)
            ->every(fn($r) => $r['status'] === 'pass');

        return [
            'row_data' => $csvRowData,
            'llm_response' => $llmResponse,
            'assertions' => $this->assertionResults,
            'final_status' => $allPassed ? 'pass' : 'fail',
        ];
    }
}

๐Ÿ“Š Analyzing Your Results

CSV Output Structure

When you export results with --output, you get a comprehensive CSV report! ๐Ÿ“ˆ

CSV Columns Explained:

  • ๐Ÿ“Œ
    Evaluation Name - The name of your evaluation
  • ๐Ÿ“Œ
    Row Index - Which test case from your CSV
  • ๐Ÿ“Œ
    Final Status - pass โœ…, fail โŒ, or error โš ๏ธ
  • ๐Ÿ“Œ
    LLM Response - What your agent actually said
  • ๐Ÿ“Œ
    Assertions (JSON) - Detailed results of each check

๐Ÿš€ CI/CD Integration

Make testing automatic! Here's how to add evaluations to your CI/CD pipeline! ๐Ÿ”„

.github/workflows/evaluate.yml
# Evaluate agents on every push
name: Evaluate Agents
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Setup PHP & Dependencies
        uses: shivammathur/setup-php@v2
        with:
          php-version: '8.2'

      - name: Install Dependencies
        run: composer install

      - name: Run Evaluations
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          php artisan vizra:run:eval CustomerSupportEvaluation --output=results.csv

      - name: Check Results
        run: |
          # Add your own pass/fail logic based on CSV results
          php artisan app:check-eval-results storage/app/evaluations/results.csv

      - name: Upload Results
        uses: actions/upload-artifact@v2
        with:
          name: evaluation-results
          path: storage/app/evaluations/

๐Ÿ† Best Practices for Awesome Evaluations

๐Ÿ“‹

CSV Organization - Use clear test types and descriptive columns

๐Ÿ”

Thorough Testing - Combine multiple assertion types

๐Ÿค–

LLM Judge - Use for subjective quality checks

๐Ÿ”„

CI/CD Integration - Run evaluations on every push

๐Ÿ“ˆ

Track Progress - Monitor performance over time

๐Ÿ‘ฅ

Real Data - Include actual user queries

โš ๏ธ

Edge Cases - Test error scenarios too

๐ŸŽฏ

Consistency - Use the same criteria across agents

๐ŸŽ‰ You're Ready to Test Like a Pro!

With evaluations, you can ship AI agents with confidence! Your agents will be tested, validated, and ready for real-world challenges. Happy testing! ๐Ÿš€