Vizra.ai |

Documentation

๐Ÿงช

Evaluations

Transform your AI agents from unpredictable to reliable! ๐ŸŽฏ Test, validate, and perfect your agents with automated quality assurance.

โœจ Why Testing Matters

Just like you wouldn't ship code without tests, don't deploy AI agents without evaluations! Evaluations give you confidence that your agents will handle real-world scenarios gracefully, consistently, and professionally.

๐ŸŽฏ What Are Evaluations?

Think of evaluations as unit tests for your AI agents! They help you:

๐Ÿ“Š CSV-Based Testing

Define test cases in simple CSV files - no complex setup required!

๐Ÿค– Automated Validation

Run hundreds of tests automatically with built-in assertions

๐Ÿง‘โ€โš–๏ธ LLM-as-Judge

Use AI to evaluate subjective qualities like helpfulness

๐Ÿ“ˆ Result Tracking

Export results to CSV for analysis and CI/CD integration

๐Ÿš€ Creating Your First Evaluation

Step 1: Generate the Evaluation Class

Let's create an evaluation to test a customer support agent! Run this magical command:

Terminal
php artisan vizra:make:eval CustomerSupportEvaluation

โœจ Double Magic! What Gets Created

This single command creates two files for you:

  • app/Evaluations/CustomerSupportEvaluation.php - Your evaluation class
  • app/Evaluations/data/customer_support_evaluation.csv - Empty CSV with headers ready for test data

No need to manually create the CSV file - it's all set up and ready for you to add test cases! ๐ŸŽ‰

Boom! ๐Ÿ’ฅ This creates your evaluation class in app/Evaluations/CustomerSupportEvaluation.php:

app/Evaluations/CustomerSupportEvaluation.php
<?php

namespace App\Evaluations;

use Vizra\VizraADK\Evaluations\BaseEvaluation;

class CustomerSupportEvaluation extends BaseEvaluation
{
    public string $name = 'customer_support_eval';

    public string $description = 'Evaluate customer support agent responses';

    public string $agentName = 'customer_support'; // Agent alias

    public string $csvPath = 'app/Evaluations/data/customer_support_tests.csv';

    public function preparePrompt(array $csvRowData): string
    {
        // Use the 'prompt' column from CSV by default
        return $csvRowData[$this->getPromptCsvColumn()] ?? '';
    }

    public function evaluateRow(array $csvRowData, string $llmResponse): array
    {
        // Reset assertions for this row
        $this->resetAssertionResults();

        // Run assertions based on test type
        if ($csvRowData['test_type'] === 'greeting') {
            $this->assertResponseContains($llmResponse, 'help');
            $this->assertResponseHasPositiveSentiment($llmResponse);
        }

        // Return evaluation results
        $allPassed = collect($this->assertionResults)
            ->every(fn($r) => $r['status'] === 'pass');

        return [
            'row_data' => $csvRowData,
            'llm_response' => $llmResponse,
            'assertions' => $this->assertionResults,
            'final_status' => $allPassed ? 'pass' : 'fail',
        ];
    }
}

Step 2: Add Your Test Data

Now for the fun part - adding test scenarios! ๐ŸŽจ The CSV file was automatically created with standard headers. Let's populate it with different customer interactions:

app/Evaluations/data/customer_support_evaluation.csv
prompt,expected_response,description
"Hello, I need help",help,"Greeting test - should offer assistance"
"Where is my order #12345?",order,"Order inquiry - should help track order"
"I want to return this product",return,"Return request - should explain process"
"This is terrible service!",sorry,"Complaint - should be empathetic"

๐Ÿ’ก Pro Tip: Customize Your CSV Structure!

The auto-generated CSV starts with standard headers, but you can customize it for your needs:

  • prompt - The input to send to your agent (required)
  • expected_response - What you expect in the response
  • description - Human-readable test description
  • test_type - Add this to categorize tests for different assertion logic
  • context - Add background information for the test
  • Add any custom columns you need!

The command creates the basic structure - feel free to add more columns as your evaluation needs grow!

๐Ÿงฐ Your Assertion Toolbox

Vizra ADK provides a rich collection of assertions to validate every aspect of your agent's responses!

๐Ÿ“ Content Assertions

// Check if response contains text
$this->assertResponseContains($llmResponse, 'expected text');
$this->assertResponseDoesNotContain($llmResponse, 'unwanted');

// Pattern matching
$this->assertResponseMatchesRegex($llmResponse, '/pattern/');

// Position checks
$this->assertResponseStartsWith($llmResponse, 'Hello');
$this->assertResponseEndsWith($llmResponse, '.');

// Multiple checks
$this->assertContainsAnyOf($llmResponse, ['yes', 'sure', 'okay']);
$this->assertContainsAllOf($llmResponse, ['thank', 'you']);

๐Ÿ“ Length & Structure

// Response length
$this->assertResponseLengthBetween($llmResponse, 50, 500);

// Word count
$this->assertWordCountBetween($llmResponse, 10, 100);

// Format validation
$this->assertResponseIsValidJson($llmResponse);
$this->assertJsonHasKey($llmResponse, 'result');
$this->assertResponseIsValidXml($llmResponse);

โœจ Quality Checks

// Sentiment analysis
$this->assertResponseHasPositiveSentiment($llmResponse);

// Writing quality
$this->assertGrammarCorrect($llmResponse);
$this->assertReadabilityLevel($llmResponse, 12);
$this->assertNoRepetition($llmResponse, 0.3);

๐Ÿ›ก๏ธ Safety & Security

// Content safety
$this->assertNotToxic($llmResponse);

// Privacy protection
$this->assertNoPII($llmResponse);

// General safety
$this->assertResponseIsNotEmpty($llmResponse);

๐Ÿง‘โ€โš–๏ธ LLM as Judge - The Ultimate Quality Check

Sometimes you need another AI to evaluate subjective qualities. That's where LLM-as-Judge comes in! ๐ŸŽญ

๐Ÿค” When to Use LLM Judge?

Perfect for evaluating:

  • Helpfulness and professionalism
  • Empathy and emotional intelligence
  • Creativity and originality
  • Accuracy of complex responses
  • Overall response quality

Using LLM Judge Assertions

โœจ New Fluent Judge Interface!

We've introduced a cleaner, more intuitive syntax for judge assertions:

app/Evaluations/CustomerSupportEvaluation.php
public function evaluateRow(array $csvRowData, string $llmResponse): array
{
    $this->resetAssertionResults();

    // Simple pass/fail evaluation
    $this->judge($llmResponse)
        ->using(PassFailJudgeAgent::class)
        ->expectPass();

    // Quality score evaluation
    $this->judge($llmResponse)
        ->using(QualityJudgeAgent::class)
        ->expectMinimumScore(7.5);

    // Multi-dimensional evaluation
    $this->judge($llmResponse)
        ->using(ComprehensiveJudgeAgent::class)
        ->expectMinimumScore([
            'accuracy' => 8,
            'helpfulness' => 7,
            'clarity' => 7
        ]);

    // Return results...
}

๐ŸŽฏ Three Judge Patterns

1. Pass/Fail Judge

For binary decisions - returns {"pass": true/false, "reasoning": "..."}

$this->judge($response)
    ->using(PassFailJudgeAgent::class)
    ->expectPass();
2. Quality Score Judge

For numeric ratings - returns {"score": 8.5, "reasoning": "..."}

$this->judge($response)
    ->using(QualityJudgeAgent::class)
    ->expectMinimumScore(7.0);
3. Comprehensive Judge

For multi-dimensional evaluation - returns {"scores": {...}, "reasoning": "..."}

$this->judge($response)
    ->using(ComprehensiveJudgeAgent::class)
    ->expectMinimumScore([
        'accuracy' => 8,
        'helpfulness' => 7,
        'clarity' => 7
    ]);

๐Ÿƒโ€โ™‚๏ธ Running Your Evaluations

Time to put your agent to the test! Let's see how it performs! ๐ŸŽฌ

Running from CLI

Terminal
# Run evaluation by class name
php artisan vizra:run:eval CustomerSupportEvaluation

# Save results to CSV for analysis
php artisan vizra:run:eval CustomerSupportEvaluation --output=results.csv

# Results are saved to storage/app/evaluations/ by default

What You'll See

Watch the magic happen with a beautiful progress bar and detailed results! โœจ

Console Output
Running evaluation: customer_support_eval
Description: Evaluate customer support agent responses
Processing 4 rows from CSV using agent 'customer_support'...
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 4/4
Evaluation processing complete.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Row โ”‚ Final Status โ”‚ LLM Response Summary     โ”‚ Assertions Countโ”‚ Error โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ โœ… pass      โ”‚ Hello! I'd be happy to...โ”‚ 2               โ”‚       โ”‚
โ”‚ 2   โ”‚ โœ… pass      โ”‚ I can help you track...  โ”‚ 1               โ”‚       โ”‚
โ”‚ 3   โ”‚ โŒ fail      โ”‚ Sure, let me assist...   โ”‚ 2               โ”‚       โ”‚
โ”‚ 4   โ”‚ โœ… pass      โ”‚ I understand your...     โ”‚ 3               โ”‚       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Summary: Total Rows: 4, Passed: 3 (75%), Failed: 1 (25%), Errors: 0

๐ŸŽ“ Advanced Example - Putting It All Together

Ready for the full experience? Here's a complete evaluation implementation that showcases all the techniques! ๐Ÿ’ช

app/Evaluations/CustomerSupportEvaluation.php
<?php

namespace App\Evaluations;

use Vizra\VizraADK\Evaluations\BaseEvaluation;

class CustomerSupportEvaluation extends BaseEvaluation
{
    public string $name = 'customer_support_eval';
    public string $description = 'Comprehensive customer support evaluation';
    public string $agentName = 'customer_support';
    public string $csvPath = 'app/Evaluations/data/customer_support_tests.csv';

    public function preparePrompt(array $csvRowData): string
    {
        // Get the base prompt
        $prompt = $csvRowData[$this->getPromptCsvColumn()] ?? '';

        // Add context if available
        if (isset($csvRowData['context'])) {
            $prompt = "Context: " . $csvRowData['context'] . "\n\n" . $prompt;
        }

        return $prompt;
    }

    public function evaluateRow(array $csvRowData, string $llmResponse): array
    {
        $this->resetAssertionResults();

        // Basic content checks
        if (isset($csvRowData['expected_contains'])) {
            $this->assertResponseContains(
                $llmResponse,
                $csvRowData['expected_contains']
            );
        }

        // Test type specific assertions
        switch ($csvRowData['test_type'] ?? '') {
            case 'greeting':
                $this->assertResponseHasPositiveSentiment($llmResponse);
                $this->assertWordCountBetween($llmResponse, 10, 50);
                break;

            case 'complaint':
                $this->assertResponseContains($llmResponse, 'sorry');
                $this->assertNotToxic($llmResponse);
                $this->assertLlmJudge(
                    $llmResponse,
                    'Is this response empathetic and de-escalating?',
                    'llm_judge',
                    'pass'
                );
                break;

            case 'technical':
                $this->assertReadabilityLevel($llmResponse, 12);
                $this->assertGrammarCorrect($llmResponse);
                break;
        }

        // General quality checks
        $this->assertResponseIsNotEmpty($llmResponse);
        $this->assertNoPII($llmResponse);

        // Determine final status
        $allPassed = collect($this->assertionResults)
            ->every(fn($r) => $r['status'] === 'pass');

        return [
            'row_data' => $csvRowData,
            'llm_response' => $llmResponse,
            'assertions' => $this->assertionResults,
            'final_status' => $allPassed ? 'pass' : 'fail',
        ];
    }
}

๐Ÿ“Š Analyzing Your Results

CSV Output Structure

When you export results with --output, you get a comprehensive CSV report! ๐Ÿ“ˆ

CSV Columns Explained:

  • ๐Ÿ“Œ
    Evaluation Name - The name of your evaluation
  • ๐Ÿ“Œ
    Row Index - Which test case from your CSV
  • ๐Ÿ“Œ
    Final Status - pass โœ…, fail โŒ, or error โš ๏ธ
  • ๐Ÿ“Œ
    LLM Response - What your agent actually said
  • ๐Ÿ“Œ
    Assertions (JSON) - Detailed results of each check

๐ŸŽจ Creating Custom Assertions

Need something specific? Create your own reusable assertion classes! ๐Ÿš€

Simple Example: Product Name Assertion

Let's create a simple assertion that checks if a product name is mentioned:

app/Evaluations/Assertions/ContainsProductAssertion.php
<?php

namespace App\Evaluations\Assertions;

use Vizra\VizraADK\Evaluations\Assertions\BaseAssertion;

class ContainsProductAssertion extends BaseAssertion
{
    public function assert(string $response, ...$params): array
    {
        $productName = $params[0] ?? '';

        if (empty($productName)) {
            return $this->result(false, 'Product name parameter is required');
        }

        $contains = stripos($response, $productName) !== false;

        return $this->result(
            $contains,
            "Response should mention the product '{$productName}'",
            "contains '{$productName}'",
            $contains ? "found '{$productName}'" : "product not mentioned"
        );
    }
}

Using Your Custom Assertion

app/Evaluations/ProductReviewEvaluation.php
use App\Evaluations\Assertions\ContainsProductAssertion;

class ProductReviewEvaluation extends BaseEvaluation
{
    private ContainsProductAssertion $productAssertion;

    public function __construct()
    {
        parent::__construct();
        $this->productAssertion = new ContainsProductAssertion();
    }

    public function evaluateRow(array $csvRowData, string $llmResponse): array
    {
        $this->resetAssertionResults();

        // Use your custom assertion
        $this->assertCustom(ContainsProductAssertion::class, $llmResponse, 'MacBook Pro');

        // Mix with built-in assertions
        $this->assertWordCountBetween($llmResponse, 50, 200);

        // Determine final status
        $allPassed = collect($this->assertionResults)
            ->every(fn($r) => $r['status'] === 'pass');

        return [
            'assertions' => $this->assertionResults,
            'final_status' => $allPassed ? 'pass' : 'fail',
        ];
    }
}

๐Ÿ’ก Pro Tip: CSV-Driven Custom Assertions!

You can even specify custom assertions in your CSV files:

prompt,assertion_class,assertion_params
"Tell me about the new iPhone",ContainsProductAssertion,"[\"iPhone\"]"
"Describe the MacBook features",ContainsProductAssertion,"[\"MacBook\"]"

Then use them dynamically in your evaluation:

if (isset($csvRowData['assertion_class'])) {
    $params = json_decode($csvRowData['assertion_params'] ?? '[]', true);
    $this->assertCustom($csvRowData['assertion_class'], $llmResponse, ...$params);
}

Generate Assertion Classes with Artisan

Creating new assertions is super easy with our generator command! โšก

Terminal
php artisan vizra:make:assertion EmailValidationAssertion

This creates a ready-to-use assertion class with helpful boilerplate!

Built-in Custom Assertions

Vizra ADK comes with several ready-to-use custom assertions:

๐Ÿ“ฆ ContainsProductAssertion

Check if a product name is mentioned

๐Ÿ“„ JsonSchemaAssertion

Validate JSON structure against a schema

๐Ÿ’ฐ PriceFormatAssertion

Verify price formatting in any currency

๐Ÿ“ง EmailFormatAssertion

Check for valid email addresses

๐Ÿš€ CI/CD Integration

Make testing automatic! Here's how to add evaluations to your CI/CD pipeline! ๐Ÿ”„

.github/workflows/evaluate.yml
# Evaluate agents on every push
name: Evaluate Agents
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Setup PHP & Dependencies
        uses: shivammathur/setup-php@v2
        with:
          php-version: '8.2'

      - name: Install Dependencies
        run: composer install

      - name: Run Evaluations
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          php artisan vizra:run:eval CustomerSupportEvaluation --output=results.csv

      - name: Check Results
        run: |
          # Add your own pass/fail logic based on CSV results
          php artisan app:check-eval-results storage/app/evaluations/results.csv

      - name: Upload Results
        uses: actions/upload-artifact@v2
        with:
          name: evaluation-results
          path: storage/app/evaluations/

๐Ÿ† Best Practices for Awesome Evaluations

๐Ÿ“‹

CSV Organization - Use clear test types and descriptive columns

๐Ÿ”

Thorough Testing - Combine multiple assertion types

๐Ÿค–

LLM Judge - Use for subjective quality checks

๐Ÿ”„

CI/CD Integration - Run evaluations on every push

๐Ÿ“ˆ

Track Progress - Monitor performance over time

๐Ÿ‘ฅ

Real Data - Include actual user queries

โš ๏ธ

Edge Cases - Test error scenarios too

๐ŸŽฏ

Consistency - Use the same criteria across agents

๐ŸŽ‰ You're Ready to Test Like a Pro!

With evaluations, you can ship AI agents with confidence! Your agents will be tested, validated, and ready for real-world challenges. Happy testing! ๐Ÿš€

Ready for Professional AI Agent Evaluation? ๐Ÿš€

Evaluate and debug your Vizra ADK agents with professional cloud tools. Get early access to Vizra Cloud and be among the first to experience advanced evaluation and trace analysis at scale.

Cloud evaluation runs
Trace visualization
Team collaboration

Join other developers already on the waitlist. No spam, just launch updates.