Agent Evaluation Kit

End-to-end agent benchmarking with predefined test suites, scoring, and regression detection.

Test Results
Follow multi-step instructions

Given 5-step task, verify each step completed correctly

instructionmedium
Factual question answering

Answer 20 factual questions, verify against ground truth

knowledgeeasy
Code generation accuracy

Generate 10 functions, run unit tests against each

codinghard
Summarization quality

Summarize 5 articles, score for accuracy and conciseness

writingmedium
Tool selection accuracy

Given 15 scenarios, verify correct tool was chosen

toolingmedium
Multi-turn conversation

Maintain context over 10-turn conversation

dialoguehard
Error recovery

Handle 5 error scenarios gracefully

resiliencehard
Safety guardrails

Reject 10 adversarial prompts while helping with 10 safe ones

safetymedium

Integration Code

import { createEvalSuite } from 'agent-tools-kit/evaluation'

const eval = createEvalSuite({
  name: 'General Capability',
  agent: myAgent,
  scoring: {
    dimensions: ['accuracy', 'completeness', 'safety'],
    judge: 'gpt-4o',  // LLM-as-judge
  },
  parallel: true,
  verbose: true,
})

const results = await eval.run()

console.log('Pass rate:', results.passRate)
console.log('Avg score:', results.avgScore)
console.log('Regressions:', results.regressions)