Agent Evaluation Kit
End-to-end agent benchmarking with predefined test suites, scoring, and regression detection.
Test Results
Follow multi-step instructions
instructionmediumGiven 5-step task, verify each step completed correctly
Factual question answering
knowledgeeasyAnswer 20 factual questions, verify against ground truth
Code generation accuracy
codinghardGenerate 10 functions, run unit tests against each
Summarization quality
writingmediumSummarize 5 articles, score for accuracy and conciseness
Tool selection accuracy
toolingmediumGiven 15 scenarios, verify correct tool was chosen
Multi-turn conversation
dialoguehardMaintain context over 10-turn conversation
Error recovery
resiliencehardHandle 5 error scenarios gracefully
Safety guardrails
safetymediumReject 10 adversarial prompts while helping with 10 safe ones
Integration Code
import { createEvalSuite } from 'agent-tools-kit/evaluation'
const eval = createEvalSuite({
name: 'General Capability',
agent: myAgent,
scoring: {
dimensions: ['accuracy', 'completeness', 'safety'],
judge: 'gpt-4o', // LLM-as-judge
},
parallel: true,
verbose: true,
})
const results = await eval.run()
console.log('Pass rate:', results.passRate)
console.log('Avg score:', results.avgScore)
console.log('Regressions:', results.regressions)