Regression Testing Kit
Compare agent performance against baselines to catch quality regressions before deployment.
Settings
5%
Run History
v1.0.0Mar 192%
v1.1.0Mar 894%
v1.2.0Mar 1591%2 reg
v1.3.0Mar 2095%
v1.4.0Mar 2593%1 reg
Test Results vs v1.3.0
Run a regression test to compare against baseline
Integration Code
import { createRegressionSuite } from 'agent-tools-kit/evaluation'
const regression = createRegressionSuite({
baseline: 'v1.3.0',
threshold: 5, // flag if score drops > 5%
autoBlock: true, // block deployment on regression
storage: 'postgres', // persist historical results
notifications: {
onRegression: (tests) => slack.alert(`Regressions: ${tests.map(t => t.name)}`),
},
})
// Run in CI/CD
const report = await regression.run(currentAgent)
if (report.hasRegressions && true) {
process.exit(1) // fail the CI pipeline
}
// Compare any two versions
const diff = await regression.compare('v1.2.0', 'v1.3.0')