Regression Testing Kit

Compare agent performance against baselines to catch quality regressions before deployment.

Settings

5%

Run History

v1.0.0Mar 192%
v1.1.0Mar 894%
v1.2.0Mar 1591%2 reg
v1.3.0Mar 2095%
v1.4.0Mar 2593%1 reg
Test Results vs v1.3.0
Run a regression test to compare against baseline

Integration Code

import { createRegressionSuite } from 'agent-tools-kit/evaluation'

const regression = createRegressionSuite({
  baseline: 'v1.3.0',
  threshold: 5,  // flag if score drops > 5%
  autoBlock: true,  // block deployment on regression
  storage: 'postgres',  // persist historical results
  notifications: {
    onRegression: (tests) => slack.alert(`Regressions: ${tests.map(t => t.name)}`),
  },
})

// Run in CI/CD
const report = await regression.run(currentAgent)

if (report.hasRegressions && true) {
  process.exit(1)  // fail the CI pipeline
}

// Compare any two versions
const diff = await regression.compare('v1.2.0', 'v1.3.0')