Testing Methodology

How to design effective chaos tests and read results.

Start simple

Begin with one fault at a time. A single http_error 500 on always schedule tells you whether the agent retries, crashes, or gives up.

scenarios:
  - name: llm-always-500
    summary: Every LLM call fails
    target: llm_chat
    fault:
      kind: http_error
      status_code: 500
    schedule:
      mode: always

Add realism with schedules

Real failures are intermittent. Use random with a probability, or periodic for burst patterns.

schedule:
  mode: random
  probability: 0.3     # 30% of requests fail

schedule:
  mode: periodic
  every: 5             # every 5th request...
  length: 2            # ...2 consecutive requests fail

Combine faults

Layer multiple scenarios. A brownout might mean slow responses and occasional errors:

scenarios:
  - name: slow-llm
    summary: LLM latency spikes
    target: llm_chat
    fault:
      kind: latency
      min_ms: 3000
      max_ms: 8000
    schedule:
      mode: random
      probability: 0.4

  - name: llm-rate-limit
    summary: Occasional rate limits
    target: llm_chat
    fault:
      kind: http_error
      status_code: 429
    schedule:
      mode: random
      probability: 0.2

Target specific tools or models

Use the match field to scope faults:

match:
  model: gpt-4o          # only affect requests to this model

match:
  tool_name: search_docs  # only affect this MCP tool

match:
  tool_name_pattern: "search_*"  # wildcard matching

Reading results

Scorecard

After a test run, check the scorecard:

curl localhost:5005/_agentbreak/scorecard

Key metrics:

Metric	Meaning
`requests_seen`	Total requests the agent made
`injected_faults`	How many faults AgentBreak applied
`upstream_successes`	Requests that got a valid response
`upstream_failures`	Failed requests (injected + real)
`duplicate_requests`	Same payload sent twice
`suspected_loops`	Same payload sent 3+ times
`resilience_score`	0-100 composite score
`run_outcome`	`PASS`, `DEGRADED`, or `FAIL`

Run outcome

PASS — no failures, no loops. The agent handled everything cleanly.
DEGRADED — some failures occurred, but some requests succeeded. The agent partially recovered.
FAIL — all requests failed, or the agent got stuck in a loop.

Presets

Built-in scenario bundles for common patterns:

Preset	What it does
`brownout`	Random LLM latency + rate limits
`mcp-slow-tools`	90% of MCP tool calls are slow
`mcp-tool-failures`	30% of MCP tool calls return 503
`mcp-mixed-transient`	Light MCP latency + errors

preset: brownout

Workflow

agentbreak init — generate config files
Edit scenarios — define what faults to inject
agentbreak serve — start the proxy
Run your agent — point it at localhost:5005
Check scorecard — review results at /_agentbreak/scorecard
Iterate — adjust scenarios, re-run, compare scores