Writing Performance Acceptance Criteria

Writing Performance Acceptance Criteria
Daniel Marsh · Spec-first engineering notes

"The API should be fast" is not an acceptance criterion — it is a hope. Performance requirements only become enforceable when they specify a metric, a threshold, a load condition, and a measurement method. This article shows how to write performance acceptance criteria that a load test can either pass or fail, with no room for interpretation.

Published on 2026-02-26 · ✓ Updated 2026-03-20 · 6 min read · Author: Daniel Marsh · Review policy: Editorial Policy

The problem with "the system should be fast"

Every engineer has read a requirement that says "the API should respond quickly" or "the page load should feel snappy." These phrases feel like performance criteria because they mention performance. They aren't. They're wishes. Nobody can pass or fail a wish.

Writing performance acceptance criteria that are actually testable means committing to three things before implementation begins: a specific metric, a numeric threshold, and a load condition under which the threshold must hold. Without all three, QA cannot run a test, and you cannot declare done.

Vague Criteria

  • "Page should load fast"
  • "API should handle high traffic"
  • "Search should return results quickly"

Measurable Criteria

  • "LCP < 2.5s on 4G, p95"
  • "API sustains 500 req/s with p99 < 800ms"
  • "Search returns results in < 200ms for queries up to 10K rows"

The three-part formula for measurable criteria

Every performance criterion in a spec should answer: what are we measuring, what number must it hit, and under what conditions? The formula is: metric + threshold + load condition. Anything missing one of those parts is incomplete.

Examples of complete criteria:

Notice what none of these say: "reasonable," "acceptable," "fast enough." Those words mean the spec writer has not yet made a decision.

Choosing the right percentile

Average latency lies. A p50 of 80ms looks healthy while your p99 is 4 seconds and your largest customers are leaving. Production traffic is never a normal distribution. Tails matter.

The standard practice for most API or page performance specs is to define criteria at p95 and p99. Use p50 only as a baseline sanity check. For anything user-facing with SLA exposure, p99 is the number that actually gets you paged at 2am, so that is the number that should appear in the spec.

Performance criteria — Search API:
  p50 latency: < 50ms
  p95 latency: < 120ms
  p99 latency: < 200ms
  Conditions: 500 concurrent users, 10-second sustained load test
  Dataset: production-scale index (≥ 2M documents)
  Acceptable error rate: < 0.05%

Defining the load condition

A threshold without a load condition is a lab result. The spec needs to specify what "normal load" means in measurable terms. This forces the team to think about production scale before writing a single line of code.

Useful parameters to specify in the load condition:

Tying criteria to specific test tools

If the spec does not say how the criterion will be measured, different team members will measure it differently and get different results. Specify the tool, the scenario, and where the results need to land. This is especially important for anything that becomes a CI gate or a release blocker.

Acceptance test:
  Tool: k6 (load test script in /tests/perf/search-load.js)
  Scenario: ramp from 0 to 500 VUs over 30s, hold for 5 min
  Pass condition: p99 < 200ms AND error rate < 0.05%
  Failure action: block release, file perf regression ticket

Naming the tool and the script path removes the last ambiguity. There is exactly one way to evaluate this criterion, so there is no argument about whether QA and engineering are testing the same thing.

When performance criteria must be tiered

Not every user or operation needs the same SLA. A read API serving cached list data has different requirements than a write operation that hits the database and sends an email. Tying one performance spec to every endpoint in a service is a category error.

For complex features, define performance tiers by criticality:

Critical-path operations like checkout, auth, and payment need the tightest thresholds — p99 under 150ms, hard SLA, blocks release if missed. Standard interactive endpoints like list pages and search can tolerate more — p99 under 500ms, soft SLA, tracked as a regression. Async or background work like exports and report generation is measured differently: a wall-clock deadline (say, under 30 seconds) rather than a latency target.

Degradation thresholds and alerting

A spec that only covers green-path performance is half a spec. The document should also say what happens when the system slows down: at what point does it alert, at what point does it reject requests, and at what point does it stop accepting new work to protect existing requests.

Add a degradation policy to any performance criterion that has SLA implications:

Degradation policy — Order API:
  Alert: p99 > 300ms for 2 consecutive minutes → PagerDuty P2
  Shed load: if p99 > 800ms, return HTTP 503 with Retry-After: 30
  Circuit open: if error rate > 5% for 60s, open circuit to inventory service
  Rollback trigger: if p99 stays > 500ms after deploy for 5 min → auto-rollback

What not to include in performance criteria

Performance sections often accumulate aspirational numbers that nobody intends to enforce. Keep the list short. If a threshold is not worth blocking a release over, it does not belong in the acceptance criteria — put it in a monitoring dashboard instead.

Also avoid specifying implementation approach in a performance criterion. The spec should say "p99 under 200ms at 1,000 RPS." It should not say "use Redis to cache the result." Implementation is engineering's domain. The criterion tells them what outcome is required; they decide how to reach it.

The readiness test

Before signing off on a performance section of any spec, ask one question: could QA run this test today without asking me a single clarifying question? If the answer is no, the spec is not finished. The missing information is almost always the load condition, the measurement tool, or the exact pass/fail number.

Writing performance acceptance criteria well takes an extra hour during spec review. That hour prevents days of back-and-forth during load testing, and eliminates the awkward post-release conversation about whether the system is "fast enough."

Keywords: performance acceptance criteria · spec-first development · load testing · p99 latency · software specification

Editorial note

This article covers Writing Performance Acceptance Criteria for software delivery teams. Examples are illustrative engineering scenarios, not legal, tax, or investment advice.