Spec-First Error Handling Patterns for APIs

Spec-First Error Handling Patterns for APIs
Daniel Marsh · Spec-first engineering notes

Error handling is the part of an API contract that gets specified last and breaks clients first. If your spec says "returns 400 on bad input" and nothing else, clients are guessing. I've debugged enough production incidents caused by vague error contracts — including one where a billing API returned three different error shapes from three different endpoints — to know this section of the spec deserves more attention than it usually gets. This guide covers how to define a complete error taxonomy in the spec, design retry-safe error payloads, and write acceptance criteria that make error behavior reviewable.

Published on 2026-02-21 · ✓ Updated 2026-03-20 · 7 min read · Author: Daniel Marsh · Review policy: Editorial Policy

Why error contracts fail after implementation starts

When engineers spec a new endpoint, they tend to document the happy path thoroughly and sketch the error paths in a single bullet: "returns appropriate error codes." That phrase does no work. It tells QA nothing, tells client developers nothing, and tells operations nothing about what to monitor.

The result: each engineer invents the error response shape independently. One endpoint returns {"error": "not_found"}, another returns {"message": "User does not exist", "code": 404}, and a third wraps everything in a data envelope. Clients that need to handle errors uniformly can't, because there's no uniform contract to handle.

Error TypeHTTP StatusRetryable?User Action
Validation400 / 422NoFix input and resubmit
Authentication401NoRe-login
Authorization403NoRequest access
Not Found404NoCheck resource ID
Conflict409MaybeRefresh and retry
Rate Limit429Yes, after delayWait and retry
Server Error500MaybeReport issue
Unavailable503Yes, after delayWait and retry

Defining the error taxonomy before writing code

The error taxonomy belongs in the spec as a shared component, not scattered endpoint by endpoint. Define the categories first:

The 4xx range covers client errors — the client sent something invalid or unauthorized, and retrying the same request without changing it will not help. The 5xx range covers server errors — the server failed, and the request may or may not be safe to retry depending on the operation. Within each HTTP status category, specific error codes (machine-readable strings) identify the exact failure type so clients can branch on them programmatically.

A shared error schema defined in OpenAPI looks like this:

components:
  schemas:
    ErrorResponse:
      type: object
      required: [error, message, request_id]
      properties:
        error:
          type: string
          description: Machine-readable error code. Stable across versions.
          example: VALIDATION_FAILED
        message:
          type: string
          description: Human-readable explanation. May change between versions.
          example: "Field 'email' must be a valid email address."
        request_id:
          type: string
          description: Correlates this response to server logs.
          example: "req_8f3a2b1c"
        retry_after:
          type: integer
          description: Present on 429 and certain 503 responses. Seconds to wait.
          example: 30

Every error response in every endpoint references this schema. When clients write error-handling code, they target one shape, not dozens of improvised variations.

Mapping HTTP status codes to behavior

The spec should explicitly state which HTTP status code maps to which client behavior. This is not implied by the HTTP standard — many teams use 400, 422, and 409 interchangeably until an argument forces a decision. Make the decision in the spec:

## Error status code usage

400 Bad Request   — malformed JSON, missing required field, type mismatch.
                    Client must fix the request before retrying.

401 Unauthorized  — missing or invalid authentication token.
                    Client must re-authenticate before retrying.

403 Forbidden     — authenticated but not authorized for this resource.
                    Retrying will not help. Contact support.

404 Not Found     — resource does not exist or caller cannot see it.
                    Do not reveal whether the resource exists to unauthorized callers.

409 Conflict      — request conflicts with current state (e.g., duplicate key).
                    Client must resolve the conflict (e.g., use a different ID).

422 Unprocessable — request was valid JSON but failed domain validation.
                    error.details will contain field-level validation messages.

429 Too Many Reqs — rate limit exceeded. Retry after retry_after seconds.

500 Internal      — unexpected server error. Safe to retry with backoff.
503 Unavailable   — server temporarily unavailable. Retry with backoff.

Idempotency requirements and retry safety

For any operation with side effects, the spec must state whether it is retry-safe and, if so, how. This is not optional for payment endpoints, order creation, or anything that writes state. A 500 response on a POST leaves the client in an ambiguous state: did the operation complete before the server crashed, or not?

The spec-first solution is to require an idempotency key for non-idempotent operations and document the deduplication window:

POST /v1/charges
Headers:
  Idempotency-Key: <client-generated UUID> (required)

Behavior:
- If a request with the same Idempotency-Key is received within 24 hours
  of a successful charge, return the original response with status 200.
- If a request with the same key is received while the original is still
  processing, return 409 Conflict with error code IDEMPOTENCY_IN_PROGRESS.
- If a request with the same key has a different request body, return 422
  with error code IDEMPOTENCY_KEY_REUSE.
- Keys older than 24 hours are expired. A new request will be processed.

With this in the spec, every engineer who implements or reviews the endpoint knows exactly what retry-safe behavior looks like. QA can write tests without guessing. No post-incident surprises.

Backward compatibility of error contracts

Error response shapes are part of the API contract and must be versioned alongside success responses. Teams often break clients not by changing success payloads but by quietly changing error codes, renaming error fields, or switching from 400 to 422 for a validation scenario.

The classification rules from the versioning spec apply equally to errors:

How clients should handle errors per spec

The spec should include a section explicitly for API consumers, not just for implementors. This section tells clients which error codes are permanent (stop retrying) versus transient (retry with backoff), and which fields are stable enough to switch on in client code.

## Client error handling guidance

Stable fields safe to parse in client code:
- error (machine-readable code) — stable across minor versions
- request_id — always present, use for support requests
- retry_after — present on 429, safe to use for backoff timing

Do NOT branch on `message` in client code — it is for humans and may
change between releases without a version bump.

Retry policy:
- 500, 503: exponential backoff, max 3 retries
- 429: wait retry_after seconds, then retry once
- 400, 401, 403, 404, 409, 422: do not retry — fix the request first

This section belongs in the spec, not in a separate consumer guide that nobody reads before integrating.

Specifying error detail for validation failures

A 422 response that says only "validation failed" forces the client to make a follow-up request or open a support ticket to understand which field was wrong. The spec should define a details array for multi-field validation errors:

ErrorResponse (422):
  error: "VALIDATION_FAILED"
  message: "One or more fields failed validation."
  details:
    - field: "email"
      code: "INVALID_FORMAT"
      message: "Must be a valid email address."
    - field: "date_of_birth"
      code: "FUTURE_DATE"
      message: "Date of birth cannot be in the future."

Form UIs, mobile clients, and integration test suites all benefit from field-level error codes. Without them, every 422 requires human interpretation.

Specifying error handling in acceptance criteria

Acceptance criteria for error behavior are written the same way as success-path criteria. They are specific, testable, and written before implementation:

- Given a POST /v1/orders request with a missing `items` field
  When the server processes the request
  Then the response status is 422
   And the response body matches ErrorResponse schema
   And error.error equals "VALIDATION_FAILED"
   And error.details contains an entry with field="items" and code="REQUIRED"

- Given a POST /v1/charges with a valid Idempotency-Key that was used 1 hour ago
  When the server receives the same request body
  Then the response status is 200
   And the response body is identical to the original charge response

- Given a POST /v1/charges where the server crashes after committing the charge
  When the client retries with the same Idempotency-Key
  Then the response status is 200
   And no duplicate charge is created

Error observability requirements in the spec

The spec should state which errors must be monitored in production and at what thresholds. An elevated 500 rate is a deployment signal. A spike in 422s might indicate a client using a deprecated field. A sudden 401 spike might indicate a token rotation gone wrong.

The monitoring requirements have to be in the spec, not invented service by service. If they're not named before implementation starts, they won't be consistent, and they won't be there when you need them at 2am.

Keywords: error handling · spec-first error handling patterns for APIs · error taxonomy · idempotency · retry safety · API contracts

Editorial note

This article covers Spec-First Error Handling Patterns for APIs for software delivery teams. Examples are illustrative engineering scenarios, not legal, tax, or investment advice.