Postmortem: Preventing a Billing Incident with Spec-First
I was on the team when 847 customers got charged twice. Support tickets started coming in on a Tuesday afternoon, and by the time we had a full picture it was $127,000 in duplicate charges across a 6-hour window. Three days to fully resolve. The root cause wasn't a bug in the traditional sense — it was a question nobody had answered in writing: what should happen when the payment provider times out after a charge is already in flight?
How it happened
Our checkout service called POST /v1/charges to process a payment. The payment provider occasionally took longer than our 30-second timeout. When the timeout fired, checkout returned "Payment failed. Please try again." The user — reasonably — clicked retry. A second charge went through.
Except the first charge had also completed, just after the timeout, while the retry was already in flight. Two charges. One purchase. And we had no mechanism to detect or prevent it.
I remember the Slack thread where someone asked "wait, does our charges endpoint even support idempotency keys?" The answer was no. We had never discussed it. It just... wasn't in the spec.
What the spec said (and didn't)
The spec for the POST /v1/charges endpoint, as it existed before the incident, contained the following:
POST /v1/charges Creates a new charge against the customer's payment method. Request body: customer_id: string (required) amount_cents: integer (required) currency: string (required, ISO 4217) Response: 201: Charge created successfully. 400: Invalid request parameters. 402: Payment declined by provider. 500: Internal server error.
Notice what's absent: any mention of retry behavior, idempotency, what a client should do when it receives a 500, or what happens when the payment provider times out after a charge has already been submitted. The spec describes the happy path and says nothing about the failure modes that actually matter for a billing system.
The acceptance criteria we never wrote
Good acceptance criteria would have forced the team to decide the retry question before implementation. Here is what was missing:
MISSING - Acceptance criteria for retry behavior:
- Given POST /v1/charges is submitted with Idempotency-Key: "key-xyz"
When the payment provider succeeds after the client's timeout fires
And the client retries with the same Idempotency-Key
Then the endpoint returns 200 with the original charge response
And no duplicate charge is created
- Given POST /v1/charges returns 500
When the client has no Idempotency-Key
Then the spec must state: "Do not retry. Contact support with your
request ID. Retrying may result in a duplicate charge."
- Given the payment provider takes longer than 25 seconds
When the endpoint has not yet received a response
Then the endpoint must NOT return a 500 before the payment settles.
The endpoint must wait for the provider response or timeout at 45s.
None of these criteria existed. The team had never made an explicit decision about what a 500 meant in terms of retry safety. Each engineer on the team, if asked, would have given a different answer about whether it was safe to retry after a 500. The incident resolved that question in the most expensive way possible.
What the spec should have said
A spec that follows spec-first principles for billing endpoints would have included an explicit idempotency policy before implementation started:
## POST /v1/charges — Idempotency requirements
This endpoint MUST support idempotency keys. Clients MUST send an
Idempotency-Key header on every request.
Idempotency-Key:
Format: UUID v4, client-generated
Required: yes — return 400 MISSING_IDEMPOTENCY_KEY if absent
Deduplication window: 24 hours
Behavior on duplicate key within window:
If original charge succeeded: return 200 with original charge body.
If original charge is still processing: return 409 IDEMPOTENCY_IN_PROGRESS.
If original charge failed (declined/error): return 200 with the original
failure response. The client should not retry without user action.
Provider timeout handling:
The endpoint's timeout to the payment provider is 45 seconds.
If the provider does not respond within 45 seconds, the endpoint returns
503 SERVICE_UNAVAILABLE — not 500.
503 means: the charge status is unknown. Do not retry. Use
GET /v1/charges/status/{idempotency_key} to check the outcome.
Retry guidance for clients:
500 — do not retry. The charge state is unknown. Use the status endpoint.
503 — do not retry immediately. Check status endpoint first.
429 — retry after retry_after seconds.
402 — do not retry. Payment was declined. Prompt the user for a new method.The missing status endpoint
A missing piece of the original design was a status endpoint for charges. When a client receives a 500 or a timeout, it has no safe way to determine whether the charge completed without risking a duplicate by retrying. A status endpoint resolves this:
GET /v1/charges/status/{idempotency_key}
Returns the outcome of a charge submitted with the given key.
Response:
200 with status: "success" — charge completed, body contains charge object
200 with status: "failed" — charge failed, body contains failure reason
200 with status: "pending" — charge is still processing, retry in 5s
404 — no charge found for this key (key expired or
key never used)
With this endpoint specced before implementation, the checkout service could poll for status after a timeout rather than blindly retrying the charge. The double-charge scenario becomes impossible: check status, find the charge succeeded, return success to the frontend without issuing a second charge.
The review that never happened
The PR that introduced POST /v1/charges was reviewed and approved in two hours. I was actually one of the reviewers. I checked the implementation against the spec, confirmed it matched, and approved. The implementation was correct — the spec was the problem. That's the part that still bothers me: I reviewed code against a spec and the spec was wrong, and I didn't notice because I was reviewing the wrong thing.
A stronger spec review would have asked these questions before the PR was even opened:
- What does a client do when this endpoint returns 500?
- Is it safe to retry this endpoint without risk of duplicate side effects?
- What is the payment provider timeout, and what status does the endpoint return if the provider times out mid-processing?
- How does a client determine whether a timed-out charge completed?
None of these questions appear in the original spec. A spec review checklist that explicitly asks "is this endpoint retry-safe, and is that documented?" would have surfaced the gap at the cheapest possible moment — before any code was written.
The real cost
| Impact | Detail |
|---|---|
| Duplicate charges | 847 customers, $127,000 total |
| Incident response | 3 days, 2 engineers full-time on manual refunds |
| Coordination | 1 EM on calls with payment provider + support escalations |
| Change freeze | 2 weeks — no payment endpoint changes while retrofitting |
| Retrofit work | 8 engineer-days: Redis idempotency store, status endpoint, SDK docs |
| Prevention cost | ~2 hours of spec writing |
That last row is the one that still gets me. The retrofit took 8 engineer-days. Writing the spec section that would have prevented the incident would have taken 2 hours.
Retrofitting specs after the fire
The postmortem action items for the team included writing the spec sections that were missing. This is common — teams frequently write specs after incidents that reveal what should have been documented. The problem is that the spec cannot prevent the incident it is documenting. It can only prevent the next one.
The more durable lesson is to build spec review checklists that ask the hard questions proactively. For payment endpoints, that checklist must include:
- Does the endpoint require an idempotency key?
- Is the retry safety policy documented, including what each error code means for retries?
- Is there a status endpoint for operations where the client may lose track of the outcome?
- Is the provider timeout documented, and does the endpoint behavior on timeout match the documented behavior?
- Are the acceptance criteria written for the timeout and retry scenarios, not just the happy path?
The one question that would have changed everything
If we had reviewed the spec before writing code — really reviewed it, not just skimmed it — someone would have asked: "what does the client do when this returns 500?" That question starts a 30-minute conversation that ends with an explicit idempotency policy. At spec review time, that conversation costs nothing.
At incident time, it cost 847 customers their trust and three days of engineering bandwidth. I'm not sure we ever fully recovered the trust part.
This is now the first question on my personal spec review checklist for any endpoint that touches payments or external state: is this endpoint retry-safe, and is that documented? If the spec doesn't answer it, the spec isn't done.
Keep reading
Editorial note
This article covers Postmortem: Preventing a Billing Incident with Spec-First for software delivery teams. Examples are illustrative engineering scenarios, not legal, tax, or investment advice.
- Author details: Daniel Marsh
- Editorial policy: How we review and update articles
- Corrections: Contact the editor