Postmortem: How a Missing Contract Test Let a Breaking Change Reach Production
A provider team renamed a single JSON field in a cleanup PR. The OpenAPI spec was updated. All provider-side tests passed. No contract test existed between the two services. The consumer silently started receiving null values for a required field, and 340 orders were created with missing SKU data before a warehouse operator noticed blank pick lists four hours later. This is the reconstructed postmortem.
What happened
The Catalog service team was running a routine cleanup sprint. One of the PRs renamed a JSON response field from product_sku to item_sku across the GET /v2/catalog/products/{id} endpoint. The rename aligned the field name with the team's internal domain model, where "item" had replaced "product" in the bounded context six months earlier. The OpenAPI spec was updated in the same PR. The change looked clean: one field rename, spec updated, all Catalog service tests passing.
The Order service consumed this endpoint. It parsed the product_sku field from the Catalog response to populate the SKU column on every new order. After the Catalog deploy, the Order service started receiving responses where product_sku no longer existed. The Order service's deserialization logic did not fail on the missing field. It silently assigned null to the SKU column and continued processing the order.
Here is what the Catalog response looked like before and after the rename:
{
"id": "cat-9821",
"product_sku": "WH-4420-BLU",
"name": "Widget Housing — Blue",
"price_cents": 1499,
"currency": "USD"
}{
"id": "cat-9821",
"item_sku": "WH-4420-BLU",
"name": "Widget Housing — Blue",
"price_cents": 1499,
"currency": "USD"
}The Order service continued parsing product_sku. The field was gone. The value resolved to null. Orders kept flowing, now with blank SKU data. No error was thrown, no alert fired, no test failed. The issue was discovered four hours later when a warehouse operator reported that pick lists were printing with blank item codes, making it impossible to locate products on the shelves.
Timeline
Monday
T-0 10:14 AM PR #3847 merged — renames product_sku → item_sku in
Catalog service. OpenAPI spec updated in same commit.
All provider-side CI checks pass.
T+15m 10:29 AM Catalog service deployed to production via automated
pipeline. No deployment gate checks consumer compatibility.
T+45m 10:59 AM Order service begins receiving null for product_sku on
all new orders. No error logged — the field is mapped as
optional in the Order service's deserialization model.
T+2h 12:14 PM 340 orders created with null SKU data. Order processing
continues normally. No alerting rule monitors null rate
on the SKU column.
T+4h 02:14 PM Warehouse operator reports blank pick lists to the
operations Slack channel. Pick lists show order line
items with no item code. Warehouse team cannot fulfill
orders without manually looking up each product.
T+4h 02:26 PM On-call engineer begins investigation. Traces blank
12m SKU to null values in the orders table. Checks Catalog
service response and finds product_sku field is gone,
replaced by item_sku. Identifies PR #3847 as the cause.
T+4h 02:44 PM Catalog service rolled back to the previous version.
30m product_sku field restored in production responses.
T+5h 03:14 PM Data backfill script executed. Queries Catalog service
for the 340 affected order line items, retrieves the
correct SKU values, and patches the orders table.
Root cause analysis
The immediate cause was a breaking field rename deployed without consumer notification. But the root cause was structural: no mechanism existed to detect that a field rename would break a consumer, and no process required checking.
Four specific gaps contributed to the incident:
1. The OpenAPI spec existed but nothing enforced it against consumers. The Catalog team maintained an OpenAPI spec and updated it in the same PR as the field rename. The spec was accurate. But accuracy is not the same as enforcement. No tool compared the new spec against the previous version to flag breaking changes. A tool like oasdiff would have detected the removed field and blocked the merge.
2. No consumer-driven contract test existed. The Order service had no Pact contract or equivalent test that declared: "the Order service expects a field called product_sku in the Catalog response." A consumer-driven contract test would have run in the Catalog service's CI pipeline and failed when the field was renamed, before the PR could merge.
3. The PR review caught the spec update but not the consumer impact. Two engineers reviewed PR #3847. Both confirmed that the spec matched the code change. Neither asked: "who consumes this field?" The spec review checklist in use at the time did not include a consumer impact section for field-level changes. The review verified internal consistency (code matches spec) but not external compatibility (spec change is safe for consumers).
4. Provider-side tests passed because they tested the new field name. The Catalog service's integration tests were updated in the same PR to assert item_sku instead of product_sku. From the provider's perspective, every test was correct. The tests verified what the provider produced, not what consumers expected to receive. This is the fundamental limitation of provider-only testing: it cannot detect consumer breakage because it does not know what consumers depend on.
Impact
The direct impact was contained to a single business function (order fulfillment), but the downstream effects touched multiple teams and required manual intervention at every step.
| Category | Detail |
|---|---|
| Orders affected | 340 orders with null SKU data over a 4-hour window |
| Order processing | 4 hours of degraded fulfillment — orders created but not fulfillable |
| Warehouse impact | 87 orders already sent to pick lists; warehouse team manually cross-referenced each one against the product catalog |
| Engineering time | ~6 person-hours: investigation (30 min), rollback (20 min), backfill script (1h), verification (30 min), postmortem (3.5h) |
| Warehouse operations time | ~3 person-hours: manual SKU lookup for 87 orders across 4 warehouse staff |
| Deployment freeze | Catalog service deployments paused for 24 hours pending review |
| Total estimated cost | 9 person-hours of unplanned work + 4 hours of degraded order fulfillment |
No customer-facing impact occurred in terms of order delivery — the backfill corrected all 340 orders before any shipped. But the 87 orders that had already reached the warehouse floor required manual intervention by the picking team, slowing fulfillment throughput for the remainder of the shift.
Five action items
The postmortem produced five concrete action items, each with a deliverable, owner, and deadline. These are ordered by blast radius: the first two prevent recurrence of this specific incident; the remaining three address the systemic gaps that allowed it.
| # | Action | Owner | Deadline | Deliverable |
|---|---|---|---|---|
| 1 | Add oasdiff to Catalog service CI to detect breaking changes before merge | Catalog team tech lead | Sprint +1 | CI step that fails on removed/renamed fields. See: Contract Testing Plan from OpenAPI to CI |
| 2 | Implement Pact consumer-driven contract tests for Order → Catalog dependency | Order team + Catalog team (joint) | Sprint +2 | Pact contract published by Order service; verified in Catalog CI. See: Building a Test Harness for API Services |
| 3 | Add spec review gate requiring consumer team sign-off on field changes | Engineering manager | Sprint +1 | Updated PR template with "Consumer Impact" section. See: Spec Review Checklist Before Coding |
| 4 | Create alerting rule for null-rate spike on required order fields | Order team on-call | Sprint +1 | Datadog monitor: alert if null rate on sku column exceeds 1% over a 15-minute window |
| 5 | Document field rename policy in API governance guide | Platform team | Sprint +2 | Written policy: field renames require deprecation period + consumer notification. See: Versioning Strategies for API Contracts |
Action 1: Breaking change detection with oasdiff
The highest-leverage fix is automated. oasdiff compares two versions of an OpenAPI spec and reports breaking changes: removed fields, renamed fields, changed types, narrowed enums. Adding it as a CI step in the Catalog service pipeline means that PR #3847 would have failed CI with a clear message:
oasdiff breaking --base main --revision HEAD
BREAKING CHANGES DETECTED:
GET /v2/catalog/products/{id}
Response 200:
- Field 'product_sku' was removed (breaking)
- Field 'item_sku' was added (non-breaking)
1 breaking change(s) found. Pipeline blocked.
This check runs in seconds, requires no consumer-side setup, and catches the most common category of accidental breaking change: field removal or rename.
Action 2: Consumer-driven contract tests with Pact
oasdiff catches spec-level changes but cannot verify that consumers actually parse specific fields correctly. A Pact contract test closes this gap. The Order service publishes a contract declaring: "when the Order service calls GET /v2/catalog/products/{id}, the response must contain a field called product_sku of type string." This contract is verified in the Catalog service's CI pipeline. If the Catalog service renames the field, the Pact verification fails before merge.
The combination of oasdiff (spec-level) and Pact (runtime-level) creates two independent layers of protection. Either one alone would have prevented this incident. Together, they catch breaking changes that one layer might miss — for example, a response that matches the spec but returns unexpected values due to a logic change.
Action 3: Consumer impact review gate
Tooling catches known patterns. A spec review gate catches the patterns that tooling does not yet cover. The updated PR template now includes a "Consumer Impact" section for any PR that modifies an API response schema:
## Consumer Impact Fields modified: product_sku → item_sku (renamed) Known consumers: Order service, Analytics pipeline, Partner API Consumer teams notified: [ ] Order team [ ] Data team [ ] Partner team Breaking change: Yes / No Migration plan: [link to deprecation timeline or N/A]
The PR cannot be merged until the "Consumer teams notified" checkboxes are checked. This is a process gate, not a technical one, and it requires the PR author to answer a question that PR #3847 never asked: who reads this field?
Action 4: Null-rate alerting on required fields
Even with detection and review gates, defense in depth requires runtime monitoring. The Order service now has a Datadog monitor that alerts when the null rate on the sku column exceeds 1% over any 15-minute window. Under normal operation, the null rate is 0% — every order has a SKU. A spike to 100% (as happened during this incident) would trigger a PagerDuty alert within 15 minutes of the first affected order, reducing the detection window from 4 hours to under 20 minutes.
This alert would not have prevented the incident, but it would have reduced the blast radius from 340 orders to approximately 30.
Action 5: Field rename policy in API governance
The final action item addresses the organizational gap. The API governance guide now includes a field rename policy:
- Field renames are treated as breaking changes, regardless of whether the old field name was "incorrect."
- The migration path is: add the new field alongside the old field, notify consumers, set a deprecation deadline (minimum 2 sprints), remove the old field only after all consumers have migrated.
- Direct renames (remove old field, add new field in the same PR) are prohibited for any field that appears in a versioned API response.
Under this policy, PR #3847 would have added item_sku alongside product_sku, with a deprecation notice in the OpenAPI spec's description field. The Order service team would have migrated at their own pace, and the old field would be removed only after the Order service confirmed it was no longer in use.
What the spec should have caught
The OpenAPI spec for the Catalog service was accurate at every point during this incident. Before the rename, it documented product_sku. After the rename, it documented item_sku. The spec was never wrong. It was just never checked against the people who depended on it.
This is the gap that separates a spec from a contract. A spec describes what a service produces. A contract describes what a service produces and what its consumers expect to receive. The spec was updated. The contract did not exist. The distance between those two facts is exactly 340 orders with missing SKU data.
If the spec review had included a consumer impact check — a single question, "who reads this field?" — the rename would have been flagged before the PR was approved. The Catalog team would have discovered that the Order service, the Analytics pipeline, and the Partner API all consumed product_sku. The rename would have become a migration, not a cleanup.
The fix is not only tooling. Tools like oasdiff and Pact automate detection, but they operate on patterns that have already been encoded. The deeper fix is adding "consumer impact" as a required section in the spec template for any field-level change. This forces the question before the tooling runs, at the point where the answer is cheapest to act on.
The billing incident postmortem on this site reached a similar conclusion from a different angle: the spec existed, the spec was accurate, and the spec still failed to prevent the incident because it did not address the question that mattered. In that case, the missing question was "is this endpoint retry-safe?" In this case, the missing question was "who depends on this field?" Both questions cost minutes to ask during spec review and hours to answer after production breaks.
Every field in an API response is a promise to someone. The spec documents the promise. The contract test verifies it. The review gate ensures someone asks whether the promise is about to be broken. All three layers are necessary because each one catches failures that the other two miss. The field rename that caused this incident would have been stopped by any one of them.
Keep reading
Editorial note
This article covers a reconstructed postmortem of a breaking API change caused by a missing contract test for software delivery teams. Examples are illustrative engineering scenarios, not legal, tax, or investment advice.
- Author details: Daniel Marsh
- Editorial policy: How we review and update articles
- Corrections: Contact the editor