SDD Evidence Log Template

Use the evidence log when review approval is not enough. It records how each acceptance criterion was proven, where the proof lives, and what signal should stop a rollout.

evidence.md

# Evidence Log

Spec:
Release:
Owner:
Date:

## Acceptance Evidence
| Criterion | Evidence | Link | Result |
| --- | --- | --- | --- |
| AC-1 | Test | | Pass/Fail |
| AC-2 | Screenshot | | Pass/Fail |

## Operational Evidence
- Log query:
- Metric:
- Alert:
- Stop signal:

## Manual Checks
- [ ] ...

## Known Gaps
- Gap:
- Risk:
- Owner:
- Follow-up date:

When to use this template

The release touches money, data integrity, permissions, or customer-visible state.
A reviewer needs to verify proof instead of trusting a summary.
A team wants stable release notes and rollback context.
AI-generated code must provide concrete evidence before merge.

What a filled version looks like

The template becomes useful after it carries a real decision, owner, and evidence. This is the level of specificity to aim for.

| Criterion | Evidence | Link | Result |
| --- | --- | --- | --- |
| AC-2 duplicate replay | Integration test | refund_timeout_replay | Pass |
| AC-3 support block | Screenshot | support-refund-pending.png | Pass |

Stop signal: duplicate_refund_attempts > 0.5% for 15 minutes.

Field note: proof for a risky release

A payment change passes unit tests, but the release owner still needs proof that replay, support blocking, and rollout monitoring work together. The evidence log turns those claims into artifacts.

Common failure: If evidence is reduced to "QA checked it", the next reviewer cannot reproduce the result or know which signal should stop rollout.

Reviewer action: Ask reviewers to inspect each acceptance criterion and reject any row that does not name a test, screenshot, log query, metric, or owner.
Evidence bar: The best filled version includes at least one automated test, one human-inspectable artifact, and one production-facing stop signal.

How to adapt this template without making it generic

Do not only replace the title and date. A useful version turns every placeholder into a reviewable decision: who owns the change, which behavior must be true, which scope is explicitly excluded, and what evidence must exist before merge. If a field cannot be filled yet, keep it as an open question instead of burying the uncertainty in prose.

When you use this evidence.md, start with the part most likely to cause rework. For many teams that is not the implementation step; it is the boundary, exception, compatibility rule, or release evidence. The earlier the template exposes those decisions, the less room an AI coding tool or rushed engineer has to broaden the change silently.

Use it when: The release touches money, data integrity, permissions, or customer-visible state.
Review for: Each acceptance criterion has a proof type and result.
Strong wording target: AC-2 is covered by refund_timeout_replay, AC-3 is verified by a support UI screenshot, and rollout stops if duplicate_refund_attempts exceeds 0.5% for 15 minutes.

Suggested review path

Use the first pass to review scope: the goal should be singular, the non-goals should block common expansions, and the affected systems should be named. Use the second pass to review testability: acceptance criteria should describe state, trigger, and observable result, not a vague wish that the product feels better. Use the third pass to review evidence: tests, screenshots, logs, metrics, or manual checks should prove each criterion.

Before giving this template to an AI coding tool, ask a human reviewer to confirm allowed files, interfaces that must not change, migration order, and stop signals. The AI should receive an executable spec, not a prompt that looks complete while still leaving the risky decisions implicit.

Before implementation: confirm open questions do not block behavior decisions.
During implementation: map every task back to a criterion or constraint in this file.
Before merge: prove the result with evidence, not only with a "tests passed" sentence.

Review before implementation

Each acceptance criterion has a proof type and result.
Operational signals are concrete queries, dashboards, metrics, or alerts.
Known gaps are owned and dated.
Rollback or stop signal is visible to release reviewers.

Weak vs strong wording

Weak

Tests pass and QA checked it.

Strong

AC-2 is covered by refund_timeout_replay, AC-3 is verified by a support UI screenshot, and rollout stops if duplicate_refund_attempts exceeds 0.5% for 15 minutes.

When the template stops being empty

The easiest way for a template page to become thin is to provide a clean skeleton without showing how to judge the filled result. A useful version answers three questions: why this change is worth doing now, which scope is explicitly excluded, and which evidence proves the implementation did not drift.

When you use the template for real work, attach the final file to the pull request and mark any section that changed during implementation. A spec is not a one-time document; it should move with the implementation evidence. Readers copying this template should also copy that habit: every sentence that sounds like a decision should be reviewable and traceable.

Minimum evidence: at least one automated test or contract fixture.
Higher-risk evidence: add screenshots, log queries, metrics, or rollback signals.
Follow-up evidence: give known gaps an owner and review date.

Where it fits in a complete SDD packet

Do not push every decision into the same file. evidence.md should own the layer it is best at: making one category of decision reviewable, linkable, and updateable. Scope, design, tasks, and evidence should connect to each other, but they should not swallow each other. When implementation reveals new facts, the team should know exactly which artifact needs to change.

In practice, use this template as one step in a short chain: write the spec or proposal, add design or tasks only when the work needs them, then feed evidence back into the pull request. Readers copying the template should copy that chain as well. A polished standalone template does not improve delivery by itself; a traceable set of artifacts does.

If the template becomes a team standard, keep one filled example in the repository instead of only publishing an empty skeleton. The example teaches new contributors what "specific enough" looks like and gives AI coding tools a better pattern to follow.

Upstream input: a concrete user problem, system constraint, and known failure mode.
Downstream output: executable tasks, review questions, test evidence, or release gates.
Maintenance habit: update the matching spec file whenever implementation changes a decision.

FAQ

Is an evidence log required for every change?

No. Use it when correctness needs proof: high-risk releases, API changes, migrations, payment flows, permissions, or AI-generated diffs.

What counts as evidence?

A test, fixture, screenshot, log query, metric, dashboard, alert, manual check, or release gate that proves a specific criterion.

How does this help SEO and content quality?

For the site reader, it shows a real operational artifact, not abstract advice. For teams, it turns review into a reproducible practice.

Related resources

Editorial note

This template is written for spec-driven development workflows. The example is illustrative and should be adapted to your domain.

Author: Spec Coding Editorial Team
Editorial policy: How we review and update content

Tip: keep it under /docs/specs/ or /.specs/, then update it in the same pull request as implementation changes. Last updated: May 19, 2026.