CDAD User Guide¶
Estimated time to read: 24 minutes
Purpose¶
This guide turns Context-Disciplined Agent Development (CDAD) from a conceptual framework into an operational workflow for teams building software with AI coding agents. It is intended for engineering leads, developers using coding agents day to day, and tool builders who may eventually implement CDAD support. The guide is also a pressure test: wherever the framework remains ambiguous, the guide calls out a Gap so the method can be refined rather than treated as complete.
The guide is based on the earlier CDAD framework brief plus additional evidence on long-running agent harnesses, context engineering, approval boundaries, and resumable agent loops (framework brief; Effective context engineering for AI agents; Effective harnesses for long-running agents; Running agents; Guardrails and human review; Agent approvals & security).
What CDAD Is For¶
CDAD is meant for work where ad hoc agent prompting becomes unreliable because the task spans many turns or sessions, the codebase is too large for naive context stuffing, requirements and tests need to stay aligned over time, or you need traceability from intent to code to verification. This orientation follows both the framework brief and external evidence showing that long-running agent performance depends heavily on harness design, context management, and explicit progress artifacts rather than on model capability alone (framework brief; Effective harnesses for long-running agents; Building Effective AI Coding Agents for the Terminal).
CDAD is not the default for every task. For tiny disposable tasks, direct prompting may still be faster. This is consistent with the earlier brief’s recommendation not to over-govern narrow work and with current agent documentation that warns against bloated workflows and excessive context (framework brief; Claude Code best practices).
Core Mental Model¶
CDAD separates software development into four layers of information:
Durable project memory: Stable rules, conventions, architecture notes, and constraints.
Source-of-truth artifacts: Product specs, requirements, decision records, interfaces, examples.
Task packet: The small runtime bundle an agent receives for one unit of work.
Execution trace: Temporary commands, outputs, tool results, and reasoning residue.
This layering is an operationalized restatement of the framework brief’s core claim that durable knowledge should remain outside the active prompt, while only the context needed for the next verified step should be loaded at runtime (framework brief). It is also consistent with Anthropic’s distinction between prompt engineering and context engineering, where the main design problem is curating the smallest useful set of high-signal tokens rather than maximizing context size (Effective context engineering for AI agents).
Operational rule: Keep durable knowledge outside the active prompt. Load only what is needed for the current verified step.
Roles In CDAD¶
Human Lead¶
Responsible for: - framing the goal; - approving risky or ambiguous changes; - deciding non-goals and trade-offs; - accepting or rejecting delivered work.
This role exists because current agent systems still need explicit approval paths and guardrails around risky side effects, contract changes, and ambiguous intent (Guardrails and human review; Agent approvals & security).
Lead Agent¶
Responsible for: - reading the task packet; - retrieving missing local context; - planning and executing the next verified increment; - updating progress artifacts; - stopping at approval boundaries.
This role reflects the standard agent loop documented by OpenAI and Anthropic: read input, call tools, continue until a real stopping point, and resume rather than restart when a run is paused for approvals or tool work (Running agents; How the agent loop works).
Research / Scout Subagent¶
Optional. Used when breadth is needed for codebase reconnaissance, source gathering, or parallel exploration. This matches both Anthropic’s discussion of sub-agent architectures for long-horizon tasks and the framework brief’s limited endorsement of subagents where decomposition clearly helps (Effective context engineering for AI agents; framework brief).
Verifier¶
Responsible for: - checking claims against sources or artifacts; - checking that completion claims match evidence; - validating citations and provenance for research-style outputs.
Reviewer¶
Responsible for: - flagging unsupported claims, logical gaps, missing controls, edge cases, or risky design choices; - stress-testing the artifact after it exists.
These verification-oriented roles are extensions of the framework brief’s distinction between synthesis, verification, and review.
The CDAD Lifecycle¶
flowchart TD
A[Frame Goal] --> B[Create or Update Durable Artifact]
B --> C[Distill Next Task Packet]
C --> D[Retrieve Local Context]
D --> E[Execute One Verified Increment]
E --> F{Verification Passed?}
F -- No --> G[Diagnose Failure]
G --> C
F -- Yes --> H[Compress Progress and Update State]
H --> I{More Work?}
I -- Yes --> C
I -- No --> J[Human Acceptance / Release Boundary] This lifecycle translates the framework brief into a repeatable control loop. It also mirrors current evidence about long-running agents: they perform better when they make incremental progress, maintain resumable artifacts, and treat context as a constrained resource rather than a permanent buffer (framework brief; Effective harnesses for long-running agents; Building Effective AI Coding Agents for the Terminal).
Lifecycle Meaning¶
Frame Goal: decide what the task is actually trying to achieve.
Durable Artifact: record that intent somewhere outside chat.
Task Packet: reduce the next step to a compact executable unit.
Retrieve Context: gather only the files, tests, and notes needed now.
Execute One Verified Increment: do one bounded piece of useful work.
Compress Progress: leave the next session with enough state to continue safely.
This ordering is evidence-backed rather than arbitrary. Anthropic’s long-running harness article found that agents fail when they try to do too much in one session or leave partial undocumented work behind, while official agent-loop docs emphasize that paused or interrupted runs should resume from state rather than restart as fresh turns (Effective harnesses for long-running agents; Running agents).
CDAD Operating Loop For One Task¶
Decide Whether CDAD Is Warranted¶
Use CDAD if at least two are true: - the task will likely exceed one focused session; - the agent needs more than a few files to reason correctly; - correctness depends on explicit constraints or non-obvious rules; - you need auditability or reusability; - the task has meaningful downside if the agent drifts.
This threshold is inferential, but it follows from the framework brief’s distinction between narrow tasks and broader governed workflows, and from vendor guidance that long sessions, broad context, and tool-heavy loops all require stricter context management (framework brief; Claude Code best practices; OpenAI GPT-5.5 guide).
Create Or Update The Durable Goal Record¶
At minimum, record: - objective; - scope in; - scope out; - quality bar; - risks; - known constraints; - expected verification.
This step preserves the strengths of spec-driven development, SPDD, and BDD: intent is made durable and reviewable rather than left inside a transient chat history (Spec Kit docs; SPDD; Cucumber BDD docs).
Template - Goal Record¶
# Goal Record
## Objective
[What outcome is required?]
## Scope In
- ...
## Scope Out
- ...
## Constraints
- ...
## Verification
- test command(s)
- manual check(s)
- acceptance scenario(s)
## Risks / Approval Boundaries
- ...
Illustrative Example - Goal Record¶
# Goal Record
## Objective
Add passwordless email login to the customer portal.
## Scope In
- backend endpoint for issuing magic-link tokens
- email template for login link
- frontend login screen and token consumption flow
- audit log entry for login success/failure
## Scope Out
- social login
- redesign of the whole auth system
- mobile app support
## Constraints
- must reuse existing email provider
- token lifetime max 15 minutes
- no new external auth vendor
- preserve current admin login flow
## Verification
- `npm test -- auth.magic-link`
- `npm run test:e2e -- login-magic-link`
- manual scenario: request link -> receive email -> log in -> expired link rejected
## Risks / Approval Boundaries
- any schema change to the users table requires approval
- any new third-party dependency requires approval
Why this example matters: it shows the difference between a vague request like “add passwordless login” and a usable durable goal record with scope, constraints, and verification.
Distill The Next Task Packet¶
A task packet is the unit the agent should actually consume.
Required Fields¶
- task ID;
- objective;
- why now;
- relevant files/modules;
- interfaces/contracts touched;
- constraints;
- verification commands;
- escalation conditions;
- references to deeper docs;
- short progress note.
This packet concept is proposed by the framework brief, but it is strongly supported by external evidence. Anthropic’s context-engineering article recommends high-signal, minimal context with just-in-time retrieval, while the terminal-agent paper emphasizes context as a budget, dynamic prompt construction, adaptive compaction, and tool-output management rather than indiscriminate context accumulation (framework brief; Effective context engineering for AI agents; Building Effective AI Coding Agents for the Terminal).
Template - Task Packet¶
# Task Packet - [ID]
## Objective
[One concrete thing to accomplish]
## Why This Step
[Why this increment matters now]
## Relevant Context
- path/to/fileA
- path/to/fileB
- decision record X
- interface contract Y
## Constraints
- do not modify ...
- preserve ...
- keep within ...
## Verification
- `command 1`
- `command 2`
- scenario: ...
## Escalate If
- schema change required
- new dependency required
- unclear requirement
- verification contradicts spec
## Progress Snapshot
[short state carried from previous session]
Illustrative Example - Task Packet¶
# Task Packet - AUTH-ML-02
## Objective
Implement backend endpoint `POST /auth/magic-link/request`.
## Why This Step
The frontend flow cannot be tested until link issuance exists.
## Relevant Context
- `src/auth/routes.ts`
- `src/auth/magicLinkService.ts`
- `tests/auth/magicLink.request.test.ts`
- decision record `docs/decisions/2026-05-passwordless-login.md`
## Constraints
- do not change existing admin auth endpoints
- reuse current email sender abstraction
- rate limit by email and IP
- token expiry must remain 15 minutes
## Verification
- `npm test -- tests/auth/magicLink.request.test.ts`
- `npm run lint`
- scenario: valid email -> 200 response -> token stored -> email job queued
## Escalate If
- users table needs a new column
- email provider cannot support the template variables
- rate limiting requires infra changes
## Progress Snapshot
Goal record approved. Frontend screen exists as stub. No backend endpoint yet.
Why this example matters: it shows how a packet narrows a large feature into one executable increment.
Retrieve Just-in-time Context¶
The agent should retrieve: - the nearest implementation surface; - the most relevant tests; - the nearest architectural rule; - the prior progress note; - any acceptance examples needed for this increment.
The agent should not preload broad unrelated docs or large command histories by default. This is directly supported by Anthropic’s “just in time” context strategy and by recent coding-agent benchmarks such as ContextBench and SWE-ContextBench, which show that retrieval quality and summarized context matter materially to cost and performance (Effective context engineering for AI agents; ContextBench; SWE-ContextBench).
Execute One Verified Increment¶
One increment means: - plan briefly; - edit locally; - run verification; - inspect failure if verification fails; - either fix or escalate.
This reflects both TDD’s divide-and-conquer strength and long-running harness evidence that one-shotting a large task leads to partial implementations, state confusion, and false completion claims (Martin Fowler Testing Guide; TDD with GitHub Copilot; Effective harnesses for long-running agents).
A single increment should end in one of four states: - Passed - Blocked - Needs approval - Ambiguous / clarification needed
Illustrative Examples Of End States¶
Passed: the endpoint was implemented, tests passed, and the packet can be marked complete.
Blocked: the endpoint needs a database column that does not exist, and the packet forbids schema changes without a prior migration decision.
Needs approval: the cleanest implementation requires adding a new queueing dependency.
Ambiguous / clarification needed: the product spec does not say whether magic links should invalidate previous outstanding links.
These examples matter because teams often confuse “blocked” with “needs approval” or “ambiguous”; CDAD treats them as different control states.
Compress Progress¶
After each increment, write a compact progress artifact.
Template - Progress Entry¶
## [timestamp] [task ID]
- Goal worked on:
- Files changed:
- Verification run:
- Result:
- Open issues:
- Next recommended step:
This pattern is consistent with Anthropic’s long-running harness guidance, which explicitly recommends progress files, feature lists, and git history as the mechanism for quickly reestablishing state in a fresh context window (Effective harnesses for long-running agents).
Govern Intent Drift¶
If the code change alters observable behaviour, update the relevant durable artifact or requirement note so the source of truth stays aligned. This preserves SPDD’s strongest governance rule: if reality diverges, fix the prompt or specification rather than allowing code and intent to drift apart silently (SPDD).
Approval Boundaries¶
CDAD treats approval boundaries as part of method design, not as an afterthought.
Require Explicit Approval When¶
- a new dependency is introduced;
- an external network call is added;
- a schema or contract changes;
- a security boundary changes;
- destructive edits or deletes are proposed;
- the agent proposes widening scope beyond the packet.
Allow Autonomous Continuation When¶
- the change stays inside packet scope;
- verification is defined and runnable;
- no new side-effect surface is introduced;
- the change is reversible and localized.
flowchart TD
A[Agent proposes action] --> B{Inside packet scope?}
B -- No --> H[Pause for human approval]
B -- Yes --> C{Side-effecting or risky?}
C -- Yes --> H
C -- No --> D{Verification available?}
D -- No --> I[Pause: define oracle first]
D -- Yes --> E[Execute]
E --> F{Verification passes?}
F -- No --> G[Diagnose / retry / escalate]
F -- Yes --> J[Record progress and continue] This section is strongly supported by current agent documentation. OpenAI’s guardrail guidance explicitly distinguishes automatic guardrails from human review and treats approvals as paused runs that should resume from the same state rather than start as new turns. Codex documentation similarly distinguishes sandbox mode from approval policy and recommends approval on actions that leave the trusted workspace or touch networked or side-effecting operations (Guardrails and human review; Agent approvals & security).
Illustrative Examples - Autonomous Vs Approval-required Actions¶
Autonomous: rename a local helper function, update one test file, and rerun the affected test command.
Autonomous: fix a lint error in a file already named in the packet.
Approval required: add Redis because the agent believes rate limiting will be easier.
Approval required: modify API response shape used by external clients.
Approval required: enable live web access or call a new external service.
Pause for oracle-first redesign: the packet has no meaningful verification step, so the right move is to define a check before implementation continues.
A practical decision rule is: if the action changes dependencies, contracts, permissions, network surfaces, or data shape, assume approval is required unless the packet explicitly pre-authorises it.
Resume / Recovery Protocol¶
CDAD assumes long tasks may span multiple sessions, context windows, or even different agents.
At Session Start¶
The agent should do the same recovery steps in order: - identify working directory / workspace boundary; - read progress notes; - read recent task packet; - inspect latest verification state; - inspect recent git history or equivalent change log; - run one basic sanity verification before touching new code.
At Session End¶
The agent should leave: - updated progress notes; - changed file list; - current packet status; - verification result; - next recommended action.
flowchart LR
A[Fresh session starts] --> B[Read progress log]
B --> C[Read latest task packet]
C --> D[Inspect recent changes]
D --> E[Run sanity verification]
E --> F{Healthy baseline?}
F -- No --> G[Repair before new work]
F -- Yes --> H[Select next packet increment] This protocol directly follows evidence from Anthropic’s long-running harness work. That article found that agents perform better when they always read progress state, check the working baseline, and refuse to layer new work on top of a broken state. OpenAI’s runtime docs add that paused runs should resume from preserved state rather than being re-created as new independent turns (Effective harnesses for long-running agents; Running agents).
Recovery Rule¶
Never start a new feature increment if the recovered baseline is already broken.
Illustrative Example - Good Recovery Note¶
## 2026-05-01 AUTH-ML-02
- Goal worked on: backend magic-link request endpoint
- Files changed: `src/auth/routes.ts`, `src/auth/magicLinkService.ts`
- Verification run: `npm test -- tests/auth/magicLink.request.test.ts`, `npm run lint`
- Result: packet implementation passes unit tests, but end-to-end flow still blocked
- Open issues: frontend still assumes token is returned directly instead of emailed
- Next recommended step: create packet AUTH-ML-03 to align frontend request flow with email-based login
Illustrative Example - Bad Recovery Note¶
The difference matters because CDAD relies on resumable state. The bad note forces the next session to reconstruct intent from scratch.
Artifact Set For A CDAD Project¶
A minimally operational CDAD project should have these artifact classes:
Project Memory¶
- conventions / rules file;
- architecture notes;
- environment commands.
Intent Artifacts¶
- goal records;
- specs or stories;
- decision records;
- acceptance scenarios.
Runtime Artifacts¶
- task packets;
- progress log;
- verification records.
Evidence Artifacts¶
- tests;
- screenshots when relevant;
- logs;
- benchmark outputs;
- citation/source notes for research-backed claims.
This classification is an operational extension of the framework brief’s layered information model and is consistent with both harness engineering and long-running agent research, where feedforward controls, feedback sensors, and resumable artifacts all play different roles in the larger control system (framework brief; Harness engineering for coding agent users; Effective harnesses for long-running agents).
Illustrative Mapping - Where Common Project Files Belong¶
CLAUDE.mdor equivalent team rules file -> Project memorydocs/specs/passwordless-login.md-> Intent artifactagent/packets/AUTH-ML-02.md-> Runtime artifactagent/verification/AUTH-ML-02-test.log-> Evidence artifacttests/auth/magicLink.request.test.ts-> both implementation support and verification evidence
This mapping helps because some teams otherwise mix long-term requirements, temporary task packets, and transient logs in one folder or one prompt.
Diagram Set Recommended For Every Serious CDAD Rollout¶
At minimum, the guide recommends four diagram types.
Lifecycle Diagram¶
Use the lifecycle flow already shown above.
Approval Boundary Diagram¶
Use the approval flow above.
Resume / Recovery Diagram¶
Use the resume flow above.
Actor-responsibility Diagram¶
flowchart TB
H[Human Lead] -->|frames goal| D[Durable Artifacts]
H -->|approves risky actions| A[Lead Agent]
D -->|source of truth| P[Task Packet]
P -->|runtime input| A
A -->|delegates exploration when needed| S[Scout/Research Subagent]
A -->|runs checks / produces evidence| V[Verification Artifacts]
V --> R[Verifier / Reviewer]
R --> H This actor diagram is useful because it makes explicit one of CDAD’s most important distinctions: the packet is not the long-term source of truth; it is a derivative runtime object.
Worked Example - One Feature Through CDAD¶
This is an illustrative walk-through, not a claim that every team must use this exact decomposition.
Feature¶
“Add passwordless email login.”
CDAD Decomposition¶
Goal record: defines scope, constraints, and verification.
Packet AUTH-ML-01: create backend token model and service.
Packet AUTH-ML-02: expose request endpoint.
Packet AUTH-ML-03: align frontend request form.
Packet AUTH-ML-04: consume token and start session.
Packet AUTH-ML-05: audit logging and expiry edge cases.
What The Lead Agent Actually Sees For AUTH-ML-02¶
- the packet objective;
- the auth route file;
- the service file;
- one relevant decision record;
- one targeted test command;
- one escalation rule for schema changes.
What The Lead Agent Does Not Need In Active Context¶
- the whole repository;
- unrelated dashboard code;
- old chat history about UI color choices;
- previous raw command logs if the progress note already captured outcomes.
Example Execution¶
- Agent reads packet AUTH-ML-02.
- Agent retrieves
routes.ts,magicLinkService.ts, and the targeted test. - Agent implements endpoint.
- Unit test fails because rate limiting middleware expects a different request field.
- Agent fixes middleware integration.
- Tests pass.
- Agent writes progress note.
- Agent does not mark the full passwordless-login feature complete; only the packet is complete.
Example Escalation¶
If the agent discovers the existing user model cannot store token metadata without a schema migration: - packet status becomes Needs approval or Blocked, depending on policy; - progress note records the reason; - the next artifact is not “more code,” but either an approval decision or a schema-design packet.
This worked example makes one of CDAD’s key ideas concrete: the framework is not trying to make the agent “understand the whole feature at once.” It is trying to keep each increment small, verified, and resumable.
Anti-patterns CDAD Tries To Prevent¶
Giant Evergreen Prompt¶
Symptoms: - context cost keeps rising; - the agent starts forgetting or misweighting old instructions; - changes become less predictable.
CDAD response: - move stable rules into durable memory; - move task-local execution into packets; - compact aggressively.
This follows directly from vendor warnings about context saturation and context rot (Anthropic context windows; Effective context engineering for AI agents).
Chat-only Development¶
Symptoms: - decisions disappear into conversation history; - next sessions restart from scratch; - human review becomes guesswork.
CDAD response: - write durable goal and progress artifacts.
Verification-free Generation¶
Symptoms: - “looks right” replaces “works right”; - the agent marks tasks done prematurely.
CDAD response: - require verification in every packet.
This anti-pattern is strongly supported by TDD, BDD, Anthropic’s verification guidance, and long-running harness evidence on premature task completion (Martin Fowler Testing Guide; Cucumber BDD docs; Claude Code best practices; Effective harnesses for long-running agents).
Scope Drift Under Apparent Productivity¶
Symptoms: - agent adds nice-to-have work not requested; - diffs broaden beyond the stated objective.
CDAD response: - explicit packet scope + approval boundary.
Resume Amnesia¶
Symptoms: - fresh sessions guess what happened; - broken states get layered over rather than repaired.
CDAD response: - mandatory recovery protocol.
Gaps Exposed While Writing This Guide¶
The act of operationalizing CDAD reveals genuine missing detail.
Packet Schema Is Conceptually Defined But Not Yet Formally Typed¶
The framework says what a packet should contain, but not yet whether packets are Markdown, JSON, hybrid, or tool-native structured objects.
Needed: one canonical schema and one lightweight human-editable rendering.
State Machine Vocabulary Is Incomplete¶
The guide had to invent practical statuses such as Passed, Blocked, Needs approval, and Ambiguous.
Needed: an official CDAD state model.
No Default Priority Rule For Choosing The Next Increment¶
The framework says to work incrementally, but not how to choose the next packet when multiple are available.
Needed: a prioritization heuristic, such as risk-first, dependency-first, or value-first.
Human Vs Agent Ownership Is Still Too Implicit¶
The guide can infer responsibilities, but the framework should define exact responsibility boundaries for requirement clarification, architecture changes, dependency changes, and release readiness.
Verification Taxonomy Is Underspecified¶
The framework needs a more explicit split between unit verification, integration verification, visual verification, policy/security verification, and evidence verification for research artifacts.
Multi-agent Use Is Not Yet Governed Tightly Enough¶
The framework endorses subagents when useful, but does not yet define when subagent use is justified versus wasteful. This gap matters because recent guidance notes that subagents are powerful but can also cause extra cost and context complexity when overused (Effective context engineering for AI agents; Prompting Claude Opus 4.7).
Tool Design Guidance Is Still Broad¶
The framework recommends a future tool, but the guide reveals a need for sharper requirements around packet generation, retrieval ranking, token budgeting, verification routing, and resume packet generation.
These gaps are not defects in the guide; they are useful outputs of the guide-writing exercise.
Suggested Default CDAD Project Layout¶
project/
docs/
architecture/
decisions/
specs/
agent/
memory/
packets/
progress/
verification/
src/
tests/
Suggested meaning: - docs/ = durable, human-reviewed source-of-truth material; - agent/packets/ = runtime task packets; - agent/progress/ = resumable state; - agent/verification/ = evidence of completion.
This layout is an implementation suggestion rather than a source-mandated standard.
How To Pilot CDAD In A Real Team¶
Week 1¶
- choose one medium-complexity task;
- define one goal record;
- create 3-5 packets manually;
- enforce explicit verification for each packet.
Week 2¶
- add progress logging and recovery protocol;
- add approval boundaries;
- compare token spend and rework against ad hoc prompting.
Week 3+¶
- introduce lightweight packet tooling;
- measure time-to-verified-change and reopen rate;
- refine schema and prioritization rules.
This staged rollout is inferential, but it aligns with Martin Fowler’s harness-engineering advice: start by building the outer control system that reduces repeated failure modes and shift quality checks left rather than trying to automate everything at once (Harness engineering for coding agent users).
Minimal Success Criteria For CDAD Adoption¶
CDAD is helping if: - sessions recover faster after resets; - fewer tasks are falsely marked complete; - review focuses more on decisions and less on reconstructing intent; - token usage drops or becomes more predictable; - verification quality improves; - humans intervene at sharper, more valuable boundaries.
These criteria are derived from the framework brief’s metric proposals and from newer work on context retrieval, context reuse, and token consumption in coding agents (framework brief; ContextBench; SWE-ContextBench; How Do AI Agents Spend Your Money?; Tokenomics).
If none of those improve, the framework is adding ceremony without enough control benefit.
Final Recommendation¶
Treat CDAD first as an operating discipline, not as a tool purchase.
Start with: - durable artifacts; - task packets; - mandatory verification; - resumable progress notes; - explicit approval boundaries.
Only then decide what tooling is justified.
Sources¶
Framework Source¶
- Existing CDAD framework brief:
outputs/ai-agent-dev-framework.md
Methods And Practices¶
- GitHub Spec Kit docs: https://github.github.com/spec-kit/
- Martin Fowler, Structured-Prompt-Driven Development: https://martinfowler.com/articles/structured-prompt-driven/
- Cucumber BDD docs: https://cucumber.io/docs/bdd
- Martin Fowler Testing Guide: https://www.martinfowler.com/testing/
- Martin Fowler, Harness engineering for coding agent users: https://martinfowler.com/articles/harness-engineering.html
Agent Loops, Approvals, And Context Engineering¶
- Anthropic, Effective context engineering for AI agents: https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- Anthropic, Effective harnesses for long-running agents: https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
- Anthropic, Claude Code best practices: https://docs.anthropic.com/en/docs/claude-code/best-practices
- Anthropic, Context windows: https://docs.anthropic.com/en/docs/build-with-claude/context-windows
- OpenAI, Running agents: https://developers.openai.com/api/docs/guides/agents/running-agents
- OpenAI, Guardrails and human review: https://developers.openai.com/api/docs/guides/agents/guardrails-approvals
- OpenAI, Agent approvals & security (Codex): https://developers.openai.com/codex/agent-approvals-security
Research Papers¶
- ContextBench: A Benchmark for Context Retrieval in Coding Agents (2026), arXiv:2602.05892: https://arxiv.org/abs/2602.05892
- SWE Context Bench: A Benchmark for Context Learning in Coding (2026), arXiv:2602.08316: https://arxiv.org/abs/2602.08316
- SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents (2026), arXiv:2601.16746: https://arxiv.org/abs/2601.16746
- Building Effective AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned (2026), arXiv:2603.05344: https://arxiv.org/abs/2603.05344
- How Do AI Agents Spend Your Money? Analysing and Predicting Token Consumption in Agentic Coding Tasks (2026), arXiv:2604.22750: https://arxiv.org/abs/2604.22750v1
- Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering (2026): https://arxiv.org/html/2601.14470v1