Toward An Ai-native Software Development Framework¶
Estimated time to read: 26 minutes
Executive Summary¶
This brief investigates four families of development practice—spec-driven development, structured prompt-driven development, behaviour-driven development (BDD), and test-driven development (TDD)—through the lens of modern AI coding agents. The main conclusion is that none of these methods should be adopted unchanged as the default operating system for agentic software development. Each contributes something important, but each also assumes a human developer with relatively stable working memory and low cost for carrying broad context mentally across a task. Spec-driven and structured-prompt methods externalize intent well; BDD externalizes examples and shared language; TDD externalizes correctness checks. Recent vendor guidance and benchmark work suggest that AI-agent performance depends not only on model quality but on how context is selected, structured, compacted, and verified during execution, rather than simply on how much context can fit in the window (Spec Kit docs; SPDD; Cucumber BDD docs; Martin Fowler Testing Guide; Anthropic context windows; OpenAI GPT-5.5 guide; Gemini long context; ContextBench).
Observation: Existing methods are strongest when they externalize intent into durable artifacts, constrain scope, and create fast feedback loops (Spec Kit docs; SPDD; Cucumber BDD docs; Martin Fowler Testing Guide).
Observation: Current model and agent documentation converges on a warning that large context alone is insufficient. Anthropic explicitly warns about “context rot” as token counts grow, OpenAI recommends starting with the smallest prompt that preserves the product contract, and Google documents both the benefits and limitations of long context together with techniques such as caching and careful prompt placement (Anthropic context windows; OpenAI GPT-5.5 guide; Gemini long context).
Inference: A better framework for AI-agent software development should preserve the discipline of specs, examples, and tests, but reorganize them around context economy and verification. The runtime unit of work should be a compact, machine-usable task packet with explicit goals, constraints, acceptance checks, and retrieval pointers rather than a large free-form spec or an ever-growing chat transcript.
This report therefore proposes an initial framework, Context-Disciplined Agent Development (CDAD), with five core ideas: - durable intent artifacts instead of chat-only instructions; - small, typed task packets instead of full-repository dumps; - retrieval-first context assembly instead of always-on prompt stuffing; - verification-first execution loops instead of generation-first loops; - explicit measurement of quality and token efficiency.
The practical implication is that the future tool, if built, should probably not be a full IDE replacement. It should be a lightweight orchestration and context-packaging layer that helps models assemble only the information needed for the next verified step.
What Existing Approaches Do Well¶
Spec-driven Development¶
Spec-driven approaches treat natural-language requirements as durable engineering artifacts rather than disposable chat. GitHub’s Spec Kit docs say that spec-driven development “flips the script” so that specifications become executable, emphasizes “intent-driven development,” and prefers “multi-step refinement rather than one-shot code generation from prompts” (Spec Kit docs). GitHub’s blog framing is similar: the method is presented as a response to AI coding agents losing track of app purpose or prior decisions, with Markdown artifacts acting as a durable source of truth (GitHub Blog: Using Markdown as a programming language; GitHub Blog: Spec Kit toolkit).
Structured Prompt-Driven Development (SPDD) pushes governance further. Martin Fowler’s article describes SPDD as an engineering method that treats prompts as “first-class delivery artifacts” that can be version controlled, reviewed, reused, and improved, and it states a core workflow rule: when reality diverges, fix the prompt first, then update the code (SPDD).
What these approaches do well: - preserve intent outside the model’s ephemeral context; - make review possible at the requirements/prompt layer, not only in code diffs; - support repeatability across developers and sessions; - reduce purely ad hoc prompting; - create a basis for auditability and reuse (Spec Kit docs; SPDD).
BDD¶
BDD is strongest where multiple stakeholders need a shared understanding of desired behaviour. Cucumber’s official documentation says BDD aims to close the gap between business and technical people through collaboration around concrete examples, then turns those examples into automated checks and living documentation (Cucumber BDD docs). Cucumber also stresses that BDD is not just Given/When/Then formatting but a process of discovery, formulation, and automation (Intro to TDD and BDD).
What BDD contributes well: - concrete behavioural examples; - a shared language between business and implementation; - executable documentation; - small, feedback-rich iterations (Cucumber BDD docs).
BDD is especially relevant for AI-agent development because agent failures are often failures of interpreted intent rather than syntax. Concrete examples reduce ambiguity better than abstract prose.
TDD¶
TDD remains one of the strongest control systems for AI-generated code. Martin Fowler’s testing guide defines TDD as guiding development by writing tests first and emphasizes a balanced portfolio of automated tests (Martin Fowler Testing Guide). Thoughtworks’ “TDD with GitHub Copilot” argues that TDD is more important with coding assistants because tests provide fast, accurate feedback on AI-written code and force divide-and-conquer problem solving when one-shot generation is unreliable (TDD with GitHub Copilot). The UK Government Digital Service guidance likewise frames TDD as a way to improve design and confidence while iterating in small steps (GDS TDD guidance).
What TDD contributes well: - explicit local correctness checks; - task decomposition into small verifiable steps; - fast feedback loops; - resistance to hallucinated success (Martin Fowler Testing Guide; TDD with GitHub Copilot).
Prompt-driven Methods¶
Prompt-driven methods correctly recognize that the model’s output quality depends heavily on how intent, constraints, and context are expressed. Structured versions of this idea, such as SPDD, also recognize that prompts are operational artifacts that need governance rather than one-off messages (SPDD). Martin Fowler’s comparison of spec-driven tools highlights that current systems vary in how much they anchor work in ongoing specifications versus lightweight task-local prompting, which is useful because it exposes a real design space rather than a single canonical method (Understanding Spec-Driven Development: Kiro, spec-kit, and Tessl).
What Do These Approaches Get Wrong For AI Coding Agents¶
They Are Often Too Large-grained¶
Many spec-first or prompt-first approaches produce large monolithic artifacts. Those are useful for humans, but they are often suboptimal as repeated runtime inputs to coding agents. Anthropic’s Claude Code guidance warns that every message, file read, and command output consumes context and that performance degrades as the context fills; OpenAI’s GPT-5.5 guide recommends starting from the smallest prompt that preserves the product contract; Google’s long-context guide says long context unlocks new workflows but still describes optimizations such as caching and careful query placement (Claude Code best practices; OpenAI GPT-5.5 guide; Gemini long context).
Inference: In an AI-agent setting, a 20-page spec may still be useful as a source of truth, but it is often the wrong runtime object. The agent should usually consume a smaller derivative packet.
They Assume Stable Human Memory¶
Traditional methodologies assume the implementer can keep broad architecture, edge cases, and prior decisions in working memory. AI agents cannot be treated that way. Their working memory is the active context window plus any external memory, retrieval, or compaction system provided by the harness. Anthropic explicitly describes the context window as the model’s working memory and documents compaction and context editing as core strategies for long-running agent workflows (Anthropic context windows).
They Under-specify Context Assembly¶
Most methodologies say which artifacts should exist, but not how a runtime should choose which parts to inject for each step. Recent benchmark work suggests this omission matters. ContextBench evaluates coding agents on context recall, precision, and efficiency during issue resolution rather than only on final task success, and SWE-ContextBench specifically studies context reuse, summarized context, and cost/time efficiency across related software tasks (ContextBench; SWE-ContextBench). A newer paper on terminal-based coding agents also treats context engineering and harness design as central system problems rather than incidental implementation details, reinforcing the claim that artifact design alone is insufficient without a runtime context policy (Building Effective AI Coding Agents for the Terminal).
They Can Confuse Governance With Verbosity¶
A common failure mode is to respond to unreliability by adding more instructions everywhere. Vendor guidance leans the other way. Anthropic says broad standing instructions should stay concise and warns that bloated CLAUDE.md files reduce usefulness; OpenAI recommends removing unnecessary process scaffolding and using structured outputs, tool descriptions, caching, and state management rather than over-describing everything in the prompt (Claude Code best practices; OpenAI GPT-5.5 guide; Codex Prompting Guide).
Inference: Better agent process does not mean longer prompts. It means better separation between stable rules, retrievable domain knowledge, and task-local evidence.
They Do Not Fully Price Token Economics¶
Human-centered methods rarely treat context as a first-class cost center. Agentic workflows must. A recent token-usage study on SWE-bench Verified reports that agentic coding tasks are far more expensive than code chat or code reasoning tasks and that input tokens, not output tokens, dominate overall cost; it also reports that higher token use does not reliably imply higher accuracy (How Do AI Agents Spend Your Money?). The 2026 “Tokenomics” study of ChatDev similarly finds that the code-review stage dominates token consumption and that input tokens form the largest share of consumption on average, which it interprets as a communication tax in multi-agent software engineering (Tokenomics).
What Information Do Modern Coding Agents Appear To Need¶
Across vendor guidance, benchmark design, and coding-agent prompting materials, several information classes recur.
Goal And Stopping Condition¶
OpenAI’s GPT-5.5 documentation says outcome-first prompts with explicit success criteria and stopping rules are important, especially in long-running, tool-heavy workflows (OpenAI GPT-5.5 guide). Anthropic’s Claude Code guidance similarly argues that the model performs dramatically better when it can verify its own work against explicit criteria (Claude Code best practices).
Scope Boundary¶
Coding-agent docs repeatedly emphasize file boundaries, constraints, and explicit references to relevant project locations. Anthropic recommends referencing specific files, constraints, and patterns; OpenAI’s Codex guide emphasizes autonomy plus explicit constraints and project conventions (Claude Code best practices; Codex Prompting Guide).
Verification Oracle¶
Tests, scripts, screenshots, and contract checks are central. Anthropic’s guidance says “give Claude a way to verify its work,” including tests and screenshots; Martin Fowler’s TDD guidance and the Thoughtworks Copilot article make the same point from a software-method perspective (Claude Code best practices; Martin Fowler Testing Guide; TDD with GitHub Copilot).
Local Code Context¶
The agent usually needs only a small subset of repository files for the current step. ContextBench is directly relevant here because it evaluates whether agents retrieve and use the right code context during problem solving, not just whether they eventually resolve the task (ContextBench).
Durable Project Memory¶
Anthropic’s CLAUDE.md guidance is effectively a statement that some context should live outside transient sessions, but only in concise form; OpenAI similarly documents state handling, compaction, and prompt caching as core parts of reliable reasoning-model systems (Claude Code best practices; OpenAI GPT-5.5 guide).
Compressed Prior Progress¶
Anthropic documents compaction, summarization, and checkpointing explicitly for long-running agent sessions, and OpenAI’s Codex guidance describes first-class compaction support for long-running coding work (Anthropic context windows; Codex Prompting Guide).
How Should That Information Be Delivered Efficiently¶
The reviewed evidence points toward a layered approach rather than a single prompt artifact.
Layer A - Stable Project Memory¶
This contains persistent instructions and conventions that apply broadly across many tasks. It should be short. Anthropic explicitly warns that broad standing instruction files should be concise and pruned regularly (Claude Code best practices).
Layer B - Source-of-truth Artifacts¶
These include product specs, architecture notes, API contracts, decision records, and behavioural examples. Spec Kit, SPDD, BDD, and TDD all supply useful artifact forms here, even if they differ in emphasis (Spec Kit docs; SPDD; Cucumber BDD docs; Martin Fowler Testing Guide).
Layer C - Task Packet¶
This is the runtime object the agent actually consumes for the next unit of work. A good packet should contain the exact objective, relevant files, constraints, acceptance checks, and references to deeper docs. This is an inference from the evidence rather than a source-defined standard, but it is strongly consistent with outcome-first prompting guidance, file-scoped prompting advice, and benchmark pressure toward better context precision (OpenAI GPT-5.5 guide; Claude Code best practices; ContextBench).
Layer D - Ephemeral Execution Trace¶
Commands run, tool outputs, and transient reasoning should not all remain in active context forever. Anthropic recommends compaction and context clearing for long sessions; OpenAI documents compaction and state handling for reasoning agents; Google recommends context caching where repeated large inputs would otherwise be expensive (Anthropic context windows; Claude Code best practices; OpenAI GPT-5.5 guide; Gemini long context).
Inference: The right design pattern is not “put the repo into the prompt.” It is “keep durable memory outside the prompt, retrieve only relevant slices, then compact aggressively.” This inference is strengthened by recent coding-agent research emphasizing adaptive context compaction, dual-memory architectures, and dynamic prompt assembly in terminal agents, as well as by pruning work that reports substantial token savings from structure-preserving context reduction (Building Effective AI Coding Agents for the Terminal; SWE-Pruner).
Proposed Framework - Context-disciplined Agent Development (CDAD)¶
This section is a proposal, not an established standard.
Principle 1 - Source Of Truth Is Durable, Not Conversational¶
Requirements, constraints, and architectural decisions must exist as files or structured records outside transient chats. This preserves the key strength of spec-driven and SPDD approaches (Spec Kit docs; SPDD).
Principle 2 - Runtime Context Must Be Packetized¶
Agents should consume compact task packets rather than raw full specs or broad chat histories. This follows from vendor guidance on context budgeting and from benchmark work showing that context retrieval quality matters materially (Anthropic context windows; OpenAI GPT-5.5 guide; ContextBench).
Principle 3 - Every Packet Must Carry Verification¶
No implementation packet is complete without a test, oracle, scenario, or verification command. This preserves the core strengths of BDD and TDD and matches current agentic-coding guidance (Cucumber BDD docs; Martin Fowler Testing Guide; Claude Code best practices).
Principle 4 - Retrieval Beats Stuffing¶
The harness should retrieve relevant files, tests, prior decisions, and examples on demand rather than preloading everything. SWE-ContextBench is especially relevant because it finds that correctly selected summarized context can improve accuracy while substantially reducing runtime and token cost, whereas unfiltered or incorrectly selected context can produce limited or negative benefits (SWE-ContextBench). SWE-Pruner points in the same direction from a different angle: it proposes self-adaptive pruning with a lightweight skimmer and reports meaningful token reductions on coding-agent tasks while aiming to preserve code structure, suggesting that runtime context reduction can be a first-class optimization target rather than only a fallback when the window is full (SWE-Pruner).
Principle 5 - Progress Must Survive Context Resets¶
Long-running work should leave resumable artifacts: checkpoint summaries, changed-file lists, open questions, and verification state. This is aligned with Anthropic’s multi-session state guidance and OpenAI’s compaction/state recommendations (Anthropic context windows; Claude Code best practices; OpenAI GPT-5.5 guide).
Principle 6 - Human Review Should Happen At Intent Boundaries And Risk Boundaries¶
Humans are most valuable when clarifying intent, approving major design choices, and reviewing high-risk changes—not micromanaging every generation step. This is an inference supported indirectly by the evidence that durable artifacts and explicit verification reduce the need for continuous conversational steering.
CDAD Workflow¶
Frame¶
Create or update a durable goal record with: - user need; - scope; - non-goals; - quality bar; - risks.
Distill¶
Produce a task packet for the next unit of work with: - exact objective; - relevant files; - interfaces/contracts; - examples/acceptance criteria; - verification commands; - escalation conditions.
Retrieve¶
Assemble only the context needed for this packet: - nearest code patterns; - related tests; - relevant architectural notes; - prior decisions touching the same module.
Execute¶
The agent plans, edits, runs verification, and captures evidence.
Compress¶
Write back: - what changed; - what passed/failed; - unresolved issues; - what should be loaded next time.
Govern¶
If behaviour or scope changed materially, update the durable artifact first or immediately after, depending on the change. This preserves the SPDD insight that prompt/spec drift is a real engineering problem (SPDD).
What CDAD Keeps From Existing Methods¶
From spec-driven development: - intent-first sequencing; - durable natural-language artifacts; - implementation traceability (Spec Kit docs).
From SPDD: - versioned prompts/specs; - governance of prompt artifacts; - prompt/code synchronization (SPDD).
From BDD: - examples as alignment tools; - shared language with non-engineers; - executable behaviour descriptions (Cucumber BDD docs).
From TDD: - feedback discipline; - decomposition into verifiable increments; - tests as anti-hallucination control (Martin Fowler Testing Guide; TDD with GitHub Copilot).
From prompt-driven development: - explicitness about constraints and output expectations; - AI-native ergonomics (SPDD; Structured prompt-driven development overview).
What CDAD Rejects¶
- giant evergreen prompts as the main runtime object;
- chat-only decision making;
- code generation without local verification;
- full-context repository stuffing by default;
- success metrics based only on lines of code or raw speed.
These rejections are inferential, but they are grounded in the repeated source-level warnings about context saturation, cost, retrieval precision, and the need for explicit verification (Anthropic context windows; Claude Code best practices; OpenAI GPT-5.5 guide; ContextBench; How Do AI Agents Spend Your Money?).
Tool Concept - What Should Be Built If A Tool Is Needed¶
A useful tool should be thin and infrastructural, not another opaque agent shell.
Core Capabilities¶
Task packet builder: derive compact packets from larger specs and issue descriptions.
Context retriever: gather relevant files, tests, prior decisions, and examples.
Context budgeter: estimate token cost before execution and downselect low-value context.
Verification router: associate each packet with tests, scripts, or visual checks.
Progress compressor: summarize trajectories into resumable state.
Traceability map: connect requirement -> packet -> code diff -> verification evidence.
This proposed tool shape is an inference from the reviewed sources. It combines the artifact discipline of spec-driven and SPDD approaches with the retrieval, compaction, and verification concerns emphasized by current model documentation and recent coding-agent benchmarks. The shape is also consistent with recent coding-agent systems work that highlights adaptive context compaction, dual-memory architectures, lazy tool discovery, and defense-in-depth safety as practical harness concerns (Spec Kit docs; SPDD; Anthropic context windows; OpenAI GPT-5.5 guide; SWE-ContextBench; Building Effective AI Coding Agents for the Terminal).
Metrics To Measure Efficiency¶
A framework like this should be evaluated on both delivery quality and resource efficiency.
Outcome Metrics¶
- Task success rate: proportion of tasks completed correctly.
- Verification pass rate: proportion of tasks whose required checks pass.
- Defect escape rate: defects found after the agent’s declared completion.
- Rework rate: follow-up fixes or reopenings per completed task.
These metrics are standard software-quality style measures, but they should be paired with agent-specific cost and retrieval metrics.
Efficiency Metrics¶
- Time to verified change: wall-clock time from task start to passing required checks.
- Tokens per verified change: total tokens consumed until verification passes.
- Input-token share: proportion of spend consumed by context/tool/history input.
- Iterations to success: number of plan-edit-verify loops.
- Context retrieval precision/recall: whether retrieved context matches the actually necessary context.
This metric family is motivated by ContextBench and SWE-ContextBench, which introduce retrieval precision/recall and context-reuse efficiency, and by token-efficiency studies that show input-token dominance and weak correlation between raw spend and quality (ContextBench; SWE-ContextBench; How Do AI Agents Spend Your Money?; Tokenomics).
Process Quality Metrics¶
- Spec-to-implementation drift: how often code behaviour diverges from durable artifacts.
- Resume efficiency: time/tokens needed to recover a paused task.
- Single-source critical claim rate: how often major implementation decisions rely on one weak source.
These are proposed metrics rather than source-defined benchmark standards.
Suggested Benchmark Design¶
A fair comparison should evaluate at least four workflow conditions: - ad hoc prompting; - spec-driven only; - spec + TDD/BDD checks; - CDAD-style packetized retrieval + verification.
And it should measure: - resolution accuracy; - verification success; - total tokens; - input/output/reasoning token split; - latency; - human interventions; - reopenings or regressions.
SWE-bench Verified is a practical foundation because it already standardizes software issue-resolution evaluation, and the newer ContextBench / SWE-ContextBench benchmarks suggest how to extend evaluation beyond final task success into retrieval and reuse behaviour (SWE-bench Verified; OpenAI SWE-bench Verified post; ContextBench; SWE-ContextBench).
Model-specific Considerations¶
The framework should not be tied to one model family, but some model differences matter.
Observation: Anthropic, OpenAI, Google, Moonshot/Kimi, and NVIDIA all now emphasize long context, tool use, and agentic coding support in their public documentation, though the exact terminology and controls differ (Anthropic context windows; OpenAI GPT-5.5 guide; Codex Prompting Guide; Gemini 2.5 Pro model page; Gemini long context; Kimi models; NVIDIA Nemotron; Nemotron coding guide).
Observation: The vendor guidance also converges on a warning: more context is not automatically better; context management, compaction, caching, and explicit success criteria matter (Anthropic context windows; OpenAI GPT-5.5 guide; Gemini long context).
Inference: The framework should be mostly model-agnostic at the methodology layer, while allowing model-specific tuning at the runtime layer: - effort/reasoning settings; - compaction/caching usage; - tool descriptions; - packet size thresholds; - retrieval strategies.
Strongest Claims This Research Can Support¶
- Existing methods already contain most of the discipline primitives needed for AI-native development: specs, examples, tests, prompts, and reviews (Spec Kit docs; SPDD; Cucumber BDD docs; Martin Fowler Testing Guide).
- What is still missing is a first-class method for runtime context management. Recent benchmark and vendor material repeatedly points to retrieval quality, compaction, and context efficiency as material variables in coding-agent performance (ContextBench; SWE-ContextBench; Anthropic context windows; OpenAI GPT-5.5 guide).
- The next useful innovation is therefore likely not “more prompting,” but better packetization, retrieval, verification, and compression.
Open Questions¶
- How much framework complexity is justified before process overhead outweighs token savings?
- Which context-packing strategies generalize across model families, and which require per-model tuning?
- Can context retrieval quality be improved enough that smaller or cheaper models become economically superior on many software tasks?
- Should the proposed tool be integrated into coding agents directly or remain an external orchestration layer?
- How should UI-heavy and visually exploratory software work be packetized, given that logic-centric methods transfer less cleanly there?
Recommended Next Steps¶
- Turn CDAD into a short whitepaper or paper-style memo with explicit comparison tables.
- Build a minimum viable tool that generates task packets from issue/spec files and links them to verification commands.
- Evaluate the tool on a small benchmark set using token, time, and correctness metrics.
- Compare against at least one baseline spec-driven workflow and one ad hoc prompting workflow.
- Refine the framework around measured failure modes rather than aesthetic process preferences.
Sources¶
Methods And Frameworks¶
- GitHub Spec Kit docs: https://github.github.com/spec-kit/
- GitHub Spec Kit repo: https://github.com/github/spec-kit/
- GitHub Blog, “Spec-driven development: Using Markdown as a programming language when building with AI”: https://github.blog/ai-and-ml/generative-ai/spec-driven-development-using-markdown-as-a-programming-language-when-building-with-ai/
- GitHub / resources article, “Spec-driven development with AI: Get started with a new open source toolkit”: https://resources.github.com/increasing-collaborative-development-with-ai/
- Martin Fowler, “Structured-Prompt-Driven Development (SPDD)”: https://martinfowler.com/articles/structured-prompt-driven/
- Martin Fowler, “Understanding Spec-Driven-Development: Kiro, spec-kit, and Tessl”: https://martinfowler.com/articles/exploring-gen-ai/sdd-3-tools.html
- Cucumber BDD docs: https://cucumber.io/docs/bdd
- Cucumber, “Introduction to TDD and BDD”: https://cucumber.io/blog/bdd/intro-to-bdd-and-tdd
- Martin Fowler Testing Guide: https://www.martinfowler.com/testing/
- Martin Fowler / Thoughtworks, “TDD with GitHub Copilot”: https://www.martinfowler.com/articles/exploring-gen-ai/06-tdd-with-coding-assistance.html
- UK Government Digital Service TDD guidance: https://gds-way.digital.cabinet-office.gov.uk/standards/test-driven-development.html#getting-better
Model And Agent Guidance¶
- Anthropic, context windows: https://docs.anthropic.com/en/docs/build-with-claude/context-windows
- Anthropic, Claude Code best practices: https://docs.anthropic.com/en/docs/claude-code/best-practices
- Anthropic, long-context prompting / prompting guide: https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/long-context-tips
- OpenAI, GPT-5.5 guide: https://developers.openai.com/api/docs/guides/gpt-5
- OpenAI, code generation guide: https://platform.openai.com/docs/guides/code-generation
- OpenAI, Codex Prompting Guide: https://developers.openai.com/cookbook/examples/gpt-5/codex_prompting_guide/
- Google, Gemini long context: https://ai.google.dev/gemini-api/docs/long-context
- Google, Gemini 2.5 Pro: https://ai.google.dev/gemini-api/docs/models/gemini-2.5-pro
- Kimi model list: https://platform.kimi.ai/docs/models
- NVIDIA Nemotron models: https://developer.nvidia.com/nemotron
- NVIDIA Nemotron coding guide: https://docs.nvidia.com/nemotron/latest/usage-cookbook/Nemotron-3-Super/OpenScaffoldingResources/README.html
Benchmarks And Research¶
- ContextBench: A Benchmark for Context Retrieval in Coding Agents (2026), arXiv:2602.05892: https://arxiv.org/abs/2602.05892
- SWE Context Bench: A Benchmark for Context Learning in Coding (2026), arXiv:2602.08316: https://arxiv.org/abs/2602.08316
- Building Effective AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned (2026), arXiv:2603.05344: https://arxiv.org/abs/2603.05344
- SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents (2026), arXiv:2601.16746: https://arxiv.org/abs/2601.16746
- How Do AI Agents Spend Your Money? Analysing and Predicting Token Consumption in Agentic Coding Tasks (2026), arXiv:2604.22750: https://arxiv.org/abs/2604.22750v1
- Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering (2026): https://arxiv.org/html/2601.14470v1
- SWE-bench Verified: https://www.swebench.com/verified.html
- OpenAI, “Introducing SWE-bench Verified”: https://openai.com/index/introducing-swe-bench-verified/