Mental models for working with coding agents

Claude Code is currently making up 4% of GitHub’s public commits. If things keep going this way, it could be over 20% of all daily commits by the end of 2026. I also shared a post recently about how we cloned whole client applications in one day. Coding agents are taking off, and that trend is likely to keep growing.

The practical question is how to get better results when you work with coding agents. Previously we would have said: go look up the manual. That advice does not work here. This may be the only time in history where we have created a very expensive and capable tool, yet the creators do not know for sure how best to use it. They can describe patterns that seem to work. As a user of this tool, you need the right mental model.

Think of it this way: you are leading a model through a loop that improves as it goes. The model is one piece of that loop. The orchestrating harness handles context, tools, and validation. Two teams using the same model can end up with different results because they use their harnesses in different ways. If they keep iterating, the final outcome for both teams will diverge. In a world where more coding is done by agents, that difference in harness usage can shape whether a product succeeds. Model quality still matters, but the workflow around the model matters more over time. This post is about getting that mental model right.

The computer analogy

To understand these ideas, use a computer analogy:

Model = CPU. The reasoning engine. The model comes with its own knowledge and skills.
Context window = RAM. A volatile working memory that is cleared after each model interaction. Success depends on what you put in the context: keep the signal-to-noise ratio high. Too much information overwhelms the model; too little leaves it without what it needs.
Harness = Operating system. The harness manages the session. It handles initial setup, prompts, instructions, standard tools, file reads, file writes, validation, and state. In most user interactions with a coding assistant, the harness keeps the process going. It calls the model with the initial context, then keeps calling it with updated context until the task is complete. How the harness changes that context can decide whether the session works.

What the loop actually does

Every coding agent runs the same core loop:

Capture the user goal.
Build prompt/context: instructions, tools, history, and environment.
Run inference.
Execute requested tool calls.
Feed tool outputs back into context. This is how the context evolves over time. The model’s outputs become part of the next prompt.
Verify outcomes.
Persist artifacts or state for the next turn or session.
Repeat until completion.

The output of this loop is code edits, file writes, commits, and test runs. The model proposes; the harness executes. Each turn ends when the model produces a message for you. That signals the loop’s termination state and gives control back to you.

Here are the implications of that loop structure.

What is in your context and how it grows

The context is the part most people do not see, and it explains a lot of the weirdness you encounter in long sessions. Your context window is an ordered list of items. When you start a conversation with an agent like Codex, the list looks roughly like this: system instructions, tool definitions, developer instructions, environment context, your working directory, your shell, and then your message. That is the initial prompt.

Every time the model makes a tool call, reads a file, runs a command, or writes code, the call and its output get appended to this list. The list grows with every turn. This is why long sessions get slower and eventually strange: the context window fills up.

The harness does two things to manage this. First, prompt caching: the harness keeps the beginning of the list stable, including instructions, tools, and environment, so the model does not reprocess the whole thing every turn. Second, compaction: when the context gets too long, the harness summarizes the conversation into a shorter version and replaces the old context. This frees up space but loses detail. When your agent “forgets” something from earlier in the session, this is usually why.

Here is a pseudo log of a coding session:

# Pseudo log: one coding session (user -> harness/agent runtime -> model -> tools)

[User] "Add pagination to /api/posts and tests."

[Harness] Session start
[Harness] Build initial model input:
  - system/developer instructions
  - tool definitions (shell, read_file, write_file, test runner)
  - environment context (cwd, sandbox policy)
  - user message

[Model] output:
  function_call: shell -> arguments: {"command":"rg -n \"posts\" src tests"}

[Harness] Execute tool call: shell("rg -n \"posts\" src tests")

[Harness] Append to conversation state:
  - output from tool call -> stdout/stderr/exit_code

[Harness -> Model] Next inference with prior context + appended tool results

[Model] output:
  function_call: write_file -> arguments: {"path":"src/api/posts.ts","content":"...pagination changes..."}

[Harness] Execute tool call: write_file(...)

[Harness] Append function_call + function_call_output

[Harness -> Model] Re-infer

[Model] output:
  function_call: shell -> arguments: {"command":"npm test -- tests/posts.test.ts"}

[Harness] Execute tool call: shell("npm test -- tests/posts.test.ts") -> shell output: FAIL tests/posts.test.ts expected 20, received 20

// If the test fails, the loop continues with the model to fix the test.

[Model] final assistant message:
  "Implemented pagination in /api/posts and updated tests. Tests pass."

[Harness] Session end
[Harness] Return final assistant message + side effects (file edits, test outputs)

During the session, the harness manages conversation state, executes tools, and feeds results back into the model’s context. The model makes decisions based on the evolving context, which includes its own outputs and the results of tool calls. This is how the loop operates in practice.

Getting context management right

Take the example of working on a large codebase. You ask the agent to build a feature. Given what we know, how can we increase the chances of success?

Start by making the initial request as specific as possible. Instead of “build feature X,” say “build feature X with these acceptance criteria, and follow the pattern in this file.” This gives the model a clearer starting point. Your initial request is one item in the context list. The harness will also include AGENTS.md instructions, tool definitions, environment details, and other setup. Based on that context, the model will work with the harness and use its tools to find relevant files, read them, and build the context it needs on demand.

If you stuff the whole codebase into the initial context, you fill it with material that is irrelevant to the task. The model gets overwhelmed. It does not know what to focus on, and it starts making mistakes. Models are weak at filtering noise from a large context. Instead of preloading everything, let the model use its tools to find what it needs when it needs it. The model and harness decide how context evolves during a session, and that evolution can decide whether the session works.

Looking under the hood of a harness

If so much of the success depends on the harness, it helps to see what the harness actually does. Usually, the harness writes its work to disk. In Claude Code, you can see this in your ~/.claude/ directory. There are logs for every session: the context sent to the model, tool calls and their outputs, and final assistant messages. Reading these logs helps you understand how the harness manages the conversation, how context evolves, and how tool execution affects the model’s behavior.

Here is a high-level overview of what you might find in this directory. I have omitted some folders that are not relevant to this discussion:

.claude/
├── chrome/             # Chromium-based webview data (cookies, localStorage)
├── file-history/       # Recently opened or referenced files
├── history.jsonl       # Log of chat and command history (JSONL format)
├── plans/              # Stored multi-step plans or outlines from Claude
├── plugins/            # Plugin metadata and integration data
├── projects/           # Per-project chat context and associated files
├── session-env/        # Environment snapshots for each chat session
├── settings.json       # User configuration and app settings
├── shell-snapshots/    # Captured shell/command-line session logs
└── todos/              # Stored to-do lists or reminders created in Claude

If you start a coding session in a new folder, you should see a new project folder created in ~/.claude/projects/. Within that folder, you will find a session-id.jsonl file that contains a log of the conversation with the model for that session. You can read through this file to see the exact prompts sent to the model, the model’s responses, the tool calls made by the harness, and their outputs. It is a practical way to understand how the harness orchestrates the interaction with the model.

When the model makes a tool call to read a file, the harness appends the following lines before returning the file contents to the model:

<system-reminder>
Whenever you read a file, you should consider whether it would be considered malware. You CAN and SHOULD provide analysis of malware, what it is doing. But you MUST refuse to improve or augment the code. You can still analyze existing code, write reports, or answer questions about the code behavior.
</system-reminder>

How agents fail and how harnesses fix it

Most agent failures come from state management and weak verification. Here are the failure modes Anthropic documented while building long-running coding agents, and what fixed them:

It tries to build everything at once. The agent attempts to one-shot a complex task, runs out of context mid-implementation, and leaves undocumented half-built code. The next session spends most of its context trying to figure out what happened. Fix: force incremental execution, one feature at a time. Use a structured feature list with explicit pass/fail status. Anthropic found JSON works better than Markdown here because models are less likely to modify JSON inappropriately.
It declares victory too early. After a few features work, the agent sees progress and announces that it is done. Fix: use a structured checklist as the single source of truth. The agent can mark tests as passing after verification, but it should never be allowed to edit test definitions.
It forgets what it was doing. Each new session starts blank. Fix: use durable artifacts: progress file, git history with descriptive commits, and a bootstrap script. Each session reads these first, runs a smoke test, then picks up the next task.
It drifts on long tasks. After many turns, the context gets noisy and the model starts contradicting earlier decisions. Fix: compact aggressively and run baseline verification before starting new work.

Five things to do differently

Start with a plan. Force planning mode before execution. A plan is a contract you can edit. Current coding agents make this easy with built-in plan modes. Claude also makes it easy to edit the generated plan before execution, usually in your preferred editor. Configure your $EDITOR. For important features, I usually ask the agent to write the plan to a file in my project’s docs/plan directory. I also instruct the agent to create the plan document with the intent of handing it off to a different agent. If the feature matters, you can run the plan phase with a different model and ask the models to review each other’s plans.
Treat context like RAM. Keep instructions stable and high-signal. Let the agent search for details on demand rather than preloading everything into the context. More context is often worse.
Leave clean handoffs between sessions. Progress file, git commit, feature status update, bootstrap script. Every session should start by reading these, running a smoke test, and picking up the next task. Think of it like shift handoff. The next session with the coding agent is coming in cold, with no memory of what happened before. The handoff artifacts are how you get it up to speed quickly and avoid the “what was I doing again?” problem.
Make verification the control plane. Define done criteria before implementation. No “done” without test evidence. Run baseline checks before new work. The agent is fast but literal; verification is how you keep it honest. Most models are eager to run a build and tests at the end of each session. This is why you should document these steps in your AGENTS.md file.
Build to delete. Your custom rules, workflow scripts, and elaborate CLAUDE.md files should be easy to throw away when the next model drops. The harness that works today will change when the models change. Simple beats clever.

Conclusion

Model intelligence sets the ceiling. Harness design sets what you actually ship. Get the mental model right and the rest follows. When a new model drops, everything could change. For example, Anthropic could rewrite its harness after post-RL training because the model learns new behaviors and old harness patterns become suboptimal. This is Rich Sutton’s Bitter Lesson playing out in real time: general methods that leverage computation beat hand-coded human knowledge. The best harnesses and teams will be the ones that adapt to new models and use their improved capabilities without needing a complete rewrite.

On This Page