How an LLM Coding Agent Actually Builds Software

A practical breakdown of how coding agents work: model, tool loop, context management, patching, and verification.

Posted Apr 16, 2026

14 min read

How an LLM Coding Agent Actually Builds Software

The first time I tried to wire a local coding agent around Gemma, I thought the hard part would be the model.

It wasn’t.

The model looked flaky because my agent loop was flaky.

One of the first tasks I gave it was boring on purpose: find a file, make a small code change, then run the relevant test. The model did the first part correctly. It asked for the file. It asked for the test. Then it drifted. Instead of taking the next tool step, it started explaining what should happen next like a consultant with a checklist.

At first glance that looked like a model failure. It wasn’t. I was parsing the response stream too early and mishandling the turn after the tool result. The model never really got a clean chance to continue.

That changed how I think about coding agents.

A coding agent is not just a model with a bigger prompt. It is a small software system wrapped around a model. The model does the reasoning. The runtime does the state management, tool execution, file edits, and validation.

That distinction matters because people blame or praise “the model” for behavior the surrounding harness is actually causing.

After spending time with Gemma and OpenCode-style local workflows, I keep coming back to the same conclusion: the model is only one part of the system. The loop around it is what turns text generation into software work.

If I had to reduce the whole thing to one line, it would be this:

Most of what feels magical in a coding agent is just a model sitting inside a well-built loop.

What the system actually is

At a high level, a coding agent has five moving parts:

The model that interprets the request and decides what to do next
The prompt and context builder that prepares instructions and relevant repository state
The tool runtime that executes shell commands, file reads, searches, and patches
The agent loop that keeps calling the model after each tool result
The verification layer that runs tests, linters, or builds before returning control

A normal chatbot answers once. An agent reads, acts, checks what happened, then goes again.

Here is the simplest version of the flow:

User request
  -> context builder
  -> model
  -> tool call
  -> runtime executes tool
  -> tool result goes back to model
  -> patch / command / follow-up tool call
  -> tests or lint
  -> final answer

Not glamorous, but more useful than model marketing.

Step 1: build the working context

Before the model can do useful work, the agent has to decide what context to provide.

That usually includes:

system instructions
the user’s request
recent conversation history
tool definitions
repository structure or symbol summaries
relevant file contents or search results

This part gets hand-waved a lot. The agent is not dumping an entire repository into the model. It is assembling a useful working set.

In practice, good agents do some combination of:

repository mapping: build a high-level view of files, symbols, or modules
targeted retrieval: read only the files that look relevant
context trimming: keep the active state small enough to fit the model’s window
caching: avoid re-reading the same large context on every turn

The important point is simple: context is assembled. It is not magically remembered.

Step 2: let the model decide the next move

Once the context is ready, the model gets a turn.

For a coding task such as “fix the failing login flow,” the model is not supposed to immediately output code. A good model will first decide what information it still needs:

inspect the auth code
search for the failing controller or service
read the relevant test
run the test suite or a narrowed test target

Reasoning helps, but reasoning on its own is not enough. If the runtime does not support tools and iteration, the model can only describe a plan. It cannot carry it out.

That is the gap between:

“You should inspect auth.rb and run the tests”
actually reading auth.rb, running the tests, seeing the failure, and proposing a patch

If you have built one of these loops badly, you can feel the difference immediately. The model sounds smart. Nothing gets done.

Step 3: turn intent into tool calls

The model does not directly touch your filesystem. It emits structured intent.

Depending on the runtime, that may look like a function call, JSON object, or tool invocation block. The meaning is the same:

“Run this shell command.”
“Read this file.”
“Apply this patch.”

This is one of the most useful mental models in the whole setup:

The model does not execute tools. The runtime executes tools.

That separation is what makes the system manageable. The runtime can validate arguments, reject unsafe actions, log what happened, and feed the results back into the conversation.

It also means a lot of agent bugs are not really model bugs. They are runtime bugs:

tool calls parsed incorrectly
partial streaming output handled too early
malformed tool results appended to history
message roles mismatched for the target model
missing loop after a tool result

I ran into exactly this in my own local setup. The model looked flaky until I realized the harness was the flaky part.

One subtle point here: the runtime is not just a dumb pipe. It defines the contract. It decides which tools exist, what arguments are allowed, how results are formatted, and what the model gets back when something fails, times out, or succeeds. The model can only operate inside that contract.

Message formatting is part of the system

This sounds boring until it breaks.

Different models and runtimes expect different message shapes, role names, and tool-call formats. Some want explicit tool messages with IDs. Some expect tool results folded back into a user turn. Some tolerate loose formatting. Some absolutely do not.

If you get this wrong, the failure mode is annoying because it does not always look like a protocol error. It just looks like the model got weird. It ignores a tool result. It repeats itself. It forgets what just happened. It starts narrating instead of acting.

That is another reason I hesitate when people talk about agent quality as if it were just a model ranking problem. A surprising amount of the real work is message plumbing.

Step 4: execute, observe, loop

This is the part many first-time agent builders miss.

The basic loop looks like this:

  
messages = initial_context

while True:
    response = llm(messages, tools=tools)

    if response.tool_calls:
        messages.append(response)

        for call in response.tool_calls:
            result = execute_tool(call)
            messages.append(result)

        continue

    return response.final_text

The crucial line is continue.

After each tool result, the model needs another turn. That is how it moves from:

reading files
forming a hypothesis
patching code
running tests
adjusting the patch if the tests still fail

Without that loop, you do not have much of an agent. You have a one-shot assistant that knows how to talk about tool syntax.

The runtime also needs clear stopping conditions. A good agent should stop when the checks pass, when it is genuinely blocked and needs user input, or when another retry is just burning tokens without improving anything. Otherwise you get the other classic failure mode: the agent that keeps “working” long after it should have stopped.

A tiny end-to-end example

This is what a healthy loop looks like in practice.

Imagine the user asks:

“Fix the failing login test.”

What happens next is usually something like this:

The agent searches for the failing test or runs a narrowed test command.
The runtime sends the failure output back to the model.
The model asks to read auth.rb and the matching test file.
The runtime returns both file contents.
The model proposes a small patch.
The runtime applies the patch.
The model asks to rerun the test.
The runtime returns either a pass or a new failure.
If it still fails, the loop continues.

In rough pseudo-transcript form:

user: Fix the failing login test.

assistant -> tool: run_test("bundle exec rspec spec/requests/login_spec.rb")
tool -> assistant: failure in "returns 401 for expired token"

assistant -> tool: read_file("app/services/auth.rb")
assistant -> tool: read_file("spec/requests/login_spec.rb")
tool -> assistant: [file contents]

assistant -> tool: apply_patch(...)
tool -> assistant: patch applied

assistant -> tool: run_test("bundle exec rspec spec/requests/login_spec.rb")
tool -> assistant: 1 example, 0 failures

assistant: Fixed. The token expiry check was comparing strings instead of timestamps.

That is the job. Not one huge leap of intelligence. A sequence of small moves grounded in feedback.

Parallel work helps, but only when the dependency graph is real

A naive agent does everything in sequence. Better agents can overlap independent work.

Reading three files at once is usually fine. Searching two directories in parallel is usually fine. Running a linter and a type check at the same time is often fine.

But the runtime has to know where parallelism stops being safe. Reading a file and patching it at the same time is a bug. Running a test against code that another step is still modifying is a bug. Parallelism is useful, but only when the operations are actually independent.

Step 5: make precise edits instead of rewriting everything

When the agent decides to change code, the safest path is usually not “rewrite the whole file.”

Better runtimes prefer targeted edits such as:

search-and-replace for a unique block
line-oriented patching
unified diff application

That approach helps for two reasons:

It reduces accidental damage to unrelated code.
It gives the model a more stable editing primitive for iterative fixes.

This is one reason patch-based workflows feel noticeably more reliable than naive full-file rewrites.

Step 6: check the work against reality

An agent is only useful if it can compare its changes against reality.

For software tasks, “reality” usually means one or more of:

tests
linters
type checks
builds
runtime output

The model proposes a change. The runtime runs the relevant checks. The model then sees the result and decides whether the job is actually done.

That is the difference between a flashy demo and a tool you might actually trust. The demo stops when the code looks plausible. The useful tool stops when the environment says the change holds up.

This is also where models run into a hard limit. They are good at predicting plausible next steps. They are much worse at knowing, from their own internal confidence alone, whether those steps actually worked. That is why verification is not optional. The model’s guess is not the ground truth.

Where agents usually break

When people say a coding agent “just kind of fell apart,” the failure is often boring:

the model emitted a tool call across multiple stream chunks and the runtime acted too early
the tool result got appended in the wrong format
the agent lost the thread after a long wall of shell output
the patch applied, but the model never saw the real post-patch state
the patch failed to apply cleanly and the retry logic made things worse
a command timed out and the runtime treated that like useful output
the system skipped verification and returned confident nonsense

This is why I am suspicious of sweeping claims about model quality without any discussion of runtime quality. A fragile harness can make a good model look bad. A disciplined harness can make a merely decent model feel much better than expected.

Context management is where things quietly break

As the session gets longer, the agent’s job gets harder.

Every tool result, file read, and patch explanation consumes context window space. If you keep everything, the model eventually drowns in stale logs and low-value history.

So real agents need compaction strategies:

keep recent turns verbatim
summarize older work
drop noisy command output
retain the current plan and latest repository state
preserve durable instructions while discarding dead ends

This is not the glamorous part of agent design, but it matters more than people think. A lot of agent failures are really context failures wearing a fake mustache.

There is also a tradeoff here that people skip past too quickly: compaction is lossy. Summaries are useful, but sometimes the exact detail you threw away is the detail you needed three turns later. Long-running agents are always balancing recall against context budget.

The model matters, but the harness matters more

Different models are better or worse at planning, tool use, structured output, and code generation. That absolutely affects the experience.

But once you start building agents, you realize something uncomfortable:

A strong model in a weak harness is frustrating. A decent model in a strong harness is often more useful than it has any right to be.

The harness determines whether the model can:

find the right file
survive long sessions
recover from failed commands
apply surgical edits
prove that the task is complete

That is why two products using similarly capable models can feel wildly different in practice.

Why this gets harder locally

This point matters even more for local agents.

Cloud coding products usually have polished runtimes, mature prompt formatting, and enough infrastructure around the model to hide a lot of rough edges. Local setups are less forgiving.

You run into issues like:

tighter memory limits
smaller practical context windows
worse latency when you overfeed the model
more brittle tool calling
more prompt-format sensitivity
less guardrail infrastructure around long sessions

That does not make local agents pointless. I still like them. It just means the boring systems work matters even more. If your local agent feels unstable, it may not need a smarter model first. It may need a better loop, cleaner context, and stricter verification.

Permissions, sandboxing, and safety are part of the design

Another missing piece in a lot of simplified agent diagrams is the operating envelope.

Real coding agents usually do not have unlimited power. Some tools are read-only. Some filesystem paths are writable and others are not. Some commands require explicit approval. Network access may be blocked. Destructive operations may be denied or wrapped in extra checks.

That is not an annoying implementation detail. It is part of how the system works. The runtime is not just giving the model hands. It is also deciding what the hands are allowed to touch.

The same goes for observability. If the agent cannot show you what tool it called, what came back, what got truncated, and why it stopped, debugging turns into superstition.

A better way to think about coding agents

The mental model I keep settling on is this:

the model is the reasoning engine
the agent runtime is the operating system around that engine

The runtime gives the model senses, memory, and hands:

senses through file reads, search, test output, and external tools
memory through conversation state, summaries, and cached repository context
hands through patches, shell commands, and API calls

Once you see the system that way, a lot of confusing behavior stops being confusing. You stop asking, “Why didn’t the model just do it?” and start asking the more useful question:

“What part of the agent loop failed?”

What I am leaving out on purpose

There are more advanced pieces beyond the basic loop:

planner/executor splits
long-term memory systems
background agents
richer approval flows
evaluation harnesses
multi-agent coordination

Those matter, but they come later.

The first-order problem is still the same boring one: can the model ask for a tool, can the runtime execute it, can the result get fed back correctly, and can the system verify the change before it stops?

Final takeaway

An LLM coding agent does not build software by generating one brilliant answer.

It builds software by repeatedly doing four things well:

gathering the right context
choosing the next action
executing that action through tools
checking the result against reality

If you want to build a better local agent, spend less time imagining a magical autonomous coder and more time improving those four steps.

What looks like intelligence is often just good plumbing.

AI, Engineering

ai llm agents coding-agent gemma opencode software-engineering

This post is licensed under CC BY 4.0 by the author.

How an LLM Coding Agent Actually Builds Software

What the system actually is

Step 1: build the working context

Step 2: let the model decide the next move

Step 3: turn intent into tool calls

Message formatting is part of the system

Step 4: execute, observe, loop

A tiny end-to-end example

Parallel work helps, but only when the dependency graph is real

Step 5: make precise edits instead of rewriting everything

Step 6: check the work against reality

Where agents usually break

Context management is where things quietly break

The model matters, but the harness matters more

Why this gets harder locally

Permissions, sandboxing, and safety are part of the design

A better way to think about coding agents

What I am leaving out on purpose

Final takeaway

Trending Tags