How an LLM Coding Agent Actually Builds Software
A practical breakdown of how coding agents work: model, tool loop, context management, patching, and verification.
How an LLM Coding Agent Actually Builds Software
The first time I tried to wire a local coding agent around Gemma, I thought the hard part would be the model.
It wasn’t.
The model looked flaky because my agent loop was flaky.
One of the first tasks I gave it was boring on purpose: find a file, make a small code change, then run the relevant test. The model did the first part correctly. It asked for the file. It asked for the test. Then it drifted. Instead of taking the next tool step, it started explaining what should happen next like a consultant with a checklist.
At first glance that looked like a model failure. It wasn’t. I was parsing the response stream too early and mishandling the turn after the tool result. The model never really got a clean chance to continue.
That changed how I think about coding agents.
A coding agent is not just a model with a bigger prompt. It is a small software system wrapped around a model. The model does the reasoning. The runtime does the state management, tool execution, file edits, and validation.
That distinction matters because people blame or praise “the model” for behavior the surrounding harness is actually causing.
After spending time with Gemma and OpenCode-style local workflows, I keep coming back to the same conclusion: the model is only one part of the system. The loop around it is what turns text generation into software work.
If I had to reduce the whole thing to one line, it would be this:
Most of what feels magical in a coding agent is just a model sitting inside a well-built loop.
What the system actually is
At a high level, a coding agent has five moving parts:
- The model that interprets the request and decides what to do next
- The prompt and context builder that prepares instructions and relevant repository state
- The tool runtime that executes shell commands, file reads, searches, and patches
- The agent loop that keeps calling the model after each tool result
- The verification layer that runs tests, linters, or builds before returning control
A normal chatbot answers once. An agent reads, acts, checks what happened, then goes again.
Here is the simplest version of the flow:
1
2
3
4
5
6
7
8
9
User request
-> context builder
-> model
-> tool call
-> runtime executes tool
-> tool result goes back to model
-> patch / command / follow-up tool call
-> tests or lint
-> final answer
Not glamorous, but more useful than model marketing.
Step 1: build the working context
Before the model can do useful work, the agent has to decide what context to provide.
That usually includes:
- system instructions
- the user’s request
- recent conversation history
- tool definitions
- repository structure or symbol summaries
- relevant file contents or search results
This part gets hand-waved a lot. The agent is not dumping an entire repository into the model. It is assembling a useful working set.
In practice, good agents do some combination of:
- repository mapping: build a high-level view of files, symbols, or modules
- targeted retrieval: read only the files that look relevant
- context trimming: keep the active state small enough to fit the model’s window
- caching: avoid re-reading the same large context on every turn
The important point is simple: context is assembled. It is not magically remembered.
Step 2: let the model decide the next move
Once the context is ready, the model gets a turn.
For a coding task such as “fix the failing login flow,” the model is not supposed to immediately output code. A good model will first decide what information it still needs:
- inspect the auth code
- search for the failing controller or service
- read the relevant test
- run the test suite or a narrowed test target
Reasoning helps, but reasoning on its own is not enough. If the runtime does not support tools and iteration, the model can only describe a plan. It cannot carry it out.
That is the gap between:
- “You should inspect
auth.rband run the tests” - actually reading
auth.rb, running the tests, seeing the failure, and proposing a patch
If you have built one of these loops badly, you can feel the difference immediately. The model sounds smart. Nothing gets done.
Step 3: turn intent into tool calls
The model does not directly touch your filesystem. It emits structured intent.
Depending on the runtime, that may look like a function call, JSON object, or tool invocation block. The meaning is the same:
“Run this shell command.”
“Read this file.”
“Apply this patch.”
This is one of the most useful mental models in the whole setup:
The model does not execute tools. The runtime executes tools.
That separation is what makes the system manageable. The runtime can validate arguments, reject unsafe actions, log what happened, and feed the results back into the conversation.
It also means a lot of agent bugs are not really model bugs. They are runtime bugs:
- tool calls parsed incorrectly
- partial streaming output handled too early
- malformed tool results appended to history
- message roles mismatched for the target model
- missing loop after a tool result
I ran into exactly this in my own local setup. The model looked flaky until I realized the harness was the flaky part.
One subtle point here: the runtime is not just a dumb pipe. It defines the contract. It decides which tools exist, what arguments are allowed, how results are formatted, and what the model gets back when something fails, times out, or succeeds. The model can only operate inside that contract.
Message formatting is part of the system
This sounds boring until it breaks.
Different models and runtimes expect different message shapes, role names, and tool-call formats. Some want explicit tool messages with IDs. Some expect tool results folded back into a user turn. Some tolerate loose formatting. Some absolutely do not.
If you get this wrong, the failure mode is annoying because it does not always look like a protocol error. It just looks like the model got weird. It ignores a tool result. It repeats itself. It forgets what just happened. It starts narrating instead of acting.
That is another reason I hesitate when people talk about agent quality as if it were just a model ranking problem. A surprising amount of the real work is message plumbing.
Step 4: execute, observe, loop
This is the part many first-time agent builders miss.
The basic loop looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
messages = initial_context
while True:
response = llm(messages, tools=tools)
if response.tool_calls:
messages.append(response)
for call in response.tool_calls:
result = execute_tool(call)
messages.append(result)
continue
return response.final_text
The crucial line is continue.
After each tool result, the model needs another turn. That is how it moves from:
- reading files
- forming a hypothesis
- patching code
- running tests
- adjusting the patch if the tests still fail
Without that loop, you do not have much of an agent. You have a one-shot assistant that knows how to talk about tool syntax.
The runtime also needs clear stopping conditions. A good agent should stop when the checks pass, when it is genuinely blocked and needs user input, or when another retry is just burning tokens without improving anything. Otherwise you get the other classic failure mode: the agent that keeps “working” long after it should have stopped.
A tiny end-to-end example
This is what a healthy loop looks like in practice.
Imagine the user asks:
“Fix the failing login test.”
What happens next is usually something like this:
- The agent searches for the failing test or runs a narrowed test command.
- The runtime sends the failure output back to the model.
- The model asks to read
auth.rband the matching test file. - The runtime returns both file contents.
- The model proposes a small patch.
- The runtime applies the patch.
- The model asks to rerun the test.
- The runtime returns either a pass or a new failure.
- If it still fails, the loop continues.
In rough pseudo-transcript form:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
user: Fix the failing login test.
assistant -> tool: run_test("bundle exec rspec spec/requests/login_spec.rb")
tool -> assistant: failure in "returns 401 for expired token"
assistant -> tool: read_file("app/services/auth.rb")
assistant -> tool: read_file("spec/requests/login_spec.rb")
tool -> assistant: [file contents]
assistant -> tool: apply_patch(...)
tool -> assistant: patch applied
assistant -> tool: run_test("bundle exec rspec spec/requests/login_spec.rb")
tool -> assistant: 1 example, 0 failures
assistant: Fixed. The token expiry check was comparing strings instead of timestamps.
That is the job. Not one huge leap of intelligence. A sequence of small moves grounded in feedback.
Parallel work helps, but only when the dependency graph is real
A naive agent does everything in sequence. Better agents can overlap independent work.
Reading three files at once is usually fine. Searching two directories in parallel is usually fine. Running a linter and a type check at the same time is often fine.
But the runtime has to know where parallelism stops being safe. Reading a file and patching it at the same time is a bug. Running a test against code that another step is still modifying is a bug. Parallelism is useful, but only when the operations are actually independent.
Step 5: make precise edits instead of rewriting everything
When the agent decides to change code, the safest path is usually not “rewrite the whole file.”
Better runtimes prefer targeted edits such as:
- search-and-replace for a unique block
- line-oriented patching
- unified diff application
That approach helps for two reasons:
- It reduces accidental damage to unrelated code.
- It gives the model a more stable editing primitive for iterative fixes.
This is one reason patch-based workflows feel noticeably more reliable than naive full-file rewrites.
Step 6: check the work against reality
An agent is only useful if it can compare its changes against reality.
For software tasks, “reality” usually means one or more of:
- tests
- linters
- type checks
- builds
- runtime output
The model proposes a change. The runtime runs the relevant checks. The model then sees the result and decides whether the job is actually done.
That is the difference between a flashy demo and a tool you might actually trust. The demo stops when the code looks plausible. The useful tool stops when the environment says the change holds up.
This is also where models run into a hard limit. They are good at predicting plausible next steps. They are much worse at knowing, from their own internal confidence alone, whether those steps actually worked. That is why verification is not optional. The model’s guess is not the ground truth.
Where agents usually break
When people say a coding agent “just kind of fell apart,” the failure is often boring:
- the model emitted a tool call across multiple stream chunks and the runtime acted too early
- the tool result got appended in the wrong format
- the agent lost the thread after a long wall of shell output
- the patch applied, but the model never saw the real post-patch state
- the patch failed to apply cleanly and the retry logic made things worse
- a command timed out and the runtime treated that like useful output
- the system skipped verification and returned confident nonsense
This is why I am suspicious of sweeping claims about model quality without any discussion of runtime quality. A fragile harness can make a good model look bad. A disciplined harness can make a merely decent model feel much better than expected.
Context management is where things quietly break
As the session gets longer, the agent’s job gets harder.
Every tool result, file read, and patch explanation consumes context window space. If you keep everything, the model eventually drowns in stale logs and low-value history.
So real agents need compaction strategies:
- keep recent turns verbatim
- summarize older work
- drop noisy command output
- retain the current plan and latest repository state
- preserve durable instructions while discarding dead ends
This is not the glamorous part of agent design, but it matters more than people think. A lot of agent failures are really context failures wearing a fake mustache.
There is also a tradeoff here that people skip past too quickly: compaction is lossy. Summaries are useful, but sometimes the exact detail you threw away is the detail you needed three turns later. Long-running agents are always balancing recall against context budget.
The model matters, but the harness matters more
Different models are better or worse at planning, tool use, structured output, and code generation. That absolutely affects the experience.
But once you start building agents, you realize something uncomfortable:
A strong model in a weak harness is frustrating. A decent model in a strong harness is often more useful than it has any right to be.
The harness determines whether the model can:
- find the right file
- survive long sessions
- recover from failed commands
- apply surgical edits
- prove that the task is complete
That is why two products using similarly capable models can feel wildly different in practice.
Why this gets harder locally
This point matters even more for local agents.
Cloud coding products usually have polished runtimes, mature prompt formatting, and enough infrastructure around the model to hide a lot of rough edges. Local setups are less forgiving.
You run into issues like:
- tighter memory limits
- smaller practical context windows
- worse latency when you overfeed the model
- more brittle tool calling
- more prompt-format sensitivity
- less guardrail infrastructure around long sessions
That does not make local agents pointless. I still like them. It just means the boring systems work matters even more. If your local agent feels unstable, it may not need a smarter model first. It may need a better loop, cleaner context, and stricter verification.
Permissions, sandboxing, and safety are part of the design
Another missing piece in a lot of simplified agent diagrams is the operating envelope.
Real coding agents usually do not have unlimited power. Some tools are read-only. Some filesystem paths are writable and others are not. Some commands require explicit approval. Network access may be blocked. Destructive operations may be denied or wrapped in extra checks.
That is not an annoying implementation detail. It is part of how the system works. The runtime is not just giving the model hands. It is also deciding what the hands are allowed to touch.
The same goes for observability. If the agent cannot show you what tool it called, what came back, what got truncated, and why it stopped, debugging turns into superstition.
A better way to think about coding agents
The mental model I keep settling on is this:
- the model is the reasoning engine
- the agent runtime is the operating system around that engine
The runtime gives the model senses, memory, and hands:
- senses through file reads, search, test output, and external tools
- memory through conversation state, summaries, and cached repository context
- hands through patches, shell commands, and API calls
Once you see the system that way, a lot of confusing behavior stops being confusing. You stop asking, “Why didn’t the model just do it?” and start asking the more useful question:
“What part of the agent loop failed?”
What I am leaving out on purpose
There are more advanced pieces beyond the basic loop:
- planner/executor splits
- long-term memory systems
- background agents
- richer approval flows
- evaluation harnesses
- multi-agent coordination
Those matter, but they come later.
The first-order problem is still the same boring one: can the model ask for a tool, can the runtime execute it, can the result get fed back correctly, and can the system verify the change before it stops?
Final takeaway
An LLM coding agent does not build software by generating one brilliant answer.
It builds software by repeatedly doing four things well:
- gathering the right context
- choosing the next action
- executing that action through tools
- checking the result against reality
If you want to build a better local agent, spend less time imagining a magical autonomous coder and more time improving those four steps.
What looks like intelligence is often just good plumbing.