<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.4">Jekyll</generator><link href="https://joeywang.github.io//feed.xml" rel="self" type="application/atom+xml" /><link href="https://joeywang.github.io//" rel="alternate" type="text/html" /><updated>2026-04-21T22:19:15+00:00</updated><id>https://joeywang.github.io//feed.xml</id><title type="html">Joey</title><subtitle>The world of learning is constantly evolving and working on e-learning allows me to be a part of that challenge. I love being in a supportive team, with people from different backgrounds and specialties, and exploring the opportunities that we make possible together.</subtitle><entry><title type="html"></title><link href="https://joeywang.github.io//posts/2026-01-20-redis-cleanup/" rel="alternate" type="text/html" title="" /><published>2026-04-21T22:19:15+00:00</published><updated>2026-04-21T22:19:15+00:00</updated><id>https://joeywang.github.io//posts/2026-01-20-redis-cleanup</id><content type="html" xml:base="https://joeywang.github.io//posts/2026-01-20-redis-cleanup/"><![CDATA[<h2 id="1-core-principles-read-this-first">1. Core Principles (Read This First)</h2>

<p>Before touching any cleanup script, internalize these rules:</p>

<ol>
  <li><strong>Redis memory issues are almost always caused by retention mistakes, not leaks</strong></li>
  <li><em>*KEYS **</em> is forbidden in production</li>
  <li><strong>DEL is dangerous for large keys — UNLINK is preferred</strong></li>
  <li><strong>Backups must come before cleanup</strong></li>
  <li><strong>TTL is the only sustainable memory strategy</strong></li>
</ol>

<p>If Redis data can grow forever, it eventually will.</p>

<hr />

<h2 id="2-redis-in-kubernetes-what-makes-it-tricky">2. Redis in Kubernetes: What Makes It Tricky</h2>

<p>Kubernetes adds unique failure modes:</p>

<ul>
  <li>Pods can restart unexpectedly → memory spikes repeat</li>
  <li>RSS vs used_memory confusion in container limits</li>
  <li>Eviction by the kubelet if Redis exceeds memory limits</li>
  <li>PersistentVolumes hide real memory growth</li>
</ul>

<h3 id="recommendation">Recommendation</h3>

<ul>
  <li>Always set <strong>Redis pod memory limits</strong></li>
  <li>Always configure <strong>Redis maxmemory</strong></li>
  <li>Never rely on K8s eviction alone</li>
</ul>

<hr />

<h2 id="3-baseline-health-checks-run-these-first">3. Baseline Health Checks (Run These First)</h2>

<h3 id="memory-overview">Memory overview</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>redis-cli INFO memory
</pre></td></tr></tbody></table></code></pre></div></div>

<p>Key fields:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">used_memory_human</code></li>
  <li><code class="language-plaintext highlighter-rouge">used_memory_rss_human</code></li>
  <li><code class="language-plaintext highlighter-rouge">mem_fragmentation_ratio</code></li>
  <li><code class="language-plaintext highlighter-rouge">maxmemory</code></li>
  <li><code class="language-plaintext highlighter-rouge">maxmemory_policy</code></li>
</ul>

<h3 id="keyspace-overview">Keyspace overview</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>redis-cli INFO keyspace
</pre></td></tr></tbody></table></code></pre></div></div>

<p>This tells you <strong>where keys live</strong>, not how large they are.</p>

<hr />

<h2 id="4-the-silent-killers-large-keys">4. The Silent Killers: Large Keys</h2>

<p>Redis is fast — until you store <strong>huge values</strong>.</p>

<p>Common offenders:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">stat:*</code> (Sidekiq statistics)</li>
  <li>Large JSON strings</li>
  <li>Unbounded hashes or lists</li>
  <li>Job payloads stored as strings</li>
</ul>

<h3 id="find-the-biggest-keys-safely">Find the biggest keys safely</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="rouge-code"><pre><span class="nv">DB</span><span class="o">=</span>0
<span class="nv">TOP</span><span class="o">=</span>20

redis-cli <span class="nt">-n</span> <span class="s2">"</span><span class="nv">$DB</span><span class="s2">"</span> <span class="nt">--scan</span> <span class="se">\</span>
| <span class="k">while </span><span class="nb">read</span> <span class="nt">-r</span> key<span class="p">;</span> <span class="k">do
    </span><span class="nv">bytes</span><span class="o">=</span><span class="si">$(</span>redis-cli <span class="nt">-n</span> <span class="s2">"</span><span class="nv">$DB</span><span class="s2">"</span> MEMORY USAGE <span class="s2">"</span><span class="nv">$key</span><span class="s2">"</span> 2&gt;/dev/null<span class="si">)</span>
    <span class="o">[</span> <span class="nt">-z</span> <span class="s2">"</span><span class="nv">$bytes</span><span class="s2">"</span> <span class="o">]</span> <span class="o">&amp;&amp;</span> <span class="nv">bytes</span><span class="o">=</span>0
    <span class="nb">printf</span> <span class="s2">"%12s  %s</span><span class="se">\n</span><span class="s2">"</span> <span class="s2">"</span><span class="nv">$bytes</span><span class="s2">"</span> <span class="s2">"</span><span class="nv">$key</span><span class="s2">"</span>
  <span class="k">done</span> <span class="se">\</span>
| <span class="nb">sort</span> <span class="nt">-nr</span> <span class="se">\</span>
| <span class="nb">head</span> <span class="nt">-n</span> <span class="s2">"</span><span class="nv">$TOP</span><span class="s2">"</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<p>Never use <code class="language-plaintext highlighter-rouge">KEYS *</code>.</p>

<hr />

<h2 id="5-backups-before-cleanup-nonnegotiable">5. Backups Before Cleanup (Non‑Negotiable)</h2>

<h3 id="recommended-rdb-snapshot">Recommended: RDB snapshot</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>redis-cli BGSAVE
</pre></td></tr></tbody></table></code></pre></div></div>

<p>Locate and copy:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="rouge-code"><pre>redis-cli CONFIG GET <span class="nb">dir
</span>redis-cli CONFIG GET dbfilename
<span class="nb">cp</span> /var/lib/redis/dump.rdb /backup/redis/pre-cleanup-<span class="si">$(</span><span class="nb">date</span> +%F<span class="si">)</span>.rdb
</pre></td></tr></tbody></table></code></pre></div></div>

<p>Why this works:</p>

<ul>
  <li>Handles very large keys</li>
  <li>Fast</li>
  <li>Easy restore</li>
</ul>

<hr />

<h2 id="6-safe-cleanup-patterns">6. Safe Cleanup Patterns</h2>

<h3 id="rule-unlink--del">Rule: UNLINK &gt; DEL</h3>

<p><code class="language-plaintext highlighter-rouge">DEL</code> blocks Redis while freeing memory.
<code class="language-plaintext highlighter-rouge">UNLINK</code> frees memory asynchronously.</p>

<hr />

<h3 id="pattern-1-delete-keys-by-pattern">Pattern 1: Delete keys by pattern</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="rouge-code"><pre><span class="nv">DB</span><span class="o">=</span>0

redis-cli <span class="nt">-n</span> <span class="s2">"</span><span class="nv">$DB</span><span class="s2">"</span> <span class="nt">--scan</span> MATCH <span class="s1">'Course#linked_course_uuids_and_self*'</span> <span class="se">\</span>
| <span class="k">while </span><span class="nb">read</span> <span class="nt">-r</span> key<span class="p">;</span> <span class="k">do
    </span>redis-cli <span class="nt">-n</span> <span class="s2">"</span><span class="nv">$DB</span><span class="s2">"</span> UNLINK <span class="s2">"</span><span class="nv">$key</span><span class="s2">"</span>
  <span class="k">done</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<hr />

<h3 id="pattern-2-rate-limited-cleanup-extra-safe">Pattern 2: Rate-limited cleanup (extra safe)</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="rouge-code"><pre><span class="nv">DB</span><span class="o">=</span>0

redis-cli <span class="nt">-n</span> <span class="s2">"</span><span class="nv">$DB</span><span class="s2">"</span> <span class="nt">--scan</span> MATCH <span class="s1">'stat:*'</span> <span class="se">\</span>
| <span class="k">while </span><span class="nb">read</span> <span class="nt">-r</span> key<span class="p">;</span> <span class="k">do
    </span>redis-cli <span class="nt">-n</span> <span class="s2">"</span><span class="nv">$DB</span><span class="s2">"</span> UNLINK <span class="s2">"</span><span class="nv">$key</span><span class="s2">"</span>
    <span class="nb">sleep </span>0.01
  <span class="k">done</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<hr />

<h2 id="7-sidekiq-the-biggest-redis-memory-trap">7. Sidekiq: The Biggest Redis Memory Trap</h2>

<h3 id="why-sidekiq-causes-redis-memory-explosions">Why Sidekiq causes Redis memory explosions</h3>

<p>By default:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">stat:*</code> keys <strong>never expire</strong></li>
  <li>Retry jobs accumulate</li>
  <li>Dead jobs remain for months</li>
</ul>

<p>This is expected behavior — and dangerous without tuning.</p>

<hr />

<h3 id="fix-1-apply-ttl-to-sidekiq-stats">Fix 1: Apply TTL to Sidekiq stats</h3>

<p><code class="language-plaintext highlighter-rouge">config/initializers/sidekiq.rb</code></p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="rouge-code"><pre><span class="no">Sidekiq</span><span class="p">.</span><span class="nf">configure_server</span> <span class="k">do</span> <span class="o">|</span><span class="n">config</span><span class="o">|</span>
  <span class="n">config</span><span class="p">.</span><span class="nf">on</span><span class="p">(</span><span class="ss">:startup</span><span class="p">)</span> <span class="k">do</span>
    <span class="no">Sidekiq</span><span class="p">.</span><span class="nf">redis</span> <span class="k">do</span> <span class="o">|</span><span class="n">conn</span><span class="o">|</span>
      <span class="n">retention_days</span> <span class="o">=</span> <span class="mi">30</span>
      <span class="n">ttl</span> <span class="o">=</span> <span class="n">retention_days</span> <span class="o">*</span> <span class="mi">24</span> <span class="o">*</span> <span class="mi">60</span> <span class="o">*</span> <span class="mi">60</span>

      <span class="n">conn</span><span class="p">.</span><span class="nf">scan_each</span><span class="p">(</span><span class="ss">match: </span><span class="s1">'stat:*'</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">key</span><span class="o">|</span>
        <span class="n">conn</span><span class="p">.</span><span class="nf">expire</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">ttl</span><span class="p">)</span>
      <span class="k">end</span>
    <span class="k">end</span>
  <span class="k">end</span>
<span class="k">end</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<hr />

<h3 id="fix-2-reduce-retry-pressure">Fix 2: Reduce retry pressure</h3>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre><span class="k">class</span> <span class="nc">MyWorker</span>
  <span class="kp">include</span> <span class="no">Sidekiq</span><span class="o">::</span><span class="no">Worker</span>
  <span class="n">sidekiq_options</span> <span class="ss">retry: </span><span class="mi">5</span>
<span class="k">end</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<p>Disable retries for non-critical jobs:</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre><span class="n">sidekiq_options</span> <span class="ss">retry: </span><span class="kp">false</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<hr />

<h3 id="fix-3-tune-dead-job-retention">Fix 3: Tune dead job retention</h3>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre><span class="no">Sidekiq</span><span class="p">.</span><span class="nf">configure_server</span> <span class="k">do</span> <span class="o">|</span><span class="n">config</span><span class="o">|</span>
  <span class="n">config</span><span class="p">.</span><span class="nf">options</span><span class="p">[</span><span class="ss">:dead_timeout</span><span class="p">]</span> <span class="o">=</span> <span class="mi">30</span> <span class="o">*</span> <span class="mi">24</span> <span class="o">*</span> <span class="mi">60</span> <span class="o">*</span> <span class="mi">60</span>
  <span class="n">config</span><span class="p">.</span><span class="nf">options</span><span class="p">[</span><span class="ss">:dead_max_jobs</span><span class="p">]</span> <span class="o">=</span> <span class="mi">2000</span>
<span class="k">end</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<hr />

<h2 id="8-redis-maxmemory-k8s-safety-net">8. Redis maxmemory (K8s Safety Net)</h2>

<p>Unbounded Redis is dangerous in containers.</p>

<h3 id="recommended-baseline">Recommended baseline</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre>redis-cli CONFIG SET maxmemory 512mb
redis-cli CONFIG SET maxmemory-policy allkeys-lru
</pre></td></tr></tbody></table></code></pre></div></div>

<p>Choose a value <strong>below your pod memory limit</strong>.</p>

<hr />

<h2 id="9-fragmentation--rss-troubleshooting">9. Fragmentation &amp; RSS Troubleshooting</h2>

<h3 id="when-rss-is-much-higher-than-used_memory">When RSS is much higher than used_memory</h3>

<p>Run:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>redis-cli MEMORY DOCTOR
</pre></td></tr></tbody></table></code></pre></div></div>

<p>If caused by historical peak:</p>

<ul>
  <li>Harmless</li>
  <li>RSS will be reused</li>
</ul>

<p>Try:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>redis-cli MEMORY PURGE
</pre></td></tr></tbody></table></code></pre></div></div>

<p>Guaranteed fix:</p>

<ul>
  <li>Rolling restart</li>
</ul>

<hr />

<h2 id="10-production-troubleshooting-checklist">10. Production Troubleshooting Checklist</h2>

<h3 id="check-eviction--hit-rate">Check eviction &amp; hit rate</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>redis-cli INFO stats | egrep <span class="s1">'evicted_keys|expired_keys|keyspace_hits|keyspace_misses'</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<h3 id="check-retry--dead-size">Check retry &amp; dead size</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre>redis-cli ZCARD retry
redis-cli ZCARD dead
</pre></td></tr></tbody></table></code></pre></div></div>

<h3 id="check-biggest-keys-again-after-cleanup">Check biggest keys again after cleanup</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>redis-cli <span class="nt">--scan</span> | <span class="nb">head</span> <span class="nt">-n</span> 20
</pre></td></tr></tbody></table></code></pre></div></div>

<hr />

<h2 id="11-kubernetes-specific-recommendations">11. Kubernetes-Specific Recommendations</h2>

<ul>
  <li>Use <strong>StatefulSet</strong> for Redis</li>
  <li>Set <strong>resources.limits.memory</strong></li>
  <li>Avoid OOMKills by setting Redis maxmemory</li>
  <li>Prefer managed Redis for critical workloads</li>
</ul>

<hr />

<h2 id="12-final-takeaways">12. Final Takeaways</h2>

<ul>
  <li>Redis problems are <strong>predictable</strong></li>
  <li>TTL beats cleanup scripts</li>
  <li>UNLINK beats DEL</li>
  <li>Backups beat regret</li>
  <li>Sidekiq defaults are not production-safe</li>
</ul>

<p>If you fix retention, Redis becomes boring again — and boring is good.</p>

<hr />

<p>If you want, this guide can be adapted into:</p>

<ul>
  <li>An internal runbook</li>
  <li>A Helm chart checklist</li>
  <li>A Sidekiq-specific hardening guide</li>
  <li>A Grafana alerting spec</li>
</ul>

<p>Just say the word.</p>]]></content><author><name></name></author></entry><entry><title type="html">Tiny Puzzles for Testing and Debugging AI Agents</title><link href="https://joeywang.github.io//posts/agent-debugging-puzzles/" rel="alternate" type="text/html" title="Tiny Puzzles for Testing and Debugging AI Agents" /><published>2026-04-21T00:00:00+00:00</published><updated>2026-04-21T00:00:00+00:00</updated><id>https://joeywang.github.io//posts/agent-debugging-puzzles</id><content type="html" xml:base="https://joeywang.github.io//posts/agent-debugging-puzzles/"><![CDATA[<h1 id="tiny-puzzles-for-testing-and-debugging-ai-agents">Tiny Puzzles for Testing and Debugging AI Agents</h1>

<p>One of the smallest tests I use is also one of the dumbest:</p>

<ol>
  <li>send <code class="language-plaintext highlighter-rouge">hello</code></li>
  <li>send <code class="language-plaintext highlighter-rouge">hello again</code></li>
</ol>

<p>I am not doing that because I think it is a meaningful intelligence test.</p>

<p>I am doing it because I want to see whether the stack behaves differently on the second request. If I expect prompt caching or KV reuse, I want some sign of it in latency, logs, or runtime warnings. If nothing changes, or LM Studio logs complain about cache support, that already tells me something useful.</p>

<p>That tiny test changed how I think about debugging agents.</p>

<p>One thing I keep coming back to with local agents is this: if you only test them on real work, debugging takes forever.</p>

<p>A big task fails and you do not know why.</p>

<p>Was function calling broken?
Did the model ignore the skill instructions?
Did the prompt get too long?
Did the agent runtime format messages wrong?
Did LM Studio drop some warning you only notice after the fact?</p>

<p>When the task is large, every failure mode gets mixed together. That makes the whole setup feel mystical when it usually is not.</p>

<p>What helped me was building tiny puzzles.</p>

<p>By “puzzle,” I do not mean benchmark games or synthetic IQ tests. I mean small, repeatable probe tasks that isolate one capability at a time. A good probe gives you a yes-or-no answer about one layer of the stack.</p>

<p>That is much more useful than asking a vague question and hoping the agent “feels smart.”</p>

<h2 id="probe-not-benchmark">Probe, not benchmark</h2>

<p>This distinction matters.</p>

<p>A benchmark tries to rank systems.</p>

<p>A probe tries to isolate a failure.</p>

<p>A benchmark asks:</p>

<blockquote>
  <p>Which model is better?</p>
</blockquote>

<p>A probe asks:</p>

<blockquote>
  <p>What broke?</p>
</blockquote>

<p>For debugging, I care much more about the second question.</p>

<h2 id="what-i-actually-want-from-a-probe">What I actually want from a probe</h2>

<p>A good agent-debugging probe should be:</p>

<ul>
  <li><strong>small</strong> enough to run in seconds</li>
  <li><strong>repeatable</strong> enough to compare across models or prompt changes</li>
  <li><strong>narrow</strong> enough to isolate one failure mode</li>
  <li><strong>cheap</strong> enough that I can run it many times</li>
  <li><strong>observable</strong> enough that I can inspect logs and understand what happened</li>
</ul>

<p>If the task is too broad, you learn almost nothing.</p>

<p>That is why I like tests like:</p>

<ul>
  <li>“call exactly one tool”</li>
  <li>“use the right skill when the trigger phrase appears”</li>
  <li>“survive a prompt that is 3x longer than the easy case”</li>
  <li>“repeat the same prompt twice and see whether cache behavior changes”</li>
</ul>

<p>Those are not glamorous. They are useful.</p>

<h2 id="the-basic-idea-build-a-probe-suite-not-one-giant-eval">The basic idea: build a probe suite, not one giant eval</h2>

<p>I would break agent testing into five buckets:</p>

<ol>
  <li>function calling support</li>
  <li>skill awareness</li>
  <li>prompt-length and context pressure</li>
  <li>debugging by layer: agent runtime vs model runtime</li>
  <li>metrics, logs, and traces</li>
</ol>

<p>You can think of this as a cheap local eval harness for yourself.</p>

<p>Here is the compact version:</p>

<table>
  <thead>
    <tr>
      <th>Probe</th>
      <th>What it tests</th>
      <th>Common failure</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>single-tool sanity check</td>
      <td>basic function calling</td>
      <td>model answers without using the tool</td>
    </tr>
    <tr>
      <td>tool-loop continuation</td>
      <td>post-tool reasoning</td>
      <td>runtime stops after the tool result</td>
    </tr>
    <tr>
      <td>skill trigger check</td>
      <td>skill awareness</td>
      <td>skill never enters context or gets ignored</td>
    </tr>
    <tr>
      <td>buried instruction retention</td>
      <td>prompt pressure</td>
      <td>instruction gets lost in long context</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">hello</code> / <code class="language-plaintext highlighter-rouge">hello again</code></td>
      <td>cache or runtime behavior</td>
      <td>no reuse signal, warning in logs</td>
    </tr>
  </tbody>
</table>

<h2 id="1-test-function-calling-support-first">1. Test function calling support first</h2>

<p>This is the first thing I test because if it is broken, everything above it becomes noise.</p>

<p>The goal here is not “can the model describe tool use?” The goal is “can the model produce the exact structured action the runtime needs, and can the runtime round-trip the result correctly?”</p>

<h3 id="the-simplest-probe">The simplest probe</h3>

<p>Give the agent one tool and one obvious task.</p>

<p>For example:</p>

<blockquote>
  <p>You have one tool: <code class="language-plaintext highlighter-rouge">read_file(path)</code>.
Read <code class="language-plaintext highlighter-rouge">README.md</code> and tell me the first sentence.</p>
</blockquote>

<p>Why this works:</p>

<ul>
  <li>only one tool is available</li>
  <li>the right action is obvious</li>
  <li>the result is easy to verify</li>
  <li>failure is easy to classify</li>
</ul>

<p>Possible outcomes:</p>

<ul>
  <li>it calls the tool correctly: good</li>
  <li>it answers without the tool: prompt or runtime issue</li>
  <li>it emits malformed tool JSON: model formatting issue</li>
  <li>it calls the wrong tool: tool selection issue</li>
  <li>it calls the tool but ignores the result: loop or message formatting issue</li>
</ul>

<h3 id="slightly-better-function-call-probes">Slightly better function-call probes</h3>

<p>Once the trivial case passes, I like a short ladder:</p>

<ol>
  <li>one tool, one obvious call</li>
  <li>two tools, only one correct choice</li>
  <li>two sequential tool calls</li>
  <li>read, then patch, then verify</li>
  <li>multiple independent reads in parallel</li>
</ol>

<p>That ladder tells you where the system starts to break.</p>

<h3 id="what-i-look-for-in-logs">What I look for in logs</h3>

<p>At this stage, I want to see:</p>

<ul>
  <li>the full prompt sent by the agent</li>
  <li>the tool schema exposed to the model</li>
  <li>the raw model output before the runtime sanitizes it</li>
  <li>the parsed tool call</li>
  <li>the returned tool result</li>
  <li>the next model turn after the tool result</li>
</ul>

<p>If any one of those is missing, debugging gets much harder.</p>

<h2 id="2-test-skill-awareness-separately">2. Test skill awareness separately</h2>

<p>This one is easy to confuse with general intelligence.</p>

<p>If you use skills, rulesets, or task-specific system instructions, you should test whether the agent notices and follows them before you test complicated work.</p>

<p>The puzzle here is not “is the output good?” The puzzle is “did the agent notice the right behavior cue?”</p>

<h3 id="a-good-skill-awareness-probe">A good skill-awareness probe</h3>

<p>Create two nearly identical tasks where only one should trigger the skill.</p>

<p>For example:</p>

<ul>
  <li>Task A: “Summarize this file.”</li>
  <li>Task B: “Humanize this article.”</li>
</ul>

<p>If the <code class="language-plaintext highlighter-rouge">humanizer</code> skill is available and Task B does not activate the expected behavior, that is not a generic writing failure. That is a skill-selection failure.</p>

<h3 id="another-useful-pattern">Another useful pattern</h3>

<p>Make the skill instruction produce a visible artifact.</p>

<p>Examples:</p>

<ul>
  <li>a required heading</li>
  <li>a required command pattern</li>
  <li>a required output shape</li>
  <li>a required refusal to do some prohibited action</li>
</ul>

<p>That way you can verify whether the skill was active without guessing.</p>

<h3 id="failure-modes-to-watch">Failure modes to watch</h3>

<ul>
  <li>the skill never enters context</li>
  <li>the skill enters context but gets ignored</li>
  <li>the skill conflicts with stronger system instructions</li>
  <li>the skill works on short prompts but disappears on long ones</li>
  <li>the skill triggers, but only partially</li>
</ul>

<p>That last one is common. The agent acts like it vaguely remembers the rule but not enough to follow it cleanly.</p>

<h2 id="3-test-prompt-length-and-context-pressure-on-purpose">3. Test prompt length and context pressure on purpose</h2>

<p>This is where local setups get weird fast.</p>

<p>A model can look fine on a short prompt and then fall apart once you add:</p>

<ul>
  <li>a large system prompt</li>
  <li>a skill catalog</li>
  <li>tool schemas</li>
  <li>a long file</li>
  <li>a diff</li>
  <li>previous turns</li>
</ul>

<p>That is not a minor edge case. That is the normal shape of agent work.</p>

<p>So I like to test prompt pressure deliberately instead of waiting for a real task to reveal it.</p>

<h3 id="the-ladder-i-use">The ladder I use</h3>

<p>Run the same task at different prompt sizes:</p>

<ol>
  <li>tiny prompt</li>
  <li>normal prompt</li>
  <li>long prompt</li>
  <li>long prompt plus previous turns</li>
  <li>long prompt plus tools plus previous turns</li>
</ol>

<p>The task itself should stay almost the same. Only the context load changes.</p>

<p>That helps answer a very useful question:</p>

<blockquote>
  <p>Is the model bad at the task, or is the system bad under context pressure?</p>
</blockquote>

<h3 id="a-cheap-kv-cache-probe">A cheap KV-cache probe</h3>

<p>One of my favorite tiny tests is embarrassingly simple:</p>

<ol>
  <li>send <code class="language-plaintext highlighter-rouge">hello</code></li>
  <li>send <code class="language-plaintext highlighter-rouge">hello again</code></li>
</ol>

<p>I am not using that to test intelligence. I am using it to test cache behavior and runtime traces.</p>

<p>If the stack supports prompt caching or KV reuse, I want to see evidence in latency, logs, or runtime counters. If I expect cache reuse and nothing changes, that is already a clue.</p>

<p>This is where I often check LM Studio logs. I want to see:</p>

<ul>
  <li>whether the requests actually reached the model runtime</li>
  <li>whether there are warnings about cache support</li>
  <li>whether the second request behaves differently</li>
  <li>whether the runtime reports any unsupported cache path</li>
</ul>

<p>That tiny test is not sufficient, but it is cheap and surprisingly revealing.</p>

<h3 id="another-prompt-length-probe">Another prompt-length probe</h3>

<p>Ask the model to follow one simple instruction buried at the very end of a long prompt.</p>

<p>For example:</p>

<blockquote>
  <p>After reading everything above, answer with exactly: <code class="language-plaintext highlighter-rouge">CACHE_OK</code></p>
</blockquote>

<p>If it misses that reliably only when the prompt gets large, you have learned something real about instruction retention under load.</p>

<h2 id="control-one-variable-at-a-time">Control one variable at a time</h2>

<p>This sounds obvious, but I have wasted plenty of time ignoring it.</p>

<p>If you change the model, the prompt, the agent settings, and the runtime configuration all at once, a bad result tells you almost nothing.</p>

<p>So when I run these probes, I try to be strict:</p>

<ul>
  <li>keep the model fixed when testing agent changes</li>
  <li>keep the agent fixed when testing runtime changes</li>
  <li>keep the task fixed when testing prompt length</li>
  <li>keep the prompt fixed when testing cache behavior</li>
</ul>

<p>Half of debugging is refusing to create your own confusion.</p>

<h2 id="4-debug-by-layer-agent-problem-or-model-problem">4. Debug by layer: agent problem or model problem?</h2>

<p>This is the distinction that saves the most time.</p>

<p>When an agent fails, I try to ask:</p>

<ol>
  <li>Did the agent send the right prompt?</li>
  <li>Did the model return something usable?</li>
  <li>Did the runtime parse it correctly?</li>
  <li>Did the tool result get fed back in the right shape?</li>
  <li>Did the loop stop too early or continue too long?</li>
</ol>

<p>If you do not separate those layers, you end up rewriting prompts when the real problem is message formatting, or blaming the model when the real problem is unsupported KV cache in the backend.</p>

<h3 id="my-rough-debugging-stack">My rough debugging stack</h3>

<p>I usually debug in this order:</p>

<h4 id="layer-1-agent-client">Layer 1: agent client</h4>

<p>This is Pi, OpenCode, OpenHands, or whatever harness you are using.</p>

<p>Questions:</p>

<ul>
  <li>What exact system prompt did it send?</li>
  <li>What tools did it expose?</li>
  <li>Did it strip or rewrite any message roles?</li>
  <li>Did it truncate context?</li>
  <li>Did it retry without telling me?</li>
  <li>Did it sanitize model output before I saw it?</li>
</ul>

<p>This layer is responsible for a lot more weirdness than people expect.</p>

<h4 id="layer-2-transport--api-compatibility">Layer 2: transport / API compatibility</h4>

<p>Questions:</p>

<ul>
  <li>Did the request format match what the backend expects?</li>
  <li>Are tool calls encoded the way the runtime expects?</li>
  <li>Are reasoning channels or special tokens being passed through cleanly?</li>
  <li>Is the compatibility mode wrong for this backend?</li>
</ul>

<p>This layer is boring until it is broken. Then everything looks cursed.</p>

<h4 id="layer-3-model-runtime">Layer 3: model runtime</h4>

<p>For a local stack, this is where LM Studio, Ollama, MLX, llama.cpp, or another runtime starts to matter.</p>

<p>Questions:</p>

<ul>
  <li>Did the model start generating?</li>
  <li>Did it stall at 0%?</li>
  <li>Did it leak think tags?</li>
  <li>Did it warn about unsupported KV caching?</li>
  <li>Did it hit memory pressure?</li>
  <li>Did it silently fall back to a degraded path?</li>
</ul>

<p>This is why I check LM Studio logs so often. If the runtime is complaining, I want to know before I waste an hour blaming the prompt.</p>

<h4 id="layer-4-model-behavior">Layer 4: model behavior</h4>

<p>Only after the first three layers look healthy do I spend much time blaming the model itself.</p>

<p>Questions:</p>

<ul>
  <li>Did the model choose the wrong tool?</li>
  <li>Did it ignore a strong instruction?</li>
  <li>Did it lose track of prior context?</li>
  <li>Did it produce malformed structured output?</li>
</ul>

<p>Sometimes the answer really is “the model is weak at this.” But that should be the last conclusion, not the first.</p>

<h2 id="5-metrics-logs-and-traces-what-i-actually-want">5. Metrics, logs, and traces: what I actually want</h2>

<p>If you are serious about debugging agents, text output alone is not enough.</p>

<p>You want a trace.</p>

<p>At minimum, I would want:</p>

<ul>
  <li>request timestamp</li>
  <li>model name</li>
  <li>prompt token count or prompt length estimate</li>
  <li>completion length</li>
  <li>latency to first token</li>
  <li>total latency</li>
  <li>tool calls emitted</li>
  <li>tool-call parse failures</li>
  <li>retries</li>
  <li>stop reason</li>
  <li>cache hit or cache reuse signals if available</li>
  <li>warnings from the model runtime</li>
</ul>

<p>That is the minimum useful set.</p>

<p>Even a fake trace like this is better than nothing:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
</pre></td><td class="rouge-code"><pre>request_id=42
agent=opencode
model=gemma-4
prompt_tokens=4187
completion_tokens=96
ttft_ms=842
total_latency_ms=2310
tool_calls=1
tool_parse_ok=true
tool_roundtrip_ok=true
cache_reuse=false
stop_reason=tool_call
warning="RotatingKVCache Quantization NYI"
</pre></td></tr></tbody></table></code></pre></div></div>

<p>If I can capture something shaped like that for each probe, I can usually stop guessing and start comparing.</p>

<h3 id="the-most-useful-derived-metrics">The most useful derived metrics</h3>

<p>Once you have the basics, the metrics I care about most are:</p>

<ul>
  <li><strong>tool-call success rate</strong>: how often the raw output becomes a valid tool call</li>
  <li><strong>tool round-trip success rate</strong>: how often the model continues correctly after getting a tool result</li>
  <li><strong>instruction retention under load</strong>: does the model still obey the obvious rule in long prompts</li>
  <li><strong>latency by prompt size</strong>: when does the system start to bend</li>
  <li><strong>failure clustering</strong>: are most failures coming from one layer</li>
</ul>

<p>That last one matters because random-seeming instability often is not random at all. It clusters around one part of the stack.</p>

<h2 id="a-practical-probe-suite-i-would-keep-around">A practical probe suite I would keep around</h2>

<p>If I were setting this up from scratch, I would keep a small set of named probes.</p>

<h3 id="probe-1-single-tool-sanity-check">Probe 1: single-tool sanity check</h3>

<p>Goal: verify basic tool calling.</p>

<p>Prompt:</p>

<blockquote>
  <p>Use the available file-read tool to read <code class="language-plaintext highlighter-rouge">README.md</code> and return only its first sentence.</p>
</blockquote>

<h3 id="probe-2-tool-loop-continuation">Probe 2: tool-loop continuation</h3>

<p>Goal: verify that the agent continues after a tool result instead of narrating.</p>

<p>Prompt:</p>

<blockquote>
  <p>Read <code class="language-plaintext highlighter-rouge">README.md</code>, then tell me how many top-level sections it has.</p>
</blockquote>

<p>This forces a read, then a second reasoning step based on the returned content.</p>

<h3 id="probe-3-skill-trigger-check">Probe 3: skill trigger check</h3>

<p>Goal: verify that the right skill or ruleset activates.</p>

<p>Prompt:</p>

<blockquote>
  <p>Humanize the following draft.</p>
</blockquote>

<p>Expected result: visible behavior that clearly matches the skill instructions.</p>

<h3 id="probe-4-buried-instruction-retention">Probe 4: buried instruction retention</h3>

<p>Goal: check prompt-pressure behavior.</p>

<p>Prompt:</p>

<blockquote>
  <p>[long filler context]
Final instruction: reply with exactly <code class="language-plaintext highlighter-rouge">CACHE_OK</code></p>
</blockquote>

<h3 id="probe-5-cache-reuse-smoke-test">Probe 5: cache reuse smoke test</h3>

<p>Goal: see whether repeated prompts behave differently.</p>

<p>Prompt sequence:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">hello</code></li>
  <li><code class="language-plaintext highlighter-rouge">hello again</code></li>
</ol>

<p>This is not deep, but it is cheap. For local debugging, cheap matters.</p>

<h3 id="probe-6-runtime-warning-trap">Probe 6: runtime warning trap</h3>

<p>Goal: catch backend-specific failures early.</p>

<p>Procedure:</p>

<ol>
  <li>run one short prompt</li>
  <li>run one long prompt</li>
  <li>inspect LM Studio logs</li>
</ol>

<p>I want to catch things like:</p>

<ul>
  <li>unsupported KV cache paths</li>
  <li>prompt stalls</li>
  <li>memory pressure warnings</li>
  <li>reasoning-tag leakage</li>
</ul>

<h3 id="probe-7-patch-and-verify-microtask">Probe 7: patch-and-verify microtask</h3>

<p>Goal: test the full coding loop in miniature.</p>

<p>Prompt:</p>

<blockquote>
  <p>Read <code class="language-plaintext highlighter-rouge">foo.txt</code>, replace <code class="language-plaintext highlighter-rouge">cat</code> with <code class="language-plaintext highlighter-rouge">dog</code>, then verify the file changed.</p>
</blockquote>

<p>This sounds silly, but it exercises the real sequence:</p>

<ul>
  <li>read</li>
  <li>patch</li>
  <li>observe</li>
  <li>continue</li>
</ul>

<p>That makes it more valuable than a pure text-generation test.</p>

<h2 id="do-not-over-trust-a-passing-probe">Do not over-trust a passing probe</h2>

<p>This is the other trap.</p>

<p>A probe can pass while the real workflow is still broken.</p>

<p>Examples:</p>

<ul>
  <li>the single-tool test passes, but tool use collapses under long prompts</li>
  <li>the skill trigger works alone, but disappears once the tool manifest gets large</li>
  <li>the cache smoke test looks fine on tiny prompts, but session reuse breaks on real workloads</li>
</ul>

<p>So I do not use probes to declare victory. I use them to narrow the search space.</p>

<h2 id="keep-a-failure-diary">Keep a failure diary</h2>

<p>I would also keep a tiny log for myself.</p>

<p>Nothing fancy. Just enough to compare runs without relying on memory:</p>

<ul>
  <li>probe name</li>
  <li>model</li>
  <li>agent or harness</li>
  <li>prompt size</li>
  <li>result</li>
  <li>runtime warnings</li>
  <li>note about what changed since the last run</li>
</ul>

<p>This matters because agent debugging gets fuzzy very quickly. A failure diary turns “I think it got worse after that config change” into something you can actually verify.</p>

<h2 id="what-makes-a-puzzle-good">What makes a puzzle good?</h2>

<p>The best puzzles are not the most clever ones. They are the ones that make failure obvious.</p>

<p>Bad puzzle:</p>

<blockquote>
  <p>“Review this medium-sized repository and suggest improvements.”</p>
</blockquote>

<p>If it fails, you learn almost nothing.</p>

<p>Better puzzle:</p>

<blockquote>
  <p>“You have one tool. Read one file. Return one sentence.”</p>
</blockquote>

<p>If it fails, you know where to look.</p>

<p>That is the mindset shift.</p>

<p>Do not start by asking whether the agent is smart. Start by asking whether one layer of the system is behaving.</p>

<h2 id="the-point-is-not-to-win-the-puzzle">The point is not to win the puzzle</h2>

<p>This is worth saying because people drift into benchmark brain very easily.</p>

<p>The goal of these puzzles is not to prove that one model is superior in the abstract. The goal is to make the stack debuggable.</p>

<p>That means:</p>

<ul>
  <li>isolate one variable</li>
  <li>keep the task cheap</li>
  <li>inspect the logs</li>
  <li>compare runs after each config change</li>
  <li>only then move back to real work</li>
</ul>

<p>I still care about real tasks in the end. Of course I do. But I do not trust a real-task result very much if I do not have a clean probe suite underneath it.</p>

<p>That is especially true for local agent setups, where a lot of failures come from runtime behavior, prompt overhead, API compatibility, or cache paths rather than from some deep model limitation.</p>

<h2 id="what-i-would-do-first-in-a-local-lm-studio-setup">What I would do first in a local LM Studio setup</h2>

<p>If I were debugging Pi or OpenCode against LM Studio from scratch, I would do this in order:</p>

<ol>
  <li>Run a one-line plain text prompt and confirm the backend is alive.</li>
  <li>Run a one-tool puzzle and confirm tool calling works.</li>
  <li>Run the same task with a longer prompt and compare latency and output quality.</li>
  <li>Run the <code class="language-plaintext highlighter-rouge">hello</code> / <code class="language-plaintext highlighter-rouge">hello again</code> cache smoke test.</li>
  <li>Inspect LM Studio logs after both short and long prompts.</li>
  <li>Only then try a real coding task.</li>
</ol>

<p>That order is boring. It also saves time.</p>

<h2 id="final-takeaway">Final takeaway</h2>

<p>When an agent fails, I do not want a mystery. I want a small broken puzzle.</p>

<p>That is what lets me debug one layer at a time:</p>

<ul>
  <li>tool calling</li>
  <li>skill awareness</li>
  <li>prompt pressure</li>
  <li>runtime health</li>
  <li>logs and metrics</li>
</ul>

<p>Real work is too expensive to be your only test harness.</p>

<p>Tiny puzzles are cheaper. And for debugging, cheaper usually wins.</p>]]></content><author><name>Joey Wang</name></author><category term="AI" /><category term="Engineering" /><category term="ai" /><category term="llm" /><category term="agents" /><category term="debugging" /><category term="testing" /><category term="lm-studio" /><category term="pi" /><category term="opencode" /><category term="local-llm" /><summary type="html"><![CDATA[A practical guide to building small probe tasks to test function calling, skill awareness, prompt length, cache behavior, and runtime debugging for local AI agents.]]></summary></entry><entry><title type="html">LM Studio Local Agent Runbook: Pi and OpenCode Step by Step</title><link href="https://joeywang.github.io//posts/lm-studio-local-agent-runbook/" rel="alternate" type="text/html" title="LM Studio Local Agent Runbook: Pi and OpenCode Step by Step" /><published>2026-04-21T00:00:00+00:00</published><updated>2026-04-21T00:00:00+00:00</updated><id>https://joeywang.github.io//posts/lm-studio-local-agent-runbook</id><content type="html" xml:base="https://joeywang.github.io//posts/lm-studio-local-agent-runbook/"><![CDATA[<h1 id="lm-studio-local-agent-runbook-pi-and-opencode-step-by-step">LM Studio Local Agent Runbook: Pi and OpenCode Step by Step</h1>

<p>This is the setup guide I wanted while I was trying to make LM Studio work as a local engine for coding agents.</p>

<p>This post is not about theory. It is about getting a model running locally, exposing it through LM Studio’s OpenAI-compatible endpoint, wiring it into Pi or OpenCode, and checking that the stack is alive before you waste time debugging the wrong thing.</p>

<p>If you want the broader context on why I was doing this and where the setup still falls short, read the companion article:</p>

<p><a href="/posts/lm-studio-gemma4/">Using LM Studio and Gemma as a Local Engine for Coding Agents</a></p>

<h2 id="what-you-are-building">What you are building</h2>

<p>The target architecture is simple:</p>

<ol>
  <li>LM Studio runs the model locally.</li>
  <li>LM Studio exposes <code class="language-plaintext highlighter-rouge">http://127.0.0.1:1234/v1</code>.</li>
  <li>Pi agents or OpenCode point at that endpoint.</li>
  <li>Your agent harness uses LM Studio as if it were an OpenAI-compatible backend.</li>
</ol>

<p>That does not give you a perfect local agent. It gives you a local model backend your agent can talk to.</p>

<h2 id="prerequisites">Prerequisites</h2>

<p>Before you start, make sure you have:</p>

<ul>
  <li>LM Studio installed</li>
  <li>at least one model downloaded in LM Studio</li>
  <li>a model that actually fits your machine</li>
  <li>Pi or OpenCode available locally if you want to test integration immediately</li>
</ul>

<p>One warning up front: model selection is constrained by memory much faster than people expect.</p>

<p>For example, I hit memory trouble when pushing into Qwen 35B-class MLX setups. Bigger is not automatically better if the runtime becomes unstable or dies during actual agent work.</p>

<h2 id="step-1-start-lm-studio-and-load-a-model">Step 1: start LM Studio and load a model</h2>

<p>Open LM Studio and load the model you want to serve.</p>

<p>My actual workflow is:</p>

<ol>
  <li>Open LM Studio.</li>
  <li>Download or select the model.</li>
  <li>Open the Developer or Local Server section.</li>
  <li>Start the local server.</li>
  <li>Confirm it is listening on <code class="language-plaintext highlighter-rouge">http://127.0.0.1:1234/v1</code>.</li>
</ol>

<p>The important detail is that you are not just chatting in the LM Studio UI. You are turning it into a local API server.</p>

<h2 id="step-2-sanity-check-the-local-endpoint">Step 2: sanity-check the local endpoint</h2>

<p>Before configuring any agent client, make sure the LM Studio endpoint responds.</p>

<p>Run:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
</pre></td><td class="rouge-code"><pre>curl http://127.0.0.1:1234/v1/models
</pre></td></tr></tbody></table></code></pre></div></div>

<p>If that works, you should get a model list back.</p>

<p>If it does not work, stop there and fix the server first. Do not start editing Pi or OpenCode config until this responds.</p>

<h2 id="step-3-change-the-lm-studio-settings-that-matter">Step 3: change the LM Studio settings that matter</h2>

<p>These are the first settings I would touch before trying agent workloads.</p>

<h3 id="disable-unified-kv-cache">Disable Unified KV Cache</h3>

<p>This helped the most when I was seeing prompt stalls or weird runtime behavior.</p>

<p>If you see large prompts stuck at 0%, this is one of the first things I would change.</p>

<h3 id="set-context-length-manually">Set context length manually</h3>

<p>Do not leave context length on auto if you want more predictable behavior.</p>

<p>Choose an explicit number based on your machine and the kind of tasks you are running.</p>

<h3 id="add-stop-tokens-if-gemma-leaks-think-tags">Add stop tokens if Gemma leaks think tags</h3>

<p>Gemma can leak reasoning markers into visible output. That is annoying in normal chat and actively bad in agent loops.</p>

<p>If you are seeing leaked thought markers, add stop tokens early instead of pretending this will sort itself out later.</p>

<h2 id="step-4-wire-lm-studio-into-piagentsmodelsjson">Step 4: wire LM Studio into <code class="language-plaintext highlighter-rouge">~/.pi/agents/models.json</code></h2>

<p>If you are using Pi-style agents, this is the general shape:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
</pre></td><td class="rouge-code"><pre><span class="p">{</span><span class="w">
  </span><span class="nl">"providers"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"lm-studio"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"local-lm-studio"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"LM Studio"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"api"</span><span class="p">:</span><span class="w"> </span><span class="s2">"openai-completions"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"compatibility"</span><span class="p">:</span><span class="w"> </span><span class="s2">"legacy-system-role"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"apiKey"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ollama"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"baseUrl"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://127.0.0.1:1234/v1"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"models"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"qwen3.5-9b-sushi-coder-rl-mlx"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"_launch"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
          </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"qwen 3.5 sushi coder"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"contextWindow"</span><span class="p">:</span><span class="w"> </span><span class="mi">84000</span><span class="p">,</span><span class="w">
          </span><span class="nl">"input"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"text"</span><span class="p">],</span><span class="w">
          </span><span class="nl">"reasoning"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
        </span><span class="p">},</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gemma-4-e4b-it-mlx"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"_launch"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
          </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Gemma 4 E4B"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"contextWindow"</span><span class="p">:</span><span class="w"> </span><span class="mi">84000</span><span class="p">,</span><span class="w">
          </span><span class="nl">"input"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"text"</span><span class="p">],</span><span class="w">
          </span><span class="nl">"reasoning"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
        </span><span class="p">}</span><span class="w">
      </span><span class="p">]</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></div></div>

<p>What matters here:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">baseUrl</code> should point at the LM Studio endpoint</li>
  <li><code class="language-plaintext highlighter-rouge">apiKey</code> is usually just a placeholder for local use</li>
  <li><code class="language-plaintext highlighter-rouge">compatibility</code> can matter if the client is picky about role handling</li>
  <li><code class="language-plaintext highlighter-rouge">reasoning: true</code> is only useful if your harness can handle reasoning output cleanly</li>
</ul>

<p>That last point is not academic. If the harness does not know what to do with reasoning output, turning it on can make the whole setup noisier instead of smarter.</p>

<h2 id="step-5-wire-lm-studio-into-opencode">Step 5: wire LM Studio into OpenCode</h2>

<p>For OpenCode, the provider block is usually simpler:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="rouge-code"><pre><span class="p">{</span><span class="w">
  </span><span class="nl">"lm-studio"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"models"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"gemma-4-26b-a4b-it-mlx"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="nl">"_launch"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
        </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Gemma 4 26B A4B IT MLX"</span><span class="w">
      </span><span class="p">}</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"LM Studio"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"npm"</span><span class="p">:</span><span class="w"> </span><span class="s2">"@ai-sdk/openai-compatible"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"options"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"baseURL"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://127.0.0.1:1234/v1"</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></div></div>

<p>The important part is not the shape of the JSON. The important part is that OpenCode is treating LM Studio like an OpenAI-compatible backend.</p>

<p>That is what makes this setup useful for experimentation. You can swap the model backend without rewriting the rest of the agent harness.</p>

<h2 id="step-6-verify-with-a-cheap-test-before-a-real-agent-task">Step 6: verify with a cheap test before a real agent task</h2>

<p>Do not jump straight into a huge repository review.</p>

<p>Do these first:</p>

<ol>
  <li>Confirm the client can see the configured model.</li>
  <li>Send one short plain-text prompt.</li>
  <li>Try one tiny bounded task, like summarizing a short file or reviewing a tiny diff.</li>
</ol>

<p>You want to prove the stack works in the smallest possible way before turning the task size up.</p>

<h2 id="step-7-simplify-the-pi-prompt-before-blaming-the-model">Step 7: simplify the Pi prompt before blaming the model</h2>

<p>This matters more for local models than for stronger cloud models.</p>

<p>If Pi is dragging around a huge system prompt, a giant skill catalog, MCP server descriptions, tool docs, and every meta-instruction imaginable, you are burning context before the real work even starts.</p>

<p>If you are struggling with context pressure, simplify aggressively:</p>

<ul>
  <li>remove skills that are not needed</li>
  <li>avoid loading MCP descriptions the task will not use</li>
  <li>shorten the system prompt</li>
  <li>keep the operational rules direct</li>
</ul>

<p>You are not trying to make the agent less capable. You are trying to stop wasting the context window on scaffolding.</p>

<h2 id="step-8-know-the-problems-you-are-likely-to-hit">Step 8: know the problems you are likely to hit</h2>

<p>This is the part people usually skip. They should not.</p>

<h3 id="problem-qwen-35b-on-mlx-runs-out-of-memory">Problem: Qwen 35B on MLX runs out of memory</h3>

<p>This is a machine constraint problem, not a prompting problem.</p>

<p>If the model is too large for the actual workflow, move down to something that stays alive under repeated turns.</p>

<h3 id="problem-kv-cache-support-on-lm-studio--mlx-is-incomplete">Problem: KV cache support on LM Studio + MLX is incomplete</h3>

<p>This showed up for me as runtime weirdness and failed expectations around caching.</p>

<p>I would not assume KV-related optimizations are fully reliable just because they exist in a settings panel.</p>

<h3 id="problem-broken-kv-cache-state-causes-0-prompt-stalls">Problem: broken KV cache state causes 0% prompt stalls</h3>

<p>If the prompt sits at 0%, treat runtime cache state as suspicious immediately.</p>

<p>My default response is:</p>

<ol>
  <li>stop the run</li>
  <li>disable Unified KV Cache</li>
  <li>retry with a much smaller prompt</li>
  <li>verify the model is healthy again before doing anything larger</li>
</ol>

<h3 id="problem-gemma-leaks-think-tags">Problem: Gemma leaks think tags</h3>

<p>This is real and it matters.</p>

<p>In an agent loop, think-tag leakage can break parsing, pollute tool output, and waste context.</p>

<h3 id="problem-gemma-shared-kv-is-not-something-i-would-rely-on-here">Problem: Gemma shared KV is not something I would rely on here</h3>

<p>For the LM Studio runtime I was testing, shared KV for Gemma was not something I could treat as usable.</p>

<p>If that is part of your mental model for why the setup should be fast or stable, remove that assumption first.</p>

<h2 id="concrete-error-text-worth-recognizing">Concrete error text worth recognizing</h2>

<p>One runtime error I hit looked like this:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre>Error: Error in iterating prediction stream:
NotImplementedError: RotatingKVCache Quantization NYI
</pre></td></tr></tbody></table></code></pre></div></div>

<p>That is useful because it tells you this is not a “maybe I should rewrite my prompt” situation.</p>

<p>It points to a runtime capability gap.</p>

<h2 id="troubleshooting-checklist">Troubleshooting checklist</h2>

<p>If the whole setup is failing, I would check things in this order:</p>

<ol>
  <li>Is LM Studio actually serving on <code class="language-plaintext highlighter-rouge">127.0.0.1:1234</code>?</li>
  <li>Does <code class="language-plaintext highlighter-rouge">curl http://127.0.0.1:1234/v1/models</code> work?</li>
  <li>Does the client model ID exactly match what LM Studio exposes?</li>
  <li>Is the client expecting a different OpenAI compatibility mode than the provider config uses?</li>
  <li>Is shared KV part of the setup assumptions?</li>
  <li>Is Unified KV Cache making things worse?</li>
  <li>Is the prompt too large before the actual task even starts?</li>
  <li>Is Gemma leaking think tags into output the harness expects to parse cleanly?</li>
</ol>

<p>That order has saved me time because it forces me to debug the runtime before I start rewriting prompts for no reason.</p>

<h2 id="what-i-would-do-first-on-a-fresh-machine">What I would do first on a fresh machine</h2>

<p>If I had to set this up again from zero, I would keep it boring:</p>

<ol>
  <li>Start LM Studio.</li>
  <li>Load one model that fits comfortably.</li>
  <li>Confirm <code class="language-plaintext highlighter-rouge">curl http://127.0.0.1:1234/v1/models</code> works.</li>
  <li>Disable Unified KV Cache.</li>
  <li>Set context length manually.</li>
  <li>Wire one client into the endpoint.</li>
  <li>Test with one tiny task.</li>
  <li>Only then move to bigger prompts or more capable models.</li>
</ol>

<p>That sounds conservative. It is also much faster than trying to brute-force your way through three problems at once.</p>

<h2 id="final-note">Final note</h2>

<p>The biggest trap in local agent setup is mixing all the variables together.</p>

<p>Do not debug:</p>

<ul>
  <li>a new model</li>
  <li>a large prompt</li>
  <li>a new harness</li>
  <li>a new cache setting</li>
  <li>and a new client config</li>
</ul>

<p>all at the same time.</p>

<p>Make the stack boring first. Then make it ambitious.</p>]]></content><author><name>Joey Wang</name></author><category term="AI" /><category term="Engineering" /><category term="ai" /><category term="llm" /><category term="gemma" /><category term="lm-studio" /><category term="agents" /><category term="pi" /><category term="opencode" /><category term="local-llm" /><summary type="html"><![CDATA[A step-by-step runbook for using LM Studio as a local OpenAI-compatible backend for Pi agents and OpenCode, including config examples, verification steps, and troubleshooting.]]></summary></entry><entry><title type="html">How an LLM Coding Agent Actually Builds Software</title><link href="https://joeywang.github.io//posts/llm-agent-building/" rel="alternate" type="text/html" title="How an LLM Coding Agent Actually Builds Software" /><published>2026-04-16T00:00:00+00:00</published><updated>2026-04-16T00:00:00+00:00</updated><id>https://joeywang.github.io//posts/llm-agent-building</id><content type="html" xml:base="https://joeywang.github.io//posts/llm-agent-building/"><![CDATA[<h1 id="how-an-llm-coding-agent-actually-builds-software">How an LLM Coding Agent Actually Builds Software</h1>

<p>The first time I tried to wire a local coding agent around Gemma, I thought the hard part would be the model.</p>

<p>It wasn’t.</p>

<p>The model looked flaky because my agent loop was flaky.</p>

<p>One of the first tasks I gave it was boring on purpose: find a file, make a small code change, then run the relevant test. The model did the first part correctly. It asked for the file. It asked for the test. Then it drifted. Instead of taking the next tool step, it started explaining what should happen next like a consultant with a checklist.</p>

<p>At first glance that looked like a model failure. It wasn’t. I was parsing the response stream too early and mishandling the turn after the tool result. The model never really got a clean chance to continue.</p>

<p>That changed how I think about coding agents.</p>

<p>A coding agent is not just a model with a bigger prompt. It is a small software system wrapped around a model. The model does the reasoning. The runtime does the state management, tool execution, file edits, and validation.</p>

<p>That distinction matters because people blame or praise “the model” for behavior the surrounding harness is actually causing.</p>

<p>After spending time with Gemma and OpenCode-style local workflows, I keep coming back to the same conclusion: the model is only one part of the system. The loop around it is what turns text generation into software work.</p>

<p>If I had to reduce the whole thing to one line, it would be this:</p>

<blockquote>
  <p>Most of what feels magical in a coding agent is just a model sitting inside a well-built loop.</p>
</blockquote>

<h2 id="what-the-system-actually-is">What the system actually is</h2>

<p>At a high level, a coding agent has five moving parts:</p>

<ol>
  <li>The <strong>model</strong> that interprets the request and decides what to do next</li>
  <li>The <strong>prompt and context builder</strong> that prepares instructions and relevant repository state</li>
  <li>The <strong>tool runtime</strong> that executes shell commands, file reads, searches, and patches</li>
  <li>The <strong>agent loop</strong> that keeps calling the model after each tool result</li>
  <li>The <strong>verification layer</strong> that runs tests, linters, or builds before returning control</li>
</ol>

<p>A normal chatbot answers once. An agent reads, acts, checks what happened, then goes again.</p>

<p>Here is the simplest version of the flow:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="rouge-code"><pre>User request
  -&gt; context builder
  -&gt; model
  -&gt; tool call
  -&gt; runtime executes tool
  -&gt; tool result goes back to model
  -&gt; patch / command / follow-up tool call
  -&gt; tests or lint
  -&gt; final answer
</pre></td></tr></tbody></table></code></pre></div></div>

<p>Not glamorous, but more useful than model marketing.</p>

<h2 id="step-1-build-the-working-context">Step 1: build the working context</h2>

<p>Before the model can do useful work, the agent has to decide what context to provide.</p>

<p>That usually includes:</p>

<ul>
  <li>system instructions</li>
  <li>the user’s request</li>
  <li>recent conversation history</li>
  <li>tool definitions</li>
  <li>repository structure or symbol summaries</li>
  <li>relevant file contents or search results</li>
</ul>

<p>This part gets hand-waved a lot. The agent is not dumping an entire repository into the model. It is assembling a useful working set.</p>

<p>In practice, good agents do some combination of:</p>

<ul>
  <li><strong>repository mapping</strong>: build a high-level view of files, symbols, or modules</li>
  <li><strong>targeted retrieval</strong>: read only the files that look relevant</li>
  <li><strong>context trimming</strong>: keep the active state small enough to fit the model’s window</li>
  <li><strong>caching</strong>: avoid re-reading the same large context on every turn</li>
</ul>

<p>The important point is simple: context is assembled. It is not magically remembered.</p>

<h2 id="step-2-let-the-model-decide-the-next-move">Step 2: let the model decide the next move</h2>

<p>Once the context is ready, the model gets a turn.</p>

<p>For a coding task such as “fix the failing login flow,” the model is not supposed to immediately output code. A good model will first decide what information it still needs:</p>

<ul>
  <li>inspect the auth code</li>
  <li>search for the failing controller or service</li>
  <li>read the relevant test</li>
  <li>run the test suite or a narrowed test target</li>
</ul>

<p>Reasoning helps, but reasoning on its own is not enough. If the runtime does not support tools and iteration, the model can only describe a plan. It cannot carry it out.</p>

<p>That is the gap between:</p>

<ul>
  <li>“You should inspect <code class="language-plaintext highlighter-rouge">auth.rb</code> and run the tests”</li>
  <li>actually reading <code class="language-plaintext highlighter-rouge">auth.rb</code>, running the tests, seeing the failure, and proposing a patch</li>
</ul>

<p>If you have built one of these loops badly, you can feel the difference immediately. The model sounds smart. Nothing gets done.</p>

<h2 id="step-3-turn-intent-into-tool-calls">Step 3: turn intent into tool calls</h2>

<p>The model does not directly touch your filesystem. It emits structured intent.</p>

<p>Depending on the runtime, that may look like a function call, JSON object, or tool invocation block. The meaning is the same:</p>

<blockquote>
  <p>“Run this shell command.”</p>

  <p>“Read this file.”</p>

  <p>“Apply this patch.”</p>
</blockquote>

<p>This is one of the most useful mental models in the whole setup:</p>

<blockquote>
  <p><strong>The model does not execute tools. The runtime executes tools.</strong></p>
</blockquote>

<p>That separation is what makes the system manageable. The runtime can validate arguments, reject unsafe actions, log what happened, and feed the results back into the conversation.</p>

<p>It also means a lot of agent bugs are not really model bugs. They are runtime bugs:</p>

<ul>
  <li>tool calls parsed incorrectly</li>
  <li>partial streaming output handled too early</li>
  <li>malformed tool results appended to history</li>
  <li>message roles mismatched for the target model</li>
  <li>missing loop after a tool result</li>
</ul>

<p>I ran into exactly this in my own local setup. The model looked flaky until I realized the harness was the flaky part.</p>

<p>One subtle point here: the runtime is not just a dumb pipe. It defines the contract. It decides which tools exist, what arguments are allowed, how results are formatted, and what the model gets back when something fails, times out, or succeeds. The model can only operate inside that contract.</p>

<h2 id="message-formatting-is-part-of-the-system">Message formatting is part of the system</h2>

<p>This sounds boring until it breaks.</p>

<p>Different models and runtimes expect different message shapes, role names, and tool-call formats. Some want explicit <code class="language-plaintext highlighter-rouge">tool</code> messages with IDs. Some expect tool results folded back into a user turn. Some tolerate loose formatting. Some absolutely do not.</p>

<p>If you get this wrong, the failure mode is annoying because it does not always look like a protocol error. It just looks like the model got weird. It ignores a tool result. It repeats itself. It forgets what just happened. It starts narrating instead of acting.</p>

<p>That is another reason I hesitate when people talk about agent quality as if it were just a model ranking problem. A surprising amount of the real work is message plumbing.</p>

<h2 id="step-4-execute-observe-loop">Step 4: execute, observe, loop</h2>

<p>This is the part many first-time agent builders miss.</p>

<p>The basic loop looks like this:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="rouge-code"><pre><span class="n">messages</span> <span class="o">=</span> <span class="n">initial_context</span>

<span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
    <span class="n">response</span> <span class="o">=</span> <span class="nf">llm</span><span class="p">(</span><span class="n">messages</span><span class="p">,</span> <span class="n">tools</span><span class="o">=</span><span class="n">tools</span><span class="p">)</span>

    <span class="k">if</span> <span class="n">response</span><span class="p">.</span><span class="n">tool_calls</span><span class="p">:</span>
        <span class="n">messages</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>

        <span class="k">for</span> <span class="n">call</span> <span class="ow">in</span> <span class="n">response</span><span class="p">.</span><span class="n">tool_calls</span><span class="p">:</span>
            <span class="n">result</span> <span class="o">=</span> <span class="nf">execute_tool</span><span class="p">(</span><span class="n">call</span><span class="p">)</span>
            <span class="n">messages</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>

        <span class="k">continue</span>

    <span class="k">return</span> <span class="n">response</span><span class="p">.</span><span class="n">final_text</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<p>The crucial line is <code class="language-plaintext highlighter-rouge">continue</code>.</p>

<p>After each tool result, the model needs another turn. That is how it moves from:</p>

<ol>
  <li>reading files</li>
  <li>forming a hypothesis</li>
  <li>patching code</li>
  <li>running tests</li>
  <li>adjusting the patch if the tests still fail</li>
</ol>

<p>Without that loop, you do not have much of an agent. You have a one-shot assistant that knows how to talk about tool syntax.</p>

<p>The runtime also needs clear stopping conditions. A good agent should stop when the checks pass, when it is genuinely blocked and needs user input, or when another retry is just burning tokens without improving anything. Otherwise you get the other classic failure mode: the agent that keeps “working” long after it should have stopped.</p>

<h2 id="a-tiny-end-to-end-example">A tiny end-to-end example</h2>

<p>This is what a healthy loop looks like in practice.</p>

<p>Imagine the user asks:</p>

<blockquote>
  <p>“Fix the failing login test.”</p>
</blockquote>

<p>What happens next is usually something like this:</p>

<ol>
  <li>The agent searches for the failing test or runs a narrowed test command.</li>
  <li>The runtime sends the failure output back to the model.</li>
  <li>The model asks to read <code class="language-plaintext highlighter-rouge">auth.rb</code> and the matching test file.</li>
  <li>The runtime returns both file contents.</li>
  <li>The model proposes a small patch.</li>
  <li>The runtime applies the patch.</li>
  <li>The model asks to rerun the test.</li>
  <li>The runtime returns either a pass or a new failure.</li>
  <li>If it still fails, the loop continues.</li>
</ol>

<p>In rough pseudo-transcript form:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
</pre></td><td class="rouge-code"><pre>user: Fix the failing login test.

assistant -&gt; tool: run_test("bundle exec rspec spec/requests/login_spec.rb")
tool -&gt; assistant: failure in "returns 401 for expired token"

assistant -&gt; tool: read_file("app/services/auth.rb")
assistant -&gt; tool: read_file("spec/requests/login_spec.rb")
tool -&gt; assistant: [file contents]

assistant -&gt; tool: apply_patch(...)
tool -&gt; assistant: patch applied

assistant -&gt; tool: run_test("bundle exec rspec spec/requests/login_spec.rb")
tool -&gt; assistant: 1 example, 0 failures

assistant: Fixed. The token expiry check was comparing strings instead of timestamps.
</pre></td></tr></tbody></table></code></pre></div></div>

<p>That is the job. Not one huge leap of intelligence. A sequence of small moves grounded in feedback.</p>

<h2 id="parallel-work-helps-but-only-when-the-dependency-graph-is-real">Parallel work helps, but only when the dependency graph is real</h2>

<p>A naive agent does everything in sequence. Better agents can overlap independent work.</p>

<p>Reading three files at once is usually fine. Searching two directories in parallel is usually fine. Running a linter and a type check at the same time is often fine.</p>

<p>But the runtime has to know where parallelism stops being safe. Reading a file and patching it at the same time is a bug. Running a test against code that another step is still modifying is a bug. Parallelism is useful, but only when the operations are actually independent.</p>

<h2 id="step-5-make-precise-edits-instead-of-rewriting-everything">Step 5: make precise edits instead of rewriting everything</h2>

<p>When the agent decides to change code, the safest path is usually not “rewrite the whole file.”</p>

<p>Better runtimes prefer targeted edits such as:</p>

<ul>
  <li>search-and-replace for a unique block</li>
  <li>line-oriented patching</li>
  <li>unified diff application</li>
</ul>

<p>That approach helps for two reasons:</p>

<ol>
  <li>It reduces accidental damage to unrelated code.</li>
  <li>It gives the model a more stable editing primitive for iterative fixes.</li>
</ol>

<p>This is one reason patch-based workflows feel noticeably more reliable than naive full-file rewrites.</p>

<h2 id="step-6-check-the-work-against-reality">Step 6: check the work against reality</h2>

<p>An agent is only useful if it can compare its changes against reality.</p>

<p>For software tasks, “reality” usually means one or more of:</p>

<ul>
  <li>tests</li>
  <li>linters</li>
  <li>type checks</li>
  <li>builds</li>
  <li>runtime output</li>
</ul>

<p>The model proposes a change. The runtime runs the relevant checks. The model then sees the result and decides whether the job is actually done.</p>

<p>That is the difference between a flashy demo and a tool you might actually trust. The demo stops when the code looks plausible. The useful tool stops when the environment says the change holds up.</p>

<p>This is also where models run into a hard limit. They are good at predicting plausible next steps. They are much worse at knowing, from their own internal confidence alone, whether those steps actually worked. That is why verification is not optional. The model’s guess is not the ground truth.</p>

<h2 id="where-agents-usually-break">Where agents usually break</h2>

<p>When people say a coding agent “just kind of fell apart,” the failure is often boring:</p>

<ul>
  <li>the model emitted a tool call across multiple stream chunks and the runtime acted too early</li>
  <li>the tool result got appended in the wrong format</li>
  <li>the agent lost the thread after a long wall of shell output</li>
  <li>the patch applied, but the model never saw the real post-patch state</li>
  <li>the patch failed to apply cleanly and the retry logic made things worse</li>
  <li>a command timed out and the runtime treated that like useful output</li>
  <li>the system skipped verification and returned confident nonsense</li>
</ul>

<p>This is why I am suspicious of sweeping claims about model quality without any discussion of runtime quality. A fragile harness can make a good model look bad. A disciplined harness can make a merely decent model feel much better than expected.</p>

<h2 id="context-management-is-where-things-quietly-break">Context management is where things quietly break</h2>

<p>As the session gets longer, the agent’s job gets harder.</p>

<p>Every tool result, file read, and patch explanation consumes context window space. If you keep everything, the model eventually drowns in stale logs and low-value history.</p>

<p>So real agents need compaction strategies:</p>

<ul>
  <li>keep recent turns verbatim</li>
  <li>summarize older work</li>
  <li>drop noisy command output</li>
  <li>retain the current plan and latest repository state</li>
  <li>preserve durable instructions while discarding dead ends</li>
</ul>

<p>This is not the glamorous part of agent design, but it matters more than people think. A lot of agent failures are really context failures wearing a fake mustache.</p>

<p>There is also a tradeoff here that people skip past too quickly: compaction is lossy. Summaries are useful, but sometimes the exact detail you threw away is the detail you needed three turns later. Long-running agents are always balancing recall against context budget.</p>

<h2 id="the-model-matters-but-the-harness-matters-more">The model matters, but the harness matters more</h2>

<p>Different models are better or worse at planning, tool use, structured output, and code generation. That absolutely affects the experience.</p>

<p>But once you start building agents, you realize something uncomfortable:</p>

<blockquote>
  <p>A strong model in a weak harness is frustrating. A decent model in a strong harness is often more useful than it has any right to be.</p>
</blockquote>

<p>The harness determines whether the model can:</p>

<ul>
  <li>find the right file</li>
  <li>survive long sessions</li>
  <li>recover from failed commands</li>
  <li>apply surgical edits</li>
  <li>prove that the task is complete</li>
</ul>

<p>That is why two products using similarly capable models can feel wildly different in practice.</p>

<h2 id="why-this-gets-harder-locally">Why this gets harder locally</h2>

<p>This point matters even more for local agents.</p>

<p>Cloud coding products usually have polished runtimes, mature prompt formatting, and enough infrastructure around the model to hide a lot of rough edges. Local setups are less forgiving.</p>

<p>You run into issues like:</p>

<ul>
  <li>tighter memory limits</li>
  <li>smaller practical context windows</li>
  <li>worse latency when you overfeed the model</li>
  <li>more brittle tool calling</li>
  <li>more prompt-format sensitivity</li>
  <li>less guardrail infrastructure around long sessions</li>
</ul>

<p>That does not make local agents pointless. I still like them. It just means the boring systems work matters even more. If your local agent feels unstable, it may not need a smarter model first. It may need a better loop, cleaner context, and stricter verification.</p>

<h2 id="permissions-sandboxing-and-safety-are-part-of-the-design">Permissions, sandboxing, and safety are part of the design</h2>

<p>Another missing piece in a lot of simplified agent diagrams is the operating envelope.</p>

<p>Real coding agents usually do not have unlimited power. Some tools are read-only. Some filesystem paths are writable and others are not. Some commands require explicit approval. Network access may be blocked. Destructive operations may be denied or wrapped in extra checks.</p>

<p>That is not an annoying implementation detail. It is part of how the system works. The runtime is not just giving the model hands. It is also deciding what the hands are allowed to touch.</p>

<p>The same goes for observability. If the agent cannot show you what tool it called, what came back, what got truncated, and why it stopped, debugging turns into superstition.</p>

<h2 id="a-better-way-to-think-about-coding-agents">A better way to think about coding agents</h2>

<p>The mental model I keep settling on is this:</p>

<ul>
  <li>the <strong>model</strong> is the reasoning engine</li>
  <li>the <strong>agent runtime</strong> is the operating system around that engine</li>
</ul>

<p>The runtime gives the model senses, memory, and hands:</p>

<ul>
  <li><strong>senses</strong> through file reads, search, test output, and external tools</li>
  <li><strong>memory</strong> through conversation state, summaries, and cached repository context</li>
  <li><strong>hands</strong> through patches, shell commands, and API calls</li>
</ul>

<p>Once you see the system that way, a lot of confusing behavior stops being confusing. You stop asking, “Why didn’t the model just do it?” and start asking the more useful question:</p>

<blockquote>
  <p>“What part of the agent loop failed?”</p>
</blockquote>

<h2 id="what-i-am-leaving-out-on-purpose">What I am leaving out on purpose</h2>

<p>There are more advanced pieces beyond the basic loop:</p>

<ul>
  <li>planner/executor splits</li>
  <li>long-term memory systems</li>
  <li>background agents</li>
  <li>richer approval flows</li>
  <li>evaluation harnesses</li>
  <li>multi-agent coordination</li>
</ul>

<p>Those matter, but they come later.</p>

<p>The first-order problem is still the same boring one: can the model ask for a tool, can the runtime execute it, can the result get fed back correctly, and can the system verify the change before it stops?</p>

<h2 id="final-takeaway">Final takeaway</h2>

<p>An LLM coding agent does not build software by generating one brilliant answer.</p>

<p>It builds software by repeatedly doing four things well:</p>

<ol>
  <li>gathering the right context</li>
  <li>choosing the next action</li>
  <li>executing that action through tools</li>
  <li>checking the result against reality</li>
</ol>

<p>If you want to build a better local agent, spend less time imagining a magical autonomous coder and more time improving those four steps.</p>

<p>What looks like intelligence is often just good plumbing.</p>]]></content><author><name>Joey Wang</name></author><category term="AI" /><category term="Engineering" /><category term="ai" /><category term="llm" /><category term="agents" /><category term="coding-agent" /><category term="gemma" /><category term="opencode" /><category term="software-engineering" /><summary type="html"><![CDATA[A practical breakdown of how coding agents work: model, tool loop, context management, patching, and verification.]]></summary></entry><entry><title type="html">Using LM Studio and Gemma as a Local Engine for Coding Agents</title><link href="https://joeywang.github.io//posts/lm-studio-gemma4/" rel="alternate" type="text/html" title="Using LM Studio and Gemma as a Local Engine for Coding Agents" /><published>2026-04-13T00:00:00+00:00</published><updated>2026-04-13T00:00:00+00:00</updated><id>https://joeywang.github.io//posts/lm-studio-gemma4</id><content type="html" xml:base="https://joeywang.github.io//posts/lm-studio-gemma4/"><![CDATA[<h1 id="using-lm-studio-and-gemma-as-a-local-engine-for-coding-agents">Using LM Studio and Gemma as a Local Engine for Coding Agents</h1>

<p>I did not start this experiment because I wanted a nicer chatbot on my laptop.</p>

<p>What I wanted was much more specific: a local model endpoint I could plug into agent-style workflows for code review, repo questions, bounded refactors, and private documentation-heavy tasks. Something that felt close enough to the OpenAI-compatible APIs many tools already expect, but without sending every prompt, diff, and internal doc set to the cloud.</p>

<p>That turned out to be possible. It was also a lot less smooth than the demo videos make it look.</p>

<p>The core setup that worked best for me was LM Studio plus a GGUF build of Gemma 4 26B. Once I got the settings under control, it became a usable local engine for agent clients. Not perfect. Not something I would trust blindly. But good enough that I would actually use it.</p>

<p>This post is the version I wish I had found before I started.</p>

<h2 id="why-i-wanted-this-locally">Why I wanted this locally</h2>

<p>There were a few reasons I kept pushing on local setup instead of giving up and using cloud models for everything.</p>

<p>First, privacy. If I am experimenting with agent workflows against private repositories, internal notes, or ugly half-finished local docs, I want the option to keep all of that on my machine.</p>

<p>Second, cost. Agent loops are noisy. They read files, retry, summarize, call tools, and sometimes spiral. That is manageable when you are testing locally. It gets annoying fast when every bad prompt is burning money.</p>

<p>Third, iteration speed. I wanted to try weird things: change prompts, swap harnesses, feed in local notes, break the loop, fix the loop, try again. Local models are slower in raw quality terms, but they are very forgiving for this kind of experimentation.</p>

<p>And fourth, I wanted to understand the boundary between “local model” and “local agent.” Those are not the same thing, and a lot of the confusion here comes from treating them as if they are.</p>

<h2 id="the-stack-i-settled-on">The stack I settled on</h2>

<p>After trying a few variations, this is the shape that felt the most practical:</p>

<ul>
  <li>LM Studio as the local server</li>
  <li>Gemma 4 26B in GGUF format</li>
  <li>LM Studio’s OpenAI-compatible local endpoint</li>
  <li>An agent client on top, such as OpenCode, OpenHands, or a custom harness</li>
</ul>

<p>That last bullet matters.</p>

<p>LM Studio gives you model hosting, local inference, and a familiar API surface. That is useful. But it is not the agent. It is the model backend.</p>

<p>The agent still has to do the hard part:</p>

<ul>
  <li>manage message history</li>
  <li>decide when to call tools</li>
  <li>execute those tools</li>
  <li>feed tool results back into the model</li>
  <li>handle long outputs</li>
  <li>sanitize weird model output</li>
  <li>know when the task is actually done</li>
</ul>

<p>If that loop is weak, a good local model still feels unreliable. If the loop is solid, even an imperfect local model becomes usable much faster than you would expect.</p>

<h2 id="if-you-want-the-setup-guide-read-the-runbook">If you want the setup guide, read the runbook</h2>

<p>The setup details ended up being long enough that I split them into a separate post:</p>

<p><a href="/posts/lm-studio-local-agent-runbook/">LM Studio local agent runbook: Pi and OpenCode step by step</a></p>

<p>That runbook covers:</p>

<ul>
  <li>starting the LM Studio server</li>
  <li>checking the local endpoint</li>
  <li>wiring the model into <code class="language-plaintext highlighter-rouge">~/.pi/agents/models.json</code></li>
  <li>wiring the same endpoint into OpenCode</li>
  <li>doing a cheap verification pass before larger agent tasks</li>
  <li>the runtime and cache-related failures I would check first</li>
</ul>

<h2 id="why-i-ended-up-preferring-gguf-here">Why I ended up preferring GGUF here</h2>

<p>For this specific setup, GGUF was the path of least resistance.</p>

<p>I am not making a universal claim that GGUF is always better than every other format. I am saying that in my own tests, it was the easiest way to get Gemma into a stable LM Studio workflow for agent-style tasks.</p>

<p>The key word there is stable.</p>

<p>I care less about benchmark bragging rights than about whether the model can survive long prompts, diffs, and repeated turns without drifting into nonsense or getting stuck in some half-broken internal state. GGUF plus LM Studio gave me a workflow I could keep using. That was enough for me.</p>

<h2 id="what-broke-first">What broke first</h2>

<p>The first wave of problems had very little to do with “intelligence” and a lot to do with runtime behavior.</p>

<h3 id="1-long-prompts-hanging-at-0">1. Long prompts hanging at 0%</h3>

<p>This was the most annoying one.</p>

<p>I would send a large diff or a code-heavy prompt and LM Studio would sit there looking busy while doing nothing useful. In practice, the model was not making forward progress. It just looked stuck.</p>

<p>The fix that helped most was turning off Unified KV Cache and setting the context length manually instead of leaving it on auto.</p>

<p>That was my first clue that local agent work is often about runtime stability, not just model selection.</p>

<h3 id="2-reasoning-tag-leakage">2. Reasoning tag leakage</h3>

<p>Gemma can leak internal reasoning markers or thought-channel fragments into the visible output. If you are just chatting in a UI, that is ugly. If you are piping the response into an agent workflow, it is worse than ugly. It can break downstream parsing or contaminate structured output.</p>

<p>I saw this most often when the prompt mixed code review instructions with tool-like formatting expectations.</p>

<p>Adding stop tokens helped. Tightening the system prompt helped. But the deeper lesson was that local agent clients need output sanitation anyway. You should not assume the model will always hand you clean final text.</p>

<h3 id="3-context-pressure-shows-up-fast">3. Context pressure shows up fast</h3>

<p>A local coding model can look fine on a short task and then fall apart the moment you give it a real repository diff, a stack trace, and a few tool results in the same conversation.</p>

<p>This is where local setups stop feeling like “cheap cloud replacements” and start feeling like engineering systems with real constraints. Context is not just a number on the model card. It is a budget, and agent loops spend that budget aggressively.</p>

<h3 id="4-good-answers-are-easier-than-reliable-behavior">4. Good answers are easier than reliable behavior</h3>

<p>This one took me a while to admit.</p>

<p>On a single prompt, Gemma could often produce a respectable answer about code. That made me optimistic. But agent workloads are harsher than single prompts. They require consistency across turns, clean tool-use behavior, and enough formatting discipline that the harness can keep going.</p>

<p>A model that gives one good answer is not automatically a good agent backend.</p>

<h2 id="troubleshooting-notes-from-my-own-setup">Troubleshooting notes from my own setup</h2>

<p>These are the problems I actually ran into while trying to make this usable for agents.</p>

<p>One thing I wish more local-LLM writeups did: include the ugly error text.</p>

<p>When you are debugging this stack, vague advice is not that helpful. Concrete failure signatures are.</p>

<h3 id="qwen-35b-on-mlx-ran-out-of-memory">Qwen 35B on MLX ran out of memory</h3>

<p>This was one of the first practical limits I hit.</p>

<p>On paper, a larger local model always sounds attractive. In practice, once I pushed into Qwen 35B-class MLX setups, memory pressure became the story. The model might load partway, fail during inference, or behave badly once context and KV usage started climbing.</p>

<p>That pushed me toward setups that were slightly smaller but much easier to keep alive for repeated agent turns.</p>

<p>This is one reason I think local agent work should be judged as a system, not just a model preference. A model that looks stronger in theory is not useful if it keeps falling over in the workflow you actually want.</p>

<h3 id="kv-cache-support-on-lm-studio--mlx-was-not-fully-there-for-me">KV cache support on LM Studio + MLX was not fully there for me</h3>

<p>This was another source of confusion.</p>

<p>I expected KV cache behavior to be a straightforward performance win. In this setup, it was not. My experience was that KV caching on the MLX path inside LM Studio was not something I could treat as fully reliable for agent-style workloads.</p>

<p>That matters because agent sessions are exactly the kind of workload where you want caching to help. Long turns, repeated context, retries, and follow-up prompts should benefit from it. But if that layer is unstable, it stops being an optimization and turns into a source of weird failures.</p>

<p>One example I hit looked like this:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
</pre></td><td class="rouge-code"><pre>Error: Error in iterating prediction stream:
NotImplementedError: RotatingKVCache Quantization NYI
</pre></td></tr></tbody></table></code></pre></div></div>

<p>If you run into that kind of error, I would treat it as a runtime capability gap first, not something you are supposed to fix with prompting.</p>

<p>In plain English: the stack is telling you that the KV/cache path you are trying to use is not fully implemented for that configuration yet.</p>

<h3 id="shared-kv-support-for-gemma-was-not-there-in-lm-studio-runtime">Shared KV support for Gemma was not there in LM Studio runtime</h3>

<p>This deserves to be called out separately because it wasted a lot of time for me.</p>

<p>Gemma plus shared KV sounded like exactly the kind of optimization that should help agent loops. In practice, in LM Studio runtime, I had to treat shared KV as not supported for the setup I was using.</p>

<p>That matters because if you assume shared KV is working, you can end up chasing fake explanations for behavior that is really coming from unsupported runtime features.</p>

<p>My rule here became simple: if Gemma is behaving strangely and shared KV is part of the configuration, remove that variable first.</p>

<h3 id="0-prompt-stalls-were-tied-to-bad-kv-cache-state">0% prompt stalls were tied to bad KV cache state</h3>

<p>This was the most visible symptom.</p>

<p>When I saw LM Studio sit at 0% on a prompt that should have started normally, the issue often traced back to invalid or broken KV caching state. Turning off Unified KV Cache and resetting the session usually helped more than trying to tweak the prompt itself.</p>

<p>That is why I would treat 0% prompt stalls as a runtime problem first, not a prompting problem first.</p>

<p>If I hit this again, my default response would be:</p>

<ol>
  <li>stop the current run</li>
  <li>clear or avoid the problematic cache path</li>
  <li>disable Unified KV Cache</li>
  <li>retry with a smaller prompt to confirm the model is healthy again</li>
</ol>

<p>That sequence saved me time because it kept me from debugging the wrong layer.</p>

<p>Another important lesson here: the prompt is often innocent.</p>

<p>If the runtime state is broken, rewriting the prompt five times is just a way to waste an afternoon.</p>

<h3 id="gemma-think-tag-leakage-is-real">Gemma think-tag leakage is real</h3>

<p>This one showed up often enough that I think it belongs in the troubleshooting section, not just the general setup notes.</p>

<p>Gemma can leak internal reasoning or think-tag markers into visible output. Depending on the client, this may show up as raw reasoning blocks, partial channel markers, or output that looks like it forgot to separate internal thought from the final answer.</p>

<p>That is annoying in a chat window. In an agent loop, it is worse because it can:</p>

<ul>
  <li>break parsing</li>
  <li>contaminate structured output</li>
  <li>confuse downstream tool-call handling</li>
  <li>waste context on junk you never wanted to keep</li>
</ul>

<p>What helped me:</p>

<ul>
  <li>add stop tokens for the leaked markers</li>
  <li>keep the output format expectations simple</li>
  <li>avoid mixing too many formatting rules into one prompt</li>
  <li>sanitize model output in the harness instead of assuming the model will cleanly separate reasoning every time</li>
</ul>

<p>I would treat this as a normal engineering concern when using Gemma for agents, not as a weird edge case.</p>

<h3 id="simplifying-pi-prompts-matters-more-than-i-expected">Simplifying Pi prompts matters more than I expected</h3>

<p>Another lesson from the agent side: the prompt budget disappears fast.</p>

<p>If Pi is carrying a huge system prompt, a long skill catalog, MCP server descriptions, tool instructions, and model-specific behavior rules, you can burn a surprising amount of context before the real task even starts.</p>

<p>For local models, that overhead hurts more.</p>

<p>The practical fix was to simplify the Pi prompt aggressively:</p>

<ul>
  <li>trim skills that are not needed for the current workflow</li>
  <li>avoid loading MCP/tool descriptions that the task will not use</li>
  <li>keep the system prompt direct and task-specific</li>
  <li>prefer short operational rules over a giant “do everything” instruction block</li>
</ul>

<p>The goal is not to make the agent dumb. The goal is to stop spending half the context window on scaffolding.</p>

<p>For local coding models in particular, I think this matters a lot. Smaller prompt overhead usually means:</p>

<ul>
  <li>less wasted context</li>
  <li>fewer formatting mistakes</li>
  <li>less chance of the model getting distracted by meta-instructions</li>
  <li>more room for the actual code, diff, logs, and tool results</li>
</ul>

<p>If the model is struggling, one of the highest-leverage fixes is often not “find a smarter model.” It is “stop stuffing the prompt with framework overhead the model does not need.”</p>

<h3 id="a-short-troubleshooting-checklist-i-would-use-again">A short troubleshooting checklist I would use again</h3>

<p>If I were setting this up from scratch and it started failing, this is the order I would check things:</p>

<ol>
  <li>Confirm LM Studio is actually serving the model you think it is serving.</li>
  <li>Check whether the exact model ID in the client matches the runtime.</li>
  <li>Remove shared KV assumptions from the setup.</li>
  <li>Disable Unified KV Cache.</li>
  <li>Treat MLX KV/cache errors as runtime limitations first.</li>
  <li>Retry with a much smaller prompt.</li>
  <li>Strip down the Pi or agent system prompt so the model gets more room for the real task.</li>
  <li>If Gemma output is leaking think tags, fix that in both prompt design and output sanitation.</li>
</ol>

<p>That sequence is not elegant, but it is the one I trust now.</p>

<h2 id="the-settings-that-made-it-usable">The settings that made it usable</h2>

<p>These were the settings and habits that moved the setup from “interesting demo” to “something I would actually plug into a tool.”</p>

<h3 id="disable-unified-kv-cache">Disable Unified KV Cache</h3>

<p>This was the biggest quality-of-life improvement in my setup for long prompts and review-style workloads.</p>

<p>If you are seeing prompt hangs, weird stalls, or long code-heavy requests that never quite start, this is the first thing I would change.</p>

<h3 id="set-context-length-manually">Set context length manually</h3>

<p>I had better results when I chose an explicit context size instead of trusting auto mode.</p>

<p>That gave me more predictable behavior, especially when I was switching between smaller code questions and much larger review tasks.</p>

<h3 id="add-stop-tokens-for-leaked-thought-markers">Add stop tokens for leaked thought markers</h3>

<p>If your model output is spraying thought tags into normal text, stop tokens are worth trying immediately.</p>

<p>This is not a complete fix. It is more like putting guardrails around a messy edge. Still worth doing.</p>

<h3 id="use-conservative-expectations-for-long-diffs">Use conservative expectations for long diffs</h3>

<p>This is less a setting and more a survival rule: do not assume the local model should ingest your entire diff, your entire style guide, and a huge pile of tool output in one go.</p>

<p>Chunk work where you can. Summarize aggressively. Keep the agent loop disciplined.</p>

<h3 id="treat-sampling-as-workload-specific">Treat sampling as workload-specific</h3>

<p>I would not hard-sell one magical sampling profile for every coding task. In practice, I found that code review, repo Q&amp;A, and action-oriented tool use want slightly different behavior.</p>

<p>For me, the more important point was not the exact number. It was avoiding the temptation to keep turning the model into a “creative” assistant when what I actually needed was predictable structured behavior.</p>

<h2 id="where-agents-come-in">Where agents come in</h2>

<p>This is the part that is usually missing from local LLM articles.</p>

<p>Running Gemma in LM Studio does not give you a coding agent. It gives you a model server.</p>

<p>To get agent behavior, something above that server has to implement a loop like this:</p>

<ol>
  <li>Send the current task and available tools to the model.</li>
  <li>Parse whether the model wants to answer directly or call a tool.</li>
  <li>Execute the tool if requested.</li>
  <li>Feed the result back into the conversation.</li>
  <li>Repeat until the model produces a final answer or hits a stopping rule.</li>
</ol>

<p>That sounds obvious when written out. It is also where a lot of local setups quietly fail.</p>

<p>If the harness does not execute the tool correctly, the model looks dumb.
If the harness appends tool results in a format the model does not handle well, the model looks confused.
If the harness keeps dumping huge command output back into context, the model looks forgetful.</p>

<p>The local endpoint is only one piece of the system.</p>

<h2 id="where-this-fits-with-openhands-opencode-and-custom-harnesses">Where this fits with OpenHands, OpenCode, and custom harnesses</h2>

<p>What I like about LM Studio in this role is that it can behave like a local OpenAI-compatible backend. That makes experimentation easier because a lot of agent tooling already assumes that style of interface.</p>

<p>So the setup I kept coming back to was:</p>

<ul>
  <li>LM Studio hosts the model locally</li>
  <li>the agent client points at <code class="language-plaintext highlighter-rouge">http://localhost:1234/v1</code></li>
  <li>the agent loop handles tools, memory pressure, and retries</li>
</ul>

<p>That works well for:</p>

<ul>
  <li>OpenCode-style local coding sessions</li>
  <li>OpenHands-style experiments where you want a local model backend</li>
  <li>custom agents that already know how to talk to OpenAI-compatible endpoints</li>
</ul>

<p>What it does not solve is model capability mismatch. If your agent framework expects extremely clean function calling, long-context reliability, or excellent multi-step planning under pressure, local Gemma may still feel rough compared to stronger cloud models.</p>

<p>That is not a failure of LM Studio. It is just the reality of the stack.</p>

<h2 id="what-this-setup-is-actually-good-for">What this setup is actually good for</h2>

<p>Once I stopped expecting a local model to be a universal drop-in replacement for the best cloud systems, the setup got much more useful.</p>

<p>Here is where I think it makes sense.</p>

<h3 id="pr-review-and-diff-reading">PR review and diff reading</h3>

<p>This is one of the better use cases because the task is bounded and the output format is easy to evaluate.</p>

<p>The model can read a diff, point out obvious risk, summarize a change, and flag suspicious sections. You still need judgment. You still need tests. But it is useful enough to be worth keeping around.</p>

<h3 id="repo-qa">Repo Q&amp;A</h3>

<p>If you want to ask questions about a codebase, architecture notes, or internal docs that you do not want to upload elsewhere, local is attractive.</p>

<p>This gets even better when the agent harness is disciplined about retrieval and does not just dump huge blobs of text into the prompt.</p>

<h3 id="bounded-refactors">Bounded refactors</h3>

<p>I would trust this more for “rename this pattern in these files” than for “redesign the authentication system.”</p>

<p>That distinction is important. Local models can be very handy for narrow, repetitive engineering work. They are much less trustworthy when the task becomes open-ended or architectural.</p>

<h3 id="private-knowledge-heavy-tasks">Private knowledge-heavy tasks</h3>

<p>If the task depends on local notes, internal guides, or a private personal knowledge base, a local endpoint starts to feel compelling even when the model is not state of the art.</p>

<p>Sometimes privacy and convenience matter more than squeezing out the last 10% of model quality.</p>

<h2 id="where-it-still-falls-short">Where it still falls short</h2>

<p>This setup is useful, but I would not oversell it.</p>

<h3 id="it-is-still-easier-to-get-bad-agent-behavior-than-good-agent-behavior">It is still easier to get bad agent behavior than good agent behavior</h3>

<p>Single-turn demos flatter local models. Multi-step agent loops expose every weakness.</p>

<p>You notice formatting issues, small reasoning slips, and context handling problems much more when the model has to survive several turns in a row.</p>

<h3 id="long-context-work-remains-fragile">Long-context work remains fragile</h3>

<p>Even when the model technically accepts a large context, that does not mean it uses it well.</p>

<p>If your workflow depends on feeding in very large diffs, long logs, and many tool results without careful pruning, you will probably have a bad time.</p>

<h3 id="tool-use-can-be-brittle">Tool use can be brittle</h3>

<p>Some failures are obvious. The model outputs malformed JSON. Or it leaks thought tags into a tool call. Or it starts narrating instead of acting.</p>

<p>Some failures are subtler. The tool call shape is technically valid but not useful. The model calls the wrong tool. The model forgets why it called the tool in the first place.</p>

<p>Again, this is why the harness matters so much.</p>

<h3 id="it-is-not-the-best-choice-for-every-job">It is not the best choice for every job</h3>

<p>If I need maximum reliability on a complex, high-context, multi-step coding task, I would still reach for a stronger cloud model first.</p>

<p>The local setup wins when privacy, cost control, experimentation, or offline access matter enough to justify the trade.</p>

<h2 id="my-practical-recommendation">My practical recommendation</h2>

<p>If your goal is to build a local engine for agent usage, I think LM Studio plus Gemma is a reasonable place to start.</p>

<p>Not because it is perfect. Because it is accessible.</p>

<p>You get a local server, an OpenAI-compatible endpoint, and a model that is capable enough to make the experiment real. That is a good combination for people who want to learn how local agents actually work instead of just reading about them.</p>

<p>I would recommend it to:</p>

<ul>
  <li>developers experimenting with local coding agents</li>
  <li>people who want to review private code or docs locally</li>
  <li>anyone building a custom harness and needing a local backend to test against</li>
</ul>

<p>I would not recommend it as your only serious option if:</p>

<ul>
  <li>you need highly reliable long-context reasoning</li>
  <li>you need polished tool calling with very little cleanup</li>
  <li>you are trying to match the strongest hosted coding models head-on</li>
</ul>

<p>My own conclusion is pretty simple.</p>

<p>LM Studio plus Gemma can absolutely work as a local engine for agent workflows. It is good enough for real experiments and some real tasks. But the useful mental model is not “I installed a local genius on my laptop.”</p>

<p>It is “I built a constrained local backend, then did the engineering work required to make an agent around it behave.”</p>

<p>That framing is less glamorous. It is also much closer to the truth.</p>]]></content><author><name>Joey Wang</name></author><category term="AI" /><category term="Engineering" /><category term="ai" /><category term="llm" /><category term="gemma" /><category term="lm-studio" /><category term="agents" /><category term="coding-agent" /><category term="local-llm" /><summary type="html"><![CDATA[A practical guide to using LM Studio and Gemma as a local OpenAI-compatible backend for coding agents, including what broke, what settings mattered, and where the setup still falls short.]]></summary></entry><entry><title type="html">Building a Local Coding Agent (Codex/Claude-Code Style) with Gemma</title><link href="https://joeywang.github.io//posts/code-agent/" rel="alternate" type="text/html" title="Building a Local Coding Agent (Codex/Claude-Code Style) with Gemma" /><published>2026-04-09T00:00:00+00:00</published><updated>2026-04-09T00:00:00+00:00</updated><id>https://joeywang.github.io//posts/code-agent</id><content type="html" xml:base="https://joeywang.github.io//posts/code-agent/"><![CDATA[<h1 id="building-a-local-coding-agent-codexclaude-code-style-with-gemma">Building a Local Coding Agent (Codex/Claude-Code Style) with Gemma</h1>

<p>Last week I spent an evening trying to get Gemma 4 (26B) to run a simple coding task through my own agent: “find all the Ruby files in this directory and replace <code class="language-plaintext highlighter-rouge">before_filter</code> with <code class="language-plaintext highlighter-rouge">before_action</code>.” The first tool call worked perfectly. The model correctly asked to run <code class="language-plaintext highlighter-rouge">find . -name '*.rb'</code>. But when I fed the file list back to it, instead of calling <code class="language-plaintext highlighter-rouge">sed</code> or a file editor, it started <em>explaining</em> what I should do next, as if I were asking for advice rather than expecting it to act.</p>

<p>I bumped the temperature down. I rewrote the prompt three times. I tried adding explicit instructions like “you must call a tool.” Nothing helped.</p>

<p>The problem wasn’t Gemma 4. The problem was my agent. I was treating it like a chatbot with tool access, but what I actually needed was a state machine.</p>

<h2 id="whats-missing-from-most-local-agent-setups">What’s Missing from Most Local Agent Setups</h2>

<p>Ollama gives you model execution, streaming, and a tool call output format. It even supports a thinking channel now. But it doesn’t give you an agent loop. It doesn’t execute tools, manage state, or decide whether to call the model again. It just returns whatever the model outputs and leaves the rest to you.</p>

<p><code class="language-plaintext highlighter-rouge">llama.cpp</code> is even more bare. You get raw model inference and nothing else. That’s fine if you want to build everything from scratch. But it means you can’t just drop in a tool schema and expect multi-step behavior to work.</p>

<p>This gap is where things fall apart. The model knows how to request a tool call. But if your runtime doesn’t execute that call, append the result to the message history, and feed it back into the model, the conversation stops dead. Or worse, the model improvises. That usually means it starts describing what <em>would</em> happen if someone ran the tool, instead of actually asking for it to run.</p>

<h2 id="the-agent-loop-is-the-whole-thing">The Agent Loop Is the Whole Thing</h2>

<p>Here’s the core loop, stripped down:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
</pre></td><td class="rouge-code"><pre><span class="n">messages</span> <span class="o">=</span> <span class="n">initial_messages</span>

<span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
    <span class="n">response</span> <span class="o">=</span> <span class="nf">llm</span><span class="p">(</span><span class="n">messages</span><span class="p">,</span> <span class="n">tools</span><span class="o">=</span><span class="n">tools</span><span class="p">)</span>

    <span class="k">if</span> <span class="n">response</span><span class="p">.</span><span class="n">tool_calls</span><span class="p">:</span>
        <span class="n">messages</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">call</span> <span class="ow">in</span> <span class="n">response</span><span class="p">.</span><span class="n">tool_calls</span><span class="p">:</span>
            <span class="n">result</span> <span class="o">=</span> <span class="nf">execute_tool</span><span class="p">(</span><span class="n">call</span><span class="p">)</span>
            <span class="n">messages</span><span class="p">.</span><span class="nf">append</span><span class="p">({</span>
                <span class="sh">"</span><span class="s">role</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">tool</span><span class="sh">"</span><span class="p">,</span>
                <span class="sh">"</span><span class="s">tool_call_id</span><span class="sh">"</span><span class="p">:</span> <span class="n">call</span><span class="p">.</span><span class="nb">id</span><span class="p">,</span>
                <span class="sh">"</span><span class="s">content</span><span class="sh">"</span><span class="p">:</span> <span class="n">json</span><span class="p">.</span><span class="nf">dumps</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
            <span class="p">})</span>
        <span class="c1"># loop back
</span>    <span class="k">else</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">response</span><span class="p">.</span><span class="n">content</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<p>That’s it. The <code class="language-plaintext highlighter-rouge">continue</code> back into the model after each tool execution is the part most people miss, or implement badly. Without it, you get one tool call and then nothing.</p>

<h3 id="what-i-got-wrong-first">What I Got Wrong First</h3>

<p>My original code collected the streaming response and checked for tool calls, but I was only looking at the <em>first</em> response chunk. With Gemma 4’s thinking channel enabled, the first chunk contained reasoning text, and the actual tool call came several chunks later. I was executing on incomplete output.</p>

<p>The fix was simple: accumulate the full stream, then parse. Don’t act on partial data.</p>

<h2 id="what-the-model-exposes-vs-what-the-agent-must-build">What the Model Exposes vs What the Agent Must Build</h2>

<p>When you use something like Claude Code or Codex, the smooth experience comes from a tight integration between what the model exposes and what the agent runtime does with it. Here’s a mapping of the main LLM features to what your agent needs to support.</p>

<h3 id="tool--function-calling">Tool / Function Calling</h3>

<p><strong>What the model does:</strong> Outputs structured tool call objects with a name and arguments. It doesn’t call the tool. It just signals intent. In the raw token stream, this often looks like a special marker followed by JSON:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="rouge-code"><pre>&lt;/think&gt;

{"name": "run_shell", "arguments": {"command": "ls -la"}}
</pre></td></tr></tbody></table></code></pre></div></div>

<p><strong>What the agent must build:</strong> Parse the tool call from the stream, validate the JSON arguments, dispatch to the right function, capture stdout/stderr/exit code, and return the result as a structured <code class="language-plaintext highlighter-rouge">tool</code> role message. The model expects the result to come back with the same <code class="language-plaintext highlighter-rouge">tool_call_id</code> it used. If you lose that ID, the model has no way to know which tool call this result corresponds to. That matters even more when the model fires multiple tool calls in parallel.</p>

<h3 id="thinking--reasoning-channel">Thinking / Reasoning Channel</h3>

<p><strong>What the model does:</strong> Emits reasoning tokens before the actual answer or tool call. These tokens represent the model’s “chain of thought,” the step-by-step reasoning that leads to a decision. In some models, these tokens are hidden from the final output but still influence the model’s behavior.</p>

<p><strong>What the agent must build:</strong> Separate the thinking tokens from the actual output. You have two choices: show them to the user (for transparency, like Claude Code does) or hide them (for a cleaner UI). Critically, thinking tokens must <em>not</em> be treated as tool calls or as the final answer. They’re an internal monologue. If your agent confuses them with real output, the user sees garbled text and the tool loop breaks.</p>

<p>I found that enabling the thinking channel for simple tasks actually made things worse. The model would reason out loud about what tool to call, then somehow convince itself the task was already done. Now I only enable reasoning for the planning phase, figuring out what steps to take, and disable it for the actual tool execution loop.</p>

<h3 id="streaming">Streaming</h3>

<p><strong>What the model does:</strong> Emits tokens one at a time (or in small batches). Tool call JSON often arrives across multiple chunks. Thinking tokens and answer tokens can interleave.</p>

<p><strong>What the agent must build:</strong> A streaming buffer that accumulates all chunks until the model signals completion. Only then should you parse the full response to decide what happened. If you act on partial data, you’ll try to execute a tool call with truncated JSON, or display an incomplete answer to the user.</p>

<p>This is the most common bug I see in custom agents. The stream arrives in chunks, the first chunk looks like a tool call, the agent fires off the tool, and the rest of the tool call data arrives too late. Then the model gets confused because it never got a result for the <em>full</em> tool call it made.</p>

<h3 id="message-roles-and-format">Message Roles and Format</h3>

<p><strong>What the model does:</strong> Expects messages in a specific format. Different models have different expectations. OpenAI models expect <code class="language-plaintext highlighter-rouge">system</code>, <code class="language-plaintext highlighter-rouge">user</code>, <code class="language-plaintext highlighter-rouge">assistant</code>, and <code class="language-plaintext highlighter-rouge">tool</code> roles. Gemma 4 works better with fewer role types — mainly <code class="language-plaintext highlighter-rouge">user</code> and <code class="language-plaintext highlighter-rouge">model</code> — with system instructions flattened into the conversation context.</p>

<p><strong>What the agent must build:</strong> A message format adapter that translates between your internal representation and what the model expects. If you’re using Ollama’s OpenAI-compatible endpoint, the adapter is built in. If you’re talking to the model directly (llama.cpp, or Ollama’s native API), you need to handle this yourself.</p>

<p>The thing that tripped me up: I was appending tool results with a <code class="language-plaintext highlighter-rouge">tool</code> role, but my Ollama setup expected them as part of a <code class="language-plaintext highlighter-rouge">user</code> turn. The model received messages in a format it didn’t recognize, and the tool loop silently broke — no error, just the model ignoring the result and generating something else.</p>

<h3 id="context-window-management">Context Window Management</h3>

<p><strong>What the model does:</strong> Has a fixed context window (e.g., 8K, 16K tokens). Everything in the message history — user prompts, model responses, tool results — consumes tokens. When you exceed the window, older tokens get truncated.</p>

<p><strong>What the agent must build:</strong> A strategy for managing context growth. Tool results can be large — a <code class="language-plaintext highlighter-rouge">grep -r</code> across a codebase might return thousands of lines. If you dump every tool result into the message history without thinking, you’ll burn through your context window in three turns.</p>

<p>Common approaches: summarize tool results before appending, truncate output that exceeds a threshold, or maintain a separate “memory” that the model can query instead of keeping everything in the active context. For coding tasks, I’ve had good luck truncating file contents to relevant sections and summarizing long shell output.</p>

<h3 id="parallel-tool-calls">Parallel Tool Calls</h3>

<p><strong>What the model does:</strong> Can emit multiple tool calls in a single response when it determines they’re independent — for example, reading five files at once. The model expects these to be executed and the results returned in the same turn.</p>

<p><strong>What the agent must build:</strong> Logic to detect multiple tool calls, decide whether to run them in parallel or sequentially, and collect all results before feeding them back to the model. Running independent file reads in parallel is a nice optimization, but running dependent tool calls in parallel (read a file, then write to it) is a bug.</p>

<h3 id="stop-sequences-and-completion-detection">Stop Sequences and Completion Detection</h3>

<p><strong>What the model does:</strong> Signals that it’s done generating by hitting a stop sequence or a special end-of-turn token. The runtime (Ollama, llama.cpp) tells you when generation is complete.</p>

<p><strong>What the agent must build:</strong> Use the completion signal to decide the next step. If the response contains tool calls → execute and loop. If it’s plain text → return it to the user. Without reliable completion detection, you can’t reliably distinguish between “the model is still thinking” and “the model is done.”</p>

<h2 id="putting-it-all-together">Putting It All Together</h2>

<p>Here’s how these pieces interact in a working agent:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
</pre></td><td class="rouge-code"><pre>User sends request
    ↓
Agent builds message history (with context management)
    ↓
Agent calls LLM with tools schema
    ↓
LLM streams tokens (thinking + answer + possibly tool calls)
    ↓
Agent buffers the full stream
    ↓
Agent parses: thinking tokens → tool calls OR final answer
    ↓
If tool calls:
    → Agent executes them (parallel if independent)
    → Agent captures results
    → Agent appends results to message history
    → Agent loops back to LLM call
If final answer:
    → Agent returns to user
</pre></td></tr></tbody></table></code></pre></div></div>

<p>Every arrow in that diagram is a place where something can go wrong. The streaming buffer is where partial-data bugs live. The message history is where format mismatches cause silent failures. The tool executor is where you need to handle errors, timeouts, and large outputs.</p>

<h2 id="a-minimal-working-setup">A Minimal Working Setup</h2>

<p>The message type:</p>

<div class="language-typescript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre><span class="kd">type</span> <span class="nx">Message</span> <span class="o">=</span>
  <span class="o">|</span> <span class="p">{</span> <span class="na">role</span><span class="p">:</span> <span class="dl">"</span><span class="s2">user</span><span class="dl">"</span><span class="p">;</span> <span class="nl">content</span><span class="p">:</span> <span class="kr">string</span> <span class="p">}</span>
  <span class="o">|</span> <span class="p">{</span> <span class="na">role</span><span class="p">:</span> <span class="dl">"</span><span class="s2">assistant</span><span class="dl">"</span><span class="p">;</span> <span class="nl">content</span><span class="p">?:</span> <span class="kr">string</span><span class="p">;</span> <span class="nl">tool_calls</span><span class="p">?:</span> <span class="nx">ToolCall</span><span class="p">[]</span> <span class="p">}</span>
  <span class="o">|</span> <span class="p">{</span> <span class="na">role</span><span class="p">:</span> <span class="dl">"</span><span class="s2">tool</span><span class="dl">"</span><span class="p">;</span> <span class="nl">tool_call_id</span><span class="p">:</span> <span class="kr">string</span><span class="p">;</span> <span class="nl">content</span><span class="p">:</span> <span class="kr">string</span> <span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<p>A tool schema — nothing fancy:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="rouge-code"><pre><span class="p">{</span><span class="w">
  </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"get_weather"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"description"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Get weather by city"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"parameters"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"object"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"properties"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"city"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"string"</span><span class="w"> </span><span class="p">}</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="nl">"required"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"city"</span><span class="p">]</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></div></div>

<p>The core loop:</p>

<div class="language-typescript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
</pre></td><td class="rouge-code"><pre><span class="k">async</span> <span class="kd">function</span> <span class="nf">runAgent</span><span class="p">(</span><span class="nx">messages</span><span class="p">:</span> <span class="nx">Message</span><span class="p">[]):</span> <span class="nb">Promise</span><span class="o">&lt;</span><span class="kr">string</span><span class="o">&gt;</span> <span class="p">{</span>
  <span class="k">while </span><span class="p">(</span><span class="kc">true</span><span class="p">)</span> <span class="p">{</span>
    <span class="kd">const</span> <span class="nx">response</span> <span class="o">=</span> <span class="k">await</span> <span class="nf">llm</span><span class="p">(</span><span class="nx">messages</span><span class="p">)</span>

    <span class="k">if </span><span class="p">(</span><span class="nx">response</span><span class="p">.</span><span class="nx">tool_calls</span><span class="p">)</span> <span class="p">{</span>
      <span class="nx">messages</span><span class="p">.</span><span class="nf">push</span><span class="p">(</span><span class="nx">response</span><span class="p">)</span>

      <span class="k">for </span><span class="p">(</span><span class="kd">const</span> <span class="nx">call</span> <span class="k">of</span> <span class="nx">response</span><span class="p">.</span><span class="nx">tool_calls</span><span class="p">)</span> <span class="p">{</span>
        <span class="kd">const</span> <span class="nx">result</span> <span class="o">=</span> <span class="k">await</span> <span class="nf">executeTool</span><span class="p">(</span><span class="nx">call</span><span class="p">)</span>

        <span class="nx">messages</span><span class="p">.</span><span class="nf">push</span><span class="p">({</span>
          <span class="na">role</span><span class="p">:</span> <span class="dl">"</span><span class="s2">tool</span><span class="dl">"</span><span class="p">,</span>
          <span class="na">tool_call_id</span><span class="p">:</span> <span class="nx">call</span><span class="p">.</span><span class="nx">id</span><span class="p">,</span>
          <span class="na">content</span><span class="p">:</span> <span class="nx">JSON</span><span class="p">.</span><span class="nf">stringify</span><span class="p">(</span><span class="nx">result</span><span class="p">)</span>
        <span class="p">})</span>
      <span class="p">}</span>

      <span class="k">continue</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="nx">response</span><span class="p">.</span><span class="nx">content</span>
  <span class="p">}</span>
<span class="p">}</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<h2 id="planner-and-executor-split">Planner and Executor Split</h2>

<p>One thing I picked up from watching how Claude Code behaves: it doesn’t use a single model call for everything. There’s a planning phase (which may involve tool calls) and an execution phase (which produces the final answer). You can approximate this by using one pass of the model to decide on steps and call tools, then a second pass — sometimes even with temperature set to 0 — to synthesize the result.</p>

<p>It’s not strictly necessary for simple tasks, but once you’re doing multi-step refactoring work, the split makes the agent more predictable. The planner can wander through tool calls without worrying about producing a clean final answer. The executor gets a complete set of tool results and just needs to summarize.</p>

<h2 id="ollama-vs-llamacpp">Ollama vs llama.cpp</h2>

<p>Just use Ollama to start. You’ll save yourself days. The tool support works out of the box, the thinking channel is already wired in, and you can focus on building your agent loop instead of wrestling with low-level inference.</p>

<p>Switch to <code class="language-plaintext highlighter-rouge">llama.cpp</code> only when you need control over something that Ollama hides from you. I haven’t needed to yet, but when I do, it’ll be because I want to optimize the prompt format or handle the thinking channel more precisely.</p>

<h2 id="what-to-skip">What to Skip</h2>

<p>You don’t need a complex session system. You don’t need the full Ollama API. You don’t need abstractions around abstractions. You need:</p>

<ul>
  <li>A working tool loop</li>
  <li>A streaming buffer</li>
  <li>A prompt adapter for Gemma</li>
  <li>A state machine with maybe four states</li>
  <li>A tool executor that actually executes things</li>
</ul>

<h2 id="my-setup-on-m1-pro-32gb">My Setup on M1 Pro 32GB</h2>

<ul>
  <li><strong>Model:</strong> <code class="language-plaintext highlighter-rouge">gemma4:26b</code></li>
  <li><strong>Runtime:</strong> Ollama</li>
  <li><strong>Context:</strong> 8K (16K if the task needs it, but it burns memory fast)</li>
  <li><strong>Temperature:</strong> 0.1 for tool-heavy tasks, 0.3 for planning</li>
</ul>

<p>26B fits in unified memory alongside everything else I’m running, and the tool calling is reliable enough that I actually use it for real tasks now — not just experiments.</p>

<p>The thing I keep coming back to is that model quality matters less than people think once your loop is correct. A good agent loop makes a mediocre model feel capable. A broken agent loop makes a great model feel useless.</p>]]></content><author><name></name></author><category term="AI" /><category term="LLM" /><category term="Gemma" /><category term="Ollama" /><category term="coding-agent" /><summary type="html"><![CDATA[Building a Local Coding Agent (Codex/Claude-Code Style) with Gemma]]></summary></entry><entry><title type="html">Escaping the Cloud Token Trap: Building a Multi-Tenant Graph RAG System Locally</title><link href="https://joeywang.github.io//posts/rag-hub/" rel="alternate" type="text/html" title="Escaping the Cloud Token Trap: Building a Multi-Tenant Graph RAG System Locally" /><published>2026-04-08T14:00:00+00:00</published><updated>2026-04-08T14:00:00+00:00</updated><id>https://joeywang.github.io//posts/rag-hub</id><content type="html" xml:base="https://joeywang.github.io//posts/rag-hub/"><![CDATA[<h1 id="escaping-the-cloud-token-trap-building-a-multi-tenant-graph-rag-system-locally">Escaping the Cloud Token Trap: Building a Multi-Tenant Graph RAG System Locally</h1>

<p>We’ve all been there. You want to build a Retrieval-Augmented Generation (RAG) system, so you do the “standard” thing: you chunk your entire database, throw it into a vector database, and wire it up to a massive cloud LLM.</p>

<p>Then the bill arrives.</p>

<p>Worse, you realize that while the cloud model is brilliant at reasoning, it suffers from “lost in the middle” syndrome when fed thousands of tokens of irrelevant context. You are paying a premium to confuse the smartest models on the market.</p>

<p>The solution isn’t to stop using cloud models—it’s to stop using them for the dirty work. By shifting the indexing, embedding, and initial retrieval phases to a local environment using tools like <strong>Ollama</strong> and <strong>LightRAG</strong>, you can build a highly precise, multi-tenant knowledge graph that only sends the most critical context to the expensive models.</p>

<p>Here is how we designed a local, Hub-and-Spoke Graph RAG system using PostgreSQL, and the deep architectural lessons we learned along the way.</p>

<hr />

<h2 id="️-the-architecture-hub-and-spoke-graph-rag">🏗️ The Architecture: Hub-and-Spoke Graph RAG</h2>

<p>Traditional RAG relies heavily on Vector Databases (Dense Retrieval). While great for finding semantically similar sentences, vectors are terrible at understanding <em>relationships</em>. If a customer asks, <em>“How does feature X affect my billing?”</em>, a pure vector search might pull up a document about Feature X, and a separate document about Billing, but completely miss the bridge between them.</p>

<p>This is where <strong>LightRAG</strong> steps in. It builds a Knowledge Graph (GraphRAG), extracting entities (nodes) and their relationships (edges).</p>

<p>Our architecture uses a central Python Gateway to orchestrate this data flow from our existing PostgreSQL database into isolated LightRAG workspaces.</p>

<h3 id="the-core-stack">The Core Stack</h3>
<ul>
  <li><strong>Source of Truth:</strong> PostgreSQL (Housing raw app data, customer logs, and markdown docs).</li>
  <li><strong>The Engine:</strong> Ollama running <code class="language-plaintext highlighter-rouge">llama3.1</code> (for local reasoning/extraction) and <code class="language-plaintext highlighter-rouge">nomic-embed-text</code> (for embeddings).</li>
  <li><strong>The Brain:</strong> LightRAG, operating via a Dockerized container.</li>
  <li><strong>The Gateway:</strong> A FastAPI layer enforcing strict “Workspace” routing.</li>
</ul>

<p>By running a daily ETL (Extract, Transform, Load) script, we pull rows from Postgres, format them into highly structured markdown blocks, and push them to specific LightRAG workspaces (e.g., <code class="language-plaintext highlighter-rouge">customer_facing</code>, <code class="language-plaintext highlighter-rouge">internal_dev</code>, <code class="language-plaintext highlighter-rouge">product_codex</code>).</p>

<hr />

<h2 id="-deep-thoughts-why-this-approach-wins">🧠 Deep Thoughts: Why This Approach Wins</h2>

<h3 id="1-the-power-of-data-partitioning-in-rag">1. The Power of “Data Partitioning” in RAG</h3>
<p>One of the biggest mistakes in enterprise AI is creating a single, monolithic vector database. If you dump your API documentation, product roadmaps, and English learning resources into the same index, the LLM’s latent space gets muddy.</p>

<p>By utilizing <strong>Multi-Workspace Isolation</strong>, we apply a principle as old as database design: <em>separation of concerns</em>. The Gateway ensures that a customer asking a grammar question only queries the <code class="language-plaintext highlighter-rouge">customer_facing</code> graph, while a developer debugging an endpoint queries the <code class="language-plaintext highlighter-rouge">internal_dev</code> graph. It drastically reduces hallucination and limits unauthorized data access.</p>

<h3 id="2-graph-retrieval--vector-retrieval-for-complex-codebases">2. Graph Retrieval &gt; Vector Retrieval for Complex Codebases</h3>
<p>Code and product documentation are inherently relational. A function calls another function; a product feature requires a specific database schema.</p>

<p>When LightRAG processes a document, it doesn’t just store the text. It uses the local Ollama model to actively read the text and extract a graph. The initial ingestion is slow—local LLMs churn hard to build these relationships—but the querying is lightning fast. When you query the system, you perform a <strong>Hybrid Retrieval</strong>: BM25 (exact keyword) + Vector (semantic) + Graph (relational).</p>

<h3 id="3-asymmetric-compute-costs">3. Asymmetric Compute Costs</h3>
<p>We leverage a cheap, local LLM to do the heavy lifting of <em>reading and mapping</em> the data (the Graph extraction phase). We only use expensive cloud models (like Gemini) when the user asks a highly complex, reasoning-heavy question. The local system retrieves the perfect 800-token context block from the graph, and we forward <em>only</em> that precise block to the cloud model. We turned a $100/month prompt habit into pennies.</p>

<hr />

<h2 id="-further-improvements-scaling-the-system">🚀 Further Improvements: Scaling the System</h2>

<p>While this local Docker Compose stack is incredibly powerful, preparing it for massive scale requires a few architectural evolutions.</p>

<h3 id="1-semantic-caching-redismomento">1. Semantic Caching (Redis/Momento)</h3>
<p>Right now, if 100 users ask the chatbot, <em>“How do I reset my password?”</em>, the system performs the full graph-retrieval and generation process 100 times.
<strong>The Fix:</strong> Implement a Semantic Cache layer in front of the API Gateway. If the embedded intent of a new query has a 95% similarity match to a recently answered query, return the cached response instantly. This drops latency from seconds to milliseconds.</p>

<h3 id="2-agentic-routing-llm-as-a-router">2. Agentic Routing (LLM as a Router)</h3>
<p>Currently, our routing is deterministic (e.g., <code class="language-plaintext highlighter-rouge">if user_type == 'dev'</code>). As the system grows, we can deploy a micro-model (like an 8B parameter model) at the Gateway level to act purely as a “Traffic Cop.”
<strong>The Fix:</strong> The Traffic Cop reads the prompt and decides <em>which</em> workspace to query. If a Project Manager asks, <em>“Did the new API endpoint cause customer complaints?”</em>, the routing agent can intelligently query <em>both</em> the <code class="language-plaintext highlighter-rouge">internal_dev</code> workspace and the <code class="language-plaintext highlighter-rouge">customer_facing</code> workspace, synthesizing a cross-functional answer.</p>

<h3 id="3-event-driven-ingestion-webhooks">3. Event-Driven Ingestion (Webhooks)</h3>
<p>Batch syncing via cron jobs leaves the knowledge base out of date for up to 24 hours. 
<strong>The Fix:</strong> Move from an ETL script to an Event-Driven architecture using PostgreSQL triggers or a message broker (like RabbitMQ/Kafka). When a developer merges a pull request or updates a PRD, a webhook instantly fires a payload to LightRAG, keeping the Knowledge Graph updated in near real-time.</p>

<hr />

<h2 id="the-takeaway">The Takeaway</h2>

<p>Building AI systems isn’t just about calling the smartest API; it’s about systems engineering. By treating LLMs as modular components within a traditional software architecture—using local models for data processing and graph generation, and strictly controlling data flow via workspaces—you build systems that are not only cheaper to run, but exponentially more accurate.</p>

<p>The future of RAG isn’t just bigger context windows; it’s smarter, structured, local retrieval.</p>]]></content><author><name></name></author><category term="rag" /><summary type="html"><![CDATA[Escaping the Cloud Token Trap: Building a Multi-Tenant Graph RAG System Locally]]></summary></entry><entry><title type="html">The 2026 Developer’s Guide to Token Efficiency</title><link href="https://joeywang.github.io//posts/token-saving/" rel="alternate" type="text/html" title="The 2026 Developer’s Guide to Token Efficiency" /><published>2026-04-07T10:00:00+00:00</published><updated>2026-04-07T10:00:00+00:00</updated><id>https://joeywang.github.io//posts/token-saving</id><content type="html" xml:base="https://joeywang.github.io//posts/token-saving/"><![CDATA[<h1 id="the-2026-developers-guide-to-token-efficiency">The 2026 Developer’s Guide to Token Efficiency</h1>
<h3 id="mastering-the-context-tax-in-the-era-of-ai-agents">Mastering the “Context Tax” in the Era of AI Agents</h3>

<p>In 2026, the bottleneck for AI coding isn’t model intelligence—it’s the <strong>Context Tax</strong>. As agents like Claude Code (CC) and Codex become more autonomous, they tend to “over-read” your codebase, leading to massive input bills and hit-rate limits.</p>

<p>Here is the definitive breakdown of how to architect your workflow for maximum token thrift.</p>

<hr />

<h2 id="1-input-the-skeleton-architecture">1. Input: The “Skeleton” Architecture</h2>
<p><strong>The Idea:</strong> Move from “sending files” to “sending blueprints.” Use high-density index files to guide the AI.</p>

<ul>
  <li><strong>Solution:</strong> <strong><code class="language-plaintext highlighter-rouge">ai-codex</code></strong> or <strong>RepoMix</strong>.</li>
  <li><strong>Pros:</strong> Prevents the AI from running <code class="language-plaintext highlighter-rouge">ls -R</code> or <code class="language-plaintext highlighter-rouge">cat</code> on 50 different files just to find a single variable. Huge savings on “discovery” tokens.</li>
  <li><strong>Cons:</strong> If the index is outdated, the AI might hallucinate file paths that no longer exist.</li>
  <li><strong>Best Practice:</strong> Re-generate your skeleton file after every major refactor. Lead your session with: <em>“Read index.md first, do not search files until I give a specific task.”</em></li>
</ul>

<h2 id="2-output-the-caveman-protocol">2. Output: The “Caveman” Protocol</h2>
<p><strong>The Idea:</strong> Silence the AI’s “inner polite assistant.” Every word of “Sure, I’d be happy to help!” is a token you paid for.</p>

<ul>
  <li><strong>Solution:</strong> The <strong>Caveman</strong> plugin (by Julius Brussee) or telegraphic system prompts.</li>
  <li><strong>Pros:</strong> Reduces output tokens by 40–70%. Faster response times (lower latency).</li>
  <li><strong>Cons:</strong> Can feel “cold.” Complex logic might occasionally lose nuance if the compression is too aggressive.</li>
  <li><strong>Best Practice:</strong> Use <code class="language-plaintext highlighter-rouge">/caveman full</code> in CC or add <code class="language-plaintext highlighter-rouge">Output: Telegraphic, fragments only, no preamble</code> to your rules.</li>
</ul>

<h2 id="3-command-output-rtk--terminal-compaction">3. Command Output: RTK &amp; Terminal Compaction</h2>
<p><strong>The Idea:</strong> Stop letting 2,000 lines of “Test Passed” logs flood your chat history.</p>

<ul>
  <li><strong>Solution:</strong> <strong>RTK (Real-time Kitchen)</strong> or <strong>distill</strong>.</li>
  <li><strong>Pros:</strong> Turns a massive stack trace or <code class="language-plaintext highlighter-rouge">npm install</code> log into a 3-line summary. Keeps your conversation “clean” for much longer.</li>
  <li><strong>Cons:</strong> “Blindness.” If a minor warning was the root cause of a bug, the compression might hide it.</li>
  <li><strong>Best Practice:</strong> Use RTK by default. If the AI is stuck, run a “Raw” command once to see the full context: <code class="language-plaintext highlighter-rouge">rtk --raw &lt;command&gt;</code>.</li>
</ul>

<h2 id="4-code-locating-symbol-level-retrieval">4. Code Locating: Symbol-Level Retrieval</h2>
<p><strong>The Idea:</strong> Don’t read the haystack to find the needle. Use AST (Abstract Syntax Tree) indexing.</p>

<ul>
  <li><strong>Solution:</strong> <strong>Serena</strong> (LSP-based) or <strong>CocoIndex</strong>.</li>
  <li><strong>Pros:</strong> Instead of reading a 2,000-token file, the AI uses a tool to fetch <em>only</em> a specific function. <strong>CocoIndex</strong> is particularly loved for its AST-based “incremental” indexing that stays sync’d with your git branches.</li>
  <li><strong>Cons:</strong> Requires a Language Server (LSP) to be running in the background.</li>
  <li><strong>Best Practice:</strong> Favor “Get Symbol” tools over “Read File” tools for large legacy codebases.</li>
</ul>

<h2 id="5-mcp-proxy-the-middleware-layer">5. MCP Proxy: The Middleware Layer</h2>
<p><strong>The Idea:</strong> Intercept and optimize the communication between your IDE and the LLM.</p>

<ul>
  <li><strong>Solution:</strong> <strong>Lean-ctx</strong> (MCP) or <strong>Graphify</strong>.</li>
  <li><strong>Pros:</strong> <strong>Graphify</strong> turns your project into a Knowledge Graph; the AI queries the graph (cheap) instead of scanning files (expensive). <strong>Lean-ctx</strong> acts as a “shredder,” stripping comments and whitespace before tokens are counted.</li>
  <li><strong>Cons:</strong> Adds a slight layer of setup complexity to your <code class="language-plaintext highlighter-rouge">config.json</code>.</li>
  <li><strong>Best Practice:</strong> Use <strong>Graphify</strong> for architectural understanding and <strong>Lean-ctx</strong> for day-to-day coding to strip boilerplate.</li>
</ul>

<h2 id="6-context-shorten-wenyan--compaction">6. Context Shorten: “Wenyan” &amp; Compaction</h2>
<p><strong>The Idea:</strong> Use high-density languages or “Garbage Collection” to keep the window small.</p>

<ul>
  <li><strong>The “Wenyan” Hack:</strong> In 2026, some devs use <strong>Wenyan (Classical Chinese)</strong> MCPs to store documentation. Because Classical Chinese is so dense, it can store 5x more information per token than English.</li>
  <li><strong>Solution:</strong> The <strong><code class="language-plaintext highlighter-rouge">/compact</code></strong> command or <strong>Session Layering</strong>.</li>
  <li><strong>Pros:</strong> Flushes the memory of 10 prompts ago that are no longer relevant. Prevents “Attention Drift.”</li>
  <li><strong>Cons:</strong> If you haven’t committed your work, the AI might “forget” the previous state of the code.</li>
  <li><strong>Best Practice:</strong> Always <strong>Git Commit</strong> a working chunk, then run a compaction command. Treat your chat history like a <code class="language-plaintext highlighter-rouge">tmp</code> folder—delete it often.</li>
</ul>

<hr />

<h3 id="final-verdict-the-lean-stack">Final Verdict: The “Lean” Stack</h3>
<p>For the ultimate 2026 setup, combine these:</p>
<ol>
  <li><strong>Map:</strong> <code class="language-plaintext highlighter-rouge">ai-codex</code> (The Map)</li>
  <li><strong>Locate:</strong> <code class="language-plaintext highlighter-rouge">CocoIndex</code> (The AST Surgeon)</li>
  <li><strong>Shred:</strong> <code class="language-plaintext highlighter-rouge">Lean-ctx</code> (The Token Shredder)</li>
  <li><strong>Muzzle:</strong> <code class="language-plaintext highlighter-rouge">Caveman</code> mode (The Assistant Muzzle)</li>
</ol>

<p><strong>Deeper Thinking:</strong> Token saving isn’t just about money; it’s about <strong>Model IQ</strong>. The more “junk” tokens (logs, politeness, redundant imports) you feed an AI, the lower its effective reasoning becomes. <strong>A lean context is a smart context.</strong></p>]]></content><author><name></name></author><category term="Token" /><category term="Efficiency" /><category term="token" /><category term="efficiency" /><category term="context" /><category term="tax" /><summary type="html"><![CDATA[The 2026 Developer’s Guide to Token Efficiency Mastering the “Context Tax” in the Era of AI Agents]]></summary></entry><entry><title type="html">Incus vs. Docker: The Next-Generation Guide to System Containers</title><link href="https://joeywang.github.io//posts/incus-vs-docker/" rel="alternate" type="text/html" title="Incus vs. Docker: The Next-Generation Guide to System Containers" /><published>2026-04-06T14:00:00+00:00</published><updated>2026-04-06T14:00:00+00:00</updated><id>https://joeywang.github.io//posts/incus-vs-docker</id><content type="html" xml:base="https://joeywang.github.io//posts/incus-vs-docker/"><![CDATA[<h2 id="incus-vs-docker-the-next-generation-guide-to-system-containers">Incus vs. Docker: The Next-Generation Guide to System Containers</h2>

<p>In the world of containerization, <strong>Docker</strong> has long been the household name. However, for developers who need more than just a place to run a single process, <strong>Incus</strong> has emerged as the premier community-driven alternative.</p>

<p>While Docker focuses on <strong>Application Containers</strong> (packaging a single app), Incus focuses on <strong>System Containers</strong> (packaging a full Linux OS). Think of Incus as a way to create “instant Virtual Machines” that run at the speed of a container.</p>

<hr />

<h3 id="-key-differences-at-a-glance">🚀 Key Differences at a Glance</h3>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Feature</th>
      <th style="text-align: left">Docker</th>
      <th style="text-align: left">Incus</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><strong>Philosophy</strong></td>
      <td style="text-align: left">“One process per container”</td>
      <td style="text-align: left">“One full OS per container”</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Primary Use</strong></td>
      <td style="text-align: left">Microservices, CI/CD, Deployment</td>
      <td style="text-align: left">Development Labs, AI Sandboxing, VPS replacement</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Init System</strong></td>
      <td style="text-align: left">No (Usually just <code class="language-plaintext highlighter-rouge">entrypoint</code>)</td>
      <td style="text-align: left">Yes (<code class="language-plaintext highlighter-rouge">systemd</code>, <code class="language-plaintext highlighter-rouge">OpenRC</code> work natively)</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Security</strong></td>
      <td style="text-align: left">Process-level isolation</td>
      <td style="text-align: left">Unprivileged containers by default + VM support</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Persistence</strong></td>
      <td style="text-align: left">Volatile (requires Volumes/Bind mounts)</td>
      <td style="text-align: left">Persistent (acts like a physical disk)</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Hardware</strong></td>
      <td style="text-align: left">Hard to pass through GPUs/USB</td>
      <td style="text-align: left">Native, low-latency device passthrough</td>
    </tr>
  </tbody>
</table>

<hr />

<h3 id="️-command-comparison-speaking-the-language">⌨️ Command Comparison: Speaking the Language</h3>
<p>If you already know Docker, learning Incus is a matter of mapping your existing knowledge to a new set of verbs.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Action</th>
      <th style="text-align: left">Docker Command</th>
      <th style="text-align: left">Incus Command</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><strong>Start a container</strong></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">docker run -d --name web ubuntu</code></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">incus launch images:ubuntu/24.04 web</code></td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>List containers</strong></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">docker ps</code></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">incus list</code></td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Access shell</strong></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">docker exec -it web bash</code></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">incus shell web</code></td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Stop container</strong></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">docker stop web</code></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">incus stop web</code></td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Remove container</strong></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">docker rm -f web</code></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">incus delete -f web</code></td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Create Image</strong></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">docker commit web my-image</code></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">incus publish web --alias my-image</code></td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>View Logs</strong></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">docker logs web</code></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">incus info --show-log web</code></td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Copy Files</strong></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">docker cp file web:/path</code></td>
      <td style="text-align: left"><code class="language-plaintext highlighter-rouge">incus file push file web/path</code></td>
    </tr>
  </tbody>
</table>

<hr />

<h3 id="️-setting-up-your-incus-environment-ubuntu-2404">🛠️ Setting Up Your Incus Environment (Ubuntu 24.04+)</h3>

<p>Incus is now officially supported in the latest Ubuntu repositories, making installation a breeze.</p>

<h4 id="1-installation--init">1. Installation &amp; Init</h4>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="rouge-code"><pre><span class="c"># Install the core packages</span>
<span class="nb">sudo </span>apt update <span class="o">&amp;&amp;</span> <span class="nb">sudo </span>apt <span class="nb">install</span> <span class="nt">-y</span> incus

<span class="c"># Add your user to the management group</span>
<span class="nb">sudo </span>usermod <span class="nt">-aG</span> incus-admin <span class="nv">$USER</span>
newgrp incus-admin

<span class="c"># Initialize the system (Interactive Wizard)</span>
incus admin init
</pre></td></tr></tbody></table></code></pre></div></div>
<p><em>Tip: During <code class="language-plaintext highlighter-rouge">init</code>, choosing <strong>ZFS</strong> or <strong>Btrfs</strong> for storage allows for near-instant snapshots.</em></p>

<h4 id="2-launching-your-first-dev-box">2. Launching your first “Dev Box”</h4>
<p>Unlike Docker Hub, Incus uses multiple “remotes.” The most common is the community-maintained <code class="language-plaintext highlighter-rouge">images:</code> server.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre><span class="c"># Launch a persistent Ubuntu 24.04 container</span>
incus launch images:ubuntu/24.04 dev-box

<span class="c"># Launch a MicroVM (for AI sandboxing or extra security)</span>
incus launch images:ubuntu/24.04 ai-box <span class="nt">--vm</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<hr />

<h3 id="-advanced-management-the-pro-workflow">🤖 Advanced Management: The “Pro” Workflow</h3>

<h4 id="using-profiles-for-automation">Using Profiles for Automation</h4>
<p>Instead of manual configuration, you can use <strong>Profiles</strong> to apply settings (like GPU access or mounted folders) to many containers at once.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
</pre></td><td class="rouge-code"><pre><span class="c"># Create a profile for Rails development</span>
incus profile create rails-dev

<span class="c"># Add a device to map your code folder from the host</span>
incus profile device add rails-dev my-code disk <span class="se">\</span>
    <span class="nb">source</span><span class="o">=</span>/home/user/projects/app <span class="se">\</span>
    <span class="nv">path</span><span class="o">=</span>/root/app

<span class="c"># Apply this profile to your container</span>
incus profile add dev-box rails-dev
</pre></td></tr></tbody></table></code></pre></div></div>

<h4 id="snapshotting-the-undo-button">Snapshotting (The “Undo” Button)</h4>
<p>This is where Incus shines over Docker for development. Before making a big change:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre><span class="c"># Create a snapshot</span>
incus snapshot create dev-box pre-upgrade

<span class="c"># Messed up? Restore instantly</span>
incus restore dev-box pre-upgrade
</pre></td></tr></tbody></table></code></pre></div></div>

<h4 id="running-docker-inside-incus">Running Docker inside Incus</h4>
<p>Yes, you can have the best of both worlds. To run Docker inside an Incus container (nesting):</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="rouge-code"><pre>incus config <span class="nb">set </span>dev-box security.nesting<span class="o">=</span><span class="nb">true
</span>incus restart dev-box
<span class="c"># Now install docker inside the dev-box as usual!</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<hr />

<h3 id="-conclusion">🎯 Conclusion</h3>
<p><strong>Use Docker</strong> when you have a finished app that you want to ship to the cloud.
<strong>Use Incus</strong> when you are <em>building</em> that app. It provides a stable, persistent, and high-performance environment that handles system services and hardware with ease—all while keeping your host machine clean and organized.</p>]]></content><author><name></name></author><category term="incus" /><category term="docker" /><summary type="html"><![CDATA[Incus vs. Docker: The Next-Generation Guide to System Containers]]></summary></entry><entry><title type="html">Mastering Redis HA, Shared Sessions, and Fault Tolerance</title><link href="https://joeywang.github.io//posts/rails-cache-queue-session/" rel="alternate" type="text/html" title="Mastering Redis HA, Shared Sessions, and Fault Tolerance" /><published>2026-04-05T14:00:00+00:00</published><updated>2026-04-05T14:00:00+00:00</updated><id>https://joeywang.github.io//posts/rails-cache-queue-session</id><content type="html" xml:base="https://joeywang.github.io//posts/rails-cache-queue-session/"><![CDATA[<h1 id="the-resilient-rails-stack-mastering-redis-ha-shared-sessions-and-fault-tolerance">The Resilient Rails Stack: Mastering Redis HA, Shared Sessions, and Fault Tolerance</h1>

<p>In a modern microservices or multi-app architecture, Redis is often the “glue” that holds everything together. It manages your user sessions, speeds up your app via caching, and handles background job orchestration.</p>

<p>However, many teams fall into the trap of the <strong>“Single Point of Failure”</strong>—using one Redis instance for everything. If that instance blips during a cloud provider node upgrade, your entire platform goes dark. Here is the blueprint for a “Bulletproof” Web Service.</p>

<hr />

<h2 id="1-the-strategy-isolation-via-the-triple-redis">1. The Strategy: Isolation via “The Triple-Redis”</h2>
<p>The most important best practice is <strong>Decoupling</strong>. You should split your Redis usage into three distinct functional groups (StatefulSets in Kubernetes).</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Group</th>
      <th style="text-align: left">Data Type</th>
      <th style="text-align: left">Priority</th>
      <th style="text-align: left">Failure Impact</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><strong>Session</strong></td>
      <td style="text-align: left">User IDs, CSRF tokens</td>
      <td style="text-align: left"><strong>Critical</strong></td>
      <td style="text-align: left">Users are logged out (Global outage)</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Cache</strong></td>
      <td style="text-align: left">HTML fragments, API results</td>
      <td style="text-align: left"><strong>Medium</strong></td>
      <td style="text-align: left">Site slows down (Degraded performance)</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Sidekiq</strong></td>
      <td style="text-align: left">Background Job Metadata</td>
      <td style="text-align: left"><strong>High</strong></td>
      <td style="text-align: left">Emails/Uploads stop (Data delay)</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="2-shared-sessions--optimal-durations">2. Shared Sessions &amp; Optimal Durations</h2>
<p>When sharing sessions across multiple apps (e.g., <code class="language-plaintext highlighter-rouge">dashboard.example.com</code> and <code class="language-plaintext highlighter-rouge">learn.example.com</code>), you must use a centralized Redis store so a user remains logged in as they move between subdomains.</p>

<h3 id="the-best-practice-duration">The Best Practice Duration</h3>
<p>For most SaaS or Educational platforms, <strong>2 to 4 hours</strong> is the “sweet spot.”</p>
<ul>
  <li><strong>Why?</strong> It covers a standard study or work session.</li>
  <li><strong>Security:</strong> Since you are using Redis (Server-side storage), you can instantly revoke a session if a device is stolen—something you can’t do with pure CookieStore.</li>
</ul>

<hr />

<h2 id="3-the-implementation-plan">3. The Implementation Plan</h2>

<h3 id="step-a-kubernetes-anti-affinity">Step A: Kubernetes Anti-Affinity</h3>
<p>To ensure GCP node upgrades don’t kill all your Redis replicas at once, use <strong>Pod Anti-Affinity</strong>. This forces Kubernetes to place your Redis pods on different physical nodes.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="rouge-code"><pre><span class="c1"># Partial StatefulSet Spec</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">template</span><span class="pi">:</span>
    <span class="na">spec</span><span class="pi">:</span>
      <span class="na">affinity</span><span class="pi">:</span>
        <span class="na">podAntiAffinity</span><span class="pi">:</span>
          <span class="na">requiredDuringSchedulingIgnoredDuringExecution</span><span class="pi">:</span>
          <span class="pi">-</span> <span class="na">labelSelector</span><span class="pi">:</span>
              <span class="na">matchExpressions</span><span class="pi">:</span>
              <span class="pi">-</span> <span class="na">key</span><span class="pi">:</span> <span class="s">app</span>
                <span class="na">operator</span><span class="pi">:</span> <span class="s">In</span>
                <span class="na">values</span><span class="pi">:</span>
                <span class="pi">-</span> <span class="s">redis-session</span>
            <span class="na">topologyKey</span><span class="pi">:</span> <span class="s2">"</span><span class="s">kubernetes.io/hostname"</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<h3 id="step-b-resilient-rails-configuration">Step B: Resilient Rails Configuration</h3>
<p>Don’t let a Redis connection error trigger a 500 page. Use error handlers to “fail soft.”</p>

<p><strong><code class="language-plaintext highlighter-rouge">config/environments/production.rb</code></strong></p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
</pre></td><td class="rouge-code"><pre><span class="c1"># 1. Resilient Cache</span>
<span class="n">config</span><span class="p">.</span><span class="nf">cache_store</span> <span class="o">=</span> <span class="ss">:redis_cache_store</span><span class="p">,</span> <span class="p">{</span>
  <span class="ss">url: </span><span class="no">ENV</span><span class="p">[</span><span class="s1">'REDIS_CACHE_URL'</span><span class="p">],</span>
  <span class="ss">connect_timeout: </span><span class="mi">1</span><span class="p">,</span>
  <span class="ss">read_timeout: </span><span class="mf">0.2</span><span class="p">,</span>
  <span class="ss">error_handler: </span><span class="o">-&gt;</span> <span class="p">(</span><span class="nb">method</span><span class="p">:,</span> <span class="n">returning</span><span class="p">:,</span> <span class="n">exception</span><span class="p">:)</span> <span class="p">{</span>
    <span class="no">Rails</span><span class="p">.</span><span class="nf">logger</span><span class="p">.</span><span class="nf">error</span> <span class="s2">"Redis Cache Down: </span><span class="si">#{</span><span class="n">exception</span><span class="p">.</span><span class="nf">message</span><span class="si">}</span><span class="s2">"</span>
    <span class="n">returning</span> <span class="c1"># Returns nil, forcing a DB fetch instead of a crash</span>
  <span class="p">}</span>
<span class="p">}</span>

<span class="c1"># 2. Shared Session Store</span>
<span class="no">Rails</span><span class="p">.</span><span class="nf">application</span><span class="p">.</span><span class="nf">config</span><span class="p">.</span><span class="nf">session_store</span> <span class="ss">:redis_store</span><span class="p">,</span>
  <span class="ss">servers: </span><span class="p">[</span><span class="no">ENV</span><span class="p">[</span><span class="s1">'REDIS_SESSION_URL'</span><span class="p">]],</span>
  <span class="ss">key: </span><span class="s1">'_shared_org_session'</span><span class="p">,</span>
  <span class="ss">domain: :all</span><span class="p">,</span> <span class="c1"># Allows subdomains to share the cookie</span>
  <span class="ss">expire_after: </span><span class="mi">4</span><span class="p">.</span><span class="nf">hours</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<h3 id="step-c-sidekiq-safety-valve">Step C: Sidekiq “Safety Valve”</h3>
<p>If the Sidekiq Redis is down, we want to avoid crashing the web request when enqueuing a job.</p>

<p><strong><code class="language-plaintext highlighter-rouge">app/jobs/application_job.rb</code></strong></p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
</pre></td><td class="rouge-code"><pre><span class="k">class</span> <span class="nc">ApplicationJob</span> <span class="o">&lt;</span> <span class="no">ActiveJob</span><span class="o">::</span><span class="no">Base</span>
  <span class="c1"># Detect if Redis is alive; if not, run the job immediately (Inline)</span>
  <span class="nb">self</span><span class="p">.</span><span class="nf">queue_adapter</span> <span class="o">=</span> <span class="k">begin</span>
    <span class="no">Sidekiq</span><span class="p">.</span><span class="nf">redis</span><span class="p">(</span><span class="o">&amp;</span><span class="ss">:ping</span><span class="p">)</span>
    <span class="ss">:sidekiq</span>
  <span class="k">rescue</span> <span class="no">StandardError</span> <span class="o">=&gt;</span> <span class="n">e</span>
    <span class="no">Rails</span><span class="p">.</span><span class="nf">logger</span><span class="p">.</span><span class="nf">warn</span> <span class="s2">"Sidekiq Redis Unreachable: Falling back to :inline. </span><span class="si">#{</span><span class="n">e</span><span class="p">.</span><span class="nf">message</span><span class="si">}</span><span class="s2">"</span>
    <span class="ss">:inline</span>
  <span class="k">end</span>
<span class="k">end</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<hr />

<h2 id="4-auditing-memory-is-it-worth-it">4. Auditing Memory: Is it worth it?</h2>
<p>As your app grows, you need to know if your Redis cost is justified. Use this “Internal Audit” script to see exactly where your memory is going.</p>

<p><strong><code class="language-plaintext highlighter-rouge">redis_audit.sh</code></strong></p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="rouge-code"><pre><span class="c">#!/bin/bash</span>
<span class="c"># Find the Top 5 memory-hogging keys in the current DB</span>
<span class="nb">echo</span> <span class="s2">"Scanning for top memory consumers..."</span>
redis-cli <span class="nt">--scan</span> | xargs <span class="nt">-I</span> <span class="o">{}</span> redis-cli MEMORY USAGE <span class="o">{}</span> | <span class="nb">paste</span> - - | <span class="nb">sort</span> <span class="nt">-nr</span> <span class="nt">-k</span> 2 | <span class="nb">head</span> <span class="nt">-n</span> 5 | <span class="nb">awk</span> <span class="s1">'{printf "  %s bytes\t%s\n", $2, $1}'</span>

<span class="c"># Summarize by data type</span>
redis-cli <span class="nt">--bigkeys</span> | <span class="nb">grep</span> <span class="nt">-E</span> <span class="s2">"summarized|payload"</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<hr />

<h2 id="5-conclusion-the-zero-downtime-mindset">5. Conclusion: The “Zero Downtime” Mindset</h2>
<p>By following this plan, you transform your infrastructure from a fragile house of cards into a resilient mesh:</p>
<ol>
  <li><strong>Isolation:</strong> A cache spike never logs a user out.</li>
  <li><strong>Redundancy:</strong> K8s Anti-Affinity protects you from Cloud Provider upgrades.</li>
  <li><strong>Graceful Degradation:</strong> If Redis fails, the code knows how to skip the cache and keep the student learning.</li>
</ol>

<p>This is the standard for high-performance, professional web services in 2026.</p>]]></content><author><name></name></author><category term="redis" /><summary type="html"><![CDATA[The Resilient Rails Stack: Mastering Redis HA, Shared Sessions, and Fault Tolerance]]></summary></entry></feed>