Tiny Puzzles for Testing and Debugging AI Agents
A practical guide to building small probe tasks to test function calling, skill awareness, prompt length, cache behavior, and runtime debugging for local AI agents.
Tiny Puzzles for Testing and Debugging AI Agents
One of the smallest tests I use is also one of the dumbest:
- send
hello - send
hello again
I am not doing that because I think it is a meaningful intelligence test.
I am doing it because I want to see whether the stack behaves differently on the second request. If I expect prompt caching or KV reuse, I want some sign of it in latency, logs, or runtime warnings. If nothing changes, or LM Studio logs complain about cache support, that already tells me something useful.
That tiny test changed how I think about debugging agents.
One thing I keep coming back to with local agents is this: if you only test them on real work, debugging takes forever.
A big task fails and you do not know why.
Was function calling broken? Did the model ignore the skill instructions? Did the prompt get too long? Did the agent runtime format messages wrong? Did LM Studio drop some warning you only notice after the fact?
When the task is large, every failure mode gets mixed together. That makes the whole setup feel mystical when it usually is not.
What helped me was building tiny puzzles.
By “puzzle,” I do not mean benchmark games or synthetic IQ tests. I mean small, repeatable probe tasks that isolate one capability at a time. A good probe gives you a yes-or-no answer about one layer of the stack.
That is much more useful than asking a vague question and hoping the agent “feels smart.”
Probe, not benchmark
This distinction matters.
A benchmark tries to rank systems.
A probe tries to isolate a failure.
A benchmark asks:
Which model is better?
A probe asks:
What broke?
For debugging, I care much more about the second question.
What I actually want from a probe
A good agent-debugging probe should be:
- small enough to run in seconds
- repeatable enough to compare across models or prompt changes
- narrow enough to isolate one failure mode
- cheap enough that I can run it many times
- observable enough that I can inspect logs and understand what happened
If the task is too broad, you learn almost nothing.
That is why I like tests like:
- “call exactly one tool”
- “use the right skill when the trigger phrase appears”
- “survive a prompt that is 3x longer than the easy case”
- “repeat the same prompt twice and see whether cache behavior changes”
Those are not glamorous. They are useful.
The basic idea: build a probe suite, not one giant eval
I would break agent testing into five buckets:
- function calling support
- skill awareness
- prompt-length and context pressure
- debugging by layer: agent runtime vs model runtime
- metrics, logs, and traces
You can think of this as a cheap local eval harness for yourself.
Here is the compact version:
| Probe | What it tests | Common failure |
|---|---|---|
| single-tool sanity check | basic function calling | model answers without using the tool |
| tool-loop continuation | post-tool reasoning | runtime stops after the tool result |
| skill trigger check | skill awareness | skill never enters context or gets ignored |
| buried instruction retention | prompt pressure | instruction gets lost in long context |
hello / hello again | cache or runtime behavior | no reuse signal, warning in logs |
1. Test function calling support first
This is the first thing I test because if it is broken, everything above it becomes noise.
The goal here is not “can the model describe tool use?” The goal is “can the model produce the exact structured action the runtime needs, and can the runtime round-trip the result correctly?”
The simplest probe
Give the agent one tool and one obvious task.
For example:
You have one tool:
read_file(path). ReadREADME.mdand tell me the first sentence.
Why this works:
- only one tool is available
- the right action is obvious
- the result is easy to verify
- failure is easy to classify
Possible outcomes:
- it calls the tool correctly: good
- it answers without the tool: prompt or runtime issue
- it emits malformed tool JSON: model formatting issue
- it calls the wrong tool: tool selection issue
- it calls the tool but ignores the result: loop or message formatting issue
Slightly better function-call probes
Once the trivial case passes, I like a short ladder:
- one tool, one obvious call
- two tools, only one correct choice
- two sequential tool calls
- read, then patch, then verify
- multiple independent reads in parallel
That ladder tells you where the system starts to break.
What I look for in logs
At this stage, I want to see:
- the full prompt sent by the agent
- the tool schema exposed to the model
- the raw model output before the runtime sanitizes it
- the parsed tool call
- the returned tool result
- the next model turn after the tool result
If any one of those is missing, debugging gets much harder.
2. Test skill awareness separately
This one is easy to confuse with general intelligence.
If you use skills, rulesets, or task-specific system instructions, you should test whether the agent notices and follows them before you test complicated work.
The puzzle here is not “is the output good?” The puzzle is “did the agent notice the right behavior cue?”
A good skill-awareness probe
Create two nearly identical tasks where only one should trigger the skill.
For example:
- Task A: “Summarize this file.”
- Task B: “Humanize this article.”
If the humanizer skill is available and Task B does not activate the expected behavior, that is not a generic writing failure. That is a skill-selection failure.
Another useful pattern
Make the skill instruction produce a visible artifact.
Examples:
- a required heading
- a required command pattern
- a required output shape
- a required refusal to do some prohibited action
That way you can verify whether the skill was active without guessing.
Failure modes to watch
- the skill never enters context
- the skill enters context but gets ignored
- the skill conflicts with stronger system instructions
- the skill works on short prompts but disappears on long ones
- the skill triggers, but only partially
That last one is common. The agent acts like it vaguely remembers the rule but not enough to follow it cleanly.
3. Test prompt length and context pressure on purpose
This is where local setups get weird fast.
A model can look fine on a short prompt and then fall apart once you add:
- a large system prompt
- a skill catalog
- tool schemas
- a long file
- a diff
- previous turns
That is not a minor edge case. That is the normal shape of agent work.
So I like to test prompt pressure deliberately instead of waiting for a real task to reveal it.
The ladder I use
Run the same task at different prompt sizes:
- tiny prompt
- normal prompt
- long prompt
- long prompt plus previous turns
- long prompt plus tools plus previous turns
The task itself should stay almost the same. Only the context load changes.
That helps answer a very useful question:
Is the model bad at the task, or is the system bad under context pressure?
A cheap KV-cache probe
One of my favorite tiny tests is embarrassingly simple:
- send
hello - send
hello again
I am not using that to test intelligence. I am using it to test cache behavior and runtime traces.
If the stack supports prompt caching or KV reuse, I want to see evidence in latency, logs, or runtime counters. If I expect cache reuse and nothing changes, that is already a clue.
This is where I often check LM Studio logs. I want to see:
- whether the requests actually reached the model runtime
- whether there are warnings about cache support
- whether the second request behaves differently
- whether the runtime reports any unsupported cache path
That tiny test is not sufficient, but it is cheap and surprisingly revealing.
Another prompt-length probe
Ask the model to follow one simple instruction buried at the very end of a long prompt.
For example:
After reading everything above, answer with exactly:
CACHE_OK
If it misses that reliably only when the prompt gets large, you have learned something real about instruction retention under load.
Control one variable at a time
This sounds obvious, but I have wasted plenty of time ignoring it.
If you change the model, the prompt, the agent settings, and the runtime configuration all at once, a bad result tells you almost nothing.
So when I run these probes, I try to be strict:
- keep the model fixed when testing agent changes
- keep the agent fixed when testing runtime changes
- keep the task fixed when testing prompt length
- keep the prompt fixed when testing cache behavior
Half of debugging is refusing to create your own confusion.
4. Debug by layer: agent problem or model problem?
This is the distinction that saves the most time.
When an agent fails, I try to ask:
- Did the agent send the right prompt?
- Did the model return something usable?
- Did the runtime parse it correctly?
- Did the tool result get fed back in the right shape?
- Did the loop stop too early or continue too long?
If you do not separate those layers, you end up rewriting prompts when the real problem is message formatting, or blaming the model when the real problem is unsupported KV cache in the backend.
My rough debugging stack
I usually debug in this order:
Layer 1: agent client
This is Pi, OpenCode, OpenHands, or whatever harness you are using.
Questions:
- What exact system prompt did it send?
- What tools did it expose?
- Did it strip or rewrite any message roles?
- Did it truncate context?
- Did it retry without telling me?
- Did it sanitize model output before I saw it?
This layer is responsible for a lot more weirdness than people expect.
Layer 2: transport / API compatibility
Questions:
- Did the request format match what the backend expects?
- Are tool calls encoded the way the runtime expects?
- Are reasoning channels or special tokens being passed through cleanly?
- Is the compatibility mode wrong for this backend?
This layer is boring until it is broken. Then everything looks cursed.
Layer 3: model runtime
For a local stack, this is where LM Studio, Ollama, MLX, llama.cpp, or another runtime starts to matter.
Questions:
- Did the model start generating?
- Did it stall at 0%?
- Did it leak think tags?
- Did it warn about unsupported KV caching?
- Did it hit memory pressure?
- Did it silently fall back to a degraded path?
This is why I check LM Studio logs so often. If the runtime is complaining, I want to know before I waste an hour blaming the prompt.
Layer 4: model behavior
Only after the first three layers look healthy do I spend much time blaming the model itself.
Questions:
- Did the model choose the wrong tool?
- Did it ignore a strong instruction?
- Did it lose track of prior context?
- Did it produce malformed structured output?
Sometimes the answer really is “the model is weak at this.” But that should be the last conclusion, not the first.
5. Metrics, logs, and traces: what I actually want
If you are serious about debugging agents, text output alone is not enough.
You want a trace.
At minimum, I would want:
- request timestamp
- model name
- prompt token count or prompt length estimate
- completion length
- latency to first token
- total latency
- tool calls emitted
- tool-call parse failures
- retries
- stop reason
- cache hit or cache reuse signals if available
- warnings from the model runtime
That is the minimum useful set.
Even a fake trace like this is better than nothing:
1
2
3
4
5
6
7
8
9
10
11
12
13
request_id=42
agent=opencode
model=gemma-4
prompt_tokens=4187
completion_tokens=96
ttft_ms=842
total_latency_ms=2310
tool_calls=1
tool_parse_ok=true
tool_roundtrip_ok=true
cache_reuse=false
stop_reason=tool_call
warning="RotatingKVCache Quantization NYI"
If I can capture something shaped like that for each probe, I can usually stop guessing and start comparing.
The most useful derived metrics
Once you have the basics, the metrics I care about most are:
- tool-call success rate: how often the raw output becomes a valid tool call
- tool round-trip success rate: how often the model continues correctly after getting a tool result
- instruction retention under load: does the model still obey the obvious rule in long prompts
- latency by prompt size: when does the system start to bend
- failure clustering: are most failures coming from one layer
That last one matters because random-seeming instability often is not random at all. It clusters around one part of the stack.
A practical probe suite I would keep around
If I were setting this up from scratch, I would keep a small set of named probes.
Probe 1: single-tool sanity check
Goal: verify basic tool calling.
Prompt:
Use the available file-read tool to read
README.mdand return only its first sentence.
Probe 2: tool-loop continuation
Goal: verify that the agent continues after a tool result instead of narrating.
Prompt:
Read
README.md, then tell me how many top-level sections it has.
This forces a read, then a second reasoning step based on the returned content.
Probe 3: skill trigger check
Goal: verify that the right skill or ruleset activates.
Prompt:
Humanize the following draft.
Expected result: visible behavior that clearly matches the skill instructions.
Probe 4: buried instruction retention
Goal: check prompt-pressure behavior.
Prompt:
[long filler context] Final instruction: reply with exactly
CACHE_OK
Probe 5: cache reuse smoke test
Goal: see whether repeated prompts behave differently.
Prompt sequence:
hellohello again
This is not deep, but it is cheap. For local debugging, cheap matters.
Probe 6: runtime warning trap
Goal: catch backend-specific failures early.
Procedure:
- run one short prompt
- run one long prompt
- inspect LM Studio logs
I want to catch things like:
- unsupported KV cache paths
- prompt stalls
- memory pressure warnings
- reasoning-tag leakage
Probe 7: patch-and-verify microtask
Goal: test the full coding loop in miniature.
Prompt:
Read
foo.txt, replacecatwithdog, then verify the file changed.
This sounds silly, but it exercises the real sequence:
- read
- patch
- observe
- continue
That makes it more valuable than a pure text-generation test.
Do not over-trust a passing probe
This is the other trap.
A probe can pass while the real workflow is still broken.
Examples:
- the single-tool test passes, but tool use collapses under long prompts
- the skill trigger works alone, but disappears once the tool manifest gets large
- the cache smoke test looks fine on tiny prompts, but session reuse breaks on real workloads
So I do not use probes to declare victory. I use them to narrow the search space.
Keep a failure diary
I would also keep a tiny log for myself.
Nothing fancy. Just enough to compare runs without relying on memory:
- probe name
- model
- agent or harness
- prompt size
- result
- runtime warnings
- note about what changed since the last run
This matters because agent debugging gets fuzzy very quickly. A failure diary turns “I think it got worse after that config change” into something you can actually verify.
What makes a puzzle good?
The best puzzles are not the most clever ones. They are the ones that make failure obvious.
Bad puzzle:
“Review this medium-sized repository and suggest improvements.”
If it fails, you learn almost nothing.
Better puzzle:
“You have one tool. Read one file. Return one sentence.”
If it fails, you know where to look.
That is the mindset shift.
Do not start by asking whether the agent is smart. Start by asking whether one layer of the system is behaving.
The point is not to win the puzzle
This is worth saying because people drift into benchmark brain very easily.
The goal of these puzzles is not to prove that one model is superior in the abstract. The goal is to make the stack debuggable.
That means:
- isolate one variable
- keep the task cheap
- inspect the logs
- compare runs after each config change
- only then move back to real work
I still care about real tasks in the end. Of course I do. But I do not trust a real-task result very much if I do not have a clean probe suite underneath it.
That is especially true for local agent setups, where a lot of failures come from runtime behavior, prompt overhead, API compatibility, or cache paths rather than from some deep model limitation.
What I would do first in a local LM Studio setup
If I were debugging Pi or OpenCode against LM Studio from scratch, I would do this in order:
- Run a one-line plain text prompt and confirm the backend is alive.
- Run a one-tool puzzle and confirm tool calling works.
- Run the same task with a longer prompt and compare latency and output quality.
- Run the
hello/hello againcache smoke test. - Inspect LM Studio logs after both short and long prompts.
- Only then try a real coding task.
That order is boring. It also saves time.
Final takeaway
When an agent fails, I do not want a mystery. I want a small broken puzzle.
That is what lets me debug one layer at a time:
- tool calling
- skill awareness
- prompt pressure
- runtime health
- logs and metrics
Real work is too expensive to be your only test harness.
Tiny puzzles are cheaper. And for debugging, cheaper usually wins.