Don't trust your agent. Read the receipt. - GRAIsol BlogSkip to main content

Technology

Don't trust your agent. Read the receipt.

Most agent tools ask you to trust that the work got done. Agent AFK hands you the receipt — an append-only trace of every tool call, sub-agent, and decision it made while you were gone, replayable line by line.

Don't trust your agent. Read the receipt.

Don't trust your agent. Read the receipt.

You give an autonomous agent a task, walk away, and come back to a green checkmark: Done ✅.

Now what? You have a diff, maybe a passing test, and a paragraph of confident prose explaining what the agent says it did. If you're a senior engineer, that paragraph is worth approximately nothing. You've read enough postmortems to know that "I ran the tests and they passed" and "the tests passed" are different sentences. The agent that just edited eleven files and force-touched your build config is asking for the same thing every junior asks for on a Friday afternoon: trust me.

The honest answer to "did the work actually get done correctly?" for most agent tools is: open the diff and re-derive it yourself. The agent's own account of its run is a story it tells after the fact, reconstructed from a context window it has probably already compacted. There is no ground truth.

That's the gap Agent AFK is built around. Not "run an agent on a schedule" — everybody can do that now. The bet is narrower and, I think, more load-bearing: unattended work is only worth anything if it leaves evidence you can audit without trusting the thing that produced it.


Here's the receipt

Every Agent AFK session writes an append-only trace to ~/.afk/state/witness/<session>/trace.jsonl. You read it back with one command. Here's a real one — three sub-agents dispatched in parallel to read three files, then a verification step, then a clean exit:

$ afk trace show 7b5c3d16 --all

Trace  7b5c3d16-9cf4-4742-a4d4-147b3faeba8a
File   ~/.afk/state/witness/7b5c3d16-.../trace.jsonl
       sealed (succeeded) · 18 events · 4 tool calls (0 err) · 0 subagents · 0 claims

  15:49:10  phase    session_init_start
  15:49:10  phase    session_init_done   1.6s
  15:49:10  phase    loop_start
  15:49:12  phase    model_ttfb          1.5s
  15:49:18  tool     agent      started          ┐
  15:49:22  tool     agent      started          ├ three sub-agents, one wave
  15:49:22  tool     agent      started          ┘
  15:49:28  tool     agent      ok   9.6s   417 B ┐
  15:49:28  tool     agent      ok   9.6s   442 B ├ all three return together
  15:49:28  tool     agent      ok   9.6s   463 B ┘
  15:49:31  tool     read_file  ok   1ms    822 B   ← verifying a cited line
  15:49:42  phase    loop_end            32.1s
  15:49:42  closure  model_end_turn   turns=1
  15:49:42  SEALED   succeeded   turns=1   (closed 15:49:42.972Z)

You don't have to take the agent's word for what it did. The fan-out is in the timestamps: three agent dispatches between 15:49:18 and 15:49:22, all completing at 15:49:28. The verification is right there — a read_file the agent issued to re-check one line a sub-agent had cited. The run ended on its own terms (model_end_turn, not an abort or a crash) and the trace is SEALED succeeded.

This is not a log file someone remembered to add. It's the durable record the runtime produces as a side effect of running at all.


What you're actually looking at

The trace is newline-delimited JSON. Every record is the same shape:

{"ts":"2026-06-11T15:49:42.972Z","seq":17,"kind":"session_sealed",
 "payload":{"status":"succeeded","finalTurnCount":1,"closedAt":"2026-06-11T15:49:42.972Z"}}

Four fields, no surprises. seq is a monotonic counter owned by the writer, so the ordering is total and tamper-evident — you can't lose an event without leaving a gap. kind is a discriminated union: session_phase, tool_call (emitted twice, on start and completion), hook_decision, closure, session_sealed, and a handful more. payload is typed per kind.

Three commands cover the workflow:

afk trace list                 # every session that has a trace, newest first
afk trace show <id|latest>     # the human-readable timeline above
afk trace show <id> --json     # the raw NDJSON, for piping to jq

The last one matters more than it looks. The pretty printer is a convenience; the trace is the source of truth, and it's just JSON on your disk. You can jq it, diff two runs, feed it to another tool, or grep six weeks of unattended runs for every session that hit a tool error. Nothing is locked behind a dashboard or a vendor's retention policy.

The "sealed" part is a real durability contract, not a status string. When a session ends cleanly, the writer appends a terminal session_sealed record and then calls fsync before the call returns — so the seal is on disk before the session-end hooks fire. A trace without that record means the process died between its last action and a clean teardown. sealed (succeeded) versus unsealed (live or crashed) tells you whether you're looking at a complete receipt or the scene of an accident. For unattended work, "it crashed and here's exactly how far it got" is the second-most useful thing a system can tell you.


"Everybody has cron now"

Yes. Scheduling an agent is table stakes in 2026, and if that were the pitch I wouldn't be writing this. Running unattended is the easy half. The hard half is the part that makes unattended work usable: when you weren't watching, what happened?

Observability is the thing that's hard to add later and hard to copy, because it can't be a feature bolted onto the side — it has to be how the runtime is built. If the trace is an afterthought, it's incomplete exactly when you need it (the crash, the silent retry loop, the sub-agent that quietly failed). The reason Agent AFK's receipt is trustworthy is that the agent doesn't write it. The runtime does, underneath, whether the model cooperates or not.


Why the receipt is trustworthy

The trace isn't a subsystem off to the side. It's the seam every other subsystem is sewn through:

  • The tool dispatcher emits a tool_call on start and completion for every single tool — bash, file edits, web scrapes, sub-agent dispatch. Duration and output size come for free. That's the body of the receipt.
  • The hook registry emits hook_decision. When a PreToolUse hook blocks a dangerous call, that block is written to the trace — so you can audit not just what ran, but what was prevented from running.
  • Sub-agent orchestration — both the agent tool and the compose DAG executor — shows up as tool calls with real wall-clock durations. The parallel wave above is one example; a compose graph of research → implement → verify nodes is another. Delegation is visible, not a black box.
  • The session loop brackets everything with loop_start, model_ttfb, loop_end, a closure carrying the terminal reason and token counts, and the session_sealed durability record.

The skills — /review, /diagnose, /mint, and the rest — are composable workflows defined in plain prompt files. They don't have privileged logging; they orchestrate the same primitives, so a /review run leaves the same kind of receipt as anything else. And the system prompt is doing real work here too: it's the reason a session ends in an explicit terminal state and re-reads a line to verify a sub-agent's claim instead of trusting it. The prompting produces the behavior; the architecture records it. The receipt is just where the two meet and become inspectable.


The honest caveats

If I'm asking you to trust a receipt, I owe you its rough edges. Reading real traces, here's what's not polished yet:

  • The summary counters undercount. That run dispatched three sub-agents, but the header says 0 subagents. The counter tracks a narrower notion of forked session than the agent and compose tools emit, so it reads low. The timeline is accurate — the three agent lines are right there — but don't trust the top-line tally yet.
  • Cost shows $0.0000. These runs went through subscription-based auth, which doesn't meter per-call cost, so the trace records zero. On a metered API key you get a real dollar figure in the closure payload; on a subscription you don't. The field is honest about what it can see.
  • claims is sparse. The agent verified a claim in this run (the read_file), and said so in its output, but the structured claims counter stayed 0 — formal claim events aren't emitted everywhere yet. The verification is visible as a tool call; it's just not tallied.

None of these are load-bearing for the core promise — the event stream itself is complete and durable — but they're the difference between "the receipt is trustworthy" and "the receipt is polished," and only the first one is true today.


Try it

It's npx-able and gate-free — no signup, no account, runs locally against your own key or subscription:

npm i -g agent-afk

afk chat -m sonnet "summarize what this repo's README covers, then verify one claim by re-reading the cited line"

afk trace show latest --all

Read your own receipt. That's the whole pitch: you shouldn't have to trust this post either.


Agent AFK is open source (Apache-2.0) at github.com/griffinwork40/agent-afk. It runs Claude, GPT, and local models; the trace works the same regardless. More at agentafk.com.