A contract, an audit, and a humbling test run: Plumbline hits v1.0

A while back I shared Plumbline -- a spec-driven workflow of AI-agent skills that build code in a trackable, accountable way, coordinating entirely through files instead of calling each other. The pitch was "trust, but verify": make the filesystem the contract, so you can actually inspect the contract.

It just hit v1.0, and getting there taught me something I didn't see coming. Quick re-intro for anyone new, then what changed -- including the part where my own framework embarrassed me (on purpose).

THE 30-SECOND REINTRO

To build a house well you follow the plans, double-check the work, and keep everything plumb. Your spec is the plumbline. Plumbline splits a build into single-responsibility skills that never call each other -- they hand off through files:

scaffold --> architect --> foreman --> builder --> inspector

(once) (spec) (blueprint) (code) (proof)

Each stage reads the last one's file and writes its own. The spec is one file and stays the source of truth end to end. The inspector runs with fresh eyes and proves the software against the spec -- "done" means shown, not asserted.

THE NEW PROBLEM I WALKED INTO

A skill is read COLD. A fresh agent -- or a different model entirely -- loads one skill with zero memory of the conversation it was written in. So every shared convention that lives only "in the author's head" is a silent failure waiting to happen. One skill writes a status line one way, another reads it slightly differently, nothing errors -- the run just quietly goes wrong.

I had skills silently agreeing on dozens of little things -- how a spec tags a finished criterion, the exact words the inspector stamps, file-naming patterns -- and that agreement existed nowhere except in my head and a handful of separate files that were free to drift apart.

THE FIX: A CONTRACT THE SKILLS GET CHECKED AGAINST

So v1.0 adds a TERMS file -- a single, canonical definition of every token, status line, and convention the skills share. Every skill reads it first and stops if it can't. The shared ground is explicit instead of assumed.

And because I don't trust myself any more than I trust the agents, I made it enforceable: there's now an audit that checks every skill against the contract, plus a semantic pass for the judgment a script can't do. The framework verifies its own coherence. Trust-but-verify, pointed back at the verifier.

WHAT ELSE CHANGED

It's a proper, installable, versioned plugin now (v1.0.0) instead of a folder of loose skills.
An autonomous build orchestrator and a maintain-mode one drive the same single-purpose skills -- nothing reimplemented.
A git-optional "none" mode, for folks who don't want a repo spun up on them.
A real, signed-off example run lives in the repo, so you can read the actual spec, the blueprint with the inspector's stamps, and the evidence.

WHAT I TOOK AWAY (this time)

Last time the lesson was "the filesystem is the contract." This time it's the next layer down: when you compose skills, the skills need a contract too -- and if you can't check it, it will drift. Writing the contract down was good. Making it verifiable is what made it true.

A REALITY CHECK

None of this is free. All the speccing, planning, testing, and inspecting is overhead -- ballpark ~15% on top of just turning an agent loose on the problem. I look at that as the cost of trusting it. And it buys something specific: when the builder has to deviate from the plan, it writes down exactly where and why -- so "huh, that's not what I asked for" becomes a line in a deviation log instead of a debugging session three weeks later. I'll pay 15% for reliable, testable, documented code over pocketing it and reverse-engineering a silent drift later.

MIT-licensed, v1.0.0, public: https://github.com/BytesFromToby/plumbline

The example run is in examples/ if you want to poke at the artifacts. Would genuinely love feedback from this crowd -- especially on the contract idea: is a shared, checkable vocabulary the right way to keep composed agents honest, or am I overbuilding it? You'll have opinions. :)

5 comments

A contract, an audit, and a humbling test run: Plumbline hits v1.0