Like a lot of people here, I started with Jake's paper. ICM made immediate sense to me, the filesystem as the orchestration, folders as stages, an agent that navigates instead of being fed. So I built with it. And somewhere along the way I realized my main workspace was never a pipeline like ICM describes. So I made a workspace builder skill off the paper, and started my first true pipeline workspace and the contrast became apparent.
My original workspace has been running for months now, agent-maintained, prose-dense, and the thing I was tending wasn't a sequence of stages anymore. It was a body of knowledge. Decisions piling up, conventions hardening, old files going quietly stale, and the question I kept hitting wasn't "what runs next." It was "does the right knowledge still reach the right task."
Then Google published OKF, and the pieces clicked. Markdown, YAML frontmatter, typed nodes, a semantic layer over what you know. My existing html frontmatter wasn’t sufficient enough. But OKF is a format, and it says so itself. Nothing in it governs what happens when the corpus grows for a year. The next realization came when I stopped thinking in layers and started thinking in nodes: a long-running development workspace isn't a stack, it's a graph, and it accretes. So I gave it a name. An Accretive Context Graph.
What it is
An ACG is a project workspace built as a growing knowledge graph of plain markdown files. Each file is a node holding one piece of what the project knows: a decision, a convention, a failure, a spec. Typed links join the nodes and state what governs what, what answers what, and what depends on what. An agent works inside the graph. It walks the links to gather the context each task needs, adds what each session learns, and regenerates the indexes, maps, and checks from the prose, so the structure stays accurate as it grows. A human directs from the edge, judging what the checks and the agent surface, deciding what stays, what supersedes, and what earns a closer look. And because the graph accretes rather than resets, maintenance is part of the work: staleness is tracked, contradictions get reconciled, growth is consolidated on a cadence.
Where an ICM link says what runs next, an ACG link says what governs what. Nothing gets given up in the move. A pipeline nests inside the graph as a path, the substrate is still markdown and a text editor, and this sits comfortably inside Jake's methodology term. ACG is a contribution back into ICM, stretched to cover the projects that outlive any single run, like dynamic or non deterministic workflows needing elements from across the corpus.
One idea sits underneath all of it. Finding is not binding. Search, links, and a good map make knowledge findable, and none of that guarantees the rule was actually in the window at the moment of the edit. Most of what people call drift is omission. The rule existed, it was findable, and it wasn't loaded when it mattered. Which is why the mechanisms I'm testing are mostly about closing that gap.
The study
Claims are cheap, so I'm measuring instead of asserting. Two live workspaces serve as testbeds, the prose-dense one I described above and a script-dense data pipeline as the contrasting pole. Alongside them, seeded test workspaces run in matched pairs: an ON arm with the theory's mechanisms folded in (typed edges, governance links, staleness anchors) and an OFF arm without them. Same tasks, same corpus scale, and the delta is the result. I am logging time, tokens, and tool calls, but what is judged is always output.
The honesty rules matter more than the mechanisms, so here they are too. Primary evidence is naturally occurring: rework rates, failures a human actually noticed, time to completion. Rubric scores and model judgments are leading indicators only, and when an indicator disagrees with an oracle, the oracle wins. Sample sizes are small and I'm both the author and the operator, so same-author bias gets named, not hidden. And negative results are first-class. Several mechanisms have already earned a "not worth the maintenance cost" verdict, and those go in the brief with the same weight as the wins.
One limit up front, because I'd rather state it than have you assume the opposite: the OFF arm is a theory-blind accretive workspace, not a plain ICM pipeline. So the deltas measure what the governance mechanisms add on an accretive substrate. They don't test ACG against ICM head to head. That comparison is still open, and the brief will say so.
Run the nav test on your own workspace
One instrument from the study is portable enough to share now, and I've pointed this version at the workspace most of you actually run: a standard ICM build, numbered stages, contracts, reference files, a routing file at the top. It measures the thing that actually matters, whether a fresh session can reach the right knowledge, and the whole thing runs on three prompts and two sessions. No part of it requires you to write the questions yourself.
First, a fresh session builds the bank:
> You have full access to this workspace. Build a navigation test bank. Pick 10 facts this workspace genuinely contains, each living in one specific file. Sample across the layers: at least two from stage contracts (inputs, process rules, output specs), at least two from reference material (_config, style guides, constraints), at least one from the routing files (CLAUDE.md or CONTEXT.md decisions), and a few from files nothing links to prominently. For each fact write three things: the question phrased as a working task, the way a real session would need it — "I'm about to run stage 02, which reference files must I load and what does its output need to contain?" beats "what does file Y say" — the correct answer, and the ground-truth file path. Output the questions in one block and the answer key in a separate block. Don't fix or comment on anything you find along the way. Save both blocks, clear the session, and open a blind one. Give it only the questions:
> You have full access to this workspace. Answer the following questions using only what you can find in it. For each one give your answer, the file or files you used, a confidence score from 0 to 10, and roughly what it cost you to get there (files opened, searches run). If you can't find an answer, say so plainly and score your confidence at 0. Don't guess.
Run the questions, then paste the answer key into that same session:
> Here's the answer key. Grade your run: for each question, mark whether you reached the right file, whether the answer was correct and complete, what it cost, and what you scored your confidence. Sort the failures: a low-confidence miss means the structure was navigable enough that you knew you were lost; a HIGH-confidence miss means the workspace let you feel certain about a wrong answer, and those are the dangerous ones — fix them first. For every miss or partial, diagnose why the structure failed you — what would have had to exist for you to find it — and propose the smallest structural fix: a link, a rename, an index line, a line in a stage contract. One fix per miss. Don't make the changes. List them.
And that fix list is your next prompt. Every miss comes out of the run already diagnosed and already shaped as the thing to do about it, so the test doesn't just score your workspace, it hands you the work order. Keep the bank, re-run it after the fixes land or after any structural change, and the comparison is the result.
The confidence score is the part I'd argue you shouldn't skip. Reached-the-file and got-it-right tell you about coverage. Confidence-versus-correctness tells you about trust, and a workspace that produces confident wrong answers is worse than one that produces honest dead ends, because in normal work you'll never see the difference. Which is also why the most valuable outcome of the whole run is a miss: a question whose answer exists in your workspace but never gets found is a reachability gap, and it fails silently every day until you test for it. The agent doesn't know what it failed to load, and neither do you.
If your workspace has grown past the pipeline into something more graph-shaped, densely linked the way mine is, the same three prompts hold, just ask the generator for tougher questions: multi-hop ones ("what rule governs the thing this file depends on"), questions whose answers span two files, questions about which of two conflicting notes is current. The denser the structure, the harder the bank it can support, and the more the misses tell you.
One honest limit on the self-generated bank: the first session can only ask about what it managed to find, so the very least-reachable content never makes it into the questions. The layer-sampling instruction narrows that hole. It doesn't close it. If you want the stronger version, write two or three questions yourself about things you know are in there, and add them to the bank before the blind run.
What comes next
The formal brief lands once the remaining measurements come in. Every claim will carry its evidence class: what's true by construction, what was measured and at what sample size, and what's still a design hypothesis, each with the fail condition it survived. The related work will cite the field honestly too, because the graph-shaped workspace is a converging idea in 2026 and the scarce thing isn't the coinage, it's tested evidence with the don't-adopts included.
If you run the nav test, I'd genuinely like to see your misses. That failure data helps everyone building this way. I'm not walking in with credentials on this one, just the measurements, and when the brief lands you can check every claim against them yourselves.