This is a summary of the system that I've been working on since joining Clief Notes. This has been made by drawing off of a lot of different posts, weekly competitions, David's Corner, and the work has shared with the community. This post specifically was made to address a question that has been pretty common with a lot of members: How a folder of plain text files becomes an operating system for AI work? The whole philosophy in one line: This is not automation — it's orchestration through structure. The folders are the framework; the files are the code; a human reads and audits all of it.
1. What this is (plain-language version):
Most people drive AI with one giant prompt: they dump everything they know into a chat box and hope. That breaks down the moment a task has more than one step, because the AI has no map — it improvises, forgets, and contradicts itself.
This is a live implementation of ICM — Interpretable Context Methodology — the principle that structured files delivered at the right moment replace a framework telling the AI what to do. Instead of one giant prompt, the workflow lives in a folder of plain text files, and the folder structure itself is the interface. The way the folders are laid out tells the AI what to do, in what order, and where the boundaries are. There's no app, no code required, no cloud account. An AI agent walks into the folder, reads a small "you are here" file, and the structure routes it to exactly the right instructions at the right moment.
Everything below is ICM in practice. The toolkit isn't a new theory — it's what ICM looks like when you actually build it and run it every day.
One sentence: Structured files at the right moment replace a framework telling the AI what to do.
The payoff: instead of loading 30,000–50,000 words of context for every task, the AI loads only the 2,000–8,000 words relevant to the current step. It's faster, cheaper, more consistent, and — crucially — a human can read the whole system and audit it, because the files are the documentation.
2. The one rule that governs everything: 60 / 30 / 10
Before any task, ask three questions in order:
1. Is the answer deterministic? (a calculation, a lookup, a file operation) → write a script. Stop. Don't involve AI.
2. Can it be written as an if/this-then/that rule? (sorting, routing, categorizing) → a small automation or script. Stop.
3. Does it genuinely need judgment across messy information? (synthesis, writing, analysis, design) → now use AI.
The name "60/30/10" is the effort budget for any system you build — the rough share of the work each layer should own:
60% — Deterministic (scripts, data, established tools). Anything with one right answer that can be calculated, looked up, or formatted to spec: scripts, database/SQL queries, spreadsheet formulas, file operations, APIs with predictable responses. This is the stable foundation — reliable, fast, auditable, cheap to run. It doesn't drift, hallucinate, or cost tokens. A VLOOKUP does not hallucinate.
30% — Rule-based routing & layering (AI navigates, scripts execute). The "which path do we take" decisions that have clear criteria: if/then routing, branching, known-category classification, template selection, threshold checks. The AI may decide which route applies, but a script or automation does the deterministic step. This is the orchestration band — predictable, still drift-free, and the layer that decides which of the 60% gets called.
10% — AI judgment (logic, synthesis, analysis). The slice that genuinely can't be reduced to a rule: writing, analysis of unstructured text, fuzzy categorization, creative work, noticing the pattern in messy data. This is the only place a language model earns its cost.
The deeper truth: the real divide is 90 vs 10, not 60 vs 30. Infrastructure (60%) and orchestration (30%) are both deterministic work — they differ in complexity, not in kind. So when something breaks, the operative question is simply: did something that belongs in the 90% end up running on the 10%? It almost always did.
The endstate — AI as director, not middle management. Once the right scripts exist, the AI's 10% collapses to one job: selecting and calling the deterministic 90% correctly. Judgment in routing, not in doing. A smart director calling solid scripts beats a bloated orchestration layer that hard-codes every path. And it ages well: the 10% (prompts, model choices) decays as models change; the 90% (folder structure, pipelines, routing conventions) survives every tool swap. Patterns last. Tools don't.
The trap this prevents: routing a math problem or a file rename through an AI. It's slow, expensive, and gives a slightly different answer each run. Scripts automate. AI reasons. Never make them compete.
3. The mental model: five context layers:
This is ICM's core model — the toolkit is built directly on top of it. Everything in the system is one of five "layers." Think of it as a factory and its products.
- Layer 0 — Identity: "Who is this project, what agents exist, where do I start?" Lifespan: permanent. Size: ~1 page.
- Layer 1 — Directory / Router: A signpost. No content, only "for X go here, for Y go there." Lifespan: permanent. Size: short.
- Layer 2 — Stage contract: One step's rules — what to load, what to do, what to produce. Lifespan: per step. Size: short.
- Layer 3 — Reference: Stable knowledge — standards, templates, domain facts. Set once, reused. Lifespan: stable. Size: load only what's needed.
- Layer 4 — Output: The artifacts a step produces. A human reviews these before the next step runs. Lifespan: per run. Size: varies.
Layer 3 is the factory. Layer 4 is the product. Knowledge stays stable and gets reused; output is fresh each run and always gets a human's eyes before the workflow advances.
4. How the system is organized (the six functional layers):
At the top level the toolkit is split by function, not by topic. Each folder has its own small router file.
toolkit/
├── README / map ← the system map; always read first
│
├── constraints/ ← Problem-solving protocols. Short, standalone rules
│ you load one or two of when a specific failure shows up
│ (e.g. "AI output sounds generic," "agent wandered out of scope").
│
├── skills/ ← Reusable techniques, model-agnostic. Each is a self-contained
│ folder (e.g. "audit X," "delegate heavy coding," "fix generic prose").
│ Flat namespace — pick one off the shelf.
│
├── workflows/ ← Active multi-step pipelines. Numbered stages that run in order.
│ This is where real multi-step work executes.
│
├── knowledge/ ← The persistent reference layer. Naming rules, methodology docs,
│ templates. Everything points here; it never points back out.
│ ├── _config/ ← global settings shared by every stage (the system's /etc)
│ ├── references/ ← the methodology, guides, patterns
│ └── templates/ ← reusable blank forms
│
├── scaffold/ ← Starters + builders for spinning up a brand-new project,
│ plus read-only "teaching" examples.
│
└── runtime/ ← The front door at run time: a boundary gate, a small memory
service, and the deterministic scripts (lint, scaffold, migrate).
The placement law: subject first, type second. A document about a particular skill lives inside that skill's folder — not in a generic "all plans" bucket. Type-buckets are a last resort for things that genuinely belong to no single subject. This is what keeps the system navigable as it grows.
5. The four routing documents (the "living" part):
This is ICM's filescope principle, customized into the toolkit's daily driver. Three or four tiny files do all the steering. These are what make it a living agent reference rather than a static doc — an agent re-reads them every session, so updating a file instantly changes how every future agent behaves.
- Identity file (Layer 0): Who this folder is. Identity only — if it's getting long, content is hiding that belongs in a subfolder. Limit: ~50 lines.
- Router file (Layer 1): Pointers only. "For this task → that file." No actual content. Limit: ~80 lines.
- Values / Rules file: The non-negotiables and the why behind them. The system's conscience. Limit: short.
- State file: Current session status — what's in progress, what's next. Kept separate from identity so it can change every session without churning the permanent files. Limit: short.
Filescope (the recursive principle): every router/identity file describes only its own folder. A subfolder's detail lives in the subfolder's own file, never copied up into the parent. So an agent entering the "monitoring" subfolder loads 30 lines of monitoring rules — not 400 lines of everything in the project. This is the single trick that keeps each step small and cheap.
6. How a workflow actually runs: the Unix-pipe principle:
A workflow is numbered stages that hand off through a single channel.
01-research/output/ → read by 02-draft
02-draft/output/ → read by 03-review
03-review/output/ → delivered
Three iron rules:
1. The handoff channel is output/, nothing else. Stage 2 reads stage 1's output/ folder — never stage 1's scratch/working files.
2. Adjacent only. Stage 3 reads stage 2, never stage 1. If stage 3 needs something from stage 1, stage 2 must copy it forward into its own output. ("Embed, don't reach.")
3. Config is always available. The global _config/ (voice rules, standards) is like system settings — any stage may read it without "breaking the chain."
Because each stage only cares about the shape of the output it receives, you can rewrite stage 1 entirely without touching stage 2. Stages are independently testable and swappable, exactly like Unix pipes (cmd1 | cmd2 | cmd3).
Each stage is defined by a stage contract — just three sections: Inputs (what to load and why), Process (ordered steps), Outputs (what to write, where). That's it. Same contract shape regardless of subject, tool, or AI.
Design tip: write each stage's output to be interrogated later, not just consumed once. A well-structured output can be pulled into a fresh session weeks later and questioned ("show me where the earlier step flagged X"). A raw dump can't.
7. How to build a specialist (a single expert agent):
A "specialist" is a folder that turns an AI into one focused expert. The discipline is a five-file pattern:
- identity — Holds a philosophy, not a persona. Named principles with a defensible reason to exist. The test: "I catch what the writer missed, because writers can't review their own work" (philosophy) — not "I'm a helpful reviewer" (persona).
- rules — Behavioral constraints only: how it responds, what it never does, how it self-corrects. The test: every rule must be falsifiable — a judge could read one output and mark it pass/fail. "Be thorough" fails this test; "every finding must state where, how to reproduce, and why — or be discarded" passes.
- reference — The protocols, schemas, templates, ordered question-lists. The test: if a section describes a sequence or a schema rather than a behavior, it belongs here, not in rules.
- workflow — The resumability layer: a Pick-Up state (how to resume mid-task) and a Drop state (done-when criteria + where to route next). The test: a new session can inherit a half-finished job and continue.
- examples — Matched do / don't pairs of real input → output, each annotated with why. (See below — this is the highest-leverage file in the folder.)
Two more rules that separate good specialists from generic ones:
- Scope lists earn trust through specificity. "I don't do SEO" is vague. "I don't do semicolons — get a linter" is opinionated enough to be believed. The more concrete the exclusion, the more credible the inclusions.
- Define the clean-pass output explicitly. A specialist with no defined "nothing wrong here" result will invent noise to look useful. Give it a contractual silence: PASS — 0 issues, STRONG — 92/100, No changes recommended.
Why examples carry the most weight:
Of the five files, examples is the one most people underbuild and the one that moves quality the most. Here's why, and how to build it well.
Why it works:
- Examples encode the tacit knowledge that rules can't. A rule can say "be specific," but it cannot fully transmit what specific looks like in this domain. One concrete good-output makes the standard legible in a way ten adjectives never will. The rule names the target; the example is the target.
- Models pattern-match harder off examples than off instructions. When an AI sees a worked input→output pair, it anchors to the shape, depth, and tone of that output. A single strong example pins down a whole class of decisions that would otherwise drift run-to-run — formatting, length, how much to push back, when to stop.
- The "don't" half is where the boundary actually lives. A good-only example shows the destination but not the edge. A bad example — especially a near-miss that looks plausible but fails — teaches the line between acceptable and not. Most specialist failures are subtle (technically-correct-but-useless output), and only a contrastive pair makes that line visible.
- Annotated pairs turn examples into a rubric. When each example says why it's good or bad, the specialist isn't just imitating a sample — it's internalizing the criterion. That's what lets it generalize to inputs you never showed it.
How to build it well:
- Use matched do/don't pairs, not isolated samples. Same input, two outputs: one that passes, one that fails. The contrast is the lesson.
- Annotate every example with the reason. One line under each: "Good — names the exact location and a reproduction step." / "Bad — flags a 'concern' with no evidence; this is the noise the clean-pass rule exists to kill."
- Mine the near-misses, not the obvious failures. A bad example that's wildly wrong teaches nothing. The valuable "don't" is the one that looks right at a glance and fails on inspection — that's the mistake the specialist will actually be tempted to make.
- Include the clean-pass case. Show one example where the correct output is "nothing to report." This is what stops a specialist from manufacturing problems to seem useful.
- Cover the edge, not just the center. One ambiguous or adversarial input → the correct handling. Examples that only show easy cases leave the hard ones to improvisation.
Rule of thumb: if rules is the law, examples is the case law. Rules state the principle; examples show how it's been applied — and applied wrongly. Agents, like junior hires, learn the job faster from worked cases than from the statute.
8. How to build a multi-agent system (several specialists working together):
When one expert isn't enough, you connect specialists through an orchestrator and a shared contract.
your-system/
├── manifest.yaml ← the contract: the exact data shape every handoff must use
├── AGENTS.md ← entry point + "how to activate" guide ├── 0-orchestrator/ ← routes work; never does the work itself
├── 1-specialist-a/
└── 2-specialist-b/
- The orchestrator routes, it never answers. It identifies intent, packages the request (always including the user's original words verbatim), and sends it to the right specialist. If unsure, it asks one clarifying question.
- manifest.yaml is the single source of truth for data shape. Every specialist's handoff references it. Golden rule: never change a handoff format without updating the manifest first — that's the #1 cause of silent breakage ("schema drift").
- Two orchestrator styles:
- Relay (lead → research → offer → close): work flows stage to stage.
- Assess-and-decide (score one thing across several dimensions, then rule on it): the orchestrator needs a binding decision table (e.g. SHIP / FIX / ESCALATE), evidence-over-memory discipline, and an adversarial "try to break this" gate before it commits. It returns a decision, never a question.
Because it's all files, a brand-new agent can be operational in ~30 minutes by reading four files. The filesystem is the language; the files are the code.
9. How to keep it alive (updating without rotting):
A "living agent reference" stays useful only with maintenance discipline:
- Routers stay routers. The moment a router file starts containing content instead of pointers, that's the signal to break out a subfolder with its own router.
- Identity stays under a page. If it's growing, content is leaking in from a subfolder — move it down.
- State is separate from identity. Update state every session; never touch identity to record "what I did today."
- Length limits as smoke alarms: routers < ~80 lines, reference files < ~200 lines (split if longer), identity < ~50 lines. When a file blows its limit, that's a structural problem to fix, not a number to ignore.
- Match constraints to symptoms, don't load the library. Each workspace's router lists only the one or two protocols likely to bite there — not all of them. Context overload defeats the whole purpose.
10. The discipline (the unbreakable rules):
These are the cultural rules that make the whole thing work. Generalize them to any team:
1. Navigate before reasoning. If a map/router file exists, read it first. Using AI to reconstruct what a file already states is pure waste.
2. Scripts automate, AI reasons. Honor the 60/30/10 split. Never route deterministic work through AI; never let a script make judgment calls.
3. Stay in scope. An agent works only in the folder it was pointed at. Wandering into sibling folders is a boundary failure, not initiative.
4. Stop and surface failures. When something fails, stop and report it. Don't silently retry — that hides the real problem and burns budget. A human picks the next move.
5. Complete answers, held under pressure. Give the full answer the first time; don't reverse a correct answer just because you were pushed, unless there's genuinely new evidence.
6. Human review gate between stages. Nothing advances until a person approves what landed in output/. This is what makes the workflow interruptible and auditable.
7. Brief before prompt. Define the outcome and the failure criteria before opening the AI. A thin brief earns generic output — correctly.
11. Quickstart: scaffold this into your subject
Hand an AI this whole document plus the steps below, and it can build the same ICM system for any domain — cooking, legal intake, research, a game, customer support, anything.
1. Pick the subject and name the end-to-end job.
("Take a raw customer email and produce a triaged, routed response.")
2. Find the natural stages already implicit in that job, in order.
Most workflows already have stages — you're just naming them.
→ 01-intake / 02-classify / 03-draft / 04-review
3. For each stage, write a one-page CONTRACT: Inputs · Process · Outputs.
Output of stage N is the ONLY thing stage N+1 reads.
4. Pull the stable stuff (standards, voice, templates, domain facts)
out of the stages into a shared reference/ + _config/ layer.
5. At the root, write three tiny files:
- identity (who this project is — under a page)
- router (pointers only — "for X go to stage 0Y")
- rules (the non-negotiables + why)
6. If a stage needs a focused expert, build it as a specialist:
identity (philosophy) · rules (testable behavior) ·
reference (protocols/schemas) · workflow (pick-up/drop) ·
examples (annotated do/don't pairs — build this one hardest).
7. If multiple experts must coordinate, add a 0-orchestrator that
ROUTES (never answers) and a manifest defining the handoff shape.
Update the manifest BEFORE any handoff format ever changes.
8. Before automating anything, run the 60/30/10 test:
deterministic → script; rule-based → automation; judgment → AI.
Limits to hold: identity < 50 lines · routers < 80 lines ·
reference < 200 lines · each step's context 2,000–8,000 "words."
Every step ends in an output/ folder a human checks before advancing.