Dual View Architecture - Full Orchestration Engine.
Here is a failure mode that ships more often than it should. The model
that writes an output is also the model that checks it. You send the
result back with "review this and flag anything weak," and the review
skews toward approval, because a model reviewing its own work shares
its own blind spots. This is a documented limitation, not a
hypothesis.
It is not universal. Plenty of teams already run multi-step pipelines,
separate critic models, and output validators. But self-review as the
only quality gate is still common, and for enterprise-grade output,
where a confident wrong answer is a liability, it is worth solving
properly. This post is how we approached it, where the design is
genuinely sound, and where it is not. I want this community to review
both.
Why self-review is weak
Two things work against a single model checking itself.
The first is autoregressive momentum. A model picks each word partly
from the words it already wrote, so the opening of an output
conditions everything after it. A model that has generated thousands
of reports has a strong format prior: summary, background, analysis,
recommendation. Your spec might say to lead with the competitive
threat and drop the background. That instruction competes with the
prior, and a few sentences in, the prior often wins. The output looks
like a report. It is not your report.
The second is that evaluation has a prior too. In training data,
reviews of polished work skew positive, so a model asked to "evaluate
this" leans toward approval. A reviewer that is the same model, or the
same model family, shares the writer's biases.
Here is the honest version of the claim, and it is less dramatic than
how this is usually sold. Prompting is not powerless, it is unreliable
as your only quality control. Self-critique does catch real errors.
It just catches fewer than an independent reviewer would, and you
cannot tell from the output which case you got. So you do not throw
prompting away. You add structure around it.
Dual-view: structured prompting, not no prompting
Dual-view splits one generation into three separate model calls, each
in its own isolated context, each on a different model. I want to be
precise about what this is and is not. All three stages are still
prompted model calls. We did not escape prompting. We structured it,
so that no single model both produces something and signs off on it.
Track 1 builds structure. It reads your methodology and produces a
skeleton of headings and hierarchy, with a filter that blocks prose.
Honest caveat: cleanly separating "structure" from "content" is its
own hard problem, and that filter is a heuristic, not a solved
science.
Track 2 challenges. A second model, ideally from a different family,
reviews the skeleton and cannot approve without naming a specific risk
or alternative. Different families do carry somewhat different
biases, so this reduces shared blind spots. It does not eliminate
them, since most large models share training data and similar tuning,
and it costs you a second integration and added latency. It is a real
improvement, not magic.
Track 3 synthesizes. Your most capable model writes the actual
content, using the skeleton as structure and the critique as a lens.
The orchestration engine
The part I am most confident in is the control flow. If a model
decides when to run the quality steps, it can skip them, because "run
governance first" is just another instruction competing with "produce
the output now." So the orchestration is not a model. It is
deterministic code: a workflow engine, middleware, a service. It calls
each track in order, validates each result before the next runs, and
makes the governance outputs required inputs to synthesis, so
synthesis cannot execute without them. Deterministic code has no
training priors to override. This is ordinary good engineering, and it
is the right call.
For enterprise use, the engine also produces an audit record of what
was checked and that the checks ran. For a regulated company that
record is part of the deliverable.
Where this is honestly weak
A reviewer outside our team pushed hard on this design, and the
critique was fair. The most important point: Track 2 reviews the
skeleton, before any final prose exists. The final synthesized text
from Track 3 is not currently reviewed by an independent model. We
open by criticizing self-checking, and then the actual output leans on
deterministic post-checks rather than a fresh adversarial pass. The
architecture generalizes to more tracks, and a post-synthesis review
track is the obvious next one. We are building toward it, and it is
exactly the kind of thing I want reviewed before we call it done.
The other honest trade-off is cost. Three model calls plus validation,
retries, and human routing is real overhead. The right question is
not "is this architecture correct." It is "are my failure costs high
enough to pay for it." For many features, self-review plus a validator
is fine. For enterprise-level companies producing high-stakes,
regulated, or audited output, the overhead is justified. That is who
this architecture is for.
What we have found, and the ask
In our own testing, building the structure before any content is
generated sharply reduced fabricated figures, because missing data
surfaces as an explicit gap instead of an invented number. The sample
is small and self-run, which is exactly why outside review matters.
I want to share all of the findings we have, the strengths and the
weak spots, with people who ship production systems. I want the app
and the findings peer-reviewed by this community. If you have built
LLM features into real products, tell me where we are wrong. Comment
here or message me, and I will get you access and the full findings.Z
0
6 comments
Krystian Swierk
2
Dual View Architecture - Full Orchestration Engine.
Clief Notes
skool.com/cliefnotes
Jake Van Clief, giving you the Cliff notes on the new AI age.
Leaderboard (30-day)
Powered by