Dual View Architecture - Full Orchestration Engine.

Krystian Swierk

2h • 📚 Resources & Finds

Here is a failure mode that ships more often than it should. The model

that writes an output is also the model that checks it. You send the

result back with "review this and flag anything weak," and the review

skews toward approval, because a model reviewing its own work shares

its own blind spots. This is a documented limitation, not a

hypothesis.

It is not universal. Plenty of teams already run multi-step pipelines,

separate critic models, and output validators. But self-review as the

only quality gate is still common, and for enterprise-grade output,

where a confident wrong answer is a liability, it is worth solving

properly. This post is how we approached it, where the design is

genuinely sound, and where it is not. I want this community to review

both.

Why self-review is weak

Two things work against a single model checking itself.

The first is autoregressive momentum. A model picks each word partly

from the words it already wrote, so the opening of an output

conditions everything after it. A model that has generated thousands

of reports has a strong format prior: summary, background, analysis,

recommendation. Your spec might say to lead with the competitive

threat and drop the background. That instruction competes with the

prior, and a few sentences in, the prior often wins. The output looks

like a report. It is not your report.

The second is that evaluation has a prior too. In training data,

reviews of polished work skew positive, so a model asked to "evaluate

this" leans toward approval. A reviewer that is the same model, or the

same model family, shares the writer's biases.

Here is the honest version of the claim, and it is less dramatic than

how this is usually sold. Prompting is not powerless, it is unreliable

as your only quality control. Self-critique does catch real errors.

It just catches fewer than an independent reviewer would, and you

cannot tell from the output which case you got. So you do not throw

prompting away. You add structure around it.

Dual-view: structured prompting, not no prompting

Dual-view splits one generation into three separate model calls, each

in its own isolated context, each on a different model. I want to be

precise about what this is and is not. All three stages are still

prompted model calls. We did not escape prompting. We structured it,

so that no single model both produces something and signs off on it.

Track 1 builds structure. It reads your methodology and produces a

skeleton of headings and hierarchy, with a filter that blocks prose.

Honest caveat: cleanly separating "structure" from "content" is its

own hard problem, and that filter is a heuristic, not a solved

science.

Track 2 challenges. A second model, ideally from a different family,

reviews the skeleton and cannot approve without naming a specific risk

or alternative. Different families do carry somewhat different

biases, so this reduces shared blind spots. It does not eliminate

them, since most large models share training data and similar tuning,

and it costs you a second integration and added latency. It is a real

improvement, not magic.

Track 3 synthesizes. Your most capable model writes the actual

content, using the skeleton as structure and the critique as a lens.

The orchestration engine

The part I am most confident in is the control flow. If a model

decides when to run the quality steps, it can skip them, because "run

governance first" is just another instruction competing with "produce

the output now." So the orchestration is not a model. It is

deterministic code: a workflow engine, middleware, a service. It calls

each track in order, validates each result before the next runs, and

makes the governance outputs required inputs to synthesis, so

synthesis cannot execute without them. Deterministic code has no

training priors to override. This is ordinary good engineering, and it

is the right call.

For enterprise use, the engine also produces an audit record of what

was checked and that the checks ran. For a regulated company that

record is part of the deliverable.

Where this is honestly weak

A reviewer outside our team pushed hard on this design, and the

critique was fair. The most important point: Track 2 reviews the

skeleton, before any final prose exists. The final synthesized text

from Track 3 is not currently reviewed by an independent model. We

open by criticizing self-checking, and then the actual output leans on

deterministic post-checks rather than a fresh adversarial pass. The

architecture generalizes to more tracks, and a post-synthesis review

track is the obvious next one. We are building toward it, and it is

exactly the kind of thing I want reviewed before we call it done.

The other honest trade-off is cost. Three model calls plus validation,

retries, and human routing is real overhead. The right question is

not "is this architecture correct." It is "are my failure costs high

enough to pay for it." For many features, self-review plus a validator

is fine. For enterprise-level companies producing high-stakes,

regulated, or audited output, the overhead is justified. That is who

this architecture is for.

What we have found, and the ask

In our own testing, building the structure before any content is

generated sharply reduced fabricated figures, because missing data

surfaces as an explicit gap instead of an invented number. The sample

is small and self-run, which is exactly why outside review matters.

I want to share all of the findings we have, the strengths and the

weak spots, with people who ship production systems. I want the app

and the findings peer-reviewed by this community. If you have built

LLM features into real products, tell me where we are wrong. Comment

here or message me, and I will get you access and the full findings.Z

@Jake Van Clief

@David Vogel

6 comments

Clief Notes

skool.com/cliefnotes

Jake Van Clief, giving you the Cliff notes on the new AI age.

Leaderboard (30-day)

+970

+880

+806

+690

+589