Most people think weak AI output means a weak model. Wrong.
Here's the reality. The output is bad because nothing grades it before it ships.
Anthropic just proved this with a feature called Outcomes. You write a rubric for what good looks like. A separate agent scores every output against it and kicks back anything that fails. The agent that did the work never grades its own work.
No model change. Just a grading loop.
The result on their benchmarks. 10.1% better PowerPoint quality. 8.4% better Word docs.
You can copy the same loop into any build. I wrote up the exact setup. The 5 steps, the copy-paste grader prompt, and the 3 mistakes that kill it.