Something I keep hitting and don't have a clean answer for. Catching when an automation BREAKS is easy. It errors, nothing comes out, you get an alert. Catching when it quietly gets WORSE is the hard one. The thing runs fine, returns something that looks right, but the quality slipped and nobody notices for two weeks. For the mechanical parts (did the row get created, did the email send) this is simple. For anything open ended (a draft, a summary, a reply, a piece of content) I have no clean way to score it automatically. "It produced text" is not the same as "it produced good text." What I do now is a mix: a few hard checks on the mechanical parts, a human spot checking a sample, and saving the bad outputs so I can see patterns. It works, but it's manual and it doesn't scale past a handful of automations. So the open question for people running this in production: how are you measuring whether an agent's output is actually good over time, not just that it ran? Anyone using a model to grade another model's output, and does that actually catch the slips or does it just rubber stamp them? Curious what's working.