I’m seeing a recurring pattern with long-running agents: a lot of failures aren’t because the model is “bad”, but because the workflow is packaged poorly.
If an agent is held together by one giant prompt, it tends to drift, forget context, and turn into “prompt spaghetti.”
The cleaner approach (from OpenAI’s Skills + Shell + Compaction framing) is to treat the workflow like software: procedures live in a skill, execution happens in a real environment (Shell), and context gets compacted so runs can continue without falling apart.
A detail that stood out: Glean.ai saw skill routing initially drop by about 20% in evals, largely because descriptions weren’t written like routing logic. The takeaway is simple: skill descriptions should be decision boundaries, not marketing copy. A few practical habits I’m adopting: write clear “use when / don’t use when” rules, add negative examples when skills can be confused, keep templates inside the skill (not the system prompt), and when reliability matters, explicitly instruct “Use the <skill name> skill.”
There’s also a solid datapoint in the post: a Salesforce-oriented skill example improved eval accuracy from 73% to 85%, and time-to-first-token dropped by 18.1%.
Curious: what’s your biggest failure mode with long-running agents right now?