Dionny Chejito

I just red a study I had to share with you guys. Eight months ago, the best AI agent in the world could only complete 2.5% of real freelance projects to a client-acceptable standard. Today that number is 16.1%. The Remote Labor Index tests AI agents on actual commissioned work (3D/CAD, architecture, video, web dev, and more), with every deliverable judged by human evaluators against a professional's paid output, not a benchmark score. The new leader is Anthropic's Fable 5, roughly double Opus 4.8 (8.3%) and well ahead of GPT-5.5 (6.3%). Here's the part I find reassuring: the researchers also tried replacing human evaluators with an AI judge. It overestimated the newest models' performance by up to 3x. Turns out we can't yet trust an AI to reliably judge another AI's work, human evaluation is still doing the heavy lifting. So no, humans aren't out of the loop yet. But going from 2.5% to 16.1% in under 8 months is the kind of curve that should have people paying attention, regardless of industry. Curious how others here read this: signal of what's coming, or still early enough not to worry? Source: https://safe.ai/blog/significant-increase-in-digital-labor-automation

New comment 24m ago

Dionny Chejito

0 likes • 41m

The 3x overestimation by the AI judge is the part I'd underline. It means any automated quality check we put in place has a blind spot unless a human reviews the review occasionally.

Thanh Dinh

3d •

General Discussion 💬

How are you checking your agent's output is still good, not just that it ran?

Something I keep hitting and don't have a clean answer for. Catching when an automation BREAKS is easy. It errors, nothing comes out, you get an alert. Catching when it quietly gets WORSE is the hard one. The thing runs fine, returns something that looks right, but the quality slipped and nobody notices for two weeks. For the mechanical parts (did the row get created, did the email send) this is simple. For anything open ended (a draft, a summary, a reply, a piece of content) I have no clean way to score it automatically. "It produced text" is not the same as "it produced good text." What I do now is a mix: a few hard checks on the mechanical parts, a human spot checking a sample, and saving the bad outputs so I can see patterns. It works, but it's manual and it doesn't scale past a handful of automations. So the open question for people running this in production: how are you measuring whether an agent's output is actually good over time, not just that it ran? Anyone using a model to grade another model's output, and does that actually catch the slips or does it just rubber stamp them? Curious what's working.

New comment 57m ago

Dionny Chejito

0 likes • 2d

I keep the golden set fixed for comparability, but I add one rotating wildcard example pulled from recent live traffic each week. That gives me both a stable trend and a drift signal.

Dionny Chejito

0 likes • 57m

Good question. I promote the wildcard into the permanent set after a week if it consistently catches something the originals missed. The rotation keeps fresh examples in the pipeline either way.

Frits Erasmus

9h •

Support Needed 💻

API Integration

I am planning/started developing using Claude Code. Fairly simple: The user will log onto the app I am developing, view data on a Supabase database, update and save. So far, everything is easy. Help needed:- Only the data fields to be edited will be pulled from an existing database of a web CRM system, and once the data is edited by the user using the app, the updated data must be pushed back to the CRM database again. In priciple I understand what an API must do and, at a high level, how it works, but since I have never done an API Post and Call (I am not a developer), I would like to know what I need from the developers of the CRM system If someone has a playbook info for me I can follow for that integration, I would really appreciate some help/pointers This is all I have for now, hence I would like to know what else, if I need, I should ask for: https://www.hireandservice.com/terms/developer

New comment 12m ago

Dionny Chejito

0 likes • 1h

Ask the CRM devs for three specific things: the base URL and authentication method for the API, the exact endpoint paths for the tables you need to read and write, and a sample request/response body for each. That covers most of the wiring.

Titus Blair

5h •

General Discussion 💬

If you use a coding agent, have it prove its work instead of describing it

The new shot-scraper 1.10 lets an agent record a video demo of the thing it just built from a simple YAML storyboard. Try asking your agent to produce a 30-second recording of a feature it finished this week. Watching the demo catches problems a text summary hides. shot-scraper video source: https://aititus.com/news

New comment 2h ago

Dionny Chejito

0 likes • 2h

The video catches the blind spots you don't know you have. A text summary can describe the feature correctly, but watching it run reveals the missing loading state or the wrong color on hover.

Dionny Chejito

3h •

General Discussion 💬

When an AI agent gives you a result that feels

When an AI agent gives you a result that feels like magic, don't just celebrate, capture the recipe. The single output is worthless if you can't recreate it. I learned this the hard way after spending an hour getting a Claude agent to produce the exact kind of analysis I wanted. I was thrilled, until I tried to run the same prompt again and got something completely different. The fix was simple: I asked the agent to "write me the system prompt that would create this exact response." It produced a clean, reusable instruction that I saved in my prompts folder. Now every time I iterate and find a good version, I extract the system prompt and store it. Over time you build a library of proven instructions, not just lucky one-offs. You turn random success into a repeatable asset. Next time you get an output that nails it, type: "now write the system prompt that generated this output." Then save that prompt. What's one prompt you wish you had captured?

New comment 2h ago

Dionny Chejito

0 likes • 2h

That's a great addition, especially the evaluation criteria part. I've started doing a lightweight version of that after losing a few good results to missing context.

1-10 of 203

Level 5 - Agent Orchestrator🤖

339points to level up

Dionny Chejito

@dionny-chejito-4957

building AI agents & automations. i share what actually works, and what quietly breaks

Active 5m ago

Joined May 29, 2026