Activity
Mon
Wed
Fri
Sun
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
What is this?
Less
More

Memberships

Imperium Academy™

67k members • Free

The AI Advantage

126.2k members • Free

AI Automation Agency Hub

327.6k members • Free

AI Automation Society

417k members • Free

AI Money Lab

84.5k members • Free

Start Writing Online

20.8k members • Free

Ghostwriters Anonymous

16.9k members • Free

Digital Wealth Creators

25.1k members • Free

The One-Person Business

3.4k members • Free

203 contributions to AI Automation Society
Are humans closed to be replaced by AI?
I just red a study I had to share with you guys. Eight months ago, the best AI agent in the world could only complete 2.5% of real freelance projects to a client-acceptable standard. Today that number is 16.1%. The Remote Labor Index tests AI agents on actual commissioned work (3D/CAD, architecture, video, web dev, and more), with every deliverable judged by human evaluators against a professional's paid output, not a benchmark score. The new leader is Anthropic's Fable 5, roughly double Opus 4.8 (8.3%) and well ahead of GPT-5.5 (6.3%). Here's the part I find reassuring: the researchers also tried replacing human evaluators with an AI judge. It overestimated the newest models' performance by up to 3x. Turns out we can't yet trust an AI to reliably judge another AI's work, human evaluation is still doing the heavy lifting. So no, humans aren't out of the loop yet. But going from 2.5% to 16.1% in under 8 months is the kind of curve that should have people paying attention, regardless of industry. Curious how others here read this: signal of what's coming, or still early enough not to worry? Source: https://safe.ai/blog/significant-increase-in-digital-labor-automation
0 likes • 41m
The 3x overestimation by the AI judge is the part I'd underline. It means any automated quality check we put in place has a blind spot unless a human reviews the review occasionally.
How are you checking your agent's output is still good, not just that it ran?
Something I keep hitting and don't have a clean answer for. Catching when an automation BREAKS is easy. It errors, nothing comes out, you get an alert. Catching when it quietly gets WORSE is the hard one. The thing runs fine, returns something that looks right, but the quality slipped and nobody notices for two weeks. For the mechanical parts (did the row get created, did the email send) this is simple. For anything open ended (a draft, a summary, a reply, a piece of content) I have no clean way to score it automatically. "It produced text" is not the same as "it produced good text." What I do now is a mix: a few hard checks on the mechanical parts, a human spot checking a sample, and saving the bad outputs so I can see patterns. It works, but it's manual and it doesn't scale past a handful of automations. So the open question for people running this in production: how are you measuring whether an agent's output is actually good over time, not just that it ran? Anyone using a model to grade another model's output, and does that actually catch the slips or does it just rubber stamp them? Curious what's working.
0 likes • 2d
I keep the golden set fixed for comparability, but I add one rotating wildcard example pulled from recent live traffic each week. That gives me both a stable trend and a drift signal.
0 likes • 57m
Good question. I promote the wildcard into the permanent set after a week if it consistently catches something the originals missed. The rotation keeps fresh examples in the pipeline either way.
API Integration
I am planning/started developing using Claude Code. Fairly simple: The user will log onto the app I am developing, view data on a Supabase database, update and save. So far, everything is easy. Help needed:- Only the data fields to be edited will be pulled from an existing database of a web CRM system, and once the data is edited by the user using the app, the updated data must be pushed back to the CRM database again. In priciple I understand what an API must do and, at a high level, how it works, but since I have never done an API Post and Call (I am not a developer), I would like to know what I need from the developers of the CRM system If someone has a playbook info for me I can follow for that integration, I would really appreciate some help/pointers This is all I have for now, hence I would like to know what else, if I need, I should ask for: https://www.hireandservice.com/terms/developer
API Integration
0 likes • 1h
Ask the CRM devs for three specific things: the base URL and authentication method for the API, the exact endpoint paths for the tables you need to read and write, and a sample request/response body for each. That covers most of the wiring.
If you use a coding agent, have it prove its work instead of describing it
The new shot-scraper 1.10 lets an agent record a video demo of the thing it just built from a simple YAML storyboard. Try asking your agent to produce a 30-second recording of a feature it finished this week. Watching the demo catches problems a text summary hides. shot-scraper video source: https://aititus.com/news
0 likes • 2h
The video catches the blind spots you don't know you have. A text summary can describe the feature correctly, but watching it run reveals the missing loading state or the wrong color on hover.
When an AI agent gives you a result that feels
When an AI agent gives you a result that feels like magic, don't just celebrate, capture the recipe. The single output is worthless if you can't recreate it. I learned this the hard way after spending an hour getting a Claude agent to produce the exact kind of analysis I wanted. I was thrilled, until I tried to run the same prompt again and got something completely different. The fix was simple: I asked the agent to "write me the system prompt that would create this exact response." It produced a clean, reusable instruction that I saved in my prompts folder. Now every time I iterate and find a good version, I extract the system prompt and store it. Over time you build a library of proven instructions, not just lucky one-offs. You turn random success into a repeatable asset. Next time you get an output that nails it, type: "now write the system prompt that generated this output." Then save that prompt. What's one prompt you wish you had captured?
0 likes • 2h
That's a great addition, especially the evaluation criteria part. I've started doing a lightweight version of that after losing a few good results to missing context.
1-10 of 203
Dionny Chejito
5
339points to level up
@dionny-chejito-4957
building AI agents & automations. i share what actually works, and what quietly breaks

Active 5m ago
Joined May 29, 2026
Powered by