A tricky RAG scenario for you guys · AI Automation Society

Mike H

2d • General Discussion 💬

A tricky RAG scenario for you guys

Your retriever scores 90% accuracy on a 5K document test set then crashes to 50% in production with 500K docs.

Why? What do you think exactly happened behind the scenes?

Not because the embedding model suddenly became bad.

The real issue is embedding space crowding.

In enterprise systems, one single business decision creates:

• Slack threads

• Jira tickets

• Confluence docs

• Emails

• Meeting transcripts

All of them are semantically related, so they cluster tightly together in embedding space.

But each document contains different facts.

Example:

→ Slack = final decision

→ Jira = deadline

→ Confluence = technical spec

→ Email = customer request

At small scale, the correct doc easily makes it into top-K retrieval.

At large scale, dozens of highly similar docs compete for the same retrieval slots — and the exact answer doc gets pushed out.

A recent Onyx research paper showed this clearly:

• Vector search dropped from 90.7% → 50.6% when scaling from 5K → 500K docs

• BM25 degraded much more gracefully

Big lesson for AI engineers:

A retriever that works on small datasets tells you almost nothing about real enterprise performance.

Always test RAG systems at production-scale corpus sizes — because neighbourhood density in embedding space changes everything.

What weird RAG/chatbot systems behaviours have you ever noticed?

4 comments

AI Automation Society

skool.com/ai-automation-society

Learn to get paid for AI solutions, regardless of your background.

Looking for Resources? 📚

My Speech to Text Tool🎙️

Leaderboard (30-day)

🔥

+7108

Christian Rivadeneira

🔥

+4948

Sam Alder

+4039

Frank van Bokhorst

🔥

+2520

Shihab Sakif

+1169