Your retriever scores 90% accuracy on a 5K document test set then crashes to 50% in production with 500K docs.
Why? What do you think exactly happened behind the scenes?
Not because the embedding model suddenly became bad.
The real issue is embedding space crowding.
In enterprise systems, one single business decision creates:
• Slack threads
• Jira tickets
• Confluence docs
• Emails
• Meeting transcripts
All of them are semantically related, so they cluster tightly together in embedding space.
But each document contains different facts.
Example:
→ Slack = final decision
→ Jira = deadline
→ Confluence = technical spec
→ Email = customer request
At small scale, the correct doc easily makes it into top-K retrieval.
At large scale, dozens of highly similar docs compete for the same retrieval slots — and the exact answer doc gets pushed out.
A recent Onyx research paper showed this clearly:
• Vector search dropped from 90.7% → 50.6% when scaling from 5K → 500K docs
• BM25 degraded much more gracefully
Big lesson for AI engineers:
A retriever that works on small datasets tells you almost nothing about real enterprise performance.
Always test RAG systems at production-scale corpus sizes — because neighbourhood density in embedding space changes everything.
What weird RAG/chatbot systems behaviours have you ever noticed?