So last month we had this weird situation. Our voice AI was responding in like 300ms, super fast. LLM streaming, TTS optimized, everything running in parallel. We were pretty happy with ourselves.
Then we get feedback from users in India and Australia saying the system feels laggy and unresponsive.
I'm like, what? Our metrics show 300ms. That's fast.
Spent a week debugging the AI stack. Nothing wrong there.
Finally someone suggested we check actual end to end latency from the user's perspective, not just our server logs.
Turns out:
- Mumbai to our Virginia server: 900ms
- Sydney: 1200ms
- Even Sรฃo Paulo: 800ms
Our 300ms processing time was getting buried under 500-600ms of just network travel time.
The actual problem
When someone in Mumbai makes a call, the audio goes: Mumbai โ local ISP โ regional backbone โ submarine cables โ Europe โ Atlantic โ US โ our server
Then the response does the same journey back.
That's like 15+ hops through routers, firewalls, ISPs. Each one adding 20-50ms.
Physics problem, not a code problem.
What we did
Moved our servers closer to users. Sounds obvious now but we initially thought "cloud is cloud, location doesn't matter."
Deployed smaller Kubernetes clusters in:
- Mumbai
- Singapore
- Sรฃo Paulo
- Sydney
- Plus our existing US and Europe ones
Each location runs the full stack. Not a cache, actual processing.
When someone in Mumbai calls now, they hit the Mumbai server. Processing happens 40ms away instead of 200ms away.
Used GeoDNS so users automatically connect to nearest location. Plus some smart routing in case the nearest one is overloaded.
Results
Mumbai: 900ms โ 300ms Sydney: 1200ms โ 340ms Sรฃo Paulo: 800ms โ 310ms
Basically went from "unusable in some regions" to "works everywhere."
The funny part? Our AI didn't change at all. Same models, same code. We just moved the servers closer.
The kubernetes part
This would've been a nightmare to manage without k8s. We'd need to manually deploy and maintain like 10+ separate systems.
Instead:
- One deployment config
- Apply to all regions
- Each scales independently based on local traffic
- Update all of them with one command
India gets busy during Indian business hours, scales up automatically. Scales down at night. Same for every region.
When US East had that outage last week, only 12% of our users noticed because they were on that region. Everyone else didn't even know it happened.
Lesson learned
You can optimize your code all day but if you're sending data halfway around the world, physics wins.
Also, measure what users actually experience, not just what your server processes. Our metrics looked great but user experience sucked in half the world.
Anyway, if you're building anything real-time and have global users, geography matters more than you think.
Has anyone else run into this? How'd you handle it?