We fixed our AI processing but calls still felt slow. Turned out we were solving the wrong problem.

So last month we had this weird situation. Our voice AI was responding in like 300ms, super fast. LLM streaming, TTS optimized, everything running in parallel. We were pretty happy with ourselves.

Then we get feedback from users in India and Australia saying the system feels laggy and unresponsive.

I'm like, what? Our metrics show 300ms. That's fast.

Spent a week debugging the AI stack. Nothing wrong there.

Finally someone suggested we check actual end to end latency from the user's perspective, not just our server logs.

Turns out:

Mumbai to our Virginia server: 900ms
Sydney: 1200ms
Even São Paulo: 800ms

Our 300ms processing time was getting buried under 500-600ms of just network travel time.

The actual problem

When someone in Mumbai makes a call, the audio goes: Mumbai → local ISP → regional backbone → submarine cables → Europe → Atlantic → US → our server

Then the response does the same journey back.

That's like 15+ hops through routers, firewalls, ISPs. Each one adding 20-50ms.

Physics problem, not a code problem.

What we did

Moved our servers closer to users. Sounds obvious now but we initially thought "cloud is cloud, location doesn't matter."

Deployed smaller Kubernetes clusters in:

Mumbai
Singapore
São Paulo
Sydney
Plus our existing US and Europe ones

Each location runs the full stack. Not a cache, actual processing.

When someone in Mumbai calls now, they hit the Mumbai server. Processing happens 40ms away instead of 200ms away.

Used GeoDNS so users automatically connect to nearest location. Plus some smart routing in case the nearest one is overloaded.

Results

Mumbai: 900ms → 300ms Sydney: 1200ms → 340ms São Paulo: 800ms → 310ms

Basically went from "unusable in some regions" to "works everywhere."

The funny part? Our AI didn't change at all. Same models, same code. We just moved the servers closer.

The kubernetes part

This would've been a nightmare to manage without k8s. We'd need to manually deploy and maintain like 10+ separate systems.

Instead:

One deployment config
Apply to all regions
Each scales independently based on local traffic
Update all of them with one command

India gets busy during Indian business hours, scales up automatically. Scales down at night. Same for every region.

When US East had that outage last week, only 12% of our users noticed because they were on that region. Everyone else didn't even know it happened.

Lesson learned

You can optimize your code all day but if you're sending data halfway around the world, physics wins.

Also, measure what users actually experience, not just what your server processes. Our metrics looked great but user experience sucked in half the world.

Anyway, if you're building anything real-time and have global users, geography matters more than you think.

Has anyone else run into this? How'd you handle it?

0 comments