Live agents. Real money flowing. One transaction to move them clean.
Moved two live AI agents from one server to another today while they were actively serving paid chat sessions. No downtime. No maintenance window. No error messages for users. Here's the pattern that made it work: Stand up the new server fully. Get it running, smoke test it, confirm it returns 200. Then flip the database pointer in one transaction. Next webhook call hits the new server. Old server stays stopped but intact on disk. That last part is the thing most people skip. Don't delete the old process. Just stop it. Rollback is then one command, not a rebuild. The gotcha we hit: the webhook URL was stored in two database tables, not one. Updated the first, smoke test half-worked, spent 10 minutes confused before finding the second row. Now it's in the project notes permanently. Observed downtime for users: zero seconds. What does your rollback plan look like when you move a live service?