I cooked up a raw Voice AI orchestration engine from scratch using ๐—Ÿ๐—ถ๐˜ƒ๐—ฒ๐—ž๐—ถ๐˜ & ๐—ฃ๐˜†๐˜๐—ต๐—ผ๐—ป. ๐Ÿณ
While wrappers are great for MVPs, building your own orchestration layer gives you ๐—ณ๐˜‚๐—น๐—น ๐—ผ๐˜„๐—ป๐—ฒ๐—ฟ๐˜€๐—ต๐—ถ๐—ฝ, ๐˜€๐—ถ๐—ด๐—ป๐—ถ๐—ณ๐—ถ๐—ฐ๐—ฎ๐—ป๐˜๐—น๐˜† ๐—น๐—ผ๐˜„๐—ฒ๐—ฟ ๐—ฐ๐—ผ๐˜€๐˜๐˜€, ๐—ฎ๐—ป๐—ฑ ๐—ด๐—ฟ๐—ฎ๐—ป๐˜‚๐—น๐—ฎ๐—ฟ ๐—ฐ๐—ผ๐—ป๐˜๐—ฟ๐—ผ๐—น over the entire conversational pipeline.
I designed this engine to fully replace third-party wrappers like Vapi & Retell AI. Here is a deep dive into whatโ€™s under the hood:
๐Ÿ”„ ๐——๐˜†๐—ป๐—ฎ๐—บ๐—ถ๐—ฐ ๐—”๐—ด๐—ฒ๐—ป๐˜ ๐—–๐—ผ๐—ป๐—ณ๐—ถ๐—ด๐˜‚๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป (๐—ฅ๐—ฒ๐—ฎ๐—น-๐—ง๐—ถ๐—บ๐—ฒ ๐—›๐˜†๐—ฑ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป)
Hardcoding agents is a trap. I implemented a system that executes an API call upon call initialization.
โ€ข ๐—›๐—ผ๐˜-๐—ฆ๐˜„๐—ฎ๐—ฝ๐—ฝ๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ฃ๐—ฒ๐—ฟ๐˜€๐—ผ๐—ป๐—ฎ๐˜€: A single engine instance can instantly apply unique System Prompts, Voice IDs, and Temperature settings based on backend parameters.
โ€ข ๐—ฅ๐—ฒ๐˜€๐˜‚๐—น๐˜: You can power thousands of unique agents (e.g., specific to different businesses) without ever redeploying the core code or creating a new instance.
๐Ÿ› ๏ธ ๐—–๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜-๐—”๐˜„๐—ฎ๐—ฟ๐—ฒ ๐—™๐˜‚๐—ป๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐—ฅ๐—ผ๐˜‚๐˜๐—ฒ๐—ฟ
When building raw infrastructure, manually mapping tools to agents is a major architectural hassle. I built specialized helper logic for ๐——๐˜†๐—ป๐—ฎ๐—บ๐—ถ๐—ฐ ๐—ง๐—ผ๐—ผ๐—น ๐—œ๐—ป๐—ท๐—ฒ๐—ฐ๐˜๐—ถ๐—ผ๐—ป to solve this.
โ€ข ๐— ๐—ผ๐—ฑ๐˜‚๐—น๐—ฎ๐—ฟ ๐—Ÿ๐—ผ๐—ด๐—ถ๐—ฐ: The router decouples the orchestration engine from business logic. It parses the backend setup and assignsย onlyย the specific tools defined in that agent's configuration (e.g., loading "Appointment Booking" tools only when the specific use-case demands it).
๐Ÿ’พ ๐——๐—ฎ๐˜๐—ฎ ๐—ฃ๐—ฒ๐—ฟ๐˜€๐—ถ๐˜€๐˜๐—ฒ๐—ป๐—ฐ๐—ฒ & ๐—ฃ๐—ผ๐˜€๐˜-๐—–๐—ฎ๐—น๐—น ๐—œ๐—ป๐˜๐—ฒ๐—น๐—น๐—ถ๐—ด๐—ฒ๐—ป๐—ฐ๐—ฒ
Logs aren't enough. I built a save_conversation function that aggregates the full session payload and triggers intelligent sub-functions immediately after the call:
โ€ข ๐—–๐—ฎ๐—น๐—น ๐—ฆ๐˜‚๐—บ๐—บ๐—ฎ๐—ฟ๐˜†: Generates a natural language recap via LLM.
โ€ข ๐—–๐—ฎ๐—น๐—น ๐—˜๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ผ๐—ป: Structurally classifies the outcome (e.g., "Booked", "Inquiry", "Failed").
โ€ข ๐—ง๐—ฒ๐—น๐—ฒ๐—บ๐—ฒ๐˜๐—ฟ๐˜†: Captures precise Token Usage (for billing) and Latency statistics alongside the transcript.
๐Ÿ›ก๏ธ ๐—ฃ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐—š๐˜‚๐—ฎ๐—ฟ๐—ฑ๐—ฟ๐—ฎ๐—ถ๐—น๐˜€
To prevent runaway costs and "zombie" connections, I engineered active background monitors:
โ€ข ๐—œ๐—ป๐—ฎ๐—ฐ๐˜๐—ถ๐˜ƒ๐—ถ๐˜๐˜† ๐— ๐—ผ๐—ป๐—ถ๐˜๐—ผ๐—ฟ: Detects silence (30s default) and gracefully terminates the session.
โ€ข ๐—ฆ๐—ฒ๐˜€๐˜€๐—ถ๐—ผ๐—ป ๐—Ÿ๐—ถ๐—บ๐—ถ๐˜ ๐— ๐—ผ๐—ป๐—ถ๐˜๐—ผ๐—ฟ: Enforces a hard safety cap (15 mins) to prevent infinite loops or abuse.
๐Ÿš€ ๐—ง๐—ต๐—ฒ ๐—ฃ๐—ฟ๐—ผ๐—ผ๐—ณ:
This engine isn't a prototype. It is currently the production backbone for my Dental SaaS, handling real-time scheduling for ๐Ÿฎ๐Ÿฌ+ ๐—ฎ๐—ฐ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—ฐ๐—น๐—ถ๐—ป๐—ถ๐—ฐ๐˜€ across Canada.
If you are interested in having this architecture for your own SaaS, ๐—ฐ๐—ผ๐—บ๐—บ๐—ฒ๐—ป๐˜ "๐—ฉ๐—ผ๐—ถ๐—ฐ๐—ฒ ๐—”๐—œ" or ๐——๐—  ๐—บ๐—ฒ. Let's build. ๐Ÿ‘‡
4
8 comments
Jin Park
3
I cooked up a raw Voice AI orchestration engine from scratch using ๐—Ÿ๐—ถ๐˜ƒ๐—ฒ๐—ž๐—ถ๐˜ & ๐—ฃ๐˜†๐˜๐—ต๐—ผ๐—ป. ๐Ÿณ
powered by
Open Source Voice AI Community
skool.com/open-source-voice-ai-community-6088
Voice AI made open: Learn to build voice agents with Livekit & Pipecat and uncover what the closed platforms are hiding.
Build your own community
Bring people together around your passion and get paid.
Powered by