Kwindla Kramer

Open Source Voice AI Community

Activity

Mon

Wed

Fri

Sun

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

Jan

Feb

What is this?

Less

Memberships

Open Source Voice AI Community

826 members • Free

11 contributions to Open Source Voice AI Community

Kwindla Kramer

Jan 6 •

General discussion

New NVIDIA open model for voice agents: Nemotron Speech ASR

NVIDIA released a new open source speech-to-text model designed from the ground up for low-latency use cases like voice agents. This is part of NVIDIA's new focus on open models, which I'm excited about. These new models in the Nemotron family include STT and TTS models, specialized models like guardrail models and LLMs. And they are completely open: open weights, training code, training data sets, and inference tooling. This new STT model is very fast. Here's a voice agent running locally on my RTX 5090 with sub-500ms voice-to-voice inference. Technical write-up and link to GitHub repo: https://www.daily.co/blog/building-voice-agents-with-nvidia-open-models/ Also, Twitter and LinkedIn if either of those platforms are your thing. (I post a lot about voice agents on both platforms.) https://x.com/kwindla/status/2008601714392514722 https://www.linkedin.com/posts/kwkramer_nvidia-just-released-a-new-open-source-transcription-activity-7414368349905821696-ufuy/

New comment 26d ago

Kwindla Kramer

1 like • 26d

Yeah, I agree that we don't actually need to get all the way down to 500ms for actual real-world voice AI use cases. It's cool to be able to, though!

Mohammad Mussab

26d •

Pipecat

GeminiLive S2S + pipecat-flows Integration Issue

Hey everyone! I'm trying to integrate GeminiLive S2S (speech-to-speech) with pipecat-flows for a healthcare booking agent. The Problem: When pipecat-flows transitions between nodes, it sends LLMSetToolsFrame to update available tools. GeminiLive requires WebSocket reconnection when tools change (API limitation). After reconnection, the conversation state breaks and Gemini doesn't follow the new node's task messages to call functions. What works: - OpenAI LLM + Azure STT + ElevenLabs TTS with pipecat-flows ✅ - Tool updates happen seamlessly, no reconnection needed What doesn't work: - GeminiLive S2S + pipecat-flows ❌ - Every node transition → reconnection → broken flow Current workaround attempts: - Monkey-patched process_frame to handle LLMSetToolsFrame - Wait for session ready after reconnection - Trigger inference with new context messages - Still inconsistent behavior Questions: 1. Has anyone successfully used GeminiLive with pipecat-flows? 2. Is there a recommended pattern for handling tool updates without reconnection? 3. Should I create a custom adapter that pre-registers all tools at connection time? Any guidance appreciated! 🙏

New comment 26d ago

Kwindla Kramer

1 like • 26d

The speech-to-speech models are not currently supported by Pipecat Flows, because their APIs do not fully support the context engineering that Flows does. The gpt-realtime API has added capabilities that should be enough to support Flows, but nobody has done the engineering work for that, yet. The Gemini Live APIs are still not mature enough to support complex workflows at all, whether built on Flows or not. For something like a booking agent, I still recommend pretty strongly using the STT->LLM->TTS approach. You will have better reliability, better observability, and the latency is the same. Gemini Live voice-to-voice time is about 2.5s these days. A three-model pipeline should be <1.5s.

Nir Simionovich

Jan 4 •

General discussion

Musings about Vibe Coding, Pipecat, LiveKit and more

So, over the past few weeks - I've been neck deep into working with PIpecat, LiveKit and Vibe Coding. Mainly, I wanted to see what kind of milage I can get from Vibe Coding tools, and in order to test it - what's a better way than build a Pipecat/LiveKit implementation? So, I decided to examine 3 primary tools: - Claude Code - Using Sonnet 3.5 (using CLI) - OpenCode - Grok Code Fast 1 - Google Antigravity - Using Gemini 2.5 Below are my conclusions, split into several categories. 💵 Financials: Most expensive to use - Claude Code Least expensive to use - OpenCode 😡 Developer Experience: Best experience - Google Antigravity Worst experience - Claude Code 💪 Reliability: Most reliable - Claude Code Least reliable - OpenCode 🚅 Performance: Fastest planning and building - Google Antigravity Slowest planning and building - OpenCode So, overall - there is no "one tool to rule them all" here - and what I found out that each tool is really good at performing specific tasks. Here is what I've learned about how to "leverage" these tools in order to build something successful: - Planning can be performed with either OpenCode of Google antigravity. Google provides free developer credits for Antigravity, and their deep-thinking and reasoning engine, when applied to software architecture and design works very well. - Backend development with either ClaudeCode or Google Antigravity. When coupled with proper topic sub-agents, these are really powerful tools. For some odd reason, Claude Code is far more capable at handling complex architectures, while Google Antigravity leans towards the "hacker style" coding. - UI/UIX development - without any question, OpenCode did a better job. It was far more capable in spitting out hundreds of lines of working UI/UX code - even faster that Claude. However, if at some point it gets stuck on a specific UI component package, it may require Claude to show it the light - so pay attention to what it's doing. - Code Review, Security and Privacy - without any question, Claude is the winner here - with potentially the most extensive availability of sub-agent topic experts.

New comment 26d ago

Kwindla Kramer

2 likes • Jan 6

I try to use as many AI coding environments as I can, but Claude Code is the one I keep coming back to. The integration of the model and the harness is just so good, and I personally like working in the terminal. One really encouraging thing is that all of these environments have gotten much better at writing correct Pipecat code. Six months ago, none of them could handle a library as big as Pipecat very well. We did a lot of work on the docs and core repo organization, which has definitely helped. But I think the biggest thing is just that the models and harnesses keep getting better. One thing I do, though, when writing complex Pipecat code, is clone the github repo locally, check out the release I'm using (`git checkout v0.0.98`), and then sometimes tell Claude Code: look in the pipecat/ core implementation code to see how the Foo class works, including which frames are handled and how code in pipecat/examples/foundational uses the class.

Ai Wax

Dec '25 •

General discussion

Game Changer - New Potential Client - Need Assistance!!!

I have a meeting today with a potential client. He's the Director, PMO for a private detention and correctional conglomerate. They have educational re-entry programs, transportation operational division, real estate, etc. I want to implement Voice AI tools to their operations. I just want to start out doing a small project to collaborate with him to prove what I can do. What would be a good introduction statement? What kind of demo can I do? (Examples) Ultimately, what price do I charge? Your thoughts are much appreciated.

New comment Dec '25

Kwindla Kramer

3 likes • Dec '25

You can configure a demo voice agent using the Pipecat CLI in ~5 minutes, test it locally, and then deploy it for a demo: https://docs.pipecat.ai/cli/overview The CLI and everything else about Pipecat is completely open source.

Mohammad Mussab

Dec '25 •

Pipecat

Experts Advice Needed on my Pipecat Architecture

𝗛𝗲𝗮𝗹𝘁𝗵𝗰𝗮𝗿𝗲 𝗩𝗼𝗶𝗰𝗲 𝗔𝗴𝗲𝗻𝘁 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 𝗥𝗲𝘃𝗶𝗲𝘄 Hi everyone, Running a production voice agent (~500-600 calls/day) with 𝗽𝗶𝗽𝗲𝗰𝗮𝘁-𝗳𝗹𝗼𝘄𝘀. Would appreciate feedback on my architecture. 𝗪𝗵𝘆 𝗦𝗲𝗹𝗳-𝗛𝗼𝘀𝘁𝗲𝗱: Tried Pipecat Cloud but Talkdesk is not supported. WebSocket is mandatory - cannot use WebRTC. 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲: Talkdesk ──WS──► Bridge Server (Azure App Service) ──WS──► Pipecat Agent (Azure VM + Docker) • Bridge converts μ-law 8kHz ↔ PCM 16kHz (resampling on every chunk) • 3 Docker containers behind Nginx load balancer • Each handles ~15 concurrent calls ──► Each container: 3GB RAM, 0.75 CPU limit • CI/CD: GitHub Actions → Docker Hub → Azure VM pull 𝗔𝗜 𝗦𝘁𝗮𝗰𝗸: • STT: Azure Speech (Italian) • LLM: OpenAI GPT-4.1 • TTS: ElevenLabs (eleven_multilingual_v2) • VAD: Silero 𝗠𝘂𝗹𝘁𝗶-𝗔𝗴𝗲𝗻𝘁 𝗦𝗲𝘁𝘂𝗽 (pipecat-flows): Router Node → detects intent → routes to: • Booking Agent (20+ step flow) • Info Agent (RAG/knowledge base) • [Future] Person specify the doctors name e.g "I want to book appointment with Dr. Jhon for heart checkup." Doctor Booking Agent Agents can transfer between each other during conversation. 𝗠𝘆 𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻𝘀: 𝟭. 𝗟𝗮𝘁𝗲𝗻𝗰𝘆 feels high. Is the two-hop WebSocket architecture (Talkdesk → Bridge → Pipecat) causing this? Should I merge the bridge into the Pipecat container? 𝟮. Is having a 𝘀𝗲𝗽𝗮𝗿𝗮𝘁𝗲 𝗯𝗿𝗶𝗱𝗴𝗲 for audio conversion a common pattern, or is there a better approach? 𝟯. 𝗥𝗼𝘂𝘁𝗶𝗻𝗴 𝗽𝗮𝘁𝘁𝗲𝗿𝗻 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻: I use a Router node to detect intent and route to agents. But I'm concerned this approach is too rigid. Example: Currently I route to "Booking Agent" when user says "book X-ray". But what if user says "book with Dr. Jhon" or "book with Dr. Jhon at 3pm tomorrow"? Should I create separate agents for each variation? That feels wrong - they're all booking, just with different pre-filled data. Or should the Router extract entities (doctor name, time, service) and pass them as parameters to a single flexible agent that skips steps dynamically? What's the best pattern in pipecat-flows for handling these variations without creating rigid, bounded flows for each request type?

New comment Dec '25

Kwindla Kramer

2 likes • Dec '25

Unless I'm misunderstanding, I don't think you need a bridge server. You can do the u-law conversion in the WebSocket transport serializer. You can specify a serializer when you create the transport. If there's not already one for TalkDesk, it should be easy to create one based on an existing serializer. Here's the Telnyx serializer, for example. https://docs.pipecat.ai/server/services/serializers/telnyx https://github.com/pipecat-ai/pipecat/blob/08a9b434c1eafabd5416a8ec5861a8563cd9c709/src/pipecat/serializers/telnyx.py#L38 Custom serializer docs: https://docs.pipecat.ai/server/services/serializers/introduction Does that make sense? Have you tried posting about TalkDesk in the Pipecat Discord? There may be somebody there who has already implemented a TalkDesk serializer.

1-10 of 11

Level 3

36points to level up

Kwindla Kramer

@kwindla-kramer-2446

I work on Pipecat and Daily infrastructure

Active 26d ago

Joined Nov 7, 2025

Contributions

Followers

Following