Wait... someone just open-sourced a real-time OpenClaw agent 🤯
It sees what you see. Hears what you say. Acts on your behalf.
All through your voice. From your Meta glasses or just your iPhone.
It's called VisionClaw. And it uses OpenClaw and Gemini Live.
Put on your glasses or open the app on your phone, tap one button, and just talk.
"What am I looking at?" and Gemini sees through your camera and describes the scene in real time.
"Add milk to my shopping list" and it actually adds it through your connected apps.
"Send John a message saying I'll be late" and it routes through WhatsApp, Telegram, or iMessage.
"Find the best coffee shops nearby" and it searches the web and speaks the results back to you.
All happening live. While you're walking around.
Here's what's running under the hood:
• Camera streams video at ~1fps to Gemini Live API
• Audio flows bidirectionally in real time over WebSocket
• Gemini processes vision + voice natively (not speech-to-text first)
• OpenClaw gateway gives it access to 56+ skills and connected apps
• Tool calls execute and results are spoken back to you
Your camera captures video. Gemini Live processes what it sees and hears over WebSocket, decides if action is needed, routes tool calls through OpenClaw, and speaks the result back.
No Meta glasses? No problem. Your iPhone camera works just the same.
Clone the repo, add your Gemini API key, build in Xcode, and you're running.
100% open source. Built with Gemini Live API + OpenClaw.