展示HN：使用Gemma E2B在M3 Pro上实现的实时人工智能（音频/视频输入，语音输出）

展示HN：使用Gemma E2B在M3 Pro上实现的实时人工智能（音频/视频输入，语音输出）
Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B

原始链接: https://github.com/fikrikarim/parlor

## Parlor：实时、本地多模态人工智能 Parlor 是一个研究预览版，展示了使用语音和视觉进行实时人工智能对话，所有处理都在您的设备*本地*进行。它利用 Google 的 Gemma 4 E2B 来理解音频和视觉输入，并使用 Kokoro 进行文本到语音的输出。 Parlor 的驱动力是创造一个可持续、免费的英语学习工具，它通过完全在设备上运行来消除服务器成本——这得益于像 Gemma 这样更小、更强大的 AI 模型领域的最新进展。虽然它无法执行复杂的代理任务，但这代表着向人人可及的人工智能迈出的重要一步。目前 Parlor 适用于 macOS (Apple Silicon) 和 Linux，并配备支持的 GPU，允许进行免提、可中断的对话。它可以通过 GitHub 克隆轻松运行，并需要大约 3GB 的内存。该项目旨在未来移植到手机，设想一个用户可以通过简单地展示和谈论他们周围的物体来与人工智能交互的世界，甚至利用多语言支持。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 工作 | 提交登录展示HN：M3 Pro上的实时AI（音频/视频输入，语音输出），使用Gemma E2B (github.com/fikrikarim) 14点由 karimf 1小时前 | 隐藏 | 过去 | 收藏 | 2评论相关：https://news.ycombinator.com/item?id=47653752 帮助 dvt 18分钟前 | 上一个 [–] 出色的工作和展示，我用Kokoro做过很多东西，延迟非常惊人。苹果的表现太差了…感觉你的演示应该是一个Siri演示（我这是最真诚的赞美）。回复 karimf 8分钟前 | 父级 [–] 谢谢。这让我想起了LatentSpace新闻通讯中的一段 [0]> 优秀的上设备能力让人不禁想知道这些是否是将在与苹果的协议下部署在新Siri中的模型的基础……https://www.latent.space/p/ainews-gemma-4-the-best-small-mul...回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

On-device, real-time multimodal AI. Have natural voice and vision conversations with an AI that runs entirely on your machine.

Parlor uses Gemma 4 E2B for understanding speech and vision, and Kokoro for text-to-speech. You talk, show your camera, and it talks back, all locally.

parlor_realtime_ai_with_audio_video_input_optimized.mp4

Research preview. This is an early experiment. Expect rough edges and bugs.

I'm self-hosting a totally free voice AI on my home server to help people learn speaking English. It has hundreds of monthly active users, and I've been thinking about how to keep it free while making it sustainable.

The obvious answer: run everything on-device, eliminating any server cost. Six months ago I needed an RTX 5090 to run just the voice models in real-time.

Google just released a super capable small model that I can run on my M3 Pro in real-time, with vision too! Sure you can't do agentic coding with this, but it is a game-changer for people learning a new language. Imagine a few years from now that people can run this locally on their phones. They can point their camera at objects and talk about them. And this model is multi-lingual, so people can always fallback to their native language if they want. This is essentially what OpenAI demoed a few years ago.

Browser (mic + camera)
    │
    │  WebSocket (audio PCM + JPEG frames)
    ▼
FastAPI server
    ├── Gemma 4 E2B via LiteRT-LM (GPU)  →  understands speech + vision
    └── Kokoro TTS (MLX on Mac, ONNX on Linux)  →  speaks back
    │
    │  WebSocket (streamed audio chunks)
    ▼
Browser (playback + transcript)

Voice Activity Detection in the browser (Silero VAD). Hands-free, no push-to-talk.
Barge-in. Interrupt the AI mid-sentence by speaking.
Sentence-level TTS streaming. Audio starts playing before the full response is generated.

Python 3.12+
macOS with Apple Silicon, or Linux with a supported GPU
~3 GB free RAM for the model

git clone https://github.com/fikrikarim/parlor.git
cd parlor

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

cd src
uv sync
uv run server.py

Open http://localhost:8000, grant camera and microphone access, and start talking.

Models are downloaded automatically on first run (~2.6 GB for Gemma 4 E2B, plus TTS models).

Variable	Default	Description
`MODEL_PATH`	auto-download from HuggingFace	Path to a local `gemma-4-E2B-it.litertlm` file
`PORT`	`8000`	Server port

Performance (Apple M3 Pro)

Stage	Time
Speech + vision understanding	~1.8-2.2s
Response generation (~25 tokens)	~0.3s
Text-to-speech (1-3 sentences)	~0.3-0.7s
Total end-to-end	~2.5-3.0s

Decode speed: ~83 tokens/sec on GPU (Apple M3 Pro).

src/
├── server.py              # FastAPI WebSocket server + Gemma 4 inference
├── tts.py                 # Platform-aware TTS (MLX on Mac, ONNX on Linux)
├── index.html             # Frontend UI (VAD, camera, audio playback)
├── pyproject.toml         # Dependencies
└── benchmarks/
    ├── bench.py           # End-to-end WebSocket benchmark
    └── benchmark_tts.py   # TTS backend comparison

Apache 2.0

展示HN：使用Gemma E2B在M3 Pro上实现的实时人工智能（音频/视频输入，语音输出） Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B

Performance (Apple M3 Pro)

展示HN：使用Gemma E2B在M3 Pro上实现的实时人工智能（音频/视频输入，语音输出）
Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B