我讨厌Go,但它拯救了我的创业公司:一次架构剖析
I Hate Go, but It Saved My Startup: An Architectural Autopsy

原始链接: http://audiotext.live/blog/posts/hate-go-saved-startup-architecture/

## 从挑剔到发布:我为何用 Go 构建 SaaS 平台 尽管我是一位经验丰富的程序员,偏爱 F#、Haskell 和 Rust 等语言,但我还是勉强选择了 Go 来构建 AudioText Live——一个旨在降低 Twilio 价格的实时音频智能平台。我的理由并非对 Go 的喜爱(我认为它有些过时),而是实用性。 Go 的简洁性使得 AI 代码生成能够非常准确,从而大大提高了开发速度——这对单人开发者来说至关重要。此外,它快速的编译时间也是从 Rust 切换过来的一大优势。 我没有选择传统的单体架构,而是选择了微服务架构(在 k3s 上运行超过 15 个服务),以隔离关键的音频处理任务,例如录音、转录和分析,防止一个领域的故障影响其他领域。核心管道利用 WebSockets、NATS JetStream 用于消息传递、Soniox 用于转录以及 Gemini 用于摘要。 最大的挑战不是代码本身,而是正确配置 NATS 以及与 Watermill 库中的意外超时问题作斗争。最终,AudioText Live 通过解耦服务并异步处理音频以提高准确性,实现了显著的成本优势(0.30 美元/小时,而 Twilio 为 1.50 美元/小时)。 虽然我仍然更喜欢“更纯粹”的语言,但 Go 使我能够发布一个复杂、分布式的 SaaS 平台,证明有时实用主义胜过纯粹性。

一位独立开发者,自称函数式编程爱好者(Rust/Scala),不情愿地使用了Go语言来快速构建一个分布式音频接入系统,最终实现了比Twilio降低80%的成本。作者发现LLM擅长编写Go代码,这得益于Go语言的简洁性和大量训练数据,即使最初有所保留,也实现了快速开发。 然而,作者承认大量使用了LLM来“润色”附带的文章,引发了关于其真实性的争论。多名用户报告在不同浏览器(Firefox, Chrome, Safari)上,移动端和桌面端都出现了CSS加载问题,可能与Cloudflare的自动压缩和SRI标签有关。作者正在积极调试这些问题,并欢迎大家提问关于项目架构的问题。尽管对文章来源存在担忧,但该帖子引发了讨论,并对作者分享的观点表示赞赏。
相关文章

原文

I am a programming language snob. There, I said it.

I grew up on Assembly and C, moved through the heavyweights (Java, C#), and eventually found enlightenment in the world of Functional Programming. I write F#, Haskell, Scala, Elixir, and Rust. I dream in immutable state and pattern matching. I despise side effects.

So, naturally, when I decided to build a high-performance, real-time audio intelligence platform in my spare time, I chose… Go.

I don’t like Go. It feels like a language designed in the 1980s that just woke up from a coma. It has no ternary operators (seriously?). It has generics now, but in my entire codebase, I haven’t used them once. It has nil pointers that panic. It forces me to write if err != nil until my fingers bleed.

But this article isn’t a rant about Go. It’s an admission of defeat. Because despite my snobbery, Go and specifically the combination of Go + LLMs + Microservices is the only reason I was able to ship a complex, distributed SaaS platform as a solo developer with a day job.

Here is the story of AudioText Live, why I broke the “Build a Monolith” rule, and how I accidentally built a Twilio competitor for 1/5th the price.

The Problem: “Worse is Better” for AI#

My initial goal was to build a live prompter for interviewers using OBS and RTMP. That didn’t work out (yet), but it left me with a stack capable of ingesting live audio streams. I pivoted to telephony because I looked at my old Twilio bills and realized they were charging $1.50/hour for “Voice Intelligence” (transcription + analysis).

I knew I could do it cheaper. The raw cost of high quality transcription (via Soniox) and LLM summarization (via Gemini) is closer to $0.30/hour. The margin was massive, provided I could build the plumbing to handle the data reliability.

I chose Go for two reasons, neither of which was “I like the syntax”:

  1. AI writes excellent Go. Because Go is simple, verbose, and rigid, Large Language Models (LLMs) rarely hallucinate syntax errors. If I ask an AI to write a Rust macro or complex Haskell monad, it struggles. If I ask it to write a Go struct and a JSON unmarshaller, it gets it right 99% of the time.
  2. Compilation Speed. After wrestling with Rust compile times, Go feeling like a scripted language is a breath of fresh air.

As a solo dev, velocity is my only currency. I don’t write code anymore; I architect code and let the AI write the boilerplate. Go is the perfect language for this workflow.

The Architecture: Breaking the “Monolith First” Rule#

Standard startup advice is “Build a Monolith.” I ignored that. I built a distributed system with more than 15 microservices running on k3s (Kubernetes).

Why? Because audio processing is fundamentally different from a REST API.

  • The API Service needs to be stateless and responsive.
  • The Recorder needs to hold open WebSocket connections and write raw bytes to disk without garbage collection pauses causing dropped frames.
  • The Transcriber needs to process massive WAV files asynchronously.

If I put this all in one binary, a memory leak in the transcription logic could kill active recording sessions. Isolation was non-negotiable.

Here is the 10,000-foot view of the pipeline:

  1. Ingestion: Twilio (or Telnyx/SignalWire) connects via WebSocket to twilio_media_streams.
  2. Buffering: We merge audio packets and shove them into NATS JetStream immediately.
  3. Recording: The recorder service consumes the stream and writes raw PCM to disk.
  4. Post-Processing: Once the call ends, post_recorder converts raw audio to WAV and uploads to S3.
  5. Intelligence: The stt_async_service grabs the WAV, sends it to Soniox for diarization, and then triggers the summarizer (Gemini) and embedding_importer (Qdrant).

The War Story: NATS, Watermill, and The Timeout From Hell#

I needed a message bus. I had used Kafka and RabbitMQ before, but they felt too heavy for a single node. I chose NATS JetStream because it’s written in Go and supports persistent streams.

I hate it.

Okay, I don’t hate NATS, but the learning curve for configuring it correctly is a vertical wall. You have Streams, Subjects, Consumers, Delivery Groups, and Ack Policies. If you get one wrong, your messages vanish into the void or get redelivered infinitely.

I decided to use Watermill, a Go library that abstracts away the underlying Pub/Sub. In hindsight, this was a mistake. Watermill is great, but it hides the knobs you need to turn when things go wrong.

The biggest headache? Backoff strategies.

I wanted a simple retry logic: If the transcription API is down, wait 1 second, then 10 seconds, then 1 minute.

In Watermill (with NATS), if you define a backoff policy, the first duration in your backoff slice becomes the hard timeout for the message ack.

If I set my first retry to 100ms, Watermill tells NATS: “If you don’t hear back from the consumer in 100ms, assume it failed.” But my recorder service takes longer than 100ms just to flush buffers to disk! I spent weeks debugging why my messages were being redelivered constantly, creating duplicate recordings.

I eventually had to write my own migration scripts to forcefully update the JetStream consumer configs, decoupling the AckWait from the BackOff:

// Strategy: External API / Slow Tasks (AI, Email, Webhooks)
// OLD: 1s (Causes immediate timeout & hammering)
// NEW: 45s (Plenty of time for AI generation or SMTP handshake)
backoffExternalAPI := []time.Duration{
    5 * time.Minute,  // Attempt 1 Window (Happy Path)
    7 * time.Minute,  // Retry 1 (Short outage)
    10 * time.Minute, // Retry 2 (Medium outage)
}

This is the unglamorous reality of distributed systems. It’s not about the code; it’s about the config.

The “Secret Sauce”: $0.30 vs $1.50#

The core value proposition of AudioText Live isn’t the code; it’s the arbitrage.

Twilio Intelligence is amazing, but it’s bundled. You pay for the convenience. By unbundling the stack, handling the WebSocket stream myself, storing the audio in S3, and routing the transcription to a dedicated provider like Soniox, I can offer the same (or better) accuracy for a fraction of the price.

The stt_async_service is where the money is saved. Instead of transcribing in real-time (which is expensive and lacks context), we record the full call and process it asynchronously. This allows Soniox to perform global speaker diarization (identifying who spoke when across the entire conversation), which is impossible to do perfectly in real-time.

// stt_async_service/service.go
// We calculate usage in 50-unit blocks to ensure granular billing
usageUnits := int(durationMinutes * 50)

// Idempotency is key. If the service crashes, we don't want to double-charge.
idempotencyKey := fmt.Sprintf("transcription_%s", transcriptionID)

Conclusion: Purity is a Luxury#

If I had stuck to my guns and written this in Haskell or Rust, I would still be fighting the borrow checker or debugging monad transformers.

By embracing Go, a language I find theoretically boring, I let the AI do the heavy lifting. I focused on the architecture: the resiliency of the NATS streams, the idempotency of the ledger, and the security of the Kratos authentication.

The result is a platform that scales, saves money, and actually shipped.

And hey, the compilation times are pretty great.

联系我们 contact @ memedata.com