I am a programming language snob. There, I said it.
I grew up on Assembly and C, moved through the heavyweights (Java, C#), and eventually found enlightenment in the world of Functional Programming. I write F#, Haskell, Scala, Elixir, and Rust. I dream in immutable state and pattern matching. I despise side effects.
So, naturally, when I decided to build a high-performance, real-time audio intelligence platform in my spare time, I chose… Go.
I don’t like Go. It feels like a language designed in the 1980s that just woke up from a coma. It has no ternary operators (seriously?). It has generics now, but in my entire codebase, I haven’t used them once. It has nil pointers that panic. It forces me to write if err != nil until my fingers bleed.
But this article isn’t a rant about Go. It’s an admission of defeat. Because despite my snobbery, Go and specifically the combination of Go + LLMs + Microservices is the only reason I was able to ship a complex, distributed SaaS platform as a solo developer with a day job.
Here is the story of AudioText Live, why I broke the “Build a Monolith” rule, and how I accidentally built a Twilio competitor for 1/5th the price.
The Problem: “Worse is Better” for AI#
My initial goal was to build a live prompter for interviewers using OBS and RTMP. That didn’t work out (yet), but it left me with a stack capable of ingesting live audio streams. I pivoted to telephony because I looked at my old Twilio bills and realized they were charging $1.50/hour for “Voice Intelligence” (transcription + analysis).
I knew I could do it cheaper. The raw cost of high quality transcription (via Soniox) and LLM summarization (via Gemini) is closer to $0.30/hour. The margin was massive, provided I could build the plumbing to handle the data reliability.
I chose Go for two reasons, neither of which was “I like the syntax”:
- AI writes excellent Go. Because Go is simple, verbose, and rigid, Large Language Models (LLMs) rarely hallucinate syntax errors. If I ask an AI to write a Rust macro or complex Haskell monad, it struggles. If I ask it to write a Go struct and a JSON unmarshaller, it gets it right 99% of the time.
- Compilation Speed. After wrestling with Rust compile times, Go feeling like a scripted language is a breath of fresh air.
As a solo dev, velocity is my only currency. I don’t write code anymore; I architect code and let the AI write the boilerplate. Go is the perfect language for this workflow.
The Architecture: Breaking the “Monolith First” Rule#
Standard startup advice is “Build a Monolith.” I ignored that. I built a distributed system with more than 15 microservices running on k3s (Kubernetes).
Why? Because audio processing is fundamentally different from a REST API.
- The API Service needs to be stateless and responsive.
- The Recorder needs to hold open WebSocket connections and write raw bytes to disk without garbage collection pauses causing dropped frames.
- The Transcriber needs to process massive WAV files asynchronously.
If I put this all in one binary, a memory leak in the transcription logic could kill active recording sessions. Isolation was non-negotiable.
Here is the 10,000-foot view of the pipeline:
- Ingestion: Twilio (or Telnyx/SignalWire) connects via WebSocket to
twilio_media_streams. - Buffering: We merge audio packets and shove them into NATS JetStream immediately.
- Recording: The
recorderservice consumes the stream and writes raw PCM to disk. - Post-Processing: Once the call ends,
post_recorderconverts raw audio to WAV and uploads to S3. - Intelligence: The
stt_async_servicegrabs the WAV, sends it to Soniox for diarization, and then triggers thesummarizer(Gemini) andembedding_importer(Qdrant).
The War Story: NATS, Watermill, and The Timeout From Hell#
I needed a message bus. I had used Kafka and RabbitMQ before, but they felt too heavy for a single node. I chose NATS JetStream because it’s written in Go and supports persistent streams.
I hate it.
Okay, I don’t hate NATS, but the learning curve for configuring it correctly is a vertical wall. You have Streams, Subjects, Consumers, Delivery Groups, and Ack Policies. If you get one wrong, your messages vanish into the void or get redelivered infinitely.
I decided to use Watermill, a Go library that abstracts away the underlying Pub/Sub. In hindsight, this was a mistake. Watermill is great, but it hides the knobs you need to turn when things go wrong.
The biggest headache? Backoff strategies.
I wanted a simple retry logic: If the transcription API is down, wait 1 second, then 10 seconds, then 1 minute.
In Watermill (with NATS), if you define a backoff policy, the first duration in your backoff slice becomes the hard timeout for the message ack.
If I set my first retry to 100ms, Watermill tells NATS: “If you don’t hear back from the consumer in 100ms, assume it failed.” But my recorder service takes longer than 100ms just to flush buffers to disk! I spent weeks debugging why my messages were being redelivered constantly, creating duplicate recordings.
I eventually had to write my own migration scripts to forcefully update the JetStream consumer configs, decoupling the AckWait from the BackOff:
// Strategy: External API / Slow Tasks (AI, Email, Webhooks)
// OLD: 1s (Causes immediate timeout & hammering)
// NEW: 45s (Plenty of time for AI generation or SMTP handshake)
backoffExternalAPI := []time.Duration{
5 * time.Minute, // Attempt 1 Window (Happy Path)
7 * time.Minute, // Retry 1 (Short outage)
10 * time.Minute, // Retry 2 (Medium outage)
}This is the unglamorous reality of distributed systems. It’s not about the code; it’s about the config.
The “Secret Sauce”: $0.30 vs $1.50#
The core value proposition of AudioText Live isn’t the code; it’s the arbitrage.
Twilio Intelligence is amazing, but it’s bundled. You pay for the convenience. By unbundling the stack, handling the WebSocket stream myself, storing the audio in S3, and routing the transcription to a dedicated provider like Soniox, I can offer the same (or better) accuracy for a fraction of the price.
The stt_async_service is where the money is saved. Instead of transcribing in real-time (which is expensive and lacks context), we record the full call and process it asynchronously. This allows Soniox to perform global speaker diarization (identifying who spoke when across the entire conversation), which is impossible to do perfectly in real-time.
// stt_async_service/service.go
// We calculate usage in 50-unit blocks to ensure granular billing
usageUnits := int(durationMinutes * 50)
// Idempotency is key. If the service crashes, we don't want to double-charge.
idempotencyKey := fmt.Sprintf("transcription_%s", transcriptionID)Conclusion: Purity is a Luxury#
If I had stuck to my guns and written this in Haskell or Rust, I would still be fighting the borrow checker or debugging monad transformers.
By embracing Go, a language I find theoretically boring, I let the AI do the heavy lifting. I focused on the architecture: the resiliency of the NATS streams, the idempotency of the ledger, and the security of the Kratos authentication.
The result is a platform that scales, saves money, and actually shipped.
And hey, the compilation times are pretty great.