OpenAI 按分钟收费,所以请加快你的音频速度。
OpenAI charges by the minute, so speed up your audio

原始链接: https://george.mand.is/2025/06/openai-charges-by-the-minute-so-make-the-minutes-shorter/

为了用OpenAI更快更便宜地转录音频,尝试在上传前加快音频文件速度。使用ffmpeg将音频速度提高到2倍或3倍,可以减少音频时长,从而减少转录使用的token数量,尤其是在使用gpt-4o-transcribe模型时。此方法可以显著降低成本(3倍速度可降低约33%),同时对转录质量的影响最小。记住,输出token占大部分成本。 作者发现,2倍或3倍的速度在效率和准确性之间取得了良好的平衡。虽然4倍的速度可能会导致失真过大而无法可靠地转录,但鼓励尝试不同的速度。使用gpt-4o-transcribe模型,一个40分钟的文件成本可以从0.24美元降低到0.08美元,方法是加快音频速度。这个简单的技巧既节省时间又节省金钱,同时仍然可以提供可用的转录文本。

这个 Hacker News 帖子讨论了一种降低使用 OpenAI 转录 API 成本的技术,该 API 按分钟收费。发帖人最初使用 ffmpeg 加速音频文件,有效地将更多语音压缩到每一分钟内,从而降低了总体成本。 评论者探讨了相关的想法,例如去除静音以进一步缩短音频并提高转录质量,以及使用更便宜的替代转录服务,例如 Groq 或本地 Whisper 模型。讨论还涉及最佳收听速度,一些用户发现更高的速度更有益,而另一些用户则更喜欢较慢的速度以获得更好的理解。 几位评论者分享了从 YouTube 视频中提取字幕的工具和技术,包括使用 yt-dlp 和非官方 YouTube API。在不同速度下生成的字幕的准确性受到了争议,一些人建议使用 LLM 分析来评估整体摘要质量,而不是简单的逐字比较。总的来说,该帖子重点介绍了优化音频转录工作流程和最大限度地降低成本的各种策略。
相关文章

原文

• ~2,000 words • 9 minute read

Want to make OpenAI transcriptions faster and cheaper? Just speed up your audio.

I mean that very literally. Run your audio through ffmpeg at 2x or 3x before transcribing it. You’ll spend fewer tokens and less time waiting with almost no drop in transcription quality.

That’s it!

Here’s a script combining of all my favorite little toys and tricks to get the job. You’ll need yt-dlp, ffmpeg and llm installed.

# Extract the audio from the video
yt-dlp -f 'bestaudio[ext=m4a]' --extract-audio --audio-format m4a -o 'video-audio.m4a' "https://www.youtube.com/watch?v=LCEmiRjPEtQ" -k;

# Create a low-bitrate MP3 version at 3x speed
ffmpeg -i "video-audio.m4a" -filter:a "atempo=3.0" -ac 1 -b:a 64k video-audio-3x.mp3;

# Send it along to OpenAI for a transcription
curl --request POST \
  --url https://api.openai.com/v1/audio/transcriptions \
  --header "Authorization: Bearer $OPENAI_API_KEY" \
  --header 'Content-Type: multipart/form-data' \
  --form [email protected] \
  --form model=gpt-4o-transcribe > video-transcript.txt;

# Get a nice little summary

cat video-transcript.txt | llm --system "Summarize the main points of this talk."

I just saved you time by jumping straight to the point, but read-on if you want more of a story about how I accidentally discovered this while trying to summarize a 40-minute talk from Andrej Karpathy.

Also read-on if you’re wondering why I didn’t just use the built-in auto-transcription that YouTube provides, though the short answer there is easy: I’m sort of a doofus and thought—incorrectly—it wasn’t available. So I did things the hard way.

I Just Wanted the TL;DW(atch)

A former colleague of mine sent me this talk from Andrej Karpathy about how AI is changing software. I wasn’t familiar with Andrej, but saw he’d worked at Tesla. That coupled with the talk being part of a Y Combinator series and 40 minutes made me think “Ugh. Do I… really want to watch this? Another 'AI is changing everything' talk from the usual suspects, to the usual crowds?”

If ever there were a use-case for dumping something into an LLM to get the gist of it and walk away, this felt like it. I respected the person who sent it to me though and wanted to do the noble thing: use AI to summarize the thing for me, blindly trust it and engage with the person pretending I had watched it.

My first instinct was to pipe the transcript into an LLM and get the gist of it. This script is the one I would previously reach for to pull the auto-generated transcripts from YouTube:

yt-dlp --all-subs --skip-download \
  --sub-format ttml/vtt/best \
  [url]

For some reason though, no subtitles were downloaded. I kept running into an error!

Later, after some head-scratching and rereading the documentation, I realized my version (2025.04.03) was outdated.

Long story short: Updating to the latest version (2025.06.09) fixed it, but for some reason I did not try this before going down a totally different rabbit hole. I guess I got this little write-up and exploration out of it though.

If you care more about summarizing transcripts and less about the vagaries of audio-transcriptions and tokens, this is the correct answer and your off-ramp.

My Transcription Workflow

I already had an old, home-brewed script that would extract the audio from any video URL, pipe it through whisper locally and dump the transcription in a text file.

That worked, but I was on dwindling battery power in a coffee shop. Not ideal for longer, local inference, mighty as my M3 MacBook Air still feels to me. I figured I would try offloading it to OpenAI’s API instead. Surely that would be faster?

Testing OpenAI’s Transcription Tools

Okay, using the whisper-1 model it’s still pretty slow, but it gets the job done. Had I opted for the model I knew and moved on, the story might end here.

However, out of curiosity, I went straight for the newer gpt-4o-transcribe model first. It’s built to handle multimodal inputs and promises faster responses.

I quickly hit another roadblock: there’s a 25-minute audio limit and my audio was nearly 40 minutes long.

Let's Try Something Obvious

At first I thought about trimming the audio to fit somehow, but there wasn’t an obvious 14 minutes to cut. Trimming the beginning and end would give me a minute or so at most.

An interesting, weird idea I thought about for a second but never tried was cutting a chunk or two out of the middle. Maybe I would somehow still have enough info for a relevant summary?

Then it crossed my mind—what if I just sped up the audio before sending it over? People listen to podcasts at accelerated 1-2x speeds all the time.

So I wrote a quick script:

ffmpeg -i video-audio.m4a -filter:a "atempo=2.0" -ac 1 -b:a 64k video-audio-2x.mp3

Ta-da! Now I had something closer to a 20 minute file to send to OpenAI.

I uploaded it and… it worked like a charm! Behold the summary bestowed upon me that gave me enough confidence to reply to my colleague as though I had watched it.

But there was something... interesting here. Did I just stumble across a sort of obvious, straightforward hack? Is everyone in the audio-transcription business already doing this and am I just haphazardly bumbling into their secrets?

I had to dig deeper.

Why This Works: Our Brains Forgive, and So Does AI

There’s an interesting parallel here in my mind with optimizing images. Traditionally you have lossy and lossless file formats. A lossy file-format kind of gives away the game in its description—the further you crunch and compact the bytes the more fidelity you’re going to lose. It works because the human brain just isn’t likely to pick-up on the artifacts and imperfection

But even with a “lossless” file format there are tricks you can lean into that rely on the limits of human perception. One of the primary ways you can do that with a PNG or GIF is reducing the number of unique colors in the palette. You’d be surprised by how often a palette of 64 colors or fewer might actually be enough and perceived as significantly more.

There’s also a parallel in my head between this and the brain’s ability to still comprehend text with spelling mistakes, dropped words and other errors, i.e. transposed letter effects. Our brains have a knack for filling in the gaps, and when you go looking through the world with magnifying glass you'll start to notice lots of them.

Speeding up the audio starts to drop the more subtle sounds and occasionally shorter words from the audio, but it doesn’t seem to hurt my ability to comprehend what I’m hearing—even if I do have to focus. These audio transcription models seem to be pretty good at this as well.

Wait—how far can I push this? Does It Actually Save Money?

Turns out yes. OpenAI charges for transcription based on audio tokens, which scale with the duration of the input. Faster audio = fewer seconds = fewer tokens.

Here are some rounded numbers based on the 40-minute audio file breaking down the audio input and text output token costs:

SpeedDuration (seconds)Audio Input TokensInput Token CostOutput Token Cost
1x (original)2,372NA (too long)NANA
2x1,18611,856$0.07$0.02
3x7917,904$0.04$0.02

That’s a solid 33% price reduction on input tokens at 3x! However the bulk of your costs for these transcription models are still going to be the output tokens. Those are priced at $10 per 1M tokens whereas audio input tokens are priced at $6 per 1M token as of the time of this writing.

Also interesting to note—my output tokens for the 2x and 3x versions were exactly the same: 2,048. This kind of makes sense, I think? To the extent the output tokens are a reflection of that models ability to understand and summarize the input, my takeaway is a “summarized” (i.e. reduced-token) version of the same audio yields the same amount of comprehensibility.

This is also probably a reflection of the 4,096 token ceiling on transcriptions generally when using the gpt-4o-transcription model. I suspect half the context window is reserved for the output tokens and this is basically reflecting our request using it up in its entirety. I suspect we might get diminishing results with longer transcriptions.

But back to money.

So the back-of-the-envelope calculator for a single transcription looks something like this:

6 * (audio_input_tokens / 1_000_000) + 10 * (text_output_tokens / 1_000_000);

That does not quite seem to jibe with the estimated cost of $0.006 per minute stated on the pricing page, at least for the 2x speed. That version (19-20 minutes) seemed to cost about $0.09 whereas the 3x version (13 minutes) cost about $0.07 (pretty accurate actually), if I’m adding up the tokens correctly.

# Pricing for 2x speed
6 * (11_856 / 1_000_000) + 10 * (2_048 / 1_000_000) = 0.09

# Pricing for 3x speed
6 * (7_904 / 1_000_000) + 10 * (2_048 / 1_000_000) = 0.07

It would seem that estimate isn’t just based on the length of the audio but also some assumptions around how many tokens per minute are going to be generated from a normal speaking cadence.

That’s… kind of fascinating! I wonder how John Moschitta’s feels about this.

Comparing these costs to whisper-1 is easy because the pricing table more confidently advertises the cost—not “estimated” cost—as a flat $0.006 per minute. I’m assuming that’s minute of audio processed, not minute of inference.

The gpt-4o-transcription model actually compares pretty favorably.

SpeedDurationCost
1x2372$0.24
2x1186 seconds$0.12
3x791 seconds$0.08

Does This Save Money?

In short, yes! It’s not particularly rigorous, but it seems like we reduced the cost of transcribing our 40-minute audio file by 23% from $0.09 to $0.07 simply by speeding up the audio.

If we could compare to a 1x version of the audio file trimmed to the 25-minute limit, I bet we could paint an even more impressive picture of cost reduction. We kind of can with the whisper-1 chart. You could make the case this technique reduced costs by 67%!

Is It Accurate?

I don’t know—I didn’t watch it, lol. That was the whole point. And if that answer makes you uncomfortable, buckle-up for this future we're hurtling toward. Boy, howdy.

More helpfully, I didn’t compare word-for-word, but spot checks on the 2x and 3x versions looked solid. 4x speed was too fast—the transcription started getting hilariously weird. So, 2x and 3x seem to be the sweet spot between efficiency and fidelity, though it will obviously depend on how fast the people are speaking in the first place.

Why Not 4x?

When I pushed it to 4x the results became comically unusable.

Output of a 4x transcription mostly repeating "And how do we talk about that?" over and over again

That sure didn't stop my call to summarize from trying though.

Hey, not the worst talk I've been to!

In Summary

Always, in short, to save time and money, consider doubling or tripling the speed of the audio you want to transcribe. The trade-off is, as always, fidelity, but it’s not an insignificant savings.

Simple, fast, and surprisingly effective.

TL;DR

  • OpenAI charges for transcriptions based on audio duration (whisper-1) or tokens (gpt-4o-transcribe).
  • You can speed up audio with ffmpeg before uploading to save time and money.
  • This reduces audio tokens (or duration), lowering your bill.
  • 2x or 3x speed works well.
  • 4x speed? Probably too much—but fun to try.

If you find problems with my math, have questions, found a more rigorous study qualitatively comparing different output speeds please get in touch! Or if you thought this was so cool you want to hire me for something fun...

--

Published on Tuesday, June 24th 2025. Read this post in Markdown or plain-text.

If you enjoyed this consider signing-up for my newsletter.

联系我们 contact @ memedata.com