为什么内存股今天暴跌:TurboQuant 带来了“谷歌DeepSeek时刻”,改变了游戏规则。
Why Memory Stocks Crashed Today: TurboQuant Just Changed The Game With "Google's DeepSeek Moment"

原始链接: https://www.zerohedge.com/markets/why-memory-stocks-crashed-today-turboquant-just-changed-game-googles-deepseek-moment

尽管股市总体表现积极,但存储芯片股(如美光和闪迪)表现明显落后,引发了投资者的“理性评估”。下跌源于谷歌研究院发布了“TurboQuant”,一种用于人工智能模型的新压缩算法。 TurboQuant 显著降低了人工智能推理所需的内存——最多可降低 6 倍,且不牺牲准确性。这项突破性技术威胁到降低对存储芯片的大量需求,而这种需求最近推高了价格。该算法侧重于压缩“KV 缓存”,这是人工智能处理中的一个关键瓶颈,并且独特地避免了与压缩相关的典型精度损失。 尽管仍处于研究阶段(预计 2026 年展示),但市场反应迅速,担心需求可能崩溃。分析师认为,这可能会影响目前受益于高内存价格的公司,甚至挑战持续的 DRAM/NAND 需求叙事。这种情况与过去“由中向外”的创新类似,即软件改进降低了对硬件的需求,并促使人们呼吁做空存储芯片股,甚至做空严重依赖存储芯片制造商的韩国市场。

相关文章

原文

With stocks closing solidly in the green despite some painful wobbles during the day, one sector was a notable laggard: the same sector that had dramatically outperformed the S&P since memory prices soared last October: memory stocks, most notably MU and SNDK.

In his EOD wrap, Goldman tech specialist Peter Callahan wrote that while there wasn't that much actual "angst" out there, his clients complained of plenty of "sanity checking" on the moves today in memory (MU / SNDK lower vs. OEMs higher) and especially "the 5 day slide in MU as Micron has underperformed the SOX by 20% in 5-days, starting with the company's blowout earnings report; that move ranks as the largest 5 days of underperformance relative to Semis/SOX since 2011.

What caused today's remarkable slump, which at one point saw Micron shares fall over 6% and Sandisk sliding 9% before paring losses, with other notable decliners including Western Digital (-6.7%) and Seagate Technologies (-8.5%)?

The answer was the latest announcement from Google Research, which after the close on Wednesday unveiled TurboQuant, a compression algorithm for large language models and vector search engines, that shrinks a major inference-memory bottleneck: it reduces an AI model's memory 6x, making it 8x faster with the same number of GPUs, all the while maintaining zero loss in accuracy and "redefining AI efficiency."

The  paper is slated for presentation at ICLR 2026, but the reaction online was immediate: Cloudflare CEO Matthew Prince called it "Google's DeepSeek moment." 

The implication is clear: if Google can achieve the same inference results with one-sixth of the hardware, then demand for memory chips will collapse in inverse proportion - the same ravenous demand that until recently sent DDR prices as much as 7x higher in just 3 months when the memory bottleneck for AI became apparent...

... and more recently sent inference-heavy NAND Flash prices also surging.

If this sounds similar to the infamous Middle Out algorithm from Silicon Valley, it's because it is, all minus the jerking off part :

Of course, that's a bit hyperbolic, but the premise is there: taking existing hardware and achieving a far better compression result.

A quick technical side note on how Turboquant achieves this remarkable improvement in efficiency per decrypt:

Quantization efficiency is a big achievement by itself. But "zero accuracy loss" needs context. TurboQuant targets the KV cache—the chunk of GPU memory that stores everything a language model needs to remember during a conversation.

As context windows grow toward millions of tokens, those caches balloon into hundreds of gigabytes per session. That's the actual bottleneck. Not compute power but raw memory.

Traditional compression methods try to shrink those caches by rounding numbers down—from 32-bit floats to 16, to 8 to 4-bit integers, for example. To better understand it, think of shrinking an image from 4K, to full HD, to 720p and so. It’s easy to tell it’s the same image overall, but there’s more detail in 4K resolution.

The catch: they have to store extra "quantization constants" alongside the compressed data to keep the model from going stupid. Those constants add 1 to 2 bits per value, partially eroding the gains.

TurboQuant claims it eliminates that overhead entirely.

It does this via two sub-algorithms. PolarQuant separates magnitude from direction in vectors, and QJL (Quantized Johnson-Lindenstrauss) takes the tiny residual error left over and reduces it to a single sign bit, positive or negative, with zero stored constants.

The result, Google says, is a mathematically unbiased estimator for the attention calculations that drive transformer models.

In benchmarks using Gemma and Mistral, TurboQuant matched full-precision performance under 4x compression, including perfect retrieval accuracy on needle-in-haystack tasks up to 104,000 tokens.

For context on why those benchmarks matter, expanding a model's usable context without quality loss has been one of the hardest problems in LLM deployment.

Now, the fine print. "Zero accuracy loss" applies to KV cache compression during inference—not to the model's weights. Compressing weights is a completely different, harder problem. TurboQuant doesn't touch those.

What it compresses is the temporary memory storing mid-session attention computations, which is more forgiving because that data can theoretically be reconstructed.

There's also the gap between a clean benchmark and a production system serving billions of requests. TurboQuant was tested on open-source models—Gemma, Mistral, Llama—not Google's own Gemini stack at scale.

The punchline: unlike DeepSeek's efficiency gains, which required deep architectural decisions baked in from the start, TurboQuant requires no retraining or fine-tuning and claims negligible runtime overhead. In theory, it drops straight into existing inference pipelines.

That's the part that spooked the memory hardware sector - because if it works in production, every major AI lab will run much leaner on the same GPUs they already own. Or said, in terms of P&L, AI companies - already deeply cash flow negative - and which are suddenly bleeding even more profit margin (which they don't have but assume they did) to soaring RAM prices, have found a software way to require far less hardware - potentially as much as 6x less - and thus flip the table on the memory makers who are generating massive profits precisely because they refuse to produce more memory in what some would call cartel-like behavior. In doing so, they may have eliminated the entire physical memory bottleneck, courtesy of the memory cartel which magically can't find any new supply until 2027 or later.

But wait, it gets better: because if Google has already found a compression algo that achieves such phenomenal efficiency improvements, it is virtually certain that further optimization - and competing algos - will surely lead to far greater efficiency, reducing the amount of hardware needed even further. 

And just like that, suddenly the memory bubble which was built on the assumption that demand for DRAM and NAND will persist will into the future, looks set to burst as software may have just solved a very sticky hardware problem. 

The Google paper goes to ICLR 2026. Until it ships in production, the "zero loss" headline stays in the lab, but the market isn't waiting and the mere threat that demand for memory may tumble by orders of magnitude could shock the entire ecosystem. In which case, buy puts on the Kospi, which is about 100% overvalued if the "memory benefit" of its two core stocks, Samsung and SK Hynix, disappears. Come to think of it, short everything memory. 

联系我们 contact @ memedata.com