![]() |
|
![]() |
| It's great to see vLLM getting faster/better for DeepSeek. I tested vLLM vs SGLang a couple weeks ago and SGLang's DeepSeek support was much better/faster (on 2 x p5 H100 nodes). It's great that no one's standing still, I saw this recent AMD article that reported SGLang perf on MI300X has increased by 4X over the past couple weeks: https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR...
(w/ the extra memory V3/R1 fits on a single MI300X or H200 node) It'll be interesting to see if either project can take advantage/get any benefits from this FlashMLA implementation. |
![]() |
| I don't think I got it backwards, I believe what I said is correct - FA does not improve inference time.
From the authors of FlashAttention: > This [decoding] operation has been optimized with FlashAttention (v1 and v2 recently) in the training case, where the bottleneck is the memory bandwidth to read and write the intermediate results And then they continue with: > However, these optimizations don’t apply directly to the inference case, because the bottlenecks are different. For training, FlashAttention parallelizes across the batch size and query length dimensions. During inference, the query length is typically 1 ... With a batch size of 1, FlashAttention will use less than 1% of the GPU! And then they come up with a different proposal, FlashDecoding, that optimizes for inference time: > Our new approach Flash-Decoding is based on FlashAttention, and adds a new parallelization dimension: the keys/values sequence length. It combines the benefits of the 2 approaches from above. Like FlashAttention, it stores very little extra data to global memory, however it fully utilizes the GPU even when the batch size is small, as long as the context length is large enough. Link: https://crfm.stanford.edu/2023/10/12/flashdecoding.html |
![]() |
| You need to load cached k/v tensor, in addition to weights. It's going to take me some minutes to find out what's wrong in this napkin math. Will edit or reply this comment later. |
![]() |
| Dang only forward passes. The real secret was in the backward pass! I was also curious to learn how they implemented the dualpipe scheduler |
![]() |
| Do they even have an optimized backward? It looks like optimizations like this aren't needed during training. Their V2 paper also suggests so. |
![]() |
| I'm confused. Wasn't there sanctions against Chinese companies about Hopper GPUs? Are they just admitting that they had access to H100 against the US sanctions?! |
![]() |
| It isn't illegal for chinese companies to buy H100 cards. It is illegal for USA companies to sell them to China. So the "admit" part wouldn't be on Chinas side. |
![]() |
| Not really Singapore is a trading hub a lot of multi national companies have regional offices or head offices in Singapore so if the head office buys anything for any where the purchase will show up as Singapore. Despite Nvidia showing such a large revenue from Singapore actual number of gpu shipped to Singapore is not that high. Not that some of the gpus are not going China but their is a valid reason for the Nvidia Singapore revenue numbers.
https://www.tomshardware.com/tech-industry/deepseek-gpu-smug... |
![]() |
| I can't tell if you're insinuating that Singapore is a pass-through for H100's heading towards China or whether there is some significant development taking place in Singapore that I'm unaware of? |
![]() |
| I'd be very careful when using that word in this situation. If China wants X, and another country has X, who are you to say they shouldn't trade with each other? |
![]() |
| No, especially considering that they open sourced everything. (not OP)
Also, they could have outsourced the computation to a subsidiary company in the US, I suppose. |
![]() |
| Today's H100 cluster models are tomorrow's computing at the edge models.
With the next wave of investment targeting local on-device robotics, I'm way more bullish about local AI than vertical SaaS AI. |
![]() |
| What do you mean with "only" developer? Someone who just knows how to code when given a spec but lacking domain knowledge (in this case ai math and hardware optimization) and larger context? |
![]() |
| I don't find it a reasonable take, it's like saying stackoverflow.com is taking developer jobs by making it easy to code, we better develop new stackoverflow.com |
![]() |
| I heard their inferencing framework is way lower than typical deployment methods. Can this be verified from that open-source project? How does it stack up against vllm or llama.cpp |
![]() |
| right, I think PTX use is a bigger deal than its getting coverage for. this opens an opening for other vendors to get their foot in with PTX to LLVM-ir translation for existing cuda kernels. |
![]() |
| Don't think the decision is based on infra, or any technical reasons. It's more on the service support side. How a 200-person company supports 44M iPhone users in China? |
I am very curious to see how well-optimized Deepseek's code is compared to leading LLM serving softwares like vLLM or SGLang.