![]() |
|
![]() |
| How many people does it take to implement this? A 10% gain in performance could pay for a lot of people's salaries when your company is spending hundreds of millions on GPU clusters. |
![]() |
| If you think how many people who looked and failed to realize this optimization in the preceding performance efforts of the community, you could argue for quite a big number. |
![]() |
| I generally avoid FP8 and prefer I8, but your question got me wondering how well cuBLAS performs.
First of all, cuBLAS needs the cuBLASLt extension API for mixed-precision workloads to handle FP8. Second, some adequate type combinations, like E5M2 x E5M2 for A x B, are not supported, while others, like E5M2 x E4M3, are! Moreover, matrix A must always come in a transposed layout for Ampere, Hopper, and Blackwell... and the list of constraints goes on. I've integrated FP8 cuBLASLt benchmarks into my "Less Slow C++" repository <https://github.com/ashvardanian/less_slow.cpp>, adding to the list of existing cuBLAS and hand-rolled CUDA and PTX benchmarks. I'm running them on H200 GPUs, which should have the same performance as H100. For square inputs, the throughput peaks around 1.35 Peta-ops.
That's around 67% of the advertised number for dense GEMM https://resources.nvidia.com/en-us-data-center-overview-mc/e...>. |
![]() |
| I heard that it is possible to achieve better performance than cuBLAS using CUTLASS? I thought they chose the better one among cuBLAS and CUTLASS as baseline. |
![]() |
| Basically every activation function throws away half of the dynamic range at every neuron, which across a deep network is a lot.
You make a good point about LayerNorm, it's probably even worse. |
![]() |
| This might be rendered moot by native microscaling support in Blackwell (MXFP). They've manually done a coarser-grained version of that for Hopper, but with full FP32 scaling factors. |
![]() |
| > This stuff must be documented internally
Probably no. They are likely only documented in architectural design doc / spec etc which you surely do not want to share. |
![]() |
| Honestly, this is beyond my usage and understanding. But I really appreciate such sharing findings and improvements so that everyone can benefit from them. It's a refreshment. |
> We observe a performance improvement in the CUTLASS FP8 kernel between NVCC 12.2 and 12.3. By comparing the compiled SASS, we discover that one bit in a series of FADD instructions is flipped in an interleaving pattern. After referencing some open-source CUDA assembler implementations, we identified that this bit controls yield, which may enhance warp-level parallelism (just a guess, yielding the current warp and let other warps work).
> To leverage this, we develop a similar script to modify the FFMA instructions in the compiled binary. Besides simply modifying the yield bit, we also flip the reuse bit (registers cannot be reused if the warp is yielded). This adjustment improves performance (10%+ in some cases) for fine-grained scaling FP8 GEMMs by creating more opportunities to overlap MMA instructions with promotion FFMA instructions.
I would say it is really mind-blowing.