原文
Adds CUDA dequantization for TQ4_1S (5.0 bpv) and TQ3_1S (4.0 bpv) WHT-rotated weight compression types. These achieve 27-37% model size reduction at +1.0-1.9% PPL on Qwen/Phi families. Base types + Metal + CPU quantize/dequant from TheTom's PR TheTom#45. CUDA additions: - turbo-quant.cuh: weight centroids (N(0,1) Lloyd-Max, 16/8 levels), sign array for 32-element inverse WHT - dequantize.cuh: dequantize_tq4_1s/tq3_1s — full 32-element block inverse RHT (5 butterfly stages + normalize + unsign) - convert.cu: TQ4_1S/TQ3_1S in all 4 dequant dispatchers - ggml-cuda.cu: supports_op for MUL_MAT and GET_ROWS, excluded from mmvq/mmq (uses cuBLAS dequant-to-f16 path) The cuBLAS path is correct for initial support. Future optimization: pre-rotate activations via warp shuffle WHT (same pattern as KV cache Q rotation) to eliminate per-block inverse WHT. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>