Posts from China
How to Ensure Kernels Actually Overlap
- 15 February 2026
While the CPU scheduler controls the kernel launch order to favor overlapping, the GPU’s Hyper-Q driver [Bradley, 2013] ultimately dictates the actual execution order. This process is inherently non-deterministic and heavily influenced by transient GPU resource occupancy.
Distributed-Native FFA (Coming Soon)
- 14 February 2026
The upcoming blog post will be released in the near future. Stay tuned!
Attention Engine for Inference (Coming Soon)
- 08 February 2026
The upcoming blog post will be released in the near future. Stay tuned!
Support Blackwell with FFA_FA4 Backend
- 07 February 2026
Before the release of MagiAttention-v1.1.0, MagiAttention had supported only the Hopper GPUs, since the attention kernel backend Flex-Flash-Attention (FFA) is built upon open-sourced Flash-Attention 3 (FA3) [Shah et al., 2024], tailored for SM90 compute capability.
Support Muon QK-Clip
- 04 February 2026
The Muon optimizer [Jordan et al., 2024], which leverages matrix orthogonalization, has shown faster convergence than traditional optimizers such as Adam [Kingma and Ba, 2017, Loshchilov and Hutter, 2019] on smaller language models and was subsequently demonstrated to scale to large models by Kimi [Liu et al., 2025].
Optimize Sparse Attention in FFA (Coming Soon)
- 25 January 2026
The upcoming blog post will be released in the near future. Stay tuned!
Support Native Group Collective
- 24 January 2026
With the release of MagiAttention-v1.1.0, we are excited to announce the support for native group collective CUDA kernels for both intranode and internode communication, based upon the amazing work of DeepEP [Zhao et al., 2025].
Dynamic Attention Solver (Coming Soon)
- 21 January 2026
The upcoming blog post will be released in the near future. Stay tuned!
Flash Attention 2 Math Derivation
- 22 December 2025
This blog post is a detailed math derivation of well-known Flash Attention 2 (FA2), a memory-efficient, highly optimized and de facto kernel implementation [Dao, 2023, Dao et al., 2022, Shah et al., 2024] of scaled dot-product attention operation introduced by Transformer [Vaswani et al., 2023], which is re-implemented and further extended in Flex-Flash-Attention kernels of MagiAttention [Zewei and Yunpeng, 2025].
Support Learnable Attention Sink
- 17 November 2025
Large-Scaled Models assign significant attention to few tokens (such as the intial tokens in the sequence), even if they are not semantically important, which is known as attention sink [Xiao et al., 2024]. Researchers attribute this interesting phenomenon to the nature of \(softmax\), which requires attention scores of each query token to always sum up to \(1\) for all key tokens in the context, even when some query token does not strongly attend to any key token at all [Gu et al., 2025]. Therefore, during the training, we can deliberately add some learnable sink tokens to the key sequence for each query token to collect those unneeded attention scores to relax the ”sum-up-to-one” constraint, as a learnable version of \(\textit{off-by-one}\space softmax\) [Miller, 2024].
MagiAttention
- 21 April 2025
A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Mask Training