Posts in MagiAttention

How to Ensure Kernels Actually Overlapped

15 February 2026

While the CPU scheduler controls kernel launch order to favor overlap, the GPU Hyper-Q driver [Bradley, 2013] ultimately determines actual execution order non‑deterministically, influenced by transient GPU resource occupancy as well.

Read more ...

Distributed-Native FFA

14 February 2026

The upcoming blog post will be released in the near future. Stay tuned!

Read more ...

Attention Engine for Inference

08 February 2026

The upcoming blog post will be released in the near future. Stay tuned!

Read more ...

Support Blackwell with FFA_FA4 Backend

07 February 2026

The upcoming blog post will be released in the near future. Stay tuned!

Read more ...

Support Muon QK-Clip

04 February 2026

The Muon optimizer [Jordan et al., 2024], which leverages matrix orthogonalization, has shown faster convergence than traditional optimizers such as Adam [Kingma and Ba, 2017, Loshchilov and Hutter, 2019] on smaller language models and was subsequently demonstrated to scale to large models by Kimi [Liu et al., 2025].

Read more ...

Optimize Sparse Attention in FFA

25 January 2026

The upcoming blog post will be released in the near future. Stay tuned!

Read more ...

Support Native Group Collective Based on DeepEP

24 January 2026

The upcoming blog post will be released in the near future. Stay tuned!

Read more ...

Dynamic Attention Solver

21 January 2026

The upcoming blog post will be released in the near future. Stay tuned!

Read more ...

Flash Attention 2 Math Derivation

22 December 2025

This blog post is a detailed math derivation of well-known Flash Attention 2 (FA2), a memory-efficient, highly optimized and de facto kernel implementation [Dao, 2023, Dao et al., 2022, Shah et al., 2024] of scaled dot-product attention operation introduced by Transformer [Vaswani et al., 2023], which is re-implemented and further extended in Flex-Flash-Attention kernels of MagiAttention [Zewei and Yunpeng, 2025].

Read more ...

Support Learnable Attention Sink

17 November 2025

Large-Scaled Models assign significant attention to few tokens (such as the intial tokens in the sequence), even if they are not semantically important, which is known as attention sink [Xiao et al., 2024]. Researchers attribute this interesting phenomenon to the nature of \(softmax\), which requires attention scores of each query token to always sum up to \(1\) for all key tokens in the context, even when some query token does not strongly attend to any key token at all [Gu et al., 2025]. Therefore, during the training, we can deliberately add some learnable sink tokens to the key sequence for each query token to collect those unneeded attention scores to relax the ”sum-up-to-one” constraint, as a learnable version of \(\textit{off-by-one}\space softmax\) [Miller, 2024].