MagiAttention documentation#
Overview :
MagiAttention is a distributed attention (Context Parallelism) solution tailored for the demanding requirements of ultra-long sequences and heterogeneous masking patterns. It combines Flex-Flash-Attention (FFA)—a kernel supporting distributable and flexible mask representations—with a dispatch solver for load-balanced computation and new Group Collective primitives for communication to achieve zero-redundant communication. By coordinating these components through an adaptive multi-stage overlap strategy, MagiAttention delivers linear scalability across a broad range of training scenarios, such as large-scale video generation in Magi-1.
We are committed to continually improving the performance and generality of MagiAttention for the broader research community. Stay tuned for exciting enhancements and new features on the horizon!
Contents
- User Guide
- Blogs
- MagiAttention
- Long-Context Attention Benchmark
- Support Native Group Collective
- Support Blackwell with FFA_FA4 Backend
- Support Learnable Attention Sink
- Support Muon QK-Clip
- How to Ensure Kernels Actually Overlap
- Support JIT Compilation in FFA
- Flash Attention 2 Math Derivation
- Optimize Sparse Attention in FFA (Coming Soon)
- Dynamic Attention Solver (Coming Soon)
- Distributed-Native FFA (Coming Soon)
- Attention Engine for Inference (Coming Soon)