Posts tagged Context Parallelism

How to Ensure Kernels Actually Overlap

While the CPU scheduler controls the kernel launch order to favor overlapping, the GPU’s Hyper-Q driver [Bradley, 2013] ultimately dictates the actual execution order. This process is inherently non-deterministic and heavily influenced by transient GPU resource occupancy.

Read more ...


Distributed-Native FFA (Coming Soon)

The upcoming blog post will be released in the near future. Stay tuned!

Read more ...


Attention Engine for Inference (Coming Soon)

The upcoming blog post will be released in the near future. Stay tuned!

Read more ...


Support Native Group Collective

With the release of MagiAttention-v1.1.0, we are excited to announce the support for native group collective CUDA kernels for both intranode and internode communication, based upon the amazing work of DeepEP [Zhao et al., 2025].

Read more ...


Dynamic Attention Solver (Coming Soon)

The upcoming blog post will be released in the near future. Stay tuned!

Read more ...


Long-Context Attention Benchmark

From Kernel Efficiency to Distributed Scalability

Read more ...


MagiAttention

A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Mask Training

Read more ...