Posted in 2026

How to Ensure Kernels Actually Overlapped

15 February 2026

While the CPU scheduler controls kernel launch order to favor overlap, the GPU Hyper-Q driver [Bradley, 2013] ultimately determines actual execution order non‑deterministically, influenced by transient GPU resource occupancy as well.

Read more ...

Distributed-Native FFA

14 February 2026

The upcoming blog post will be released in the near future. Stay tuned!

Read more ...

Attention Engine for Inference

08 February 2026

The upcoming blog post will be released in the near future. Stay tuned!

Read more ...

Support Blackwell with FFA_FA4 Backend

07 February 2026

The upcoming blog post will be released in the near future. Stay tuned!

Read more ...

Support Muon QK-Clip

04 February 2026

The Muon optimizer [Jordan et al., 2024], which leverages matrix orthogonalization, has shown faster convergence than traditional optimizers such as Adam [Kingma and Ba, 2017, Loshchilov and Hutter, 2019] on smaller language models and was subsequently demonstrated to scale to large models by Kimi [Liu et al., 2025].

Read more ...