Installation#

Warning

MagiAttention currently supports only Hopper and Blackwell. We are actively working to support more GPU architectures in upcoming releases.

If you would like to try using our native group-collective kernels when cp_size > 8 as the communication backend, i.e. a process group involving both intranode (connected through NVLink) and internode (visible through RDMA) peers, you’re required to enable IBGDA on your bare-metal host machine.

Warning

This step needs to be performed on the BARE-METAL HOST OPERATING SYSTEM, NOT inside a Docker or other containerized environment, as containers do not manage the host kernel.

bash script:
```
bash scripts/enable_ibgda_on_host.sh
```

Setup Dependencies #

Install Required Packages #

pip install command:
```
pip install -r requirements.txt
```

Install flash_attn_cute (optional)#

Note

If you would like to try MagiAttention on Blackwell, for now you’re required to install flash_attn_cute package to enable FFA_FA backend as a temporary workaround.

bash script:

bash scripts/install_flash_attn_cute.sh

Install MagiAttention #

Install MagiAttention From Source #

Warning

This progress may take around 10~20 minutes and occupies 90% of CPU resources for the first time.

Note

We have several environment variables to fine-grained control the installation progress, especially for CUDA extension modules building.

pip install command for Hopper:
```
pip install --no-build-isolation .
```

pip install command for Blackwell:

export MAGI_ATTENTION_PREBUILD_FFA=0
pip install --no-build-isolation .

export MAGI_ATTENTION_FA4_BACKEND=1 # always set it when using MagiAttention on Blackwell

PreCompile FFA_FA4 kernels (optional)#

Note

If you would like to try MagiAttention on Blackwell and you’ve already installed both magi_attention and flash_attn_cute to enable FFA_FA backend, we further recommend you to pre-compile the common cases for FFA_FA4 kernels before production usage to avoid runtime JIT re-compilation overhead, since it is built upon CuteDSL.

And the cache directory for pre-compiled kernels is /path/to/magi_attention/lib/ffa_fa4_cache/ by default, which can be overridden by setting the environment variable MAGI_ATTENTION_FFA_FA4_CACHE_DIR to specify a custom cache directory if needed.

python script:

# You can change the cases to pre-compile in the script according to your needs,
# and the whole pre-compilation progress will be richly logged
# in the terminal by tqdm, for you to track the progress and results.
python tools/precompile_ffa_fa4.py

Installation#

Setup Environment #

Activate an NGC-PyTorch Container #

Pull Source Code #

Enable IBGDA (optional)#

Setup Dependencies #

Install Required Packages #

Install flash_attn_cute (optional)#

Install MagiAttention #

Install MagiAttention From Source #

PreCompile FFA_FA4 kernels (optional)#