Installation#

Warning

MagiAttention currently supports only Hopper and Blackwell. We are actively working to support more GPU architectures in upcoming releases.

Setup Environment#

Activate an NGC-PyTorch Container#

Tip

We recommend you to use the standard NGC-PyTorch Docker Releases for consistency of basic dependencies such as Python, CUDA, PyTorch, etc.

Warning

Due to performance issue caused by CUDA-12, we recommend you to use CUDA-13+ based NGC-PyTorch containers for optimal performance.

And we add an assertion in the setup.py script to check the CUDA version and abort the installation if the CUDA version is lower than 13.0.

If you insist on using CUDA-12 based containers, you can set the environment variable MAGI_ATTENTION_ALLOW_BUILD_WITH_CUDA12=1, but please be aware that it may lead to significant performance degradation compared to CUDA-13+.

  • docker run command:

    # choose one compatible version
    MAJOR_VERSION=25
    MINOR_VERSION=10
    
    # specify your own names and paths
    CONTAINER_NAME=...
    HOST_MNT_ROOT=...
    CONTAINER_MNT_ROOT=...
    
    docker run --name ${CONTAINER_NAME} -v ${HOST_MNT_ROOT}:${CONTAINER_MNT_ROOT} -it -d --privileged --gpus all --network host --ipc host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/pytorch:${MAJOR_VERSION}.${MINOR_VERSION}-py3 /bin/bash
    
  • docker exec command:

    docker exec -it ${CONTAINER_NAME} /bin/bash
    

Pull Source Code#

  • git commands:

    git clone https://github.com/SandAI-org/MagiAttention.git
    
    cd MagiAttention
    
    git submodule update --init --recursive
    

Enable IBGDA (optional)#

Note

If you would like to try using our native group-collective kernels when cp_size > 8 as the communication backend, i.e. a process group involving both intranode (connected through NVLink) and internode (visible through RDMA) peers, you’re required to enable IBGDA on your bare-metal host machine.

Warning

This step needs to be performed on the BARE-METAL HOST OPERATING SYSTEM, NOT inside a Docker or other containerized environment, as containers do not manage the host kernel.

  • bash script:

    bash scripts/enable_ibgda_on_host.sh
    

Setup Dependencies#

Install Required Packages#

  • pip install command:

    pip install -r requirements.txt
    

Install flash_attn_cute (optional)#

Note

If you would like to try MagiAttention on Blackwell, for now you’re required to install flash_attn_cute package to enable FFA_FA backend as a temporary workaround.

  • bash script:

    bash scripts/install_flash_attn_cute.sh
    

Install MagiAttention#

Install MagiAttention From Source#

Warning

This progress may take around 10~20 minutes and occupies 90% of CPU resources for the first time.

Note

We have several environment variables to fine-grained control the installation progress, especially for CUDA extension modules building.

  • pip install command for Hopper:

    pip install --no-build-isolation .
    
  • pip install command for Blackwell:

    export MAGI_ATTENTION_PREBUILD_FFA=0
    pip install --no-build-isolation .
    
    export MAGI_ATTENTION_FA4_BACKEND=1 # always set it when using MagiAttention on Blackwell
    

PreCompile FFA_FA4 kernels (optional)#

Note

If you would like to try MagiAttention on Blackwell and you’ve already installed both magi_attention and flash_attn_cute to enable FFA_FA backend, we further recommend you to pre-compile the common cases for FFA_FA4 kernels before production usage to avoid runtime JIT re-compilation overhead, since it is built upon Cute PythonDSL.

And the cache directory for pre-compiled kernels is /path/to/magi_attention/lib/ffa_fa4_cache/ by default, which can be overridden by setting the environment variable MAGI_ATTENTION_FFA_FA4_CACHE_DIR to specify a custom cache directory if needed.

  • python script:

    # You can change the cases to pre-compile in the script according to your needs,
    # and the whole pre-compilation progress will be richly logged
    # in the terminal by tqdm, for you to track the progress and results.
    python tools/precompile_ffa_fa4.py