154 lines
6.2 KiB
Markdown
154 lines
6.2 KiB
Markdown
# Performance
|
||
|
||
## Data Pipeline
|
||
|
||
The runtime of the data pipeline (i.e. genetic sequence search and template
|
||
search) can vary significantly depending on the size of the input and the number
|
||
of homologous sequences found, as well as the available hardware (disk speed can
|
||
influence genetic search speed in particular). If you would like to improve
|
||
performance, it’s recommended to increase the disk speed (e.g. by leveraging a
|
||
RAM-backed filesystem), or increase the available CPU cores and add more
|
||
parallelisation. Also note that for sequences with deep MSAs, Jackhmmer or
|
||
Nhmmer may need a substantial amount of RAM beyond the recommended 64 GB of RAM.
|
||
|
||
## Model Inference
|
||
|
||
Table 8 in the Supplementary Information of the
|
||
[AlphaFold 3 paper](https://nature.com/articles/s41586-024-07487-w) provides
|
||
compile-free inference timings for AlphaFold 3 when configured to run on 16
|
||
NVIDIA A100s, with 40 GB of memory per device. In contrast, this repository
|
||
supports running AlphaFold 3 on a single NVIDIA A100 with 80 GB of memory in a
|
||
configuration optimised to maximise throughput.
|
||
|
||
We compare compile-free inference timings of these two setups in the table below
|
||
using GPU seconds (i.e. multiplying by 16 when using 16 A100s). The setup in
|
||
this repository is more efficient (by at least 2×) across all token sizes,
|
||
indicating its suitability for high-throughput applications.
|
||
|
||
Num Tokens | 1 A100 80 GB (GPU secs) | 16 A100 40 GB (GPU secs) | Improvement
|
||
:--------- | ----------------------: | -----------------------: | ----------:
|
||
1024 | 62 | 352 | 5.7×
|
||
2048 | 275 | 1136 | 4.1×
|
||
3072 | 703 | 2016 | 2.9×
|
||
4096 | 1434 | 3648 | 2.5×
|
||
5120 | 2547 | 5552 | 2.2×
|
||
|
||
## Running the Pipeline in Stages
|
||
|
||
The `run_alphafold.py` script can be executed in stages to optimise resource
|
||
utilisation. This can be useful for:
|
||
|
||
1. Splitting the CPU-only data pipeline from model inference (which requires a
|
||
GPU), to optimise cost and resource usage.
|
||
1. Caching the results of MSA/template search, then reusing the augmented JSON
|
||
for multiple different inferences across seeds or across variations of other
|
||
features (e.g. a ligand).
|
||
|
||
### Data Pipeline Only
|
||
|
||
Launch `run_alphafold.py` with `--norun_inference` to generate Multiple Sequence
|
||
Alignments (MSAs) and templates, without running featurisation and model
|
||
inference. This stage can be quite costly in terms of runtime, CPU, and RAM use.
|
||
The output will be JSON files augmented with MSAs and templates that can then be
|
||
directly used as input for running inference.
|
||
|
||
### Featurisation and Model Inference Only
|
||
|
||
Launch `run_alphafold.py` with `--norun_data_pipeline` to skip the data pipeline
|
||
and run only featurisation and model inference. This stage requires the input
|
||
JSON file to contain pre-computed MSAs and templates.
|
||
|
||
## Accelerator Hardware Requirements
|
||
|
||
We officially support the following configurations, and have extensively tested
|
||
them for numerical accuracy and throughput efficiency:
|
||
|
||
- 1 NVIDIA A100 (80 GB)
|
||
- 1 NVIDIA H100 (80 GB)
|
||
|
||
### Other Hardware Configurations
|
||
|
||
#### NVIDIA A100 (40 GB)
|
||
|
||
AlphaFold 3 can run on a single NVIDIA A100 (40 GB) with the following
|
||
configuration changes:
|
||
|
||
1. Enabling [unified memory](#unified-memory).
|
||
1. Adjusting `pair_transition_shard_spec` in `model_config.py`:
|
||
|
||
```py
|
||
pair_transition_shard_spec: Sequence[_Shape2DType] = (
|
||
(2048, None),
|
||
(3072, 1024),
|
||
(None, 512),
|
||
)
|
||
```
|
||
|
||
While numerically accurate, this configuration will have lower throughput
|
||
compared to the set up on the NVIDIA A100 (80 GB), due to less available memory.
|
||
|
||
#### NVIDIA V100 (16 GB)
|
||
|
||
While you can run AlphaFold 3 on sequences up to 1,280 tokens on a single NVIDIA
|
||
V100 using the flag `--flash_attention_implementation=xla` in
|
||
`run_alphafold.py`, this configuration has not been tested for numerical
|
||
accuracy or throughput efficiency, so please proceed with caution.
|
||
|
||
## Additional Flags
|
||
|
||
### Compilation Time Workaround with XLA Flags
|
||
|
||
To work around a known XLA issue causing the compilation time to greatly
|
||
increase, the following environment variable must be set (it is set by default
|
||
in the provided `Dockerfile`).
|
||
|
||
```sh
|
||
ENV XLA_FLAGS="--xla_gpu_enable_triton_gemm=false"
|
||
```
|
||
|
||
### GPU Memory
|
||
|
||
The following environment variables (set by default in the `Dockerfile`) enable
|
||
folding a single input of size up to 5,120 tokens on a single A100 with 80 GB of
|
||
memory:
|
||
|
||
```sh
|
||
ENV XLA_PYTHON_CLIENT_PREALLOCATE=true
|
||
ENV XLA_CLIENT_MEM_FRACTION=0.95
|
||
```
|
||
|
||
#### Unified Memory
|
||
|
||
If you would like to run AlphaFold 3 on a GPU with less memory (an A100 with 40
|
||
GB of memory, for instance), we recommend enabling unified memory. Enabling
|
||
unified memory allows the program to spill GPU memory to host memory if there
|
||
isn't enough space. This prevents an OOM, at the cost of making the program
|
||
slower by accessing host memory instead of device memory. To learn more, check
|
||
out the
|
||
[NVIDIA blog post](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/).
|
||
|
||
You can enable unified memory by setting the following environment variables in
|
||
your `Dockerfile`:
|
||
|
||
```sh
|
||
ENV XLA_PYTHON_CLIENT_PREALLOCATE=false
|
||
ENV TF_FORCE_UNIFIED_MEMORY=true
|
||
ENV XLA_CLIENT_MEM_FRACTION=3.2
|
||
```
|
||
|
||
### JAX Persistent Compilation Cache
|
||
|
||
You may also want to make use of the JAX persistent compilation cache, to avoid
|
||
unnecessary recompilation of the model between runs. You can enable the
|
||
compilation cache with the `--jax_compilation_cache_dir <YOUR_DIRECTORY>` flag
|
||
in `run_alphafold.py`.
|
||
|
||
More detailed instructions are available in the
|
||
[JAX documentation](https://jax.readthedocs.io/en/latest/persistent_compilation_cache.html#persistent-compilation-cache),
|
||
and more specifically the instructions for use on
|
||
[Google Cloud](https://jax.readthedocs.io/en/latest/persistent_compilation_cache.html#persistent-compilation-cache).
|
||
In particular, note that if you would like to make use of a non-local
|
||
filesystem, such as Google Cloud Storage, you will need to install
|
||
[`etils`](https://github.com/google/etils) (this is not included by default in
|
||
the AlphaFold 3 Docker container).
|