first add
This commit is contained in:
153
docs/performance.md
Normal file
153
docs/performance.md
Normal file
@@ -0,0 +1,153 @@
|
||||
# Performance
|
||||
|
||||
## Data Pipeline
|
||||
|
||||
The runtime of the data pipeline (i.e. genetic sequence search and template
|
||||
search) can vary significantly depending on the size of the input and the number
|
||||
of homologous sequences found, as well as the available hardware (disk speed can
|
||||
influence genetic search speed in particular). If you would like to improve
|
||||
performance, it’s recommended to increase the disk speed (e.g. by leveraging a
|
||||
RAM-backed filesystem), or increase the available CPU cores and add more
|
||||
parallelisation. Also note that for sequences with deep MSAs, Jackhmmer or
|
||||
Nhmmer may need a substantial amount of RAM beyond the recommended 64 GB of RAM.
|
||||
|
||||
## Model Inference
|
||||
|
||||
Table 8 in the Supplementary Information of the
|
||||
[AlphaFold 3 paper](https://nature.com/articles/s41586-024-07487-w) provides
|
||||
compile-free inference timings for AlphaFold 3 when configured to run on 16
|
||||
NVIDIA A100s, with 40 GB of memory per device. In contrast, this repository
|
||||
supports running AlphaFold 3 on a single NVIDIA A100 with 80 GB of memory in a
|
||||
configuration optimised to maximise throughput.
|
||||
|
||||
We compare compile-free inference timings of these two setups in the table below
|
||||
using GPU seconds (i.e. multiplying by 16 when using 16 A100s). The setup in
|
||||
this repository is more efficient (by at least 2×) across all token sizes,
|
||||
indicating its suitability for high-throughput applications.
|
||||
|
||||
Num Tokens | 1 A100 80 GB (GPU secs) | 16 A100 40 GB (GPU secs) | Improvement
|
||||
:--------- | ----------------------: | -----------------------: | ----------:
|
||||
1024 | 62 | 352 | 5.7×
|
||||
2048 | 275 | 1136 | 4.1×
|
||||
3072 | 703 | 2016 | 2.9×
|
||||
4096 | 1434 | 3648 | 2.5×
|
||||
5120 | 2547 | 5552 | 2.2×
|
||||
|
||||
## Running the Pipeline in Stages
|
||||
|
||||
The `run_alphafold.py` script can be executed in stages to optimise resource
|
||||
utilisation. This can be useful for:
|
||||
|
||||
1. Splitting the CPU-only data pipeline from model inference (which requires a
|
||||
GPU), to optimise cost and resource usage.
|
||||
1. Caching the results of MSA/template search, then reusing the augmented JSON
|
||||
for multiple different inferences across seeds or across variations of other
|
||||
features (e.g. a ligand).
|
||||
|
||||
### Data Pipeline Only
|
||||
|
||||
Launch `run_alphafold.py` with `--norun_inference` to generate Multiple Sequence
|
||||
Alignments (MSAs) and templates, without running featurisation and model
|
||||
inference. This stage can be quite costly in terms of runtime, CPU, and RAM use.
|
||||
The output will be JSON files augmented with MSAs and templates that can then be
|
||||
directly used as input for running inference.
|
||||
|
||||
### Featurisation and Model Inference Only
|
||||
|
||||
Launch `run_alphafold.py` with `--norun_data_pipeline` to skip the data pipeline
|
||||
and run only featurisation and model inference. This stage requires the input
|
||||
JSON file to contain pre-computed MSAs and templates.
|
||||
|
||||
## Accelerator Hardware Requirements
|
||||
|
||||
We officially support the following configurations, and have extensively tested
|
||||
them for numerical accuracy and throughput efficiency:
|
||||
|
||||
- 1 NVIDIA A100 (80 GB)
|
||||
- 1 NVIDIA H100 (80 GB)
|
||||
|
||||
### Other Hardware Configurations
|
||||
|
||||
#### NVIDIA A100 (40 GB)
|
||||
|
||||
AlphaFold 3 can run on a single NVIDIA A100 (40 GB) with the following
|
||||
configuration changes:
|
||||
|
||||
1. Enabling [unified memory](#unified-memory).
|
||||
1. Adjusting `pair_transition_shard_spec` in `model_config.py`:
|
||||
|
||||
```py
|
||||
pair_transition_shard_spec: Sequence[_Shape2DType] = (
|
||||
(2048, None),
|
||||
(3072, 1024),
|
||||
(None, 512),
|
||||
)
|
||||
```
|
||||
|
||||
While numerically accurate, this configuration will have lower throughput
|
||||
compared to the set up on the NVIDIA A100 (80 GB), due to less available memory.
|
||||
|
||||
#### NVIDIA V100 (16 GB)
|
||||
|
||||
While you can run AlphaFold 3 on sequences up to 1,280 tokens on a single NVIDIA
|
||||
V100 using the flag `--flash_attention_implementation=xla` in
|
||||
`run_alphafold.py`, this configuration has not been tested for numerical
|
||||
accuracy or throughput efficiency, so please proceed with caution.
|
||||
|
||||
## Additional Flags
|
||||
|
||||
### Compilation Time Workaround with XLA Flags
|
||||
|
||||
To work around a known XLA issue causing the compilation time to greatly
|
||||
increase, the following environment variable must be set (it is set by default
|
||||
in the provided `Dockerfile`).
|
||||
|
||||
```sh
|
||||
ENV XLA_FLAGS="--xla_gpu_enable_triton_gemm=false"
|
||||
```
|
||||
|
||||
### GPU Memory
|
||||
|
||||
The following environment variables (set by default in the `Dockerfile`) enable
|
||||
folding a single input of size up to 5,120 tokens on a single A100 with 80 GB of
|
||||
memory:
|
||||
|
||||
```sh
|
||||
ENV XLA_PYTHON_CLIENT_PREALLOCATE=true
|
||||
ENV XLA_CLIENT_MEM_FRACTION=0.95
|
||||
```
|
||||
|
||||
#### Unified Memory
|
||||
|
||||
If you would like to run AlphaFold 3 on a GPU with less memory (an A100 with 40
|
||||
GB of memory, for instance), we recommend enabling unified memory. Enabling
|
||||
unified memory allows the program to spill GPU memory to host memory if there
|
||||
isn't enough space. This prevents an OOM, at the cost of making the program
|
||||
slower by accessing host memory instead of device memory. To learn more, check
|
||||
out the
|
||||
[NVIDIA blog post](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/).
|
||||
|
||||
You can enable unified memory by setting the following environment variables in
|
||||
your `Dockerfile`:
|
||||
|
||||
```sh
|
||||
ENV XLA_PYTHON_CLIENT_PREALLOCATE=false
|
||||
ENV TF_FORCE_UNIFIED_MEMORY=true
|
||||
ENV XLA_CLIENT_MEM_FRACTION=3.2
|
||||
```
|
||||
|
||||
### JAX Persistent Compilation Cache
|
||||
|
||||
You may also want to make use of the JAX persistent compilation cache, to avoid
|
||||
unnecessary recompilation of the model between runs. You can enable the
|
||||
compilation cache with the `--jax_compilation_cache_dir <YOUR_DIRECTORY>` flag
|
||||
in `run_alphafold.py`.
|
||||
|
||||
More detailed instructions are available in the
|
||||
[JAX documentation](https://jax.readthedocs.io/en/latest/persistent_compilation_cache.html#persistent-compilation-cache),
|
||||
and more specifically the instructions for use on
|
||||
[Google Cloud](https://jax.readthedocs.io/en/latest/persistent_compilation_cache.html#persistent-compilation-cache).
|
||||
In particular, note that if you would like to make use of a non-local
|
||||
filesystem, such as Google Cloud Storage, you will need to install
|
||||
[`etils`](https://github.com/google/etils) (this is not included by default in
|
||||
the AlphaFold 3 Docker container).
|
||||
Reference in New Issue
Block a user