first add

2024-11-24 20:53:33 +08:00
commit c0239f4a3d
180 changed files with 57702 additions and 0 deletions
--- a/docs/performance.md
+++ b/docs/performance.md
@@ -0,0 +1,153 @@
+# Performance
+
+## Data Pipeline
+
+The runtime of the data pipeline (i.e. genetic sequence search and template
+search) can vary significantly depending on the size of the input and the number
+of homologous sequences found, as well as the available hardware (disk speed can
+influence genetic search speed in particular). If you would like to improve
+performance, it’s recommended to increase the disk speed (e.g. by leveraging a
+RAM-backed filesystem), or increase the available CPU cores and add more
+parallelisation. Also note that for sequences with deep MSAs, Jackhmmer or
+Nhmmer may need a substantial amount of RAM beyond the recommended 64 GB of RAM.
+
+## Model Inference
+
+Table 8 in the Supplementary Information of the
+[AlphaFold 3 paper](https://nature.com/articles/s41586-024-07487-w) provides
+compile-free inference timings for AlphaFold 3 when configured to run on 16
+NVIDIA A100s, with 40 GB of memory per device. In contrast, this repository
+supports running AlphaFold 3 on a single NVIDIA A100 with 80 GB of memory in a
+configuration optimised to maximise throughput.
+
+We compare compile-free inference timings of these two setups in the table below
+using GPU seconds (i.e. multiplying by 16 when using 16 A100s). The setup in
+this repository is more efficient (by at least 2×) across all token sizes,
+indicating its suitability for high-throughput applications.
+
+Num Tokens | 1 A100 80 GB (GPU secs) | 16 A100 40 GB (GPU secs) | Improvement
+:--------- | ----------------------: | -----------------------: | ----------:
+1024       | 62                      | 352                      | 5.7×
+2048       | 275                     | 1136                     | 4.1×
+3072       | 703                     | 2016                     | 2.9×
+4096       | 1434                    | 3648                     | 2.5×
+5120       | 2547                    | 5552                     | 2.2×
+
+## Running the Pipeline in Stages
+
+The `run_alphafold.py` script can be executed in stages to optimise resource
+utilisation. This can be useful for:
+
+1.  Splitting the CPU-only data pipeline from model inference (which requires a
+    GPU), to optimise cost and resource usage.
+1.  Caching the results of MSA/template search, then reusing the augmented JSON
+    for multiple different inferences across seeds or across variations of other
+    features (e.g. a ligand).
+
+### Data Pipeline Only
+
+Launch `run_alphafold.py` with `--norun_inference` to generate Multiple Sequence
+Alignments (MSAs) and templates, without running featurisation and model
+inference. This stage can be quite costly in terms of runtime, CPU, and RAM use.
+The output will be JSON files augmented with MSAs and templates that can then be
+directly used as input for running inference.
+
+### Featurisation and Model Inference Only
+
+Launch `run_alphafold.py` with `--norun_data_pipeline` to skip the data pipeline
+and run only featurisation and model inference. This stage requires the input
+JSON file to contain pre-computed MSAs and templates.
+
+## Accelerator Hardware Requirements
+
+We officially support the following configurations, and have extensively tested
+them for numerical accuracy and throughput efficiency:
+
+-   1 NVIDIA A100 (80 GB)
+-   1 NVIDIA H100 (80 GB)
+
+### Other Hardware Configurations
+
+#### NVIDIA A100 (40 GB)
+
+AlphaFold 3 can run on a single NVIDIA A100 (40 GB) with the following
+configuration changes:
+
+1.  Enabling [unified memory](#unified-memory).
+1.  Adjusting `pair_transition_shard_spec` in `model_config.py`:
+
+    ```py
+      pair_transition_shard_spec: Sequence[_Shape2DType] = (
+          (2048, None),
+          (3072, 1024),
+          (None, 512),
+      )
+    ```
+
+While numerically accurate, this configuration will have lower throughput
+compared to the set up on the NVIDIA A100 (80 GB), due to less available memory.
+
+#### NVIDIA V100 (16 GB)
+
+While you can run AlphaFold 3 on sequences up to 1,280 tokens on a single NVIDIA
+V100 using the flag `--flash_attention_implementation=xla` in
+`run_alphafold.py`, this configuration has not been tested for numerical
+accuracy or throughput efficiency, so please proceed with caution.
+
+## Additional Flags
+
+### Compilation Time Workaround with XLA Flags
+
+To work around a known XLA issue causing the compilation time to greatly
+increase, the following environment variable must be set (it is set by default
+in the provided `Dockerfile`).
+
+```sh
+ENV XLA_FLAGS="--xla_gpu_enable_triton_gemm=false"
+```
+
+### GPU Memory
+
+The following environment variables (set by default in the `Dockerfile`) enable
+folding a single input of size up to 5,120 tokens on a single A100 with 80 GB of
+memory:
+
+```sh
+ENV XLA_PYTHON_CLIENT_PREALLOCATE=true
+ENV XLA_CLIENT_MEM_FRACTION=0.95
+```
+
+#### Unified Memory
+
+If you would like to run AlphaFold 3 on a GPU with less memory (an A100 with 40
+GB of memory, for instance), we recommend enabling unified memory. Enabling
+unified memory allows the program to spill GPU memory to host memory if there
+isn't enough space. This prevents an OOM, at the cost of making the program
+slower by accessing host memory instead of device memory. To learn more, check
+out the
+[NVIDIA blog post](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/).
+
+You can enable unified memory by setting the following environment variables in
+your `Dockerfile`:
+
+```sh
+ENV XLA_PYTHON_CLIENT_PREALLOCATE=false
+ENV TF_FORCE_UNIFIED_MEMORY=true
+ENV XLA_CLIENT_MEM_FRACTION=3.2
+```
+
+### JAX Persistent Compilation Cache
+
+You may also want to make use of the JAX persistent compilation cache, to avoid
+unnecessary recompilation of the model between runs. You can enable the
+compilation cache with the `--jax_compilation_cache_dir <YOUR_DIRECTORY>` flag
+in `run_alphafold.py`.
+
+More detailed instructions are available in the
+[JAX documentation](https://jax.readthedocs.io/en/latest/persistent_compilation_cache.html#persistent-compilation-cache),
+and more specifically the instructions for use on
+[Google Cloud](https://jax.readthedocs.io/en/latest/persistent_compilation_cache.html#persistent-compilation-cache).
+In particular, note that if you would like to make use of a non-local
+filesystem, such as Google Cloud Storage, you will need to install
+[`etils`](https://github.com/google/etils) (this is not included by default in
+the AlphaFold 3 Docker container).