first add

This commit is contained in:
Your Name
2024-11-24 20:53:33 +08:00
commit c0239f4a3d
180 changed files with 57702 additions and 0 deletions

31
docs/contributing.md Normal file
View File

@@ -0,0 +1,31 @@
# How to Contribute
We welcome small patches related to bug fixes and documentation, but we do not
plan to make any major changes to this repository.
## Before You Begin
### Sign Our Contributor License Agreement
Contributions to this project must be accompanied by a
[Contributor License Agreement](https://cla.developers.google.com/about) (CLA).
You (or your employer) retain the copyright to your contribution; this simply
gives us permission to use and redistribute your contributions as part of the
project.
If you or your current employer have already signed the Google CLA (even if it
was for a different project), you probably don't need to do it again.
Visit <https://cla.developers.google.com/> to see your current agreements or to
sign a new one.
### Review Our Community Guidelines
This project follows
[Google's Open Source Community Guidelines](https://opensource.google/conduct/).
## Contribution Process
We won't accept pull requests directly, but if you send one, we will review it.
If we send a fix based on your pull request, we will make sure to credit you in
the release notes.

BIN
docs/header.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 380 KiB

723
docs/input.md Normal file
View File

@@ -0,0 +1,723 @@
# AlphaFold 3 Input
## Specifying Input Files
You can provide inputs to `run_alphafold.py` in one of two ways:
- Single input file: Use the `--json_path` flag followed by the path to a
single JSON file.
- Multiple input files: Use the `--input_dir` flag followed by the path to a
directory of JSON files.
## Input Format
AlphaFold 3 uses a custom JSON input format differing from the
[AlphaFold Server JSON input format](https://github.com/google-deepmind/alphafold/tree/main/server).
See [below](#alphafold-server-json-compatibility) for more information.
The custom AlphaFold 3 format allows:
* Specifying protein, RNA, and DNA chains, including modified residues.
* Specifying custom multiple sequence alignment (MSA) for protein and RNA
chains.
* Specifying custom structural templates for protein chains.
* Specifying ligands using
[Chemical Component Dictionary (CCD)](https://www.wwpdb.org/data/ccd) codes.
* Specifying ligands using SMILES.
* Specifying ligands by defining them using the CCD mmCIF format and supplying
them via the [user-provided CCD](#user-provided-ccd).
* Specifying covalent bonds between entities.
* Specifying multiple random seeds.
## AlphaFold Server JSON Compatibility
The [AlphaFold Server](https://alphafoldserver.com/) uses a separate
[JSON format](https://github.com/google-deepmind/alphafold/tree/main/server)
from the one used here in the AlphaFold 3 codebase. In particular, the JSON
format used in the AlphaFold 3 codebase offers more flexibility and control in
defining custom ligands, branched glycans, and covalent bonds between entities.
We provide a converter in `run_alphafold.py` which automatically detects the
input JSON format, denoted `dialect` in the converter code. The converter
denotes the AlphaFoldServer JSON as `alphafoldserver`, and the JSON format
defined here in the AlphaFold 3 codebase as `alphafold3`. If the detected input
JSON format is `alphafoldserver`, then the converter will translate that into
the JSON format `alphafold3`.
### Multiple Inputs
The top-level of the `alphafoldserver` JSON format is a list, allowing
specification of multiple inputs in a single JSON. In contrast, the `alphafold3`
JSON format requires exactly one input per JSON file. Specifying multiple inputs
in a single `alphafoldserver` JSON is fully supported.
Note that the converter distinguishes between `alphafoldserver` and `alphafold3`
JSON formats by checking if the top-level of the JSON is a list or not. In
particular, if you pass in a `alphafoldserver`-style JSON without a top-level
list, then this is considered incorrect and `run_alphafold.py` will raise an
error.
### Glycans
If the JSON in `alphafoldserver` format specifies glycans, the converter will
raise an error. This is because translating glycans specified in the
`alphafoldserver` format to the `alphafold3` format is not currently supported.
### Random Seeds
The `alphafoldserver` JSON format allows users to specify `"modelSeeds": []`, in
which case a seed is chosen randomly for the user. On the other hand, the
`alphafold3` format requires users to specify a seed.
The converter will choose a seed randomly if `"modelSeeds": []` is set when
translating from `alphafoldserver` JSON format to `alphafold3` JSON format. If
seeds are specified in the `alphafoldserver` JSON format, then those will be
preserved in the translation to the `alphafold3` JSON format.
### Ions
While AlphaFold Server treats ions and ligands as different entity types in the
JSON format, AlphaFold 3 treats ions as ligands. Therefore, to specify e.g. a
magnesium ion, one would specify it as an entity of type `ligand` with
`ccdCodes: ["MG"]`.
### Sequence IDs
The `alphafold3` JSON format requires the user to specify a unique identifier
(`id`) for each entity. On the other hand, the `alphafoldserver` does not allow
specification of an `id` for each entity. Thus, the converter automatically
assigns one.
The converter iterates through the list provided in the `sequences` field of the
`alphafoldserver` JSON format, assigning an `id` to each entity using the
following order ("reverse spreadsheet style"):
```
A, B, ..., Z, AA, BA, CA, ..., ZA, AB, BB, CB, ..., ZB, ...
```
For any entity with `count > 1`, an `id` is assigned arbitrarily to each "copy"
of the entity.
## Top-level Structure
The top-level structure of the input JSON is:
```json
{
"name": "Job name goes here",
"modelSeeds": [1, 2], # At least one seed required.
"sequences": [
{"protein": {...}},
{"rna": {...}},
{"dna": {...}},
{"ligand": {...}}
],
"bondedAtomPairs": [...], # Optional
"userCCD": "...", # Optional
"dialect": "alphafold3", # Required
"version": 1 # Required
}
```
The fields specify the following:
* `name: str`: The name of the job. A sanitised version of this name is used
for naming the output files.
* `modelSeeds: list[int]`: A list of integer random seeds. The pipeline and
the model will be invoked with each of the seeds in the list. I.e. if you
provide *n* random seeds, you will get *n* predicted structures, each with
the respective random seed. You must provide at least one random seed.
* `sequences: list[Protein | RNA | DNA | Ligand]`: A list of sequence
dictionaries, each defining a molecular entity, see below.
* `bondedAtomPairs: list[Bond]`: An optional list of covalently bonded atoms.
These can link atoms within an entity, or across two entities. See more
below.
* `userCCD: str`: An optional string with user-provided chemical components
dictionary. This is an expert mode for providing custom molecules when
SMILES is not sufficient. This should also be used when you have a custom
molecule that needs to be bonded with other entities - SMILES can't be used
in such cases since it doesn't give the possibility of uniquely naming all
atoms. It can also be used to provide a reference conformer for cases where
RDKit fails to generate a conformer. See more below.
* `dialect: str`: The dialect of the input JSON. This must be set to
`alphafold3`. See
[AlphaFold Server JSON Compatibility](#alphafold-server-json-compatibility)
for more information.
* `version: int`: The version of the input JSON. This must be set to 1. See
[AlphaFold Server JSON Compatibility](#alphafold-server-json-compatibility)
for more information.
## Sequences
The `sequences` section specifies the protein chains, RNA chains, DNA chains,
and ligands. Every entity in `sequences` must have a unique ID. IDs don't have
to be sorted alphabetically.
### Protein
Specifies a single protein chain.
```json
{
"protein": {
"id": "A",
"sequence": "PVLSCGEWQL",
"modifications": [
{"ptmType": "HY3", "ptmPosition": 1},
{"ptmType": "P1L", "ptmPosition": 5}
],
"unpairedMsa": ...,
"pairedMsa": ...,
"templates": [...]
}
}
```
The fields specify the following:
* `id: str | list[str]`: An uppercase letter or multiple letters specifying
the unique IDs for each copy of this protein chain. The IDs are then also
used in the output mmCIF file. Specifying a list of IDs (e.g. `["A", "B",
"C"]`) implies a homomeric chain with multiple copies.
* `sequence: str`: The amino-acid sequence, specified as a string that uses
the 1-letter standard amino acid codes.
* `modifications: list[ProteinModification]`: An optional list of
post-translational modifications. Each modification is specified using its
CCD code and 1-based residue position. In the example above, we see that the
first residue won't be a proline (`P`) but instead `HY3`.
* `unpairedMsa: str`: An optional multiple sequence alignment for this chain.
This is specified using the A3M format (equivalent to the FASTA format, but
also allows gaps denoted by the hyphen `-` character). See more details
below.
* `pairedMsa: str`: We recommend *not* using this optional field and using the
`unpairedMsa` for the purposes of pairing. See more details below.
* `templates: list[Template]`: An optional list of structural templates. See
more details below.
### RNA
Specifies a single RNA chain.
```json
{
"rna": {
"id": "A",
"sequence": "AGCU",
"modifications": [
{"modificationType": "2MG", "basePosition": 1},
{"modificationType": "5MC", "basePosition": 4}
],
"unpairedMsa": ...
}
}
```
The fields specify the following:
* `id: str | list[str]`: An uppercase letter or multiple letters specifying
the unique IDs for each copy of this RNA chain. The IDs are then also used
in the output mmCIF file. Specifying a list of IDs (e.g. `["A", "B", "C"]`)
implies a homomeric chain with multiple copies.
* `sequence: str`: The RNA sequence, specified as a string using only the
letters `A`, `C`, `G`, `U`.
* `modifications: list[RnaModification]`: An optional list of modifications.
Each modification is specified using its CCD code and 1-based base position.
* `unpairedMsa: str`: An optional multiple sequence alignment for this chain.
This is specified using the A3M format. See more details below.
### DNA
Specifies a single DNA chain.
```json
{
"dna": {
"id": "A",
"sequence": "GACCTCT",
"modifications": [
{"modificationType": "6OG", "basePosition": 1},
{"modificationType": "6MA", "basePosition": 2}
]
}
}
```
The fields specify the following:
* `id: str | list[str]`: An uppercase letter or multiple letters specifying
the unique IDs for each copy of this DNA chain. The IDs are then also used
in the output mmCIF file. Specifying a list of IDs (e.g. `["A", "B", "C"]`)
implies a homomeric chain with multiple copies.
* `sequence: str`: The DNA sequence, specified as a string using only the
letters `A`, `C`, `G`, `T`.
* `modifications: list[DnaModification]`: An optional list of modifications.
Each modification is specified using its CCD code and 1-based base position.
### Ligands
Specifies a single ligand. Ligands can be specified using 3 different formats:
1. [CCD code(s)](https://www.wwpdb.org/data/ccd). This is the easiest way to
specify ligands. Supports specifying covalent bonds to other entities. CCD
from 2022-09-28 is used. If multiple CCD codes are specified, you may want
to specify a bond between these and/or a bond to some other entity. See the
[bonds](#bonds) section below.
2. [SMILES string](https://en.wikipedia.org/wiki/Simplified_Molecular_Input_Line_Entry_System).
This enables specifying ligands that are not in CCD. If using SMILES, you
cannot specify covalent bonds to other entities as these rely on specific
atom names - see the next option for what to use for this case.
3. User-provided CCD + custom ligand codes. This enables specifying ligands not
in CCD, while also supporting specification of covalent bonds to other
entities and backup reference coordinates for when RDKit fails to generate a
conformer. This offers the most flexibility, but also requires careful
attention to get all of the details right.
```json
{
"ligand": {
"id": ["G", "H", "I"],
"ccdCodes": ["ATP"]
}
},
{
"ligand": {
"id": "J",
"ccdCodes": ["LIG-1337"]
}
},
{
"ligand": {
"id": "K",
"smiles": "CC(=O)OC1C[NH+]2CCC1CC2"
}
}
```
The fields specify the following:
* `id: str | list[str]`: An uppercase letter (or multiple letters) specifying
the unique ID of this ligand. This ID is then also used in the output mmCIF
file. Specifying a list of IDs (e.g. `["A", "B", "C"]`) implies a ligand
that has multiple copies.
* `ccdCodes: list[str]`: An optional list of CCD codes. These could be either
standard CCD codes, or custom codes pointing to the
[user-provided CCD](#user-provided-ccd).
* `smiles: str`: An optional string defining the ligand using a SMILES string.
Each ligand may be specified using CCD codes or SMILES but not both, i.e. for a
given ligand, the `ccdCodes` and `smiles` fields are mutually exclusive.
### Ions
Ions are treated as ligands, e.g. a magnesium ion would simply be a ligand with
`ccdCodes: ["MG"]`.
## Multiple Sequence Alignment
Protein and RNA chains allow setting a custom Multiple Sequence Alignment (MSA).
If not set, the data pipeline will automatically build MSAs for protein and RNA
entities using Jackhmmer/Nhmmer search over genetic databases as described in
the paper.
There are 3 modes for MSA:
1. If the `unpairedMsa` field is unset, AlphaFold 3 will build the MSA
automatically. This is the recommended option.
2. If the `unpairedMsa` field is set to an empty string (`""`), AlphaFold 3
will not build the MSA and the MSA input to the model will be empty.
3. If the `unpairedMsa` field is set to a custom A3M string, AlphaFold 3 will
use the provided MSA instead of building one as part of the data pipeline.
This is considered an expert option.
Note that if you set the `unpairedMsa` field for a particular protein entity,
you will also have to explicitly set the `pairedMsa` field (typically to empty
string) and templates (either to a list of templates, or an empty list to run
template-free). For example this will run the protein chain A with the given
MSA, but without any templates:
```json
{
"protein": {
"id": "A",
"sequence": ...,
"unpairedMsa": "The A3M you want to run with",
"pairedMsa": "",
"templates": []
}
}
```
When setting your own MSA, you have to make sure that:
1. The MSA is a valid A3M file. This means adhering to the FASTA format while
also allowing lowercase characters denoting inserted residues and hyphens
(`-`) denoting gaps in sequences.
2. The first sequence is exactly equal to the query sequence.
3. If all insertions are removed from MSA hits (i.e. all lowercase letters are
removed), all sequences have exactly the same length as the query (they form
an exact rectangular matrix).
### MSA Pairing
MSA pairing matters only when folding multiple chains (multimers), since we need
to find a way to concatenate MSAs for the individual chains along the sequence
dimension. If done naively, by simply concatenating the individual MSA matrices
along the sequence dimension and padding so that all MSAs have the same depth,
one can end up with rows in the concatenated MSA that are formed by sequences
from different organisms.
It may be desirable to ensure that across multiple chains, sequences in the MSA
that are from the same organism end up in the same MSA row. AlphaFold 3
internally achieves this by looking for the UniProt organism ID in the
`pairedMsa` and pairing sequences based on this information.
We recommend users do the pairing manually or use the output of an appropriate
software and then provide the MSA using only the `unpairedMsa` field. This
method gives exact control over the placement of each sequence in the MSA, as
opposed to relying on name-matching post-processing heuristics used for
`pairedMsa`.
When setting `unpairedMsa` manually, the `pairedMsa` must be left unset (i.e.
the `pairedMsa` key should not be present in the JSON).
For instance, if there are two chains `DEEP` and `MIND` which we want to be
paired on organism A and C, we can achieve it as follows:
```text
> query
DEEP
> match 1 (organism A)
D--P
> match 2 (organism B)
DD-P
> match 3 (organism C)
DD-P
```
```text
> query
MIND
> match 1 (organism A)
M--D
> Empty hit to make sure pairing is achieved
----
> match 2 (organism C)
MIN-
```
The resulting MSA when chains are concatenated will then be:
```text
> query
DEEPMIND
> match 1 + match 1
D--PM--D
> match 2 + padding
DD-P----
> match 3 + match 2
DD-PMIN-
```
## Structural Templates
Structural templates can be specified only for protein chains:
```json
"templates": [
{
"mmcif": ...,
"queryIndices": [0, 1, 2, 4, 5, 6],
"templateIndices": [0, 1, 2, 3, 4, 8]
}
]
```
A template is specified as an mmCIF string containing a single chain with the
structural template together with a 0-based mapping that maps query residue
indices to the template residue indices. The mapping is specified using two
lists of the same length. E.g. to express a mapping `{0: 0, 1: 2, 2: 5, 3: 6}`,
you would specify the two indices lists as:
```json
"queryIndices": [0, 1, 2, 3],
"templateIndices": [0, 2, 5, 6]
```
You can provide multiple structural templates. Note that if an mmCIF containing
more than one chain is provided, you will get an error since it is not possible
to determine which of the chains should be used as the template.
## Bonds
To manually specify covalent bonds, use the `bondedAtomPairs` field. This is
intended for modelling covalent ligands, and for defining multi-CCD ligands
(e.g. glycans). Defining covalent bonds between or within polymer entities is
not currently supported.
Bonds are specified as pairs of (source atom, destination atom), with each atom
being uniquely addressed using 3 fields:
* **Entity ID** (`str`): this corresponds to the `id` field for that entity.
* **Residue ID** (`int`): this is 1-based residue index *within* the chain.
For single-residue ligands, this is simply set to 1.
* **Atom name** (`str`): this is the unique atom name *within* the given
residue. The atom name for protein/RNA/DNA residues or CCD ligands can be
looked up in the CCD for the given chemical component. This also explains
why SMILES ligands don't support bonds: there is no atom name that could be
used to define the bond. This shortcoming can be addressed by using the
user-provided CCD format (see below).
The example below shows two bonds:
```json
"bondedAtomPairs": [
[["A", 145, "SG"], ["L", 1, "C04"]],
[["J", 1, "O6"], ["J", 2, "C1"]]
]
```
The first bond is between chain A, residue 145, atom SG and chain L, residue 1,
atom C04. This is a typical example for a covalent ligand. The second bond is
between chain J, residue 1, atom O6 and chain J, residue 2, atom C1. This bond
is within the same entity and is a typical example when defining a glycan.
All bonds are implicitly assumed to be covalent bonds. Other bond types are not
supported.
### Defining Glycans
Glycans are bound to a protein residue, and they are typically formed of
multiple chemical components. To define a glycan, define a new ligand with all
of the chemical components of the glycan. Then define a bond that links the
glycan to the protein residue, and all bonds that are within the glycan between
its individual chemical components.
For example, to define the following glycan composed of 4 components (CMP1,
CMP2, CMP3, CMP4) bound to an arginine in a protein chain A:
```
ALA CMP4
| |
ARG --- CMP1 --- CMP2
| |
ALA CMP3
```
You will need to specify:
1. Protein chain A.
2. Ligand chain B with the 4 components.
3. Bonds ARG-CMP1, CMP1-CMP2, CMP2-CMP3, CMP2-CMP4.
## User-provided CCD
There are two approaches to model a custom ligand not defined in the CCD. If the
ligand is not bonded to other entities, it can be defined using a
[SMILES string](https://en.wikipedia.org/wiki/Simplified_Molecular_Input_Line_Entry_System).
Otherwise, it is necessary to define that particular ligand using the
[CCD mmCIF format](https://www.wwpdb.org/data/ccd#mmcifFormat).
Once defined, this ligand needs to be assigned a name that doesn't clash with
existing CCD ligand names (e.g. `LIG-1`). Avoid underscores (`_`) in the name,
as it could cause issues in the mmCIF format.
The newly defined ligand can then be used as a standard CCD ligand using its
custom name, and bonds can be linked to it using its named atom scheme.
### User-provided CCD Format
The user-provided CCD must be passed in the `userCCD` field (in the root of the
input JSON) as a string. Note that JSON doesn't allow newlines within strings,
so newline characters (`\n`) must be used to delimit lines. Single rather than
double quotes should also be used around strings like the chemical formula.
The main pieces of information used are the atom names and elements, bonds, and
also the ideal coordinates (`pdbx_model_Cartn_{x,y,z}_ideal`) which essentially
serve as a structural template for the ligand if RDKit fails to generate
conformers for that ligand.
The `userCCD` can also be used to redefine standard chemical components in the
CCD. This can be useful if you need to redefine the ideal coordinates.
Below is an example `userCCD` redefining component X7F, which serves to
illustrate the required sections. For readability purposes, newlines have not
been replaced by `\n`.
```
data_MY-X7F
#
_chem_comp.id MY-X7F
_chem_comp.name '5,8-bis(oxidanyl)naphthalene-1,4-dione'
_chem_comp.type non-polymer
_chem_comp.formula 'C10 H6 O4'
_chem_comp.mon_nstd_parent_comp_id ?
_chem_comp.pdbx_synonyms ?
_chem_comp.formula_weight 190.152
#
loop_
_chem_comp_atom.comp_id
_chem_comp_atom.atom_id
_chem_comp_atom.alt_atom_id
_chem_comp_atom.type_symbol
_chem_comp_atom.charge
_chem_comp_atom.pdbx_align
_chem_comp_atom.pdbx_aromatic_flag
_chem_comp_atom.pdbx_leaving_atom_flag
_chem_comp_atom.pdbx_stereo_config
_chem_comp_atom.pdbx_backbone_atom_flag
_chem_comp_atom.pdbx_n_terminal_atom_flag
_chem_comp_atom.pdbx_c_terminal_atom_flag
_chem_comp_atom.model_Cartn_x
_chem_comp_atom.model_Cartn_y
_chem_comp_atom.model_Cartn_z
_chem_comp_atom.pdbx_model_Cartn_x_ideal
_chem_comp_atom.pdbx_model_Cartn_y_ideal
_chem_comp_atom.pdbx_model_Cartn_z_ideal
_chem_comp_atom.pdbx_component_atom_id
_chem_comp_atom.pdbx_component_comp_id
_chem_comp_atom.pdbx_ordinal
MY-X7F C02 C1 C 0 1 N N N N N N 48.727 17.090 17.537 -1.418 -1.260 0.018 C02 MY-X7F 1
MY-X7F C03 C2 C 0 1 N N N N N N 47.344 16.691 17.993 -0.665 -2.503 -0.247 C03 MY-X7F 2
MY-X7F C04 C3 C 0 1 N N N N N N 47.166 16.016 19.310 0.677 -2.501 -0.235 C04 MY-X7F 3
MY-X7F C05 C4 C 0 1 N N N N N N 48.363 15.728 20.184 1.421 -1.257 0.043 C05 MY-X7F 4
MY-X7F C06 C5 C 0 1 Y N N N N N 49.790 16.142 19.699 0.706 0.032 0.008 C06 MY-X7F 5
MY-X7F C07 C6 C 0 1 Y N N N N N 49.965 16.791 18.444 -0.706 0.030 -0.004 C07 MY-X7F 6
MY-X7F C08 C7 C 0 1 Y N N N N N 51.249 17.162 18.023 -1.397 1.240 -0.037 C08 MY-X7F 7
MY-X7F C10 C8 C 0 1 Y N N N N N 52.359 16.893 18.837 -0.685 2.443 -0.057 C10 MY-X7F 8
MY-X7F C11 C9 C 0 1 Y N N N N N 52.184 16.247 20.090 0.679 2.445 -0.045 C11 MY-X7F 9
MY-X7F C12 C10 C 0 1 Y N N N N N 50.899 15.876 20.515 1.394 1.243 -0.013 C12 MY-X7F 10
MY-X7F O01 O1 O 0 1 N N N N N N 48.876 17.630 16.492 -2.611 -1.301 0.247 O01 MY-X7F 11
MY-X7F O09 O2 O 0 1 N N N N N N 51.423 17.798 16.789 -2.752 1.249 -0.049 O09 MY-X7F 12
MY-X7F O13 O3 O 0 1 N N N N N N 50.710 15.236 21.750 2.750 1.257 -0.001 O13 MY-X7F 13
MY-X7F O14 O4 O 0 1 N N N N N N 48.229 15.189 21.234 2.609 -1.294 0.298 O14 MY-X7F 14
MY-X7F H1 H1 H 0 1 N N N N N N 46.487 16.894 17.367 -1.199 -3.419 -0.452 H1 MY-X7F 15
MY-X7F H2 H2 H 0 1 N N N N N N 46.178 15.732 19.640 1.216 -3.416 -0.429 H2 MY-X7F 16
MY-X7F H3 H3 H 0 1 N N N N N N 53.348 17.177 18.511 -1.221 3.381 -0.082 H3 MY-X7F 17
MY-X7F H4 H4 H 0 1 N N N N N N 53.040 16.041 20.716 1.212 3.384 -0.062 H4 MY-X7F 18
MY-X7F H5 H5 H 0 1 N N N N N N 50.579 17.904 16.365 -3.154 1.271 0.830 H5 MY-X7F 19
MY-X7F H6 H6 H 0 1 N N N N N N 49.785 15.059 21.877 3.151 1.241 -0.880 H6 MY-X7F 20
#
loop_
_chem_comp_bond.comp_id
_chem_comp_bond.atom_id_1
_chem_comp_bond.atom_id_2
_chem_comp_bond.value_order
_chem_comp_bond.pdbx_aromatic_flag
_chem_comp_bond.pdbx_stereo_config
_chem_comp_bond.pdbx_ordinal
MY-X7F O01 C02 DOUB N N 1
MY-X7F O09 C08 SING N N 2
MY-X7F C02 C03 SING N N 3
MY-X7F C02 C07 SING N N 4
MY-X7F C03 C04 DOUB N N 5
MY-X7F C08 C07 DOUB Y N 6
MY-X7F C08 C10 SING Y N 7
MY-X7F C07 C06 SING Y N 8
MY-X7F C10 C11 DOUB Y N 9
MY-X7F C04 C05 SING N N 10
MY-X7F C06 C05 SING N N 11
MY-X7F C06 C12 DOUB Y N 12
MY-X7F C11 C12 SING Y N 13
MY-X7F C05 O14 DOUB N N 14
MY-X7F C12 O13 SING N N 15
MY-X7F C03 H1 SING N N 16
MY-X7F C04 H2 SING N N 17
MY-X7F C10 H3 SING N N 18
MY-X7F C11 H4 SING N N 19
MY-X7F O09 H5 SING N N 20
MY-X7F O13 H6 SING N N 21
#
_pdbx_chem_comp_descriptor.type SMILES_CANONICAL
_pdbx_chem_comp_descriptor.descriptor 'Oc1ccc(O)c2C(=O)C=CC(=O)c12'
#
```
## Full Example
An example illustrating all the aspects of the input format is provided below.
Note that AlphaFold 3 won't run this input out of the box as it abbreviates
certain fields and the sequences are not biologically meaningful.
```json
{
"name": "Hello fold",
"modelSeeds": [10, 42],
"sequences": [
{
"protein": {
"id": "A",
"sequence": "PVLSCGEWQL",
"modifications": [
{"ptmType": "HY3", "ptmPosition": 1},
{"ptmType": "P1L", "ptmPosition": 5}
],
"unpairedMsa": ...,
}
},
{
"protein": {
"id": "B",
"sequence": "RPACQLW",
"templates": [
{
"mmcif": ...,
"queryIndices": [0, 1, 2, 4, 5, 6],
"templateIndices": [0, 1, 2, 3, 4, 8]
}
]
}
},
{
"dna": {
"id": "C",
"sequence": "GACCTCT",
"modifications": [
{"modificationType": "6OG", "basePosition": 1},
{"modificationType": "6MA", "basePosition": 2}
]
}
},
{
"rna": {
"id": "E",
"sequence": "AGCU",
"modifications": [
{"modificationType": "2MG", "basePosition": 1},
{"modificationType": "5MC", "basePosition": 4}
],
"unpairedMsa": ...
}
},
{
"ligand": {
"id": ["F", "G", "H"],
"ccdCodes": ["ATP"]
}
},
{
"ligand": {
"id": "I",
"ccdCodes": ["NAG", "FUC"]
}
},
{
"ligand": {
"id": "Z",
"smiles": "CC(=O)OC1C[NH+]2CCC1CC2"
}
}
],
"bondedAtomPairs": [
[["A", 1, "CA"], ["B", 1, "CA"]],
[["A", 1, "CA"], ["G", 1, "CHA"]],
[["J", 1, "O6"], ["J", 2, "C1"]]
],
"userCcd": ...,
"dialect": "alphafold3",
"version": 1
}
```

355
docs/installation.md Normal file
View File

@@ -0,0 +1,355 @@
# Installation and Running Your First Prediction
You will need a machine running Linux; AlphaFold 3 does not support other
operating systems. Full installation requires up to 1 TB of disk space to keep
genetic databases (SSD storage is recommended) and an NVIDIA GPU with Compute
Capability 8.0 or greater (GPUs with more memory can predict larger protein
structures). We have verified that inputs with up to 5,120 tokens can fit on a
single NVIDIA A100 80 GB, or a single NVIDIA H100 80 GB. We have verified
numerical accuracy on both NVIDIA A100 and H100 GPUs.
Especially for long targets, the genetic search stage can consume a lot of RAM
we recommend running with at least 64 GB of RAM.
We provide installation instructions for a machine with an NVIDIA A100 80 GB GPU
and a clean Ubuntu 22.04 LTS installation, and expect that these instructions
should aid others with different setups.
The instructions provided below describe how to:
1. Provision a machine on GCP.
1. Install Docker.
1. Install NVIDIA drivers for an A100.
1. Obtain genetic databases.
1. Obtain model parameters.
1. Build the AlphaFold 3 Docker container or Singularity image.
## Provisioning a Machine
Clean Ubuntu images are available on Google Cloud, AWS, Azure, and other major
platforms.
We first provisioned a new machine in Google Cloud Platform using the following
command. We were using a Google Cloud project that was already set up.
* We recommend using `--machine-type a2-ultragpu-1g` but feel free to use
`--machine-type a2-highgpu-1g` for smaller predictions.
* If desired, replace `--zone us-central1-a` with a zone that has quota for
the machine you have selected. See
[gpu-regions-zones](https://cloud.google.com/compute/docs/gpus/gpu-regions-zones).
```sh
gcloud compute instances create alphafold3 \
--machine-type a2-ultragpu-1g \
--zone us-central1-a \
--image-family ubuntu-2204-lts \
--image-project ubuntu-os-cloud \
--maintenance-policy TERMINATE \
--boot-disk-size 1000 \
--boot-disk-type pd-balanced
```
This provisions a bare Ubuntu 22.04 LTS image on an
[A2 Ultra](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a2-vms)
machine with 12 CPUs, 170 GB RAM, 1 TB disk and NVIDIA A100 80 GB GPU attached.
We verified the following installation steps from this point.
## Installing Docker
These instructions are for rootless Docker.
### Installing Docker on Host
Note these instructions only apply to Ubuntu 22.04 LTS images, see above.
Add Docker's official GPG key. Official Docker instructions are
[here](https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository).
The commands we ran are:
```sh
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
```
Add the repository to apt sources:
```sh
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo docker run hello-world
```
### Enabling Rootless Docker
Official Docker instructions are
[here](https://docs.docker.com/engine/security/rootless/#distribution-specific-hint).
The commands we ran are:
```sh
sudo apt-get install -y uidmap systemd-container
sudo machinectl shell $(whoami)@ /bin/bash -c 'dockerd-rootless-setuptool.sh install && sudo loginctl enable-linger $(whoami) && DOCKER_HOST=unix:///run/user/1001/docker.sock docker context use rootless'
```
## Installing GPU Support
### Installing NVIDIA Drivers
Official Ubuntu instructions are
[here](https://documentation.ubuntu.com/server/how-to/graphics/install-nvidia-drivers/).
The commands we ran are:
```sh
sudo apt-get -y install alsa-utils ubuntu-drivers-common
sudo ubuntu-drivers install
sudo nvidia-smi --gpu-reset
nvidia-smi # Check that the drivers are installed.
```
Accept "Pending kernel upgrade" dialog if it appears.
You will need to reboot the instance with `sudo reboot now` to reset the GPU if
you see the following warning:
```text
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
Make sure that the latest NVIDIA driver is installed and running.
```
Proceed only if `nvidia-smi` has a sensible output.
### Installing NVIDIA Support for Docker
Official NVIDIA instructions are
[here](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).
The commands we ran are:
```sh
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker --config=$HOME/.config/docker/daemon.json
systemctl --user restart docker
sudo nvidia-ctk config --set nvidia-container-cli.no-cgroups --in-place
```
Check that your container can see the GPU:
```sh
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi
```
The output should look similar to this:
```text
Mon Nov 11 12:00:00 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-80GB Off | 00000000:00:05.0 Off | 0 |
| N/A 34C P0 51W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
```
## Obtaining AlphaFold 3 Source Code
You will need to have `git` installed to download the AlphaFold 3 repository:
```sh
git clone https://github.com/google-deepmind/alphafold3.git
```
## Obtaining Genetic Databases
This step requires `curl` and `zstd` to be installed on your machine.
AlphaFold 3 needs multiple genetic (sequence) protein and RNA databases to run:
* [BFD small](https://bfd.mmseqs.com/)
* [MGnify](https://www.ebi.ac.uk/metagenomics/)
* [PDB](https://www.rcsb.org/) (structures in the mmCIF format)
* [PDB seqres](https://www.rcsb.org/)
* [UniProt](https://www.uniprot.org/uniprot/)
* [UniRef90](https://www.uniprot.org/help/uniref)
* [NT](https://www.ncbi.nlm.nih.gov/nucleotide/)
* [RFam](https://rfam.org/)
* [RNACentral](https://rnacentral.org/)
We provide a Python program `fetch_databases.py` that can be used to download
and set up all of these databases. This process takes around 45 minutes when not
installing on local SSD. We recommend running the following in a `screen` or
`tmux` session as downloading and decompressing the databases takes some time.
```sh
cd alphafold3 # Navigate to the directory with cloned AlphaFold 3 repository.
python3 fetch_databases.py --download_destination=<DATABASES_DIR>
```
This script downloads the databases from a mirror hosted on GCS, with all
versions being the same as used in the AlphaFold 3 paper.
:ledger: **Note: The download directory `<DATABASES_DIR>` should *not* be a
subdirectory in the AlphaFold 3 repository directory.** If it is, the Docker
build will be slow as the large databases will be copied during the image
creation.
:ledger: **Note: The total download size for the full databases is around 252 GB
and the total size when unzipped is 630 GB. Please make sure you have sufficient
hard drive space, bandwidth, and time to download. We recommend using an SSD for
better genetic search performance, and faster runtime of `fetch_databases.py`.**
:ledger: **Note: If the download directory and datasets don't have full read and
write permissions, it can cause errors with the MSA tools, with opaque
(external) error messages. Please ensure the required permissions are applied,
e.g. with the `sudo chmod 755 --recursive <DATABASES_DIR>` command.**
Once the script has finished, you should have the following directory structure:
```sh
pdb_2022_09_28_mmcif_files.tar # ~200k PDB mmCIF files in this tar.
bfd-first_non_consensus_sequences.fasta
mgy_clusters_2022_05.fa
nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta
pdb_seqres_2022_09_28.fasta
rfam_14_9_clust_seq_id_90_cov_80_rep_seq.fasta
rnacentral_active_seq_id_90_cov_80_linclust.fasta
uniprot_all_2021_04.fa
uniref90_2022_05.fa
```
## Obtaining Model Parameters
To request access to the AlphaFold 3 model parameters, please complete
[this form](https://forms.gle/svvpY4u2jsHEwWYS6). Access will be granted at
Google DeepMinds sole discretion. We will aim to respond to requests within 23
business days. You may only use AlphaFold 3 model parameters if received
directly from Google. Use is subject to these
[terms of use](https://github.com/google-deepmind/alphafold3/blob/main/WEIGHTS_TERMS_OF_USE.md).
## Building the Docker Container That Will Run AlphaFold 3
Then, build the Docker container. This builds a container with all the right
python dependencies:
```sh
docker build -t alphafold3 -f docker/Dockerfile .
```
You can now run AlphaFold 3!
```sh
docker run -it \
--volume $HOME/af_input:/root/af_input \
--volume $HOME/af_output:/root/af_output \
--volume <MODEL_PARAMETERS_DIR>:/root/models \
--volume <DATABASES_DIR>:/root/public_databases \
--gpus all \
alphafold3 \
python run_alphafold.py \
--json_path=/root/af_input/fold_input.json \
--model_dir=/root/models \
--output_dir=/root/af_output
```
:ledger: **Note: In the example above the databases have been placed on the
persistent disk, which is slow.** If you want better genetic and template search
performance, make sure all databases are placed on a local SSD.
If you get an error like the following, make sure the models and data are in the
paths (flags named `--volume` above) in the correct locations.
```
docker: Error response from daemon: error while creating mount source path '/srv/alphafold3_data/models': mkdir /srv/alphafold3_data/models: permission denied.
```
## Running Using Singularity Instead of Docker
You may prefer to run AlphaFold 3 within Singularity. You'll still need to
*build* the Singularity image from the Docker container. Afterwards, you will
not have to depend on Docker (at structure prediction time).
### Install Singularity
Official Singularity instructions are
[here](https://docs.sylabs.io/guides/3.3/user-guide/installation.html). The
commands we ran are:
```sh
wget https://github.com/sylabs/singularity/releases/download/v4.2.1/singularity-ce_4.2.1-jammy_amd64.deb
sudo dpkg --install singularity-ce_4.2.1-jammy_amd64.deb
sudo apt-get install -f
```
### Build the Singularity Container From the Docker Image
After building the *Docker* container above with `docker build -t`, start a
local Docker registry and upload your image `alphafold3` to it. Singularity's
instructions are [here](https://github.com/apptainer/singularity/issues/1537).
The commands we ran are:
```sh
docker run -d -p 5000:5000 --restart=always --name registry registry:2
docker tag alphafold3 localhost:5000/alphafold3
docker push localhost:5000/alphafold3
```
Then build the Singularity container:
```sh
SINGULARITY_NOHTTPS=1 singularity build alphafold3.simg docker://localhost:5000/alphafold3:latest
```
You can confirm your build by starting a shell and inspecting the environment.
For example, you may want to ensure the Singularity image can access your GPU.
You may want to restart your computer if you have issues with this.
```sh
singularity exec --nv alphafold3.simg sh -c 'nvidia-smi'
```
You can now run AlphaFold 3!
```sh
singularity exec --nv alphafold3.simg <<args>>
```
For example:
```sh
singularity exec \
--nv alphafold3.simg \
--bind $HOME/af_input:/root/af_input \
--bind $HOME/af_output:/root/af_output \
--bind <MODEL_PARAMETERS_DIR>:/root/models \
--bind <DATABASES_DIR>:/root/public_databases \
python alphafold3/run_alphafold.py \
--json_path=/root/af_input/fold_input.json \
--model_dir=/root/models \
--db_dir=/root/public_databases \
--output_dir=/root/af_output
```

16
docs/known_issues.md Normal file
View File

@@ -0,0 +1,16 @@
# Known Issues
## Numerical Accuracy above 5,120 Tokens
AlphaFold 3 does not currently support inference on inputs larger than 5,120
tokens. An error will be raised if the input is larger than this threshold.
This is due to a numerical issue with the custom Pallas kernel implementing the
Gated Linear Unit. The numerical issue only occurs at inputs above the 5,120
tokens threshold, and results in degraded accuracy in the predicted structure.
This numerical issue is unique to the single GPU configuration used in this
repository, and does not affect the results in the
[AlphaFold 3 paper](https://www.nature.com/articles/s41586-024-07487-w).
We hope to resolve this issue soon and remove this check on input size.

187
docs/output.md Normal file
View File

@@ -0,0 +1,187 @@
# AlphaFold 3 Output
## Output Directory Structure
For every input job, AlphaFold 3 writes all its outputs in a directory called by
the sanitized version of the job name. E.g. for job name "My first fold (test)",
AlphaFold 3 will write its outputs in a directory called `my_first_fold_test`.
The following structure is used within the output directory:
* Sub-directories with results for each sample and seed. There will be
*num\_seeds* \* *num\_samples* such sub-directories. The naming pattern is
`seed-<seed value>_sample-<sample number>`. Each of these directories
contains a confidence JSON, summary confidence JSON, and the mmCIF with the
predicted structure.
* Top-ranking prediction mmCIF: `<job_name>_model.cif`. This file contains the
predicted coordinates and should be compatible with most structural biology
tools. We do not provide the output in the PDB format, the CIF file can be
easily converted into one if needed.
* Top-ranking prediction confidence JSON: `<job_name>_confidences.json`.
* Top-ranking prediction summary confidence JSON:
`<job_name>_summary_confidences.json`.
* Job input JSON file with the MSA and template data added by the data
pipeline: `<job_name>_data.json`.
* Ranking scores for all predictions: `ranking_scores.csv`. The prediction
with highest ranking is the one included in the root directory.
* Output terms of use: `TERMS_OF_USE.md`.
Below is an example AlphaFold 3 output directory listing for a job called
"Hello Fold", that has been ran with 1 seed and 5 samples:
```text
hello_fold/
├── seed-1234_sample-0/
│ ├── confidences.json
│ ├── model.cif
│ └── summary_confidences.json
├── seed-1234_sample-1/
│ ├── confidences.json
│ ├── model.cif
│ └── summary_confidences.json
├── seed-1234_sample-2/
│ ├── confidences.json
│ ├── model.cif
│ └── summary_confidences.json
├── seed-1234_sample-3/
│ ├── confidences.json
│ ├── model.cif
│ └── summary_confidences.json
├── seed-1234_sample-4/
│ ├── confidences.json
│ ├── model.cif
│ └── summary_confidences.json
├── TERMS_OF_USE.md
├── hello_fold_confidences.json
├── hello_fold_data.json
├── hello_fold_model.cif
├── hello_fold_summary_confidences.json
└── ranking_scores.csv
```
## Confidence Metrics
Similar to AlphaFold2 and AlphaFold-Multimer, AlphaFold 3 outputs include
confidence metrics. The main metrics are:
* **pLDDT:** a per-atom confidence estimate on a 0-100 scale where a higher
value indicates higher confidence. pLDDT aims to predict a modified LDDT
score that only considers distances to polymers. For proteins this is
similar to the
[lDDT-Cα metric](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3799472/) but
with more granularity as it can vary per atom not just per residue. For
ligand atoms, the modified LDDT considers the errors only between the ligand
atom and polymers, not other ligand atoms. For DNA/RNA a wider radius of 30
Å is used for the modified LDDT instead of 15 Å.
* **PAE (predicted aligned error)**: an estimate of the error in the relative
position and orientation between two tokens in the predicted structure.
Higher values indicate higher predicted error and therefore lower
confidence. For proteins and nucleic acids, PAE score is essentially the
same as AlphaFold2, where the error is measured relative to frames
constructed from the protein backbone. For small molecules and
post-translational modifications, a frame is constructed for each atom from
its closest neighbors from a reference conformer.
* **pTM and ipTM scores**: the predicted template modeling (pTM) score and the
interface predicted template modeling (ipTM) score are both derived from a
measure called the template modeling (TM) score. This measures the accuracy
of the entire structure
([Zhang and Skolnick, 2004](https://doi.org/10.1002/prot.20264);
[Xu and Zhang, 2010](https://doi.org/10.1093/bioinformatics/btq066)). A pTM
score above 0.5 means the overall predicted fold for the complex might be
similar to the true structure. ipTM measures the accuracy of the predicted
relative positions of the subunits within the complex. Values higher than
0.8 represent confident high-quality predictions, while values below 0.6
suggest a failed prediction. ipTM values between 0.6 and 0.8 are a gray zone
where predictions could be correct or incorrect. The TM score is very strict
for small structures or short chains, so pTM assigns values less than 0.05
when fewer than 20 tokens are involved; for these cases PAE or pLDDT may be
more indicative of prediction quality.
For detailed description of these confidence metrics see the
[AlphaFold 3 paper](https://www.nature.com/articles/s41586-024-07487-w). For
protein components, the
[AlphaFold: A Practical guide](https://www.ebi.ac.uk/training/online/courses/alphafold/inputs-and-outputs/evaluating-alphafolds-predicted-structures-using-confidence-scores/)
course for structures provides additional tutorials on the confidence metrics.
If you are interested in a specific entity or interaction, then there are
confidences available in the outputs which are specific to each chain or
chain-pair, as opposed to the full complex. See below for more details on all
the confidence metrics that are returned.
## Multi-Seed and Multi-Sample Results
By default, the model samples five predictions per seed. The top-ranked
prediction across all samples and seeds is available at the top-level of the
output directory. All samples along with their associated confidences are
available in subdirectories of the output directory.
For ranking of the full complex use the `ranking_score` (higher is better). This
score uses overall structure confidences (pTM and ipTM), but also includes terms
that penalize clashes and encourage disordered regions not to have spurious
helices these extra terms mean the score should only be used to rank
structures.
If you are interested in a specific entity or interaction, you may want to rank
by a metric specific to that chain or chain-pair, as opposed to the full
complex. In that case, use the per chain or per chain-pair confidence metrics
described below for ranking.
## Metrics in Confidences JSON
For each predicted sample we provide two JSON files. One contains summary
metrics summaries for either the whole structure, per chain or per chain-pair
and the other contains full 1D or 2D arrays.
Summary outputs:
* `ptm`: A scalar in the range 0-1 indicating the predicted TM-score for the
full structure.
* `iptm`: A scalar in the range 0-1 indicating predicted interface TM-score
(confidence in the predicted interfaces) for all interfaces in the
structure.
* `fraction_disordered`: A scalar in the range 0-1 that indicates what
fraction of the prediction structure is disordered, as measured by
accessible surface area, see our
[paper](https://www.nature.com/articles/s41586-024-07487-w) for details.
* `has_clash`: A boolean indicating if the structure has a significant number
of clashing atoms (more than 50% of a chain, or a chain with more than 100
clashing atoms).
* `ranking_score`: A scalar in the range \[-100, 1.5\] that can be used for
ranking predictions, it incorporates `ptm`, `iptm`, `fraction_disordered`
and `has_clash` into a single number with the following equation: 0.8 × ipTM
\+ 0.2 × pTM \+ 0.5 × disorder 100 × has_clash.
* `chain_pair_pae_min`: A \[num_chains, num_chains\] array. Element (i, j) of
the array contains the lowest PAE value across rows restricted to chain i
and columns restricted to chain j. This has been found to correlate with
whether two chains interact or not, and in some cases can be used to
distinguish binders from non-binders.
* `chain_pair_iptm`: A \[num_chains, num_chains\] array. Off-diagonal element
(i, j) of the array contains the ipTM restricted to tokens from chains i and
j. Diagonal element (i, i) contains the pTM restricted to chain i. Can be
used for ranking a specific interface between two chains, when you know that
they interact, e.g. for antibody-antigen interactions
* `chain_ptm`: A \[num_chains\] array. Element i contains the pTM restricted
to chain i. Can be used for ranking individual chains when the structure of
that chain is most of interest, rather than the cross-chain interactions it
is involved with.
* `chain_iptm:` A \[num_chains\] array that gives the average confidence
(interface pTM) in the interface between each chain and all other chains.
Can be used for ranking a specific chain, when you care about where the
chain binds to the rest of the complex and you do not know which other
chains you expect it to interact with. This is often the case with ligands.
Full array outputs:
* `pae`: A \[num\_tokens, num\_tokens\] array. Element (i, j) indicates the
predicted error in the position of token j, when the prediction is aligned
to the ground truth using the frame of token i.
* `atom_plddts`: A \[num_atoms\] array, element i indicates the predicted
local distance difference test (pLDDT) for atom i in the prediction.
* `contact_probs`: A \[num_tokens, num_tokens\] array. Element (i, j)
indicates the predicted probability that token i and token j are in contact
(8 Å between the representative atom for each token), see
[paper](https://www.nature.com/articles/s41586-024-07487-w) for details.
* `token_chain_ids`: A \[num_tokens\] array indicating the chain ids
corresponding to each token in the prediction.
* `atom_chain_ids`: A \[num_atoms\] array indicating the chain ids
corresponding to each atom in the prediction.

153
docs/performance.md Normal file
View File

@@ -0,0 +1,153 @@
# Performance
## Data Pipeline
The runtime of the data pipeline (i.e. genetic sequence search and template
search) can vary significantly depending on the size of the input and the number
of homologous sequences found, as well as the available hardware (disk speed can
influence genetic search speed in particular). If you would like to improve
performance, its recommended to increase the disk speed (e.g. by leveraging a
RAM-backed filesystem), or increase the available CPU cores and add more
parallelisation. Also note that for sequences with deep MSAs, Jackhmmer or
Nhmmer may need a substantial amount of RAM beyond the recommended 64 GB of RAM.
## Model Inference
Table 8 in the Supplementary Information of the
[AlphaFold 3 paper](https://nature.com/articles/s41586-024-07487-w) provides
compile-free inference timings for AlphaFold 3 when configured to run on 16
NVIDIA A100s, with 40 GB of memory per device. In contrast, this repository
supports running AlphaFold 3 on a single NVIDIA A100 with 80 GB of memory in a
configuration optimised to maximise throughput.
We compare compile-free inference timings of these two setups in the table below
using GPU seconds (i.e. multiplying by 16 when using 16 A100s). The setup in
this repository is more efficient (by at least 2×) across all token sizes,
indicating its suitability for high-throughput applications.
Num Tokens | 1 A100 80 GB (GPU secs) | 16 A100 40 GB (GPU secs) | Improvement
:--------- | ----------------------: | -----------------------: | ----------:
1024 | 62 | 352 | 5.7×
2048 | 275 | 1136 | 4.1×
3072 | 703 | 2016 | 2.9×
4096 | 1434 | 3648 | 2.5×
5120 | 2547 | 5552 | 2.2×
## Running the Pipeline in Stages
The `run_alphafold.py` script can be executed in stages to optimise resource
utilisation. This can be useful for:
1. Splitting the CPU-only data pipeline from model inference (which requires a
GPU), to optimise cost and resource usage.
1. Caching the results of MSA/template search, then reusing the augmented JSON
for multiple different inferences across seeds or across variations of other
features (e.g. a ligand).
### Data Pipeline Only
Launch `run_alphafold.py` with `--norun_inference` to generate Multiple Sequence
Alignments (MSAs) and templates, without running featurisation and model
inference. This stage can be quite costly in terms of runtime, CPU, and RAM use.
The output will be JSON files augmented with MSAs and templates that can then be
directly used as input for running inference.
### Featurisation and Model Inference Only
Launch `run_alphafold.py` with `--norun_data_pipeline` to skip the data pipeline
and run only featurisation and model inference. This stage requires the input
JSON file to contain pre-computed MSAs and templates.
## Accelerator Hardware Requirements
We officially support the following configurations, and have extensively tested
them for numerical accuracy and throughput efficiency:
- 1 NVIDIA A100 (80 GB)
- 1 NVIDIA H100 (80 GB)
### Other Hardware Configurations
#### NVIDIA A100 (40 GB)
AlphaFold 3 can run on a single NVIDIA A100 (40 GB) with the following
configuration changes:
1. Enabling [unified memory](#unified-memory).
1. Adjusting `pair_transition_shard_spec` in `model_config.py`:
```py
pair_transition_shard_spec: Sequence[_Shape2DType] = (
(2048, None),
(3072, 1024),
(None, 512),
)
```
While numerically accurate, this configuration will have lower throughput
compared to the set up on the NVIDIA A100 (80 GB), due to less available memory.
#### NVIDIA V100 (16 GB)
While you can run AlphaFold 3 on sequences up to 1,280 tokens on a single NVIDIA
V100 using the flag `--flash_attention_implementation=xla` in
`run_alphafold.py`, this configuration has not been tested for numerical
accuracy or throughput efficiency, so please proceed with caution.
## Additional Flags
### Compilation Time Workaround with XLA Flags
To work around a known XLA issue causing the compilation time to greatly
increase, the following environment variable must be set (it is set by default
in the provided `Dockerfile`).
```sh
ENV XLA_FLAGS="--xla_gpu_enable_triton_gemm=false"
```
### GPU Memory
The following environment variables (set by default in the `Dockerfile`) enable
folding a single input of size up to 5,120 tokens on a single A100 with 80 GB of
memory:
```sh
ENV XLA_PYTHON_CLIENT_PREALLOCATE=true
ENV XLA_CLIENT_MEM_FRACTION=0.95
```
#### Unified Memory
If you would like to run AlphaFold 3 on a GPU with less memory (an A100 with 40
GB of memory, for instance), we recommend enabling unified memory. Enabling
unified memory allows the program to spill GPU memory to host memory if there
isn't enough space. This prevents an OOM, at the cost of making the program
slower by accessing host memory instead of device memory. To learn more, check
out the
[NVIDIA blog post](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/).
You can enable unified memory by setting the following environment variables in
your `Dockerfile`:
```sh
ENV XLA_PYTHON_CLIENT_PREALLOCATE=false
ENV TF_FORCE_UNIFIED_MEMORY=true
ENV XLA_CLIENT_MEM_FRACTION=3.2
```
### JAX Persistent Compilation Cache
You may also want to make use of the JAX persistent compilation cache, to avoid
unnecessary recompilation of the model between runs. You can enable the
compilation cache with the `--jax_compilation_cache_dir <YOUR_DIRECTORY>` flag
in `run_alphafold.py`.
More detailed instructions are available in the
[JAX documentation](https://jax.readthedocs.io/en/latest/persistent_compilation_cache.html#persistent-compilation-cache),
and more specifically the instructions for use on
[Google Cloud](https://jax.readthedocs.io/en/latest/persistent_compilation_cache.html#persistent-compilation-cache).
In particular, note that if you would like to make use of a non-local
filesystem, such as Google Cloud Storage, you will need to install
[`etils`](https://github.com/google/etils) (this is not included by default in
the AlphaFold 3 Docker container).