first add
This commit is contained in:
31
docs/contributing.md
Normal file
31
docs/contributing.md
Normal file
@@ -0,0 +1,31 @@
|
||||
# How to Contribute
|
||||
|
||||
We welcome small patches related to bug fixes and documentation, but we do not
|
||||
plan to make any major changes to this repository.
|
||||
|
||||
## Before You Begin
|
||||
|
||||
### Sign Our Contributor License Agreement
|
||||
|
||||
Contributions to this project must be accompanied by a
|
||||
[Contributor License Agreement](https://cla.developers.google.com/about) (CLA).
|
||||
You (or your employer) retain the copyright to your contribution; this simply
|
||||
gives us permission to use and redistribute your contributions as part of the
|
||||
project.
|
||||
|
||||
If you or your current employer have already signed the Google CLA (even if it
|
||||
was for a different project), you probably don't need to do it again.
|
||||
|
||||
Visit <https://cla.developers.google.com/> to see your current agreements or to
|
||||
sign a new one.
|
||||
|
||||
### Review Our Community Guidelines
|
||||
|
||||
This project follows
|
||||
[Google's Open Source Community Guidelines](https://opensource.google/conduct/).
|
||||
|
||||
## Contribution Process
|
||||
|
||||
We won't accept pull requests directly, but if you send one, we will review it.
|
||||
If we send a fix based on your pull request, we will make sure to credit you in
|
||||
the release notes.
|
||||
BIN
docs/header.jpg
Normal file
BIN
docs/header.jpg
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 380 KiB |
723
docs/input.md
Normal file
723
docs/input.md
Normal file
@@ -0,0 +1,723 @@
|
||||
# AlphaFold 3 Input
|
||||
|
||||
## Specifying Input Files
|
||||
|
||||
You can provide inputs to `run_alphafold.py` in one of two ways:
|
||||
|
||||
- Single input file: Use the `--json_path` flag followed by the path to a
|
||||
single JSON file.
|
||||
- Multiple input files: Use the `--input_dir` flag followed by the path to a
|
||||
directory of JSON files.
|
||||
|
||||
## Input Format
|
||||
|
||||
AlphaFold 3 uses a custom JSON input format differing from the
|
||||
[AlphaFold Server JSON input format](https://github.com/google-deepmind/alphafold/tree/main/server).
|
||||
See [below](#alphafold-server-json-compatibility) for more information.
|
||||
|
||||
The custom AlphaFold 3 format allows:
|
||||
|
||||
* Specifying protein, RNA, and DNA chains, including modified residues.
|
||||
* Specifying custom multiple sequence alignment (MSA) for protein and RNA
|
||||
chains.
|
||||
* Specifying custom structural templates for protein chains.
|
||||
* Specifying ligands using
|
||||
[Chemical Component Dictionary (CCD)](https://www.wwpdb.org/data/ccd) codes.
|
||||
* Specifying ligands using SMILES.
|
||||
* Specifying ligands by defining them using the CCD mmCIF format and supplying
|
||||
them via the [user-provided CCD](#user-provided-ccd).
|
||||
* Specifying covalent bonds between entities.
|
||||
* Specifying multiple random seeds.
|
||||
|
||||
## AlphaFold Server JSON Compatibility
|
||||
|
||||
The [AlphaFold Server](https://alphafoldserver.com/) uses a separate
|
||||
[JSON format](https://github.com/google-deepmind/alphafold/tree/main/server)
|
||||
from the one used here in the AlphaFold 3 codebase. In particular, the JSON
|
||||
format used in the AlphaFold 3 codebase offers more flexibility and control in
|
||||
defining custom ligands, branched glycans, and covalent bonds between entities.
|
||||
|
||||
We provide a converter in `run_alphafold.py` which automatically detects the
|
||||
input JSON format, denoted `dialect` in the converter code. The converter
|
||||
denotes the AlphaFoldServer JSON as `alphafoldserver`, and the JSON format
|
||||
defined here in the AlphaFold 3 codebase as `alphafold3`. If the detected input
|
||||
JSON format is `alphafoldserver`, then the converter will translate that into
|
||||
the JSON format `alphafold3`.
|
||||
|
||||
### Multiple Inputs
|
||||
|
||||
The top-level of the `alphafoldserver` JSON format is a list, allowing
|
||||
specification of multiple inputs in a single JSON. In contrast, the `alphafold3`
|
||||
JSON format requires exactly one input per JSON file. Specifying multiple inputs
|
||||
in a single `alphafoldserver` JSON is fully supported.
|
||||
|
||||
Note that the converter distinguishes between `alphafoldserver` and `alphafold3`
|
||||
JSON formats by checking if the top-level of the JSON is a list or not. In
|
||||
particular, if you pass in a `alphafoldserver`-style JSON without a top-level
|
||||
list, then this is considered incorrect and `run_alphafold.py` will raise an
|
||||
error.
|
||||
|
||||
### Glycans
|
||||
|
||||
If the JSON in `alphafoldserver` format specifies glycans, the converter will
|
||||
raise an error. This is because translating glycans specified in the
|
||||
`alphafoldserver` format to the `alphafold3` format is not currently supported.
|
||||
|
||||
### Random Seeds
|
||||
|
||||
The `alphafoldserver` JSON format allows users to specify `"modelSeeds": []`, in
|
||||
which case a seed is chosen randomly for the user. On the other hand, the
|
||||
`alphafold3` format requires users to specify a seed.
|
||||
|
||||
The converter will choose a seed randomly if `"modelSeeds": []` is set when
|
||||
translating from `alphafoldserver` JSON format to `alphafold3` JSON format. If
|
||||
seeds are specified in the `alphafoldserver` JSON format, then those will be
|
||||
preserved in the translation to the `alphafold3` JSON format.
|
||||
|
||||
### Ions
|
||||
|
||||
While AlphaFold Server treats ions and ligands as different entity types in the
|
||||
JSON format, AlphaFold 3 treats ions as ligands. Therefore, to specify e.g. a
|
||||
magnesium ion, one would specify it as an entity of type `ligand` with
|
||||
`ccdCodes: ["MG"]`.
|
||||
|
||||
### Sequence IDs
|
||||
|
||||
The `alphafold3` JSON format requires the user to specify a unique identifier
|
||||
(`id`) for each entity. On the other hand, the `alphafoldserver` does not allow
|
||||
specification of an `id` for each entity. Thus, the converter automatically
|
||||
assigns one.
|
||||
|
||||
The converter iterates through the list provided in the `sequences` field of the
|
||||
`alphafoldserver` JSON format, assigning an `id` to each entity using the
|
||||
following order ("reverse spreadsheet style"):
|
||||
|
||||
```
|
||||
A, B, ..., Z, AA, BA, CA, ..., ZA, AB, BB, CB, ..., ZB, ...
|
||||
```
|
||||
|
||||
For any entity with `count > 1`, an `id` is assigned arbitrarily to each "copy"
|
||||
of the entity.
|
||||
|
||||
## Top-level Structure
|
||||
|
||||
The top-level structure of the input JSON is:
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "Job name goes here",
|
||||
"modelSeeds": [1, 2], # At least one seed required.
|
||||
"sequences": [
|
||||
{"protein": {...}},
|
||||
{"rna": {...}},
|
||||
{"dna": {...}},
|
||||
{"ligand": {...}}
|
||||
],
|
||||
"bondedAtomPairs": [...], # Optional
|
||||
"userCCD": "...", # Optional
|
||||
"dialect": "alphafold3", # Required
|
||||
"version": 1 # Required
|
||||
}
|
||||
```
|
||||
|
||||
The fields specify the following:
|
||||
|
||||
* `name: str`: The name of the job. A sanitised version of this name is used
|
||||
for naming the output files.
|
||||
* `modelSeeds: list[int]`: A list of integer random seeds. The pipeline and
|
||||
the model will be invoked with each of the seeds in the list. I.e. if you
|
||||
provide *n* random seeds, you will get *n* predicted structures, each with
|
||||
the respective random seed. You must provide at least one random seed.
|
||||
* `sequences: list[Protein | RNA | DNA | Ligand]`: A list of sequence
|
||||
dictionaries, each defining a molecular entity, see below.
|
||||
* `bondedAtomPairs: list[Bond]`: An optional list of covalently bonded atoms.
|
||||
These can link atoms within an entity, or across two entities. See more
|
||||
below.
|
||||
* `userCCD: str`: An optional string with user-provided chemical components
|
||||
dictionary. This is an expert mode for providing custom molecules when
|
||||
SMILES is not sufficient. This should also be used when you have a custom
|
||||
molecule that needs to be bonded with other entities - SMILES can't be used
|
||||
in such cases since it doesn't give the possibility of uniquely naming all
|
||||
atoms. It can also be used to provide a reference conformer for cases where
|
||||
RDKit fails to generate a conformer. See more below.
|
||||
* `dialect: str`: The dialect of the input JSON. This must be set to
|
||||
`alphafold3`. See
|
||||
[AlphaFold Server JSON Compatibility](#alphafold-server-json-compatibility)
|
||||
for more information.
|
||||
* `version: int`: The version of the input JSON. This must be set to 1. See
|
||||
[AlphaFold Server JSON Compatibility](#alphafold-server-json-compatibility)
|
||||
for more information.
|
||||
|
||||
## Sequences
|
||||
|
||||
The `sequences` section specifies the protein chains, RNA chains, DNA chains,
|
||||
and ligands. Every entity in `sequences` must have a unique ID. IDs don't have
|
||||
to be sorted alphabetically.
|
||||
|
||||
### Protein
|
||||
|
||||
Specifies a single protein chain.
|
||||
|
||||
```json
|
||||
{
|
||||
"protein": {
|
||||
"id": "A",
|
||||
"sequence": "PVLSCGEWQL",
|
||||
"modifications": [
|
||||
{"ptmType": "HY3", "ptmPosition": 1},
|
||||
{"ptmType": "P1L", "ptmPosition": 5}
|
||||
],
|
||||
"unpairedMsa": ...,
|
||||
"pairedMsa": ...,
|
||||
"templates": [...]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The fields specify the following:
|
||||
|
||||
* `id: str | list[str]`: An uppercase letter or multiple letters specifying
|
||||
the unique IDs for each copy of this protein chain. The IDs are then also
|
||||
used in the output mmCIF file. Specifying a list of IDs (e.g. `["A", "B",
|
||||
"C"]`) implies a homomeric chain with multiple copies.
|
||||
* `sequence: str`: The amino-acid sequence, specified as a string that uses
|
||||
the 1-letter standard amino acid codes.
|
||||
* `modifications: list[ProteinModification]`: An optional list of
|
||||
post-translational modifications. Each modification is specified using its
|
||||
CCD code and 1-based residue position. In the example above, we see that the
|
||||
first residue won't be a proline (`P`) but instead `HY3`.
|
||||
* `unpairedMsa: str`: An optional multiple sequence alignment for this chain.
|
||||
This is specified using the A3M format (equivalent to the FASTA format, but
|
||||
also allows gaps denoted by the hyphen `-` character). See more details
|
||||
below.
|
||||
* `pairedMsa: str`: We recommend *not* using this optional field and using the
|
||||
`unpairedMsa` for the purposes of pairing. See more details below.
|
||||
* `templates: list[Template]`: An optional list of structural templates. See
|
||||
more details below.
|
||||
|
||||
### RNA
|
||||
|
||||
Specifies a single RNA chain.
|
||||
|
||||
```json
|
||||
{
|
||||
"rna": {
|
||||
"id": "A",
|
||||
"sequence": "AGCU",
|
||||
"modifications": [
|
||||
{"modificationType": "2MG", "basePosition": 1},
|
||||
{"modificationType": "5MC", "basePosition": 4}
|
||||
],
|
||||
"unpairedMsa": ...
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The fields specify the following:
|
||||
|
||||
* `id: str | list[str]`: An uppercase letter or multiple letters specifying
|
||||
the unique IDs for each copy of this RNA chain. The IDs are then also used
|
||||
in the output mmCIF file. Specifying a list of IDs (e.g. `["A", "B", "C"]`)
|
||||
implies a homomeric chain with multiple copies.
|
||||
* `sequence: str`: The RNA sequence, specified as a string using only the
|
||||
letters `A`, `C`, `G`, `U`.
|
||||
* `modifications: list[RnaModification]`: An optional list of modifications.
|
||||
Each modification is specified using its CCD code and 1-based base position.
|
||||
* `unpairedMsa: str`: An optional multiple sequence alignment for this chain.
|
||||
This is specified using the A3M format. See more details below.
|
||||
|
||||
### DNA
|
||||
|
||||
Specifies a single DNA chain.
|
||||
|
||||
```json
|
||||
{
|
||||
"dna": {
|
||||
"id": "A",
|
||||
"sequence": "GACCTCT",
|
||||
"modifications": [
|
||||
{"modificationType": "6OG", "basePosition": 1},
|
||||
{"modificationType": "6MA", "basePosition": 2}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The fields specify the following:
|
||||
|
||||
* `id: str | list[str]`: An uppercase letter or multiple letters specifying
|
||||
the unique IDs for each copy of this DNA chain. The IDs are then also used
|
||||
in the output mmCIF file. Specifying a list of IDs (e.g. `["A", "B", "C"]`)
|
||||
implies a homomeric chain with multiple copies.
|
||||
* `sequence: str`: The DNA sequence, specified as a string using only the
|
||||
letters `A`, `C`, `G`, `T`.
|
||||
* `modifications: list[DnaModification]`: An optional list of modifications.
|
||||
Each modification is specified using its CCD code and 1-based base position.
|
||||
|
||||
### Ligands
|
||||
|
||||
Specifies a single ligand. Ligands can be specified using 3 different formats:
|
||||
|
||||
1. [CCD code(s)](https://www.wwpdb.org/data/ccd). This is the easiest way to
|
||||
specify ligands. Supports specifying covalent bonds to other entities. CCD
|
||||
from 2022-09-28 is used. If multiple CCD codes are specified, you may want
|
||||
to specify a bond between these and/or a bond to some other entity. See the
|
||||
[bonds](#bonds) section below.
|
||||
2. [SMILES string](https://en.wikipedia.org/wiki/Simplified_Molecular_Input_Line_Entry_System).
|
||||
This enables specifying ligands that are not in CCD. If using SMILES, you
|
||||
cannot specify covalent bonds to other entities as these rely on specific
|
||||
atom names - see the next option for what to use for this case.
|
||||
3. User-provided CCD + custom ligand codes. This enables specifying ligands not
|
||||
in CCD, while also supporting specification of covalent bonds to other
|
||||
entities and backup reference coordinates for when RDKit fails to generate a
|
||||
conformer. This offers the most flexibility, but also requires careful
|
||||
attention to get all of the details right.
|
||||
|
||||
```json
|
||||
{
|
||||
"ligand": {
|
||||
"id": ["G", "H", "I"],
|
||||
"ccdCodes": ["ATP"]
|
||||
}
|
||||
},
|
||||
{
|
||||
"ligand": {
|
||||
"id": "J",
|
||||
"ccdCodes": ["LIG-1337"]
|
||||
}
|
||||
},
|
||||
{
|
||||
"ligand": {
|
||||
"id": "K",
|
||||
"smiles": "CC(=O)OC1C[NH+]2CCC1CC2"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The fields specify the following:
|
||||
|
||||
* `id: str | list[str]`: An uppercase letter (or multiple letters) specifying
|
||||
the unique ID of this ligand. This ID is then also used in the output mmCIF
|
||||
file. Specifying a list of IDs (e.g. `["A", "B", "C"]`) implies a ligand
|
||||
that has multiple copies.
|
||||
* `ccdCodes: list[str]`: An optional list of CCD codes. These could be either
|
||||
standard CCD codes, or custom codes pointing to the
|
||||
[user-provided CCD](#user-provided-ccd).
|
||||
* `smiles: str`: An optional string defining the ligand using a SMILES string.
|
||||
|
||||
Each ligand may be specified using CCD codes or SMILES but not both, i.e. for a
|
||||
given ligand, the `ccdCodes` and `smiles` fields are mutually exclusive.
|
||||
|
||||
### Ions
|
||||
|
||||
Ions are treated as ligands, e.g. a magnesium ion would simply be a ligand with
|
||||
`ccdCodes: ["MG"]`.
|
||||
|
||||
## Multiple Sequence Alignment
|
||||
|
||||
Protein and RNA chains allow setting a custom Multiple Sequence Alignment (MSA).
|
||||
If not set, the data pipeline will automatically build MSAs for protein and RNA
|
||||
entities using Jackhmmer/Nhmmer search over genetic databases as described in
|
||||
the paper.
|
||||
|
||||
There are 3 modes for MSA:
|
||||
|
||||
1. If the `unpairedMsa` field is unset, AlphaFold 3 will build the MSA
|
||||
automatically. This is the recommended option.
|
||||
2. If the `unpairedMsa` field is set to an empty string (`""`), AlphaFold 3
|
||||
will not build the MSA and the MSA input to the model will be empty.
|
||||
3. If the `unpairedMsa` field is set to a custom A3M string, AlphaFold 3 will
|
||||
use the provided MSA instead of building one as part of the data pipeline.
|
||||
This is considered an expert option.
|
||||
|
||||
Note that if you set the `unpairedMsa` field for a particular protein entity,
|
||||
you will also have to explicitly set the `pairedMsa` field (typically to empty
|
||||
string) and templates (either to a list of templates, or an empty list to run
|
||||
template-free). For example this will run the protein chain A with the given
|
||||
MSA, but without any templates:
|
||||
|
||||
```json
|
||||
{
|
||||
"protein": {
|
||||
"id": "A",
|
||||
"sequence": ...,
|
||||
"unpairedMsa": "The A3M you want to run with",
|
||||
"pairedMsa": "",
|
||||
"templates": []
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
When setting your own MSA, you have to make sure that:
|
||||
|
||||
1. The MSA is a valid A3M file. This means adhering to the FASTA format while
|
||||
also allowing lowercase characters denoting inserted residues and hyphens
|
||||
(`-`) denoting gaps in sequences.
|
||||
2. The first sequence is exactly equal to the query sequence.
|
||||
3. If all insertions are removed from MSA hits (i.e. all lowercase letters are
|
||||
removed), all sequences have exactly the same length as the query (they form
|
||||
an exact rectangular matrix).
|
||||
|
||||
### MSA Pairing
|
||||
|
||||
MSA pairing matters only when folding multiple chains (multimers), since we need
|
||||
to find a way to concatenate MSAs for the individual chains along the sequence
|
||||
dimension. If done naively, by simply concatenating the individual MSA matrices
|
||||
along the sequence dimension and padding so that all MSAs have the same depth,
|
||||
one can end up with rows in the concatenated MSA that are formed by sequences
|
||||
from different organisms.
|
||||
|
||||
It may be desirable to ensure that across multiple chains, sequences in the MSA
|
||||
that are from the same organism end up in the same MSA row. AlphaFold 3
|
||||
internally achieves this by looking for the UniProt organism ID in the
|
||||
`pairedMsa` and pairing sequences based on this information.
|
||||
|
||||
We recommend users do the pairing manually or use the output of an appropriate
|
||||
software and then provide the MSA using only the `unpairedMsa` field. This
|
||||
method gives exact control over the placement of each sequence in the MSA, as
|
||||
opposed to relying on name-matching post-processing heuristics used for
|
||||
`pairedMsa`.
|
||||
|
||||
When setting `unpairedMsa` manually, the `pairedMsa` must be left unset (i.e.
|
||||
the `pairedMsa` key should not be present in the JSON).
|
||||
|
||||
For instance, if there are two chains `DEEP` and `MIND` which we want to be
|
||||
paired on organism A and C, we can achieve it as follows:
|
||||
|
||||
```text
|
||||
> query
|
||||
DEEP
|
||||
> match 1 (organism A)
|
||||
D--P
|
||||
> match 2 (organism B)
|
||||
DD-P
|
||||
> match 3 (organism C)
|
||||
DD-P
|
||||
```
|
||||
|
||||
```text
|
||||
> query
|
||||
MIND
|
||||
> match 1 (organism A)
|
||||
M--D
|
||||
> Empty hit to make sure pairing is achieved
|
||||
----
|
||||
> match 2 (organism C)
|
||||
MIN-
|
||||
```
|
||||
|
||||
The resulting MSA when chains are concatenated will then be:
|
||||
|
||||
```text
|
||||
> query
|
||||
DEEPMIND
|
||||
> match 1 + match 1
|
||||
D--PM--D
|
||||
> match 2 + padding
|
||||
DD-P----
|
||||
> match 3 + match 2
|
||||
DD-PMIN-
|
||||
```
|
||||
|
||||
## Structural Templates
|
||||
|
||||
Structural templates can be specified only for protein chains:
|
||||
|
||||
```json
|
||||
"templates": [
|
||||
{
|
||||
"mmcif": ...,
|
||||
"queryIndices": [0, 1, 2, 4, 5, 6],
|
||||
"templateIndices": [0, 1, 2, 3, 4, 8]
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
A template is specified as an mmCIF string containing a single chain with the
|
||||
structural template together with a 0-based mapping that maps query residue
|
||||
indices to the template residue indices. The mapping is specified using two
|
||||
lists of the same length. E.g. to express a mapping `{0: 0, 1: 2, 2: 5, 3: 6}`,
|
||||
you would specify the two indices lists as:
|
||||
|
||||
```json
|
||||
"queryIndices": [0, 1, 2, 3],
|
||||
"templateIndices": [0, 2, 5, 6]
|
||||
```
|
||||
|
||||
You can provide multiple structural templates. Note that if an mmCIF containing
|
||||
more than one chain is provided, you will get an error since it is not possible
|
||||
to determine which of the chains should be used as the template.
|
||||
|
||||
## Bonds
|
||||
|
||||
To manually specify covalent bonds, use the `bondedAtomPairs` field. This is
|
||||
intended for modelling covalent ligands, and for defining multi-CCD ligands
|
||||
(e.g. glycans). Defining covalent bonds between or within polymer entities is
|
||||
not currently supported.
|
||||
|
||||
Bonds are specified as pairs of (source atom, destination atom), with each atom
|
||||
being uniquely addressed using 3 fields:
|
||||
|
||||
* **Entity ID** (`str`): this corresponds to the `id` field for that entity.
|
||||
* **Residue ID** (`int`): this is 1-based residue index *within* the chain.
|
||||
For single-residue ligands, this is simply set to 1.
|
||||
* **Atom name** (`str`): this is the unique atom name *within* the given
|
||||
residue. The atom name for protein/RNA/DNA residues or CCD ligands can be
|
||||
looked up in the CCD for the given chemical component. This also explains
|
||||
why SMILES ligands don't support bonds: there is no atom name that could be
|
||||
used to define the bond. This shortcoming can be addressed by using the
|
||||
user-provided CCD format (see below).
|
||||
|
||||
The example below shows two bonds:
|
||||
|
||||
```json
|
||||
"bondedAtomPairs": [
|
||||
[["A", 145, "SG"], ["L", 1, "C04"]],
|
||||
[["J", 1, "O6"], ["J", 2, "C1"]]
|
||||
]
|
||||
```
|
||||
|
||||
The first bond is between chain A, residue 145, atom SG and chain L, residue 1,
|
||||
atom C04. This is a typical example for a covalent ligand. The second bond is
|
||||
between chain J, residue 1, atom O6 and chain J, residue 2, atom C1. This bond
|
||||
is within the same entity and is a typical example when defining a glycan.
|
||||
|
||||
All bonds are implicitly assumed to be covalent bonds. Other bond types are not
|
||||
supported.
|
||||
|
||||
### Defining Glycans
|
||||
|
||||
Glycans are bound to a protein residue, and they are typically formed of
|
||||
multiple chemical components. To define a glycan, define a new ligand with all
|
||||
of the chemical components of the glycan. Then define a bond that links the
|
||||
glycan to the protein residue, and all bonds that are within the glycan between
|
||||
its individual chemical components.
|
||||
|
||||
For example, to define the following glycan composed of 4 components (CMP1,
|
||||
CMP2, CMP3, CMP4) bound to an arginine in a protein chain A:
|
||||
|
||||
```
|
||||
⋮
|
||||
ALA CMP4
|
||||
| |
|
||||
ARG --- CMP1 --- CMP2
|
||||
| |
|
||||
ALA CMP3
|
||||
⋮
|
||||
```
|
||||
|
||||
You will need to specify:
|
||||
|
||||
1. Protein chain A.
|
||||
2. Ligand chain B with the 4 components.
|
||||
3. Bonds ARG-CMP1, CMP1-CMP2, CMP2-CMP3, CMP2-CMP4.
|
||||
|
||||
## User-provided CCD
|
||||
|
||||
There are two approaches to model a custom ligand not defined in the CCD. If the
|
||||
ligand is not bonded to other entities, it can be defined using a
|
||||
[SMILES string](https://en.wikipedia.org/wiki/Simplified_Molecular_Input_Line_Entry_System).
|
||||
Otherwise, it is necessary to define that particular ligand using the
|
||||
[CCD mmCIF format](https://www.wwpdb.org/data/ccd#mmcifFormat).
|
||||
|
||||
Once defined, this ligand needs to be assigned a name that doesn't clash with
|
||||
existing CCD ligand names (e.g. `LIG-1`). Avoid underscores (`_`) in the name,
|
||||
as it could cause issues in the mmCIF format.
|
||||
|
||||
The newly defined ligand can then be used as a standard CCD ligand using its
|
||||
custom name, and bonds can be linked to it using its named atom scheme.
|
||||
|
||||
### User-provided CCD Format
|
||||
|
||||
The user-provided CCD must be passed in the `userCCD` field (in the root of the
|
||||
input JSON) as a string. Note that JSON doesn't allow newlines within strings,
|
||||
so newline characters (`\n`) must be used to delimit lines. Single rather than
|
||||
double quotes should also be used around strings like the chemical formula.
|
||||
|
||||
The main pieces of information used are the atom names and elements, bonds, and
|
||||
also the ideal coordinates (`pdbx_model_Cartn_{x,y,z}_ideal`) which essentially
|
||||
serve as a structural template for the ligand if RDKit fails to generate
|
||||
conformers for that ligand.
|
||||
|
||||
The `userCCD` can also be used to redefine standard chemical components in the
|
||||
CCD. This can be useful if you need to redefine the ideal coordinates.
|
||||
|
||||
Below is an example `userCCD` redefining component X7F, which serves to
|
||||
illustrate the required sections. For readability purposes, newlines have not
|
||||
been replaced by `\n`.
|
||||
|
||||
```
|
||||
data_MY-X7F
|
||||
#
|
||||
_chem_comp.id MY-X7F
|
||||
_chem_comp.name '5,8-bis(oxidanyl)naphthalene-1,4-dione'
|
||||
_chem_comp.type non-polymer
|
||||
_chem_comp.formula 'C10 H6 O4'
|
||||
_chem_comp.mon_nstd_parent_comp_id ?
|
||||
_chem_comp.pdbx_synonyms ?
|
||||
_chem_comp.formula_weight 190.152
|
||||
#
|
||||
loop_
|
||||
_chem_comp_atom.comp_id
|
||||
_chem_comp_atom.atom_id
|
||||
_chem_comp_atom.alt_atom_id
|
||||
_chem_comp_atom.type_symbol
|
||||
_chem_comp_atom.charge
|
||||
_chem_comp_atom.pdbx_align
|
||||
_chem_comp_atom.pdbx_aromatic_flag
|
||||
_chem_comp_atom.pdbx_leaving_atom_flag
|
||||
_chem_comp_atom.pdbx_stereo_config
|
||||
_chem_comp_atom.pdbx_backbone_atom_flag
|
||||
_chem_comp_atom.pdbx_n_terminal_atom_flag
|
||||
_chem_comp_atom.pdbx_c_terminal_atom_flag
|
||||
_chem_comp_atom.model_Cartn_x
|
||||
_chem_comp_atom.model_Cartn_y
|
||||
_chem_comp_atom.model_Cartn_z
|
||||
_chem_comp_atom.pdbx_model_Cartn_x_ideal
|
||||
_chem_comp_atom.pdbx_model_Cartn_y_ideal
|
||||
_chem_comp_atom.pdbx_model_Cartn_z_ideal
|
||||
_chem_comp_atom.pdbx_component_atom_id
|
||||
_chem_comp_atom.pdbx_component_comp_id
|
||||
_chem_comp_atom.pdbx_ordinal
|
||||
MY-X7F C02 C1 C 0 1 N N N N N N 48.727 17.090 17.537 -1.418 -1.260 0.018 C02 MY-X7F 1
|
||||
MY-X7F C03 C2 C 0 1 N N N N N N 47.344 16.691 17.993 -0.665 -2.503 -0.247 C03 MY-X7F 2
|
||||
MY-X7F C04 C3 C 0 1 N N N N N N 47.166 16.016 19.310 0.677 -2.501 -0.235 C04 MY-X7F 3
|
||||
MY-X7F C05 C4 C 0 1 N N N N N N 48.363 15.728 20.184 1.421 -1.257 0.043 C05 MY-X7F 4
|
||||
MY-X7F C06 C5 C 0 1 Y N N N N N 49.790 16.142 19.699 0.706 0.032 0.008 C06 MY-X7F 5
|
||||
MY-X7F C07 C6 C 0 1 Y N N N N N 49.965 16.791 18.444 -0.706 0.030 -0.004 C07 MY-X7F 6
|
||||
MY-X7F C08 C7 C 0 1 Y N N N N N 51.249 17.162 18.023 -1.397 1.240 -0.037 C08 MY-X7F 7
|
||||
MY-X7F C10 C8 C 0 1 Y N N N N N 52.359 16.893 18.837 -0.685 2.443 -0.057 C10 MY-X7F 8
|
||||
MY-X7F C11 C9 C 0 1 Y N N N N N 52.184 16.247 20.090 0.679 2.445 -0.045 C11 MY-X7F 9
|
||||
MY-X7F C12 C10 C 0 1 Y N N N N N 50.899 15.876 20.515 1.394 1.243 -0.013 C12 MY-X7F 10
|
||||
MY-X7F O01 O1 O 0 1 N N N N N N 48.876 17.630 16.492 -2.611 -1.301 0.247 O01 MY-X7F 11
|
||||
MY-X7F O09 O2 O 0 1 N N N N N N 51.423 17.798 16.789 -2.752 1.249 -0.049 O09 MY-X7F 12
|
||||
MY-X7F O13 O3 O 0 1 N N N N N N 50.710 15.236 21.750 2.750 1.257 -0.001 O13 MY-X7F 13
|
||||
MY-X7F O14 O4 O 0 1 N N N N N N 48.229 15.189 21.234 2.609 -1.294 0.298 O14 MY-X7F 14
|
||||
MY-X7F H1 H1 H 0 1 N N N N N N 46.487 16.894 17.367 -1.199 -3.419 -0.452 H1 MY-X7F 15
|
||||
MY-X7F H2 H2 H 0 1 N N N N N N 46.178 15.732 19.640 1.216 -3.416 -0.429 H2 MY-X7F 16
|
||||
MY-X7F H3 H3 H 0 1 N N N N N N 53.348 17.177 18.511 -1.221 3.381 -0.082 H3 MY-X7F 17
|
||||
MY-X7F H4 H4 H 0 1 N N N N N N 53.040 16.041 20.716 1.212 3.384 -0.062 H4 MY-X7F 18
|
||||
MY-X7F H5 H5 H 0 1 N N N N N N 50.579 17.904 16.365 -3.154 1.271 0.830 H5 MY-X7F 19
|
||||
MY-X7F H6 H6 H 0 1 N N N N N N 49.785 15.059 21.877 3.151 1.241 -0.880 H6 MY-X7F 20
|
||||
#
|
||||
loop_
|
||||
_chem_comp_bond.comp_id
|
||||
_chem_comp_bond.atom_id_1
|
||||
_chem_comp_bond.atom_id_2
|
||||
_chem_comp_bond.value_order
|
||||
_chem_comp_bond.pdbx_aromatic_flag
|
||||
_chem_comp_bond.pdbx_stereo_config
|
||||
_chem_comp_bond.pdbx_ordinal
|
||||
MY-X7F O01 C02 DOUB N N 1
|
||||
MY-X7F O09 C08 SING N N 2
|
||||
MY-X7F C02 C03 SING N N 3
|
||||
MY-X7F C02 C07 SING N N 4
|
||||
MY-X7F C03 C04 DOUB N N 5
|
||||
MY-X7F C08 C07 DOUB Y N 6
|
||||
MY-X7F C08 C10 SING Y N 7
|
||||
MY-X7F C07 C06 SING Y N 8
|
||||
MY-X7F C10 C11 DOUB Y N 9
|
||||
MY-X7F C04 C05 SING N N 10
|
||||
MY-X7F C06 C05 SING N N 11
|
||||
MY-X7F C06 C12 DOUB Y N 12
|
||||
MY-X7F C11 C12 SING Y N 13
|
||||
MY-X7F C05 O14 DOUB N N 14
|
||||
MY-X7F C12 O13 SING N N 15
|
||||
MY-X7F C03 H1 SING N N 16
|
||||
MY-X7F C04 H2 SING N N 17
|
||||
MY-X7F C10 H3 SING N N 18
|
||||
MY-X7F C11 H4 SING N N 19
|
||||
MY-X7F O09 H5 SING N N 20
|
||||
MY-X7F O13 H6 SING N N 21
|
||||
#
|
||||
_pdbx_chem_comp_descriptor.type SMILES_CANONICAL
|
||||
_pdbx_chem_comp_descriptor.descriptor 'Oc1ccc(O)c2C(=O)C=CC(=O)c12'
|
||||
#
|
||||
```
|
||||
|
||||
## Full Example
|
||||
|
||||
An example illustrating all the aspects of the input format is provided below.
|
||||
Note that AlphaFold 3 won't run this input out of the box as it abbreviates
|
||||
certain fields and the sequences are not biologically meaningful.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "Hello fold",
|
||||
"modelSeeds": [10, 42],
|
||||
"sequences": [
|
||||
{
|
||||
"protein": {
|
||||
"id": "A",
|
||||
"sequence": "PVLSCGEWQL",
|
||||
"modifications": [
|
||||
{"ptmType": "HY3", "ptmPosition": 1},
|
||||
{"ptmType": "P1L", "ptmPosition": 5}
|
||||
],
|
||||
"unpairedMsa": ...,
|
||||
}
|
||||
},
|
||||
{
|
||||
"protein": {
|
||||
"id": "B",
|
||||
"sequence": "RPACQLW",
|
||||
"templates": [
|
||||
{
|
||||
"mmcif": ...,
|
||||
"queryIndices": [0, 1, 2, 4, 5, 6],
|
||||
"templateIndices": [0, 1, 2, 3, 4, 8]
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"dna": {
|
||||
"id": "C",
|
||||
"sequence": "GACCTCT",
|
||||
"modifications": [
|
||||
{"modificationType": "6OG", "basePosition": 1},
|
||||
{"modificationType": "6MA", "basePosition": 2}
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"rna": {
|
||||
"id": "E",
|
||||
"sequence": "AGCU",
|
||||
"modifications": [
|
||||
{"modificationType": "2MG", "basePosition": 1},
|
||||
{"modificationType": "5MC", "basePosition": 4}
|
||||
],
|
||||
"unpairedMsa": ...
|
||||
}
|
||||
},
|
||||
{
|
||||
"ligand": {
|
||||
"id": ["F", "G", "H"],
|
||||
"ccdCodes": ["ATP"]
|
||||
}
|
||||
},
|
||||
{
|
||||
"ligand": {
|
||||
"id": "I",
|
||||
"ccdCodes": ["NAG", "FUC"]
|
||||
}
|
||||
},
|
||||
{
|
||||
"ligand": {
|
||||
"id": "Z",
|
||||
"smiles": "CC(=O)OC1C[NH+]2CCC1CC2"
|
||||
}
|
||||
}
|
||||
],
|
||||
"bondedAtomPairs": [
|
||||
[["A", 1, "CA"], ["B", 1, "CA"]],
|
||||
[["A", 1, "CA"], ["G", 1, "CHA"]],
|
||||
[["J", 1, "O6"], ["J", 2, "C1"]]
|
||||
],
|
||||
"userCcd": ...,
|
||||
"dialect": "alphafold3",
|
||||
"version": 1
|
||||
}
|
||||
|
||||
```
|
||||
355
docs/installation.md
Normal file
355
docs/installation.md
Normal file
@@ -0,0 +1,355 @@
|
||||
# Installation and Running Your First Prediction
|
||||
|
||||
You will need a machine running Linux; AlphaFold 3 does not support other
|
||||
operating systems. Full installation requires up to 1 TB of disk space to keep
|
||||
genetic databases (SSD storage is recommended) and an NVIDIA GPU with Compute
|
||||
Capability 8.0 or greater (GPUs with more memory can predict larger protein
|
||||
structures). We have verified that inputs with up to 5,120 tokens can fit on a
|
||||
single NVIDIA A100 80 GB, or a single NVIDIA H100 80 GB. We have verified
|
||||
numerical accuracy on both NVIDIA A100 and H100 GPUs.
|
||||
|
||||
Especially for long targets, the genetic search stage can consume a lot of RAM –
|
||||
we recommend running with at least 64 GB of RAM.
|
||||
|
||||
We provide installation instructions for a machine with an NVIDIA A100 80 GB GPU
|
||||
and a clean Ubuntu 22.04 LTS installation, and expect that these instructions
|
||||
should aid others with different setups.
|
||||
|
||||
The instructions provided below describe how to:
|
||||
|
||||
1. Provision a machine on GCP.
|
||||
1. Install Docker.
|
||||
1. Install NVIDIA drivers for an A100.
|
||||
1. Obtain genetic databases.
|
||||
1. Obtain model parameters.
|
||||
1. Build the AlphaFold 3 Docker container or Singularity image.
|
||||
|
||||
## Provisioning a Machine
|
||||
|
||||
Clean Ubuntu images are available on Google Cloud, AWS, Azure, and other major
|
||||
platforms.
|
||||
|
||||
We first provisioned a new machine in Google Cloud Platform using the following
|
||||
command. We were using a Google Cloud project that was already set up.
|
||||
|
||||
* We recommend using `--machine-type a2-ultragpu-1g` but feel free to use
|
||||
`--machine-type a2-highgpu-1g` for smaller predictions.
|
||||
* If desired, replace `--zone us-central1-a` with a zone that has quota for
|
||||
the machine you have selected. See
|
||||
[gpu-regions-zones](https://cloud.google.com/compute/docs/gpus/gpu-regions-zones).
|
||||
|
||||
```sh
|
||||
gcloud compute instances create alphafold3 \
|
||||
--machine-type a2-ultragpu-1g \
|
||||
--zone us-central1-a \
|
||||
--image-family ubuntu-2204-lts \
|
||||
--image-project ubuntu-os-cloud \
|
||||
--maintenance-policy TERMINATE \
|
||||
--boot-disk-size 1000 \
|
||||
--boot-disk-type pd-balanced
|
||||
```
|
||||
|
||||
This provisions a bare Ubuntu 22.04 LTS image on an
|
||||
[A2 Ultra](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a2-vms)
|
||||
machine with 12 CPUs, 170 GB RAM, 1 TB disk and NVIDIA A100 80 GB GPU attached.
|
||||
We verified the following installation steps from this point.
|
||||
|
||||
## Installing Docker
|
||||
|
||||
These instructions are for rootless Docker.
|
||||
|
||||
### Installing Docker on Host
|
||||
|
||||
Note these instructions only apply to Ubuntu 22.04 LTS images, see above.
|
||||
|
||||
Add Docker's official GPG key. Official Docker instructions are
|
||||
[here](https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository).
|
||||
The commands we ran are:
|
||||
|
||||
```sh
|
||||
sudo apt-get update
|
||||
sudo apt-get install ca-certificates curl
|
||||
sudo install -m 0755 -d /etc/apt/keyrings
|
||||
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
|
||||
sudo chmod a+r /etc/apt/keyrings/docker.asc
|
||||
```
|
||||
|
||||
Add the repository to apt sources:
|
||||
|
||||
```sh
|
||||
echo \
|
||||
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
|
||||
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
|
||||
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
|
||||
sudo docker run hello-world
|
||||
```
|
||||
|
||||
### Enabling Rootless Docker
|
||||
|
||||
Official Docker instructions are
|
||||
[here](https://docs.docker.com/engine/security/rootless/#distribution-specific-hint).
|
||||
The commands we ran are:
|
||||
|
||||
```sh
|
||||
sudo apt-get install -y uidmap systemd-container
|
||||
|
||||
sudo machinectl shell $(whoami)@ /bin/bash -c 'dockerd-rootless-setuptool.sh install && sudo loginctl enable-linger $(whoami) && DOCKER_HOST=unix:///run/user/1001/docker.sock docker context use rootless'
|
||||
```
|
||||
|
||||
## Installing GPU Support
|
||||
|
||||
### Installing NVIDIA Drivers
|
||||
|
||||
Official Ubuntu instructions are
|
||||
[here](https://documentation.ubuntu.com/server/how-to/graphics/install-nvidia-drivers/).
|
||||
The commands we ran are:
|
||||
|
||||
```sh
|
||||
sudo apt-get -y install alsa-utils ubuntu-drivers-common
|
||||
sudo ubuntu-drivers install
|
||||
|
||||
sudo nvidia-smi --gpu-reset
|
||||
|
||||
nvidia-smi # Check that the drivers are installed.
|
||||
```
|
||||
|
||||
Accept "Pending kernel upgrade" dialog if it appears.
|
||||
|
||||
You will need to reboot the instance with `sudo reboot now` to reset the GPU if
|
||||
you see the following warning:
|
||||
|
||||
```text
|
||||
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
|
||||
Make sure that the latest NVIDIA driver is installed and running.
|
||||
```
|
||||
|
||||
Proceed only if `nvidia-smi` has a sensible output.
|
||||
|
||||
### Installing NVIDIA Support for Docker
|
||||
|
||||
Official NVIDIA instructions are
|
||||
[here](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).
|
||||
The commands we ran are:
|
||||
|
||||
```sh
|
||||
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
|
||||
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
|
||||
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
|
||||
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y nvidia-container-toolkit
|
||||
nvidia-ctk runtime configure --runtime=docker --config=$HOME/.config/docker/daemon.json
|
||||
systemctl --user restart docker
|
||||
sudo nvidia-ctk config --set nvidia-container-cli.no-cgroups --in-place
|
||||
```
|
||||
|
||||
Check that your container can see the GPU:
|
||||
|
||||
```sh
|
||||
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi
|
||||
```
|
||||
|
||||
The output should look similar to this:
|
||||
|
||||
```text
|
||||
Mon Nov 11 12:00:00 2024
|
||||
+-----------------------------------------------------------------------------------------+
|
||||
| NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.6 |
|
||||
|-----------------------------------------+------------------------+----------------------+
|
||||
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
|
||||
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
|
||||
| | | MIG M. |
|
||||
|=========================================+========================+======================|
|
||||
| 0 NVIDIA A100-SXM4-80GB Off | 00000000:00:05.0 Off | 0 |
|
||||
| N/A 34C P0 51W / 400W | 1MiB / 81920MiB | 0% Default |
|
||||
| | | Disabled |
|
||||
+-----------------------------------------+------------------------+----------------------+
|
||||
|
||||
+-----------------------------------------------------------------------------------------+
|
||||
| Processes: |
|
||||
| GPU GI CI PID Type Process name GPU Memory |
|
||||
| ID ID Usage |
|
||||
|=========================================================================================|
|
||||
| No running processes found |
|
||||
+-----------------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
## Obtaining AlphaFold 3 Source Code
|
||||
|
||||
You will need to have `git` installed to download the AlphaFold 3 repository:
|
||||
|
||||
```sh
|
||||
git clone https://github.com/google-deepmind/alphafold3.git
|
||||
```
|
||||
|
||||
## Obtaining Genetic Databases
|
||||
|
||||
This step requires `curl` and `zstd` to be installed on your machine.
|
||||
|
||||
AlphaFold 3 needs multiple genetic (sequence) protein and RNA databases to run:
|
||||
|
||||
* [BFD small](https://bfd.mmseqs.com/)
|
||||
* [MGnify](https://www.ebi.ac.uk/metagenomics/)
|
||||
* [PDB](https://www.rcsb.org/) (structures in the mmCIF format)
|
||||
* [PDB seqres](https://www.rcsb.org/)
|
||||
* [UniProt](https://www.uniprot.org/uniprot/)
|
||||
* [UniRef90](https://www.uniprot.org/help/uniref)
|
||||
* [NT](https://www.ncbi.nlm.nih.gov/nucleotide/)
|
||||
* [RFam](https://rfam.org/)
|
||||
* [RNACentral](https://rnacentral.org/)
|
||||
|
||||
We provide a Python program `fetch_databases.py` that can be used to download
|
||||
and set up all of these databases. This process takes around 45 minutes when not
|
||||
installing on local SSD. We recommend running the following in a `screen` or
|
||||
`tmux` session as downloading and decompressing the databases takes some time.
|
||||
|
||||
```sh
|
||||
cd alphafold3 # Navigate to the directory with cloned AlphaFold 3 repository.
|
||||
python3 fetch_databases.py --download_destination=<DATABASES_DIR>
|
||||
```
|
||||
|
||||
This script downloads the databases from a mirror hosted on GCS, with all
|
||||
versions being the same as used in the AlphaFold 3 paper.
|
||||
|
||||
:ledger: **Note: The download directory `<DATABASES_DIR>` should *not* be a
|
||||
subdirectory in the AlphaFold 3 repository directory.** If it is, the Docker
|
||||
build will be slow as the large databases will be copied during the image
|
||||
creation.
|
||||
|
||||
:ledger: **Note: The total download size for the full databases is around 252 GB
|
||||
and the total size when unzipped is 630 GB. Please make sure you have sufficient
|
||||
hard drive space, bandwidth, and time to download. We recommend using an SSD for
|
||||
better genetic search performance, and faster runtime of `fetch_databases.py`.**
|
||||
|
||||
:ledger: **Note: If the download directory and datasets don't have full read and
|
||||
write permissions, it can cause errors with the MSA tools, with opaque
|
||||
(external) error messages. Please ensure the required permissions are applied,
|
||||
e.g. with the `sudo chmod 755 --recursive <DATABASES_DIR>` command.**
|
||||
|
||||
Once the script has finished, you should have the following directory structure:
|
||||
|
||||
```sh
|
||||
pdb_2022_09_28_mmcif_files.tar # ~200k PDB mmCIF files in this tar.
|
||||
bfd-first_non_consensus_sequences.fasta
|
||||
mgy_clusters_2022_05.fa
|
||||
nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta
|
||||
pdb_seqres_2022_09_28.fasta
|
||||
rfam_14_9_clust_seq_id_90_cov_80_rep_seq.fasta
|
||||
rnacentral_active_seq_id_90_cov_80_linclust.fasta
|
||||
uniprot_all_2021_04.fa
|
||||
uniref90_2022_05.fa
|
||||
```
|
||||
|
||||
## Obtaining Model Parameters
|
||||
|
||||
To request access to the AlphaFold 3 model parameters, please complete
|
||||
[this form](https://forms.gle/svvpY4u2jsHEwWYS6). Access will be granted at
|
||||
Google DeepMind’s sole discretion. We will aim to respond to requests within 2–3
|
||||
business days. You may only use AlphaFold 3 model parameters if received
|
||||
directly from Google. Use is subject to these
|
||||
[terms of use](https://github.com/google-deepmind/alphafold3/blob/main/WEIGHTS_TERMS_OF_USE.md).
|
||||
|
||||
## Building the Docker Container That Will Run AlphaFold 3
|
||||
|
||||
Then, build the Docker container. This builds a container with all the right
|
||||
python dependencies:
|
||||
|
||||
```sh
|
||||
docker build -t alphafold3 -f docker/Dockerfile .
|
||||
```
|
||||
|
||||
You can now run AlphaFold 3!
|
||||
|
||||
```sh
|
||||
docker run -it \
|
||||
--volume $HOME/af_input:/root/af_input \
|
||||
--volume $HOME/af_output:/root/af_output \
|
||||
--volume <MODEL_PARAMETERS_DIR>:/root/models \
|
||||
--volume <DATABASES_DIR>:/root/public_databases \
|
||||
--gpus all \
|
||||
alphafold3 \
|
||||
python run_alphafold.py \
|
||||
--json_path=/root/af_input/fold_input.json \
|
||||
--model_dir=/root/models \
|
||||
--output_dir=/root/af_output
|
||||
```
|
||||
|
||||
:ledger: **Note: In the example above the databases have been placed on the
|
||||
persistent disk, which is slow.** If you want better genetic and template search
|
||||
performance, make sure all databases are placed on a local SSD.
|
||||
|
||||
If you get an error like the following, make sure the models and data are in the
|
||||
paths (flags named `--volume` above) in the correct locations.
|
||||
|
||||
```
|
||||
docker: Error response from daemon: error while creating mount source path '/srv/alphafold3_data/models': mkdir /srv/alphafold3_data/models: permission denied.
|
||||
```
|
||||
|
||||
## Running Using Singularity Instead of Docker
|
||||
|
||||
You may prefer to run AlphaFold 3 within Singularity. You'll still need to
|
||||
*build* the Singularity image from the Docker container. Afterwards, you will
|
||||
not have to depend on Docker (at structure prediction time).
|
||||
|
||||
### Install Singularity
|
||||
|
||||
Official Singularity instructions are
|
||||
[here](https://docs.sylabs.io/guides/3.3/user-guide/installation.html). The
|
||||
commands we ran are:
|
||||
|
||||
```sh
|
||||
wget https://github.com/sylabs/singularity/releases/download/v4.2.1/singularity-ce_4.2.1-jammy_amd64.deb
|
||||
sudo dpkg --install singularity-ce_4.2.1-jammy_amd64.deb
|
||||
sudo apt-get install -f
|
||||
```
|
||||
|
||||
### Build the Singularity Container From the Docker Image
|
||||
|
||||
After building the *Docker* container above with `docker build -t`, start a
|
||||
local Docker registry and upload your image `alphafold3` to it. Singularity's
|
||||
instructions are [here](https://github.com/apptainer/singularity/issues/1537).
|
||||
The commands we ran are:
|
||||
|
||||
```sh
|
||||
docker run -d -p 5000:5000 --restart=always --name registry registry:2
|
||||
docker tag alphafold3 localhost:5000/alphafold3
|
||||
docker push localhost:5000/alphafold3
|
||||
```
|
||||
|
||||
Then build the Singularity container:
|
||||
|
||||
```sh
|
||||
SINGULARITY_NOHTTPS=1 singularity build alphafold3.simg docker://localhost:5000/alphafold3:latest
|
||||
```
|
||||
|
||||
You can confirm your build by starting a shell and inspecting the environment.
|
||||
For example, you may want to ensure the Singularity image can access your GPU.
|
||||
You may want to restart your computer if you have issues with this.
|
||||
|
||||
```sh
|
||||
singularity exec --nv alphafold3.simg sh -c 'nvidia-smi'
|
||||
```
|
||||
|
||||
You can now run AlphaFold 3!
|
||||
|
||||
```sh
|
||||
singularity exec --nv alphafold3.simg <<args>>
|
||||
```
|
||||
|
||||
For example:
|
||||
|
||||
```sh
|
||||
singularity exec \
|
||||
--nv alphafold3.simg \
|
||||
--bind $HOME/af_input:/root/af_input \
|
||||
--bind $HOME/af_output:/root/af_output \
|
||||
--bind <MODEL_PARAMETERS_DIR>:/root/models \
|
||||
--bind <DATABASES_DIR>:/root/public_databases \
|
||||
python alphafold3/run_alphafold.py \
|
||||
--json_path=/root/af_input/fold_input.json \
|
||||
--model_dir=/root/models \
|
||||
--db_dir=/root/public_databases \
|
||||
--output_dir=/root/af_output
|
||||
```
|
||||
16
docs/known_issues.md
Normal file
16
docs/known_issues.md
Normal file
@@ -0,0 +1,16 @@
|
||||
# Known Issues
|
||||
|
||||
## Numerical Accuracy above 5,120 Tokens
|
||||
|
||||
AlphaFold 3 does not currently support inference on inputs larger than 5,120
|
||||
tokens. An error will be raised if the input is larger than this threshold.
|
||||
|
||||
This is due to a numerical issue with the custom Pallas kernel implementing the
|
||||
Gated Linear Unit. The numerical issue only occurs at inputs above the 5,120
|
||||
tokens threshold, and results in degraded accuracy in the predicted structure.
|
||||
|
||||
This numerical issue is unique to the single GPU configuration used in this
|
||||
repository, and does not affect the results in the
|
||||
[AlphaFold 3 paper](https://www.nature.com/articles/s41586-024-07487-w).
|
||||
|
||||
We hope to resolve this issue soon and remove this check on input size.
|
||||
187
docs/output.md
Normal file
187
docs/output.md
Normal file
@@ -0,0 +1,187 @@
|
||||
# AlphaFold 3 Output
|
||||
|
||||
## Output Directory Structure
|
||||
|
||||
For every input job, AlphaFold 3 writes all its outputs in a directory called by
|
||||
the sanitized version of the job name. E.g. for job name "My first fold (test)",
|
||||
AlphaFold 3 will write its outputs in a directory called `my_first_fold_test`.
|
||||
|
||||
The following structure is used within the output directory:
|
||||
|
||||
* Sub-directories with results for each sample and seed. There will be
|
||||
*num\_seeds* \* *num\_samples* such sub-directories. The naming pattern is
|
||||
`seed-<seed value>_sample-<sample number>`. Each of these directories
|
||||
contains a confidence JSON, summary confidence JSON, and the mmCIF with the
|
||||
predicted structure.
|
||||
* Top-ranking prediction mmCIF: `<job_name>_model.cif`. This file contains the
|
||||
predicted coordinates and should be compatible with most structural biology
|
||||
tools. We do not provide the output in the PDB format, the CIF file can be
|
||||
easily converted into one if needed.
|
||||
* Top-ranking prediction confidence JSON: `<job_name>_confidences.json`.
|
||||
* Top-ranking prediction summary confidence JSON:
|
||||
`<job_name>_summary_confidences.json`.
|
||||
* Job input JSON file with the MSA and template data added by the data
|
||||
pipeline: `<job_name>_data.json`.
|
||||
* Ranking scores for all predictions: `ranking_scores.csv`. The prediction
|
||||
with highest ranking is the one included in the root directory.
|
||||
* Output terms of use: `TERMS_OF_USE.md`.
|
||||
|
||||
Below is an example AlphaFold 3 output directory listing for a job called
|
||||
"Hello Fold", that has been ran with 1 seed and 5 samples:
|
||||
|
||||
```text
|
||||
hello_fold/
|
||||
├── seed-1234_sample-0/
|
||||
│ ├── confidences.json
|
||||
│ ├── model.cif
|
||||
│ └── summary_confidences.json
|
||||
├── seed-1234_sample-1/
|
||||
│ ├── confidences.json
|
||||
│ ├── model.cif
|
||||
│ └── summary_confidences.json
|
||||
├── seed-1234_sample-2/
|
||||
│ ├── confidences.json
|
||||
│ ├── model.cif
|
||||
│ └── summary_confidences.json
|
||||
├── seed-1234_sample-3/
|
||||
│ ├── confidences.json
|
||||
│ ├── model.cif
|
||||
│ └── summary_confidences.json
|
||||
├── seed-1234_sample-4/
|
||||
│ ├── confidences.json
|
||||
│ ├── model.cif
|
||||
│ └── summary_confidences.json
|
||||
├── TERMS_OF_USE.md
|
||||
├── hello_fold_confidences.json
|
||||
├── hello_fold_data.json
|
||||
├── hello_fold_model.cif
|
||||
├── hello_fold_summary_confidences.json
|
||||
└── ranking_scores.csv
|
||||
```
|
||||
|
||||
## Confidence Metrics
|
||||
|
||||
Similar to AlphaFold2 and AlphaFold-Multimer, AlphaFold 3 outputs include
|
||||
confidence metrics. The main metrics are:
|
||||
|
||||
* **pLDDT:** a per-atom confidence estimate on a 0-100 scale where a higher
|
||||
value indicates higher confidence. pLDDT aims to predict a modified LDDT
|
||||
score that only considers distances to polymers. For proteins this is
|
||||
similar to the
|
||||
[lDDT-Cα metric](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3799472/) but
|
||||
with more granularity as it can vary per atom not just per residue. For
|
||||
ligand atoms, the modified LDDT considers the errors only between the ligand
|
||||
atom and polymers, not other ligand atoms. For DNA/RNA a wider radius of 30
|
||||
Å is used for the modified LDDT instead of 15 Å.
|
||||
* **PAE (predicted aligned error)**: an estimate of the error in the relative
|
||||
position and orientation between two tokens in the predicted structure.
|
||||
Higher values indicate higher predicted error and therefore lower
|
||||
confidence. For proteins and nucleic acids, PAE score is essentially the
|
||||
same as AlphaFold2, where the error is measured relative to frames
|
||||
constructed from the protein backbone. For small molecules and
|
||||
post-translational modifications, a frame is constructed for each atom from
|
||||
its closest neighbors from a reference conformer.
|
||||
* **pTM and ipTM scores**: the predicted template modeling (pTM) score and the
|
||||
interface predicted template modeling (ipTM) score are both derived from a
|
||||
measure called the template modeling (TM) score. This measures the accuracy
|
||||
of the entire structure
|
||||
([Zhang and Skolnick, 2004](https://doi.org/10.1002/prot.20264);
|
||||
[Xu and Zhang, 2010](https://doi.org/10.1093/bioinformatics/btq066)). A pTM
|
||||
score above 0.5 means the overall predicted fold for the complex might be
|
||||
similar to the true structure. ipTM measures the accuracy of the predicted
|
||||
relative positions of the subunits within the complex. Values higher than
|
||||
0.8 represent confident high-quality predictions, while values below 0.6
|
||||
suggest a failed prediction. ipTM values between 0.6 and 0.8 are a gray zone
|
||||
where predictions could be correct or incorrect. The TM score is very strict
|
||||
for small structures or short chains, so pTM assigns values less than 0.05
|
||||
when fewer than 20 tokens are involved; for these cases PAE or pLDDT may be
|
||||
more indicative of prediction quality.
|
||||
|
||||
For detailed description of these confidence metrics see the
|
||||
[AlphaFold 3 paper](https://www.nature.com/articles/s41586-024-07487-w). For
|
||||
protein components, the
|
||||
[AlphaFold: A Practical guide](https://www.ebi.ac.uk/training/online/courses/alphafold/inputs-and-outputs/evaluating-alphafolds-predicted-structures-using-confidence-scores/)
|
||||
course for structures provides additional tutorials on the confidence metrics.
|
||||
|
||||
If you are interested in a specific entity or interaction, then there are
|
||||
confidences available in the outputs which are specific to each chain or
|
||||
chain-pair, as opposed to the full complex. See below for more details on all
|
||||
the confidence metrics that are returned.
|
||||
|
||||
## Multi-Seed and Multi-Sample Results
|
||||
|
||||
By default, the model samples five predictions per seed. The top-ranked
|
||||
prediction across all samples and seeds is available at the top-level of the
|
||||
output directory. All samples along with their associated confidences are
|
||||
available in subdirectories of the output directory.
|
||||
|
||||
For ranking of the full complex use the `ranking_score` (higher is better). This
|
||||
score uses overall structure confidences (pTM and ipTM), but also includes terms
|
||||
that penalize clashes and encourage disordered regions not to have spurious
|
||||
helices – these extra terms mean the score should only be used to rank
|
||||
structures.
|
||||
|
||||
If you are interested in a specific entity or interaction, you may want to rank
|
||||
by a metric specific to that chain or chain-pair, as opposed to the full
|
||||
complex. In that case, use the per chain or per chain-pair confidence metrics
|
||||
described below for ranking.
|
||||
|
||||
## Metrics in Confidences JSON
|
||||
|
||||
For each predicted sample we provide two JSON files. One contains summary
|
||||
metrics – summaries for either the whole structure, per chain or per chain-pair
|
||||
– and the other contains full 1D or 2D arrays.
|
||||
|
||||
Summary outputs:
|
||||
|
||||
* `ptm`: A scalar in the range 0-1 indicating the predicted TM-score for the
|
||||
full structure.
|
||||
* `iptm`: A scalar in the range 0-1 indicating predicted interface TM-score
|
||||
(confidence in the predicted interfaces) for all interfaces in the
|
||||
structure.
|
||||
* `fraction_disordered`: A scalar in the range 0-1 that indicates what
|
||||
fraction of the prediction structure is disordered, as measured by
|
||||
accessible surface area, see our
|
||||
[paper](https://www.nature.com/articles/s41586-024-07487-w) for details.
|
||||
* `has_clash`: A boolean indicating if the structure has a significant number
|
||||
of clashing atoms (more than 50% of a chain, or a chain with more than 100
|
||||
clashing atoms).
|
||||
* `ranking_score`: A scalar in the range \[-100, 1.5\] that can be used for
|
||||
ranking predictions, it incorporates `ptm`, `iptm`, `fraction_disordered`
|
||||
and `has_clash` into a single number with the following equation: 0.8 × ipTM
|
||||
\+ 0.2 × pTM \+ 0.5 × disorder − 100 × has_clash.
|
||||
* `chain_pair_pae_min`: A \[num_chains, num_chains\] array. Element (i, j) of
|
||||
the array contains the lowest PAE value across rows restricted to chain i
|
||||
and columns restricted to chain j. This has been found to correlate with
|
||||
whether two chains interact or not, and in some cases can be used to
|
||||
distinguish binders from non-binders.
|
||||
* `chain_pair_iptm`: A \[num_chains, num_chains\] array. Off-diagonal element
|
||||
(i, j) of the array contains the ipTM restricted to tokens from chains i and
|
||||
j. Diagonal element (i, i) contains the pTM restricted to chain i. Can be
|
||||
used for ranking a specific interface between two chains, when you know that
|
||||
they interact, e.g. for antibody-antigen interactions
|
||||
* `chain_ptm`: A \[num_chains\] array. Element i contains the pTM restricted
|
||||
to chain i. Can be used for ranking individual chains when the structure of
|
||||
that chain is most of interest, rather than the cross-chain interactions it
|
||||
is involved with.
|
||||
* `chain_iptm:` A \[num_chains\] array that gives the average confidence
|
||||
(interface pTM) in the interface between each chain and all other chains.
|
||||
Can be used for ranking a specific chain, when you care about where the
|
||||
chain binds to the rest of the complex and you do not know which other
|
||||
chains you expect it to interact with. This is often the case with ligands.
|
||||
|
||||
Full array outputs:
|
||||
|
||||
* `pae`: A \[num\_tokens, num\_tokens\] array. Element (i, j) indicates the
|
||||
predicted error in the position of token j, when the prediction is aligned
|
||||
to the ground truth using the frame of token i.
|
||||
* `atom_plddts`: A \[num_atoms\] array, element i indicates the predicted
|
||||
local distance difference test (pLDDT) for atom i in the prediction.
|
||||
* `contact_probs`: A \[num_tokens, num_tokens\] array. Element (i, j)
|
||||
indicates the predicted probability that token i and token j are in contact
|
||||
(8 Å between the representative atom for each token), see
|
||||
[paper](https://www.nature.com/articles/s41586-024-07487-w) for details.
|
||||
* `token_chain_ids`: A \[num_tokens\] array indicating the chain ids
|
||||
corresponding to each token in the prediction.
|
||||
* `atom_chain_ids`: A \[num_atoms\] array indicating the chain ids
|
||||
corresponding to each atom in the prediction.
|
||||
153
docs/performance.md
Normal file
153
docs/performance.md
Normal file
@@ -0,0 +1,153 @@
|
||||
# Performance
|
||||
|
||||
## Data Pipeline
|
||||
|
||||
The runtime of the data pipeline (i.e. genetic sequence search and template
|
||||
search) can vary significantly depending on the size of the input and the number
|
||||
of homologous sequences found, as well as the available hardware (disk speed can
|
||||
influence genetic search speed in particular). If you would like to improve
|
||||
performance, it’s recommended to increase the disk speed (e.g. by leveraging a
|
||||
RAM-backed filesystem), or increase the available CPU cores and add more
|
||||
parallelisation. Also note that for sequences with deep MSAs, Jackhmmer or
|
||||
Nhmmer may need a substantial amount of RAM beyond the recommended 64 GB of RAM.
|
||||
|
||||
## Model Inference
|
||||
|
||||
Table 8 in the Supplementary Information of the
|
||||
[AlphaFold 3 paper](https://nature.com/articles/s41586-024-07487-w) provides
|
||||
compile-free inference timings for AlphaFold 3 when configured to run on 16
|
||||
NVIDIA A100s, with 40 GB of memory per device. In contrast, this repository
|
||||
supports running AlphaFold 3 on a single NVIDIA A100 with 80 GB of memory in a
|
||||
configuration optimised to maximise throughput.
|
||||
|
||||
We compare compile-free inference timings of these two setups in the table below
|
||||
using GPU seconds (i.e. multiplying by 16 when using 16 A100s). The setup in
|
||||
this repository is more efficient (by at least 2×) across all token sizes,
|
||||
indicating its suitability for high-throughput applications.
|
||||
|
||||
Num Tokens | 1 A100 80 GB (GPU secs) | 16 A100 40 GB (GPU secs) | Improvement
|
||||
:--------- | ----------------------: | -----------------------: | ----------:
|
||||
1024 | 62 | 352 | 5.7×
|
||||
2048 | 275 | 1136 | 4.1×
|
||||
3072 | 703 | 2016 | 2.9×
|
||||
4096 | 1434 | 3648 | 2.5×
|
||||
5120 | 2547 | 5552 | 2.2×
|
||||
|
||||
## Running the Pipeline in Stages
|
||||
|
||||
The `run_alphafold.py` script can be executed in stages to optimise resource
|
||||
utilisation. This can be useful for:
|
||||
|
||||
1. Splitting the CPU-only data pipeline from model inference (which requires a
|
||||
GPU), to optimise cost and resource usage.
|
||||
1. Caching the results of MSA/template search, then reusing the augmented JSON
|
||||
for multiple different inferences across seeds or across variations of other
|
||||
features (e.g. a ligand).
|
||||
|
||||
### Data Pipeline Only
|
||||
|
||||
Launch `run_alphafold.py` with `--norun_inference` to generate Multiple Sequence
|
||||
Alignments (MSAs) and templates, without running featurisation and model
|
||||
inference. This stage can be quite costly in terms of runtime, CPU, and RAM use.
|
||||
The output will be JSON files augmented with MSAs and templates that can then be
|
||||
directly used as input for running inference.
|
||||
|
||||
### Featurisation and Model Inference Only
|
||||
|
||||
Launch `run_alphafold.py` with `--norun_data_pipeline` to skip the data pipeline
|
||||
and run only featurisation and model inference. This stage requires the input
|
||||
JSON file to contain pre-computed MSAs and templates.
|
||||
|
||||
## Accelerator Hardware Requirements
|
||||
|
||||
We officially support the following configurations, and have extensively tested
|
||||
them for numerical accuracy and throughput efficiency:
|
||||
|
||||
- 1 NVIDIA A100 (80 GB)
|
||||
- 1 NVIDIA H100 (80 GB)
|
||||
|
||||
### Other Hardware Configurations
|
||||
|
||||
#### NVIDIA A100 (40 GB)
|
||||
|
||||
AlphaFold 3 can run on a single NVIDIA A100 (40 GB) with the following
|
||||
configuration changes:
|
||||
|
||||
1. Enabling [unified memory](#unified-memory).
|
||||
1. Adjusting `pair_transition_shard_spec` in `model_config.py`:
|
||||
|
||||
```py
|
||||
pair_transition_shard_spec: Sequence[_Shape2DType] = (
|
||||
(2048, None),
|
||||
(3072, 1024),
|
||||
(None, 512),
|
||||
)
|
||||
```
|
||||
|
||||
While numerically accurate, this configuration will have lower throughput
|
||||
compared to the set up on the NVIDIA A100 (80 GB), due to less available memory.
|
||||
|
||||
#### NVIDIA V100 (16 GB)
|
||||
|
||||
While you can run AlphaFold 3 on sequences up to 1,280 tokens on a single NVIDIA
|
||||
V100 using the flag `--flash_attention_implementation=xla` in
|
||||
`run_alphafold.py`, this configuration has not been tested for numerical
|
||||
accuracy or throughput efficiency, so please proceed with caution.
|
||||
|
||||
## Additional Flags
|
||||
|
||||
### Compilation Time Workaround with XLA Flags
|
||||
|
||||
To work around a known XLA issue causing the compilation time to greatly
|
||||
increase, the following environment variable must be set (it is set by default
|
||||
in the provided `Dockerfile`).
|
||||
|
||||
```sh
|
||||
ENV XLA_FLAGS="--xla_gpu_enable_triton_gemm=false"
|
||||
```
|
||||
|
||||
### GPU Memory
|
||||
|
||||
The following environment variables (set by default in the `Dockerfile`) enable
|
||||
folding a single input of size up to 5,120 tokens on a single A100 with 80 GB of
|
||||
memory:
|
||||
|
||||
```sh
|
||||
ENV XLA_PYTHON_CLIENT_PREALLOCATE=true
|
||||
ENV XLA_CLIENT_MEM_FRACTION=0.95
|
||||
```
|
||||
|
||||
#### Unified Memory
|
||||
|
||||
If you would like to run AlphaFold 3 on a GPU with less memory (an A100 with 40
|
||||
GB of memory, for instance), we recommend enabling unified memory. Enabling
|
||||
unified memory allows the program to spill GPU memory to host memory if there
|
||||
isn't enough space. This prevents an OOM, at the cost of making the program
|
||||
slower by accessing host memory instead of device memory. To learn more, check
|
||||
out the
|
||||
[NVIDIA blog post](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/).
|
||||
|
||||
You can enable unified memory by setting the following environment variables in
|
||||
your `Dockerfile`:
|
||||
|
||||
```sh
|
||||
ENV XLA_PYTHON_CLIENT_PREALLOCATE=false
|
||||
ENV TF_FORCE_UNIFIED_MEMORY=true
|
||||
ENV XLA_CLIENT_MEM_FRACTION=3.2
|
||||
```
|
||||
|
||||
### JAX Persistent Compilation Cache
|
||||
|
||||
You may also want to make use of the JAX persistent compilation cache, to avoid
|
||||
unnecessary recompilation of the model between runs. You can enable the
|
||||
compilation cache with the `--jax_compilation_cache_dir <YOUR_DIRECTORY>` flag
|
||||
in `run_alphafold.py`.
|
||||
|
||||
More detailed instructions are available in the
|
||||
[JAX documentation](https://jax.readthedocs.io/en/latest/persistent_compilation_cache.html#persistent-compilation-cache),
|
||||
and more specifically the instructions for use on
|
||||
[Google Cloud](https://jax.readthedocs.io/en/latest/persistent_compilation_cache.html#persistent-compilation-cache).
|
||||
In particular, note that if you would like to make use of a non-local
|
||||
filesystem, such as Google Cloud Storage, you will need to install
|
||||
[`etils`](https://github.com/google/etils) (this is not included by default in
|
||||
the AlphaFold 3 Docker container).
|
||||
Reference in New Issue
Block a user