first add

2024-11-24 20:53:33 +08:00
commit c0239f4a3d
180 changed files with 57702 additions and 0 deletions
--- a/docs/contributing.md
+++ b/docs/contributing.md
@@ -0,0 +1,31 @@
+# How to Contribute
+
+We welcome small patches related to bug fixes and documentation, but we do not
+plan to make any major changes to this repository.
+
+## Before You Begin
+
+### Sign Our Contributor License Agreement
+
+Contributions to this project must be accompanied by a
+[Contributor License Agreement](https://cla.developers.google.com/about) (CLA).
+You (or your employer) retain the copyright to your contribution; this simply
+gives us permission to use and redistribute your contributions as part of the
+project.
+
+If you or your current employer have already signed the Google CLA (even if it
+was for a different project), you probably don't need to do it again.
+
+Visit <https://cla.developers.google.com/> to see your current agreements or to
+sign a new one.
+
+### Review Our Community Guidelines
+
+This project follows
+[Google's Open Source Community Guidelines](https://opensource.google/conduct/).
+
+## Contribution Process
+
+We won't accept pull requests directly, but if you send one, we will review it.
+If we send a fix based on your pull request, we will make sure to credit you in
+the release notes.
--- a/docs/header.jpg
+++ b/docs/header.jpg
--- a/docs/input.md
+++ b/docs/input.md
@@ -0,0 +1,723 @@
+# AlphaFold 3 Input
+
+## Specifying Input Files
+
+You can provide inputs to `run_alphafold.py` in one of two ways:
+
+-   Single input file: Use the `--json_path` flag followed by the path to a
+    single JSON file.
+-   Multiple input files: Use the `--input_dir` flag followed by the path to a
+    directory of JSON files.
+
+## Input Format
+
+AlphaFold 3 uses a custom JSON input format differing from the
+[AlphaFold Server JSON input format](https://github.com/google-deepmind/alphafold/tree/main/server).
+See [below](#alphafold-server-json-compatibility) for more information.
+
+The custom AlphaFold 3 format allows:
+
+*   Specifying protein, RNA, and DNA chains, including modified residues.
+*   Specifying custom multiple sequence alignment (MSA) for protein and RNA
+    chains.
+*   Specifying custom structural templates for protein chains.
+*   Specifying ligands using
+    [Chemical Component Dictionary (CCD)](https://www.wwpdb.org/data/ccd) codes.
+*   Specifying ligands using SMILES.
+*   Specifying ligands by defining them using the CCD mmCIF format and supplying
+    them via the [user-provided CCD](#user-provided-ccd).
+*   Specifying covalent bonds between entities.
+*   Specifying multiple random seeds.
+
+## AlphaFold Server JSON Compatibility
+
+The [AlphaFold Server](https://alphafoldserver.com/) uses a separate
+[JSON format](https://github.com/google-deepmind/alphafold/tree/main/server)
+from the one used here in the AlphaFold 3 codebase. In particular, the JSON
+format used in the AlphaFold 3 codebase offers more flexibility and control in
+defining custom ligands, branched glycans, and covalent bonds between entities.
+
+We provide a converter in `run_alphafold.py` which automatically detects the
+input JSON format, denoted `dialect` in the converter code. The converter
+denotes the AlphaFoldServer JSON as `alphafoldserver`, and the JSON format
+defined here in the AlphaFold 3 codebase as `alphafold3`. If the detected input
+JSON format is `alphafoldserver`, then the converter will translate that into
+the JSON format `alphafold3`.
+
+### Multiple Inputs
+
+The top-level of the `alphafoldserver` JSON format is a list, allowing
+specification of multiple inputs in a single JSON. In contrast, the `alphafold3`
+JSON format requires exactly one input per JSON file. Specifying multiple inputs
+in a single `alphafoldserver` JSON is fully supported.
+
+Note that the converter distinguishes between `alphafoldserver` and `alphafold3`
+JSON formats by checking if the top-level of the JSON is a list or not. In
+particular, if you pass in a `alphafoldserver`-style JSON without a top-level
+list, then this is considered incorrect and `run_alphafold.py` will raise an
+error.
+
+### Glycans
+
+If the JSON in `alphafoldserver` format specifies glycans, the converter will
+raise an error. This is because translating glycans specified in the
+`alphafoldserver` format to the `alphafold3` format is not currently supported.
+
+### Random Seeds
+
+The `alphafoldserver` JSON format allows users to specify `"modelSeeds": []`, in
+which case a seed is chosen randomly for the user. On the other hand, the
+`alphafold3` format requires users to specify a seed.
+
+The converter will choose a seed randomly if `"modelSeeds": []` is set when
+translating from `alphafoldserver` JSON format to `alphafold3` JSON format. If
+seeds are specified in the `alphafoldserver` JSON format, then those will be
+preserved in the translation to the `alphafold3` JSON format.
+
+### Ions
+
+While AlphaFold Server treats ions and ligands as different entity types in the
+JSON format, AlphaFold 3 treats ions as ligands. Therefore, to specify e.g. a
+magnesium ion, one would specify it as an entity of type `ligand` with
+`ccdCodes: ["MG"]`.
+
+### Sequence IDs
+
+The `alphafold3` JSON format requires the user to specify a unique identifier
+(`id`) for each entity. On the other hand, the `alphafoldserver` does not allow
+specification of an `id` for each entity. Thus, the converter automatically
+assigns one.
+
+The converter iterates through the list provided in the `sequences` field of the
+`alphafoldserver` JSON format, assigning an `id` to each entity using the
+following order ("reverse spreadsheet style"):
+
+```
+A, B, ..., Z, AA, BA, CA, ..., ZA, AB, BB, CB, ..., ZB, ...
+```
+
+For any entity with `count > 1`, an `id` is assigned arbitrarily to each "copy"
+of the entity.
+
+## Top-level Structure
+
+The top-level structure of the input JSON is:
+
+```json
+{
+  "name": "Job name goes here",
+  "modelSeeds": [1, 2],  # At least one seed required.
+  "sequences": [
+    {"protein": {...}},
+    {"rna": {...}},
+    {"dna": {...}},
+    {"ligand": {...}}
+  ],
+  "bondedAtomPairs": [...],  # Optional
+  "userCCD": "...",  # Optional
+  "dialect": "alphafold3",  # Required
+  "version": 1  # Required
+}
+```
+
+The fields specify the following:
+
+*   `name: str`: The name of the job. A sanitised version of this name is used
+    for naming the output files.
+*   `modelSeeds: list[int]`: A list of integer random seeds. The pipeline and
+    the model will be invoked with each of the seeds in the list. I.e. if you
+    provide *n* random seeds, you will get *n* predicted structures, each with
+    the respective random seed. You must provide at least one random seed.
+*   `sequences: list[Protein | RNA | DNA | Ligand]`: A list of sequence
+    dictionaries, each defining a molecular entity, see below.
+*   `bondedAtomPairs: list[Bond]`: An optional list of covalently bonded atoms.
+    These can link atoms within an entity, or across two entities. See more
+    below.
+*   `userCCD: str`: An optional string with user-provided chemical components
+    dictionary. This is an expert mode for providing custom molecules when
+    SMILES is not sufficient. This should also be used when you have a custom
+    molecule that needs to be bonded with other entities - SMILES can't be used
+    in such cases since it doesn't give the possibility of uniquely naming all
+    atoms. It can also be used to provide a reference conformer for cases where
+    RDKit fails to generate a conformer. See more below.
+*   `dialect: str`: The dialect of the input JSON. This must be set to
+    `alphafold3`. See
+    [AlphaFold Server JSON Compatibility](#alphafold-server-json-compatibility)
+    for more information.
+*   `version: int`: The version of the input JSON. This must be set to 1. See
+    [AlphaFold Server JSON Compatibility](#alphafold-server-json-compatibility)
+    for more information.
+
+## Sequences
+
+The `sequences` section specifies the protein chains, RNA chains, DNA chains,
+and ligands. Every entity in `sequences` must have a unique ID. IDs don't have
+to be sorted alphabetically.
+
+### Protein
+
+Specifies a single protein chain.
+
+```json
+{
+  "protein": {
+    "id": "A",
+    "sequence": "PVLSCGEWQL",
+    "modifications": [
+      {"ptmType": "HY3", "ptmPosition": 1},
+      {"ptmType": "P1L", "ptmPosition": 5}
+    ],
+    "unpairedMsa": ...,
+    "pairedMsa": ...,
+    "templates": [...]
+  }
+}
+```
+
+The fields specify the following:
+
+*   `id: str | list[str]`: An uppercase letter or multiple letters specifying
+    the unique IDs for each copy of this protein chain. The IDs are then also
+    used in the output mmCIF file. Specifying a list of IDs (e.g. `["A", "B",
+    "C"]`) implies a homomeric chain with multiple copies.
+*   `sequence: str`: The amino-acid sequence, specified as a string that uses
+    the 1-letter standard amino acid codes.
+*   `modifications: list[ProteinModification]`: An optional list of
+    post-translational modifications. Each modification is specified using its
+    CCD code and 1-based residue position. In the example above, we see that the
+    first residue won't be a proline (`P`) but instead `HY3`.
+*   `unpairedMsa: str`: An optional multiple sequence alignment for this chain.
+    This is specified using the A3M format (equivalent to the FASTA format, but
+    also allows gaps denoted by the hyphen `-` character). See more details
+    below.
+*   `pairedMsa: str`: We recommend *not* using this optional field and using the
+    `unpairedMsa` for the purposes of pairing. See more details below.
+*   `templates: list[Template]`: An optional list of structural templates. See
+    more details below.
+
+### RNA
+
+Specifies a single RNA chain.
+
+```json
+{
+  "rna": {
+    "id": "A",
+    "sequence": "AGCU",
+    "modifications": [
+      {"modificationType": "2MG", "basePosition": 1},
+      {"modificationType": "5MC", "basePosition": 4}
+    ],
+    "unpairedMsa": ...
+  }
+}
+```
+
+The fields specify the following:
+
+*   `id: str | list[str]`: An uppercase letter or multiple letters specifying
+    the unique IDs for each copy of this RNA chain. The IDs are then also used
+    in the output mmCIF file. Specifying a list of IDs (e.g. `["A", "B", "C"]`)
+    implies a homomeric chain with multiple copies.
+*   `sequence: str`: The RNA sequence, specified as a string using only the
+    letters `A`, `C`, `G`, `U`.
+*   `modifications: list[RnaModification]`: An optional list of modifications.
+    Each modification is specified using its CCD code and 1-based base position.
+*   `unpairedMsa: str`: An optional multiple sequence alignment for this chain.
+    This is specified using the A3M format. See more details below.
+
+### DNA
+
+Specifies a single DNA chain.
+
+```json
+{
+  "dna": {
+    "id": "A",
+    "sequence": "GACCTCT",
+    "modifications": [
+      {"modificationType": "6OG", "basePosition": 1},
+      {"modificationType": "6MA", "basePosition": 2}
+    ]
+  }
+}
+```
+
+The fields specify the following:
+
+*   `id: str | list[str]`: An uppercase letter or multiple letters specifying
+    the unique IDs for each copy of this DNA chain. The IDs are then also used
+    in the output mmCIF file. Specifying a list of IDs (e.g. `["A", "B", "C"]`)
+    implies a homomeric chain with multiple copies.
+*   `sequence: str`: The DNA sequence, specified as a string using only the
+    letters `A`, `C`, `G`, `T`.
+*   `modifications: list[DnaModification]`: An optional list of modifications.
+    Each modification is specified using its CCD code and 1-based base position.
+
+### Ligands
+
+Specifies a single ligand. Ligands can be specified using 3 different formats:
+
+1.  [CCD code(s)](https://www.wwpdb.org/data/ccd). This is the easiest way to
+    specify ligands. Supports specifying covalent bonds to other entities. CCD
+    from 2022-09-28 is used. If multiple CCD codes are specified, you may want
+    to specify a bond between these and/or a bond to some other entity. See the
+    [bonds](#bonds) section below.
+2.  [SMILES string](https://en.wikipedia.org/wiki/Simplified_Molecular_Input_Line_Entry_System).
+    This enables specifying ligands that are not in CCD. If using SMILES, you
+    cannot specify covalent bonds to other entities as these rely on specific
+    atom names - see the next option for what to use for this case.
+3.  User-provided CCD + custom ligand codes. This enables specifying ligands not
+    in CCD, while also supporting specification of covalent bonds to other
+    entities and backup reference coordinates for when RDKit fails to generate a
+    conformer. This offers the most flexibility, but also requires careful
+    attention to get all of the details right.
+
+```json
+{
+  "ligand": {
+    "id": ["G", "H", "I"],
+    "ccdCodes": ["ATP"]
+  }
+},
+{
+  "ligand": {
+    "id": "J",
+    "ccdCodes": ["LIG-1337"]
+  }
+},
+{
+  "ligand": {
+    "id": "K",
+    "smiles": "CC(=O)OC1C[NH+]2CCC1CC2"
+  }
+}
+```
+
+The fields specify the following:
+
+*   `id: str | list[str]`: An uppercase letter (or multiple letters) specifying
+    the unique ID of this ligand. This ID is then also used in the output mmCIF
+    file. Specifying a list of IDs (e.g. `["A", "B", "C"]`) implies a ligand
+    that has multiple copies.
+*   `ccdCodes: list[str]`: An optional list of CCD codes. These could be either
+    standard CCD codes, or custom codes pointing to the
+    [user-provided CCD](#user-provided-ccd).
+*   `smiles: str`: An optional string defining the ligand using a SMILES string.
+
+Each ligand may be specified using CCD codes or SMILES but not both, i.e. for a
+given ligand, the `ccdCodes` and `smiles` fields are mutually exclusive.
+
+### Ions
+
+Ions are treated as ligands, e.g. a magnesium ion would simply be a ligand with
+`ccdCodes: ["MG"]`.
+
+## Multiple Sequence Alignment
+
+Protein and RNA chains allow setting a custom Multiple Sequence Alignment (MSA).
+If not set, the data pipeline will automatically build MSAs for protein and RNA
+entities using Jackhmmer/Nhmmer search over genetic databases as described in
+the paper.
+
+There are 3 modes for MSA:
+
+1.  If the `unpairedMsa` field is unset, AlphaFold 3 will build the MSA
+    automatically. This is the recommended option.
+2.  If the `unpairedMsa` field is set to an empty string (`""`), AlphaFold 3
+    will not build the MSA and the MSA input to the model will be empty.
+3.  If the `unpairedMsa` field is set to a custom A3M string, AlphaFold 3 will
+    use the provided MSA instead of building one as part of the data pipeline.
+    This is considered an expert option.
+
+Note that if you set the `unpairedMsa` field for a particular protein entity,
+you will also have to explicitly set the `pairedMsa` field (typically to empty
+string) and templates (either to a list of templates, or an empty list to run
+template-free). For example this will run the protein chain A with the given
+MSA, but without any templates:
+
+```json
+{
+  "protein": {
+    "id": "A",
+    "sequence": ...,
+    "unpairedMsa": "The A3M you want to run with",
+    "pairedMsa": "",
+    "templates": []
+  }
+}
+```
+
+When setting your own MSA, you have to make sure that:
+
+1.  The MSA is a valid A3M file. This means adhering to the FASTA format while
+    also allowing lowercase characters denoting inserted residues and hyphens
+    (`-`) denoting gaps in sequences.
+2.  The first sequence is exactly equal to the query sequence.
+3.  If all insertions are removed from MSA hits (i.e. all lowercase letters are
+    removed), all sequences have exactly the same length as the query (they form
+    an exact rectangular matrix).
+
+### MSA Pairing
+
+MSA pairing matters only when folding multiple chains (multimers), since we need
+to find a way to concatenate MSAs for the individual chains along the sequence
+dimension. If done naively, by simply concatenating the individual MSA matrices
+along the sequence dimension and padding so that all MSAs have the same depth,
+one can end up with rows in the concatenated MSA that are formed by sequences
+from different organisms.
+
+It may be desirable to ensure that across multiple chains, sequences in the MSA
+that are from the same organism end up in the same MSA row. AlphaFold 3
+internally achieves this by looking for the UniProt organism ID in the
+`pairedMsa` and pairing sequences based on this information.
+
+We recommend users do the pairing manually or use the output of an appropriate
+software and then provide the MSA using only the `unpairedMsa` field. This
+method gives exact control over the placement of each sequence in the MSA, as
+opposed to relying on name-matching post-processing heuristics used for
+`pairedMsa`.
+
+When setting `unpairedMsa` manually, the `pairedMsa` must be left unset (i.e.
+the `pairedMsa` key should not be present in the JSON).
+
+For instance, if there are two chains `DEEP` and `MIND` which we want to be
+paired on organism A and C, we can achieve it as follows:
+
+```text
+> query
+DEEP
+> match 1 (organism A)
+D--P
+> match 2 (organism B)
+DD-P
+> match 3 (organism C)
+DD-P
+```
+
+```text
+> query
+MIND
+> match 1 (organism A)
+M--D
+> Empty hit to make sure pairing is achieved
+----
+> match 2 (organism C)
+MIN-
+```
+
+The resulting MSA when chains are concatenated will then be:
+
+```text
+> query
+DEEPMIND
+> match 1 + match 1
+D--PM--D
+> match 2 + padding
+DD-P----
+> match 3 + match 2
+DD-PMIN-
+```
+
+## Structural Templates
+
+Structural templates can be specified only for protein chains:
+
+```json
+"templates": [
+  {
+    "mmcif": ...,
+    "queryIndices": [0, 1, 2, 4, 5, 6],
+    "templateIndices": [0, 1, 2, 3, 4, 8]
+  }
+]
+```
+
+A template is specified as an mmCIF string containing a single chain with the
+structural template together with a 0-based mapping that maps query residue
+indices to the template residue indices. The mapping is specified using two
+lists of the same length. E.g. to express a mapping `{0: 0, 1: 2, 2: 5, 3: 6}`,
+you would specify the two indices lists as:
+
+```json
+"queryIndices":    [0, 1, 2, 3],
+"templateIndices": [0, 2, 5, 6]
+```
+
+You can provide multiple structural templates. Note that if an mmCIF containing
+more than one chain is provided, you will get an error since it is not possible
+to determine which of the chains should be used as the template.
+
+## Bonds
+
+To manually specify covalent bonds, use the `bondedAtomPairs` field. This is
+intended for modelling covalent ligands, and for defining multi-CCD ligands
+(e.g. glycans). Defining covalent bonds between or within polymer entities is
+not currently supported.
+
+Bonds are specified as pairs of (source atom, destination atom), with each atom
+being uniquely addressed using 3 fields:
+
+*   **Entity ID** (`str`): this corresponds to the `id` field for that entity.
+*   **Residue ID** (`int`): this is 1-based residue index *within* the chain.
+    For single-residue ligands, this is simply set to 1.
+*   **Atom name** (`str`): this is the unique atom name *within* the given
+    residue. The atom name for protein/RNA/DNA residues or CCD ligands can be
+    looked up in the CCD for the given chemical component. This also explains
+    why SMILES ligands don't support bonds: there is no atom name that could be
+    used to define the bond. This shortcoming can be addressed by using the
+    user-provided CCD format (see below).
+
+The example below shows two bonds:
+
+```json
+"bondedAtomPairs": [
+  [["A", 145, "SG"], ["L", 1, "C04"]],
+  [["J", 1, "O6"], ["J", 2, "C1"]]
+]
+```
+
+The first bond is between chain A, residue 145, atom SG and chain L, residue 1,
+atom C04. This is a typical example for a covalent ligand. The second bond is
+between chain J, residue 1, atom O6 and chain J, residue 2, atom C1. This bond
+is within the same entity and is a typical example when defining a glycan.
+
+All bonds are implicitly assumed to be covalent bonds. Other bond types are not
+supported.
+
+### Defining Glycans
+
+Glycans are bound to a protein residue, and they are typically formed of
+multiple chemical components. To define a glycan, define a new ligand with all
+of the chemical components of the glycan. Then define a bond that links the
+glycan to the protein residue, and all bonds that are within the glycan between
+its individual chemical components.
+
+For example, to define the following glycan composed of 4 components (CMP1,
+CMP2, CMP3, CMP4) bound to an arginine in a protein chain A:
+
+```
+ ⋮
+ALA              CMP4
+ |                |
+ARG --- CMP1 --- CMP2
+ |                |
+ALA              CMP3
+ ⋮
+```
+
+You will need to specify:
+
+1.  Protein chain A.
+2.  Ligand chain B with the 4 components.
+3.  Bonds ARG-CMP1, CMP1-CMP2, CMP2-CMP3, CMP2-CMP4.
+
+## User-provided CCD
+
+There are two approaches to model a custom ligand not defined in the CCD. If the
+ligand is not bonded to other entities, it can be defined using a
+[SMILES string](https://en.wikipedia.org/wiki/Simplified_Molecular_Input_Line_Entry_System).
+Otherwise, it is necessary to define that particular ligand using the
+[CCD mmCIF format](https://www.wwpdb.org/data/ccd#mmcifFormat).
+
+Once defined, this ligand needs to be assigned a name that doesn't clash with
+existing CCD ligand names (e.g. `LIG-1`). Avoid underscores (`_`) in the name,
+as it could cause issues in the mmCIF format.
+
+The newly defined ligand can then be used as a standard CCD ligand using its
+custom name, and bonds can be linked to it using its named atom scheme.
+
+### User-provided CCD Format
+
+The user-provided CCD must be passed in the `userCCD` field (in the root of the
+input JSON) as a string. Note that JSON doesn't allow newlines within strings,
+so newline characters (`\n`) must be used to delimit lines. Single rather than
+double quotes should also be used around strings like the chemical formula.
+
+The main pieces of information used are the atom names and elements, bonds, and
+also the ideal coordinates (`pdbx_model_Cartn_{x,y,z}_ideal`) which essentially
+serve as a structural template for the ligand if RDKit fails to generate
+conformers for that ligand.
+
+The `userCCD` can also be used to redefine standard chemical components in the
+CCD. This can be useful if you need to redefine the ideal coordinates.
+
+Below is an example `userCCD` redefining component X7F, which serves to
+illustrate the required sections. For readability purposes, newlines have not
+been replaced by `\n`.
+
+```
+data_MY-X7F
+#
+_chem_comp.id MY-X7F
+_chem_comp.name '5,8-bis(oxidanyl)naphthalene-1,4-dione'
+_chem_comp.type non-polymer
+_chem_comp.formula 'C10 H6 O4'
+_chem_comp.mon_nstd_parent_comp_id ?
+_chem_comp.pdbx_synonyms ?
+_chem_comp.formula_weight 190.152
+#
+loop_
+_chem_comp_atom.comp_id
+_chem_comp_atom.atom_id
+_chem_comp_atom.alt_atom_id
+_chem_comp_atom.type_symbol
+_chem_comp_atom.charge
+_chem_comp_atom.pdbx_align
+_chem_comp_atom.pdbx_aromatic_flag
+_chem_comp_atom.pdbx_leaving_atom_flag
+_chem_comp_atom.pdbx_stereo_config
+_chem_comp_atom.pdbx_backbone_atom_flag
+_chem_comp_atom.pdbx_n_terminal_atom_flag
+_chem_comp_atom.pdbx_c_terminal_atom_flag
+_chem_comp_atom.model_Cartn_x
+_chem_comp_atom.model_Cartn_y
+_chem_comp_atom.model_Cartn_z
+_chem_comp_atom.pdbx_model_Cartn_x_ideal
+_chem_comp_atom.pdbx_model_Cartn_y_ideal
+_chem_comp_atom.pdbx_model_Cartn_z_ideal
+_chem_comp_atom.pdbx_component_atom_id
+_chem_comp_atom.pdbx_component_comp_id
+_chem_comp_atom.pdbx_ordinal
+MY-X7F C02 C1 C 0 1 N N N N N N 48.727 17.090 17.537 -1.418 -1.260 0.018 C02 MY-X7F 1
+MY-X7F C03 C2 C 0 1 N N N N N N 47.344 16.691 17.993 -0.665 -2.503 -0.247 C03 MY-X7F 2
+MY-X7F C04 C3 C 0 1 N N N N N N 47.166 16.016 19.310 0.677 -2.501 -0.235 C04 MY-X7F 3
+MY-X7F C05 C4 C 0 1 N N N N N N 48.363 15.728 20.184 1.421 -1.257 0.043 C05 MY-X7F 4
+MY-X7F C06 C5 C 0 1 Y N N N N N 49.790 16.142 19.699 0.706 0.032 0.008 C06 MY-X7F 5
+MY-X7F C07 C6 C 0 1 Y N N N N N 49.965 16.791 18.444 -0.706 0.030 -0.004 C07 MY-X7F 6
+MY-X7F C08 C7 C 0 1 Y N N N N N 51.249 17.162 18.023 -1.397 1.240 -0.037 C08 MY-X7F 7
+MY-X7F C10 C8 C 0 1 Y N N N N N 52.359 16.893 18.837 -0.685 2.443 -0.057 C10 MY-X7F 8
+MY-X7F C11 C9 C 0 1 Y N N N N N 52.184 16.247 20.090 0.679 2.445 -0.045 C11 MY-X7F 9
+MY-X7F C12 C10 C 0 1 Y N N N N N 50.899 15.876 20.515 1.394 1.243 -0.013 C12 MY-X7F 10
+MY-X7F O01 O1 O 0 1 N N N N N N 48.876 17.630 16.492 -2.611 -1.301 0.247 O01 MY-X7F 11
+MY-X7F O09 O2 O 0 1 N N N N N N 51.423 17.798 16.789 -2.752 1.249 -0.049 O09 MY-X7F 12
+MY-X7F O13 O3 O 0 1 N N N N N N 50.710 15.236 21.750 2.750 1.257 -0.001 O13 MY-X7F 13
+MY-X7F O14 O4 O 0 1 N N N N N N 48.229 15.189 21.234 2.609 -1.294 0.298 O14 MY-X7F 14
+MY-X7F H1 H1 H 0 1 N N N N N N 46.487 16.894 17.367 -1.199 -3.419 -0.452 H1 MY-X7F 15
+MY-X7F H2 H2 H 0 1 N N N N N N 46.178 15.732 19.640 1.216 -3.416 -0.429 H2 MY-X7F 16
+MY-X7F H3 H3 H 0 1 N N N N N N 53.348 17.177 18.511 -1.221 3.381 -0.082 H3 MY-X7F 17
+MY-X7F H4 H4 H 0 1 N N N N N N 53.040 16.041 20.716 1.212 3.384 -0.062 H4 MY-X7F 18
+MY-X7F H5 H5 H 0 1 N N N N N N 50.579 17.904 16.365 -3.154 1.271 0.830 H5 MY-X7F 19
+MY-X7F H6 H6 H 0 1 N N N N N N 49.785 15.059 21.877 3.151 1.241 -0.880 H6 MY-X7F 20
+#
+loop_
+_chem_comp_bond.comp_id
+_chem_comp_bond.atom_id_1
+_chem_comp_bond.atom_id_2
+_chem_comp_bond.value_order
+_chem_comp_bond.pdbx_aromatic_flag
+_chem_comp_bond.pdbx_stereo_config
+_chem_comp_bond.pdbx_ordinal
+MY-X7F O01 C02 DOUB N N 1
+MY-X7F O09 C08 SING N N 2
+MY-X7F C02 C03 SING N N 3
+MY-X7F C02 C07 SING N N 4
+MY-X7F C03 C04 DOUB N N 5
+MY-X7F C08 C07 DOUB Y N 6
+MY-X7F C08 C10 SING Y N 7
+MY-X7F C07 C06 SING Y N 8
+MY-X7F C10 C11 DOUB Y N 9
+MY-X7F C04 C05 SING N N 10
+MY-X7F C06 C05 SING N N 11
+MY-X7F C06 C12 DOUB Y N 12
+MY-X7F C11 C12 SING Y N 13
+MY-X7F C05 O14 DOUB N N 14
+MY-X7F C12 O13 SING N N 15
+MY-X7F C03 H1 SING N N 16
+MY-X7F C04 H2 SING N N 17
+MY-X7F C10 H3 SING N N 18
+MY-X7F C11 H4 SING N N 19
+MY-X7F O09 H5 SING N N 20
+MY-X7F O13 H6 SING N N 21
+#
+_pdbx_chem_comp_descriptor.type SMILES_CANONICAL
+_pdbx_chem_comp_descriptor.descriptor 'Oc1ccc(O)c2C(=O)C=CC(=O)c12'
+#
+```
+
+## Full Example
+
+An example illustrating all the aspects of the input format is provided below.
+Note that AlphaFold 3 won't run this input out of the box as it abbreviates
+certain fields and the sequences are not biologically meaningful.
+
+```json
+{
+  "name": "Hello fold",
+  "modelSeeds": [10, 42],
+  "sequences": [
+    {
+      "protein": {
+        "id": "A",
+        "sequence": "PVLSCGEWQL",
+        "modifications": [
+          {"ptmType": "HY3", "ptmPosition": 1},
+          {"ptmType": "P1L", "ptmPosition": 5}
+        ],
+        "unpairedMsa": ...,
+      }
+    },
+    {
+      "protein": {
+        "id": "B",
+        "sequence": "RPACQLW",
+        "templates": [
+          {
+            "mmcif": ...,
+            "queryIndices": [0, 1, 2, 4, 5, 6],
+            "templateIndices": [0, 1, 2, 3, 4, 8]
+          }
+        ]
+      }
+    },
+    {
+      "dna": {
+        "id": "C",
+        "sequence": "GACCTCT",
+        "modifications": [
+          {"modificationType": "6OG", "basePosition": 1},
+          {"modificationType": "6MA", "basePosition": 2}
+        ]
+      }
+    },
+    {
+      "rna": {
+        "id": "E",
+        "sequence": "AGCU",
+        "modifications": [
+          {"modificationType": "2MG", "basePosition": 1},
+          {"modificationType": "5MC", "basePosition": 4}
+        ],
+        "unpairedMsa": ...
+      }
+    },
+    {
+      "ligand": {
+        "id": ["F", "G", "H"],
+        "ccdCodes": ["ATP"]
+      }
+    },
+    {
+      "ligand": {
+        "id": "I",
+        "ccdCodes": ["NAG", "FUC"]
+      }
+    },
+    {
+      "ligand": {
+        "id": "Z",
+        "smiles": "CC(=O)OC1C[NH+]2CCC1CC2"
+      }
+    }
+  ],
+  "bondedAtomPairs": [
+    [["A", 1, "CA"], ["B", 1, "CA"]],
+    [["A", 1, "CA"], ["G", 1, "CHA"]],
+    [["J", 1, "O6"], ["J", 2, "C1"]]
+  ],
+  "userCcd": ...,
+  "dialect": "alphafold3",
+  "version": 1
+}
+
+```
--- a/docs/installation.md
+++ b/docs/installation.md
@@ -0,0 +1,355 @@
+# Installation and Running Your First Prediction
+
+You will need a machine running Linux; AlphaFold 3 does not support other
+operating systems. Full installation requires up to 1 TB of disk space to keep
+genetic databases (SSD storage is recommended) and an NVIDIA GPU with Compute
+Capability 8.0 or greater (GPUs with more memory can predict larger protein
+structures). We have verified that inputs with up to 5,120 tokens can fit on a
+single NVIDIA A100 80 GB, or a single NVIDIA H100 80 GB. We have verified
+numerical accuracy on both NVIDIA A100 and H100 GPUs.
+
+Especially for long targets, the genetic search stage can consume a lot of RAM –
+we recommend running with at least 64 GB of RAM.
+
+We provide installation instructions for a machine with an NVIDIA A100 80 GB GPU
+and a clean Ubuntu 22.04 LTS installation, and expect that these instructions
+should aid others with different setups.
+
+The instructions provided below describe how to:
+
+1.  Provision a machine on GCP.
+1.  Install Docker.
+1.  Install NVIDIA drivers for an A100.
+1.  Obtain genetic databases.
+1.  Obtain model parameters.
+1.  Build the AlphaFold 3 Docker container or Singularity image.
+
+## Provisioning a Machine
+
+Clean Ubuntu images are available on Google Cloud, AWS, Azure, and other major
+platforms.
+
+We first provisioned a new machine in Google Cloud Platform using the following
+command. We were using a Google Cloud project that was already set up.
+
+*   We recommend using `--machine-type a2-ultragpu-1g` but feel free to use
+    `--machine-type a2-highgpu-1g` for smaller predictions.
+*   If desired, replace `--zone us-central1-a` with a zone that has quota for
+    the machine you have selected. See
+    [gpu-regions-zones](https://cloud.google.com/compute/docs/gpus/gpu-regions-zones).
+
+```sh
+gcloud compute instances create alphafold3 \
+    --machine-type a2-ultragpu-1g \
+    --zone us-central1-a \
+    --image-family ubuntu-2204-lts \
+    --image-project ubuntu-os-cloud \
+    --maintenance-policy TERMINATE \
+    --boot-disk-size 1000 \
+    --boot-disk-type pd-balanced
+```
+
+This provisions a bare Ubuntu 22.04 LTS image on an
+[A2 Ultra](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a2-vms)
+machine with 12 CPUs, 170 GB RAM, 1 TB disk and NVIDIA A100 80 GB GPU attached.
+We verified the following installation steps from this point.
+
+## Installing Docker
+
+These instructions are for rootless Docker.
+
+### Installing Docker on Host
+
+Note these instructions only apply to Ubuntu 22.04 LTS images, see above.
+
+Add Docker's official GPG key. Official Docker instructions are
+[here](https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository).
+The commands we ran are:
+
+```sh
+sudo apt-get update
+sudo apt-get install ca-certificates curl
+sudo install -m 0755 -d /etc/apt/keyrings
+sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
+sudo chmod a+r /etc/apt/keyrings/docker.asc
+```
+
+Add the repository to apt sources:
+
+```sh
+echo \
+  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
+  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
+  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
+sudo apt-get update
+sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
+sudo docker run hello-world
+```
+
+### Enabling Rootless Docker
+
+Official Docker instructions are
+[here](https://docs.docker.com/engine/security/rootless/#distribution-specific-hint).
+The commands we ran are:
+
+```sh
+sudo apt-get install -y uidmap systemd-container
+
+sudo machinectl shell $(whoami)@ /bin/bash -c 'dockerd-rootless-setuptool.sh install && sudo loginctl enable-linger $(whoami) && DOCKER_HOST=unix:///run/user/1001/docker.sock docker context use rootless'
+```
+
+## Installing GPU Support
+
+### Installing NVIDIA Drivers
+
+Official Ubuntu instructions are
+[here](https://documentation.ubuntu.com/server/how-to/graphics/install-nvidia-drivers/).
+The commands we ran are:
+
+```sh
+sudo apt-get -y install alsa-utils ubuntu-drivers-common
+sudo ubuntu-drivers install
+
+sudo nvidia-smi --gpu-reset
+
+nvidia-smi  # Check that the drivers are installed.
+```
+
+Accept "Pending kernel upgrade" dialog if it appears.
+
+You will need to reboot the instance with `sudo reboot now` to reset the GPU if
+you see the following warning:
+
+```text
+NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
+Make sure that the latest NVIDIA driver is installed and running.
+```
+
+Proceed only if `nvidia-smi` has a sensible output.
+
+### Installing NVIDIA Support for Docker
+
+Official NVIDIA instructions are
+[here](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).
+The commands we ran are:
+
+```sh
+curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
+  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
+    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
+    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
+sudo apt-get update
+sudo apt-get install -y nvidia-container-toolkit
+nvidia-ctk runtime configure --runtime=docker --config=$HOME/.config/docker/daemon.json
+systemctl --user restart docker
+sudo nvidia-ctk config --set nvidia-container-cli.no-cgroups --in-place
+```
+
+Check that your container can see the GPU:
+
+```sh
+docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi
+```
+
+The output should look similar to this:
+
+```text
+Mon Nov  11 12:00:00 2024
+-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.6     |
+|-----------------------------------------+------------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|                                         |                        |               MIG M. |
+|=========================================+========================+======================|
+|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:00:05.0 Off |                    0 |
+| N/A   34C    P0             51W /  400W |       1MiB /  81920MiB |      0%      Default |
+|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
+
+-----------------------------------------------------------------------------------------+
+| Processes:                                                                              |
+|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
+|        ID   ID                                                               Usage      |
+|=========================================================================================|
+|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
+```
+
+## Obtaining AlphaFold 3 Source Code
+
+You will need to have `git` installed to download the AlphaFold 3 repository:
+
+```sh
+git clone https://github.com/google-deepmind/alphafold3.git
+```
+
+## Obtaining Genetic Databases
+
+This step requires `curl` and `zstd` to be installed on your machine.
+
+AlphaFold 3 needs multiple genetic (sequence) protein and RNA databases to run:
+
+*   [BFD small](https://bfd.mmseqs.com/)
+*   [MGnify](https://www.ebi.ac.uk/metagenomics/)
+*   [PDB](https://www.rcsb.org/) (structures in the mmCIF format)
+*   [PDB seqres](https://www.rcsb.org/)
+*   [UniProt](https://www.uniprot.org/uniprot/)
+*   [UniRef90](https://www.uniprot.org/help/uniref)
+*   [NT](https://www.ncbi.nlm.nih.gov/nucleotide/)
+*   [RFam](https://rfam.org/)
+*   [RNACentral](https://rnacentral.org/)
+
+We provide a Python program `fetch_databases.py` that can be used to download
+and set up all of these databases. This process takes around 45 minutes when not
+installing on local SSD. We recommend running the following in a `screen` or
+`tmux` session as downloading and decompressing the databases takes some time.
+
+```sh
+cd alphafold3  # Navigate to the directory with cloned AlphaFold 3 repository.
+python3 fetch_databases.py --download_destination=<DATABASES_DIR>
+```
+
+This script downloads the databases from a mirror hosted on GCS, with all
+versions being the same as used in the AlphaFold 3 paper.
+
+:ledger: **Note: The download directory `<DATABASES_DIR>` should *not* be a
+subdirectory in the AlphaFold 3 repository directory.** If it is, the Docker
+build will be slow as the large databases will be copied during the image
+creation.
+
+:ledger: **Note: The total download size for the full databases is around 252 GB
+and the total size when unzipped is 630 GB. Please make sure you have sufficient
+hard drive space, bandwidth, and time to download. We recommend using an SSD for
+better genetic search performance, and faster runtime of `fetch_databases.py`.**
+
+:ledger: **Note: If the download directory and datasets don't have full read and
+write permissions, it can cause errors with the MSA tools, with opaque
+(external) error messages. Please ensure the required permissions are applied,
+e.g. with the `sudo chmod 755 --recursive <DATABASES_DIR>` command.**
+
+Once the script has finished, you should have the following directory structure:
+
+```sh
+pdb_2022_09_28_mmcif_files.tar  # ~200k PDB mmCIF files in this tar.
+bfd-first_non_consensus_sequences.fasta
+mgy_clusters_2022_05.fa
+nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta
+pdb_seqres_2022_09_28.fasta
+rfam_14_9_clust_seq_id_90_cov_80_rep_seq.fasta
+rnacentral_active_seq_id_90_cov_80_linclust.fasta
+uniprot_all_2021_04.fa
+uniref90_2022_05.fa
+```
+
+## Obtaining Model Parameters
+
+To request access to the AlphaFold 3 model parameters, please complete
+[this form](https://forms.gle/svvpY4u2jsHEwWYS6). Access will be granted at
+Google DeepMind’s sole discretion. We will aim to respond to requests within 2–3
+business days. You may only use AlphaFold 3 model parameters if received
+directly from Google. Use is subject to these
+[terms of use](https://github.com/google-deepmind/alphafold3/blob/main/WEIGHTS_TERMS_OF_USE.md).
+
+## Building the Docker Container That Will Run AlphaFold 3
+
+Then, build the Docker container. This builds a container with all the right
+python dependencies:
+
+```sh
+docker build -t alphafold3 -f docker/Dockerfile .
+```
+
+You can now run AlphaFold 3!
+
+```sh
+docker run -it \
+    --volume $HOME/af_input:/root/af_input \
+    --volume $HOME/af_output:/root/af_output \
+    --volume <MODEL_PARAMETERS_DIR>:/root/models \
+    --volume <DATABASES_DIR>:/root/public_databases \
+    --gpus all \
+    alphafold3 \
+python run_alphafold.py \
+    --json_path=/root/af_input/fold_input.json \
+    --model_dir=/root/models \
+    --output_dir=/root/af_output
+```
+
+:ledger: **Note: In the example above the databases have been placed on the
+persistent disk, which is slow.** If you want better genetic and template search
+performance, make sure all databases are placed on a local SSD.
+
+If you get an error like the following, make sure the models and data are in the
+paths (flags named `--volume` above) in the correct locations.
+
+```
+docker: Error response from daemon: error while creating mount source path '/srv/alphafold3_data/models': mkdir /srv/alphafold3_data/models: permission denied.
+```
+
+## Running Using Singularity Instead of Docker
+
+You may prefer to run AlphaFold 3 within Singularity. You'll still need to
+*build* the Singularity image from the Docker container. Afterwards, you will
+not have to depend on Docker (at structure prediction time).
+
+### Install Singularity
+
+Official Singularity instructions are
+[here](https://docs.sylabs.io/guides/3.3/user-guide/installation.html). The
+commands we ran are:
+
+```sh
+wget https://github.com/sylabs/singularity/releases/download/v4.2.1/singularity-ce_4.2.1-jammy_amd64.deb
+sudo dpkg --install singularity-ce_4.2.1-jammy_amd64.deb
+sudo apt-get install -f
+```
+
+### Build the Singularity Container From the Docker Image
+
+After building the *Docker* container above with `docker build -t`, start a
+local Docker registry and upload your image `alphafold3` to it. Singularity's
+instructions are [here](https://github.com/apptainer/singularity/issues/1537).
+The commands we ran are:
+
+```sh
+docker run -d -p 5000:5000 --restart=always --name registry registry:2
+docker tag alphafold3 localhost:5000/alphafold3
+docker push localhost:5000/alphafold3
+```
+
+Then build the Singularity container:
+
+```sh
+SINGULARITY_NOHTTPS=1 singularity build alphafold3.simg docker://localhost:5000/alphafold3:latest
+```
+
+You can confirm your build by starting a shell and inspecting the environment.
+For example, you may want to ensure the Singularity image can access your GPU.
+You may want to restart your computer if you have issues with this.
+
+```sh
+singularity exec --nv alphafold3.simg sh -c 'nvidia-smi'
+```
+
+You can now run AlphaFold 3!
+
+```sh
+singularity exec --nv alphafold3.simg <<args>>
+```
+
+For example:
+
+```sh
+singularity exec \
+     --nv alphafold3.simg \
+     --bind $HOME/af_input:/root/af_input \
+     --bind $HOME/af_output:/root/af_output \
+     --bind <MODEL_PARAMETERS_DIR>:/root/models \
+     --bind <DATABASES_DIR>:/root/public_databases \
+python alphafold3/run_alphafold.py \
+     --json_path=/root/af_input/fold_input.json \
+     --model_dir=/root/models \
+     --db_dir=/root/public_databases \
+     --output_dir=/root/af_output
+```
--- a/docs/known_issues.md
+++ b/docs/known_issues.md
@@ -0,0 +1,16 @@
+# Known Issues
+
+## Numerical Accuracy above 5,120 Tokens
+
+AlphaFold 3 does not currently support inference on inputs larger than 5,120
+tokens. An error will be raised if the input is larger than this threshold.
+
+This is due to a numerical issue with the custom Pallas kernel implementing the
+Gated Linear Unit. The numerical issue only occurs at inputs above the 5,120
+tokens threshold, and results in degraded accuracy in the predicted structure.
+
+This numerical issue is unique to the single GPU configuration used in this
+repository, and does not affect the results in the
+[AlphaFold 3 paper](https://www.nature.com/articles/s41586-024-07487-w).
+
+We hope to resolve this issue soon and remove this check on input size.
--- a/docs/output.md
+++ b/docs/output.md
@@ -0,0 +1,187 @@
+# AlphaFold 3 Output
+
+## Output Directory Structure
+
+For every input job, AlphaFold 3 writes all its outputs in a directory called by
+the sanitized version of the job name. E.g. for job name "My first fold (test)",
+AlphaFold 3 will write its outputs in a directory called `my_first_fold_test`.
+
+The following structure is used within the output directory:
+
+*   Sub-directories with results for each sample and seed. There will be
+    *num\_seeds* \* *num\_samples* such sub-directories. The naming pattern is
+    `seed-<seed value>_sample-<sample number>`. Each of these directories
+    contains a confidence JSON, summary confidence JSON, and the mmCIF with the
+    predicted structure.
+*   Top-ranking prediction mmCIF: `<job_name>_model.cif`. This file contains the
+    predicted coordinates and should be compatible with most structural biology
+    tools. We do not provide the output in the PDB format, the CIF file can be
+    easily converted into one if needed.
+*   Top-ranking prediction confidence JSON: `<job_name>_confidences.json`.
+*   Top-ranking prediction summary confidence JSON:
+    `<job_name>_summary_confidences.json`.
+*   Job input JSON file with the MSA and template data added by the data
+    pipeline: `<job_name>_data.json`.
+*   Ranking scores for all predictions: `ranking_scores.csv`. The prediction
+    with highest ranking is the one included in the root directory.
+*   Output terms of use: `TERMS_OF_USE.md`.
+
+Below is an example AlphaFold 3 output directory listing for a job called
+"Hello Fold", that has been ran with 1 seed and 5 samples:
+
+```text
+hello_fold/
+├── seed-1234_sample-0/
+│   ├── confidences.json
+│   ├── model.cif
+│   └── summary_confidences.json
+├── seed-1234_sample-1/
+│   ├── confidences.json
+│   ├── model.cif
+│   └── summary_confidences.json
+├── seed-1234_sample-2/
+│   ├── confidences.json
+│   ├── model.cif
+│   └── summary_confidences.json
+├── seed-1234_sample-3/
+│   ├── confidences.json
+│   ├── model.cif
+│   └── summary_confidences.json
+├── seed-1234_sample-4/
+│   ├── confidences.json
+│   ├── model.cif
+│   └── summary_confidences.json
+├── TERMS_OF_USE.md
+├── hello_fold_confidences.json
+├── hello_fold_data.json
+├── hello_fold_model.cif
+├── hello_fold_summary_confidences.json
+└── ranking_scores.csv
+```
+
+## Confidence Metrics
+
+Similar to AlphaFold2 and AlphaFold-Multimer, AlphaFold 3 outputs include
+confidence metrics. The main metrics are:
+
+*   **pLDDT:** a per-atom confidence estimate on a 0-100 scale where a higher
+    value indicates higher confidence. pLDDT aims to predict a modified LDDT
+    score that only considers distances to polymers. For proteins this is
+    similar to the
+    [lDDT-Cα metric](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3799472/) but
+    with more granularity as it can vary per atom not just per residue. For
+    ligand atoms, the modified LDDT considers the errors only between the ligand
+    atom and polymers, not other ligand atoms. For DNA/RNA a wider radius of 30
+    Å is used for the modified LDDT instead of 15 Å.
+*   **PAE (predicted aligned error)**: an estimate of the error in the relative
+    position and orientation between two tokens in the predicted structure.
+    Higher values indicate higher predicted error and therefore lower
+    confidence. For proteins and nucleic acids, PAE score is essentially the
+    same as AlphaFold2, where the error is measured relative to frames
+    constructed from the protein backbone. For small molecules and
+    post-translational modifications, a frame is constructed for each atom from
+    its closest neighbors from a reference conformer.
+*   **pTM and ipTM scores**: the predicted template modeling (pTM) score and the
+    interface predicted template modeling (ipTM) score are both derived from a
+    measure called the template modeling (TM) score. This measures the accuracy
+    of the entire structure
+    ([Zhang and Skolnick, 2004](https://doi.org/10.1002/prot.20264);
+    [Xu and Zhang, 2010](https://doi.org/10.1093/bioinformatics/btq066)). A pTM
+    score above 0.5 means the overall predicted fold for the complex might be
+    similar to the true structure. ipTM measures the accuracy of the predicted
+    relative positions of the subunits within the complex. Values higher than
+    0.8 represent confident high-quality predictions, while values below 0.6
+    suggest a failed prediction. ipTM values between 0.6 and 0.8 are a gray zone
+    where predictions could be correct or incorrect. The TM score is very strict
+    for small structures or short chains, so pTM assigns values less than 0.05
+    when fewer than 20 tokens are involved; for these cases PAE or pLDDT may be
+    more indicative of prediction quality.
+
+For detailed description of these confidence metrics see the
+[AlphaFold 3 paper](https://www.nature.com/articles/s41586-024-07487-w). For
+protein components, the
+[AlphaFold: A Practical guide](https://www.ebi.ac.uk/training/online/courses/alphafold/inputs-and-outputs/evaluating-alphafolds-predicted-structures-using-confidence-scores/)
+course for structures provides additional tutorials on the confidence metrics.
+
+If you are interested in a specific entity or interaction, then there are
+confidences available in the outputs which are specific to each chain or
+chain-pair, as opposed to the full complex. See below for more details on all
+the confidence metrics that are returned.
+
+## Multi-Seed and Multi-Sample Results
+
+By default, the model samples five predictions per seed. The top-ranked
+prediction across all samples and seeds is available at the top-level of the
+output directory. All samples along with their associated confidences are
+available in subdirectories of the output directory.
+
+For ranking of the full complex use the `ranking_score` (higher is better). This
+score uses overall structure confidences (pTM and ipTM), but also includes terms
+that penalize clashes and encourage disordered regions not to have spurious
+helices – these extra terms mean the score should only be used to rank
+structures.
+
+If you are interested in a specific entity or interaction, you may want to rank
+by a metric specific to that chain or chain-pair, as opposed to the full
+complex. In that case, use the per chain or per chain-pair confidence metrics
+described below for ranking.
+
+## Metrics in Confidences JSON
+
+For each predicted sample we provide two JSON files. One contains summary
+metrics – summaries for either the whole structure, per chain or per chain-pair
+– and the other contains full 1D or 2D arrays.
+
+Summary outputs:
+
+*   `ptm`: A scalar in the range 0-1 indicating the predicted TM-score for the
+    full structure.
+*   `iptm`: A scalar in the range 0-1 indicating predicted interface TM-score
+    (confidence in the predicted interfaces) for all interfaces in the
+    structure.
+*   `fraction_disordered`: A scalar in the range 0-1 that indicates what
+    fraction of the prediction structure is disordered, as measured by
+    accessible surface area, see our
+    [paper](https://www.nature.com/articles/s41586-024-07487-w) for details.
+*   `has_clash`: A boolean indicating if the structure has a significant number
+    of clashing atoms (more than 50% of a chain, or a chain with more than 100
+    clashing atoms).
+*   `ranking_score`: A scalar in the range \[-100, 1.5\] that can be used for
+    ranking predictions, it incorporates `ptm`, `iptm`, `fraction_disordered`
+    and `has_clash` into a single number with the following equation: 0.8 × ipTM
+    \+ 0.2 × pTM \+ 0.5 × disorder − 100 × has_clash.
+*   `chain_pair_pae_min`: A \[num_chains, num_chains\] array. Element (i, j) of
+    the array contains the lowest PAE value across rows restricted to chain i
+    and columns restricted to chain j. This has been found to correlate with
+    whether two chains interact or not, and in some cases can be used to
+    distinguish binders from non-binders.
+*   `chain_pair_iptm`: A \[num_chains, num_chains\] array. Off-diagonal element
+    (i, j) of the array contains the ipTM restricted to tokens from chains i and
+    j. Diagonal element (i, i) contains the pTM restricted to chain i. Can be
+    used for ranking a specific interface between two chains, when you know that
+    they interact, e.g. for antibody-antigen interactions
+*   `chain_ptm`: A \[num_chains\] array. Element i contains the pTM restricted
+    to chain i. Can be used for ranking individual chains when the structure of
+    that chain is most of interest, rather than the cross-chain interactions it
+    is involved with.
+*   `chain_iptm:` A \[num_chains\] array that gives the average confidence
+    (interface pTM) in the interface between each chain and all other chains.
+    Can be used for ranking a specific chain, when you care about where the
+    chain binds to the rest of the complex and you do not know which other
+    chains you expect it to interact with. This is often the case with ligands.
+
+Full array outputs:
+
+*   `pae`: A \[num\_tokens, num\_tokens\] array. Element (i, j) indicates the
+    predicted error in the position of token j, when the prediction is aligned
+    to the ground truth using the frame of token i.
+*   `atom_plddts`: A \[num_atoms\] array, element i indicates the predicted
+    local distance difference test (pLDDT) for atom i in the prediction.
+*   `contact_probs`: A \[num_tokens, num_tokens\] array. Element (i, j)
+    indicates the predicted probability that token i and token j are in contact
+    (8 Å between the representative atom for each token), see
+    [paper](https://www.nature.com/articles/s41586-024-07487-w) for details.
+*   `token_chain_ids`: A \[num_tokens\] array indicating the chain ids
+    corresponding to each token in the prediction.
+*   `atom_chain_ids`: A \[num_atoms\] array indicating the chain ids
+    corresponding to each atom in the prediction.
--- a/docs/performance.md
+++ b/docs/performance.md
@@ -0,0 +1,153 @@
+# Performance
+
+## Data Pipeline
+
+The runtime of the data pipeline (i.e. genetic sequence search and template
+search) can vary significantly depending on the size of the input and the number
+of homologous sequences found, as well as the available hardware (disk speed can
+influence genetic search speed in particular). If you would like to improve
+performance, it’s recommended to increase the disk speed (e.g. by leveraging a
+RAM-backed filesystem), or increase the available CPU cores and add more
+parallelisation. Also note that for sequences with deep MSAs, Jackhmmer or
+Nhmmer may need a substantial amount of RAM beyond the recommended 64 GB of RAM.
+
+## Model Inference
+
+Table 8 in the Supplementary Information of the
+[AlphaFold 3 paper](https://nature.com/articles/s41586-024-07487-w) provides
+compile-free inference timings for AlphaFold 3 when configured to run on 16
+NVIDIA A100s, with 40 GB of memory per device. In contrast, this repository
+supports running AlphaFold 3 on a single NVIDIA A100 with 80 GB of memory in a
+configuration optimised to maximise throughput.
+
+We compare compile-free inference timings of these two setups in the table below
+using GPU seconds (i.e. multiplying by 16 when using 16 A100s). The setup in
+this repository is more efficient (by at least 2×) across all token sizes,
+indicating its suitability for high-throughput applications.
+
+Num Tokens | 1 A100 80 GB (GPU secs) | 16 A100 40 GB (GPU secs) | Improvement
+:--------- | ----------------------: | -----------------------: | ----------:
+1024       | 62                      | 352                      | 5.7×
+2048       | 275                     | 1136                     | 4.1×
+3072       | 703                     | 2016                     | 2.9×
+4096       | 1434                    | 3648                     | 2.5×
+5120       | 2547                    | 5552                     | 2.2×
+
+## Running the Pipeline in Stages
+
+The `run_alphafold.py` script can be executed in stages to optimise resource
+utilisation. This can be useful for:
+
+1.  Splitting the CPU-only data pipeline from model inference (which requires a
+    GPU), to optimise cost and resource usage.
+1.  Caching the results of MSA/template search, then reusing the augmented JSON
+    for multiple different inferences across seeds or across variations of other
+    features (e.g. a ligand).
+
+### Data Pipeline Only
+
+Launch `run_alphafold.py` with `--norun_inference` to generate Multiple Sequence
+Alignments (MSAs) and templates, without running featurisation and model
+inference. This stage can be quite costly in terms of runtime, CPU, and RAM use.
+The output will be JSON files augmented with MSAs and templates that can then be
+directly used as input for running inference.
+
+### Featurisation and Model Inference Only
+
+Launch `run_alphafold.py` with `--norun_data_pipeline` to skip the data pipeline
+and run only featurisation and model inference. This stage requires the input
+JSON file to contain pre-computed MSAs and templates.
+
+## Accelerator Hardware Requirements
+
+We officially support the following configurations, and have extensively tested
+them for numerical accuracy and throughput efficiency:
+
+-   1 NVIDIA A100 (80 GB)
+-   1 NVIDIA H100 (80 GB)
+
+### Other Hardware Configurations
+
+#### NVIDIA A100 (40 GB)
+
+AlphaFold 3 can run on a single NVIDIA A100 (40 GB) with the following
+configuration changes:
+
+1.  Enabling [unified memory](#unified-memory).
+1.  Adjusting `pair_transition_shard_spec` in `model_config.py`:
+
+    ```py
+      pair_transition_shard_spec: Sequence[_Shape2DType] = (
+          (2048, None),
+          (3072, 1024),
+          (None, 512),
+      )
+    ```
+
+While numerically accurate, this configuration will have lower throughput
+compared to the set up on the NVIDIA A100 (80 GB), due to less available memory.
+
+#### NVIDIA V100 (16 GB)
+
+While you can run AlphaFold 3 on sequences up to 1,280 tokens on a single NVIDIA
+V100 using the flag `--flash_attention_implementation=xla` in
+`run_alphafold.py`, this configuration has not been tested for numerical
+accuracy or throughput efficiency, so please proceed with caution.
+
+## Additional Flags
+
+### Compilation Time Workaround with XLA Flags
+
+To work around a known XLA issue causing the compilation time to greatly
+increase, the following environment variable must be set (it is set by default
+in the provided `Dockerfile`).
+
+```sh
+ENV XLA_FLAGS="--xla_gpu_enable_triton_gemm=false"
+```
+
+### GPU Memory
+
+The following environment variables (set by default in the `Dockerfile`) enable
+folding a single input of size up to 5,120 tokens on a single A100 with 80 GB of
+memory:
+
+```sh
+ENV XLA_PYTHON_CLIENT_PREALLOCATE=true
+ENV XLA_CLIENT_MEM_FRACTION=0.95
+```
+
+#### Unified Memory
+
+If you would like to run AlphaFold 3 on a GPU with less memory (an A100 with 40
+GB of memory, for instance), we recommend enabling unified memory. Enabling
+unified memory allows the program to spill GPU memory to host memory if there
+isn't enough space. This prevents an OOM, at the cost of making the program
+slower by accessing host memory instead of device memory. To learn more, check
+out the
+[NVIDIA blog post](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/).
+
+You can enable unified memory by setting the following environment variables in
+your `Dockerfile`:
+
+```sh
+ENV XLA_PYTHON_CLIENT_PREALLOCATE=false
+ENV TF_FORCE_UNIFIED_MEMORY=true
+ENV XLA_CLIENT_MEM_FRACTION=3.2
+```
+
+### JAX Persistent Compilation Cache
+
+You may also want to make use of the JAX persistent compilation cache, to avoid
+unnecessary recompilation of the model between runs. You can enable the
+compilation cache with the `--jax_compilation_cache_dir <YOUR_DIRECTORY>` flag
+in `run_alphafold.py`.
+
+More detailed instructions are available in the
+[JAX documentation](https://jax.readthedocs.io/en/latest/persistent_compilation_cache.html#persistent-compilation-cache),
+and more specifically the instructions for use on
+[Google Cloud](https://jax.readthedocs.io/en/latest/persistent_compilation_cache.html#persistent-compilation-cache).
+In particular, note that if you would like to make use of a non-local
+filesystem, such as Google Cloud Storage, you will need to install
+[`etils`](https://github.com/google/etils) (this is not included by default in
+the AlphaFold 3 Docker container).