update
This commit is contained in:
668
test/tutorial.ipynb
Executable file
668
test/tutorial.ipynb
Executable file
@@ -0,0 +1,668 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Tutorial"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"colab_type": "text",
|
||||
"id": "tM3wFk1e_COd",
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"## The Basics\n",
|
||||
"We begin by importing `selfies`. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"base_uri": "https://localhost:8080/",
|
||||
"height": 89
|
||||
},
|
||||
"colab_type": "code",
|
||||
"id": "GH0DQxBN_Fei",
|
||||
"outputId": "56aa043e-df48-4081-f938-49711a166d33"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import selfies as sf"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"First, let's try translating between SMILES and SELFIES - as an example, we will use benzaldehyde. To translate from SMILES to SELFIES, use the `selfies.encoder` function, and to translate from SMILES back to SELFIES, use the `selfies.decoder` function."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"original_smiles = \"O=Cc1ccccc1\" # benzaldehyde\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" encoded_selfies = sf.encoder(original_smiles) # SMILES -> SELFIES\n",
|
||||
" decoded_smiles = sf.decoder(encoded_selfies) # SELFIES -> SMILES\n",
|
||||
"except sf.EncoderError as err: \n",
|
||||
" pass # sf.encoder error...\n",
|
||||
"except sf.DecoderError as err: \n",
|
||||
" pass # sf.decoder error..."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'[O][=C][C][=C][C][=C][C][=C][Ring1][=Branch1]'"
|
||||
]
|
||||
},
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"encoded_selfies"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'O=CC1=CC=CC=C1'"
|
||||
]
|
||||
},
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"decoded_smiles"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"colab_type": "text",
|
||||
"id": "Ng8PmMiB_RvJ"
|
||||
},
|
||||
"source": [
|
||||
"Note that `original_smiles` and `decoded_smiles` are different strings, but they both represent benzaldehyde. Thus, when comparing the two SMILES strings, string equality should _not_ be used. Insead, use RDKit to check whether the SMILES strings represent the same molecule."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"base_uri": "https://localhost:8080/",
|
||||
"height": 34
|
||||
},
|
||||
"colab_type": "code",
|
||||
"id": "iAc5FVrP_XV6",
|
||||
"outputId": "b503f896-a2a0-46a6-fc5b-9c474c01ba62"
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"True"
|
||||
]
|
||||
},
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from rdkit import Chem\n",
|
||||
"\n",
|
||||
"Chem.CanonSmiles(original_smiles) == Chem.CanonSmiles(decoded_smiles)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"colab_type": "text",
|
||||
"id": "IKfNr5m6_h4f",
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"## Customizing SELFIES\n",
|
||||
"The SELFIES grammar is derived dynamically from a set of semantic constraints, which assign bonding capacities to various atoms. Let's customize the semantic constraints that `selfies` operates on. By default, the following constraints are used:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"base_uri": "https://localhost:8080/",
|
||||
"height": 200
|
||||
},
|
||||
"colab_type": "code",
|
||||
"id": "Xmce7wvV_t4Y",
|
||||
"outputId": "8b10af2f-486e-4910-8a71-055b59a09746"
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'H': 1,\n",
|
||||
" 'F': 1,\n",
|
||||
" 'Cl': 1,\n",
|
||||
" 'Br': 1,\n",
|
||||
" 'I': 1,\n",
|
||||
" 'O': 2,\n",
|
||||
" 'O+1': 3,\n",
|
||||
" 'O-1': 1,\n",
|
||||
" 'N': 3,\n",
|
||||
" 'N+1': 4,\n",
|
||||
" 'N-1': 2,\n",
|
||||
" 'C': 4,\n",
|
||||
" 'C+1': 5,\n",
|
||||
" 'C-1': 3,\n",
|
||||
" 'P': 5,\n",
|
||||
" 'P+1': 6,\n",
|
||||
" 'P-1': 4,\n",
|
||||
" 'S': 6,\n",
|
||||
" 'S+1': 7,\n",
|
||||
" 'S-1': 5,\n",
|
||||
" '?': 8}"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"sf.get_preset_constraints(\"default\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"These constraints map atoms (they keys) to their bonding capacities (the values). The special `?` key maps to the bonding capacity for all atoms that are not explicitly listed in the constraints. For example, S and Li are constrained to a maximum of 6 and 8 bonds, respectively. Every SELFIES string can be decoded into a molecule that obeys the current constraints."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'[Li]=CCS=CC#S'"
|
||||
]
|
||||
},
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"sf.decoder(\"[Li][=C][C][S][=C][C][#S]\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"colab_type": "text",
|
||||
"id": "KevVGyIEAVlu"
|
||||
},
|
||||
"source": [
|
||||
"But suppose that we instead wanted to constrain S and Li to a maximum of 2 and 1 bond(s), respectively. To do so, we create a new set of constraints, and tell `selfies` to operate on them using `selfies.set_semantic_constraints`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {
|
||||
"colab": {},
|
||||
"colab_type": "code",
|
||||
"id": "y5EbmzkKATkD"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"new_constraints = sf.get_preset_constraints(\"default\")\n",
|
||||
"new_constraints['Li'] = 1\n",
|
||||
"new_constraints['S'] = 2\n",
|
||||
"\n",
|
||||
"sf.set_semantic_constraints(new_constraints)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To check that the update was succesful, we can use `selfies.get_semantic_constraints`, which returns the semantic constraints that `selfies` is currently operating on."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'H': 1,\n",
|
||||
" 'F': 1,\n",
|
||||
" 'Cl': 1,\n",
|
||||
" 'Br': 1,\n",
|
||||
" 'I': 1,\n",
|
||||
" 'O': 2,\n",
|
||||
" 'O+1': 3,\n",
|
||||
" 'O-1': 1,\n",
|
||||
" 'N': 3,\n",
|
||||
" 'N+1': 4,\n",
|
||||
" 'N-1': 2,\n",
|
||||
" 'C': 4,\n",
|
||||
" 'C+1': 5,\n",
|
||||
" 'C-1': 3,\n",
|
||||
" 'P': 5,\n",
|
||||
" 'P+1': 6,\n",
|
||||
" 'P-1': 4,\n",
|
||||
" 'S': 2,\n",
|
||||
" 'S+1': 7,\n",
|
||||
" 'S-1': 5,\n",
|
||||
" '?': 8,\n",
|
||||
" 'Li': 1}"
|
||||
]
|
||||
},
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"sf.get_semantic_constraints()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"colab_type": "text",
|
||||
"id": "e_djGmr_AvM7"
|
||||
},
|
||||
"source": [
|
||||
"Our previous SELFIES string is now decoded like so. Notice that the specified bonding capacities are met, with every S and Li making only 2 and 1 bonds, respectively."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {
|
||||
"colab": {},
|
||||
"colab_type": "code",
|
||||
"id": "TCzbjZMAAxpo"
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'[Li]CCSCC=S'"
|
||||
]
|
||||
},
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"sf.decoder(\"[Li][=C][C][S][=C][C][#S]\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"colab_type": "text",
|
||||
"id": "Ng1Lr_e6A3cB"
|
||||
},
|
||||
"source": [
|
||||
"Finally, to revert back to the default constraints, simply call: "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {
|
||||
"colab": {},
|
||||
"colab_type": "code",
|
||||
"id": "zwC00Rx5A6eQ"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sf.set_semantic_constraints()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
" Please refer to the API reference for more details and more preset constraints.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## SELFIES in Practice \n",
|
||||
"\n",
|
||||
"Let's use a simple example to show how `selfies` can be used in practice, as well as highlight some convenient utility functions from the library. We start with a toy dataset of SMILES strings. As before, we can use `selfies.encoder` to convert the dataset into SELFIES form."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['[C][O][C]',\n",
|
||||
" '[F][C][F]',\n",
|
||||
" '[O][=O]',\n",
|
||||
" '[O][=C][C][=C][C][=C][C][=C][Ring1][=Branch1]']"
|
||||
]
|
||||
},
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"smiles_dataset = [\"COC\", \"FCF\", \"O=O\", \"O=Cc1ccccc1\"]\n",
|
||||
"selfies_dataset = list(map(sf.encoder, smiles_dataset))\n",
|
||||
"\n",
|
||||
"selfies_dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The function `selfies.len_selfies` computes the symbol length of a SELFIES string. We can use it to find the maximum symbol length of the SELFIES strings in the dataset. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"10"
|
||||
]
|
||||
},
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"max_len = max(sf.len_selfies(s) for s in selfies_dataset)\n",
|
||||
"max_len"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To extract the SELFIES symbols that form the dataset, use `selfies.get_alphabet_from_selfies`. Here, we add `[nop]` to the alphabet, which is a special padding character that `selfies` recognizes."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['[=Branch1]', '[=C]', '[=O]', '[C]', '[F]', '[O]', '[Ring1]', '[nop]']"
|
||||
]
|
||||
},
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"alphabet = sf.get_alphabet_from_selfies(selfies_dataset)\n",
|
||||
"alphabet.add(\"[nop]\")\n",
|
||||
"\n",
|
||||
"alphabet = list(sorted(alphabet))\n",
|
||||
"alphabet"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Then, create a mapping between the alphabet SELFIES symbols and indices."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'[=Branch1]': 0,\n",
|
||||
" '[=C]': 1,\n",
|
||||
" '[=O]': 2,\n",
|
||||
" '[C]': 3,\n",
|
||||
" '[F]': 4,\n",
|
||||
" '[O]': 5,\n",
|
||||
" '[Ring1]': 6,\n",
|
||||
" '[nop]': 7}"
|
||||
]
|
||||
},
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vocab_stoi = {symbol: idx for idx, symbol in enumerate(alphabet)}\n",
|
||||
"vocab_itos = {idx: symbol for symbol, idx in vocab_stoi.items()}\n",
|
||||
"\n",
|
||||
"vocab_stoi"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"SELFIES provides some convenience methods to convert between SELFIES strings and label (integer) and one-hot encodings. Using the first entry of the dataset (dimethyl ether) as an example:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dimethyl_ether = selfies_dataset[0]\n",
|
||||
"label, one_hot = sf.selfies_to_encoding(dimethyl_ether, vocab_stoi, pad_to_len=max_len)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[3, 5, 3, 7, 7, 7, 7, 7, 7, 7]"
|
||||
]
|
||||
},
|
||||
"execution_count": 17,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"label"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[[0, 0, 0, 1, 0, 0, 0, 0],\n",
|
||||
" [0, 0, 0, 0, 0, 1, 0, 0],\n",
|
||||
" [0, 0, 0, 1, 0, 0, 0, 0],\n",
|
||||
" [0, 0, 0, 0, 0, 0, 0, 1],\n",
|
||||
" [0, 0, 0, 0, 0, 0, 0, 1],\n",
|
||||
" [0, 0, 0, 0, 0, 0, 0, 1],\n",
|
||||
" [0, 0, 0, 0, 0, 0, 0, 1],\n",
|
||||
" [0, 0, 0, 0, 0, 0, 0, 1],\n",
|
||||
" [0, 0, 0, 0, 0, 0, 0, 1],\n",
|
||||
" [0, 0, 0, 0, 0, 0, 0, 1]]"
|
||||
]
|
||||
},
|
||||
"execution_count": 18,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"one_hot"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 21,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'[C][O][C][nop][nop][nop][nop][nop][nop][nop]'"
|
||||
]
|
||||
},
|
||||
"execution_count": 21,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"dimethyl_ether = sf.encoding_to_selfies(one_hot, vocab_itos, enc_type=\"one_hot\")\n",
|
||||
"dimethyl_ether"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 22,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'COC'"
|
||||
]
|
||||
},
|
||||
"execution_count": 22,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"sf.decoder(dimethyl_ether) # sf.decoder ignores [nop]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"If different encoding strategies are desired, `selfies.split_selfies` can be used to tokenize a SELFIES string into its individual symbols."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 24,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['[C]', '[O]', '[C]']"
|
||||
]
|
||||
},
|
||||
"execution_count": 24,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"list(sf.split_selfies(\"[C][O][C]\"))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Please refer to the API reference for more details and utility functions."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"collapsed_sections": [],
|
||||
"name": "selfies_example.ipynb",
|
||||
"provenance": []
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.13"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
Reference in New Issue
Block a user