This commit is contained in:
mm644706215
2025-10-16 17:26:35 +08:00
parent b1d437a06d
commit ea218a3a39
49 changed files with 694742 additions and 2 deletions

376866
test/SIME-MacroValidator.ipynb Normal file

File diff suppressed because one or more lines are too long

20
test/SIME_chemplot_tSNE.ipynb Normal file → Executable file
View File

@@ -1,5 +1,12 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"micromamba create -n qsar -c rdkit -c mordred-descriptor -c conda-forge rdkit numpy mordred scikit-learn pandas matplotlib padelpy fuzzywuzzy optuna hydra-core ipykernel loguru ipython joblib openbabel mopac rdkit jupyter ipykernel chemplot joblib -y"
]
},
{
"cell_type": "code",
"execution_count": 1,
@@ -94,7 +101,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 5,
"metadata": {},
"outputs": [
{
@@ -6091,6 +6098,17 @@
"[15:20:17] Explicit valence for atom # 29 C, 5, is greater than permitted\n",
"[15:20:17] Explicit valence for atom # 29 C, 5, is greater than permitted\n"
]
},
{
"ename": "",
"evalue": "",
"output_type": "error",
"traceback": [
"\u001b[1;31m在当前单元格或上一个单元格中执行代码时 Kernel 崩溃。\n",
"\u001b[1;31m请查看单元格中的代码以确定故障的可能原因。\n",
"\u001b[1;31m单击<a href='https://aka.ms/vscodeJupyterKernelCrash'>此处</a>了解详细信息。\n",
"\u001b[1;31m有关更多详细信息请查看 Jupyter <a href='command:jupyter.viewOutput'>log</a>。"
]
}
],
"source": [

6409
test/SIME_chemplot_tSNE100w.ipynb Executable file

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

88
test/macrocycles_core.ipynb Executable file
View File

@@ -0,0 +1,88 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'H': 1, 'F': 1, 'Cl': 1, 'Br': 1, 'I': 1, 'B': 3, 'B+1': 2, 'B-1': 4, 'O': 2, 'O+1': 3, 'O-1': 1, 'N': 3, 'N+1': 4, 'N-1': 2, 'C': 4, 'C+1': 5, 'C-1': 3, 'P': 5, 'P+1': 6, 'P-1': 4, 'S': 6, 'S+1': 7, 'S-1': 5, '?': 8}\n"
]
}
],
"source": [
"import selfies as sf\n",
"\n",
"# 获取默认的语义约束字典\n",
"constraints = sf.get_preset_constraints(\"default\")\n",
"print(constraints)\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"import selfies as sf\n",
"new_constraints = sf.get_preset_constraints(\"default\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"sf.set_semantic_constraints(new_constraints)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"smiles_dataset = [\"COC\", \"FCF\", \"O=O\", \"O=Cc1ccccc1\"]\n",
"selfies_dataset = list(map(sf.encoder, smiles_dataset))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"alphabet = sf.get_alphabet_from_selfies(selfies_dataset)\n",
"alphabet.add(\"[nop]\")\n",
"\n",
"alphabet = list(sorted(alphabet))\n",
"alphabet"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "frage",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.16"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

483
test/test.ipynb Executable file

File diff suppressed because one or more lines are too long

668
test/tutorial.ipynb Executable file
View File

@@ -0,0 +1,668 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "tM3wFk1e_COd",
"tags": []
},
"source": [
"## The Basics\n",
"We begin by importing `selfies`. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 89
},
"colab_type": "code",
"id": "GH0DQxBN_Fei",
"outputId": "56aa043e-df48-4081-f938-49711a166d33"
},
"outputs": [],
"source": [
"import selfies as sf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's try translating between SMILES and SELFIES - as an example, we will use benzaldehyde. To translate from SMILES to SELFIES, use the `selfies.encoder` function, and to translate from SMILES back to SELFIES, use the `selfies.decoder` function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"original_smiles = \"O=Cc1ccccc1\" # benzaldehyde\n",
"\n",
"try:\n",
" encoded_selfies = sf.encoder(original_smiles) # SMILES -> SELFIES\n",
" decoded_smiles = sf.decoder(encoded_selfies) # SELFIES -> SMILES\n",
"except sf.EncoderError as err: \n",
" pass # sf.encoder error...\n",
"except sf.DecoderError as err: \n",
" pass # sf.decoder error..."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'[O][=C][C][=C][C][=C][C][=C][Ring1][=Branch1]'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"encoded_selfies"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'O=CC1=CC=CC=C1'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"decoded_smiles"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Ng8PmMiB_RvJ"
},
"source": [
"Note that `original_smiles` and `decoded_smiles` are different strings, but they both represent benzaldehyde. Thus, when comparing the two SMILES strings, string equality should _not_ be used. Insead, use RDKit to check whether the SMILES strings represent the same molecule."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"id": "iAc5FVrP_XV6",
"outputId": "b503f896-a2a0-46a6-fc5b-9c474c01ba62"
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from rdkit import Chem\n",
"\n",
"Chem.CanonSmiles(original_smiles) == Chem.CanonSmiles(decoded_smiles)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "IKfNr5m6_h4f",
"tags": []
},
"source": [
"## Customizing SELFIES\n",
"The SELFIES grammar is derived dynamically from a set of semantic constraints, which assign bonding capacities to various atoms. Let's customize the semantic constraints that `selfies` operates on. By default, the following constraints are used:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 200
},
"colab_type": "code",
"id": "Xmce7wvV_t4Y",
"outputId": "8b10af2f-486e-4910-8a71-055b59a09746"
},
"outputs": [
{
"data": {
"text/plain": [
"{'H': 1,\n",
" 'F': 1,\n",
" 'Cl': 1,\n",
" 'Br': 1,\n",
" 'I': 1,\n",
" 'O': 2,\n",
" 'O+1': 3,\n",
" 'O-1': 1,\n",
" 'N': 3,\n",
" 'N+1': 4,\n",
" 'N-1': 2,\n",
" 'C': 4,\n",
" 'C+1': 5,\n",
" 'C-1': 3,\n",
" 'P': 5,\n",
" 'P+1': 6,\n",
" 'P-1': 4,\n",
" 'S': 6,\n",
" 'S+1': 7,\n",
" 'S-1': 5,\n",
" '?': 8}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sf.get_preset_constraints(\"default\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These constraints map atoms (they keys) to their bonding capacities (the values). The special `?` key maps to the bonding capacity for all atoms that are not explicitly listed in the constraints. For example, S and Li are constrained to a maximum of 6 and 8 bonds, respectively. Every SELFIES string can be decoded into a molecule that obeys the current constraints."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'[Li]=CCS=CC#S'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sf.decoder(\"[Li][=C][C][S][=C][C][#S]\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "KevVGyIEAVlu"
},
"source": [
"But suppose that we instead wanted to constrain S and Li to a maximum of 2 and 1 bond(s), respectively. To do so, we create a new set of constraints, and tell `selfies` to operate on them using `selfies.set_semantic_constraints`."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "y5EbmzkKATkD"
},
"outputs": [],
"source": [
"new_constraints = sf.get_preset_constraints(\"default\")\n",
"new_constraints['Li'] = 1\n",
"new_constraints['S'] = 2\n",
"\n",
"sf.set_semantic_constraints(new_constraints)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To check that the update was succesful, we can use `selfies.get_semantic_constraints`, which returns the semantic constraints that `selfies` is currently operating on."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'H': 1,\n",
" 'F': 1,\n",
" 'Cl': 1,\n",
" 'Br': 1,\n",
" 'I': 1,\n",
" 'O': 2,\n",
" 'O+1': 3,\n",
" 'O-1': 1,\n",
" 'N': 3,\n",
" 'N+1': 4,\n",
" 'N-1': 2,\n",
" 'C': 4,\n",
" 'C+1': 5,\n",
" 'C-1': 3,\n",
" 'P': 5,\n",
" 'P+1': 6,\n",
" 'P-1': 4,\n",
" 'S': 2,\n",
" 'S+1': 7,\n",
" 'S-1': 5,\n",
" '?': 8,\n",
" 'Li': 1}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sf.get_semantic_constraints()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "e_djGmr_AvM7"
},
"source": [
"Our previous SELFIES string is now decoded like so. Notice that the specified bonding capacities are met, with every S and Li making only 2 and 1 bonds, respectively."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "TCzbjZMAAxpo"
},
"outputs": [
{
"data": {
"text/plain": [
"'[Li]CCSCC=S'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sf.decoder(\"[Li][=C][C][S][=C][C][#S]\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Ng1Lr_e6A3cB"
},
"source": [
"Finally, to revert back to the default constraints, simply call: "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "zwC00Rx5A6eQ"
},
"outputs": [],
"source": [
"sf.set_semantic_constraints()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Please refer to the API reference for more details and more preset constraints.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## SELFIES in Practice \n",
"\n",
"Let's use a simple example to show how `selfies` can be used in practice, as well as highlight some convenient utility functions from the library. We start with a toy dataset of SMILES strings. As before, we can use `selfies.encoder` to convert the dataset into SELFIES form."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['[C][O][C]',\n",
" '[F][C][F]',\n",
" '[O][=O]',\n",
" '[O][=C][C][=C][C][=C][C][=C][Ring1][=Branch1]']"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"smiles_dataset = [\"COC\", \"FCF\", \"O=O\", \"O=Cc1ccccc1\"]\n",
"selfies_dataset = list(map(sf.encoder, smiles_dataset))\n",
"\n",
"selfies_dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The function `selfies.len_selfies` computes the symbol length of a SELFIES string. We can use it to find the maximum symbol length of the SELFIES strings in the dataset. "
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"10"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"max_len = max(sf.len_selfies(s) for s in selfies_dataset)\n",
"max_len"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To extract the SELFIES symbols that form the dataset, use `selfies.get_alphabet_from_selfies`. Here, we add `[nop]` to the alphabet, which is a special padding character that `selfies` recognizes."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['[=Branch1]', '[=C]', '[=O]', '[C]', '[F]', '[O]', '[Ring1]', '[nop]']"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"alphabet = sf.get_alphabet_from_selfies(selfies_dataset)\n",
"alphabet.add(\"[nop]\")\n",
"\n",
"alphabet = list(sorted(alphabet))\n",
"alphabet"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, create a mapping between the alphabet SELFIES symbols and indices."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'[=Branch1]': 0,\n",
" '[=C]': 1,\n",
" '[=O]': 2,\n",
" '[C]': 3,\n",
" '[F]': 4,\n",
" '[O]': 5,\n",
" '[Ring1]': 6,\n",
" '[nop]': 7}"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vocab_stoi = {symbol: idx for idx, symbol in enumerate(alphabet)}\n",
"vocab_itos = {idx: symbol for symbol, idx in vocab_stoi.items()}\n",
"\n",
"vocab_stoi"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"SELFIES provides some convenience methods to convert between SELFIES strings and label (integer) and one-hot encodings. Using the first entry of the dataset (dimethyl ether) as an example:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"dimethyl_ether = selfies_dataset[0]\n",
"label, one_hot = sf.selfies_to_encoding(dimethyl_ether, vocab_stoi, pad_to_len=max_len)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[3, 5, 3, 7, 7, 7, 7, 7, 7, 7]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"label"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[[0, 0, 0, 1, 0, 0, 0, 0],\n",
" [0, 0, 0, 0, 0, 1, 0, 0],\n",
" [0, 0, 0, 1, 0, 0, 0, 0],\n",
" [0, 0, 0, 0, 0, 0, 0, 1],\n",
" [0, 0, 0, 0, 0, 0, 0, 1],\n",
" [0, 0, 0, 0, 0, 0, 0, 1],\n",
" [0, 0, 0, 0, 0, 0, 0, 1],\n",
" [0, 0, 0, 0, 0, 0, 0, 1],\n",
" [0, 0, 0, 0, 0, 0, 0, 1],\n",
" [0, 0, 0, 0, 0, 0, 0, 1]]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"one_hot"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'[C][O][C][nop][nop][nop][nop][nop][nop][nop]'"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dimethyl_ether = sf.encoding_to_selfies(one_hot, vocab_itos, enc_type=\"one_hot\")\n",
"dimethyl_ether"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'COC'"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sf.decoder(dimethyl_ether) # sf.decoder ignores [nop]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If different encoding strategies are desired, `selfies.split_selfies` can be used to tokenize a SELFIES string into its individual symbols."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['[C]', '[O]', '[C]']"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(sf.split_selfies(\"[C][O][C]\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Please refer to the API reference for more details and utility functions."
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "selfies_example.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.13"
}
},
"nbformat": 4,
"nbformat_minor": 4
}