{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SMARTS匹配检测与可视化\n", "\n", "本notebook用于:\n", "1. 读取ring16/temp.csv中的smiles列\n", "2. 对SMARTS模式进行匹配检测:`O=C1C[C@@H](O)[*:15][*:17][*:18]C[*:23]C(=O)/C=C/[*:28]=C/[*:7][*:8]O1`\n", "3. 处理dummy原子([*:X]),尝试两种方式:\n", " - 不替换dummy原子\n", " - 将dummy原子替换为C\n", "4. 可视化匹配的原子高亮显示\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. 导入必要的库\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "所有模块导入成功!\n" ] } ], "source": [ "import sys\n", "from pathlib import Path\n", "import re\n", "\n", "# 添加项目根目录到 Python 路径\n", "notebook_dir = Path().resolve()\n", "project_root = notebook_dir.parent\n", "sys.path.insert(0, str(project_root))\n", "\n", "from rdkit import Chem\n", "from rdkit.Chem import Draw\n", "from rdkit.Chem.Draw import rdMolDraw2D\n", "from IPython.display import SVG, display, HTML\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from collections import Counter\n", "\n", "print(\"所有模块导入成功!\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. 读取数据\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "数据集大小: 2022 个分子\n", "列名: ['IDs', 'molecule_pref_name', 'max_pChEMBL', 'max_pChEMBL_target', '# Target Organisms', 'Target Organisms', '# Known Targets', 'Known Targets', 'target_pref_name', 'smiles']\n", "\n", "SMILES列存在,共 2022 个有效SMILES\n", "\n", "前5个SMILES示例:\n", "['C/C(=C\\\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[C@H](C)[C@H](O)[C@@H](C)C(=O)C(C)(C)[C@@H](O)CC(=O)O1', 'C/C(=C\\\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[C@H](C)[C@H](O)[C@@H](C)C(=O)C(C)(C)[C@@H](O)CC(=O)O1', 'Cc1c2oc3c(C)ccc(C(=O)N[C@@H]4C(=O)N[C@H](C(C)C)C(=O)N5CCC[C@H]5C(=O)N(C)CC(=O)N(C)[C@@H](C(C)C)C(=O)O[C@@H]4C)c3nc-2c(C(=O)N[C@@H]2C(=O)N[C@H](C(C)C)C(=O)N3CCC[C@H]3C(=O)N(C)CC(=O)N(C)[C@@H](C(C)C)C(=O)O[C@@H]2C)c(N)c1=O', 'CCCCCCCC(=O)SCC/C=C/[C@@H]1CC(=O)NCc2nc(cs2)C2=N[C@@](C)(CS2)C(=O)N[C@@H](C(C)C)C(=O)O1', 'Cc1cc2ccc1[C@@H](C)COC(=O)Nc1ccc(S(=O)(=O)C3CC3)c(c1)CN(C)C(=O)[C@@H]2Nc1ccc2c(N)ncc(F)c2c1']\n" ] }, { "data": { "text/html": [ "
| \n", " | IDs | \n", "molecule_pref_name | \n", "max_pChEMBL | \n", "max_pChEMBL_target | \n", "# Target Organisms | \n", "Target Organisms | \n", "# Known Targets | \n", "Known Targets | \n", "target_pref_name | \n", "smiles | \n", "
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "CHEMBL94657 | \n", "PATUPILONE | \n", "10.67 | \n", "CHEMBL1075590 | \n", "695 | \n", "Sus scrofa, Mus musculus, None, Plasmodium fal... | \n", "695 | \n", "CHEMBL612519, CHEMBL614129, CHEMBL1075484, CHE... | \n", "AGS, NCI-H1703, MKN-7, HT-1080, NCI-H226, Lu-6... | \n", "C/C(=C\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[... | \n", "
| 1 | \n", "CHEMBL94657 | \n", "PATUPILONE | \n", "10.67 | \n", "CHEMBL1075590 | \n", "695 | \n", "Sus scrofa, Mus musculus, None, Plasmodium fal... | \n", "695 | \n", "CHEMBL612519, CHEMBL614129, CHEMBL1075484, CHE... | \n", "AGS, NCI-H1703, MKN-7, HT-1080, NCI-H226, Lu-6... | \n", "C/C(=C\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[... | \n", "
| 2 | \n", "CHEMBL1554 | \n", "DACTINOMYCIN | \n", "10.10 | \n", "CHEMBL614533 | \n", "177 | \n", "Giardia intestinalis, Trypanosoma cruzi, Equus... | \n", "177 | \n", "CHEMBL388, CHEMBL614151, CHEMBL3577, CHEMBL551... | \n", "HT-29, CCRF-CEM, WIL2-NS, Unchecked, Caspase-7... | \n", "Cc1c2oc3c(C)ccc(C(=O)N[C@@H]4C(=O)N[C@H](C(C)C... | \n", "
| 3 | \n", "CHEMBL1173445 | \n", "LARGAZOLE | \n", "8.80 | \n", "CHEMBL612545 | \n", "45 | \n", "Homo sapiens, None | \n", "45 | \n", "CHEMBL392, CHEMBL3192, CHEMBL3524, CHEMBL5103,... | \n", "Histone deacetylase 9, Ubiquitin-like modifier... | \n", "CCCCCCCC(=O)SCC/C=C/[C@@H]1CC(=O)NCc2nc(cs2)C2... | \n", "
| 4 | \n", "CHEMBL3902498 | \n", "NaN | \n", "9.37 | \n", "CHEMBL2095194,CHEMBL3991 | \n", "17 | \n", "Homo sapiens, None | \n", "17 | \n", "CHEMBL2820, CHEMBL3991, CHEMBL1801, CHEMBL204,... | \n", "Coagulation factor X, Kallikrein 1, Coagulatio... | \n", "Cc1cc2ccc1[C@@H](C)COC(=O)Nc1ccc(S(=O)(=O)C3CC... | \n", "