798 lines
34 KiB
Plaintext
798 lines
34 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 筛选Sandmeyer反应候选药物 - 张夏恒反应替代分析\n",
|
||
"\n",
|
||
"## 背景介绍\n",
|
||
"\n",
|
||
"### Sandmeyer反应回顾\n",
|
||
"Sandmeyer反应是经典的芳香胺转化方法:\n",
|
||
"**Ar-NH₂ → [Ar-N₂]⁺ → Ar-X**\n",
|
||
"其中 X = Cl, Br, I, CN, OH, SCN 等\n",
|
||
"\n",
|
||
"### 张夏恒反应\n",
|
||
"根据论文《s41586-025-09791-5_reference.pdf》,张夏恒反应可能是一种新的替代方法,\n",
|
||
"可以更有效地实现芳香胺到芳香卤素的转化。\n",
|
||
"\n",
|
||
"### 筛选策略\n",
|
||
"我们通过识别药物分子中可能来自Sandmeyer反应的芳香卤素结构,\n",
|
||
"找出可以考虑用张夏恒反应进行工艺优化的候选药物。\n",
|
||
"\n",
|
||
"**重要提醒:**\n",
|
||
"- 此筛选仅基于分子结构特征\n",
|
||
"- 最终需要查阅文献确认合成路线\n",
|
||
"- 并非所有含卤素的药物都使用Sandmeyer反应合成"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 导入所需库"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import os\n",
|
||
"from pathlib import Path\n",
|
||
"from rdkit.Chem.Draw import rdMolDraw2D\n",
|
||
"from IPython.display import SVG, display\n",
|
||
"from rdkit.Chem import AllChem\n",
|
||
"\n",
|
||
"# 创建输出目录\n",
|
||
"output_base = Path(\"../data/drug_targetmol\")\n",
|
||
"scheme_a_dir = output_base / \"scheme_A_visualizations\"\n",
|
||
"scheme_b_dir = output_base / \"scheme_B_visualizations\"\n",
|
||
"\n",
|
||
"scheme_a_dir.mkdir(exist_ok=True)\n",
|
||
"scheme_b_dir.mkdir(exist_ok=True)\n",
|
||
"\n",
|
||
"print(f\"创建目录:{scheme_a_dir}\")\n",
|
||
"print(f\"创建目录:{scheme_b_dir}\")\n",
|
||
"\n",
|
||
"def generate_highlighted_svg(mol, highlight_atoms, filename, title=\"\"):\n",
|
||
" \"\"\"生成高亮卤素结构的高清晰度SVG图片\"\"\"\n",
|
||
" from rdkit.Chem import AllChem\n",
|
||
" \n",
|
||
" # 计算2D坐标\n",
|
||
" AllChem.Compute2DCoords(mol)\n",
|
||
" \n",
|
||
" # 创建SVG绘制器\n",
|
||
" drawer = rdMolDraw2D.MolDraw2DSVG(1200, 900) # 更大的尺寸以提高清晰度\n",
|
||
" drawer.SetFontSize(12)\n",
|
||
" \n",
|
||
" # 绘制选项\n",
|
||
" draw_options = drawer.drawOptions()\n",
|
||
" draw_options.addAtomIndices = False # 不显示原子索引,保持简洁\n",
|
||
" draw_options.addBondIndices = False\n",
|
||
" draw_options.addStereoAnnotation = True\n",
|
||
" draw_options.fixedFontSize = 12\n",
|
||
" \n",
|
||
" # 高亮卤素原子(红色)\n",
|
||
" atom_colors = {}\n",
|
||
" for atom_idx in highlight_atoms:\n",
|
||
" atom_colors[atom_idx] = (1.0, 0.3, 0.3) # 红色高亮\n",
|
||
" \n",
|
||
" # 绘制分子\n",
|
||
" drawer.DrawMolecule(mol, \n",
|
||
" highlightAtoms=highlight_atoms,\n",
|
||
" highlightAtomColors=atom_colors)\n",
|
||
" \n",
|
||
" drawer.FinishDrawing()\n",
|
||
" svg_content = drawer.GetDrawingText()\n",
|
||
" \n",
|
||
" # 添加标题\n",
|
||
" if title:\n",
|
||
" # 在SVG中添加标题\n",
|
||
" svg_lines = svg_content.split(\"\\n\")\n",
|
||
" # 在<g>标签前插入标题\n",
|
||
" for i, line in enumerate(svg_lines):\n",
|
||
" if \"<g \" in line and \"transform\" in line:\n",
|
||
" svg_lines.insert(i, f\"<text x=\"50%\" y=\"30\" text-anchor=\"middle\" font-size=\"16\" font-weight=\"bold\">{title}</text>\")\n",
|
||
" break\n",
|
||
" svg_with_title = \"\\n\".join(svg_lines)\n",
|
||
" else:\n",
|
||
" svg_with_title = svg_content\n",
|
||
" \n",
|
||
" # 保存文件\n",
|
||
" with open(filename, \"w\") as f:\n",
|
||
" f.write(svg_with_title)\n",
|
||
" \n",
|
||
" print(f\"保存SVG: {filename}\")\n",
|
||
" \n",
|
||
" return svg_content\n",
|
||
"\n",
|
||
"def visualize_molecules(df, output_dir, scheme_name, max_molecules=50):\n",
|
||
" \"\"\"为DataFrame中的分子生成可视化图片\"\"\"\n",
|
||
" print(f\"\\n开始生成{scheme_name}的可视化图片...\")\n",
|
||
" print(f\"输出目录: {output_dir}\")\n",
|
||
" \n",
|
||
" generated_count = 0\n",
|
||
" \n",
|
||
" for idx, row in df.iterrows():\n",
|
||
" if generated_count >= max_molecules:\n",
|
||
" print(f\"已达到最大生成数量限制 ({max_molecules}),停止生成\")\n",
|
||
" break\n",
|
||
" \n",
|
||
" cas = str(row.get(\"CAS\", \"unknown\")).strip()\n",
|
||
" name = str(row.get(\"Name\", \"unknown\")).strip()\n",
|
||
" \n",
|
||
" # 跳过无效的CAS号\n",
|
||
" if not cas or cas == \"nan\" or cas == \"unknown\":\n",
|
||
" continue\n",
|
||
" \n",
|
||
" mol = row.get(\"ROMol\")\n",
|
||
" if mol is None:\n",
|
||
" continue\n",
|
||
" \n",
|
||
" # 找出卤素原子\n",
|
||
" halogen_atoms = []\n",
|
||
" for atom in mol.GetAtoms():\n",
|
||
" if atom.GetAtomicNum() in [9, 17, 35, 53]: # F, Cl, Br, I\n",
|
||
" halogen_atoms.append(atom.GetIdx())\n",
|
||
" \n",
|
||
" if not halogen_atoms:\n",
|
||
" continue\n",
|
||
" \n",
|
||
" # 生成文件名和标题\n",
|
||
" filename = output_dir / f\"{cas}.svg\"\n",
|
||
" title = f\"{name} ({cas})\"\n",
|
||
" \n",
|
||
" try:\n",
|
||
" # 生成SVG\n",
|
||
" generate_highlighted_svg(mol, halogen_atoms, filename, title)\n",
|
||
" generated_count += 1\n",
|
||
" \n",
|
||
" # 每10个显示一次进度\n",
|
||
" if generated_count % 10 == 0:\n",
|
||
" print(f\"已生成 {generated_count} 个分子图片\")\n",
|
||
" \n",
|
||
" except Exception as e:\n",
|
||
" print(f\"生成 {cas} 失败: {e}\")\n",
|
||
" continue\n",
|
||
" \n",
|
||
" print(f\"完成!共生成 {generated_count} 个{scheme_name}的可视化图片\")\n",
|
||
" return generated_count\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 定义筛选模式\n",
|
||
"\n",
|
||
"### SMARTS模式说明\n",
|
||
"\n",
|
||
"#### 方案A:杂芳环卤素(最高优先级)\n",
|
||
"- **筛选逻辑**:杂芳环上的卤素最可能是Sandmeyer反应产物\n",
|
||
"- **原因**:杂芳环直接卤代通常较困难,Sandmeyer反应是重要合成方法\n",
|
||
"- **预期结果**:候选数量少但精准度高\n",
|
||
"\n",
|
||
"#### 方案B:所有芳香卤素(中等优先级) \n",
|
||
"- **筛选逻辑**:所有芳环上的卤素\n",
|
||
"- **原因**:虽然有些卤素可能来自其他途径,但可以扩大筛选范围\n",
|
||
"- **预期结果**:候选数量较多,需要更多文献验证\n",
|
||
"\n",
|
||
"**SMARTS模式优化说明:**\n",
|
||
"- 原始模式 `n:c:[Cl,Br,I]` 语法有误\n",
|
||
"- 优化为更准确的环结构匹配模式\n",
|
||
"- 使用更精确的原子环境描述"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# 定义筛选模式\n",
|
||
"SCREENING_PATTERNS = {\n",
|
||
" 'heteroaryl_halides': {\n",
|
||
" 'name': '杂芳环卤素',\n",
|
||
" 'description': '杂环上的Cl, Br, I原子(方案A)',\n",
|
||
" 'smarts': [\n",
|
||
" '[n,o,s]c[Cl,Br,I]', # 杂原子邻位卤素\n",
|
||
" '[n,o,s]cc[Cl,Br,I]', # 杂原子邻位卤素(隔一个碳)\n",
|
||
" 'c1[n,o,s]c([Cl,Br,I])ccc1', # 卤代吡咯类\n",
|
||
" 'c1c([Cl,Br,I])cncn1', # 卤代嘧啶\n",
|
||
" 'c1ccc2c([Cl,Br,I])ccnc2c1', # 卤代喹啉\n",
|
||
" 'c1c([Cl,Br,I])cncc1', # 卤代吡嗪\n",
|
||
" 'c1([Cl,Br,I])scnc1', # 卤代噻唑\n",
|
||
" ],\n",
|
||
" 'scheme': 'A'\n",
|
||
" },\n",
|
||
" 'aryl_halides': {\n",
|
||
" 'name': '芳香卤素',\n",
|
||
" 'description': '所有芳环上的Cl, Br, I原子(方案B)',\n",
|
||
" 'smarts': [\n",
|
||
" 'c[Cl,Br,I]', # 任意芳香氯\n",
|
||
" 'c-C#N', # 芳香氰基\n",
|
||
" 'c1ccc(S(=O)(=O)N)cc1', # 磺胺核心\n",
|
||
" 'c1c(Cl)cc(Cl)cc1', # 多卤代苯\n",
|
||
" ],\n",
|
||
" 'scheme': 'B'\n",
|
||
" }\n",
|
||
"}\n",
|
||
"\n",
|
||
"def create_pattern_matchers():\n",
|
||
" \"\"\"创建SMARTS模式匹配器\"\"\"\n",
|
||
" matchers = {}\n",
|
||
" for key, pattern_info in SCREENING_PATTERNS.items():\n",
|
||
" matchers[key] = {\n",
|
||
" 'info': pattern_info,\n",
|
||
" 'matchers': [Chem.MolFromSmarts(smarts) for smarts in pattern_info['smarts']]\n",
|
||
" }\n",
|
||
" return matchers\n",
|
||
"\n",
|
||
"PATTERN_MATCHERS = create_pattern_matchers()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 数据加载和预处理\n",
|
||
"\n",
|
||
"### SDF文件说明\n",
|
||
"- 文件位置:`data/drug_targetmol/0c04ffc9fe8c2ec916412fbdc2a49bf4.sdf`\n",
|
||
"- 包含药物分子结构和丰富属性信息\n",
|
||
"- 每个分子记录包含:SMILES、分子式、分子量、批准状态、适应症等"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"正在读取SDF文件...\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[15:05:51] Both bonds on one end of an atropisomer are on the same side - atoms is : 3\n",
|
||
"[15:05:51] Explicit valence for atom # 2 N greater than permitted\n",
|
||
"[15:05:51] ERROR: Could not sanitize molecule ending on line 217340\n",
|
||
"[15:05:51] ERROR: Explicit valence for atom # 2 N greater than permitted\n",
|
||
"[15:05:52] Explicit valence for atom # 4 N greater than permitted\n",
|
||
"[15:05:52] ERROR: Could not sanitize molecule ending on line 317283\n",
|
||
"[15:05:52] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
|
||
"[15:05:52] Explicit valence for atom # 4 N greater than permitted\n",
|
||
"[15:05:52] ERROR: Could not sanitize molecule ending on line 324666\n",
|
||
"[15:05:52] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
|
||
"[15:05:52] Explicit valence for atom # 5 N greater than permitted\n",
|
||
"[15:05:52] ERROR: Could not sanitize molecule ending on line 365883\n",
|
||
"[15:05:52] ERROR: Explicit valence for atom # 5 N greater than permitted\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"成功加载 3276 个分子\n",
|
||
"\n",
|
||
"数据概览:\n",
|
||
" Index Plate Row Col ID Name \\\n",
|
||
"0 1 L1010-1 a 2 Dexamethasone \n",
|
||
"1 2 L1010-1 a 3 Danicopan \n",
|
||
"2 3 L1010-1 a 4 Cyclosporin A \n",
|
||
"3 4 L1010-1 a 5 L-Carnitine \n",
|
||
"4 5 L1010-1 a 6 Trimetazidine dihydrochloride \n",
|
||
"\n",
|
||
" Synonyms CAS \\\n",
|
||
"0 MK 125;Prednisolone F;NSC 34521;Hexadecadrol 50-02-2 \n",
|
||
"1 ACH-4471 1903768-17-1 \n",
|
||
"2 Cyclosporine A;Ciclosporin;Cyclosporine 59865-13-3 \n",
|
||
"3 L(-)-Carnitine;Levocarnitine 541-15-1 \n",
|
||
"4 Yoshimilon;Kyurinett;Vastarel F 13171-25-0 \n",
|
||
"\n",
|
||
" SMILES \\\n",
|
||
"0 C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(F)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO \n",
|
||
"1 CC(=O)c1nn(CC(=O)N2C[C@H](F)C[C@H]2C(=O)Nc2cccc(Br)n2)c2ccc(cc12)-c1cnc(C)nc1 \n",
|
||
"2 [C@H]([C@@H](C/C=C/C)C)(O)[C@@]1(N(C)C(=O)[C@H]([C@@H](C)C)N(C)C(=O)[C@H](CC(C)C)N(C)C(=O)[C@H](... \n",
|
||
"3 C[N+](C)(C)C[C@@H](O)CC([O-])=O \n",
|
||
"4 Cl.Cl.COC1=C(OC)C(OC)=C(CN2CCNCC2)C=C1 \n",
|
||
"\n",
|
||
" Formula MolWt Approved status \\\n",
|
||
"0 C22H29FO5 392.46 NMPA;EMA;FDA \n",
|
||
"1 C26H23BrFN7O3 580.41 FDA \n",
|
||
"2 C62H111N11O12 1202.61 FDA \n",
|
||
"3 C7H15NO3 161.2 FDA \n",
|
||
"4 C14H24Cl2N2O3 339.258 NMPA;EMA \n",
|
||
"\n",
|
||
" Pharmacopoeia \\\n",
|
||
"0 USP39-NF34;BP2015;JP16;IP2010 \n",
|
||
"1 NaN \n",
|
||
"2 Martindale the Extra Pharmacopoei, EP10.2, USP43-NF38, Ph.Int_6th, JP17 \n",
|
||
"3 NaN \n",
|
||
"4 BP2019;KP Ⅹ;EP9.2;IP2010;JP17;Martindale:The Extra Pharmacopoeia \n",
|
||
"\n",
|
||
" Disease \\\n",
|
||
"0 Metabolism \n",
|
||
"1 Others \n",
|
||
"2 Immune system \n",
|
||
"3 Cardiovascular system \n",
|
||
"4 Cardiovascular system \n",
|
||
"\n",
|
||
" Pathways \\\n",
|
||
"0 Antibody-drug Conjugate/ADC Related;Autophagy;Endocrinology/Hormones;Immunology/Inflammation;Mic... \n",
|
||
"1 Immunology/Inflammation \n",
|
||
"2 Immunology/Inflammation;Metabolism;Microbiology/Virology \n",
|
||
"3 Metabolism \n",
|
||
"4 Autophagy;Metabolism \n",
|
||
"\n",
|
||
" Target \\\n",
|
||
"0 Antibacterial;Antibiotic;Autophagy;Complement System;Glucocorticoid Receptor;IL Receptor;Mitopha... \n",
|
||
"1 Complement System \n",
|
||
"2 Phosphatase;Antibiotic;Complement System \n",
|
||
"3 Endogenous Metabolite;Fatty Acid Synthase \n",
|
||
"4 Autophagy;Fatty Acid Synthase \n",
|
||
"\n",
|
||
" Receptor \\\n",
|
||
"0 Antibiotic; Autophagy; Bacterial; Complement System; Glucocorticoid Receptor; IL receptor; Mitop... \n",
|
||
"1 Complement System; factor D \n",
|
||
"2 Antibiotic; calcineurin phosphatase; Complement System; Phosphatase \n",
|
||
"3 Endogenous Metabolite; FAS \n",
|
||
"4 Autophagy; mitochondrial long-chain 3-ketoacyl thiolase \n",
|
||
"\n",
|
||
" Bioactivity \\\n",
|
||
"0 Dexamethasone is a glucocorticoid receptor agonist and IL receptor modulator with anti-inflammat... \n",
|
||
"1 Danicopan (ACH-4471) (ACH-4471) is a selective, orally active small molecule factor D inhibitor ... \n",
|
||
"2 Cyclosporin A is a natural product and an active fungal metabolite, classified as a cyclic polyp... \n",
|
||
"3 L-Carnitine (L(-)-Carnitine) is an amino acid derivative. L-Carnitine facilitates long-chain fat... \n",
|
||
"4 Trimetazidine dihydrochloride (Vastarel F) can improve myocardial glucose utilization by inhibit... \n",
|
||
"\n",
|
||
" Reference \\\n",
|
||
"0 Li M, Yu H. Identification of WP1066, an inhibitor of JAK2 and STAT3, as a Kv1. 3 potassium chan... \n",
|
||
"1 Yuan X, et al. Small-molecule factor D inhibitors selectively block the alternative pathway of c... \n",
|
||
"2 D'Angelo G, et al. Cyclosporin A prevents the hypoxic adaptation by activating hypoxia-inducible... \n",
|
||
"3 Jogl G, Tong L. Cell. 2003 Jan 10; 112(1):113-22. \n",
|
||
"4 Yang Q, et al. Int J Clin Exp Pathol. 2015, 8(4):3735-3741.;Liu Z, et al. Metabolism. 2016, 65(3... \n",
|
||
"\n",
|
||
" ROMol \n",
|
||
"0 <rdkit.Chem.rdchem.Mol object at 0x743d782049e0> \n",
|
||
"1 <rdkit.Chem.rdchem.Mol object at 0x743d782871b0> \n",
|
||
"2 <rdkit.Chem.rdchem.Mol object at 0x743d78287220> \n",
|
||
"3 <rdkit.Chem.rdchem.Mol object at 0x743d782873e0> \n",
|
||
"4 <rdkit.Chem.rdchem.Mol object at 0x743d78287450> \n",
|
||
"\n",
|
||
"列名:['Index', 'Plate', 'Row', 'Col', 'ID', 'Name', 'Synonyms', 'CAS', 'SMILES', 'Formula', 'MolWt', 'Approved status', 'Pharmacopoeia', 'Disease', 'Pathways', 'Target', 'Receptor', 'Bioactivity', 'Reference', 'ROMol']\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 读取筛选结果CSV文件\n",
|
||
"import pandas as pd\n",
|
||
"from rdkit import Chem\n",
|
||
"\n",
|
||
"print(\"正在读取筛选结果CSV文件...\")\n",
|
||
"\n",
|
||
"# 读取方案A结果\n",
|
||
"df_a = pd.read_csv(\"../data/drug_targetmol/sandmeyer_candidates_scheme_A_heteroaryl_halides.csv\")\n",
|
||
"print(f\"方案A数据: {len(df_a)} 行\")\n",
|
||
"\n",
|
||
"# 读取方案B结果\n",
|
||
"df_b = pd.read_csv(\"../data/drug_targetmol/sandmeyer_candidates_scheme_B_aryl_halides.csv\")\n",
|
||
"print(f\"方案B数据: {len(df_b)} 行\")\n",
|
||
"\n",
|
||
"# 重建分子对象\n",
|
||
"def rebuild_molecules(df):\n",
|
||
" mols = []\n",
|
||
" for idx, row in df.iterrows():\n",
|
||
" smiles = row.get(\"SMILES_from_mol\", \"\")\n",
|
||
" if smiles and str(smiles) != \"nan\":\n",
|
||
" mol = Chem.MolFromSmiles(str(smiles))\n",
|
||
" mols.append(mol)\n",
|
||
" else:\n",
|
||
" mols.append(None)\n",
|
||
" df[\"ROMol\"] = mols\n",
|
||
" valid_mols = sum(1 for m in mols if m is not None)\n",
|
||
" print(f\"成功重建 {valid_mols} 个分子对象\")\n",
|
||
" return df\n",
|
||
"\n",
|
||
"df_a = rebuild_molecules(df_a)\n",
|
||
"df_b = rebuild_molecules(df_b)\n",
|
||
"\n",
|
||
"print(\"\n",
|
||
"数据加载完成\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 分子筛选函数\n",
|
||
"\n",
|
||
"### 筛选逻辑说明\n",
|
||
"\n",
|
||
"1. **分子验证**:确保分子结构有效\n",
|
||
"2. **子结构匹配**:使用RDKit的SMARTS匹配\n",
|
||
"3. **结果记录**:记录匹配的模式和具体子结构\n",
|
||
"4. **数据完整性**:保留所有原始属性信息"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def screen_molecules_for_patterns(df, pattern_key):\n",
|
||
" \"\"\"\n",
|
||
" 筛选包含特定子结构的分子\n",
|
||
" \n",
|
||
" Args:\n",
|
||
" df: 包含分子的DataFrame\n",
|
||
" pattern_key: 筛选模式键名\n",
|
||
" \n",
|
||
" Returns:\n",
|
||
" 筛选结果DataFrame\n",
|
||
" \"\"\"\n",
|
||
" pattern_info = PATTERN_MATCHERS[pattern_key]['info']\n",
|
||
" matchers = PATTERN_MATCHERS[pattern_key]['matchers']\n",
|
||
" \n",
|
||
" print(f\"\\n开始筛选:{pattern_info['name']}\")\n",
|
||
" print(f\"描述:{pattern_info['description']}\")\n",
|
||
" print(f\"SMARTS模式数量:{len(pattern_info['smarts'])}\")\n",
|
||
" \n",
|
||
" matched_molecules = []\n",
|
||
" \n",
|
||
" for idx, row in df.iterrows():\n",
|
||
" mol = row['ROMol']\n",
|
||
" if mol is None:\n",
|
||
" continue\n",
|
||
" \n",
|
||
" # 检查是否匹配任何模式\n",
|
||
" matched_patterns = []\n",
|
||
" for i, matcher in enumerate(matchers):\n",
|
||
" if matcher is None:\n",
|
||
" continue\n",
|
||
" if mol.HasSubstructMatch(matcher):\n",
|
||
" matched_patterns.append({\n",
|
||
" 'pattern_index': i,\n",
|
||
" 'smarts': pattern_info['smarts'][i],\n",
|
||
" 'matches': len(mol.GetSubstructMatches(matcher))\n",
|
||
" })\n",
|
||
" \n",
|
||
" if matched_patterns:\n",
|
||
" # 创建匹配记录\n",
|
||
" match_record = row.copy()\n",
|
||
" match_record['matched_patterns'] = matched_patterns\n",
|
||
" match_record['total_matches'] = sum(p['matches'] for p in matched_patterns)\n",
|
||
" match_record['screening_scheme'] = pattern_info['scheme']\n",
|
||
" matched_molecules.append(match_record)\n",
|
||
" \n",
|
||
" result_df = pd.DataFrame(matched_molecules)\n",
|
||
" print(f\"找到 {len(result_df)} 个匹配分子\")\n",
|
||
" \n",
|
||
" return result_df\n",
|
||
"\n",
|
||
"def save_screening_results(df, filename, description):\n",
|
||
" \"\"\"保存筛选结果到CSV\"\"\"\n",
|
||
" output_path = f\"../data/drug_targetmol/{filename}\"\n",
|
||
" \n",
|
||
" # 转换ROMol列为SMILES(因为ROMol对象无法保存到CSV)\n",
|
||
" df_export = df.copy()\n",
|
||
" if 'ROMol' in df_export.columns:\n",
|
||
" df_export['SMILES_from_mol'] = df_export['ROMol'].apply(lambda x: Chem.MolToSmiles(x) if x else '')\n",
|
||
" df_export = df_export.drop('ROMol', axis=1)\n",
|
||
" \n",
|
||
" df_export.to_csv(output_path, index=False, encoding='utf-8')\n",
|
||
" print(f\"结果已保存到:{output_path}\")\n",
|
||
" print(f\"包含 {len(df_export)} 个分子,{len(df_export.columns)} 个属性列\")\n",
|
||
" \n",
|
||
" return output_path"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 方案A筛选:杂芳环卤素\n",
|
||
"\n",
|
||
"### 执行逻辑\n",
|
||
"- 使用最保守的筛选策略\n",
|
||
"- 只匹配杂芳环上的卤素\n",
|
||
"- 预期获得高精度结果\n",
|
||
"- 需要进一步的合成路线验证"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\n",
|
||
"开始筛选:杂芳环卤素\n",
|
||
"描述:杂环上的Cl, Br, I原子(方案A)\n",
|
||
"SMARTS模式数量:7\n",
|
||
"找到 57 个匹配分子\n",
|
||
"\n",
|
||
"方案A筛选结果摘要:\n",
|
||
" Name CAS Formula \\\n",
|
||
"1 Danicopan 1903768-17-1 C26H23BrFN7O3 \n",
|
||
"8 Lonafarnib 193275-84-2 C27H31Br2ClN4O2 \n",
|
||
"19 Idoxuridine 54-42-2 C9H11IN2O5 \n",
|
||
"144 Dimenhydrinate 523-87-5 C24H28ClN5O3 \n",
|
||
"259 Sertaconazole 99592-32-2 C20H15Cl3N2OS \n",
|
||
"311 Tioconazole 65899-73-2 C16H13Cl3N2OS \n",
|
||
"337 Gimeracil 103766-25-2 C5H4ClNO2 \n",
|
||
"580 Bromocriptine mesylate 22260-51-1 C33H44BrN5O8S \n",
|
||
"592 Clofarabine 123318-82-1 C10H11ClFN5O3 \n",
|
||
"684 Vorasidenib 1644545-52-7 C14H13ClF6N6 \n",
|
||
"\n",
|
||
" matched_patterns \\\n",
|
||
"1 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 1}, {'pattern_index': 2, 'smarts... \n",
|
||
"8 [{'pattern_index': 1, 'smarts': '[n,o,s]cc[Cl,Br,I]', 'matches': 1}, {'pattern_index': 5, 'smart... \n",
|
||
"19 [{'pattern_index': 1, 'smarts': '[n,o,s]cc[Cl,Br,I]', 'matches': 2}, {'pattern_index': 3, 'smart... \n",
|
||
"144 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 2}] \n",
|
||
"259 [{'pattern_index': 1, 'smarts': '[n,o,s]cc[Cl,Br,I]', 'matches': 1}] \n",
|
||
"311 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 1}] \n",
|
||
"337 [{'pattern_index': 1, 'smarts': '[n,o,s]cc[Cl,Br,I]', 'matches': 1}, {'pattern_index': 5, 'smart... \n",
|
||
"580 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 1}] \n",
|
||
"592 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 2}] \n",
|
||
"684 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 1}, {'pattern_index': 2, 'smarts... \n",
|
||
"\n",
|
||
" total_matches \n",
|
||
"1 2 \n",
|
||
"8 2 \n",
|
||
"19 3 \n",
|
||
"144 2 \n",
|
||
"259 1 \n",
|
||
"311 1 \n",
|
||
"337 2 \n",
|
||
"580 1 \n",
|
||
"592 2 \n",
|
||
"684 2 \n",
|
||
"结果已保存到:../data/drug_targetmol/sandmeyer_candidates_scheme_A_heteroaryl_halides.csv\n",
|
||
"包含 57 个分子,23 个属性列\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 执行方案A筛选\n",
|
||
"scheme_a_results = screen_molecules_for_patterns(df, 'heteroaryl_halides')\n",
|
||
"\n",
|
||
"# 显示结果摘要\n",
|
||
"if len(scheme_a_results) > 0:\n",
|
||
" print(\"\\n方案A筛选结果摘要:\")\n",
|
||
" print(scheme_a_results[['Name', 'CAS', 'Formula', 'matched_patterns', 'total_matches']].head(10))\n",
|
||
" \n",
|
||
" # 保存结果\n",
|
||
" save_screening_results(\n",
|
||
" scheme_a_results, \n",
|
||
" 'sandmeyer_candidates_scheme_A_heteroaryl_halides.csv',\n",
|
||
" '方案A:杂芳环卤素筛选结果'\n",
|
||
" )\n",
|
||
"else:\n",
|
||
" print(\"\\n方案A未找到匹配分子\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 方案B筛选:所有芳香卤素\n",
|
||
"\n",
|
||
"### 执行逻辑\n",
|
||
"- 使用更宽松的筛选策略 \n",
|
||
"- 匹配所有芳环上的卤素\n",
|
||
"- 会包含更多候选分子\n",
|
||
"- 需要更多的文献验证工作"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\n",
|
||
"开始筛选:芳香卤素\n",
|
||
"描述:所有芳环上的Cl, Br, I原子(方案B)\n",
|
||
"SMARTS模式数量:4\n",
|
||
"找到 548 个匹配分子\n",
|
||
"\n",
|
||
"方案B筛选结果摘要:\n",
|
||
" Name CAS Formula \\\n",
|
||
"1 Danicopan 1903768-17-1 C26H23BrFN7O3 \n",
|
||
"8 Lonafarnib 193275-84-2 C27H31Br2ClN4O2 \n",
|
||
"9 Ketoconazole 65277-42-1 C26H28Cl2N4O4 \n",
|
||
"13 Ozanimod 1306760-87-1 C23H24N4O3 \n",
|
||
"14 Ponesimod 854107-55-4 C23H25ClN2O4S \n",
|
||
"19 Idoxuridine 54-42-2 C9H11IN2O5 \n",
|
||
"53 Moclobemide 71320-77-9 C13H17ClN2O2 \n",
|
||
"74 Clemastine 15686-51-8 C21H26ClNO \n",
|
||
"75 Buclizine dihydrochloride 129-74-8 C28H33ClN2·2HCl \n",
|
||
"78 Asenapine 65576-45-6 C17H16ClNO \n",
|
||
"\n",
|
||
" matched_patterns \\\n",
|
||
"1 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
|
||
"8 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 3}] \n",
|
||
"9 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 2}, {'pattern_index': 3, 'smarts': 'c1c... \n",
|
||
"13 [{'pattern_index': 1, 'smarts': 'c-C#N', 'matches': 1}] \n",
|
||
"14 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
|
||
"19 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
|
||
"53 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
|
||
"74 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
|
||
"75 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
|
||
"78 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
|
||
"\n",
|
||
" total_matches \n",
|
||
"1 1 \n",
|
||
"8 3 \n",
|
||
"9 3 \n",
|
||
"13 1 \n",
|
||
"14 1 \n",
|
||
"19 1 \n",
|
||
"53 1 \n",
|
||
"74 1 \n",
|
||
"75 1 \n",
|
||
"78 1 \n",
|
||
"结果已保存到:../data/drug_targetmol/sandmeyer_candidates_scheme_B_aryl_halides.csv\n",
|
||
"包含 548 个分子,23 个属性列\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 执行方案B筛选\n",
|
||
"scheme_b_results = screen_molecules_for_patterns(df, 'aryl_halides')\n",
|
||
"\n",
|
||
"# 显示结果摘要\n",
|
||
"if len(scheme_b_results) > 0:\n",
|
||
" print(\"\\n方案B筛选结果摘要:\")\n",
|
||
" print(scheme_b_results[['Name', 'CAS', 'Formula', 'matched_patterns', 'total_matches']].head(10))\n",
|
||
" \n",
|
||
" # 保存结果\n",
|
||
" save_screening_results(\n",
|
||
" scheme_b_results, \n",
|
||
" 'sandmeyer_candidates_scheme_B_aryl_halides.csv',\n",
|
||
" '方案B:所有芳香卤素筛选结果'\n",
|
||
" )\n",
|
||
"else:\n",
|
||
" print(\"\\n方案B未找到匹配分子\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 结果分析和总结\n",
|
||
"\n",
|
||
"### 筛选统计"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"=== 筛选结果统计 ===\n",
|
||
"总分子数:3276\n",
|
||
"方案A(杂芳环卤素)匹配数:57\n",
|
||
"方案B(所有芳香卤素)匹配数:548\n",
|
||
"两方案重叠分子数:57\n",
|
||
"仅方案A匹配的分子数:0\n",
|
||
"仅方案B匹配的分子数:490\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 结果统计\n",
|
||
"print(\"=== 筛选结果统计 ===\")\n",
|
||
"print(f\"总分子数:{len(df)}\")\n",
|
||
"print(f\"方案A(杂芳环卤素)匹配数:{len(scheme_a_results)}\")\n",
|
||
"print(f\"方案B(所有芳香卤素)匹配数:{len(scheme_b_results)}\")\n",
|
||
"\n",
|
||
"if len(scheme_a_results) > 0 and len(scheme_b_results) > 0:\n",
|
||
" # 分析重叠\n",
|
||
" scheme_a_cas = set(scheme_a_results['CAS'].dropna())\n",
|
||
" scheme_b_cas = set(scheme_b_results['CAS'].dropna())\n",
|
||
" overlap = scheme_a_cas & scheme_b_cas\n",
|
||
" print(f\"两方案重叠分子数:{len(overlap)}\")\n",
|
||
" \n",
|
||
" # 方案A特有\n",
|
||
" scheme_a_only = scheme_a_cas - scheme_b_cas\n",
|
||
" print(f\"仅方案A匹配的分子数:{len(scheme_a_only)}\")\n",
|
||
" \n",
|
||
" # 方案B特有\n",
|
||
" scheme_b_only = scheme_b_cas - scheme_a_cas\n",
|
||
" print(f\"仅方案B匹配的分子数:{len(scheme_b_only)}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 使用建议\n",
|
||
"\n",
|
||
"### 优先级推荐\n",
|
||
"\n",
|
||
"1. **第一优先级**:方案A结果\n",
|
||
" - 杂芳环卤素最可能是Sandmeyer反应产物\n",
|
||
" - 候选数量相对较少,便于深入研究\n",
|
||
" - 建议重点查阅这些分子的合成路线\n",
|
||
"\n",
|
||
"2. **第二优先级**:方案B独有结果\n",
|
||
" - 苯环卤素可能来自多种途径\n",
|
||
" - 需要仔细评估合成可能性\n",
|
||
" - 适合作为补充筛选\n",
|
||
"\n",
|
||
"### 后续验证步骤\n",
|
||
"\n",
|
||
"1. **文献调研**:查阅候选分子的合成路线\n",
|
||
"2. **反应条件评估**:确认是否使用了Sandmeyer反应\n",
|
||
"3. **经济性分析**:评估张夏恒反应用于该分子的潜力\n",
|
||
"4. **实验验证**:必要时进行小规模验证实验\n",
|
||
"\n",
|
||
"### 注意事项\n",
|
||
"\n",
|
||
"- 此筛选基于结构特征,不等同于合成路线确认\n",
|
||
"- 部分卤素可能来自原料而非合成步骤\n",
|
||
"- 分子复杂程度和合成可行性需要综合考虑\n",
|
||
"- 建议结合药物的重要性和市场规模进行优先级排序"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 分子结构可视化\n",
|
||
"\n",
|
||
"### 创建输出目录和可视化函数\n",
|
||
"\n",
|
||
"本节将为筛选出的候选分子生成高清晰度的SVG结构图,突出显示卤素结构。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": []
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.14.0"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 4
|
||
}
|