Files
macro_split/notebooks/screen_sandmeyer_candidates.ipynb
2025-11-14 20:34:58 +08:00

798 lines
34 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 筛选Sandmeyer反应候选药物 - 张夏恒反应替代分析\n",
"\n",
"## 背景介绍\n",
"\n",
"### Sandmeyer反应回顾\n",
"Sandmeyer反应是经典的芳香胺转化方法\n",
"**Ar-NH₂ → [Ar-N₂]⁺ → Ar-X**\n",
"其中 X = Cl, Br, I, CN, OH, SCN 等\n",
"\n",
"### 张夏恒反应\n",
"根据论文《s41586-025-09791-5_reference.pdf》张夏恒反应可能是一种新的替代方法\n",
"可以更有效地实现芳香胺到芳香卤素的转化。\n",
"\n",
"### 筛选策略\n",
"我们通过识别药物分子中可能来自Sandmeyer反应的芳香卤素结构\n",
"找出可以考虑用张夏恒反应进行工艺优化的候选药物。\n",
"\n",
"**重要提醒:**\n",
"- 此筛选仅基于分子结构特征\n",
"- 最终需要查阅文献确认合成路线\n",
"- 并非所有含卤素的药物都使用Sandmeyer反应合成"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 导入所需库"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from pathlib import Path\n",
"from rdkit.Chem.Draw import rdMolDraw2D\n",
"from IPython.display import SVG, display\n",
"from rdkit.Chem import AllChem\n",
"\n",
"# 创建输出目录\n",
"output_base = Path(\"../data/drug_targetmol\")\n",
"scheme_a_dir = output_base / \"scheme_A_visualizations\"\n",
"scheme_b_dir = output_base / \"scheme_B_visualizations\"\n",
"\n",
"scheme_a_dir.mkdir(exist_ok=True)\n",
"scheme_b_dir.mkdir(exist_ok=True)\n",
"\n",
"print(f\"创建目录:{scheme_a_dir}\")\n",
"print(f\"创建目录:{scheme_b_dir}\")\n",
"\n",
"def generate_highlighted_svg(mol, highlight_atoms, filename, title=\"\"):\n",
" \"\"\"生成高亮卤素结构的高清晰度SVG图片\"\"\"\n",
" from rdkit.Chem import AllChem\n",
" \n",
" # 计算2D坐标\n",
" AllChem.Compute2DCoords(mol)\n",
" \n",
" # 创建SVG绘制器\n",
" drawer = rdMolDraw2D.MolDraw2DSVG(1200, 900) # 更大的尺寸以提高清晰度\n",
" drawer.SetFontSize(12)\n",
" \n",
" # 绘制选项\n",
" draw_options = drawer.drawOptions()\n",
" draw_options.addAtomIndices = False # 不显示原子索引,保持简洁\n",
" draw_options.addBondIndices = False\n",
" draw_options.addStereoAnnotation = True\n",
" draw_options.fixedFontSize = 12\n",
" \n",
" # 高亮卤素原子(红色)\n",
" atom_colors = {}\n",
" for atom_idx in highlight_atoms:\n",
" atom_colors[atom_idx] = (1.0, 0.3, 0.3) # 红色高亮\n",
" \n",
" # 绘制分子\n",
" drawer.DrawMolecule(mol, \n",
" highlightAtoms=highlight_atoms,\n",
" highlightAtomColors=atom_colors)\n",
" \n",
" drawer.FinishDrawing()\n",
" svg_content = drawer.GetDrawingText()\n",
" \n",
" # 添加标题\n",
" if title:\n",
" # 在SVG中添加标题\n",
" svg_lines = svg_content.split(\"\\n\")\n",
" # 在<g>标签前插入标题\n",
" for i, line in enumerate(svg_lines):\n",
" if \"<g \" in line and \"transform\" in line:\n",
" svg_lines.insert(i, f\"<text x=\"50%\" y=\"30\" text-anchor=\"middle\" font-size=\"16\" font-weight=\"bold\">{title}</text>\")\n",
" break\n",
" svg_with_title = \"\\n\".join(svg_lines)\n",
" else:\n",
" svg_with_title = svg_content\n",
" \n",
" # 保存文件\n",
" with open(filename, \"w\") as f:\n",
" f.write(svg_with_title)\n",
" \n",
" print(f\"保存SVG: {filename}\")\n",
" \n",
" return svg_content\n",
"\n",
"def visualize_molecules(df, output_dir, scheme_name, max_molecules=50):\n",
" \"\"\"为DataFrame中的分子生成可视化图片\"\"\"\n",
" print(f\"\\n开始生成{scheme_name}的可视化图片...\")\n",
" print(f\"输出目录: {output_dir}\")\n",
" \n",
" generated_count = 0\n",
" \n",
" for idx, row in df.iterrows():\n",
" if generated_count >= max_molecules:\n",
" print(f\"已达到最大生成数量限制 ({max_molecules}),停止生成\")\n",
" break\n",
" \n",
" cas = str(row.get(\"CAS\", \"unknown\")).strip()\n",
" name = str(row.get(\"Name\", \"unknown\")).strip()\n",
" \n",
" # 跳过无效的CAS号\n",
" if not cas or cas == \"nan\" or cas == \"unknown\":\n",
" continue\n",
" \n",
" mol = row.get(\"ROMol\")\n",
" if mol is None:\n",
" continue\n",
" \n",
" # 找出卤素原子\n",
" halogen_atoms = []\n",
" for atom in mol.GetAtoms():\n",
" if atom.GetAtomicNum() in [9, 17, 35, 53]: # F, Cl, Br, I\n",
" halogen_atoms.append(atom.GetIdx())\n",
" \n",
" if not halogen_atoms:\n",
" continue\n",
" \n",
" # 生成文件名和标题\n",
" filename = output_dir / f\"{cas}.svg\"\n",
" title = f\"{name} ({cas})\"\n",
" \n",
" try:\n",
" # 生成SVG\n",
" generate_highlighted_svg(mol, halogen_atoms, filename, title)\n",
" generated_count += 1\n",
" \n",
" # 每10个显示一次进度\n",
" if generated_count % 10 == 0:\n",
" print(f\"已生成 {generated_count} 个分子图片\")\n",
" \n",
" except Exception as e:\n",
" print(f\"生成 {cas} 失败: {e}\")\n",
" continue\n",
" \n",
" print(f\"完成!共生成 {generated_count} 个{scheme_name}的可视化图片\")\n",
" return generated_count\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 定义筛选模式\n",
"\n",
"### SMARTS模式说明\n",
"\n",
"#### 方案A杂芳环卤素最高优先级\n",
"- **筛选逻辑**杂芳环上的卤素最可能是Sandmeyer反应产物\n",
"- **原因**杂芳环直接卤代通常较困难Sandmeyer反应是重要合成方法\n",
"- **预期结果**:候选数量少但精准度高\n",
"\n",
"#### 方案B所有芳香卤素中等优先级 \n",
"- **筛选逻辑**:所有芳环上的卤素\n",
"- **原因**:虽然有些卤素可能来自其他途径,但可以扩大筛选范围\n",
"- **预期结果**:候选数量较多,需要更多文献验证\n",
"\n",
"**SMARTS模式优化说明**\n",
"- 原始模式 `n:c:[Cl,Br,I]` 语法有误\n",
"- 优化为更准确的环结构匹配模式\n",
"- 使用更精确的原子环境描述"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# 定义筛选模式\n",
"SCREENING_PATTERNS = {\n",
" 'heteroaryl_halides': {\n",
" 'name': '杂芳环卤素',\n",
" 'description': '杂环上的Cl, Br, I原子方案A',\n",
" 'smarts': [\n",
" '[n,o,s]c[Cl,Br,I]', # 杂原子邻位卤素\n",
" '[n,o,s]cc[Cl,Br,I]', # 杂原子邻位卤素(隔一个碳)\n",
" 'c1[n,o,s]c([Cl,Br,I])ccc1', # 卤代吡咯类\n",
" 'c1c([Cl,Br,I])cncn1', # 卤代嘧啶\n",
" 'c1ccc2c([Cl,Br,I])ccnc2c1', # 卤代喹啉\n",
" 'c1c([Cl,Br,I])cncc1', # 卤代吡嗪\n",
" 'c1([Cl,Br,I])scnc1', # 卤代噻唑\n",
" ],\n",
" 'scheme': 'A'\n",
" },\n",
" 'aryl_halides': {\n",
" 'name': '芳香卤素',\n",
" 'description': '所有芳环上的Cl, Br, I原子方案B',\n",
" 'smarts': [\n",
" 'c[Cl,Br,I]', # 任意芳香氯\n",
" 'c-C#N', # 芳香氰基\n",
" 'c1ccc(S(=O)(=O)N)cc1', # 磺胺核心\n",
" 'c1c(Cl)cc(Cl)cc1', # 多卤代苯\n",
" ],\n",
" 'scheme': 'B'\n",
" }\n",
"}\n",
"\n",
"def create_pattern_matchers():\n",
" \"\"\"创建SMARTS模式匹配器\"\"\"\n",
" matchers = {}\n",
" for key, pattern_info in SCREENING_PATTERNS.items():\n",
" matchers[key] = {\n",
" 'info': pattern_info,\n",
" 'matchers': [Chem.MolFromSmarts(smarts) for smarts in pattern_info['smarts']]\n",
" }\n",
" return matchers\n",
"\n",
"PATTERN_MATCHERS = create_pattern_matchers()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 数据加载和预处理\n",
"\n",
"### SDF文件说明\n",
"- 文件位置:`data/drug_targetmol/0c04ffc9fe8c2ec916412fbdc2a49bf4.sdf`\n",
"- 包含药物分子结构和丰富属性信息\n",
"- 每个分子记录包含SMILES、分子式、分子量、批准状态、适应症等"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"正在读取SDF文件...\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[15:05:51] Both bonds on one end of an atropisomer are on the same side - atoms is : 3\n",
"[15:05:51] Explicit valence for atom # 2 N greater than permitted\n",
"[15:05:51] ERROR: Could not sanitize molecule ending on line 217340\n",
"[15:05:51] ERROR: Explicit valence for atom # 2 N greater than permitted\n",
"[15:05:52] Explicit valence for atom # 4 N greater than permitted\n",
"[15:05:52] ERROR: Could not sanitize molecule ending on line 317283\n",
"[15:05:52] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
"[15:05:52] Explicit valence for atom # 4 N greater than permitted\n",
"[15:05:52] ERROR: Could not sanitize molecule ending on line 324666\n",
"[15:05:52] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
"[15:05:52] Explicit valence for atom # 5 N greater than permitted\n",
"[15:05:52] ERROR: Could not sanitize molecule ending on line 365883\n",
"[15:05:52] ERROR: Explicit valence for atom # 5 N greater than permitted\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"成功加载 3276 个分子\n",
"\n",
"数据概览:\n",
" Index Plate Row Col ID Name \\\n",
"0 1 L1010-1 a 2 Dexamethasone \n",
"1 2 L1010-1 a 3 Danicopan \n",
"2 3 L1010-1 a 4 Cyclosporin A \n",
"3 4 L1010-1 a 5 L-Carnitine \n",
"4 5 L1010-1 a 6 Trimetazidine dihydrochloride \n",
"\n",
" Synonyms CAS \\\n",
"0 MK 125;Prednisolone F;NSC 34521;Hexadecadrol 50-02-2 \n",
"1 ACH-4471 1903768-17-1 \n",
"2 Cyclosporine A;Ciclosporin;Cyclosporine 59865-13-3 \n",
"3 L(-)-Carnitine;Levocarnitine 541-15-1 \n",
"4 Yoshimilon;Kyurinett;Vastarel F 13171-25-0 \n",
"\n",
" SMILES \\\n",
"0 C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(F)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO \n",
"1 CC(=O)c1nn(CC(=O)N2C[C@H](F)C[C@H]2C(=O)Nc2cccc(Br)n2)c2ccc(cc12)-c1cnc(C)nc1 \n",
"2 [C@H]([C@@H](C/C=C/C)C)(O)[C@@]1(N(C)C(=O)[C@H]([C@@H](C)C)N(C)C(=O)[C@H](CC(C)C)N(C)C(=O)[C@H](... \n",
"3 C[N+](C)(C)C[C@@H](O)CC([O-])=O \n",
"4 Cl.Cl.COC1=C(OC)C(OC)=C(CN2CCNCC2)C=C1 \n",
"\n",
" Formula MolWt Approved status \\\n",
"0 C22H29FO5 392.46 NMPA;EMA;FDA \n",
"1 C26H23BrFN7O3 580.41 FDA \n",
"2 C62H111N11O12 1202.61 FDA \n",
"3 C7H15NO3 161.2 FDA \n",
"4 C14H24Cl2N2O3 339.258 NMPA;EMA \n",
"\n",
" Pharmacopoeia \\\n",
"0 USP39-NF34;BP2015;JP16;IP2010 \n",
"1 NaN \n",
"2 Martindale the Extra Pharmacopoei, EP10.2, USP43-NF38, Ph.Int_6th, JP17 \n",
"3 NaN \n",
"4 BP2019;KP ;EP9.2;IP2010;JP17;Martindale:The Extra Pharmacopoeia \n",
"\n",
" Disease \\\n",
"0 Metabolism \n",
"1 Others \n",
"2 Immune system \n",
"3 Cardiovascular system \n",
"4 Cardiovascular system \n",
"\n",
" Pathways \\\n",
"0 Antibody-drug Conjugate/ADC Related;Autophagy;Endocrinology/Hormones;Immunology/Inflammation;Mic... \n",
"1 Immunology/Inflammation \n",
"2 Immunology/Inflammation;Metabolism;Microbiology/Virology \n",
"3 Metabolism \n",
"4 Autophagy;Metabolism \n",
"\n",
" Target \\\n",
"0 Antibacterial;Antibiotic;Autophagy;Complement System;Glucocorticoid Receptor;IL Receptor;Mitopha... \n",
"1 Complement System \n",
"2 Phosphatase;Antibiotic;Complement System \n",
"3 Endogenous Metabolite;Fatty Acid Synthase \n",
"4 Autophagy;Fatty Acid Synthase \n",
"\n",
" Receptor \\\n",
"0 Antibiotic; Autophagy; Bacterial; Complement System; Glucocorticoid Receptor; IL receptor; Mitop... \n",
"1 Complement System; factor D \n",
"2 Antibiotic; calcineurin phosphatase; Complement System; Phosphatase \n",
"3 Endogenous Metabolite; FAS \n",
"4 Autophagy; mitochondrial long-chain 3-ketoacyl thiolase \n",
"\n",
" Bioactivity \\\n",
"0 Dexamethasone is a glucocorticoid receptor agonist and IL receptor modulator with anti-inflammat... \n",
"1 Danicopan (ACH-4471) (ACH-4471) is a selective, orally active small molecule factor D inhibitor ... \n",
"2 Cyclosporin A is a natural product and an active fungal metabolite, classified as a cyclic polyp... \n",
"3 L-Carnitine (L(-)-Carnitine) is an amino acid derivative. L-Carnitine facilitates long-chain fat... \n",
"4 Trimetazidine dihydrochloride (Vastarel F) can improve myocardial glucose utilization by inhibit... \n",
"\n",
" Reference \\\n",
"0 Li M, Yu H. Identification of WP1066, an inhibitor of JAK2 and STAT3, as a Kv1. 3 potassium chan... \n",
"1 Yuan X, et al. Small-molecule factor D inhibitors selectively block the alternative pathway of c... \n",
"2 D'Angelo G, et al. Cyclosporin A prevents the hypoxic adaptation by activating hypoxia-inducible... \n",
"3 Jogl G, Tong L. Cell. 2003 Jan 10; 112(1):113-22. \n",
"4 Yang Q, et al. Int J Clin Exp Pathol. 2015, 8(4):3735-3741.;Liu Z, et al. Metabolism. 2016, 65(3... \n",
"\n",
" ROMol \n",
"0 <rdkit.Chem.rdchem.Mol object at 0x743d782049e0> \n",
"1 <rdkit.Chem.rdchem.Mol object at 0x743d782871b0> \n",
"2 <rdkit.Chem.rdchem.Mol object at 0x743d78287220> \n",
"3 <rdkit.Chem.rdchem.Mol object at 0x743d782873e0> \n",
"4 <rdkit.Chem.rdchem.Mol object at 0x743d78287450> \n",
"\n",
"列名:['Index', 'Plate', 'Row', 'Col', 'ID', 'Name', 'Synonyms', 'CAS', 'SMILES', 'Formula', 'MolWt', 'Approved status', 'Pharmacopoeia', 'Disease', 'Pathways', 'Target', 'Receptor', 'Bioactivity', 'Reference', 'ROMol']\n"
]
}
],
"source": [
"# 读取筛选结果CSV文件\n",
"import pandas as pd\n",
"from rdkit import Chem\n",
"\n",
"print(\"正在读取筛选结果CSV文件...\")\n",
"\n",
"# 读取方案A结果\n",
"df_a = pd.read_csv(\"../data/drug_targetmol/sandmeyer_candidates_scheme_A_heteroaryl_halides.csv\")\n",
"print(f\"方案A数据: {len(df_a)} 行\")\n",
"\n",
"# 读取方案B结果\n",
"df_b = pd.read_csv(\"../data/drug_targetmol/sandmeyer_candidates_scheme_B_aryl_halides.csv\")\n",
"print(f\"方案B数据: {len(df_b)} 行\")\n",
"\n",
"# 重建分子对象\n",
"def rebuild_molecules(df):\n",
" mols = []\n",
" for idx, row in df.iterrows():\n",
" smiles = row.get(\"SMILES_from_mol\", \"\")\n",
" if smiles and str(smiles) != \"nan\":\n",
" mol = Chem.MolFromSmiles(str(smiles))\n",
" mols.append(mol)\n",
" else:\n",
" mols.append(None)\n",
" df[\"ROMol\"] = mols\n",
" valid_mols = sum(1 for m in mols if m is not None)\n",
" print(f\"成功重建 {valid_mols} 个分子对象\")\n",
" return df\n",
"\n",
"df_a = rebuild_molecules(df_a)\n",
"df_b = rebuild_molecules(df_b)\n",
"\n",
"print(\"\n",
"数据加载完成\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 分子筛选函数\n",
"\n",
"### 筛选逻辑说明\n",
"\n",
"1. **分子验证**:确保分子结构有效\n",
"2. **子结构匹配**使用RDKit的SMARTS匹配\n",
"3. **结果记录**:记录匹配的模式和具体子结构\n",
"4. **数据完整性**:保留所有原始属性信息"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def screen_molecules_for_patterns(df, pattern_key):\n",
" \"\"\"\n",
" 筛选包含特定子结构的分子\n",
" \n",
" Args:\n",
" df: 包含分子的DataFrame\n",
" pattern_key: 筛选模式键名\n",
" \n",
" Returns:\n",
" 筛选结果DataFrame\n",
" \"\"\"\n",
" pattern_info = PATTERN_MATCHERS[pattern_key]['info']\n",
" matchers = PATTERN_MATCHERS[pattern_key]['matchers']\n",
" \n",
" print(f\"\\n开始筛选{pattern_info['name']}\")\n",
" print(f\"描述:{pattern_info['description']}\")\n",
" print(f\"SMARTS模式数量{len(pattern_info['smarts'])}\")\n",
" \n",
" matched_molecules = []\n",
" \n",
" for idx, row in df.iterrows():\n",
" mol = row['ROMol']\n",
" if mol is None:\n",
" continue\n",
" \n",
" # 检查是否匹配任何模式\n",
" matched_patterns = []\n",
" for i, matcher in enumerate(matchers):\n",
" if matcher is None:\n",
" continue\n",
" if mol.HasSubstructMatch(matcher):\n",
" matched_patterns.append({\n",
" 'pattern_index': i,\n",
" 'smarts': pattern_info['smarts'][i],\n",
" 'matches': len(mol.GetSubstructMatches(matcher))\n",
" })\n",
" \n",
" if matched_patterns:\n",
" # 创建匹配记录\n",
" match_record = row.copy()\n",
" match_record['matched_patterns'] = matched_patterns\n",
" match_record['total_matches'] = sum(p['matches'] for p in matched_patterns)\n",
" match_record['screening_scheme'] = pattern_info['scheme']\n",
" matched_molecules.append(match_record)\n",
" \n",
" result_df = pd.DataFrame(matched_molecules)\n",
" print(f\"找到 {len(result_df)} 个匹配分子\")\n",
" \n",
" return result_df\n",
"\n",
"def save_screening_results(df, filename, description):\n",
" \"\"\"保存筛选结果到CSV\"\"\"\n",
" output_path = f\"../data/drug_targetmol/{filename}\"\n",
" \n",
" # 转换ROMol列为SMILES因为ROMol对象无法保存到CSV\n",
" df_export = df.copy()\n",
" if 'ROMol' in df_export.columns:\n",
" df_export['SMILES_from_mol'] = df_export['ROMol'].apply(lambda x: Chem.MolToSmiles(x) if x else '')\n",
" df_export = df_export.drop('ROMol', axis=1)\n",
" \n",
" df_export.to_csv(output_path, index=False, encoding='utf-8')\n",
" print(f\"结果已保存到:{output_path}\")\n",
" print(f\"包含 {len(df_export)} 个分子,{len(df_export.columns)} 个属性列\")\n",
" \n",
" return output_path"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 方案A筛选杂芳环卤素\n",
"\n",
"### 执行逻辑\n",
"- 使用最保守的筛选策略\n",
"- 只匹配杂芳环上的卤素\n",
"- 预期获得高精度结果\n",
"- 需要进一步的合成路线验证"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"开始筛选:杂芳环卤素\n",
"描述杂环上的Cl, Br, I原子方案A\n",
"SMARTS模式数量7\n",
"找到 57 个匹配分子\n",
"\n",
"方案A筛选结果摘要\n",
" Name CAS Formula \\\n",
"1 Danicopan 1903768-17-1 C26H23BrFN7O3 \n",
"8 Lonafarnib 193275-84-2 C27H31Br2ClN4O2 \n",
"19 Idoxuridine 54-42-2 C9H11IN2O5 \n",
"144 Dimenhydrinate 523-87-5 C24H28ClN5O3 \n",
"259 Sertaconazole 99592-32-2 C20H15Cl3N2OS \n",
"311 Tioconazole 65899-73-2 C16H13Cl3N2OS \n",
"337 Gimeracil 103766-25-2 C5H4ClNO2 \n",
"580 Bromocriptine mesylate 22260-51-1 C33H44BrN5O8S \n",
"592 Clofarabine 123318-82-1 C10H11ClFN5O3 \n",
"684 Vorasidenib 1644545-52-7 C14H13ClF6N6 \n",
"\n",
" matched_patterns \\\n",
"1 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 1}, {'pattern_index': 2, 'smarts... \n",
"8 [{'pattern_index': 1, 'smarts': '[n,o,s]cc[Cl,Br,I]', 'matches': 1}, {'pattern_index': 5, 'smart... \n",
"19 [{'pattern_index': 1, 'smarts': '[n,o,s]cc[Cl,Br,I]', 'matches': 2}, {'pattern_index': 3, 'smart... \n",
"144 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 2}] \n",
"259 [{'pattern_index': 1, 'smarts': '[n,o,s]cc[Cl,Br,I]', 'matches': 1}] \n",
"311 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 1}] \n",
"337 [{'pattern_index': 1, 'smarts': '[n,o,s]cc[Cl,Br,I]', 'matches': 1}, {'pattern_index': 5, 'smart... \n",
"580 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 1}] \n",
"592 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 2}] \n",
"684 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 1}, {'pattern_index': 2, 'smarts... \n",
"\n",
" total_matches \n",
"1 2 \n",
"8 2 \n",
"19 3 \n",
"144 2 \n",
"259 1 \n",
"311 1 \n",
"337 2 \n",
"580 1 \n",
"592 2 \n",
"684 2 \n",
"结果已保存到:../data/drug_targetmol/sandmeyer_candidates_scheme_A_heteroaryl_halides.csv\n",
"包含 57 个分子23 个属性列\n"
]
}
],
"source": [
"# 执行方案A筛选\n",
"scheme_a_results = screen_molecules_for_patterns(df, 'heteroaryl_halides')\n",
"\n",
"# 显示结果摘要\n",
"if len(scheme_a_results) > 0:\n",
" print(\"\\n方案A筛选结果摘要\")\n",
" print(scheme_a_results[['Name', 'CAS', 'Formula', 'matched_patterns', 'total_matches']].head(10))\n",
" \n",
" # 保存结果\n",
" save_screening_results(\n",
" scheme_a_results, \n",
" 'sandmeyer_candidates_scheme_A_heteroaryl_halides.csv',\n",
" '方案A杂芳环卤素筛选结果'\n",
" )\n",
"else:\n",
" print(\"\\n方案A未找到匹配分子\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 方案B筛选所有芳香卤素\n",
"\n",
"### 执行逻辑\n",
"- 使用更宽松的筛选策略 \n",
"- 匹配所有芳环上的卤素\n",
"- 会包含更多候选分子\n",
"- 需要更多的文献验证工作"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"开始筛选:芳香卤素\n",
"描述所有芳环上的Cl, Br, I原子方案B\n",
"SMARTS模式数量4\n",
"找到 548 个匹配分子\n",
"\n",
"方案B筛选结果摘要\n",
" Name CAS Formula \\\n",
"1 Danicopan 1903768-17-1 C26H23BrFN7O3 \n",
"8 Lonafarnib 193275-84-2 C27H31Br2ClN4O2 \n",
"9 Ketoconazole 65277-42-1 C26H28Cl2N4O4 \n",
"13 Ozanimod 1306760-87-1 C23H24N4O3 \n",
"14 Ponesimod 854107-55-4 C23H25ClN2O4S \n",
"19 Idoxuridine 54-42-2 C9H11IN2O5 \n",
"53 Moclobemide 71320-77-9 C13H17ClN2O2 \n",
"74 Clemastine 15686-51-8 C21H26ClNO \n",
"75 Buclizine dihydrochloride 129-74-8 C28H33ClN2·2HCl \n",
"78 Asenapine 65576-45-6 C17H16ClNO \n",
"\n",
" matched_patterns \\\n",
"1 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
"8 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 3}] \n",
"9 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 2}, {'pattern_index': 3, 'smarts': 'c1c... \n",
"13 [{'pattern_index': 1, 'smarts': 'c-C#N', 'matches': 1}] \n",
"14 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
"19 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
"53 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
"74 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
"75 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
"78 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
"\n",
" total_matches \n",
"1 1 \n",
"8 3 \n",
"9 3 \n",
"13 1 \n",
"14 1 \n",
"19 1 \n",
"53 1 \n",
"74 1 \n",
"75 1 \n",
"78 1 \n",
"结果已保存到:../data/drug_targetmol/sandmeyer_candidates_scheme_B_aryl_halides.csv\n",
"包含 548 个分子23 个属性列\n"
]
}
],
"source": [
"# 执行方案B筛选\n",
"scheme_b_results = screen_molecules_for_patterns(df, 'aryl_halides')\n",
"\n",
"# 显示结果摘要\n",
"if len(scheme_b_results) > 0:\n",
" print(\"\\n方案B筛选结果摘要\")\n",
" print(scheme_b_results[['Name', 'CAS', 'Formula', 'matched_patterns', 'total_matches']].head(10))\n",
" \n",
" # 保存结果\n",
" save_screening_results(\n",
" scheme_b_results, \n",
" 'sandmeyer_candidates_scheme_B_aryl_halides.csv',\n",
" '方案B所有芳香卤素筛选结果'\n",
" )\n",
"else:\n",
" print(\"\\n方案B未找到匹配分子\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 结果分析和总结\n",
"\n",
"### 筛选统计"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"=== 筛选结果统计 ===\n",
"总分子数3276\n",
"方案A杂芳环卤素匹配数57\n",
"方案B所有芳香卤素匹配数548\n",
"两方案重叠分子数57\n",
"仅方案A匹配的分子数0\n",
"仅方案B匹配的分子数490\n"
]
}
],
"source": [
"# 结果统计\n",
"print(\"=== 筛选结果统计 ===\")\n",
"print(f\"总分子数:{len(df)}\")\n",
"print(f\"方案A杂芳环卤素匹配数{len(scheme_a_results)}\")\n",
"print(f\"方案B所有芳香卤素匹配数{len(scheme_b_results)}\")\n",
"\n",
"if len(scheme_a_results) > 0 and len(scheme_b_results) > 0:\n",
" # 分析重叠\n",
" scheme_a_cas = set(scheme_a_results['CAS'].dropna())\n",
" scheme_b_cas = set(scheme_b_results['CAS'].dropna())\n",
" overlap = scheme_a_cas & scheme_b_cas\n",
" print(f\"两方案重叠分子数:{len(overlap)}\")\n",
" \n",
" # 方案A特有\n",
" scheme_a_only = scheme_a_cas - scheme_b_cas\n",
" print(f\"仅方案A匹配的分子数{len(scheme_a_only)}\")\n",
" \n",
" # 方案B特有\n",
" scheme_b_only = scheme_b_cas - scheme_a_cas\n",
" print(f\"仅方案B匹配的分子数{len(scheme_b_only)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 使用建议\n",
"\n",
"### 优先级推荐\n",
"\n",
"1. **第一优先级**方案A结果\n",
" - 杂芳环卤素最可能是Sandmeyer反应产物\n",
" - 候选数量相对较少,便于深入研究\n",
" - 建议重点查阅这些分子的合成路线\n",
"\n",
"2. **第二优先级**方案B独有结果\n",
" - 苯环卤素可能来自多种途径\n",
" - 需要仔细评估合成可能性\n",
" - 适合作为补充筛选\n",
"\n",
"### 后续验证步骤\n",
"\n",
"1. **文献调研**:查阅候选分子的合成路线\n",
"2. **反应条件评估**确认是否使用了Sandmeyer反应\n",
"3. **经济性分析**:评估张夏恒反应用于该分子的潜力\n",
"4. **实验验证**:必要时进行小规模验证实验\n",
"\n",
"### 注意事项\n",
"\n",
"- 此筛选基于结构特征,不等同于合成路线确认\n",
"- 部分卤素可能来自原料而非合成步骤\n",
"- 分子复杂程度和合成可行性需要综合考虑\n",
"- 建议结合药物的重要性和市场规模进行优先级排序"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 分子结构可视化\n",
"\n",
"### 创建输出目录和可视化函数\n",
"\n",
"本节将为筛选出的候选分子生成高清晰度的SVG结构图突出显示卤素结构。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.14.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}