first add

This commit is contained in:
2025-11-14 20:34:58 +08:00
commit 0d99f7d12c
46 changed files with 698209 additions and 0 deletions

View File

@@ -0,0 +1,190 @@
# 16元环大环内酯分子分析说明
## 文件说明
- **notebook文件**: `analyze_ring16_molecules.ipynb`
- **输入数据**: `../output/ring16_match_smarts.csv` (307个分子)
- **输出目录**: `../output/`
## 使用方法
### 1. 激活环境
```bash
cd /home/zly/project/macro_split
pixi shell
```
### 2. 运行notebook
```bash
jupyter notebook notebooks/analyze_ring16_molecules.ipynb
```
或使用 JupyterLab
```bash
jupyter lab notebooks/analyze_ring16_molecules.ipynb
```
### 3. 按顺序运行所有单元格
notebook会自动
1. 计算所有分子的药物性质分子量、LogP、QED、TPSA等
2. 进行侧链断裂分析
3. 统计每个位置3-16的碎片分布
4. 生成分布图并保存到 `output/` 目录
## 生成的输出文件
### 图片文件
- `ring16_molecular_properties_distribution.png` - 分子性质分布图4个子图
- 分子量分布
- LogP分布
- QED分布
- TPSA分布
- `atom_count_distribution_ring16.png` - 每个位置的原子数分布14个子图位置3-16
- `molecular_weight_distribution_ring16.png` - 每个位置的分子量分布14个子图位置3-16
### 数据文件
- `ring16_fragments_analysis.csv` - 所有碎片的详细信息
-fragment_id, parent_id, parent_smiles, cleavage_position, fragment_smiles, atom_count, molecular_weight
- `ring16_molecular_properties.csv` - 所有分子的性质数据
-unique_id, mol_weight, logP, num_h_donors, num_h_acceptors, num_rotatable_bonds, tpsa, qed, num_atoms, num_heavy_atoms
## 分析内容
### 已完成
1. **分子基本性质计算**
- 分子量、LogP、QED、TPSA
- 氢键供受体数、可旋转键数
- 原子数、重原子数
2. **侧链断裂分析**
- 使用封装好的 `MacrolactoneFragmenter`
- 批量处理所有307个分子
- 统计每个位置的碎片类型和数量
3. **分布图绘制**
- 参考 `test_align_two_molecules.ipynb` 的绘图逻辑
- 4x4子图布局展示位置3-16的分布
- 使用 seaborn 和 matplotlib 绘图
### 延伸分析建议
notebook的最后一个单元格Section 9提供了详细的延伸分析建议包括
#### 优先级1强烈推荐⭐⭐⭐
- **LogP分析**:找出对亲脂性贡献最大的侧链位置
- **QED分析**:比较高/低QED分子的侧链差异
- **TPSA分析**:分析极性侧链的分布模式
#### 优先级2重要⭐⭐⭐
- **SAR分析**如果有活性数据max_pChEMBL分析结构-活性关系
- **特权侧链**:找出高频出现在活性分子中的侧链
#### 优先级3有价值⭐⭐
- **碎片多样性分析**:统计每个位置的独特碎片类型
- **聚类分析**:基于碎片指纹进行分子聚类
- **极性/疏水性分析**:分析侧链的极性特征
#### 可选分析 ⭐
- **3D性质**PMI、NPR等3D描述符
- **Lipinski规则**:检查类药性规则
- **立体化学**:手性中心分析
## 代码示例
notebook中包含了完整的代码示例可以直接运行或修改。主要功能
```python
# 1. 计算分子性质
props = calculate_properties(smiles)
# 2. 批量断裂
fragmenter = MacrolactoneFragmenter(ring_size=16)
batch_results = fragmenter.process_csv(csv_file)
# 3. 统计分析
df_fragments = fragmenter.batch_to_dataframe(batch_results)
position_stats = df_fragments.groupby('cleavage_position').agg(...)
# 4. 绘图
sns.histplot(values, kde=True, ax=ax, bins=30)
```
## 关键洞察
### LogP的价值
- 反映分子的亲脂性,对膜通透性和药物分布至关重要
- 大环内酯通常LogP较高
- 了解侧链对LogP的贡献有助于优化药物设计
### QED的意义
- QED综合评估"类药性"
- 大环内酯往往违反Lipinski规则分子量>500但仍可能是好药
- 比较高/低QED分子可以找出影响类药性的关键侧链
### TPSA的重要性
- TPSA与口服生物利用度密切相关一般<140Ų为佳
- 极性侧链对TPSA贡献显著
- 可以指导侧链修饰策略
## 注意事项
1. **环境要求**
- 需要安装 `seaborn``matplotlib`
- 如果没有安装notebook会提示`pixi add seaborn matplotlib`
2. **处理时间**
- 处理307个分子可能需要几分钟
- 绘制分布图也需要一些时间
3. **内存使用**
- 批量处理和绘图会占用一定内存
- 如果遇到内存问题,可以减少 `max_rows` 参数
4. **图片分辨率**
- 默认使用 300 DPI 保存图片
- 可以根据需要调整 `dpi` 参数
## 后续工作
根据分析结果,建议进行:
1. **LogP与侧链的定量关系**
- 计算去除各个侧链后的LogP变化
- 找出对LogP贡献最大的位置
2. **活性数据关联**(如果有)
- 分析高活性分子的侧链特征
- 找出"特权侧链"
3. **碎片库构建**
- 整理每个位置的常见碎片
- 用于指导新分子设计
4. **机器学习预测**
- 使用碎片特征预测分子性质
- 建立QSAR模型
## 参考
- `filter_molecules.ipynb` - 分子过滤和断裂逻辑
- `test_align_two_molecules.ipynb` - 绘图逻辑参考
- `src/macrolactone_fragmenter.py` - 封装的断裂器类
- `src/ring_visualization.py` - 可视化工具
## 问题反馈
如果遇到问题:
1. 检查是否在 `pixi shell` 环境中
2. 确认所有依赖包已安装
3. 查看输出目录是否有写入权限
4. 检查CSV文件路径是否正确

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,93 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 药物分子SMARTS筛选基于张夏恒反应替代Sandmeyer反应的策略\n",
"\n",
"## 研究背景\n",
"\n",
"本notebook旨在筛选药物分子数据库中可能使用**张夏恒反应**替代**Sandmeyer反应**合成的化合物。\n",
"\n",
"### 关键概念\n",
"\n",
"**Sandmeyer反应**:传统的芳香胺转化方法\n",
"- 反应式Ar-NH₂ → [Ar-N₂⁺] → Ar-X\n",
"- 产物芳香卤化物X = Cl, Br, I, CN, OH, SCN等\n",
"\n",
"**张夏恒反应**:新兴的绿色反应方法\n",
"- 提供更环保的合成路线\n",
"- 可能替代传统Sandmeyer反应\n",
"\n",
"### 筛选策略\n",
"\n",
"基于**同分异构体生物等排替换**原理:\n",
"- 如果化合物A使用Sandmeyer合成有活性\n",
"- 化合物B使用张夏恒反应合成相同骨架可能有相似活性\n",
"\n",
"### 筛选逻辑\n",
"\n",
"**核心假设**含有芳香卤素的药物可能通过Sandmeyer反应合成\n",
"\n",
"**优先级排序**\n",
"1. **杂芳环卤素**(最高优先级)\n",
" - 氯代吡啶、氯代嘧啶等\n",
" - 这些结构更可能使用Sandmeyer或SNAr反应合成\n",
" \n",
"2. **普通芳香卤素**(高优先级)\n",
" - 任意芳香氯、溴、碘\n",
" - 可能来自Sandmeyer反应需要文献验证\n",
"\n",
"### 三种筛选方案\n",
"\n",
"#### 方案A最保守杂芳环卤素筛选\n",
"- **SMARTS模式**`n:c:[Cl,Br,I]` 或 `n1c([Cl,Br,I])cccc1`\n",
"- **优势**:精准度最高,假阳性率低\n",
"- **适用**:快速找到最可能的候选药物\n",
"- **预期结果**:候选数量少但精准\n",
"\n",
"#### 方案B平衡所有芳香卤素筛选\n",
"- **SMARTS模式**`c[Cl,Br,I]`\n",
"- **优势**:覆盖面更广,平衡精准度和广度\n",
"- **适用**:全面筛选药物库\n",
"- **预期结果**:候选数量中等,适中假阳性率\n",
"\n",
"#### 方案C已删除简化版\n",
"- 只筛选含卤素化合物\n",
"- 精准度较低,已废弃\n",
"\n",
"---\n",
"\n",
"## 文件信息\n",
"\n",
"- **输入文件**`/data/drug_targetmol/0c04ffc9fe8c2ec916412fbdc2a49bf4.sdf`\n",
"- **输出目录**`/data/drug_targetmol/`\n",
"- **输出文件**\n",
" - `candidates_planA_heteroaryl_halides.csv`方案A结果\n",
" - `candidates_planB_all_aromatic_halides.csv`方案B结果"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

File diff suppressed because one or more lines are too long

1134
notebooks/mactch_test.ipynb Normal file

File diff suppressed because one or more lines are too long

321
notebooks/rdkit_show.ipynb Normal file

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,754 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 筛选芳香胺候选药物 - Sandmeyer反应起始物分析\n",
"\n",
"## 背景介绍\n",
"\n",
"### Sandmeyer反应回顾\n",
"Sandmeyer反应是经典的芳香胺转化方法\n",
"**Ar-NH₂ → [Ar-N₂]⁺ → Ar-X**\n",
"其中 X = Cl, Br, I, CN, OH, SCN 等\n",
"\n",
"### 筛选目标\n",
"通过识别药物分子中含有芳香胺结构Ar-NH₂的化合物\n",
"找出可能作为Sandmeyer反应起始物的候选药物。\n",
"这些分子可能原本通过Sandmeyer反应引入芳香卤素\n",
"现在可以用张夏恒反应进行更高效的转化。\n",
"\n",
"### SMARTS模式\n",
"使用SMARTS模式 `[c,n][NH2]` 匹配:\n",
"- `[c,n]`: 芳香碳或氮原子\n",
"- `[NH2]`: 氨基(-NH₂\n",
"\n",
"**重要提醒:**\n",
"- 此筛选基于分子结构特征\n",
"- 最终需要查阅文献确认合成路线\n",
"- 并非所有含芳香胺的药物都使用Sandmeyer反应"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 导入所需库"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from pathlib import Path\n",
"from rdkit import Chem\n",
"from rdkit.Chem import PandasTools, Draw\n",
"from rdkit.Chem.Draw import rdMolDraw2D\n",
"from IPython.display import SVG, display\n",
"from rdkit.Chem import AllChem\n",
"import pandas as pd\n",
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"\n",
"# 设置显示选项\n",
"pd.set_option('display.max_columns', None)\n",
"pd.set_option('display.max_colwidth', 100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 定义筛选模式和可视化函数\n",
"\n",
"### SMARTS模式设置\n",
"- **目标模式**: `[c,n][NH2]` - 芳香碳/氮原子连接的氨基\n",
"- **匹配逻辑**: 寻找所有包含此子结构的分子"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"使用SMARTS模式: [c,n][NH2]\n",
"模式验证: ✓\n",
"\n",
"创建目录:../data/drug_targetmol/aniline_candidates\n",
"创建可视化目录:../data/drug_targetmol/aniline_candidates/visualizations\n"
]
}
],
"source": [
"# 定义筛选模式\n",
"TARGET_SMARTS = '[c,n][NH2]'\n",
"pattern = Chem.MolFromSmarts(TARGET_SMARTS)\n",
"\n",
"if pattern is None:\n",
" raise ValueError(f\"无效的SMARTS模式: {TARGET_SMARTS}\")\n",
"\n",
"print(f\"使用SMARTS模式: {TARGET_SMARTS}\")\n",
"print(f\"模式验证: {'✓' if pattern else '✗'}\")\n",
"\n",
"# 创建输出目录\n",
"output_base = Path(\"../data/drug_targetmol\")\n",
"output_dir = output_base / \"aniline_candidates\"\n",
"visualization_dir = output_dir / \"visualizations\"\n",
"\n",
"output_dir.mkdir(exist_ok=True)\n",
"visualization_dir.mkdir(exist_ok=True)\n",
"\n",
"print(f\"\\n创建目录{output_dir}\")\n",
"print(f\"创建可视化目录:{visualization_dir}\")\n",
"\n",
"def generate_highlighted_svg(mol, highlight_atoms, filename, title=\"\"):\n",
" \"\"\"生成高亮匹配结构的高清晰度SVG图片\"\"\"\n",
" # 计算2D坐标\n",
" AllChem.Compute2DCoords(mol)\n",
" \n",
" # 创建SVG绘制器\n",
" drawer = rdMolDraw2D.MolDraw2DSVG(1200, 900) # 更大的尺寸以提高清晰度\n",
" drawer.SetFontSize(12)\n",
" \n",
" # 绘制选项\n",
" draw_options = drawer.drawOptions()\n",
" draw_options.addAtomIndices = False # 不显示原子索引,保持简洁\n",
" draw_options.addBondIndices = False\n",
" draw_options.addStereoAnnotation = True\n",
" draw_options.fixedFontSize = 12\n",
" \n",
" # 高亮匹配的原子(蓝色)\n",
" atom_colors = {}\n",
" for atom_idx in highlight_atoms:\n",
" atom_colors[atom_idx] = (0.3, 0.3, 1.0) # 蓝色高亮\n",
" \n",
" # 绘制分子\n",
" drawer.DrawMolecule(mol, \n",
" highlightAtoms=highlight_atoms,\n",
" highlightAtomColors=atom_colors)\n",
" \n",
" drawer.FinishDrawing()\n",
" svg_content = drawer.GetDrawingText()\n",
" \n",
" # 添加标题\n",
" if title:\n",
" # 在SVG中添加标题\n",
" svg_lines = svg_content.split(\"\\\\n\")\n",
" # 在<g>标签前插入标题\n",
" for i, line in enumerate(svg_lines):\n",
" if \"<g \" in line and \"transform\" in line:\n",
" svg_lines.insert(i, f\"<text x=\\\"50%\\\" y=\\\"30\\\" text-anchor=\\\"middle\\\" font-size=\\\"16\\\" font-weight=\\\"bold\\\">{title}</text>\")\n",
" break\n",
" svg_with_title = \"\\\\n\".join(svg_lines)\n",
" else:\n",
" svg_with_title = svg_content\n",
" \n",
" # 保存文件\n",
" with open(filename, \"w\") as f:\n",
" f.write(svg_with_title)\n",
" \n",
" return svg_content"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 数据加载和分子筛选\n",
"\n",
"### 数据源\n",
"- 文件位置:`data/drug_targetmol/0c04ffc9fe8c2ec916412fbdc2a49bf4.sdf`\n",
"- 包含药物分子结构和丰富属性信息\n",
"\n",
"### 筛选逻辑\n",
"1. 读取SDF文件\n",
"2. 对每个分子进行SMARTS匹配\n",
"3. 记录匹配的原子和匹配数量\n",
"4. 保存匹配结果到CSV\n",
"5. 生成高亮可视化图片"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"正在读取SDF文件...\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[21:24:23] Both bonds on one end of an atropisomer are on the same side - atoms is : 3\n",
"[21:24:23] Explicit valence for atom # 2 N greater than permitted\n",
"[21:24:23] ERROR: Could not sanitize molecule ending on line 217340\n",
"[21:24:23] ERROR: Explicit valence for atom # 2 N greater than permitted\n",
"[21:24:24] Explicit valence for atom # 4 N greater than permitted\n",
"[21:24:24] ERROR: Could not sanitize molecule ending on line 317283\n",
"[21:24:24] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
"[21:24:24] Explicit valence for atom # 4 N greater than permitted\n",
"[21:24:24] ERROR: Could not sanitize molecule ending on line 324666\n",
"[21:24:24] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
"[21:24:24] Explicit valence for atom # 5 N greater than permitted\n",
"[21:24:24] ERROR: Could not sanitize molecule ending on line 365883\n",
"[21:24:24] ERROR: Explicit valence for atom # 5 N greater than permitted\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"成功加载 3276 个分子\n",
"\n",
"数据概览:\n",
" Index Plate Row Col ID Name \\\n",
"0 1 L1010-1 a 2 Dexamethasone \n",
"1 2 L1010-1 a 3 Danicopan \n",
"2 3 L1010-1 a 4 Cyclosporin A \n",
"3 4 L1010-1 a 5 L-Carnitine \n",
"4 5 L1010-1 a 6 Trimetazidine dihydrochloride \n",
"\n",
" Synonyms CAS \\\n",
"0 MK 125;Prednisolone F;NSC 34521;Hexadecadrol 50-02-2 \n",
"1 ACH-4471 1903768-17-1 \n",
"2 Cyclosporine A;Ciclosporin;Cyclosporine 59865-13-3 \n",
"3 L(-)-Carnitine;Levocarnitine 541-15-1 \n",
"4 Yoshimilon;Kyurinett;Vastarel F 13171-25-0 \n",
"\n",
" SMILES \\\n",
"0 C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(F)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO \n",
"1 CC(=O)c1nn(CC(=O)N2C[C@H](F)C[C@H]2C(=O)Nc2cccc(Br)n2)c2ccc(cc12)-c1cnc(C)nc1 \n",
"2 [C@H]([C@@H](C/C=C/C)C)(O)[C@@]1(N(C)C(=O)[C@H]([C@@H](C)C)N(C)C(=O)[C@H](CC(C)C)N(C)C(=O)[C@H](... \n",
"3 C[N+](C)(C)C[C@@H](O)CC([O-])=O \n",
"4 Cl.Cl.COC1=C(OC)C(OC)=C(CN2CCNCC2)C=C1 \n",
"\n",
" Formula MolWt Approved status \\\n",
"0 C22H29FO5 392.46 NMPA;EMA;FDA \n",
"1 C26H23BrFN7O3 580.41 FDA \n",
"2 C62H111N11O12 1202.61 FDA \n",
"3 C7H15NO3 161.2 FDA \n",
"4 C14H24Cl2N2O3 339.258 NMPA;EMA \n",
"\n",
" Pharmacopoeia \\\n",
"0 USP39-NF34;BP2015;JP16;IP2010 \n",
"1 NaN \n",
"2 Martindale the Extra Pharmacopoei, EP10.2, USP43-NF38, Ph.Int_6th, JP17 \n",
"3 NaN \n",
"4 BP2019;KP ;EP9.2;IP2010;JP17;Martindale:The Extra Pharmacopoeia \n",
"\n",
" Disease \\\n",
"0 Metabolism \n",
"1 Others \n",
"2 Immune system \n",
"3 Cardiovascular system \n",
"4 Cardiovascular system \n",
"\n",
" Pathways \\\n",
"0 Antibody-drug Conjugate/ADC Related;Autophagy;Endocrinology/Hormones;Immunology/Inflammation;Mic... \n",
"1 Immunology/Inflammation \n",
"2 Immunology/Inflammation;Metabolism;Microbiology/Virology \n",
"3 Metabolism \n",
"4 Autophagy;Metabolism \n",
"\n",
" Target \\\n",
"0 Antibacterial;Antibiotic;Autophagy;Complement System;Glucocorticoid Receptor;IL Receptor;Mitopha... \n",
"1 Complement System \n",
"2 Phosphatase;Antibiotic;Complement System \n",
"3 Endogenous Metabolite;Fatty Acid Synthase \n",
"4 Autophagy;Fatty Acid Synthase \n",
"\n",
" Receptor \\\n",
"0 Antibiotic; Autophagy; Bacterial; Complement System; Glucocorticoid Receptor; IL receptor; Mitop... \n",
"1 Complement System; factor D \n",
"2 Antibiotic; calcineurin phosphatase; Complement System; Phosphatase \n",
"3 Endogenous Metabolite; FAS \n",
"4 Autophagy; mitochondrial long-chain 3-ketoacyl thiolase \n",
"\n",
" Bioactivity \\\n",
"0 Dexamethasone is a glucocorticoid receptor agonist and IL receptor modulator with anti-inflammat... \n",
"1 Danicopan (ACH-4471) (ACH-4471) is a selective, orally active small molecule factor D inhibitor ... \n",
"2 Cyclosporin A is a natural product and an active fungal metabolite, classified as a cyclic polyp... \n",
"3 L-Carnitine (L(-)-Carnitine) is an amino acid derivative. L-Carnitine facilitates long-chain fat... \n",
"4 Trimetazidine dihydrochloride (Vastarel F) can improve myocardial glucose utilization by inhibit... \n",
"\n",
" Reference \\\n",
"0 Li M, Yu H. Identification of WP1066, an inhibitor of JAK2 and STAT3, as a Kv1. 3 potassium chan... \n",
"1 Yuan X, et al. Small-molecule factor D inhibitors selectively block the alternative pathway of c... \n",
"2 D'Angelo G, et al. Cyclosporin A prevents the hypoxic adaptation by activating hypoxia-inducible... \n",
"3 Jogl G, Tong L. Cell. 2003 Jan 10; 112(1):113-22. \n",
"4 Yang Q, et al. Int J Clin Exp Pathol. 2015, 8(4):3735-3741.;Liu Z, et al. Metabolism. 2016, 65(3... \n",
"\n",
" ROMol \n",
"0 <rdkit.Chem.rdchem.Mol object at 0x77530d73c820> \n",
"1 <rdkit.Chem.rdchem.Mol object at 0x77530d73c890> \n",
"2 <rdkit.Chem.rdchem.Mol object at 0x77530a3f6f10> \n",
"3 <rdkit.Chem.rdchem.Mol object at 0x77530a3f70d0> \n",
"4 <rdkit.Chem.rdchem.Mol object at 0x77530a3f7140> \n",
"\n",
"列名:['Index', 'Plate', 'Row', 'Col', 'ID', 'Name', 'Synonyms', 'CAS', 'SMILES', 'Formula', 'MolWt', 'Approved status', 'Pharmacopoeia', 'Disease', 'Pathways', 'Target', 'Receptor', 'Bioactivity', 'Reference', 'ROMol']\n"
]
}
],
"source": [
"# 读取SDF文件\n",
"sdf_path = '../data/drug_targetmol/0c04ffc9fe8c2ec916412fbdc2a49bf4.sdf'\n",
"\n",
"print(\"正在读取SDF文件...\")\n",
"df = PandasTools.LoadSDF(sdf_path)\n",
"print(f\"成功加载 {len(df)} 个分子\")\n",
"\n",
"# 显示数据基本信息\n",
"print(\"\\n数据概览\")\n",
"print(df.head())\n",
"print(f\"\\n列名{list(df.columns)}\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"开始筛选芳香胺结构...\n",
"SMARTS模式: [c,n][N&H2]\n",
"找到 262 个匹配分子(处理了 3276 个分子)\n",
"\n",
"筛选结果摘要:\n",
" Name CAS Formula total_matches\n",
"17 Guanosine 118-00-3 C10H13N5O5 1\n",
"20 Ganciclovir 82410-32-0 C9H13N5O4 1\n",
"22 Imiquimod maleate 896106-16-4 C18H20N4O4 1\n",
"27 Brincidofovir 444805-28-1 C27H52N3O7P 1\n",
"28 Imiquimod 99011-02-6 C14H16N4 1\n",
"32 Ganciclovir sodium 107910-75-8 C9H13N5NaO4 1\n",
"33 Cytarabine 147-94-4 C9H13N3O5 1\n",
"35 Vidarabine 5536-17-4 C10H13N5O4 1\n",
"38 Penciclovir 39809-25-1 C10H15N5O3 1\n",
"41 Famciclovir 104227-87-4 C14H19N5O4 1\n",
"... 还有 252 个分子\n"
]
}
],
"source": [
"def screen_molecules_for_aniline(df, smarts_pattern, max_molecules=100):\n",
" \"\"\"\n",
" 筛选包含芳香胺结构的分子\n",
" \n",
" Args:\n",
" df: 包含分子的DataFrame\n",
" smarts_pattern: RDKit SMARTS模式对象\n",
" max_molecules: 最大处理分子数量\n",
" \n",
" Returns:\n",
" 筛选结果DataFrame\n",
" \"\"\"\n",
" print(f\"开始筛选芳香胺结构...\")\n",
" print(f\"SMARTS模式: {Chem.MolToSmarts(smarts_pattern)}\")\n",
" \n",
" matched_molecules = []\n",
" processed_count = 0\n",
" \n",
" for idx, row in df.iterrows():\n",
" if processed_count >= max_molecules:\n",
" break\n",
" \n",
" mol = row['ROMol']\n",
" if mol is None:\n",
" continue\n",
" \n",
" processed_count += 1\n",
" \n",
" # 检查是否匹配SMARTS模式\n",
" if mol.HasSubstructMatch(smarts_pattern):\n",
" matches = mol.GetSubstructMatches(smarts_pattern)\n",
" \n",
" # 收集所有匹配的原子\n",
" matched_atoms = set()\n",
" for match in matches:\n",
" matched_atoms.update(match)\n",
" \n",
" # 创建匹配记录\n",
" match_record = row.copy()\n",
" match_record['matched_atoms'] = list(matched_atoms)\n",
" match_record['total_matches'] = len(matches)\n",
" match_record['smarts_pattern'] = Chem.MolToSmarts(smarts_pattern)\n",
" matched_molecules.append(match_record)\n",
" \n",
" result_df = pd.DataFrame(matched_molecules)\n",
" print(f\"找到 {len(result_df)} 个匹配分子(处理了 {processed_count} 个分子)\")\n",
" \n",
" return result_df\n",
"\n",
"# 执行筛选\n",
"matched_df = screen_molecules_for_aniline(df, pattern, max_molecules=1000000)\n",
"\n",
"# 显示结果摘要\n",
"if len(matched_df) > 0:\n",
" print(\"\\n筛选结果摘要\")\n",
" summary_cols = ['Name', 'CAS', 'Formula', 'total_matches']\n",
" if len(matched_df) <= 10:\n",
" print(matched_df[summary_cols])\n",
" else:\n",
" print(matched_df[summary_cols].head(10))\n",
" print(f\"... 还有 {len(matched_df) - 10} 个分子\")\n",
"else:\n",
" print(\"\\n未找到匹配分子\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 保存筛选结果\n",
"\n",
"### 输出文件\n",
"1. **CSV文件**:包含所有匹配分子的属性信息和匹配详情\n",
"2. **SVG图片**:每个匹配分子的结构可视化,高亮芳香胺结构"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CSV结果已保存到../data/drug_targetmol/aniline_candidates/aniline_candidates.csv\n",
"包含 262 个分子23 个属性列\n",
"\n",
"开始生成可视化图片最多500个...\n",
"已生成 10 个分子图片\n",
"已生成 20 个分子图片\n",
"已生成 30 个分子图片\n",
"已生成 40 个分子图片\n",
"已生成 50 个分子图片\n",
"已生成 60 个分子图片\n",
"已生成 70 个分子图片\n",
"已生成 80 个分子图片\n",
"已生成 90 个分子图片\n",
"已生成 100 个分子图片\n",
"已生成 110 个分子图片\n",
"已生成 120 个分子图片\n",
"已生成 130 个分子图片\n",
"已生成 140 个分子图片\n",
"已生成 150 个分子图片\n",
"已生成 160 个分子图片\n",
"已生成 170 个分子图片\n",
"已生成 180 个分子图片\n",
"已生成 190 个分子图片\n",
"已生成 200 个分子图片\n",
"已生成 210 个分子图片\n",
"已生成 220 个分子图片\n",
"已生成 230 个分子图片\n",
"已生成 240 个分子图片\n",
"已生成 250 个分子图片\n",
"已生成 260 个分子图片\n",
"完成!共生成 262 个可视化图片\n",
"\n",
"示例图片: 118-00-3_Guanosine.svg\n"
]
},
{
"data": {
"image/svg+xml": [
"<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:rdkit=\"http://www.rdkit.org/xml\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" version=\"1.1\" baseProfile=\"full\" xml:space=\"preserve\" width=\"1200px\" height=\"900px\" viewBox=\"0 0 1200 900\">\n",
"<!-- END OF HEADER -->\n",
"<rect style=\"opacity:1.0;fill:#FFFFFF;stroke:none\" width=\"1200.0\" height=\"900.0\" x=\"0.0\" y=\"0.0\"> </rect>\n",
"<path class=\"bond-0 atom-0 atom-1\" d=\"M 912.0,197.7 L 940.1,201.0 L 924.8,332.9 L 896.6,329.6 Z\" style=\"fill:#4C4CFF;fill-rule:evenodd;fill-opacity:1;stroke:#4C4CFF;stroke-width:0.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
"<ellipse cx=\"932.9\" cy=\"201.5\" rx=\"26.6\" ry=\"26.6\" class=\"atom-0\" style=\"fill:#4C4CFF;fill-rule:evenodd;stroke:#4C4CFF;stroke-width:1.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<ellipse cx=\"910.7\" cy=\"331.2\" rx=\"26.6\" ry=\"26.6\" class=\"atom-1\" style=\"fill:#4C4CFF;fill-rule:evenodd;stroke:#4C4CFF;stroke-width:1.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-0 atom-0 atom-1\" d=\"M 925.1,208.0 L 910.7,331.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-1 atom-1 atom-2\" d=\"M 910.7,331.2 L 853.5,355.9\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-1 atom-1 atom-2\" d=\"M 853.5,355.9 L 796.4,380.6\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-1 atom-1 atom-2\" d=\"M 908.0,354.1 L 856.2,376.5\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-1 atom-1 atom-2\" d=\"M 856.2,376.5 L 804.3,398.9\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-2 atom-2 atom-3\" d=\"M 787.8,392.5 L 780.6,454.1\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-2 atom-2 atom-3\" d=\"M 780.6,454.1 L 773.4,515.8\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-3 atom-3 atom-4\" d=\"M 773.4,515.8 L 879.9,595.0\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-3 atom-3 atom-4\" d=\"M 794.5,506.6 L 882.6,572.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-4 atom-4 atom-5\" d=\"M 879.9,595.0 L 860.1,653.6\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-4 atom-4 atom-5\" d=\"M 860.1,653.6 L 840.4,712.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-5 atom-5 atom-6\" d=\"M 829.8,720.7 L 767.3,720.0\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-5 atom-5 atom-6\" d=\"M 767.3,720.0 L 704.7,719.3\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-5 atom-5 atom-6\" d=\"M 830.1,700.8 L 774.7,700.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-5 atom-5 atom-6\" d=\"M 774.7,700.2 L 719.4,699.6\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-6 atom-6 atom-7\" d=\"M 704.7,719.3 L 686.2,660.3\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-6 atom-6 atom-7\" d=\"M 686.2,660.3 L 667.8,601.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-7 atom-3 atom-7\" d=\"M 773.4,515.8 L 723.0,551.5\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-7 atom-3 atom-7\" d=\"M 723.0,551.5 L 672.7,587.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-8 atom-7 atom-8\" d=\"M 657.5,590.0 L 598.4,570.1\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-8 atom-7 atom-8\" d=\"M 598.4,570.1 L 539.3,550.1\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-9 atom-8 atom-9\" d=\"M 539.3,550.1 L 489.3,585.6\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-9 atom-8 atom-9\" d=\"M 489.3,585.6 L 439.2,621.0\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-10 atom-9 atom-10\" d=\"M 422.7,620.8 L 373.6,584.2\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-10 atom-9 atom-10\" d=\"M 373.6,584.2 L 324.4,547.7\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-11 atom-10 atom-11\" d=\"M 324.4,547.7 L 197.7,587.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-12 atom-11 atom-12\" d=\"M 197.7,587.2 L 153.0,546.1\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-12 atom-11 atom-12\" d=\"M 153.0,546.1 L 108.3,504.9\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-13 atom-10 atom-13\" d=\"M 324.4,547.7 L 366.9,421.8\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-14 atom-13 atom-14\" d=\"M 366.9,421.8 L 331.6,372.1\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-14 atom-13 atom-14\" d=\"M 331.6,372.1 L 296.3,322.3\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-15 atom-13 atom-15\" d=\"M 366.9,421.8 L 499.7,423.4\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-16 atom-8 atom-15\" d=\"M 539.3,550.1 L 499.7,423.4\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-17 atom-15 atom-16\" d=\"M 499.7,423.4 L 536.0,374.5\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-17 atom-15 atom-16\" d=\"M 536.0,374.5 L 572.4,325.6\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-18 atom-4 atom-17\" d=\"M 879.9,595.0 L 1001.8,542.4\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-19 atom-17 atom-18\" d=\"M 991.3,547.0 L 1042.7,585.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-19 atom-17 atom-18\" d=\"M 1042.7,585.2 L 1094.1,623.5\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-19 atom-17 atom-18\" d=\"M 1003.2,531.0 L 1054.6,569.3\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-19 atom-17 atom-18\" d=\"M 1054.6,569.3 L 1106.0,607.5\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-20 atom-17 atom-19\" d=\"M 1001.8,542.4 L 1009.0,480.8\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-20 atom-17 atom-19\" d=\"M 1009.0,480.8 L 1016.2,419.1\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-21 atom-1 atom-19\" d=\"M 910.7,331.2 L 960.2,368.0\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-21 atom-1 atom-19\" d=\"M 960.2,368.0 L 1009.6,404.9\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path d=\"M 707.8,719.4 L 704.7,719.3 L 703.7,716.4\" style=\"fill:none;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
"<path d=\"M 204.0,585.3 L 197.7,587.2 L 195.5,585.2\" style=\"fill:none;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
"<path d=\"M 995.7,545.0 L 1001.8,542.4 L 1002.2,539.3\" style=\"fill:none;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
"<path class=\"atom-0\" d=\"M 924.2 195.1 L 927.0 199.6 Q 927.3 200.0, 927.7 200.8 Q 928.1 201.7, 928.2 201.7 L 928.2 195.1 L 929.3 195.1 L 929.3 203.6 L 928.1 203.6 L 925.1 198.7 Q 924.8 198.1, 924.4 197.4 Q 924.1 196.8, 924.0 196.6 L 924.0 203.6 L 922.9 203.6 L 922.9 195.1 L 924.2 195.1 \" fill=\"#000000\"/>\n",
"<path class=\"atom-0\" d=\"M 930.9 195.1 L 932.1 195.1 L 932.1 198.7 L 936.4 198.7 L 936.4 195.1 L 937.6 195.1 L 937.6 203.6 L 936.4 203.6 L 936.4 199.7 L 932.1 199.7 L 932.1 203.6 L 930.9 203.6 L 930.9 195.1 \" fill=\"#000000\"/>\n",
"<path class=\"atom-0\" d=\"M 939.2 203.3 Q 939.4 202.8, 939.9 202.5 Q 940.4 202.2, 941.1 202.2 Q 942.0 202.2, 942.4 202.6 Q 942.9 203.1, 942.9 203.9 Q 942.9 204.7, 942.3 205.5 Q 941.7 206.3, 940.4 207.2 L 943.0 207.2 L 943.0 207.8 L 939.2 207.8 L 939.2 207.3 Q 940.3 206.6, 940.9 206.0 Q 941.5 205.5, 941.8 205.0 Q 942.1 204.5, 942.1 203.9 Q 942.1 203.4, 941.8 203.1 Q 941.6 202.8, 941.1 202.8 Q 940.7 202.8, 940.4 203.0 Q 940.0 203.2, 939.8 203.6 L 939.2 203.3 \" fill=\"#000000\"/>\n",
"<path class=\"atom-2\" d=\"M 786.9 379.6 L 789.7 384.1 Q 790.0 384.6, 790.4 385.4 Q 790.8 386.2, 790.9 386.2 L 790.9 379.6 L 792.0 379.6 L 792.0 388.1 L 790.8 388.1 L 787.8 383.2 Q 787.5 382.6, 787.1 382.0 Q 786.8 381.3, 786.7 381.1 L 786.7 388.1 L 785.6 388.1 L 785.6 379.6 L 786.9 379.6 \" fill=\"#0000FF\"/>\n",
"<path class=\"atom-5\" d=\"M 835.6 716.6 L 838.4 721.1 Q 838.6 721.5, 839.1 722.3 Q 839.5 723.1, 839.5 723.2 L 839.5 716.6 L 840.7 716.6 L 840.7 725.1 L 839.5 725.1 L 836.5 720.2 Q 836.2 719.6, 835.8 718.9 Q 835.4 718.3, 835.3 718.1 L 835.3 725.1 L 834.2 725.1 L 834.2 716.6 L 835.6 716.6 \" fill=\"#0000FF\"/>\n",
"<path class=\"atom-7\" d=\"M 663.2 588.3 L 666.0 592.8 Q 666.3 593.3, 666.7 594.1 Q 667.2 594.9, 667.2 594.9 L 667.2 588.3 L 668.3 588.3 L 668.3 596.8 L 667.1 596.8 L 664.2 591.9 Q 663.8 591.3, 663.4 590.7 Q 663.1 590.0, 663.0 589.8 L 663.0 596.8 L 661.9 596.8 L 661.9 588.3 L 663.2 588.3 \" fill=\"#0000FF\"/>\n",
"<path class=\"atom-9\" d=\"M 427.1 626.9 Q 427.1 624.9, 428.1 623.8 Q 429.1 622.6, 431.0 622.6 Q 432.8 622.6, 433.9 623.8 Q 434.9 624.9, 434.9 626.9 Q 434.9 629.0, 433.8 630.2 Q 432.8 631.3, 431.0 631.3 Q 429.1 631.3, 428.1 630.2 Q 427.1 629.0, 427.1 626.9 M 431.0 630.4 Q 432.3 630.4, 433.0 629.5 Q 433.7 628.6, 433.7 626.9 Q 433.7 625.3, 433.0 624.4 Q 432.3 623.6, 431.0 623.6 Q 429.7 623.6, 429.0 624.4 Q 428.3 625.3, 428.3 626.9 Q 428.3 628.7, 429.0 629.5 Q 429.7 630.4, 431.0 630.4 \" fill=\"#FF0000\"/>\n",
"<path class=\"atom-12\" d=\"M 87.7 493.1 L 88.9 493.1 L 88.9 496.7 L 93.2 496.7 L 93.2 493.1 L 94.4 493.1 L 94.4 501.6 L 93.2 501.6 L 93.2 497.6 L 88.9 497.6 L 88.9 501.6 L 87.7 501.6 L 87.7 493.1 \" fill=\"#FF0000\"/>\n",
"<path class=\"atom-12\" d=\"M 96.1 497.3 Q 96.1 495.3, 97.1 494.1 Q 98.1 493.0, 100.0 493.0 Q 101.9 493.0, 102.9 494.1 Q 103.9 495.3, 103.9 497.3 Q 103.9 499.4, 102.9 500.5 Q 101.9 501.7, 100.0 501.7 Q 98.2 501.7, 97.1 500.5 Q 96.1 499.4, 96.1 497.3 M 100.0 500.7 Q 101.3 500.7, 102.0 499.9 Q 102.7 499.0, 102.7 497.3 Q 102.7 495.6, 102.0 494.8 Q 101.3 493.9, 100.0 493.9 Q 98.7 493.9, 98.0 494.8 Q 97.3 495.6, 97.3 497.3 Q 97.3 499.0, 98.0 499.9 Q 98.7 500.7, 100.0 500.7 \" fill=\"#FF0000\"/>\n",
"<path class=\"atom-14\" d=\"M 277.8 309.3 L 278.9 309.3 L 278.9 312.9 L 283.3 312.9 L 283.3 309.3 L 284.4 309.3 L 284.4 317.8 L 283.3 317.8 L 283.3 313.9 L 278.9 313.9 L 278.9 317.8 L 277.8 317.8 L 277.8 309.3 \" fill=\"#FF0000\"/>\n",
"<path class=\"atom-14\" d=\"M 286.2 313.6 Q 286.2 311.5, 287.2 310.4 Q 288.2 309.2, 290.1 309.2 Q 292.0 309.2, 293.0 310.4 Q 294.0 311.5, 294.0 313.6 Q 294.0 315.6, 293.0 316.8 Q 291.9 318.0, 290.1 318.0 Q 288.2 318.0, 287.2 316.8 Q 286.2 315.6, 286.2 313.6 M 290.1 317.0 Q 291.4 317.0, 292.1 316.1 Q 292.8 315.3, 292.8 313.6 Q 292.8 311.9, 292.1 311.0 Q 291.4 310.2, 290.1 310.2 Q 288.8 310.2, 288.1 311.0 Q 287.4 311.9, 287.4 313.6 Q 287.4 315.3, 288.1 316.1 Q 288.8 317.0, 290.1 317.0 \" fill=\"#FF0000\"/>\n",
"<path class=\"atom-16\" d=\"M 575.1 316.9 Q 575.1 314.8, 576.1 313.7 Q 577.1 312.5, 579.0 312.5 Q 580.8 312.5, 581.8 313.7 Q 582.9 314.8, 582.9 316.9 Q 582.9 318.9, 581.8 320.1 Q 580.8 321.3, 579.0 321.3 Q 577.1 321.3, 576.1 320.1 Q 575.1 318.9, 575.1 316.9 M 579.0 320.3 Q 580.2 320.3, 580.9 319.4 Q 581.7 318.6, 581.7 316.9 Q 581.7 315.2, 580.9 314.3 Q 580.2 313.5, 579.0 313.5 Q 577.7 313.5, 576.9 314.3 Q 576.3 315.2, 576.3 316.9 Q 576.3 318.6, 576.9 319.4 Q 577.7 320.3, 579.0 320.3 \" fill=\"#FF0000\"/>\n",
"<path class=\"atom-16\" d=\"M 584.2 312.6 L 585.3 312.6 L 585.3 316.2 L 589.7 316.2 L 589.7 312.6 L 590.8 312.6 L 590.8 321.1 L 589.7 321.1 L 589.7 317.2 L 585.3 317.2 L 585.3 321.1 L 584.2 321.1 L 584.2 312.6 \" fill=\"#FF0000\"/>\n",
"<path class=\"atom-18\" d=\"M 1104.5 621.7 Q 1104.5 619.7, 1105.5 618.5 Q 1106.5 617.4, 1108.4 617.4 Q 1110.2 617.4, 1111.3 618.5 Q 1112.3 619.7, 1112.3 621.7 Q 1112.3 623.8, 1111.2 624.9 Q 1110.2 626.1, 1108.4 626.1 Q 1106.5 626.1, 1105.5 624.9 Q 1104.5 623.8, 1104.5 621.7 M 1108.4 625.1 Q 1109.7 625.1, 1110.4 624.3 Q 1111.1 623.4, 1111.1 621.7 Q 1111.1 620.0, 1110.4 619.2 Q 1109.7 618.3, 1108.4 618.3 Q 1107.1 618.3, 1106.4 619.2 Q 1105.7 620.0, 1105.7 621.7 Q 1105.7 623.4, 1106.4 624.3 Q 1107.1 625.1, 1108.4 625.1 \" fill=\"#FF0000\"/>\n",
"<path class=\"atom-19\" d=\"M 1015.3 406.3 L 1018.1 410.8 Q 1018.4 411.2, 1018.8 412.0 Q 1019.3 412.8, 1019.3 412.9 L 1019.3 406.3 L 1020.4 406.3 L 1020.4 414.8 L 1019.3 414.8 L 1016.3 409.8 Q 1015.9 409.3, 1015.6 408.6 Q 1015.2 407.9, 1015.1 407.7 L 1015.1 414.8 L 1014.0 414.8 L 1014.0 406.3 L 1015.3 406.3 \" fill=\"#0000FF\"/>\n",
"<path class=\"atom-19\" d=\"M 1022.1 406.3 L 1023.2 406.3 L 1023.2 409.9 L 1027.6 409.9 L 1027.6 406.3 L 1028.7 406.3 L 1028.7 414.8 L 1027.6 414.8 L 1027.6 410.8 L 1023.2 410.8 L 1023.2 414.8 L 1022.1 414.8 L 1022.1 406.3 \" fill=\"#0000FF\"/>\n",
"</svg>"
],
"text/plain": [
"<IPython.core.display.SVG object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def save_aniline_screening_results(df, output_dir, visualization_dir, max_visualizations=500):\n",
" \"\"\"保存芳香胺筛选结果\"\"\"\n",
" \n",
" # 保存CSV文件\n",
" csv_path = output_dir / \"aniline_candidates.csv\"\n",
" \n",
" # 转换ROMol列为SMILES因为ROMol对象无法保存到CSV\n",
" df_export = df.copy()\n",
" if 'ROMol' in df_export.columns:\n",
" df_export['SMILES_from_mol'] = df_export['ROMol'].apply(lambda x: Chem.MolToSmiles(x) if x else '')\n",
" df_export = df_export.drop('ROMol', axis=1)\n",
" \n",
" df_export.to_csv(csv_path, index=False, encoding='utf-8')\n",
" print(f\"CSV结果已保存到{csv_path}\")\n",
" print(f\"包含 {len(df_export)} 个分子,{len(df_export.columns)} 个属性列\")\n",
" \n",
" # 生成可视化图片\n",
" print(f\"\\n开始生成可视化图片最多{max_visualizations}个)...\")\n",
" generated_count = 0\n",
" \n",
" for idx, row in df.iterrows():\n",
" if generated_count >= max_visualizations:\n",
" print(f\"已达到最大可视化数量限制 ({max_visualizations}),停止生成\")\n",
" break\n",
" \n",
" cas = str(row.get('CAS', 'unknown')).strip()\n",
" name = str(row.get('Name', 'unknown')).strip()\n",
" \n",
" # 清理文件名(去除特殊字符)\n",
" safe_name = \"\".join(c for c in name if c.isalnum() or c in (' ', '-', '_')).rstrip()\n",
" safe_cas = \"\".join(c for c in cas if c.isalnum() or c in ('-',)).rstrip()\n",
" \n",
" # 跳过无效的标识符\n",
" if not safe_cas or safe_cas == 'nan' or safe_cas == 'unknown':\n",
" continue\n",
" \n",
" mol = row.get('ROMol')\n",
" if mol is None:\n",
" continue\n",
" \n",
" matched_atoms = row.get('matched_atoms', [])\n",
" if not matched_atoms:\n",
" continue\n",
" \n",
" # 生成文件名和标题\n",
" filename = visualization_dir / f\"{safe_cas}_{safe_name.replace(' ', '_')}.svg\"\n",
" title = f\"{name} ({cas}) - 芳香胺结构\"\n",
" \n",
" try:\n",
" # 生成SVG\n",
" svg_content = generate_highlighted_svg(mol, matched_atoms, filename, title)\n",
" generated_count += 1\n",
" \n",
" # 每10个显示一次进度\n",
" if generated_count % 10 == 0:\n",
" print(f\"已生成 {generated_count} 个分子图片\")\n",
" \n",
" except Exception as e:\n",
" print(f\"生成 {safe_cas} 失败: {e}\")\n",
" continue\n",
" \n",
" print(f\"完成!共生成 {generated_count} 个可视化图片\")\n",
" return csv_path, generated_count\n",
"\n",
"# 保存结果\n",
"if len(matched_df) > 0:\n",
" csv_path, viz_count = save_aniline_screening_results(\n",
" matched_df, output_dir, visualization_dir, max_visualizations=500\n",
" )\n",
" \n",
" # 显示第一个生成的图片作为示例\n",
" if viz_count > 0:\n",
" example_files = list(visualization_dir.glob(\"*.svg\"))\n",
" if example_files:\n",
" example_file = example_files[0]\n",
" print(f\"\\n示例图片: {example_file.name}\")\n",
" with open(example_file, \"r\") as f:\n",
" svg_content = f.read()\n",
" display(SVG(svg_content))\n",
"else:\n",
" print(\"没有匹配结果,无需保存\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 结果统计和分析\n",
"\n",
"### 筛选统计\n",
"- 总分子数\n",
"- 匹配分子数\n",
"- 可视化文件数量\n",
"- 输出文件位置"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"=== 芳香胺筛选结果统计 ===\n",
"总分子数3276\n",
"匹配分子数262\n",
"匹配率8.00%\n",
"\n",
"输出目录:../data/drug_targetmol/aniline_candidates\n",
"CSV文件../data/drug_targetmol/aniline_candidates/aniline_candidates.csv\n",
"可视化目录:../data/drug_targetmol/aniline_candidates/visualizations\n",
"SVG文件数量262\n",
"\n",
"匹配数量最多的分子:\n",
" Name CAS total_matches\n",
"432 Proflavine Hemisulfate 1811-28-5 4\n",
"1064 Triamterene 396-01-0 3\n",
"335 Pemetrexed disodium hemipenta hydrate 357166-30-4 2\n",
"463 Lamotrigine 84057-84-1 2\n",
"779 Pyrimethamine 58-14-0 2\n"
]
}
],
"source": [
"# 结果统计\n",
"print(\"=== 芳香胺筛选结果统计 ===\")\n",
"print(f\"总分子数:{len(df)}\")\n",
"print(f\"匹配分子数:{len(matched_df)}\")\n",
"print(f\"匹配率:{len(matched_df)/len(df)*100:.2f}%\")\n",
"print(f\"\\n输出目录{output_dir}\")\n",
"print(f\"CSV文件{output_dir}/aniline_candidates.csv\")\n",
"print(f\"可视化目录:{visualization_dir}\")\n",
"print(f\"SVG文件数量{len(list(visualization_dir.glob('*.svg')))}\")\n",
"\n",
"# 显示匹配最多的前几个分子\n",
"if len(matched_df) > 0:\n",
" print(\"\\n匹配数量最多的分子\")\n",
" top_matches = matched_df.nlargest(5, 'total_matches')[['Name', 'CAS', 'total_matches']]\n",
" print(top_matches)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 使用建议\n",
"\n",
"### 筛选结果解读\n",
"- **匹配分子**包含芳香胺结构Ar-NH₂的药物\n",
"- **蓝色高亮**匹配的SMARTS结构芳香碳/氮 + 氨基)\n",
"- **多重匹配**:分子中可能存在多个芳香胺基团\n",
"\n",
"### 后续分析建议\n",
"1. **合成路线验证**:查阅匹配分子的合成文献\n",
"2. **Sandmeyer反应确认**确认是否使用Sandmeyer反应引入卤素\n",
"3. **张夏恒反应评估**评估替代Sandmeyer反应的可行性\n",
"4. **工艺优化潜力**:分析替换为张夏恒反应的经济效益\n",
"\n",
"### 文件说明\n",
"- **CSV文件**:完整的分子属性和匹配信息\n",
"- **SVG文件**:结构可视化,蓝色高亮芳香胺结构\n",
"- **命名规则**{CAS}_{Name}.svg特殊字符已清理"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 抗生素筛选结果\n",
"\n",
"/home/zly/project/macro_split/data/drug_targetmol/aniline_candidates/antibiotics_identified.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.14.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -0,0 +1,774 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 筛选芳香胺候选药物 - Sandmeyer反应起始物分析\n",
"\n",
"## 背景介绍\n",
"\n",
"### Sandmeyer反应回顾\n",
"Sandmeyer反应是经典的芳香胺转化方法\n",
"**Ar-NH₂ → [Ar-N₂]⁺ → Ar-X**\n",
"其中 X = Cl, Br, I, CN, OH, SCN 等\n",
"\n",
"### 筛选目标\n",
"通过识别药物分子中含有芳香胺结构Ar-NH₂的化合物\n",
"找出可能作为Sandmeyer反应起始物的候选药物。\n",
"这些分子可能原本通过Sandmeyer反应引入芳香卤素\n",
"现在可以用张夏恒反应进行更高效的转化。\n",
"\n",
"### SMARTS模式\n",
"使用SMARTS模式 `[c,n][NH2]` 匹配:\n",
"- `[c,n]`: 芳香碳或氮原子\n",
"- `[NH2]`: 氨基(-NH₂\n",
"\n",
"**重要提醒:**\n",
"- 此筛选基于分子结构特征\n",
"- 最终需要查阅文献确认合成路线\n",
"- 并非所有含芳香胺的药物都使用Sandmeyer反应"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 导入所需库"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"execution": {
"iopub.execute_input": "2025-11-11T13:21:31.660096Z",
"iopub.status.busy": "2025-11-11T13:21:31.657369Z",
"iopub.status.idle": "2025-11-11T13:21:32.943162Z",
"shell.execute_reply": "2025-11-11T13:21:32.938881Z"
}
},
"outputs": [],
"source": [
"import os\n",
"from pathlib import Path\n",
"from rdkit import Chem\n",
"from rdkit.Chem import PandasTools, Draw\n",
"from rdkit.Chem.Draw import rdMolDraw2D\n",
"from IPython.display import SVG, display\n",
"from rdkit.Chem import AllChem\n",
"import pandas as pd\n",
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"\n",
"# 设置显示选项\n",
"pd.set_option('display.max_columns', None)\n",
"pd.set_option('display.max_colwidth', 100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 定义筛选模式和可视化函数\n",
"\n",
"### SMARTS模式设置\n",
"- **目标模式**: `[c,n][NH2]` - 芳香碳/氮原子连接的氨基\n",
"- **匹配逻辑**: 寻找所有包含此子结构的分子"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"execution": {
"iopub.execute_input": "2025-11-11T13:21:32.959832Z",
"iopub.status.busy": "2025-11-11T13:21:32.957734Z",
"iopub.status.idle": "2025-11-11T13:21:32.987085Z",
"shell.execute_reply": "2025-11-11T13:21:32.980584Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"使用SMARTS模式: [c,n][NH2]\n",
"模式验证: ✓\n",
"\n",
"创建目录:../data/drug_targetmol/aniline_candidates\n",
"创建可视化目录:../data/drug_targetmol/aniline_candidates/visualizations\n"
]
}
],
"source": [
"# 定义筛选模式\n",
"TARGET_SMARTS = '[c,n][NH2]'\n",
"pattern = Chem.MolFromSmarts(TARGET_SMARTS)\n",
"\n",
"if pattern is None:\n",
" raise ValueError(f\"无效的SMARTS模式: {TARGET_SMARTS}\")\n",
"\n",
"print(f\"使用SMARTS模式: {TARGET_SMARTS}\")\n",
"print(f\"模式验证: {'✓' if pattern else '✗'}\")\n",
"\n",
"# 创建输出目录\n",
"output_base = Path(\"../data/drug_targetmol\")\n",
"output_dir = output_base / \"aniline_candidates\"\n",
"visualization_dir = output_dir / \"visualizations\"\n",
"\n",
"output_dir.mkdir(exist_ok=True)\n",
"visualization_dir.mkdir(exist_ok=True)\n",
"\n",
"print(f\"\\n创建目录{output_dir}\")\n",
"print(f\"创建可视化目录:{visualization_dir}\")\n",
"\n",
"def generate_highlighted_svg(mol, highlight_atoms, filename, title=\"\"):\n",
" \"\"\"生成高亮匹配结构的高清晰度SVG图片\"\"\"\n",
" # 计算2D坐标\n",
" AllChem.Compute2DCoords(mol)\n",
" \n",
" # 创建SVG绘制器\n",
" drawer = rdMolDraw2D.MolDraw2DSVG(1200, 900) # 更大的尺寸以提高清晰度\n",
" drawer.SetFontSize(12)\n",
" \n",
" # 绘制选项\n",
" draw_options = drawer.drawOptions()\n",
" draw_options.addAtomIndices = False # 不显示原子索引,保持简洁\n",
" draw_options.addBondIndices = False\n",
" draw_options.addStereoAnnotation = True\n",
" draw_options.fixedFontSize = 12\n",
" \n",
" # 高亮匹配的原子(蓝色)\n",
" atom_colors = {}\n",
" for atom_idx in highlight_atoms:\n",
" atom_colors[atom_idx] = (0.3, 0.3, 1.0) # 蓝色高亮\n",
" \n",
" # 绘制分子\n",
" drawer.DrawMolecule(mol, \n",
" highlightAtoms=highlight_atoms,\n",
" highlightAtomColors=atom_colors)\n",
" \n",
" drawer.FinishDrawing()\n",
" svg_content = drawer.GetDrawingText()\n",
" \n",
" # 添加标题\n",
" if title:\n",
" # 在SVG中添加标题\n",
" svg_lines = svg_content.split(\"\\\\n\")\n",
" # 在<g>标签前插入标题\n",
" for i, line in enumerate(svg_lines):\n",
" if \"<g \" in line and \"transform\" in line:\n",
" svg_lines.insert(i, f\"<text x=\\\"50%\\\" y=\\\"30\\\" text-anchor=\\\"middle\\\" font-size=\\\"16\\\" font-weight=\\\"bold\\\">{title}</text>\")\n",
" break\n",
" svg_with_title = \"\\\\n\".join(svg_lines)\n",
" else:\n",
" svg_with_title = svg_content\n",
" \n",
" # 保存文件\n",
" with open(filename, \"w\") as f:\n",
" f.write(svg_with_title)\n",
" \n",
" return svg_content"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 数据加载和分子筛选\n",
"\n",
"### 数据源\n",
"- 文件位置:`data/drug_targetmol/0c04ffc9fe8c2ec916412fbdc2a49bf4.sdf`\n",
"- 包含药物分子结构和丰富属性信息\n",
"\n",
"### 筛选逻辑\n",
"1. 读取SDF文件\n",
"2. 对每个分子进行SMARTS匹配\n",
"3. 记录匹配的原子和匹配数量\n",
"4. 保存匹配结果到CSV\n",
"5. 生成高亮可视化图片"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"execution": {
"iopub.execute_input": "2025-11-11T13:21:33.114695Z",
"iopub.status.busy": "2025-11-11T13:21:33.113063Z",
"iopub.status.idle": "2025-11-11T13:21:35.754026Z",
"shell.execute_reply": "2025-11-11T13:21:35.745369Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"正在读取SDF文件...\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[21:21:34] Both bonds on one end of an atropisomer are on the same side - atoms is : 3\n",
"[21:21:34] Explicit valence for atom # 2 N greater than permitted\n",
"[21:21:34] ERROR: Could not sanitize molecule ending on line 217340\n",
"[21:21:34] ERROR: Explicit valence for atom # 2 N greater than permitted\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[21:21:35] Explicit valence for atom # 4 N greater than permitted\n",
"[21:21:35] ERROR: Could not sanitize molecule ending on line 317283\n",
"[21:21:35] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
"[21:21:35] Explicit valence for atom # 4 N greater than permitted\n",
"[21:21:35] ERROR: Could not sanitize molecule ending on line 324666\n",
"[21:21:35] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
"[21:21:35] Explicit valence for atom # 5 N greater than permitted\n",
"[21:21:35] ERROR: Could not sanitize molecule ending on line 365883\n",
"[21:21:35] ERROR: Explicit valence for atom # 5 N greater than permitted\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"成功加载 3276 个分子\n",
"\n",
"数据概览:\n",
" Index Plate Row Col ID Name \\\n",
"0 1 L1010-1 a 2 Dexamethasone \n",
"1 2 L1010-1 a 3 Danicopan \n",
"2 3 L1010-1 a 4 Cyclosporin A \n",
"3 4 L1010-1 a 5 L-Carnitine \n",
"4 5 L1010-1 a 6 Trimetazidine dihydrochloride \n",
"\n",
" Synonyms CAS \\\n",
"0 MK 125;Prednisolone F;NSC 34521;Hexadecadrol 50-02-2 \n",
"1 ACH-4471 1903768-17-1 \n",
"2 Cyclosporine A;Ciclosporin;Cyclosporine 59865-13-3 \n",
"3 L(-)-Carnitine;Levocarnitine 541-15-1 \n",
"4 Yoshimilon;Kyurinett;Vastarel F 13171-25-0 \n",
"\n",
" SMILES \\\n",
"0 C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(F)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO \n",
"1 CC(=O)c1nn(CC(=O)N2C[C@H](F)C[C@H]2C(=O)Nc2cccc(Br)n2)c2ccc(cc12)-c1cnc(C)nc1 \n",
"2 [C@H]([C@@H](C/C=C/C)C)(O)[C@@]1(N(C)C(=O)[C@H]([C@@H](C)C)N(C)C(=O)[C@H](CC(C)C)N(C)C(=O)[C@H](... \n",
"3 C[N+](C)(C)C[C@@H](O)CC([O-])=O \n",
"4 Cl.Cl.COC1=C(OC)C(OC)=C(CN2CCNCC2)C=C1 \n",
"\n",
" Formula MolWt Approved status \\\n",
"0 C22H29FO5 392.46 NMPA;EMA;FDA \n",
"1 C26H23BrFN7O3 580.41 FDA \n",
"2 C62H111N11O12 1202.61 FDA \n",
"3 C7H15NO3 161.2 FDA \n",
"4 C14H24Cl2N2O3 339.258 NMPA;EMA \n",
"\n",
" Pharmacopoeia \\\n",
"0 USP39-NF34;BP2015;JP16;IP2010 \n",
"1 NaN \n",
"2 Martindale the Extra Pharmacopoei, EP10.2, USP43-NF38, Ph.Int_6th, JP17 \n",
"3 NaN \n",
"4 BP2019;KP ;EP9.2;IP2010;JP17;Martindale:The Extra Pharmacopoeia \n",
"\n",
" Disease \\\n",
"0 Metabolism \n",
"1 Others \n",
"2 Immune system \n",
"3 Cardiovascular system \n",
"4 Cardiovascular system \n",
"\n",
" Pathways \\\n",
"0 Antibody-drug Conjugate/ADC Related;Autophagy;Endocrinology/Hormones;Immunology/Inflammation;Mic... \n",
"1 Immunology/Inflammation \n",
"2 Immunology/Inflammation;Metabolism;Microbiology/Virology \n",
"3 Metabolism \n",
"4 Autophagy;Metabolism \n",
"\n",
" Target \\\n",
"0 Antibacterial;Antibiotic;Autophagy;Complement System;Glucocorticoid Receptor;IL Receptor;Mitopha... \n",
"1 Complement System \n",
"2 Phosphatase;Antibiotic;Complement System \n",
"3 Endogenous Metabolite;Fatty Acid Synthase \n",
"4 Autophagy;Fatty Acid Synthase \n",
"\n",
" Receptor \\\n",
"0 Antibiotic; Autophagy; Bacterial; Complement System; Glucocorticoid Receptor; IL receptor; Mitop... \n",
"1 Complement System; factor D \n",
"2 Antibiotic; calcineurin phosphatase; Complement System; Phosphatase \n",
"3 Endogenous Metabolite; FAS \n",
"4 Autophagy; mitochondrial long-chain 3-ketoacyl thiolase \n",
"\n",
" Bioactivity \\\n",
"0 Dexamethasone is a glucocorticoid receptor agonist and IL receptor modulator with anti-inflammat... \n",
"1 Danicopan (ACH-4471) (ACH-4471) is a selective, orally active small molecule factor D inhibitor ... \n",
"2 Cyclosporin A is a natural product and an active fungal metabolite, classified as a cyclic polyp... \n",
"3 L-Carnitine (L(-)-Carnitine) is an amino acid derivative. L-Carnitine facilitates long-chain fat... \n",
"4 Trimetazidine dihydrochloride (Vastarel F) can improve myocardial glucose utilization by inhibit... \n",
"\n",
" Reference \\\n",
"0 Li M, Yu H. Identification of WP1066, an inhibitor of JAK2 and STAT3, as a Kv1. 3 potassium chan... \n",
"1 Yuan X, et al. Small-molecule factor D inhibitors selectively block the alternative pathway of c... \n",
"2 D'Angelo G, et al. Cyclosporin A prevents the hypoxic adaptation by activating hypoxia-inducible... \n",
"3 Jogl G, Tong L. Cell. 2003 Jan 10; 112(1):113-22. \n",
"4 Yang Q, et al. Int J Clin Exp Pathol. 2015, 8(4):3735-3741.;Liu Z, et al. Metabolism. 2016, 65(3... \n",
"\n",
" ROMol \n",
"0 <rdkit.Chem.rdchem.Mol object at 0x774684c557e0> \n",
"1 <rdkit.Chem.rdchem.Mol object at 0x7746818ffdf0> \n",
"2 <rdkit.Chem.rdchem.Mol object at 0x7746818ffd80> \n",
"3 <rdkit.Chem.rdchem.Mol object at 0x7746816e0040> \n",
"4 <rdkit.Chem.rdchem.Mol object at 0x7746816e00b0> \n",
"\n",
"列名:['Index', 'Plate', 'Row', 'Col', 'ID', 'Name', 'Synonyms', 'CAS', 'SMILES', 'Formula', 'MolWt', 'Approved status', 'Pharmacopoeia', 'Disease', 'Pathways', 'Target', 'Receptor', 'Bioactivity', 'Reference', 'ROMol']\n"
]
}
],
"source": [
"# 读取SDF文件\n",
"sdf_path = '../data/drug_targetmol/0c04ffc9fe8c2ec916412fbdc2a49bf4.sdf'\n",
"\n",
"print(\"正在读取SDF文件...\")\n",
"df = PandasTools.LoadSDF(sdf_path)\n",
"print(f\"成功加载 {len(df)} 个分子\")\n",
"\n",
"# 显示数据基本信息\n",
"print(\"\\n数据概览\")\n",
"print(df.head())\n",
"print(f\"\\n列名{list(df.columns)}\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"execution": {
"iopub.execute_input": "2025-11-11T13:21:35.770585Z",
"iopub.status.busy": "2025-11-11T13:21:35.768752Z",
"iopub.status.idle": "2025-11-11T13:21:36.114723Z",
"shell.execute_reply": "2025-11-11T13:21:36.111467Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"开始筛选芳香胺结构...\n",
"SMARTS模式: [c,n][N&H2]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"找到 78 个匹配分子(处理了 1000 个分子)\n",
"\n",
"筛选结果摘要:\n",
" Name CAS Formula total_matches\n",
"17 Guanosine 118-00-3 C10H13N5O5 1\n",
"20 Ganciclovir 82410-32-0 C9H13N5O4 1\n",
"22 Imiquimod maleate 896106-16-4 C18H20N4O4 1\n",
"27 Brincidofovir 444805-28-1 C27H52N3O7P 1\n",
"28 Imiquimod 99011-02-6 C14H16N4 1\n",
"32 Ganciclovir sodium 107910-75-8 C9H13N5NaO4 1\n",
"33 Cytarabine 147-94-4 C9H13N3O5 1\n",
"35 Vidarabine 5536-17-4 C10H13N5O4 1\n",
"38 Penciclovir 39809-25-1 C10H15N5O3 1\n",
"41 Famciclovir 104227-87-4 C14H19N5O4 1\n",
"... 还有 68 个分子\n"
]
}
],
"source": [
"def screen_molecules_for_aniline(df, smarts_pattern, max_molecules=100):\n",
" \"\"\"\n",
" 筛选包含芳香胺结构的分子\n",
" \n",
" Args:\n",
" df: 包含分子的DataFrame\n",
" smarts_pattern: RDKit SMARTS模式对象\n",
" max_molecules: 最大处理分子数量\n",
" \n",
" Returns:\n",
" 筛选结果DataFrame\n",
" \"\"\"\n",
" print(f\"开始筛选芳香胺结构...\")\n",
" print(f\"SMARTS模式: {Chem.MolToSmarts(smarts_pattern)}\")\n",
" \n",
" matched_molecules = []\n",
" processed_count = 0\n",
" \n",
" for idx, row in df.iterrows():\n",
" if processed_count >= max_molecules:\n",
" break\n",
" \n",
" mol = row['ROMol']\n",
" if mol is None:\n",
" continue\n",
" \n",
" processed_count += 1\n",
" \n",
" # 检查是否匹配SMARTS模式\n",
" if mol.HasSubstructMatch(smarts_pattern):\n",
" matches = mol.GetSubstructMatches(smarts_pattern)\n",
" \n",
" # 收集所有匹配的原子\n",
" matched_atoms = set()\n",
" for match in matches:\n",
" matched_atoms.update(match)\n",
" \n",
" # 创建匹配记录\n",
" match_record = row.copy()\n",
" match_record['matched_atoms'] = list(matched_atoms)\n",
" match_record['total_matches'] = len(matches)\n",
" match_record['smarts_pattern'] = Chem.MolToSmarts(smarts_pattern)\n",
" matched_molecules.append(match_record)\n",
" \n",
" result_df = pd.DataFrame(matched_molecules)\n",
" print(f\"找到 {len(result_df)} 个匹配分子(处理了 {processed_count} 个分子)\")\n",
" \n",
" return result_df\n",
"\n",
"# 执行筛选\n",
"matched_df = screen_molecules_for_aniline(df, pattern, max_molecules=1000)\n",
"\n",
"# 显示结果摘要\n",
"if len(matched_df) > 0:\n",
" print(\"\\n筛选结果摘要\")\n",
" summary_cols = ['Name', 'CAS', 'Formula', 'total_matches']\n",
" if len(matched_df) <= 10:\n",
" print(matched_df[summary_cols])\n",
" else:\n",
" print(matched_df[summary_cols].head(10))\n",
" print(f\"... 还有 {len(matched_df) - 10} 个分子\")\n",
"else:\n",
" print(\"\\n未找到匹配分子\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 保存筛选结果\n",
"\n",
"### 输出文件\n",
"1. **CSV文件**:包含所有匹配分子的属性信息和匹配详情\n",
"2. **SVG图片**:每个匹配分子的结构可视化,高亮芳香胺结构"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"execution": {
"iopub.execute_input": "2025-11-11T13:21:36.120981Z",
"iopub.status.busy": "2025-11-11T13:21:36.120553Z",
"iopub.status.idle": "2025-11-11T13:21:36.279125Z",
"shell.execute_reply": "2025-11-11T13:21:36.277892Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CSV结果已保存到../data/drug_targetmol/aniline_candidates/aniline_candidates.csv\n",
"包含 78 个分子23 个属性列\n",
"\n",
"开始生成可视化图片最多50个...\n",
"已生成 10 个分子图片\n",
"已生成 20 个分子图片\n",
"已生成 30 个分子图片\n",
"已生成 40 个分子图片\n",
"已生成 50 个分子图片\n",
"已达到最大可视化数量限制 (50),停止生成\n",
"完成!共生成 50 个可视化图片\n",
"\n",
"示例图片: 118-00-3_Guanosine.svg\n"
]
},
{
"data": {
"image/svg+xml": [
"<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:rdkit=\"http://www.rdkit.org/xml\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" version=\"1.1\" baseProfile=\"full\" xml:space=\"preserve\" width=\"1200px\" height=\"900px\" viewBox=\"0 0 1200 900\">\n",
"<!-- END OF HEADER -->\n",
"<rect style=\"opacity:1.0;fill:#FFFFFF;stroke:none\" width=\"1200.0\" height=\"900.0\" x=\"0.0\" y=\"0.0\"> </rect>\n",
"<path class=\"bond-0 atom-0 atom-1\" d=\"M 912.0,197.7 L 940.1,201.0 L 924.8,332.9 L 896.6,329.6 Z\" style=\"fill:#4C4CFF;fill-rule:evenodd;fill-opacity:1;stroke:#4C4CFF;stroke-width:0.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
"<ellipse cx=\"932.9\" cy=\"201.5\" rx=\"26.6\" ry=\"26.6\" class=\"atom-0\" style=\"fill:#4C4CFF;fill-rule:evenodd;stroke:#4C4CFF;stroke-width:1.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<ellipse cx=\"910.7\" cy=\"331.2\" rx=\"26.6\" ry=\"26.6\" class=\"atom-1\" style=\"fill:#4C4CFF;fill-rule:evenodd;stroke:#4C4CFF;stroke-width:1.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-0 atom-0 atom-1\" d=\"M 925.1,208.0 L 910.7,331.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-1 atom-1 atom-2\" d=\"M 910.7,331.2 L 853.5,355.9\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-1 atom-1 atom-2\" d=\"M 853.5,355.9 L 796.4,380.6\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-1 atom-1 atom-2\" d=\"M 908.0,354.1 L 856.2,376.5\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-1 atom-1 atom-2\" d=\"M 856.2,376.5 L 804.3,398.9\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-2 atom-2 atom-3\" d=\"M 787.8,392.5 L 780.6,454.1\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-2 atom-2 atom-3\" d=\"M 780.6,454.1 L 773.4,515.8\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-3 atom-3 atom-4\" d=\"M 773.4,515.8 L 879.9,595.0\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-3 atom-3 atom-4\" d=\"M 794.5,506.6 L 882.6,572.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-4 atom-4 atom-5\" d=\"M 879.9,595.0 L 860.1,653.6\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-4 atom-4 atom-5\" d=\"M 860.1,653.6 L 840.4,712.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-5 atom-5 atom-6\" d=\"M 829.8,720.7 L 767.3,720.0\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-5 atom-5 atom-6\" d=\"M 767.3,720.0 L 704.7,719.3\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-5 atom-5 atom-6\" d=\"M 830.1,700.8 L 774.7,700.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-5 atom-5 atom-6\" d=\"M 774.7,700.2 L 719.4,699.6\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-6 atom-6 atom-7\" d=\"M 704.7,719.3 L 686.2,660.3\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-6 atom-6 atom-7\" d=\"M 686.2,660.3 L 667.8,601.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-7 atom-3 atom-7\" d=\"M 773.4,515.8 L 723.0,551.5\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-7 atom-3 atom-7\" d=\"M 723.0,551.5 L 672.7,587.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-8 atom-7 atom-8\" d=\"M 657.5,590.0 L 598.4,570.1\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-8 atom-7 atom-8\" d=\"M 598.4,570.1 L 539.3,550.1\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-9 atom-8 atom-9\" d=\"M 539.3,550.1 L 489.3,585.6\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-9 atom-8 atom-9\" d=\"M 489.3,585.6 L 439.2,621.0\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-10 atom-9 atom-10\" d=\"M 422.7,620.8 L 373.6,584.2\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-10 atom-9 atom-10\" d=\"M 373.6,584.2 L 324.4,547.7\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-11 atom-10 atom-11\" d=\"M 324.4,547.7 L 197.7,587.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-12 atom-11 atom-12\" d=\"M 197.7,587.2 L 153.0,546.1\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-12 atom-11 atom-12\" d=\"M 153.0,546.1 L 108.3,504.9\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-13 atom-10 atom-13\" d=\"M 324.4,547.7 L 366.9,421.8\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-14 atom-13 atom-14\" d=\"M 366.9,421.8 L 331.6,372.1\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-14 atom-13 atom-14\" d=\"M 331.6,372.1 L 296.3,322.3\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-15 atom-13 atom-15\" d=\"M 366.9,421.8 L 499.7,423.4\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-16 atom-8 atom-15\" d=\"M 539.3,550.1 L 499.7,423.4\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-17 atom-15 atom-16\" d=\"M 499.7,423.4 L 536.0,374.5\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-17 atom-15 atom-16\" d=\"M 536.0,374.5 L 572.4,325.6\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-18 atom-4 atom-17\" d=\"M 879.9,595.0 L 1001.8,542.4\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-19 atom-17 atom-18\" d=\"M 991.3,547.0 L 1042.7,585.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-19 atom-17 atom-18\" d=\"M 1042.7,585.2 L 1094.1,623.5\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-19 atom-17 atom-18\" d=\"M 1003.2,531.0 L 1054.6,569.3\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-19 atom-17 atom-18\" d=\"M 1054.6,569.3 L 1106.0,607.5\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-20 atom-17 atom-19\" d=\"M 1001.8,542.4 L 1009.0,480.8\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-20 atom-17 atom-19\" d=\"M 1009.0,480.8 L 1016.2,419.1\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-21 atom-1 atom-19\" d=\"M 910.7,331.2 L 960.2,368.0\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path class=\"bond-21 atom-1 atom-19\" d=\"M 960.2,368.0 L 1009.6,404.9\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
"<path d=\"M 707.8,719.4 L 704.7,719.3 L 703.7,716.4\" style=\"fill:none;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
"<path d=\"M 204.0,585.3 L 197.7,587.2 L 195.5,585.2\" style=\"fill:none;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
"<path d=\"M 995.7,545.0 L 1001.8,542.4 L 1002.2,539.3\" style=\"fill:none;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
"<path class=\"atom-0\" d=\"M 924.2 195.1 L 927.0 199.6 Q 927.3 200.0, 927.7 200.8 Q 928.1 201.7, 928.2 201.7 L 928.2 195.1 L 929.3 195.1 L 929.3 203.6 L 928.1 203.6 L 925.1 198.7 Q 924.8 198.1, 924.4 197.4 Q 924.1 196.8, 924.0 196.6 L 924.0 203.6 L 922.9 203.6 L 922.9 195.1 L 924.2 195.1 \" fill=\"#000000\"/>\n",
"<path class=\"atom-0\" d=\"M 930.9 195.1 L 932.1 195.1 L 932.1 198.7 L 936.4 198.7 L 936.4 195.1 L 937.6 195.1 L 937.6 203.6 L 936.4 203.6 L 936.4 199.7 L 932.1 199.7 L 932.1 203.6 L 930.9 203.6 L 930.9 195.1 \" fill=\"#000000\"/>\n",
"<path class=\"atom-0\" d=\"M 939.2 203.3 Q 939.4 202.8, 939.9 202.5 Q 940.4 202.2, 941.1 202.2 Q 942.0 202.2, 942.4 202.6 Q 942.9 203.1, 942.9 203.9 Q 942.9 204.7, 942.3 205.5 Q 941.7 206.3, 940.4 207.2 L 943.0 207.2 L 943.0 207.8 L 939.2 207.8 L 939.2 207.3 Q 940.3 206.6, 940.9 206.0 Q 941.5 205.5, 941.8 205.0 Q 942.1 204.5, 942.1 203.9 Q 942.1 203.4, 941.8 203.1 Q 941.6 202.8, 941.1 202.8 Q 940.7 202.8, 940.4 203.0 Q 940.0 203.2, 939.8 203.6 L 939.2 203.3 \" fill=\"#000000\"/>\n",
"<path class=\"atom-2\" d=\"M 786.9 379.6 L 789.7 384.1 Q 790.0 384.6, 790.4 385.4 Q 790.8 386.2, 790.9 386.2 L 790.9 379.6 L 792.0 379.6 L 792.0 388.1 L 790.8 388.1 L 787.8 383.2 Q 787.5 382.6, 787.1 382.0 Q 786.8 381.3, 786.7 381.1 L 786.7 388.1 L 785.6 388.1 L 785.6 379.6 L 786.9 379.6 \" fill=\"#0000FF\"/>\n",
"<path class=\"atom-5\" d=\"M 835.6 716.6 L 838.4 721.1 Q 838.6 721.5, 839.1 722.3 Q 839.5 723.1, 839.5 723.2 L 839.5 716.6 L 840.7 716.6 L 840.7 725.1 L 839.5 725.1 L 836.5 720.2 Q 836.2 719.6, 835.8 718.9 Q 835.4 718.3, 835.3 718.1 L 835.3 725.1 L 834.2 725.1 L 834.2 716.6 L 835.6 716.6 \" fill=\"#0000FF\"/>\n",
"<path class=\"atom-7\" d=\"M 663.2 588.3 L 666.0 592.8 Q 666.3 593.3, 666.7 594.1 Q 667.2 594.9, 667.2 594.9 L 667.2 588.3 L 668.3 588.3 L 668.3 596.8 L 667.1 596.8 L 664.2 591.9 Q 663.8 591.3, 663.4 590.7 Q 663.1 590.0, 663.0 589.8 L 663.0 596.8 L 661.9 596.8 L 661.9 588.3 L 663.2 588.3 \" fill=\"#0000FF\"/>\n",
"<path class=\"atom-9\" d=\"M 427.1 626.9 Q 427.1 624.9, 428.1 623.8 Q 429.1 622.6, 431.0 622.6 Q 432.8 622.6, 433.9 623.8 Q 434.9 624.9, 434.9 626.9 Q 434.9 629.0, 433.8 630.2 Q 432.8 631.3, 431.0 631.3 Q 429.1 631.3, 428.1 630.2 Q 427.1 629.0, 427.1 626.9 M 431.0 630.4 Q 432.3 630.4, 433.0 629.5 Q 433.7 628.6, 433.7 626.9 Q 433.7 625.3, 433.0 624.4 Q 432.3 623.6, 431.0 623.6 Q 429.7 623.6, 429.0 624.4 Q 428.3 625.3, 428.3 626.9 Q 428.3 628.7, 429.0 629.5 Q 429.7 630.4, 431.0 630.4 \" fill=\"#FF0000\"/>\n",
"<path class=\"atom-12\" d=\"M 87.7 493.1 L 88.9 493.1 L 88.9 496.7 L 93.2 496.7 L 93.2 493.1 L 94.4 493.1 L 94.4 501.6 L 93.2 501.6 L 93.2 497.6 L 88.9 497.6 L 88.9 501.6 L 87.7 501.6 L 87.7 493.1 \" fill=\"#FF0000\"/>\n",
"<path class=\"atom-12\" d=\"M 96.1 497.3 Q 96.1 495.3, 97.1 494.1 Q 98.1 493.0, 100.0 493.0 Q 101.9 493.0, 102.9 494.1 Q 103.9 495.3, 103.9 497.3 Q 103.9 499.4, 102.9 500.5 Q 101.9 501.7, 100.0 501.7 Q 98.2 501.7, 97.1 500.5 Q 96.1 499.4, 96.1 497.3 M 100.0 500.7 Q 101.3 500.7, 102.0 499.9 Q 102.7 499.0, 102.7 497.3 Q 102.7 495.6, 102.0 494.8 Q 101.3 493.9, 100.0 493.9 Q 98.7 493.9, 98.0 494.8 Q 97.3 495.6, 97.3 497.3 Q 97.3 499.0, 98.0 499.9 Q 98.7 500.7, 100.0 500.7 \" fill=\"#FF0000\"/>\n",
"<path class=\"atom-14\" d=\"M 277.8 309.3 L 278.9 309.3 L 278.9 312.9 L 283.3 312.9 L 283.3 309.3 L 284.4 309.3 L 284.4 317.8 L 283.3 317.8 L 283.3 313.9 L 278.9 313.9 L 278.9 317.8 L 277.8 317.8 L 277.8 309.3 \" fill=\"#FF0000\"/>\n",
"<path class=\"atom-14\" d=\"M 286.2 313.6 Q 286.2 311.5, 287.2 310.4 Q 288.2 309.2, 290.1 309.2 Q 292.0 309.2, 293.0 310.4 Q 294.0 311.5, 294.0 313.6 Q 294.0 315.6, 293.0 316.8 Q 291.9 318.0, 290.1 318.0 Q 288.2 318.0, 287.2 316.8 Q 286.2 315.6, 286.2 313.6 M 290.1 317.0 Q 291.4 317.0, 292.1 316.1 Q 292.8 315.3, 292.8 313.6 Q 292.8 311.9, 292.1 311.0 Q 291.4 310.2, 290.1 310.2 Q 288.8 310.2, 288.1 311.0 Q 287.4 311.9, 287.4 313.6 Q 287.4 315.3, 288.1 316.1 Q 288.8 317.0, 290.1 317.0 \" fill=\"#FF0000\"/>\n",
"<path class=\"atom-16\" d=\"M 575.1 316.9 Q 575.1 314.8, 576.1 313.7 Q 577.1 312.5, 579.0 312.5 Q 580.8 312.5, 581.8 313.7 Q 582.9 314.8, 582.9 316.9 Q 582.9 318.9, 581.8 320.1 Q 580.8 321.3, 579.0 321.3 Q 577.1 321.3, 576.1 320.1 Q 575.1 318.9, 575.1 316.9 M 579.0 320.3 Q 580.2 320.3, 580.9 319.4 Q 581.7 318.6, 581.7 316.9 Q 581.7 315.2, 580.9 314.3 Q 580.2 313.5, 579.0 313.5 Q 577.7 313.5, 576.9 314.3 Q 576.3 315.2, 576.3 316.9 Q 576.3 318.6, 576.9 319.4 Q 577.7 320.3, 579.0 320.3 \" fill=\"#FF0000\"/>\n",
"<path class=\"atom-16\" d=\"M 584.2 312.6 L 585.3 312.6 L 585.3 316.2 L 589.7 316.2 L 589.7 312.6 L 590.8 312.6 L 590.8 321.1 L 589.7 321.1 L 589.7 317.2 L 585.3 317.2 L 585.3 321.1 L 584.2 321.1 L 584.2 312.6 \" fill=\"#FF0000\"/>\n",
"<path class=\"atom-18\" d=\"M 1104.5 621.7 Q 1104.5 619.7, 1105.5 618.5 Q 1106.5 617.4, 1108.4 617.4 Q 1110.2 617.4, 1111.3 618.5 Q 1112.3 619.7, 1112.3 621.7 Q 1112.3 623.8, 1111.2 624.9 Q 1110.2 626.1, 1108.4 626.1 Q 1106.5 626.1, 1105.5 624.9 Q 1104.5 623.8, 1104.5 621.7 M 1108.4 625.1 Q 1109.7 625.1, 1110.4 624.3 Q 1111.1 623.4, 1111.1 621.7 Q 1111.1 620.0, 1110.4 619.2 Q 1109.7 618.3, 1108.4 618.3 Q 1107.1 618.3, 1106.4 619.2 Q 1105.7 620.0, 1105.7 621.7 Q 1105.7 623.4, 1106.4 624.3 Q 1107.1 625.1, 1108.4 625.1 \" fill=\"#FF0000\"/>\n",
"<path class=\"atom-19\" d=\"M 1015.3 406.3 L 1018.1 410.8 Q 1018.4 411.2, 1018.8 412.0 Q 1019.3 412.8, 1019.3 412.9 L 1019.3 406.3 L 1020.4 406.3 L 1020.4 414.8 L 1019.3 414.8 L 1016.3 409.8 Q 1015.9 409.3, 1015.6 408.6 Q 1015.2 407.9, 1015.1 407.7 L 1015.1 414.8 L 1014.0 414.8 L 1014.0 406.3 L 1015.3 406.3 \" fill=\"#0000FF\"/>\n",
"<path class=\"atom-19\" d=\"M 1022.1 406.3 L 1023.2 406.3 L 1023.2 409.9 L 1027.6 409.9 L 1027.6 406.3 L 1028.7 406.3 L 1028.7 414.8 L 1027.6 414.8 L 1027.6 410.8 L 1023.2 410.8 L 1023.2 414.8 L 1022.1 414.8 L 1022.1 406.3 \" fill=\"#0000FF\"/>\n",
"</svg>"
],
"text/plain": [
"<IPython.core.display.SVG object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def save_aniline_screening_results(df, output_dir, visualization_dir, max_visualizations=50):\n",
" \"\"\"保存芳香胺筛选结果\"\"\"\n",
" \n",
" # 保存CSV文件\n",
" csv_path = output_dir / \"aniline_candidates.csv\"\n",
" \n",
" # 转换ROMol列为SMILES因为ROMol对象无法保存到CSV\n",
" df_export = df.copy()\n",
" if 'ROMol' in df_export.columns:\n",
" df_export['SMILES_from_mol'] = df_export['ROMol'].apply(lambda x: Chem.MolToSmiles(x) if x else '')\n",
" df_export = df_export.drop('ROMol', axis=1)\n",
" \n",
" df_export.to_csv(csv_path, index=False, encoding='utf-8')\n",
" print(f\"CSV结果已保存到{csv_path}\")\n",
" print(f\"包含 {len(df_export)} 个分子,{len(df_export.columns)} 个属性列\")\n",
" \n",
" # 生成可视化图片\n",
" print(f\"\\n开始生成可视化图片最多{max_visualizations}个)...\")\n",
" generated_count = 0\n",
" \n",
" for idx, row in df.iterrows():\n",
" if generated_count >= max_visualizations:\n",
" print(f\"已达到最大可视化数量限制 ({max_visualizations}),停止生成\")\n",
" break\n",
" \n",
" cas = str(row.get('CAS', 'unknown')).strip()\n",
" name = str(row.get('Name', 'unknown')).strip()\n",
" \n",
" # 清理文件名(去除特殊字符)\n",
" safe_name = \"\".join(c for c in name if c.isalnum() or c in (' ', '-', '_')).rstrip()\n",
" safe_cas = \"\".join(c for c in cas if c.isalnum() or c in ('-',)).rstrip()\n",
" \n",
" # 跳过无效的标识符\n",
" if not safe_cas or safe_cas == 'nan' or safe_cas == 'unknown':\n",
" continue\n",
" \n",
" mol = row.get('ROMol')\n",
" if mol is None:\n",
" continue\n",
" \n",
" matched_atoms = row.get('matched_atoms', [])\n",
" if not matched_atoms:\n",
" continue\n",
" \n",
" # 生成文件名和标题\n",
" filename = visualization_dir / f\"{safe_cas}_{safe_name.replace(' ', '_')}.svg\"\n",
" title = f\"{name} ({cas}) - 芳香胺结构\"\n",
" \n",
" try:\n",
" # 生成SVG\n",
" svg_content = generate_highlighted_svg(mol, matched_atoms, filename, title)\n",
" generated_count += 1\n",
" \n",
" # 每10个显示一次进度\n",
" if generated_count % 10 == 0:\n",
" print(f\"已生成 {generated_count} 个分子图片\")\n",
" \n",
" except Exception as e:\n",
" print(f\"生成 {safe_cas} 失败: {e}\")\n",
" continue\n",
" \n",
" print(f\"完成!共生成 {generated_count} 个可视化图片\")\n",
" return csv_path, generated_count\n",
"\n",
"# 保存结果\n",
"if len(matched_df) > 0:\n",
" csv_path, viz_count = save_aniline_screening_results(\n",
" matched_df, output_dir, visualization_dir, max_visualizations=50\n",
" )\n",
" \n",
" # 显示第一个生成的图片作为示例\n",
" if viz_count > 0:\n",
" example_files = list(visualization_dir.glob(\"*.svg\"))\n",
" if example_files:\n",
" example_file = example_files[0]\n",
" print(f\"\\n示例图片: {example_file.name}\")\n",
" with open(example_file, \"r\") as f:\n",
" svg_content = f.read()\n",
" display(SVG(svg_content))\n",
"else:\n",
" print(\"没有匹配结果,无需保存\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 结果统计和分析\n",
"\n",
"### 筛选统计\n",
"- 总分子数\n",
"- 匹配分子数\n",
"- 可视化文件数量\n",
"- 输出文件位置"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"execution": {
"iopub.execute_input": "2025-11-11T13:21:36.282118Z",
"iopub.status.busy": "2025-11-11T13:21:36.281886Z",
"iopub.status.idle": "2025-11-11T13:21:36.317857Z",
"shell.execute_reply": "2025-11-11T13:21:36.316621Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"=== 芳香胺筛选结果统计 ===\n",
"总分子数3276\n",
"匹配分子数78\n",
"匹配率2.38%\n",
"\n",
"输出目录:../data/drug_targetmol/aniline_candidates\n",
"CSV文件../data/drug_targetmol/aniline_candidates/aniline_candidates.csv\n",
"可视化目录:../data/drug_targetmol/aniline_candidates/visualizations\n",
"SVG文件数量50\n",
"\n",
"匹配数量最多的分子:\n",
" Name CAS total_matches\n",
"432 Proflavine Hemisulfate 1811-28-5 4\n",
"335 Pemetrexed disodium hemipenta hydrate 357166-30-4 2\n",
"463 Lamotrigine 84057-84-1 2\n",
"779 Pyrimethamine 58-14-0 2\n",
"784 Dapsone 80-08-0 2\n"
]
}
],
"source": [
"# 结果统计\n",
"print(\"=== 芳香胺筛选结果统计 ===\")\n",
"print(f\"总分子数:{len(df)}\")\n",
"print(f\"匹配分子数:{len(matched_df)}\")\n",
"print(f\"匹配率:{len(matched_df)/len(df)*100:.2f}%\")\n",
"print(f\"\\n输出目录{output_dir}\")\n",
"print(f\"CSV文件{output_dir}/aniline_candidates.csv\")\n",
"print(f\"可视化目录:{visualization_dir}\")\n",
"print(f\"SVG文件数量{len(list(visualization_dir.glob('*.svg')))}\")\n",
"\n",
"# 显示匹配最多的前几个分子\n",
"if len(matched_df) > 0:\n",
" print(\"\\n匹配数量最多的分子\")\n",
" top_matches = matched_df.nlargest(5, 'total_matches')[['Name', 'CAS', 'total_matches']]\n",
" print(top_matches)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 使用建议\n",
"\n",
"### 筛选结果解读\n",
"- **匹配分子**包含芳香胺结构Ar-NH₂的药物\n",
"- **蓝色高亮**匹配的SMARTS结构芳香碳/氮 + 氨基)\n",
"- **多重匹配**:分子中可能存在多个芳香胺基团\n",
"\n",
"### 后续分析建议\n",
"1. **合成路线验证**:查阅匹配分子的合成文献\n",
"2. **Sandmeyer反应确认**确认是否使用Sandmeyer反应引入卤素\n",
"3. **张夏恒反应评估**评估替代Sandmeyer反应的可行性\n",
"4. **工艺优化潜力**:分析替换为张夏恒反应的经济效益\n",
"\n",
"### 文件说明\n",
"- **CSV文件**:完整的分子属性和匹配信息\n",
"- **SVG文件**:结构可视化,蓝色高亮芳香胺结构\n",
"- **命名规则**{CAS}_{Name}.svg特殊字符已清理"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.14.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -0,0 +1,797 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 筛选Sandmeyer反应候选药物 - 张夏恒反应替代分析\n",
"\n",
"## 背景介绍\n",
"\n",
"### Sandmeyer反应回顾\n",
"Sandmeyer反应是经典的芳香胺转化方法\n",
"**Ar-NH₂ → [Ar-N₂]⁺ → Ar-X**\n",
"其中 X = Cl, Br, I, CN, OH, SCN 等\n",
"\n",
"### 张夏恒反应\n",
"根据论文《s41586-025-09791-5_reference.pdf》张夏恒反应可能是一种新的替代方法\n",
"可以更有效地实现芳香胺到芳香卤素的转化。\n",
"\n",
"### 筛选策略\n",
"我们通过识别药物分子中可能来自Sandmeyer反应的芳香卤素结构\n",
"找出可以考虑用张夏恒反应进行工艺优化的候选药物。\n",
"\n",
"**重要提醒:**\n",
"- 此筛选仅基于分子结构特征\n",
"- 最终需要查阅文献确认合成路线\n",
"- 并非所有含卤素的药物都使用Sandmeyer反应合成"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 导入所需库"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from pathlib import Path\n",
"from rdkit.Chem.Draw import rdMolDraw2D\n",
"from IPython.display import SVG, display\n",
"from rdkit.Chem import AllChem\n",
"\n",
"# 创建输出目录\n",
"output_base = Path(\"../data/drug_targetmol\")\n",
"scheme_a_dir = output_base / \"scheme_A_visualizations\"\n",
"scheme_b_dir = output_base / \"scheme_B_visualizations\"\n",
"\n",
"scheme_a_dir.mkdir(exist_ok=True)\n",
"scheme_b_dir.mkdir(exist_ok=True)\n",
"\n",
"print(f\"创建目录:{scheme_a_dir}\")\n",
"print(f\"创建目录:{scheme_b_dir}\")\n",
"\n",
"def generate_highlighted_svg(mol, highlight_atoms, filename, title=\"\"):\n",
" \"\"\"生成高亮卤素结构的高清晰度SVG图片\"\"\"\n",
" from rdkit.Chem import AllChem\n",
" \n",
" # 计算2D坐标\n",
" AllChem.Compute2DCoords(mol)\n",
" \n",
" # 创建SVG绘制器\n",
" drawer = rdMolDraw2D.MolDraw2DSVG(1200, 900) # 更大的尺寸以提高清晰度\n",
" drawer.SetFontSize(12)\n",
" \n",
" # 绘制选项\n",
" draw_options = drawer.drawOptions()\n",
" draw_options.addAtomIndices = False # 不显示原子索引,保持简洁\n",
" draw_options.addBondIndices = False\n",
" draw_options.addStereoAnnotation = True\n",
" draw_options.fixedFontSize = 12\n",
" \n",
" # 高亮卤素原子(红色)\n",
" atom_colors = {}\n",
" for atom_idx in highlight_atoms:\n",
" atom_colors[atom_idx] = (1.0, 0.3, 0.3) # 红色高亮\n",
" \n",
" # 绘制分子\n",
" drawer.DrawMolecule(mol, \n",
" highlightAtoms=highlight_atoms,\n",
" highlightAtomColors=atom_colors)\n",
" \n",
" drawer.FinishDrawing()\n",
" svg_content = drawer.GetDrawingText()\n",
" \n",
" # 添加标题\n",
" if title:\n",
" # 在SVG中添加标题\n",
" svg_lines = svg_content.split(\"\\n\")\n",
" # 在<g>标签前插入标题\n",
" for i, line in enumerate(svg_lines):\n",
" if \"<g \" in line and \"transform\" in line:\n",
" svg_lines.insert(i, f\"<text x=\"50%\" y=\"30\" text-anchor=\"middle\" font-size=\"16\" font-weight=\"bold\">{title}</text>\")\n",
" break\n",
" svg_with_title = \"\\n\".join(svg_lines)\n",
" else:\n",
" svg_with_title = svg_content\n",
" \n",
" # 保存文件\n",
" with open(filename, \"w\") as f:\n",
" f.write(svg_with_title)\n",
" \n",
" print(f\"保存SVG: {filename}\")\n",
" \n",
" return svg_content\n",
"\n",
"def visualize_molecules(df, output_dir, scheme_name, max_molecules=50):\n",
" \"\"\"为DataFrame中的分子生成可视化图片\"\"\"\n",
" print(f\"\\n开始生成{scheme_name}的可视化图片...\")\n",
" print(f\"输出目录: {output_dir}\")\n",
" \n",
" generated_count = 0\n",
" \n",
" for idx, row in df.iterrows():\n",
" if generated_count >= max_molecules:\n",
" print(f\"已达到最大生成数量限制 ({max_molecules}),停止生成\")\n",
" break\n",
" \n",
" cas = str(row.get(\"CAS\", \"unknown\")).strip()\n",
" name = str(row.get(\"Name\", \"unknown\")).strip()\n",
" \n",
" # 跳过无效的CAS号\n",
" if not cas or cas == \"nan\" or cas == \"unknown\":\n",
" continue\n",
" \n",
" mol = row.get(\"ROMol\")\n",
" if mol is None:\n",
" continue\n",
" \n",
" # 找出卤素原子\n",
" halogen_atoms = []\n",
" for atom in mol.GetAtoms():\n",
" if atom.GetAtomicNum() in [9, 17, 35, 53]: # F, Cl, Br, I\n",
" halogen_atoms.append(atom.GetIdx())\n",
" \n",
" if not halogen_atoms:\n",
" continue\n",
" \n",
" # 生成文件名和标题\n",
" filename = output_dir / f\"{cas}.svg\"\n",
" title = f\"{name} ({cas})\"\n",
" \n",
" try:\n",
" # 生成SVG\n",
" generate_highlighted_svg(mol, halogen_atoms, filename, title)\n",
" generated_count += 1\n",
" \n",
" # 每10个显示一次进度\n",
" if generated_count % 10 == 0:\n",
" print(f\"已生成 {generated_count} 个分子图片\")\n",
" \n",
" except Exception as e:\n",
" print(f\"生成 {cas} 失败: {e}\")\n",
" continue\n",
" \n",
" print(f\"完成!共生成 {generated_count} 个{scheme_name}的可视化图片\")\n",
" return generated_count\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 定义筛选模式\n",
"\n",
"### SMARTS模式说明\n",
"\n",
"#### 方案A杂芳环卤素最高优先级\n",
"- **筛选逻辑**杂芳环上的卤素最可能是Sandmeyer反应产物\n",
"- **原因**杂芳环直接卤代通常较困难Sandmeyer反应是重要合成方法\n",
"- **预期结果**:候选数量少但精准度高\n",
"\n",
"#### 方案B所有芳香卤素中等优先级 \n",
"- **筛选逻辑**:所有芳环上的卤素\n",
"- **原因**:虽然有些卤素可能来自其他途径,但可以扩大筛选范围\n",
"- **预期结果**:候选数量较多,需要更多文献验证\n",
"\n",
"**SMARTS模式优化说明**\n",
"- 原始模式 `n:c:[Cl,Br,I]` 语法有误\n",
"- 优化为更准确的环结构匹配模式\n",
"- 使用更精确的原子环境描述"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# 定义筛选模式\n",
"SCREENING_PATTERNS = {\n",
" 'heteroaryl_halides': {\n",
" 'name': '杂芳环卤素',\n",
" 'description': '杂环上的Cl, Br, I原子方案A',\n",
" 'smarts': [\n",
" '[n,o,s]c[Cl,Br,I]', # 杂原子邻位卤素\n",
" '[n,o,s]cc[Cl,Br,I]', # 杂原子邻位卤素(隔一个碳)\n",
" 'c1[n,o,s]c([Cl,Br,I])ccc1', # 卤代吡咯类\n",
" 'c1c([Cl,Br,I])cncn1', # 卤代嘧啶\n",
" 'c1ccc2c([Cl,Br,I])ccnc2c1', # 卤代喹啉\n",
" 'c1c([Cl,Br,I])cncc1', # 卤代吡嗪\n",
" 'c1([Cl,Br,I])scnc1', # 卤代噻唑\n",
" ],\n",
" 'scheme': 'A'\n",
" },\n",
" 'aryl_halides': {\n",
" 'name': '芳香卤素',\n",
" 'description': '所有芳环上的Cl, Br, I原子方案B',\n",
" 'smarts': [\n",
" 'c[Cl,Br,I]', # 任意芳香氯\n",
" 'c-C#N', # 芳香氰基\n",
" 'c1ccc(S(=O)(=O)N)cc1', # 磺胺核心\n",
" 'c1c(Cl)cc(Cl)cc1', # 多卤代苯\n",
" ],\n",
" 'scheme': 'B'\n",
" }\n",
"}\n",
"\n",
"def create_pattern_matchers():\n",
" \"\"\"创建SMARTS模式匹配器\"\"\"\n",
" matchers = {}\n",
" for key, pattern_info in SCREENING_PATTERNS.items():\n",
" matchers[key] = {\n",
" 'info': pattern_info,\n",
" 'matchers': [Chem.MolFromSmarts(smarts) for smarts in pattern_info['smarts']]\n",
" }\n",
" return matchers\n",
"\n",
"PATTERN_MATCHERS = create_pattern_matchers()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 数据加载和预处理\n",
"\n",
"### SDF文件说明\n",
"- 文件位置:`data/drug_targetmol/0c04ffc9fe8c2ec916412fbdc2a49bf4.sdf`\n",
"- 包含药物分子结构和丰富属性信息\n",
"- 每个分子记录包含SMILES、分子式、分子量、批准状态、适应症等"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"正在读取SDF文件...\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[15:05:51] Both bonds on one end of an atropisomer are on the same side - atoms is : 3\n",
"[15:05:51] Explicit valence for atom # 2 N greater than permitted\n",
"[15:05:51] ERROR: Could not sanitize molecule ending on line 217340\n",
"[15:05:51] ERROR: Explicit valence for atom # 2 N greater than permitted\n",
"[15:05:52] Explicit valence for atom # 4 N greater than permitted\n",
"[15:05:52] ERROR: Could not sanitize molecule ending on line 317283\n",
"[15:05:52] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
"[15:05:52] Explicit valence for atom # 4 N greater than permitted\n",
"[15:05:52] ERROR: Could not sanitize molecule ending on line 324666\n",
"[15:05:52] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
"[15:05:52] Explicit valence for atom # 5 N greater than permitted\n",
"[15:05:52] ERROR: Could not sanitize molecule ending on line 365883\n",
"[15:05:52] ERROR: Explicit valence for atom # 5 N greater than permitted\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"成功加载 3276 个分子\n",
"\n",
"数据概览:\n",
" Index Plate Row Col ID Name \\\n",
"0 1 L1010-1 a 2 Dexamethasone \n",
"1 2 L1010-1 a 3 Danicopan \n",
"2 3 L1010-1 a 4 Cyclosporin A \n",
"3 4 L1010-1 a 5 L-Carnitine \n",
"4 5 L1010-1 a 6 Trimetazidine dihydrochloride \n",
"\n",
" Synonyms CAS \\\n",
"0 MK 125;Prednisolone F;NSC 34521;Hexadecadrol 50-02-2 \n",
"1 ACH-4471 1903768-17-1 \n",
"2 Cyclosporine A;Ciclosporin;Cyclosporine 59865-13-3 \n",
"3 L(-)-Carnitine;Levocarnitine 541-15-1 \n",
"4 Yoshimilon;Kyurinett;Vastarel F 13171-25-0 \n",
"\n",
" SMILES \\\n",
"0 C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(F)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO \n",
"1 CC(=O)c1nn(CC(=O)N2C[C@H](F)C[C@H]2C(=O)Nc2cccc(Br)n2)c2ccc(cc12)-c1cnc(C)nc1 \n",
"2 [C@H]([C@@H](C/C=C/C)C)(O)[C@@]1(N(C)C(=O)[C@H]([C@@H](C)C)N(C)C(=O)[C@H](CC(C)C)N(C)C(=O)[C@H](... \n",
"3 C[N+](C)(C)C[C@@H](O)CC([O-])=O \n",
"4 Cl.Cl.COC1=C(OC)C(OC)=C(CN2CCNCC2)C=C1 \n",
"\n",
" Formula MolWt Approved status \\\n",
"0 C22H29FO5 392.46 NMPA;EMA;FDA \n",
"1 C26H23BrFN7O3 580.41 FDA \n",
"2 C62H111N11O12 1202.61 FDA \n",
"3 C7H15NO3 161.2 FDA \n",
"4 C14H24Cl2N2O3 339.258 NMPA;EMA \n",
"\n",
" Pharmacopoeia \\\n",
"0 USP39-NF34;BP2015;JP16;IP2010 \n",
"1 NaN \n",
"2 Martindale the Extra Pharmacopoei, EP10.2, USP43-NF38, Ph.Int_6th, JP17 \n",
"3 NaN \n",
"4 BP2019;KP ;EP9.2;IP2010;JP17;Martindale:The Extra Pharmacopoeia \n",
"\n",
" Disease \\\n",
"0 Metabolism \n",
"1 Others \n",
"2 Immune system \n",
"3 Cardiovascular system \n",
"4 Cardiovascular system \n",
"\n",
" Pathways \\\n",
"0 Antibody-drug Conjugate/ADC Related;Autophagy;Endocrinology/Hormones;Immunology/Inflammation;Mic... \n",
"1 Immunology/Inflammation \n",
"2 Immunology/Inflammation;Metabolism;Microbiology/Virology \n",
"3 Metabolism \n",
"4 Autophagy;Metabolism \n",
"\n",
" Target \\\n",
"0 Antibacterial;Antibiotic;Autophagy;Complement System;Glucocorticoid Receptor;IL Receptor;Mitopha... \n",
"1 Complement System \n",
"2 Phosphatase;Antibiotic;Complement System \n",
"3 Endogenous Metabolite;Fatty Acid Synthase \n",
"4 Autophagy;Fatty Acid Synthase \n",
"\n",
" Receptor \\\n",
"0 Antibiotic; Autophagy; Bacterial; Complement System; Glucocorticoid Receptor; IL receptor; Mitop... \n",
"1 Complement System; factor D \n",
"2 Antibiotic; calcineurin phosphatase; Complement System; Phosphatase \n",
"3 Endogenous Metabolite; FAS \n",
"4 Autophagy; mitochondrial long-chain 3-ketoacyl thiolase \n",
"\n",
" Bioactivity \\\n",
"0 Dexamethasone is a glucocorticoid receptor agonist and IL receptor modulator with anti-inflammat... \n",
"1 Danicopan (ACH-4471) (ACH-4471) is a selective, orally active small molecule factor D inhibitor ... \n",
"2 Cyclosporin A is a natural product and an active fungal metabolite, classified as a cyclic polyp... \n",
"3 L-Carnitine (L(-)-Carnitine) is an amino acid derivative. L-Carnitine facilitates long-chain fat... \n",
"4 Trimetazidine dihydrochloride (Vastarel F) can improve myocardial glucose utilization by inhibit... \n",
"\n",
" Reference \\\n",
"0 Li M, Yu H. Identification of WP1066, an inhibitor of JAK2 and STAT3, as a Kv1. 3 potassium chan... \n",
"1 Yuan X, et al. Small-molecule factor D inhibitors selectively block the alternative pathway of c... \n",
"2 D'Angelo G, et al. Cyclosporin A prevents the hypoxic adaptation by activating hypoxia-inducible... \n",
"3 Jogl G, Tong L. Cell. 2003 Jan 10; 112(1):113-22. \n",
"4 Yang Q, et al. Int J Clin Exp Pathol. 2015, 8(4):3735-3741.;Liu Z, et al. Metabolism. 2016, 65(3... \n",
"\n",
" ROMol \n",
"0 <rdkit.Chem.rdchem.Mol object at 0x743d782049e0> \n",
"1 <rdkit.Chem.rdchem.Mol object at 0x743d782871b0> \n",
"2 <rdkit.Chem.rdchem.Mol object at 0x743d78287220> \n",
"3 <rdkit.Chem.rdchem.Mol object at 0x743d782873e0> \n",
"4 <rdkit.Chem.rdchem.Mol object at 0x743d78287450> \n",
"\n",
"列名:['Index', 'Plate', 'Row', 'Col', 'ID', 'Name', 'Synonyms', 'CAS', 'SMILES', 'Formula', 'MolWt', 'Approved status', 'Pharmacopoeia', 'Disease', 'Pathways', 'Target', 'Receptor', 'Bioactivity', 'Reference', 'ROMol']\n"
]
}
],
"source": [
"# 读取筛选结果CSV文件\n",
"import pandas as pd\n",
"from rdkit import Chem\n",
"\n",
"print(\"正在读取筛选结果CSV文件...\")\n",
"\n",
"# 读取方案A结果\n",
"df_a = pd.read_csv(\"../data/drug_targetmol/sandmeyer_candidates_scheme_A_heteroaryl_halides.csv\")\n",
"print(f\"方案A数据: {len(df_a)} 行\")\n",
"\n",
"# 读取方案B结果\n",
"df_b = pd.read_csv(\"../data/drug_targetmol/sandmeyer_candidates_scheme_B_aryl_halides.csv\")\n",
"print(f\"方案B数据: {len(df_b)} 行\")\n",
"\n",
"# 重建分子对象\n",
"def rebuild_molecules(df):\n",
" mols = []\n",
" for idx, row in df.iterrows():\n",
" smiles = row.get(\"SMILES_from_mol\", \"\")\n",
" if smiles and str(smiles) != \"nan\":\n",
" mol = Chem.MolFromSmiles(str(smiles))\n",
" mols.append(mol)\n",
" else:\n",
" mols.append(None)\n",
" df[\"ROMol\"] = mols\n",
" valid_mols = sum(1 for m in mols if m is not None)\n",
" print(f\"成功重建 {valid_mols} 个分子对象\")\n",
" return df\n",
"\n",
"df_a = rebuild_molecules(df_a)\n",
"df_b = rebuild_molecules(df_b)\n",
"\n",
"print(\"\n",
"数据加载完成\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 分子筛选函数\n",
"\n",
"### 筛选逻辑说明\n",
"\n",
"1. **分子验证**:确保分子结构有效\n",
"2. **子结构匹配**使用RDKit的SMARTS匹配\n",
"3. **结果记录**:记录匹配的模式和具体子结构\n",
"4. **数据完整性**:保留所有原始属性信息"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def screen_molecules_for_patterns(df, pattern_key):\n",
" \"\"\"\n",
" 筛选包含特定子结构的分子\n",
" \n",
" Args:\n",
" df: 包含分子的DataFrame\n",
" pattern_key: 筛选模式键名\n",
" \n",
" Returns:\n",
" 筛选结果DataFrame\n",
" \"\"\"\n",
" pattern_info = PATTERN_MATCHERS[pattern_key]['info']\n",
" matchers = PATTERN_MATCHERS[pattern_key]['matchers']\n",
" \n",
" print(f\"\\n开始筛选{pattern_info['name']}\")\n",
" print(f\"描述:{pattern_info['description']}\")\n",
" print(f\"SMARTS模式数量{len(pattern_info['smarts'])}\")\n",
" \n",
" matched_molecules = []\n",
" \n",
" for idx, row in df.iterrows():\n",
" mol = row['ROMol']\n",
" if mol is None:\n",
" continue\n",
" \n",
" # 检查是否匹配任何模式\n",
" matched_patterns = []\n",
" for i, matcher in enumerate(matchers):\n",
" if matcher is None:\n",
" continue\n",
" if mol.HasSubstructMatch(matcher):\n",
" matched_patterns.append({\n",
" 'pattern_index': i,\n",
" 'smarts': pattern_info['smarts'][i],\n",
" 'matches': len(mol.GetSubstructMatches(matcher))\n",
" })\n",
" \n",
" if matched_patterns:\n",
" # 创建匹配记录\n",
" match_record = row.copy()\n",
" match_record['matched_patterns'] = matched_patterns\n",
" match_record['total_matches'] = sum(p['matches'] for p in matched_patterns)\n",
" match_record['screening_scheme'] = pattern_info['scheme']\n",
" matched_molecules.append(match_record)\n",
" \n",
" result_df = pd.DataFrame(matched_molecules)\n",
" print(f\"找到 {len(result_df)} 个匹配分子\")\n",
" \n",
" return result_df\n",
"\n",
"def save_screening_results(df, filename, description):\n",
" \"\"\"保存筛选结果到CSV\"\"\"\n",
" output_path = f\"../data/drug_targetmol/{filename}\"\n",
" \n",
" # 转换ROMol列为SMILES因为ROMol对象无法保存到CSV\n",
" df_export = df.copy()\n",
" if 'ROMol' in df_export.columns:\n",
" df_export['SMILES_from_mol'] = df_export['ROMol'].apply(lambda x: Chem.MolToSmiles(x) if x else '')\n",
" df_export = df_export.drop('ROMol', axis=1)\n",
" \n",
" df_export.to_csv(output_path, index=False, encoding='utf-8')\n",
" print(f\"结果已保存到:{output_path}\")\n",
" print(f\"包含 {len(df_export)} 个分子,{len(df_export.columns)} 个属性列\")\n",
" \n",
" return output_path"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 方案A筛选杂芳环卤素\n",
"\n",
"### 执行逻辑\n",
"- 使用最保守的筛选策略\n",
"- 只匹配杂芳环上的卤素\n",
"- 预期获得高精度结果\n",
"- 需要进一步的合成路线验证"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"开始筛选:杂芳环卤素\n",
"描述杂环上的Cl, Br, I原子方案A\n",
"SMARTS模式数量7\n",
"找到 57 个匹配分子\n",
"\n",
"方案A筛选结果摘要\n",
" Name CAS Formula \\\n",
"1 Danicopan 1903768-17-1 C26H23BrFN7O3 \n",
"8 Lonafarnib 193275-84-2 C27H31Br2ClN4O2 \n",
"19 Idoxuridine 54-42-2 C9H11IN2O5 \n",
"144 Dimenhydrinate 523-87-5 C24H28ClN5O3 \n",
"259 Sertaconazole 99592-32-2 C20H15Cl3N2OS \n",
"311 Tioconazole 65899-73-2 C16H13Cl3N2OS \n",
"337 Gimeracil 103766-25-2 C5H4ClNO2 \n",
"580 Bromocriptine mesylate 22260-51-1 C33H44BrN5O8S \n",
"592 Clofarabine 123318-82-1 C10H11ClFN5O3 \n",
"684 Vorasidenib 1644545-52-7 C14H13ClF6N6 \n",
"\n",
" matched_patterns \\\n",
"1 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 1}, {'pattern_index': 2, 'smarts... \n",
"8 [{'pattern_index': 1, 'smarts': '[n,o,s]cc[Cl,Br,I]', 'matches': 1}, {'pattern_index': 5, 'smart... \n",
"19 [{'pattern_index': 1, 'smarts': '[n,o,s]cc[Cl,Br,I]', 'matches': 2}, {'pattern_index': 3, 'smart... \n",
"144 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 2}] \n",
"259 [{'pattern_index': 1, 'smarts': '[n,o,s]cc[Cl,Br,I]', 'matches': 1}] \n",
"311 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 1}] \n",
"337 [{'pattern_index': 1, 'smarts': '[n,o,s]cc[Cl,Br,I]', 'matches': 1}, {'pattern_index': 5, 'smart... \n",
"580 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 1}] \n",
"592 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 2}] \n",
"684 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 1}, {'pattern_index': 2, 'smarts... \n",
"\n",
" total_matches \n",
"1 2 \n",
"8 2 \n",
"19 3 \n",
"144 2 \n",
"259 1 \n",
"311 1 \n",
"337 2 \n",
"580 1 \n",
"592 2 \n",
"684 2 \n",
"结果已保存到:../data/drug_targetmol/sandmeyer_candidates_scheme_A_heteroaryl_halides.csv\n",
"包含 57 个分子23 个属性列\n"
]
}
],
"source": [
"# 执行方案A筛选\n",
"scheme_a_results = screen_molecules_for_patterns(df, 'heteroaryl_halides')\n",
"\n",
"# 显示结果摘要\n",
"if len(scheme_a_results) > 0:\n",
" print(\"\\n方案A筛选结果摘要\")\n",
" print(scheme_a_results[['Name', 'CAS', 'Formula', 'matched_patterns', 'total_matches']].head(10))\n",
" \n",
" # 保存结果\n",
" save_screening_results(\n",
" scheme_a_results, \n",
" 'sandmeyer_candidates_scheme_A_heteroaryl_halides.csv',\n",
" '方案A杂芳环卤素筛选结果'\n",
" )\n",
"else:\n",
" print(\"\\n方案A未找到匹配分子\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 方案B筛选所有芳香卤素\n",
"\n",
"### 执行逻辑\n",
"- 使用更宽松的筛选策略 \n",
"- 匹配所有芳环上的卤素\n",
"- 会包含更多候选分子\n",
"- 需要更多的文献验证工作"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"开始筛选:芳香卤素\n",
"描述所有芳环上的Cl, Br, I原子方案B\n",
"SMARTS模式数量4\n",
"找到 548 个匹配分子\n",
"\n",
"方案B筛选结果摘要\n",
" Name CAS Formula \\\n",
"1 Danicopan 1903768-17-1 C26H23BrFN7O3 \n",
"8 Lonafarnib 193275-84-2 C27H31Br2ClN4O2 \n",
"9 Ketoconazole 65277-42-1 C26H28Cl2N4O4 \n",
"13 Ozanimod 1306760-87-1 C23H24N4O3 \n",
"14 Ponesimod 854107-55-4 C23H25ClN2O4S \n",
"19 Idoxuridine 54-42-2 C9H11IN2O5 \n",
"53 Moclobemide 71320-77-9 C13H17ClN2O2 \n",
"74 Clemastine 15686-51-8 C21H26ClNO \n",
"75 Buclizine dihydrochloride 129-74-8 C28H33ClN2·2HCl \n",
"78 Asenapine 65576-45-6 C17H16ClNO \n",
"\n",
" matched_patterns \\\n",
"1 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
"8 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 3}] \n",
"9 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 2}, {'pattern_index': 3, 'smarts': 'c1c... \n",
"13 [{'pattern_index': 1, 'smarts': 'c-C#N', 'matches': 1}] \n",
"14 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
"19 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
"53 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
"74 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
"75 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
"78 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
"\n",
" total_matches \n",
"1 1 \n",
"8 3 \n",
"9 3 \n",
"13 1 \n",
"14 1 \n",
"19 1 \n",
"53 1 \n",
"74 1 \n",
"75 1 \n",
"78 1 \n",
"结果已保存到:../data/drug_targetmol/sandmeyer_candidates_scheme_B_aryl_halides.csv\n",
"包含 548 个分子23 个属性列\n"
]
}
],
"source": [
"# 执行方案B筛选\n",
"scheme_b_results = screen_molecules_for_patterns(df, 'aryl_halides')\n",
"\n",
"# 显示结果摘要\n",
"if len(scheme_b_results) > 0:\n",
" print(\"\\n方案B筛选结果摘要\")\n",
" print(scheme_b_results[['Name', 'CAS', 'Formula', 'matched_patterns', 'total_matches']].head(10))\n",
" \n",
" # 保存结果\n",
" save_screening_results(\n",
" scheme_b_results, \n",
" 'sandmeyer_candidates_scheme_B_aryl_halides.csv',\n",
" '方案B所有芳香卤素筛选结果'\n",
" )\n",
"else:\n",
" print(\"\\n方案B未找到匹配分子\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 结果分析和总结\n",
"\n",
"### 筛选统计"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"=== 筛选结果统计 ===\n",
"总分子数3276\n",
"方案A杂芳环卤素匹配数57\n",
"方案B所有芳香卤素匹配数548\n",
"两方案重叠分子数57\n",
"仅方案A匹配的分子数0\n",
"仅方案B匹配的分子数490\n"
]
}
],
"source": [
"# 结果统计\n",
"print(\"=== 筛选结果统计 ===\")\n",
"print(f\"总分子数:{len(df)}\")\n",
"print(f\"方案A杂芳环卤素匹配数{len(scheme_a_results)}\")\n",
"print(f\"方案B所有芳香卤素匹配数{len(scheme_b_results)}\")\n",
"\n",
"if len(scheme_a_results) > 0 and len(scheme_b_results) > 0:\n",
" # 分析重叠\n",
" scheme_a_cas = set(scheme_a_results['CAS'].dropna())\n",
" scheme_b_cas = set(scheme_b_results['CAS'].dropna())\n",
" overlap = scheme_a_cas & scheme_b_cas\n",
" print(f\"两方案重叠分子数:{len(overlap)}\")\n",
" \n",
" # 方案A特有\n",
" scheme_a_only = scheme_a_cas - scheme_b_cas\n",
" print(f\"仅方案A匹配的分子数{len(scheme_a_only)}\")\n",
" \n",
" # 方案B特有\n",
" scheme_b_only = scheme_b_cas - scheme_a_cas\n",
" print(f\"仅方案B匹配的分子数{len(scheme_b_only)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 使用建议\n",
"\n",
"### 优先级推荐\n",
"\n",
"1. **第一优先级**方案A结果\n",
" - 杂芳环卤素最可能是Sandmeyer反应产物\n",
" - 候选数量相对较少,便于深入研究\n",
" - 建议重点查阅这些分子的合成路线\n",
"\n",
"2. **第二优先级**方案B独有结果\n",
" - 苯环卤素可能来自多种途径\n",
" - 需要仔细评估合成可能性\n",
" - 适合作为补充筛选\n",
"\n",
"### 后续验证步骤\n",
"\n",
"1. **文献调研**:查阅候选分子的合成路线\n",
"2. **反应条件评估**确认是否使用了Sandmeyer反应\n",
"3. **经济性分析**:评估张夏恒反应用于该分子的潜力\n",
"4. **实验验证**:必要时进行小规模验证实验\n",
"\n",
"### 注意事项\n",
"\n",
"- 此筛选基于结构特征,不等同于合成路线确认\n",
"- 部分卤素可能来自原料而非合成步骤\n",
"- 分子复杂程度和合成可行性需要综合考虑\n",
"- 建议结合药物的重要性和市场规模进行优先级排序"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 分子结构可视化\n",
"\n",
"### 创建输出目录和可视化函数\n",
"\n",
"本节将为筛选出的候选分子生成高清晰度的SVG结构图突出显示卤素结构。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.14.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -0,0 +1,285 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# SMARTS匹配检测与可视化\n",
"\n",
"本notebook用于\n",
"1. 读取ring16/temp.csv中的smiles列\n",
"2. 对SMARTS模式进行匹配检测`O=C1C[C@@H](O)[*:15][*:17][*:18]C[*:23]C(=O)/C=C/[*:28]=C/[*:7][*:8]O1`\n",
"3. 处理dummy原子[*:X]),尝试两种方式:\n",
" - 不替换dummy原子\n",
" - 将dummy原子替换为C\n",
"4. 可视化匹配的原子高亮显示\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. 导入必要的库\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"所有模块导入成功!\n"
]
}
],
"source": [
"import sys\n",
"from pathlib import Path\n",
"import re\n",
"\n",
"# 添加项目根目录到 Python 路径\n",
"notebook_dir = Path().resolve()\n",
"project_root = notebook_dir.parent\n",
"sys.path.insert(0, str(project_root))\n",
"\n",
"from rdkit import Chem\n",
"from rdkit.Chem import Draw\n",
"from rdkit.Chem.Draw import rdMolDraw2D\n",
"from IPython.display import SVG, display, HTML\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from collections import Counter\n",
"\n",
"print(\"所有模块导入成功!\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. 读取数据\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"数据集大小: 2022 个分子\n",
"列名: ['IDs', 'molecule_pref_name', 'max_pChEMBL', 'max_pChEMBL_target', '# Target Organisms', 'Target Organisms', '# Known Targets', 'Known Targets', 'target_pref_name', 'smiles']\n",
"\n",
"SMILES列存在共 2022 个有效SMILES\n",
"\n",
"前5个SMILES示例:\n",
"['C/C(=C\\\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[C@H](C)[C@H](O)[C@@H](C)C(=O)C(C)(C)[C@@H](O)CC(=O)O1', 'C/C(=C\\\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[C@H](C)[C@H](O)[C@@H](C)C(=O)C(C)(C)[C@@H](O)CC(=O)O1', 'Cc1c2oc3c(C)ccc(C(=O)N[C@@H]4C(=O)N[C@H](C(C)C)C(=O)N5CCC[C@H]5C(=O)N(C)CC(=O)N(C)[C@@H](C(C)C)C(=O)O[C@@H]4C)c3nc-2c(C(=O)N[C@@H]2C(=O)N[C@H](C(C)C)C(=O)N3CCC[C@H]3C(=O)N(C)CC(=O)N(C)[C@@H](C(C)C)C(=O)O[C@@H]2C)c(N)c1=O', 'CCCCCCCC(=O)SCC/C=C/[C@@H]1CC(=O)NCc2nc(cs2)C2=N[C@@](C)(CS2)C(=O)N[C@@H](C(C)C)C(=O)O1', 'Cc1cc2ccc1[C@@H](C)COC(=O)Nc1ccc(S(=O)(=O)C3CC3)c(c1)CN(C)C(=O)[C@@H]2Nc1ccc2c(N)ncc(F)c2c1']\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>IDs</th>\n",
" <th>molecule_pref_name</th>\n",
" <th>max_pChEMBL</th>\n",
" <th>max_pChEMBL_target</th>\n",
" <th># Target Organisms</th>\n",
" <th>Target Organisms</th>\n",
" <th># Known Targets</th>\n",
" <th>Known Targets</th>\n",
" <th>target_pref_name</th>\n",
" <th>smiles</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>CHEMBL94657</td>\n",
" <td>PATUPILONE</td>\n",
" <td>10.67</td>\n",
" <td>CHEMBL1075590</td>\n",
" <td>695</td>\n",
" <td>Sus scrofa, Mus musculus, None, Plasmodium fal...</td>\n",
" <td>695</td>\n",
" <td>CHEMBL612519, CHEMBL614129, CHEMBL1075484, CHE...</td>\n",
" <td>AGS, NCI-H1703, MKN-7, HT-1080, NCI-H226, Lu-6...</td>\n",
" <td>C/C(=C\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>CHEMBL94657</td>\n",
" <td>PATUPILONE</td>\n",
" <td>10.67</td>\n",
" <td>CHEMBL1075590</td>\n",
" <td>695</td>\n",
" <td>Sus scrofa, Mus musculus, None, Plasmodium fal...</td>\n",
" <td>695</td>\n",
" <td>CHEMBL612519, CHEMBL614129, CHEMBL1075484, CHE...</td>\n",
" <td>AGS, NCI-H1703, MKN-7, HT-1080, NCI-H226, Lu-6...</td>\n",
" <td>C/C(=C\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>CHEMBL1554</td>\n",
" <td>DACTINOMYCIN</td>\n",
" <td>10.10</td>\n",
" <td>CHEMBL614533</td>\n",
" <td>177</td>\n",
" <td>Giardia intestinalis, Trypanosoma cruzi, Equus...</td>\n",
" <td>177</td>\n",
" <td>CHEMBL388, CHEMBL614151, CHEMBL3577, CHEMBL551...</td>\n",
" <td>HT-29, CCRF-CEM, WIL2-NS, Unchecked, Caspase-7...</td>\n",
" <td>Cc1c2oc3c(C)ccc(C(=O)N[C@@H]4C(=O)N[C@H](C(C)C...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>CHEMBL1173445</td>\n",
" <td>LARGAZOLE</td>\n",
" <td>8.80</td>\n",
" <td>CHEMBL612545</td>\n",
" <td>45</td>\n",
" <td>Homo sapiens, None</td>\n",
" <td>45</td>\n",
" <td>CHEMBL392, CHEMBL3192, CHEMBL3524, CHEMBL5103,...</td>\n",
" <td>Histone deacetylase 9, Ubiquitin-like modifier...</td>\n",
" <td>CCCCCCCC(=O)SCC/C=C/[C@@H]1CC(=O)NCc2nc(cs2)C2...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>CHEMBL3902498</td>\n",
" <td>NaN</td>\n",
" <td>9.37</td>\n",
" <td>CHEMBL2095194,CHEMBL3991</td>\n",
" <td>17</td>\n",
" <td>Homo sapiens, None</td>\n",
" <td>17</td>\n",
" <td>CHEMBL2820, CHEMBL3991, CHEMBL1801, CHEMBL204,...</td>\n",
" <td>Coagulation factor X, Kallikrein 1, Coagulatio...</td>\n",
" <td>Cc1cc2ccc1[C@@H](C)COC(=O)Nc1ccc(S(=O)(=O)C3CC...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" IDs molecule_pref_name max_pChEMBL max_pChEMBL_target \\\n",
"0 CHEMBL94657 PATUPILONE 10.67 CHEMBL1075590 \n",
"1 CHEMBL94657 PATUPILONE 10.67 CHEMBL1075590 \n",
"2 CHEMBL1554 DACTINOMYCIN 10.10 CHEMBL614533 \n",
"3 CHEMBL1173445 LARGAZOLE 8.80 CHEMBL612545 \n",
"4 CHEMBL3902498 NaN 9.37 CHEMBL2095194,CHEMBL3991 \n",
"\n",
" # Target Organisms Target Organisms \\\n",
"0 695 Sus scrofa, Mus musculus, None, Plasmodium fal... \n",
"1 695 Sus scrofa, Mus musculus, None, Plasmodium fal... \n",
"2 177 Giardia intestinalis, Trypanosoma cruzi, Equus... \n",
"3 45 Homo sapiens, None \n",
"4 17 Homo sapiens, None \n",
"\n",
" # Known Targets Known Targets \\\n",
"0 695 CHEMBL612519, CHEMBL614129, CHEMBL1075484, CHE... \n",
"1 695 CHEMBL612519, CHEMBL614129, CHEMBL1075484, CHE... \n",
"2 177 CHEMBL388, CHEMBL614151, CHEMBL3577, CHEMBL551... \n",
"3 45 CHEMBL392, CHEMBL3192, CHEMBL3524, CHEMBL5103,... \n",
"4 17 CHEMBL2820, CHEMBL3991, CHEMBL1801, CHEMBL204,... \n",
"\n",
" target_pref_name \\\n",
"0 AGS, NCI-H1703, MKN-7, HT-1080, NCI-H226, Lu-6... \n",
"1 AGS, NCI-H1703, MKN-7, HT-1080, NCI-H226, Lu-6... \n",
"2 HT-29, CCRF-CEM, WIL2-NS, Unchecked, Caspase-7... \n",
"3 Histone deacetylase 9, Ubiquitin-like modifier... \n",
"4 Coagulation factor X, Kallikrein 1, Coagulatio... \n",
"\n",
" smiles \n",
"0 C/C(=C\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[... \n",
"1 C/C(=C\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[... \n",
"2 Cc1c2oc3c(C)ccc(C(=O)N[C@@H]4C(=O)N[C@H](C(C)C... \n",
"3 CCCCCCCC(=O)SCC/C=C/[C@@H]1CC(=O)NCc2nc(cs2)C2... \n",
"4 Cc1cc2ccc1[C@@H](C)COC(=O)Nc1ccc(S(=O)(=O)C3CC... "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 读取CSV文件\n",
"input_file = project_root / 'ring16' / 'temp.csv'\n",
"df = pd.read_csv(input_file)\n",
"\n",
"print(f\"数据集大小: {len(df)} 个分子\")\n",
"print(f\"列名: {df.columns.tolist()}\")\n",
"\n",
"# 检查smiles列\n",
"if 'smiles' in df.columns:\n",
" print(f\"\\nSMILES列存在共 {df['smiles'].notna().sum()} 个有效SMILES\")\n",
" print(f\"\\n前5个SMILES示例:\")\n",
" print(df['smiles'].head().tolist())\n",
"else:\n",
" print(\"错误: 未找到smiles列\")\n",
" \n",
"df.head()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. 定义SMARTS模式和处理函数\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.14.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

File diff suppressed because one or more lines are too long