first add
This commit is contained in:
190
notebooks/README_analyze_ring16.md
Normal file
190
notebooks/README_analyze_ring16.md
Normal file
@@ -0,0 +1,190 @@
|
||||
# 16元环大环内酯分子分析说明
|
||||
|
||||
## 文件说明
|
||||
|
||||
- **notebook文件**: `analyze_ring16_molecules.ipynb`
|
||||
- **输入数据**: `../output/ring16_match_smarts.csv` (307个分子)
|
||||
- **输出目录**: `../output/`
|
||||
|
||||
## 使用方法
|
||||
|
||||
### 1. 激活环境
|
||||
|
||||
```bash
|
||||
cd /home/zly/project/macro_split
|
||||
pixi shell
|
||||
```
|
||||
|
||||
### 2. 运行notebook
|
||||
|
||||
```bash
|
||||
jupyter notebook notebooks/analyze_ring16_molecules.ipynb
|
||||
```
|
||||
|
||||
或使用 JupyterLab:
|
||||
|
||||
```bash
|
||||
jupyter lab notebooks/analyze_ring16_molecules.ipynb
|
||||
```
|
||||
|
||||
### 3. 按顺序运行所有单元格
|
||||
|
||||
notebook会自动:
|
||||
1. 计算所有分子的药物性质(分子量、LogP、QED、TPSA等)
|
||||
2. 进行侧链断裂分析
|
||||
3. 统计每个位置(3-16)的碎片分布
|
||||
4. 生成分布图并保存到 `output/` 目录
|
||||
|
||||
## 生成的输出文件
|
||||
|
||||
### 图片文件
|
||||
- `ring16_molecular_properties_distribution.png` - 分子性质分布图(4个子图)
|
||||
- 分子量分布
|
||||
- LogP分布
|
||||
- QED分布
|
||||
- TPSA分布
|
||||
|
||||
- `atom_count_distribution_ring16.png` - 每个位置的原子数分布(14个子图,位置3-16)
|
||||
|
||||
- `molecular_weight_distribution_ring16.png` - 每个位置的分子量分布(14个子图,位置3-16)
|
||||
|
||||
### 数据文件
|
||||
- `ring16_fragments_analysis.csv` - 所有碎片的详细信息
|
||||
- 列:fragment_id, parent_id, parent_smiles, cleavage_position, fragment_smiles, atom_count, molecular_weight
|
||||
|
||||
- `ring16_molecular_properties.csv` - 所有分子的性质数据
|
||||
- 列:unique_id, mol_weight, logP, num_h_donors, num_h_acceptors, num_rotatable_bonds, tpsa, qed, num_atoms, num_heavy_atoms
|
||||
|
||||
## 分析内容
|
||||
|
||||
### 已完成
|
||||
|
||||
1. **分子基本性质计算** ✅
|
||||
- 分子量、LogP、QED、TPSA
|
||||
- 氢键供受体数、可旋转键数
|
||||
- 原子数、重原子数
|
||||
|
||||
2. **侧链断裂分析** ✅
|
||||
- 使用封装好的 `MacrolactoneFragmenter` 类
|
||||
- 批量处理所有307个分子
|
||||
- 统计每个位置的碎片类型和数量
|
||||
|
||||
3. **分布图绘制** ✅
|
||||
- 参考 `test_align_two_molecules.ipynb` 的绘图逻辑
|
||||
- 4x4子图布局,展示位置3-16的分布
|
||||
- 使用 seaborn 和 matplotlib 绘图
|
||||
|
||||
### 延伸分析建议
|
||||
|
||||
notebook的最后一个单元格(Section 9)提供了详细的延伸分析建议,包括:
|
||||
|
||||
#### 优先级1(强烈推荐)⭐⭐⭐
|
||||
- **LogP分析**:找出对亲脂性贡献最大的侧链位置
|
||||
- **QED分析**:比较高/低QED分子的侧链差异
|
||||
- **TPSA分析**:分析极性侧链的分布模式
|
||||
|
||||
#### 优先级2(重要)⭐⭐⭐
|
||||
- **SAR分析**:如果有活性数据(max_pChEMBL),分析结构-活性关系
|
||||
- **特权侧链**:找出高频出现在活性分子中的侧链
|
||||
|
||||
#### 优先级3(有价值)⭐⭐
|
||||
- **碎片多样性分析**:统计每个位置的独特碎片类型
|
||||
- **聚类分析**:基于碎片指纹进行分子聚类
|
||||
- **极性/疏水性分析**:分析侧链的极性特征
|
||||
|
||||
#### 可选分析 ⭐
|
||||
- **3D性质**:PMI、NPR等3D描述符
|
||||
- **Lipinski规则**:检查类药性规则
|
||||
- **立体化学**:手性中心分析
|
||||
|
||||
## 代码示例
|
||||
|
||||
notebook中包含了完整的代码示例,可以直接运行或修改。主要功能:
|
||||
|
||||
```python
|
||||
# 1. 计算分子性质
|
||||
props = calculate_properties(smiles)
|
||||
|
||||
# 2. 批量断裂
|
||||
fragmenter = MacrolactoneFragmenter(ring_size=16)
|
||||
batch_results = fragmenter.process_csv(csv_file)
|
||||
|
||||
# 3. 统计分析
|
||||
df_fragments = fragmenter.batch_to_dataframe(batch_results)
|
||||
position_stats = df_fragments.groupby('cleavage_position').agg(...)
|
||||
|
||||
# 4. 绘图
|
||||
sns.histplot(values, kde=True, ax=ax, bins=30)
|
||||
```
|
||||
|
||||
## 关键洞察
|
||||
|
||||
### LogP的价值
|
||||
- 反映分子的亲脂性,对膜通透性和药物分布至关重要
|
||||
- 大环内酯通常LogP较高
|
||||
- 了解侧链对LogP的贡献有助于优化药物设计
|
||||
|
||||
### QED的意义
|
||||
- QED综合评估"类药性"
|
||||
- 大环内酯往往违反Lipinski规则(分子量>500),但仍可能是好药
|
||||
- 比较高/低QED分子可以找出影响类药性的关键侧链
|
||||
|
||||
### TPSA的重要性
|
||||
- TPSA与口服生物利用度密切相关(一般<140Ų为佳)
|
||||
- 极性侧链对TPSA贡献显著
|
||||
- 可以指导侧链修饰策略
|
||||
|
||||
## 注意事项
|
||||
|
||||
1. **环境要求**:
|
||||
- 需要安装 `seaborn` 和 `matplotlib`
|
||||
- 如果没有安装,notebook会提示:`pixi add seaborn matplotlib`
|
||||
|
||||
2. **处理时间**:
|
||||
- 处理307个分子可能需要几分钟
|
||||
- 绘制分布图也需要一些时间
|
||||
|
||||
3. **内存使用**:
|
||||
- 批量处理和绘图会占用一定内存
|
||||
- 如果遇到内存问题,可以减少 `max_rows` 参数
|
||||
|
||||
4. **图片分辨率**:
|
||||
- 默认使用 300 DPI 保存图片
|
||||
- 可以根据需要调整 `dpi` 参数
|
||||
|
||||
## 后续工作
|
||||
|
||||
根据分析结果,建议进行:
|
||||
|
||||
1. **LogP与侧链的定量关系**
|
||||
- 计算去除各个侧链后的LogP变化
|
||||
- 找出对LogP贡献最大的位置
|
||||
|
||||
2. **活性数据关联**(如果有)
|
||||
- 分析高活性分子的侧链特征
|
||||
- 找出"特权侧链"
|
||||
|
||||
3. **碎片库构建**
|
||||
- 整理每个位置的常见碎片
|
||||
- 用于指导新分子设计
|
||||
|
||||
4. **机器学习预测**
|
||||
- 使用碎片特征预测分子性质
|
||||
- 建立QSAR模型
|
||||
|
||||
## 参考
|
||||
|
||||
- `filter_molecules.ipynb` - 分子过滤和断裂逻辑
|
||||
- `test_align_two_molecules.ipynb` - 绘图逻辑参考
|
||||
- `src/macrolactone_fragmenter.py` - 封装的断裂器类
|
||||
- `src/ring_visualization.py` - 可视化工具
|
||||
|
||||
## 问题反馈
|
||||
|
||||
如果遇到问题:
|
||||
1. 检查是否在 `pixi shell` 环境中
|
||||
2. 确认所有依赖包已安装
|
||||
3. 查看输出目录是否有写入权限
|
||||
4. 检查CSV文件路径是否正确
|
||||
|
||||
|
||||
376866
notebooks/SIME-MacroValidator.ipynb
Normal file
376866
notebooks/SIME-MacroValidator.ipynb
Normal file
File diff suppressed because one or more lines are too long
7259
notebooks/analyze_ring12_20_molecules_CLEAN.ipynb
Normal file
7259
notebooks/analyze_ring12_20_molecules_CLEAN.ipynb
Normal file
File diff suppressed because one or more lines are too long
1385
notebooks/analyze_ring16_molecules.ipynb
Normal file
1385
notebooks/analyze_ring16_molecules.ipynb
Normal file
File diff suppressed because one or more lines are too long
1346
notebooks/demo_single_molecule.ipynb
Normal file
1346
notebooks/demo_single_molecule.ipynb
Normal file
File diff suppressed because one or more lines are too long
93
notebooks/drug_screening_sandmeyer.ipynb
Normal file
93
notebooks/drug_screening_sandmeyer.ipynb
Normal file
@@ -0,0 +1,93 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 药物分子SMARTS筛选:基于张夏恒反应替代Sandmeyer反应的策略\n",
|
||||
"\n",
|
||||
"## 研究背景\n",
|
||||
"\n",
|
||||
"本notebook旨在筛选药物分子数据库中可能使用**张夏恒反应**替代**Sandmeyer反应**合成的化合物。\n",
|
||||
"\n",
|
||||
"### 关键概念\n",
|
||||
"\n",
|
||||
"**Sandmeyer反应**:传统的芳香胺转化方法\n",
|
||||
"- 反应式:Ar-NH₂ → [Ar-N₂⁺] → Ar-X\n",
|
||||
"- 产物:芳香卤化物(X = Cl, Br, I, CN, OH, SCN等)\n",
|
||||
"\n",
|
||||
"**张夏恒反应**:新兴的绿色反应方法\n",
|
||||
"- 提供更环保的合成路线\n",
|
||||
"- 可能替代传统Sandmeyer反应\n",
|
||||
"\n",
|
||||
"### 筛选策略\n",
|
||||
"\n",
|
||||
"基于**同分异构体生物等排替换**原理:\n",
|
||||
"- 如果化合物A(使用Sandmeyer合成)有活性\n",
|
||||
"- 化合物B(使用张夏恒反应合成相同骨架)可能有相似活性\n",
|
||||
"\n",
|
||||
"### 筛选逻辑\n",
|
||||
"\n",
|
||||
"**核心假设**:含有芳香卤素的药物可能通过Sandmeyer反应合成\n",
|
||||
"\n",
|
||||
"**优先级排序**:\n",
|
||||
"1. **杂芳环卤素**(最高优先级)\n",
|
||||
" - 氯代吡啶、氯代嘧啶等\n",
|
||||
" - 这些结构更可能使用Sandmeyer或SNAr反应合成\n",
|
||||
" \n",
|
||||
"2. **普通芳香卤素**(高优先级)\n",
|
||||
" - 任意芳香氯、溴、碘\n",
|
||||
" - 可能来自Sandmeyer反应,需要文献验证\n",
|
||||
"\n",
|
||||
"### 三种筛选方案\n",
|
||||
"\n",
|
||||
"#### 方案A(最保守):杂芳环卤素筛选\n",
|
||||
"- **SMARTS模式**:`n:c:[Cl,Br,I]` 或 `n1c([Cl,Br,I])cccc1`\n",
|
||||
"- **优势**:精准度最高,假阳性率低\n",
|
||||
"- **适用**:快速找到最可能的候选药物\n",
|
||||
"- **预期结果**:候选数量少但精准\n",
|
||||
"\n",
|
||||
"#### 方案B(平衡):所有芳香卤素筛选\n",
|
||||
"- **SMARTS模式**:`c[Cl,Br,I]`\n",
|
||||
"- **优势**:覆盖面更广,平衡精准度和广度\n",
|
||||
"- **适用**:全面筛选药物库\n",
|
||||
"- **预期结果**:候选数量中等,适中假阳性率\n",
|
||||
"\n",
|
||||
"#### 方案C(已删除):简化版\n",
|
||||
"- 只筛选含卤素化合物\n",
|
||||
"- 精准度较低,已废弃\n",
|
||||
"\n",
|
||||
"---\n",
|
||||
"\n",
|
||||
"## 文件信息\n",
|
||||
"\n",
|
||||
"- **输入文件**:`/data/drug_targetmol/0c04ffc9fe8c2ec916412fbdc2a49bf4.sdf`\n",
|
||||
"- **输出目录**:`/data/drug_targetmol/`\n",
|
||||
"- **输出文件**:\n",
|
||||
" - `candidates_planA_heteroaryl_halides.csv`(方案A结果)\n",
|
||||
" - `candidates_planB_all_aromatic_halides.csv`(方案B结果)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.0"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
5229
notebooks/filter_molecules.ipynb
Normal file
5229
notebooks/filter_molecules.ipynb
Normal file
File diff suppressed because one or more lines are too long
1134
notebooks/mactch_test.ipynb
Normal file
1134
notebooks/mactch_test.ipynb
Normal file
File diff suppressed because one or more lines are too long
321
notebooks/rdkit_show.ipynb
Normal file
321
notebooks/rdkit_show.ipynb
Normal file
File diff suppressed because one or more lines are too long
754
notebooks/screen_aniline_candidates.ipynb
Normal file
754
notebooks/screen_aniline_candidates.ipynb
Normal file
@@ -0,0 +1,754 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 筛选芳香胺候选药物 - Sandmeyer反应起始物分析\n",
|
||||
"\n",
|
||||
"## 背景介绍\n",
|
||||
"\n",
|
||||
"### Sandmeyer反应回顾\n",
|
||||
"Sandmeyer反应是经典的芳香胺转化方法:\n",
|
||||
"**Ar-NH₂ → [Ar-N₂]⁺ → Ar-X**\n",
|
||||
"其中 X = Cl, Br, I, CN, OH, SCN 等\n",
|
||||
"\n",
|
||||
"### 筛选目标\n",
|
||||
"通过识别药物分子中含有芳香胺结构(Ar-NH₂)的化合物,\n",
|
||||
"找出可能作为Sandmeyer反应起始物的候选药物。\n",
|
||||
"这些分子可能原本通过Sandmeyer反应引入芳香卤素,\n",
|
||||
"现在可以用张夏恒反应进行更高效的转化。\n",
|
||||
"\n",
|
||||
"### SMARTS模式\n",
|
||||
"使用SMARTS模式 `[c,n][NH2]` 匹配:\n",
|
||||
"- `[c,n]`: 芳香碳或氮原子\n",
|
||||
"- `[NH2]`: 氨基(-NH₂)\n",
|
||||
"\n",
|
||||
"**重要提醒:**\n",
|
||||
"- 此筛选基于分子结构特征\n",
|
||||
"- 最终需要查阅文献确认合成路线\n",
|
||||
"- 并非所有含芳香胺的药物都使用Sandmeyer反应"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 导入所需库"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from pathlib import Path\n",
|
||||
"from rdkit import Chem\n",
|
||||
"from rdkit.Chem import PandasTools, Draw\n",
|
||||
"from rdkit.Chem.Draw import rdMolDraw2D\n",
|
||||
"from IPython.display import SVG, display\n",
|
||||
"from rdkit.Chem import AllChem\n",
|
||||
"import pandas as pd\n",
|
||||
"import warnings\n",
|
||||
"warnings.filterwarnings('ignore')\n",
|
||||
"\n",
|
||||
"# 设置显示选项\n",
|
||||
"pd.set_option('display.max_columns', None)\n",
|
||||
"pd.set_option('display.max_colwidth', 100)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 定义筛选模式和可视化函数\n",
|
||||
"\n",
|
||||
"### SMARTS模式设置\n",
|
||||
"- **目标模式**: `[c,n][NH2]` - 芳香碳/氮原子连接的氨基\n",
|
||||
"- **匹配逻辑**: 寻找所有包含此子结构的分子"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"使用SMARTS模式: [c,n][NH2]\n",
|
||||
"模式验证: ✓\n",
|
||||
"\n",
|
||||
"创建目录:../data/drug_targetmol/aniline_candidates\n",
|
||||
"创建可视化目录:../data/drug_targetmol/aniline_candidates/visualizations\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# 定义筛选模式\n",
|
||||
"TARGET_SMARTS = '[c,n][NH2]'\n",
|
||||
"pattern = Chem.MolFromSmarts(TARGET_SMARTS)\n",
|
||||
"\n",
|
||||
"if pattern is None:\n",
|
||||
" raise ValueError(f\"无效的SMARTS模式: {TARGET_SMARTS}\")\n",
|
||||
"\n",
|
||||
"print(f\"使用SMARTS模式: {TARGET_SMARTS}\")\n",
|
||||
"print(f\"模式验证: {'✓' if pattern else '✗'}\")\n",
|
||||
"\n",
|
||||
"# 创建输出目录\n",
|
||||
"output_base = Path(\"../data/drug_targetmol\")\n",
|
||||
"output_dir = output_base / \"aniline_candidates\"\n",
|
||||
"visualization_dir = output_dir / \"visualizations\"\n",
|
||||
"\n",
|
||||
"output_dir.mkdir(exist_ok=True)\n",
|
||||
"visualization_dir.mkdir(exist_ok=True)\n",
|
||||
"\n",
|
||||
"print(f\"\\n创建目录:{output_dir}\")\n",
|
||||
"print(f\"创建可视化目录:{visualization_dir}\")\n",
|
||||
"\n",
|
||||
"def generate_highlighted_svg(mol, highlight_atoms, filename, title=\"\"):\n",
|
||||
" \"\"\"生成高亮匹配结构的高清晰度SVG图片\"\"\"\n",
|
||||
" # 计算2D坐标\n",
|
||||
" AllChem.Compute2DCoords(mol)\n",
|
||||
" \n",
|
||||
" # 创建SVG绘制器\n",
|
||||
" drawer = rdMolDraw2D.MolDraw2DSVG(1200, 900) # 更大的尺寸以提高清晰度\n",
|
||||
" drawer.SetFontSize(12)\n",
|
||||
" \n",
|
||||
" # 绘制选项\n",
|
||||
" draw_options = drawer.drawOptions()\n",
|
||||
" draw_options.addAtomIndices = False # 不显示原子索引,保持简洁\n",
|
||||
" draw_options.addBondIndices = False\n",
|
||||
" draw_options.addStereoAnnotation = True\n",
|
||||
" draw_options.fixedFontSize = 12\n",
|
||||
" \n",
|
||||
" # 高亮匹配的原子(蓝色)\n",
|
||||
" atom_colors = {}\n",
|
||||
" for atom_idx in highlight_atoms:\n",
|
||||
" atom_colors[atom_idx] = (0.3, 0.3, 1.0) # 蓝色高亮\n",
|
||||
" \n",
|
||||
" # 绘制分子\n",
|
||||
" drawer.DrawMolecule(mol, \n",
|
||||
" highlightAtoms=highlight_atoms,\n",
|
||||
" highlightAtomColors=atom_colors)\n",
|
||||
" \n",
|
||||
" drawer.FinishDrawing()\n",
|
||||
" svg_content = drawer.GetDrawingText()\n",
|
||||
" \n",
|
||||
" # 添加标题\n",
|
||||
" if title:\n",
|
||||
" # 在SVG中添加标题\n",
|
||||
" svg_lines = svg_content.split(\"\\\\n\")\n",
|
||||
" # 在<g>标签前插入标题\n",
|
||||
" for i, line in enumerate(svg_lines):\n",
|
||||
" if \"<g \" in line and \"transform\" in line:\n",
|
||||
" svg_lines.insert(i, f\"<text x=\\\"50%\\\" y=\\\"30\\\" text-anchor=\\\"middle\\\" font-size=\\\"16\\\" font-weight=\\\"bold\\\">{title}</text>\")\n",
|
||||
" break\n",
|
||||
" svg_with_title = \"\\\\n\".join(svg_lines)\n",
|
||||
" else:\n",
|
||||
" svg_with_title = svg_content\n",
|
||||
" \n",
|
||||
" # 保存文件\n",
|
||||
" with open(filename, \"w\") as f:\n",
|
||||
" f.write(svg_with_title)\n",
|
||||
" \n",
|
||||
" return svg_content"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 数据加载和分子筛选\n",
|
||||
"\n",
|
||||
"### 数据源\n",
|
||||
"- 文件位置:`data/drug_targetmol/0c04ffc9fe8c2ec916412fbdc2a49bf4.sdf`\n",
|
||||
"- 包含药物分子结构和丰富属性信息\n",
|
||||
"\n",
|
||||
"### 筛选逻辑\n",
|
||||
"1. 读取SDF文件\n",
|
||||
"2. 对每个分子进行SMARTS匹配\n",
|
||||
"3. 记录匹配的原子和匹配数量\n",
|
||||
"4. 保存匹配结果到CSV\n",
|
||||
"5. 生成高亮可视化图片"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"正在读取SDF文件...\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[21:24:23] Both bonds on one end of an atropisomer are on the same side - atoms is : 3\n",
|
||||
"[21:24:23] Explicit valence for atom # 2 N greater than permitted\n",
|
||||
"[21:24:23] ERROR: Could not sanitize molecule ending on line 217340\n",
|
||||
"[21:24:23] ERROR: Explicit valence for atom # 2 N greater than permitted\n",
|
||||
"[21:24:24] Explicit valence for atom # 4 N greater than permitted\n",
|
||||
"[21:24:24] ERROR: Could not sanitize molecule ending on line 317283\n",
|
||||
"[21:24:24] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
|
||||
"[21:24:24] Explicit valence for atom # 4 N greater than permitted\n",
|
||||
"[21:24:24] ERROR: Could not sanitize molecule ending on line 324666\n",
|
||||
"[21:24:24] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
|
||||
"[21:24:24] Explicit valence for atom # 5 N greater than permitted\n",
|
||||
"[21:24:24] ERROR: Could not sanitize molecule ending on line 365883\n",
|
||||
"[21:24:24] ERROR: Explicit valence for atom # 5 N greater than permitted\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"成功加载 3276 个分子\n",
|
||||
"\n",
|
||||
"数据概览:\n",
|
||||
" Index Plate Row Col ID Name \\\n",
|
||||
"0 1 L1010-1 a 2 Dexamethasone \n",
|
||||
"1 2 L1010-1 a 3 Danicopan \n",
|
||||
"2 3 L1010-1 a 4 Cyclosporin A \n",
|
||||
"3 4 L1010-1 a 5 L-Carnitine \n",
|
||||
"4 5 L1010-1 a 6 Trimetazidine dihydrochloride \n",
|
||||
"\n",
|
||||
" Synonyms CAS \\\n",
|
||||
"0 MK 125;Prednisolone F;NSC 34521;Hexadecadrol 50-02-2 \n",
|
||||
"1 ACH-4471 1903768-17-1 \n",
|
||||
"2 Cyclosporine A;Ciclosporin;Cyclosporine 59865-13-3 \n",
|
||||
"3 L(-)-Carnitine;Levocarnitine 541-15-1 \n",
|
||||
"4 Yoshimilon;Kyurinett;Vastarel F 13171-25-0 \n",
|
||||
"\n",
|
||||
" SMILES \\\n",
|
||||
"0 C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(F)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO \n",
|
||||
"1 CC(=O)c1nn(CC(=O)N2C[C@H](F)C[C@H]2C(=O)Nc2cccc(Br)n2)c2ccc(cc12)-c1cnc(C)nc1 \n",
|
||||
"2 [C@H]([C@@H](C/C=C/C)C)(O)[C@@]1(N(C)C(=O)[C@H]([C@@H](C)C)N(C)C(=O)[C@H](CC(C)C)N(C)C(=O)[C@H](... \n",
|
||||
"3 C[N+](C)(C)C[C@@H](O)CC([O-])=O \n",
|
||||
"4 Cl.Cl.COC1=C(OC)C(OC)=C(CN2CCNCC2)C=C1 \n",
|
||||
"\n",
|
||||
" Formula MolWt Approved status \\\n",
|
||||
"0 C22H29FO5 392.46 NMPA;EMA;FDA \n",
|
||||
"1 C26H23BrFN7O3 580.41 FDA \n",
|
||||
"2 C62H111N11O12 1202.61 FDA \n",
|
||||
"3 C7H15NO3 161.2 FDA \n",
|
||||
"4 C14H24Cl2N2O3 339.258 NMPA;EMA \n",
|
||||
"\n",
|
||||
" Pharmacopoeia \\\n",
|
||||
"0 USP39-NF34;BP2015;JP16;IP2010 \n",
|
||||
"1 NaN \n",
|
||||
"2 Martindale the Extra Pharmacopoei, EP10.2, USP43-NF38, Ph.Int_6th, JP17 \n",
|
||||
"3 NaN \n",
|
||||
"4 BP2019;KP Ⅹ;EP9.2;IP2010;JP17;Martindale:The Extra Pharmacopoeia \n",
|
||||
"\n",
|
||||
" Disease \\\n",
|
||||
"0 Metabolism \n",
|
||||
"1 Others \n",
|
||||
"2 Immune system \n",
|
||||
"3 Cardiovascular system \n",
|
||||
"4 Cardiovascular system \n",
|
||||
"\n",
|
||||
" Pathways \\\n",
|
||||
"0 Antibody-drug Conjugate/ADC Related;Autophagy;Endocrinology/Hormones;Immunology/Inflammation;Mic... \n",
|
||||
"1 Immunology/Inflammation \n",
|
||||
"2 Immunology/Inflammation;Metabolism;Microbiology/Virology \n",
|
||||
"3 Metabolism \n",
|
||||
"4 Autophagy;Metabolism \n",
|
||||
"\n",
|
||||
" Target \\\n",
|
||||
"0 Antibacterial;Antibiotic;Autophagy;Complement System;Glucocorticoid Receptor;IL Receptor;Mitopha... \n",
|
||||
"1 Complement System \n",
|
||||
"2 Phosphatase;Antibiotic;Complement System \n",
|
||||
"3 Endogenous Metabolite;Fatty Acid Synthase \n",
|
||||
"4 Autophagy;Fatty Acid Synthase \n",
|
||||
"\n",
|
||||
" Receptor \\\n",
|
||||
"0 Antibiotic; Autophagy; Bacterial; Complement System; Glucocorticoid Receptor; IL receptor; Mitop... \n",
|
||||
"1 Complement System; factor D \n",
|
||||
"2 Antibiotic; calcineurin phosphatase; Complement System; Phosphatase \n",
|
||||
"3 Endogenous Metabolite; FAS \n",
|
||||
"4 Autophagy; mitochondrial long-chain 3-ketoacyl thiolase \n",
|
||||
"\n",
|
||||
" Bioactivity \\\n",
|
||||
"0 Dexamethasone is a glucocorticoid receptor agonist and IL receptor modulator with anti-inflammat... \n",
|
||||
"1 Danicopan (ACH-4471) (ACH-4471) is a selective, orally active small molecule factor D inhibitor ... \n",
|
||||
"2 Cyclosporin A is a natural product and an active fungal metabolite, classified as a cyclic polyp... \n",
|
||||
"3 L-Carnitine (L(-)-Carnitine) is an amino acid derivative. L-Carnitine facilitates long-chain fat... \n",
|
||||
"4 Trimetazidine dihydrochloride (Vastarel F) can improve myocardial glucose utilization by inhibit... \n",
|
||||
"\n",
|
||||
" Reference \\\n",
|
||||
"0 Li M, Yu H. Identification of WP1066, an inhibitor of JAK2 and STAT3, as a Kv1. 3 potassium chan... \n",
|
||||
"1 Yuan X, et al. Small-molecule factor D inhibitors selectively block the alternative pathway of c... \n",
|
||||
"2 D'Angelo G, et al. Cyclosporin A prevents the hypoxic adaptation by activating hypoxia-inducible... \n",
|
||||
"3 Jogl G, Tong L. Cell. 2003 Jan 10; 112(1):113-22. \n",
|
||||
"4 Yang Q, et al. Int J Clin Exp Pathol. 2015, 8(4):3735-3741.;Liu Z, et al. Metabolism. 2016, 65(3... \n",
|
||||
"\n",
|
||||
" ROMol \n",
|
||||
"0 <rdkit.Chem.rdchem.Mol object at 0x77530d73c820> \n",
|
||||
"1 <rdkit.Chem.rdchem.Mol object at 0x77530d73c890> \n",
|
||||
"2 <rdkit.Chem.rdchem.Mol object at 0x77530a3f6f10> \n",
|
||||
"3 <rdkit.Chem.rdchem.Mol object at 0x77530a3f70d0> \n",
|
||||
"4 <rdkit.Chem.rdchem.Mol object at 0x77530a3f7140> \n",
|
||||
"\n",
|
||||
"列名:['Index', 'Plate', 'Row', 'Col', 'ID', 'Name', 'Synonyms', 'CAS', 'SMILES', 'Formula', 'MolWt', 'Approved status', 'Pharmacopoeia', 'Disease', 'Pathways', 'Target', 'Receptor', 'Bioactivity', 'Reference', 'ROMol']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# 读取SDF文件\n",
|
||||
"sdf_path = '../data/drug_targetmol/0c04ffc9fe8c2ec916412fbdc2a49bf4.sdf'\n",
|
||||
"\n",
|
||||
"print(\"正在读取SDF文件...\")\n",
|
||||
"df = PandasTools.LoadSDF(sdf_path)\n",
|
||||
"print(f\"成功加载 {len(df)} 个分子\")\n",
|
||||
"\n",
|
||||
"# 显示数据基本信息\n",
|
||||
"print(\"\\n数据概览:\")\n",
|
||||
"print(df.head())\n",
|
||||
"print(f\"\\n列名:{list(df.columns)}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"开始筛选芳香胺结构...\n",
|
||||
"SMARTS模式: [c,n][N&H2]\n",
|
||||
"找到 262 个匹配分子(处理了 3276 个分子)\n",
|
||||
"\n",
|
||||
"筛选结果摘要:\n",
|
||||
" Name CAS Formula total_matches\n",
|
||||
"17 Guanosine 118-00-3 C10H13N5O5 1\n",
|
||||
"20 Ganciclovir 82410-32-0 C9H13N5O4 1\n",
|
||||
"22 Imiquimod maleate 896106-16-4 C18H20N4O4 1\n",
|
||||
"27 Brincidofovir 444805-28-1 C27H52N3O7P 1\n",
|
||||
"28 Imiquimod 99011-02-6 C14H16N4 1\n",
|
||||
"32 Ganciclovir sodium 107910-75-8 C9H13N5NaO4 1\n",
|
||||
"33 Cytarabine 147-94-4 C9H13N3O5 1\n",
|
||||
"35 Vidarabine 5536-17-4 C10H13N5O4 1\n",
|
||||
"38 Penciclovir 39809-25-1 C10H15N5O3 1\n",
|
||||
"41 Famciclovir 104227-87-4 C14H19N5O4 1\n",
|
||||
"... 还有 252 个分子\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"def screen_molecules_for_aniline(df, smarts_pattern, max_molecules=100):\n",
|
||||
" \"\"\"\n",
|
||||
" 筛选包含芳香胺结构的分子\n",
|
||||
" \n",
|
||||
" Args:\n",
|
||||
" df: 包含分子的DataFrame\n",
|
||||
" smarts_pattern: RDKit SMARTS模式对象\n",
|
||||
" max_molecules: 最大处理分子数量\n",
|
||||
" \n",
|
||||
" Returns:\n",
|
||||
" 筛选结果DataFrame\n",
|
||||
" \"\"\"\n",
|
||||
" print(f\"开始筛选芳香胺结构...\")\n",
|
||||
" print(f\"SMARTS模式: {Chem.MolToSmarts(smarts_pattern)}\")\n",
|
||||
" \n",
|
||||
" matched_molecules = []\n",
|
||||
" processed_count = 0\n",
|
||||
" \n",
|
||||
" for idx, row in df.iterrows():\n",
|
||||
" if processed_count >= max_molecules:\n",
|
||||
" break\n",
|
||||
" \n",
|
||||
" mol = row['ROMol']\n",
|
||||
" if mol is None:\n",
|
||||
" continue\n",
|
||||
" \n",
|
||||
" processed_count += 1\n",
|
||||
" \n",
|
||||
" # 检查是否匹配SMARTS模式\n",
|
||||
" if mol.HasSubstructMatch(smarts_pattern):\n",
|
||||
" matches = mol.GetSubstructMatches(smarts_pattern)\n",
|
||||
" \n",
|
||||
" # 收集所有匹配的原子\n",
|
||||
" matched_atoms = set()\n",
|
||||
" for match in matches:\n",
|
||||
" matched_atoms.update(match)\n",
|
||||
" \n",
|
||||
" # 创建匹配记录\n",
|
||||
" match_record = row.copy()\n",
|
||||
" match_record['matched_atoms'] = list(matched_atoms)\n",
|
||||
" match_record['total_matches'] = len(matches)\n",
|
||||
" match_record['smarts_pattern'] = Chem.MolToSmarts(smarts_pattern)\n",
|
||||
" matched_molecules.append(match_record)\n",
|
||||
" \n",
|
||||
" result_df = pd.DataFrame(matched_molecules)\n",
|
||||
" print(f\"找到 {len(result_df)} 个匹配分子(处理了 {processed_count} 个分子)\")\n",
|
||||
" \n",
|
||||
" return result_df\n",
|
||||
"\n",
|
||||
"# 执行筛选\n",
|
||||
"matched_df = screen_molecules_for_aniline(df, pattern, max_molecules=1000000)\n",
|
||||
"\n",
|
||||
"# 显示结果摘要\n",
|
||||
"if len(matched_df) > 0:\n",
|
||||
" print(\"\\n筛选结果摘要:\")\n",
|
||||
" summary_cols = ['Name', 'CAS', 'Formula', 'total_matches']\n",
|
||||
" if len(matched_df) <= 10:\n",
|
||||
" print(matched_df[summary_cols])\n",
|
||||
" else:\n",
|
||||
" print(matched_df[summary_cols].head(10))\n",
|
||||
" print(f\"... 还有 {len(matched_df) - 10} 个分子\")\n",
|
||||
"else:\n",
|
||||
" print(\"\\n未找到匹配分子\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 保存筛选结果\n",
|
||||
"\n",
|
||||
"### 输出文件\n",
|
||||
"1. **CSV文件**:包含所有匹配分子的属性信息和匹配详情\n",
|
||||
"2. **SVG图片**:每个匹配分子的结构可视化,高亮芳香胺结构"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"CSV结果已保存到:../data/drug_targetmol/aniline_candidates/aniline_candidates.csv\n",
|
||||
"包含 262 个分子,23 个属性列\n",
|
||||
"\n",
|
||||
"开始生成可视化图片(最多500个)...\n",
|
||||
"已生成 10 个分子图片\n",
|
||||
"已生成 20 个分子图片\n",
|
||||
"已生成 30 个分子图片\n",
|
||||
"已生成 40 个分子图片\n",
|
||||
"已生成 50 个分子图片\n",
|
||||
"已生成 60 个分子图片\n",
|
||||
"已生成 70 个分子图片\n",
|
||||
"已生成 80 个分子图片\n",
|
||||
"已生成 90 个分子图片\n",
|
||||
"已生成 100 个分子图片\n",
|
||||
"已生成 110 个分子图片\n",
|
||||
"已生成 120 个分子图片\n",
|
||||
"已生成 130 个分子图片\n",
|
||||
"已生成 140 个分子图片\n",
|
||||
"已生成 150 个分子图片\n",
|
||||
"已生成 160 个分子图片\n",
|
||||
"已生成 170 个分子图片\n",
|
||||
"已生成 180 个分子图片\n",
|
||||
"已生成 190 个分子图片\n",
|
||||
"已生成 200 个分子图片\n",
|
||||
"已生成 210 个分子图片\n",
|
||||
"已生成 220 个分子图片\n",
|
||||
"已生成 230 个分子图片\n",
|
||||
"已生成 240 个分子图片\n",
|
||||
"已生成 250 个分子图片\n",
|
||||
"已生成 260 个分子图片\n",
|
||||
"完成!共生成 262 个可视化图片\n",
|
||||
"\n",
|
||||
"示例图片: 118-00-3_Guanosine.svg\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"image/svg+xml": [
|
||||
"<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:rdkit=\"http://www.rdkit.org/xml\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" version=\"1.1\" baseProfile=\"full\" xml:space=\"preserve\" width=\"1200px\" height=\"900px\" viewBox=\"0 0 1200 900\">\n",
|
||||
"<!-- END OF HEADER -->\n",
|
||||
"<rect style=\"opacity:1.0;fill:#FFFFFF;stroke:none\" width=\"1200.0\" height=\"900.0\" x=\"0.0\" y=\"0.0\"> </rect>\n",
|
||||
"<path class=\"bond-0 atom-0 atom-1\" d=\"M 912.0,197.7 L 940.1,201.0 L 924.8,332.9 L 896.6,329.6 Z\" style=\"fill:#4C4CFF;fill-rule:evenodd;fill-opacity:1;stroke:#4C4CFF;stroke-width:0.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
|
||||
"<ellipse cx=\"932.9\" cy=\"201.5\" rx=\"26.6\" ry=\"26.6\" class=\"atom-0\" style=\"fill:#4C4CFF;fill-rule:evenodd;stroke:#4C4CFF;stroke-width:1.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<ellipse cx=\"910.7\" cy=\"331.2\" rx=\"26.6\" ry=\"26.6\" class=\"atom-1\" style=\"fill:#4C4CFF;fill-rule:evenodd;stroke:#4C4CFF;stroke-width:1.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-0 atom-0 atom-1\" d=\"M 925.1,208.0 L 910.7,331.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-1 atom-1 atom-2\" d=\"M 910.7,331.2 L 853.5,355.9\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-1 atom-1 atom-2\" d=\"M 853.5,355.9 L 796.4,380.6\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-1 atom-1 atom-2\" d=\"M 908.0,354.1 L 856.2,376.5\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-1 atom-1 atom-2\" d=\"M 856.2,376.5 L 804.3,398.9\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-2 atom-2 atom-3\" d=\"M 787.8,392.5 L 780.6,454.1\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-2 atom-2 atom-3\" d=\"M 780.6,454.1 L 773.4,515.8\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-3 atom-3 atom-4\" d=\"M 773.4,515.8 L 879.9,595.0\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-3 atom-3 atom-4\" d=\"M 794.5,506.6 L 882.6,572.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-4 atom-4 atom-5\" d=\"M 879.9,595.0 L 860.1,653.6\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-4 atom-4 atom-5\" d=\"M 860.1,653.6 L 840.4,712.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-5 atom-5 atom-6\" d=\"M 829.8,720.7 L 767.3,720.0\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-5 atom-5 atom-6\" d=\"M 767.3,720.0 L 704.7,719.3\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-5 atom-5 atom-6\" d=\"M 830.1,700.8 L 774.7,700.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-5 atom-5 atom-6\" d=\"M 774.7,700.2 L 719.4,699.6\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-6 atom-6 atom-7\" d=\"M 704.7,719.3 L 686.2,660.3\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-6 atom-6 atom-7\" d=\"M 686.2,660.3 L 667.8,601.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-7 atom-3 atom-7\" d=\"M 773.4,515.8 L 723.0,551.5\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-7 atom-3 atom-7\" d=\"M 723.0,551.5 L 672.7,587.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-8 atom-7 atom-8\" d=\"M 657.5,590.0 L 598.4,570.1\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-8 atom-7 atom-8\" d=\"M 598.4,570.1 L 539.3,550.1\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-9 atom-8 atom-9\" d=\"M 539.3,550.1 L 489.3,585.6\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-9 atom-8 atom-9\" d=\"M 489.3,585.6 L 439.2,621.0\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-10 atom-9 atom-10\" d=\"M 422.7,620.8 L 373.6,584.2\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-10 atom-9 atom-10\" d=\"M 373.6,584.2 L 324.4,547.7\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-11 atom-10 atom-11\" d=\"M 324.4,547.7 L 197.7,587.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-12 atom-11 atom-12\" d=\"M 197.7,587.2 L 153.0,546.1\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-12 atom-11 atom-12\" d=\"M 153.0,546.1 L 108.3,504.9\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-13 atom-10 atom-13\" d=\"M 324.4,547.7 L 366.9,421.8\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-14 atom-13 atom-14\" d=\"M 366.9,421.8 L 331.6,372.1\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-14 atom-13 atom-14\" d=\"M 331.6,372.1 L 296.3,322.3\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-15 atom-13 atom-15\" d=\"M 366.9,421.8 L 499.7,423.4\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-16 atom-8 atom-15\" d=\"M 539.3,550.1 L 499.7,423.4\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-17 atom-15 atom-16\" d=\"M 499.7,423.4 L 536.0,374.5\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-17 atom-15 atom-16\" d=\"M 536.0,374.5 L 572.4,325.6\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-18 atom-4 atom-17\" d=\"M 879.9,595.0 L 1001.8,542.4\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-19 atom-17 atom-18\" d=\"M 991.3,547.0 L 1042.7,585.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-19 atom-17 atom-18\" d=\"M 1042.7,585.2 L 1094.1,623.5\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-19 atom-17 atom-18\" d=\"M 1003.2,531.0 L 1054.6,569.3\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-19 atom-17 atom-18\" d=\"M 1054.6,569.3 L 1106.0,607.5\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-20 atom-17 atom-19\" d=\"M 1001.8,542.4 L 1009.0,480.8\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-20 atom-17 atom-19\" d=\"M 1009.0,480.8 L 1016.2,419.1\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-21 atom-1 atom-19\" d=\"M 910.7,331.2 L 960.2,368.0\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-21 atom-1 atom-19\" d=\"M 960.2,368.0 L 1009.6,404.9\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path d=\"M 707.8,719.4 L 704.7,719.3 L 703.7,716.4\" style=\"fill:none;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
|
||||
"<path d=\"M 204.0,585.3 L 197.7,587.2 L 195.5,585.2\" style=\"fill:none;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
|
||||
"<path d=\"M 995.7,545.0 L 1001.8,542.4 L 1002.2,539.3\" style=\"fill:none;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
|
||||
"<path class=\"atom-0\" d=\"M 924.2 195.1 L 927.0 199.6 Q 927.3 200.0, 927.7 200.8 Q 928.1 201.7, 928.2 201.7 L 928.2 195.1 L 929.3 195.1 L 929.3 203.6 L 928.1 203.6 L 925.1 198.7 Q 924.8 198.1, 924.4 197.4 Q 924.1 196.8, 924.0 196.6 L 924.0 203.6 L 922.9 203.6 L 922.9 195.1 L 924.2 195.1 \" fill=\"#000000\"/>\n",
|
||||
"<path class=\"atom-0\" d=\"M 930.9 195.1 L 932.1 195.1 L 932.1 198.7 L 936.4 198.7 L 936.4 195.1 L 937.6 195.1 L 937.6 203.6 L 936.4 203.6 L 936.4 199.7 L 932.1 199.7 L 932.1 203.6 L 930.9 203.6 L 930.9 195.1 \" fill=\"#000000\"/>\n",
|
||||
"<path class=\"atom-0\" d=\"M 939.2 203.3 Q 939.4 202.8, 939.9 202.5 Q 940.4 202.2, 941.1 202.2 Q 942.0 202.2, 942.4 202.6 Q 942.9 203.1, 942.9 203.9 Q 942.9 204.7, 942.3 205.5 Q 941.7 206.3, 940.4 207.2 L 943.0 207.2 L 943.0 207.8 L 939.2 207.8 L 939.2 207.3 Q 940.3 206.6, 940.9 206.0 Q 941.5 205.5, 941.8 205.0 Q 942.1 204.5, 942.1 203.9 Q 942.1 203.4, 941.8 203.1 Q 941.6 202.8, 941.1 202.8 Q 940.7 202.8, 940.4 203.0 Q 940.0 203.2, 939.8 203.6 L 939.2 203.3 \" fill=\"#000000\"/>\n",
|
||||
"<path class=\"atom-2\" d=\"M 786.9 379.6 L 789.7 384.1 Q 790.0 384.6, 790.4 385.4 Q 790.8 386.2, 790.9 386.2 L 790.9 379.6 L 792.0 379.6 L 792.0 388.1 L 790.8 388.1 L 787.8 383.2 Q 787.5 382.6, 787.1 382.0 Q 786.8 381.3, 786.7 381.1 L 786.7 388.1 L 785.6 388.1 L 785.6 379.6 L 786.9 379.6 \" fill=\"#0000FF\"/>\n",
|
||||
"<path class=\"atom-5\" d=\"M 835.6 716.6 L 838.4 721.1 Q 838.6 721.5, 839.1 722.3 Q 839.5 723.1, 839.5 723.2 L 839.5 716.6 L 840.7 716.6 L 840.7 725.1 L 839.5 725.1 L 836.5 720.2 Q 836.2 719.6, 835.8 718.9 Q 835.4 718.3, 835.3 718.1 L 835.3 725.1 L 834.2 725.1 L 834.2 716.6 L 835.6 716.6 \" fill=\"#0000FF\"/>\n",
|
||||
"<path class=\"atom-7\" d=\"M 663.2 588.3 L 666.0 592.8 Q 666.3 593.3, 666.7 594.1 Q 667.2 594.9, 667.2 594.9 L 667.2 588.3 L 668.3 588.3 L 668.3 596.8 L 667.1 596.8 L 664.2 591.9 Q 663.8 591.3, 663.4 590.7 Q 663.1 590.0, 663.0 589.8 L 663.0 596.8 L 661.9 596.8 L 661.9 588.3 L 663.2 588.3 \" fill=\"#0000FF\"/>\n",
|
||||
"<path class=\"atom-9\" d=\"M 427.1 626.9 Q 427.1 624.9, 428.1 623.8 Q 429.1 622.6, 431.0 622.6 Q 432.8 622.6, 433.9 623.8 Q 434.9 624.9, 434.9 626.9 Q 434.9 629.0, 433.8 630.2 Q 432.8 631.3, 431.0 631.3 Q 429.1 631.3, 428.1 630.2 Q 427.1 629.0, 427.1 626.9 M 431.0 630.4 Q 432.3 630.4, 433.0 629.5 Q 433.7 628.6, 433.7 626.9 Q 433.7 625.3, 433.0 624.4 Q 432.3 623.6, 431.0 623.6 Q 429.7 623.6, 429.0 624.4 Q 428.3 625.3, 428.3 626.9 Q 428.3 628.7, 429.0 629.5 Q 429.7 630.4, 431.0 630.4 \" fill=\"#FF0000\"/>\n",
|
||||
"<path class=\"atom-12\" d=\"M 87.7 493.1 L 88.9 493.1 L 88.9 496.7 L 93.2 496.7 L 93.2 493.1 L 94.4 493.1 L 94.4 501.6 L 93.2 501.6 L 93.2 497.6 L 88.9 497.6 L 88.9 501.6 L 87.7 501.6 L 87.7 493.1 \" fill=\"#FF0000\"/>\n",
|
||||
"<path class=\"atom-12\" d=\"M 96.1 497.3 Q 96.1 495.3, 97.1 494.1 Q 98.1 493.0, 100.0 493.0 Q 101.9 493.0, 102.9 494.1 Q 103.9 495.3, 103.9 497.3 Q 103.9 499.4, 102.9 500.5 Q 101.9 501.7, 100.0 501.7 Q 98.2 501.7, 97.1 500.5 Q 96.1 499.4, 96.1 497.3 M 100.0 500.7 Q 101.3 500.7, 102.0 499.9 Q 102.7 499.0, 102.7 497.3 Q 102.7 495.6, 102.0 494.8 Q 101.3 493.9, 100.0 493.9 Q 98.7 493.9, 98.0 494.8 Q 97.3 495.6, 97.3 497.3 Q 97.3 499.0, 98.0 499.9 Q 98.7 500.7, 100.0 500.7 \" fill=\"#FF0000\"/>\n",
|
||||
"<path class=\"atom-14\" d=\"M 277.8 309.3 L 278.9 309.3 L 278.9 312.9 L 283.3 312.9 L 283.3 309.3 L 284.4 309.3 L 284.4 317.8 L 283.3 317.8 L 283.3 313.9 L 278.9 313.9 L 278.9 317.8 L 277.8 317.8 L 277.8 309.3 \" fill=\"#FF0000\"/>\n",
|
||||
"<path class=\"atom-14\" d=\"M 286.2 313.6 Q 286.2 311.5, 287.2 310.4 Q 288.2 309.2, 290.1 309.2 Q 292.0 309.2, 293.0 310.4 Q 294.0 311.5, 294.0 313.6 Q 294.0 315.6, 293.0 316.8 Q 291.9 318.0, 290.1 318.0 Q 288.2 318.0, 287.2 316.8 Q 286.2 315.6, 286.2 313.6 M 290.1 317.0 Q 291.4 317.0, 292.1 316.1 Q 292.8 315.3, 292.8 313.6 Q 292.8 311.9, 292.1 311.0 Q 291.4 310.2, 290.1 310.2 Q 288.8 310.2, 288.1 311.0 Q 287.4 311.9, 287.4 313.6 Q 287.4 315.3, 288.1 316.1 Q 288.8 317.0, 290.1 317.0 \" fill=\"#FF0000\"/>\n",
|
||||
"<path class=\"atom-16\" d=\"M 575.1 316.9 Q 575.1 314.8, 576.1 313.7 Q 577.1 312.5, 579.0 312.5 Q 580.8 312.5, 581.8 313.7 Q 582.9 314.8, 582.9 316.9 Q 582.9 318.9, 581.8 320.1 Q 580.8 321.3, 579.0 321.3 Q 577.1 321.3, 576.1 320.1 Q 575.1 318.9, 575.1 316.9 M 579.0 320.3 Q 580.2 320.3, 580.9 319.4 Q 581.7 318.6, 581.7 316.9 Q 581.7 315.2, 580.9 314.3 Q 580.2 313.5, 579.0 313.5 Q 577.7 313.5, 576.9 314.3 Q 576.3 315.2, 576.3 316.9 Q 576.3 318.6, 576.9 319.4 Q 577.7 320.3, 579.0 320.3 \" fill=\"#FF0000\"/>\n",
|
||||
"<path class=\"atom-16\" d=\"M 584.2 312.6 L 585.3 312.6 L 585.3 316.2 L 589.7 316.2 L 589.7 312.6 L 590.8 312.6 L 590.8 321.1 L 589.7 321.1 L 589.7 317.2 L 585.3 317.2 L 585.3 321.1 L 584.2 321.1 L 584.2 312.6 \" fill=\"#FF0000\"/>\n",
|
||||
"<path class=\"atom-18\" d=\"M 1104.5 621.7 Q 1104.5 619.7, 1105.5 618.5 Q 1106.5 617.4, 1108.4 617.4 Q 1110.2 617.4, 1111.3 618.5 Q 1112.3 619.7, 1112.3 621.7 Q 1112.3 623.8, 1111.2 624.9 Q 1110.2 626.1, 1108.4 626.1 Q 1106.5 626.1, 1105.5 624.9 Q 1104.5 623.8, 1104.5 621.7 M 1108.4 625.1 Q 1109.7 625.1, 1110.4 624.3 Q 1111.1 623.4, 1111.1 621.7 Q 1111.1 620.0, 1110.4 619.2 Q 1109.7 618.3, 1108.4 618.3 Q 1107.1 618.3, 1106.4 619.2 Q 1105.7 620.0, 1105.7 621.7 Q 1105.7 623.4, 1106.4 624.3 Q 1107.1 625.1, 1108.4 625.1 \" fill=\"#FF0000\"/>\n",
|
||||
"<path class=\"atom-19\" d=\"M 1015.3 406.3 L 1018.1 410.8 Q 1018.4 411.2, 1018.8 412.0 Q 1019.3 412.8, 1019.3 412.9 L 1019.3 406.3 L 1020.4 406.3 L 1020.4 414.8 L 1019.3 414.8 L 1016.3 409.8 Q 1015.9 409.3, 1015.6 408.6 Q 1015.2 407.9, 1015.1 407.7 L 1015.1 414.8 L 1014.0 414.8 L 1014.0 406.3 L 1015.3 406.3 \" fill=\"#0000FF\"/>\n",
|
||||
"<path class=\"atom-19\" d=\"M 1022.1 406.3 L 1023.2 406.3 L 1023.2 409.9 L 1027.6 409.9 L 1027.6 406.3 L 1028.7 406.3 L 1028.7 414.8 L 1027.6 414.8 L 1027.6 410.8 L 1023.2 410.8 L 1023.2 414.8 L 1022.1 414.8 L 1022.1 406.3 \" fill=\"#0000FF\"/>\n",
|
||||
"</svg>"
|
||||
],
|
||||
"text/plain": [
|
||||
"<IPython.core.display.SVG object>"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"def save_aniline_screening_results(df, output_dir, visualization_dir, max_visualizations=500):\n",
|
||||
" \"\"\"保存芳香胺筛选结果\"\"\"\n",
|
||||
" \n",
|
||||
" # 保存CSV文件\n",
|
||||
" csv_path = output_dir / \"aniline_candidates.csv\"\n",
|
||||
" \n",
|
||||
" # 转换ROMol列为SMILES(因为ROMol对象无法保存到CSV)\n",
|
||||
" df_export = df.copy()\n",
|
||||
" if 'ROMol' in df_export.columns:\n",
|
||||
" df_export['SMILES_from_mol'] = df_export['ROMol'].apply(lambda x: Chem.MolToSmiles(x) if x else '')\n",
|
||||
" df_export = df_export.drop('ROMol', axis=1)\n",
|
||||
" \n",
|
||||
" df_export.to_csv(csv_path, index=False, encoding='utf-8')\n",
|
||||
" print(f\"CSV结果已保存到:{csv_path}\")\n",
|
||||
" print(f\"包含 {len(df_export)} 个分子,{len(df_export.columns)} 个属性列\")\n",
|
||||
" \n",
|
||||
" # 生成可视化图片\n",
|
||||
" print(f\"\\n开始生成可视化图片(最多{max_visualizations}个)...\")\n",
|
||||
" generated_count = 0\n",
|
||||
" \n",
|
||||
" for idx, row in df.iterrows():\n",
|
||||
" if generated_count >= max_visualizations:\n",
|
||||
" print(f\"已达到最大可视化数量限制 ({max_visualizations}),停止生成\")\n",
|
||||
" break\n",
|
||||
" \n",
|
||||
" cas = str(row.get('CAS', 'unknown')).strip()\n",
|
||||
" name = str(row.get('Name', 'unknown')).strip()\n",
|
||||
" \n",
|
||||
" # 清理文件名(去除特殊字符)\n",
|
||||
" safe_name = \"\".join(c for c in name if c.isalnum() or c in (' ', '-', '_')).rstrip()\n",
|
||||
" safe_cas = \"\".join(c for c in cas if c.isalnum() or c in ('-',)).rstrip()\n",
|
||||
" \n",
|
||||
" # 跳过无效的标识符\n",
|
||||
" if not safe_cas or safe_cas == 'nan' or safe_cas == 'unknown':\n",
|
||||
" continue\n",
|
||||
" \n",
|
||||
" mol = row.get('ROMol')\n",
|
||||
" if mol is None:\n",
|
||||
" continue\n",
|
||||
" \n",
|
||||
" matched_atoms = row.get('matched_atoms', [])\n",
|
||||
" if not matched_atoms:\n",
|
||||
" continue\n",
|
||||
" \n",
|
||||
" # 生成文件名和标题\n",
|
||||
" filename = visualization_dir / f\"{safe_cas}_{safe_name.replace(' ', '_')}.svg\"\n",
|
||||
" title = f\"{name} ({cas}) - 芳香胺结构\"\n",
|
||||
" \n",
|
||||
" try:\n",
|
||||
" # 生成SVG\n",
|
||||
" svg_content = generate_highlighted_svg(mol, matched_atoms, filename, title)\n",
|
||||
" generated_count += 1\n",
|
||||
" \n",
|
||||
" # 每10个显示一次进度\n",
|
||||
" if generated_count % 10 == 0:\n",
|
||||
" print(f\"已生成 {generated_count} 个分子图片\")\n",
|
||||
" \n",
|
||||
" except Exception as e:\n",
|
||||
" print(f\"生成 {safe_cas} 失败: {e}\")\n",
|
||||
" continue\n",
|
||||
" \n",
|
||||
" print(f\"完成!共生成 {generated_count} 个可视化图片\")\n",
|
||||
" return csv_path, generated_count\n",
|
||||
"\n",
|
||||
"# 保存结果\n",
|
||||
"if len(matched_df) > 0:\n",
|
||||
" csv_path, viz_count = save_aniline_screening_results(\n",
|
||||
" matched_df, output_dir, visualization_dir, max_visualizations=500\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" # 显示第一个生成的图片作为示例\n",
|
||||
" if viz_count > 0:\n",
|
||||
" example_files = list(visualization_dir.glob(\"*.svg\"))\n",
|
||||
" if example_files:\n",
|
||||
" example_file = example_files[0]\n",
|
||||
" print(f\"\\n示例图片: {example_file.name}\")\n",
|
||||
" with open(example_file, \"r\") as f:\n",
|
||||
" svg_content = f.read()\n",
|
||||
" display(SVG(svg_content))\n",
|
||||
"else:\n",
|
||||
" print(\"没有匹配结果,无需保存\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 结果统计和分析\n",
|
||||
"\n",
|
||||
"### 筛选统计\n",
|
||||
"- 总分子数\n",
|
||||
"- 匹配分子数\n",
|
||||
"- 可视化文件数量\n",
|
||||
"- 输出文件位置"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"=== 芳香胺筛选结果统计 ===\n",
|
||||
"总分子数:3276\n",
|
||||
"匹配分子数:262\n",
|
||||
"匹配率:8.00%\n",
|
||||
"\n",
|
||||
"输出目录:../data/drug_targetmol/aniline_candidates\n",
|
||||
"CSV文件:../data/drug_targetmol/aniline_candidates/aniline_candidates.csv\n",
|
||||
"可视化目录:../data/drug_targetmol/aniline_candidates/visualizations\n",
|
||||
"SVG文件数量:262\n",
|
||||
"\n",
|
||||
"匹配数量最多的分子:\n",
|
||||
" Name CAS total_matches\n",
|
||||
"432 Proflavine Hemisulfate 1811-28-5 4\n",
|
||||
"1064 Triamterene 396-01-0 3\n",
|
||||
"335 Pemetrexed disodium hemipenta hydrate 357166-30-4 2\n",
|
||||
"463 Lamotrigine 84057-84-1 2\n",
|
||||
"779 Pyrimethamine 58-14-0 2\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# 结果统计\n",
|
||||
"print(\"=== 芳香胺筛选结果统计 ===\")\n",
|
||||
"print(f\"总分子数:{len(df)}\")\n",
|
||||
"print(f\"匹配分子数:{len(matched_df)}\")\n",
|
||||
"print(f\"匹配率:{len(matched_df)/len(df)*100:.2f}%\")\n",
|
||||
"print(f\"\\n输出目录:{output_dir}\")\n",
|
||||
"print(f\"CSV文件:{output_dir}/aniline_candidates.csv\")\n",
|
||||
"print(f\"可视化目录:{visualization_dir}\")\n",
|
||||
"print(f\"SVG文件数量:{len(list(visualization_dir.glob('*.svg')))}\")\n",
|
||||
"\n",
|
||||
"# 显示匹配最多的前几个分子\n",
|
||||
"if len(matched_df) > 0:\n",
|
||||
" print(\"\\n匹配数量最多的分子:\")\n",
|
||||
" top_matches = matched_df.nlargest(5, 'total_matches')[['Name', 'CAS', 'total_matches']]\n",
|
||||
" print(top_matches)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 使用建议\n",
|
||||
"\n",
|
||||
"### 筛选结果解读\n",
|
||||
"- **匹配分子**:包含芳香胺结构(Ar-NH₂)的药物\n",
|
||||
"- **蓝色高亮**:匹配的SMARTS结构(芳香碳/氮 + 氨基)\n",
|
||||
"- **多重匹配**:分子中可能存在多个芳香胺基团\n",
|
||||
"\n",
|
||||
"### 后续分析建议\n",
|
||||
"1. **合成路线验证**:查阅匹配分子的合成文献\n",
|
||||
"2. **Sandmeyer反应确认**:确认是否使用Sandmeyer反应引入卤素\n",
|
||||
"3. **张夏恒反应评估**:评估替代Sandmeyer反应的可行性\n",
|
||||
"4. **工艺优化潜力**:分析替换为张夏恒反应的经济效益\n",
|
||||
"\n",
|
||||
"### 文件说明\n",
|
||||
"- **CSV文件**:完整的分子属性和匹配信息\n",
|
||||
"- **SVG文件**:结构可视化,蓝色高亮芳香胺结构\n",
|
||||
"- **命名规则**:{CAS}_{Name}.svg(特殊字符已清理)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 抗生素筛选结果\n",
|
||||
"\n",
|
||||
"/home/zly/project/macro_split/data/drug_targetmol/aniline_candidates/antibiotics_identified.csv"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.14.0"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
774
notebooks/screen_aniline_candidates_executed.ipynb
Normal file
774
notebooks/screen_aniline_candidates_executed.ipynb
Normal file
@@ -0,0 +1,774 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 筛选芳香胺候选药物 - Sandmeyer反应起始物分析\n",
|
||||
"\n",
|
||||
"## 背景介绍\n",
|
||||
"\n",
|
||||
"### Sandmeyer反应回顾\n",
|
||||
"Sandmeyer反应是经典的芳香胺转化方法:\n",
|
||||
"**Ar-NH₂ → [Ar-N₂]⁺ → Ar-X**\n",
|
||||
"其中 X = Cl, Br, I, CN, OH, SCN 等\n",
|
||||
"\n",
|
||||
"### 筛选目标\n",
|
||||
"通过识别药物分子中含有芳香胺结构(Ar-NH₂)的化合物,\n",
|
||||
"找出可能作为Sandmeyer反应起始物的候选药物。\n",
|
||||
"这些分子可能原本通过Sandmeyer反应引入芳香卤素,\n",
|
||||
"现在可以用张夏恒反应进行更高效的转化。\n",
|
||||
"\n",
|
||||
"### SMARTS模式\n",
|
||||
"使用SMARTS模式 `[c,n][NH2]` 匹配:\n",
|
||||
"- `[c,n]`: 芳香碳或氮原子\n",
|
||||
"- `[NH2]`: 氨基(-NH₂)\n",
|
||||
"\n",
|
||||
"**重要提醒:**\n",
|
||||
"- 此筛选基于分子结构特征\n",
|
||||
"- 最终需要查阅文献确认合成路线\n",
|
||||
"- 并非所有含芳香胺的药物都使用Sandmeyer反应"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 导入所需库"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2025-11-11T13:21:31.660096Z",
|
||||
"iopub.status.busy": "2025-11-11T13:21:31.657369Z",
|
||||
"iopub.status.idle": "2025-11-11T13:21:32.943162Z",
|
||||
"shell.execute_reply": "2025-11-11T13:21:32.938881Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from pathlib import Path\n",
|
||||
"from rdkit import Chem\n",
|
||||
"from rdkit.Chem import PandasTools, Draw\n",
|
||||
"from rdkit.Chem.Draw import rdMolDraw2D\n",
|
||||
"from IPython.display import SVG, display\n",
|
||||
"from rdkit.Chem import AllChem\n",
|
||||
"import pandas as pd\n",
|
||||
"import warnings\n",
|
||||
"warnings.filterwarnings('ignore')\n",
|
||||
"\n",
|
||||
"# 设置显示选项\n",
|
||||
"pd.set_option('display.max_columns', None)\n",
|
||||
"pd.set_option('display.max_colwidth', 100)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 定义筛选模式和可视化函数\n",
|
||||
"\n",
|
||||
"### SMARTS模式设置\n",
|
||||
"- **目标模式**: `[c,n][NH2]` - 芳香碳/氮原子连接的氨基\n",
|
||||
"- **匹配逻辑**: 寻找所有包含此子结构的分子"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2025-11-11T13:21:32.959832Z",
|
||||
"iopub.status.busy": "2025-11-11T13:21:32.957734Z",
|
||||
"iopub.status.idle": "2025-11-11T13:21:32.987085Z",
|
||||
"shell.execute_reply": "2025-11-11T13:21:32.980584Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"使用SMARTS模式: [c,n][NH2]\n",
|
||||
"模式验证: ✓\n",
|
||||
"\n",
|
||||
"创建目录:../data/drug_targetmol/aniline_candidates\n",
|
||||
"创建可视化目录:../data/drug_targetmol/aniline_candidates/visualizations\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# 定义筛选模式\n",
|
||||
"TARGET_SMARTS = '[c,n][NH2]'\n",
|
||||
"pattern = Chem.MolFromSmarts(TARGET_SMARTS)\n",
|
||||
"\n",
|
||||
"if pattern is None:\n",
|
||||
" raise ValueError(f\"无效的SMARTS模式: {TARGET_SMARTS}\")\n",
|
||||
"\n",
|
||||
"print(f\"使用SMARTS模式: {TARGET_SMARTS}\")\n",
|
||||
"print(f\"模式验证: {'✓' if pattern else '✗'}\")\n",
|
||||
"\n",
|
||||
"# 创建输出目录\n",
|
||||
"output_base = Path(\"../data/drug_targetmol\")\n",
|
||||
"output_dir = output_base / \"aniline_candidates\"\n",
|
||||
"visualization_dir = output_dir / \"visualizations\"\n",
|
||||
"\n",
|
||||
"output_dir.mkdir(exist_ok=True)\n",
|
||||
"visualization_dir.mkdir(exist_ok=True)\n",
|
||||
"\n",
|
||||
"print(f\"\\n创建目录:{output_dir}\")\n",
|
||||
"print(f\"创建可视化目录:{visualization_dir}\")\n",
|
||||
"\n",
|
||||
"def generate_highlighted_svg(mol, highlight_atoms, filename, title=\"\"):\n",
|
||||
" \"\"\"生成高亮匹配结构的高清晰度SVG图片\"\"\"\n",
|
||||
" # 计算2D坐标\n",
|
||||
" AllChem.Compute2DCoords(mol)\n",
|
||||
" \n",
|
||||
" # 创建SVG绘制器\n",
|
||||
" drawer = rdMolDraw2D.MolDraw2DSVG(1200, 900) # 更大的尺寸以提高清晰度\n",
|
||||
" drawer.SetFontSize(12)\n",
|
||||
" \n",
|
||||
" # 绘制选项\n",
|
||||
" draw_options = drawer.drawOptions()\n",
|
||||
" draw_options.addAtomIndices = False # 不显示原子索引,保持简洁\n",
|
||||
" draw_options.addBondIndices = False\n",
|
||||
" draw_options.addStereoAnnotation = True\n",
|
||||
" draw_options.fixedFontSize = 12\n",
|
||||
" \n",
|
||||
" # 高亮匹配的原子(蓝色)\n",
|
||||
" atom_colors = {}\n",
|
||||
" for atom_idx in highlight_atoms:\n",
|
||||
" atom_colors[atom_idx] = (0.3, 0.3, 1.0) # 蓝色高亮\n",
|
||||
" \n",
|
||||
" # 绘制分子\n",
|
||||
" drawer.DrawMolecule(mol, \n",
|
||||
" highlightAtoms=highlight_atoms,\n",
|
||||
" highlightAtomColors=atom_colors)\n",
|
||||
" \n",
|
||||
" drawer.FinishDrawing()\n",
|
||||
" svg_content = drawer.GetDrawingText()\n",
|
||||
" \n",
|
||||
" # 添加标题\n",
|
||||
" if title:\n",
|
||||
" # 在SVG中添加标题\n",
|
||||
" svg_lines = svg_content.split(\"\\\\n\")\n",
|
||||
" # 在<g>标签前插入标题\n",
|
||||
" for i, line in enumerate(svg_lines):\n",
|
||||
" if \"<g \" in line and \"transform\" in line:\n",
|
||||
" svg_lines.insert(i, f\"<text x=\\\"50%\\\" y=\\\"30\\\" text-anchor=\\\"middle\\\" font-size=\\\"16\\\" font-weight=\\\"bold\\\">{title}</text>\")\n",
|
||||
" break\n",
|
||||
" svg_with_title = \"\\\\n\".join(svg_lines)\n",
|
||||
" else:\n",
|
||||
" svg_with_title = svg_content\n",
|
||||
" \n",
|
||||
" # 保存文件\n",
|
||||
" with open(filename, \"w\") as f:\n",
|
||||
" f.write(svg_with_title)\n",
|
||||
" \n",
|
||||
" return svg_content"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 数据加载和分子筛选\n",
|
||||
"\n",
|
||||
"### 数据源\n",
|
||||
"- 文件位置:`data/drug_targetmol/0c04ffc9fe8c2ec916412fbdc2a49bf4.sdf`\n",
|
||||
"- 包含药物分子结构和丰富属性信息\n",
|
||||
"\n",
|
||||
"### 筛选逻辑\n",
|
||||
"1. 读取SDF文件\n",
|
||||
"2. 对每个分子进行SMARTS匹配\n",
|
||||
"3. 记录匹配的原子和匹配数量\n",
|
||||
"4. 保存匹配结果到CSV\n",
|
||||
"5. 生成高亮可视化图片"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2025-11-11T13:21:33.114695Z",
|
||||
"iopub.status.busy": "2025-11-11T13:21:33.113063Z",
|
||||
"iopub.status.idle": "2025-11-11T13:21:35.754026Z",
|
||||
"shell.execute_reply": "2025-11-11T13:21:35.745369Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"正在读取SDF文件...\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[21:21:34] Both bonds on one end of an atropisomer are on the same side - atoms is : 3\n",
|
||||
"[21:21:34] Explicit valence for atom # 2 N greater than permitted\n",
|
||||
"[21:21:34] ERROR: Could not sanitize molecule ending on line 217340\n",
|
||||
"[21:21:34] ERROR: Explicit valence for atom # 2 N greater than permitted\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[21:21:35] Explicit valence for atom # 4 N greater than permitted\n",
|
||||
"[21:21:35] ERROR: Could not sanitize molecule ending on line 317283\n",
|
||||
"[21:21:35] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
|
||||
"[21:21:35] Explicit valence for atom # 4 N greater than permitted\n",
|
||||
"[21:21:35] ERROR: Could not sanitize molecule ending on line 324666\n",
|
||||
"[21:21:35] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
|
||||
"[21:21:35] Explicit valence for atom # 5 N greater than permitted\n",
|
||||
"[21:21:35] ERROR: Could not sanitize molecule ending on line 365883\n",
|
||||
"[21:21:35] ERROR: Explicit valence for atom # 5 N greater than permitted\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"成功加载 3276 个分子\n",
|
||||
"\n",
|
||||
"数据概览:\n",
|
||||
" Index Plate Row Col ID Name \\\n",
|
||||
"0 1 L1010-1 a 2 Dexamethasone \n",
|
||||
"1 2 L1010-1 a 3 Danicopan \n",
|
||||
"2 3 L1010-1 a 4 Cyclosporin A \n",
|
||||
"3 4 L1010-1 a 5 L-Carnitine \n",
|
||||
"4 5 L1010-1 a 6 Trimetazidine dihydrochloride \n",
|
||||
"\n",
|
||||
" Synonyms CAS \\\n",
|
||||
"0 MK 125;Prednisolone F;NSC 34521;Hexadecadrol 50-02-2 \n",
|
||||
"1 ACH-4471 1903768-17-1 \n",
|
||||
"2 Cyclosporine A;Ciclosporin;Cyclosporine 59865-13-3 \n",
|
||||
"3 L(-)-Carnitine;Levocarnitine 541-15-1 \n",
|
||||
"4 Yoshimilon;Kyurinett;Vastarel F 13171-25-0 \n",
|
||||
"\n",
|
||||
" SMILES \\\n",
|
||||
"0 C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(F)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO \n",
|
||||
"1 CC(=O)c1nn(CC(=O)N2C[C@H](F)C[C@H]2C(=O)Nc2cccc(Br)n2)c2ccc(cc12)-c1cnc(C)nc1 \n",
|
||||
"2 [C@H]([C@@H](C/C=C/C)C)(O)[C@@]1(N(C)C(=O)[C@H]([C@@H](C)C)N(C)C(=O)[C@H](CC(C)C)N(C)C(=O)[C@H](... \n",
|
||||
"3 C[N+](C)(C)C[C@@H](O)CC([O-])=O \n",
|
||||
"4 Cl.Cl.COC1=C(OC)C(OC)=C(CN2CCNCC2)C=C1 \n",
|
||||
"\n",
|
||||
" Formula MolWt Approved status \\\n",
|
||||
"0 C22H29FO5 392.46 NMPA;EMA;FDA \n",
|
||||
"1 C26H23BrFN7O3 580.41 FDA \n",
|
||||
"2 C62H111N11O12 1202.61 FDA \n",
|
||||
"3 C7H15NO3 161.2 FDA \n",
|
||||
"4 C14H24Cl2N2O3 339.258 NMPA;EMA \n",
|
||||
"\n",
|
||||
" Pharmacopoeia \\\n",
|
||||
"0 USP39-NF34;BP2015;JP16;IP2010 \n",
|
||||
"1 NaN \n",
|
||||
"2 Martindale the Extra Pharmacopoei, EP10.2, USP43-NF38, Ph.Int_6th, JP17 \n",
|
||||
"3 NaN \n",
|
||||
"4 BP2019;KP Ⅹ;EP9.2;IP2010;JP17;Martindale:The Extra Pharmacopoeia \n",
|
||||
"\n",
|
||||
" Disease \\\n",
|
||||
"0 Metabolism \n",
|
||||
"1 Others \n",
|
||||
"2 Immune system \n",
|
||||
"3 Cardiovascular system \n",
|
||||
"4 Cardiovascular system \n",
|
||||
"\n",
|
||||
" Pathways \\\n",
|
||||
"0 Antibody-drug Conjugate/ADC Related;Autophagy;Endocrinology/Hormones;Immunology/Inflammation;Mic... \n",
|
||||
"1 Immunology/Inflammation \n",
|
||||
"2 Immunology/Inflammation;Metabolism;Microbiology/Virology \n",
|
||||
"3 Metabolism \n",
|
||||
"4 Autophagy;Metabolism \n",
|
||||
"\n",
|
||||
" Target \\\n",
|
||||
"0 Antibacterial;Antibiotic;Autophagy;Complement System;Glucocorticoid Receptor;IL Receptor;Mitopha... \n",
|
||||
"1 Complement System \n",
|
||||
"2 Phosphatase;Antibiotic;Complement System \n",
|
||||
"3 Endogenous Metabolite;Fatty Acid Synthase \n",
|
||||
"4 Autophagy;Fatty Acid Synthase \n",
|
||||
"\n",
|
||||
" Receptor \\\n",
|
||||
"0 Antibiotic; Autophagy; Bacterial; Complement System; Glucocorticoid Receptor; IL receptor; Mitop... \n",
|
||||
"1 Complement System; factor D \n",
|
||||
"2 Antibiotic; calcineurin phosphatase; Complement System; Phosphatase \n",
|
||||
"3 Endogenous Metabolite; FAS \n",
|
||||
"4 Autophagy; mitochondrial long-chain 3-ketoacyl thiolase \n",
|
||||
"\n",
|
||||
" Bioactivity \\\n",
|
||||
"0 Dexamethasone is a glucocorticoid receptor agonist and IL receptor modulator with anti-inflammat... \n",
|
||||
"1 Danicopan (ACH-4471) (ACH-4471) is a selective, orally active small molecule factor D inhibitor ... \n",
|
||||
"2 Cyclosporin A is a natural product and an active fungal metabolite, classified as a cyclic polyp... \n",
|
||||
"3 L-Carnitine (L(-)-Carnitine) is an amino acid derivative. L-Carnitine facilitates long-chain fat... \n",
|
||||
"4 Trimetazidine dihydrochloride (Vastarel F) can improve myocardial glucose utilization by inhibit... \n",
|
||||
"\n",
|
||||
" Reference \\\n",
|
||||
"0 Li M, Yu H. Identification of WP1066, an inhibitor of JAK2 and STAT3, as a Kv1. 3 potassium chan... \n",
|
||||
"1 Yuan X, et al. Small-molecule factor D inhibitors selectively block the alternative pathway of c... \n",
|
||||
"2 D'Angelo G, et al. Cyclosporin A prevents the hypoxic adaptation by activating hypoxia-inducible... \n",
|
||||
"3 Jogl G, Tong L. Cell. 2003 Jan 10; 112(1):113-22. \n",
|
||||
"4 Yang Q, et al. Int J Clin Exp Pathol. 2015, 8(4):3735-3741.;Liu Z, et al. Metabolism. 2016, 65(3... \n",
|
||||
"\n",
|
||||
" ROMol \n",
|
||||
"0 <rdkit.Chem.rdchem.Mol object at 0x774684c557e0> \n",
|
||||
"1 <rdkit.Chem.rdchem.Mol object at 0x7746818ffdf0> \n",
|
||||
"2 <rdkit.Chem.rdchem.Mol object at 0x7746818ffd80> \n",
|
||||
"3 <rdkit.Chem.rdchem.Mol object at 0x7746816e0040> \n",
|
||||
"4 <rdkit.Chem.rdchem.Mol object at 0x7746816e00b0> \n",
|
||||
"\n",
|
||||
"列名:['Index', 'Plate', 'Row', 'Col', 'ID', 'Name', 'Synonyms', 'CAS', 'SMILES', 'Formula', 'MolWt', 'Approved status', 'Pharmacopoeia', 'Disease', 'Pathways', 'Target', 'Receptor', 'Bioactivity', 'Reference', 'ROMol']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# 读取SDF文件\n",
|
||||
"sdf_path = '../data/drug_targetmol/0c04ffc9fe8c2ec916412fbdc2a49bf4.sdf'\n",
|
||||
"\n",
|
||||
"print(\"正在读取SDF文件...\")\n",
|
||||
"df = PandasTools.LoadSDF(sdf_path)\n",
|
||||
"print(f\"成功加载 {len(df)} 个分子\")\n",
|
||||
"\n",
|
||||
"# 显示数据基本信息\n",
|
||||
"print(\"\\n数据概览:\")\n",
|
||||
"print(df.head())\n",
|
||||
"print(f\"\\n列名:{list(df.columns)}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2025-11-11T13:21:35.770585Z",
|
||||
"iopub.status.busy": "2025-11-11T13:21:35.768752Z",
|
||||
"iopub.status.idle": "2025-11-11T13:21:36.114723Z",
|
||||
"shell.execute_reply": "2025-11-11T13:21:36.111467Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"开始筛选芳香胺结构...\n",
|
||||
"SMARTS模式: [c,n][N&H2]\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"找到 78 个匹配分子(处理了 1000 个分子)\n",
|
||||
"\n",
|
||||
"筛选结果摘要:\n",
|
||||
" Name CAS Formula total_matches\n",
|
||||
"17 Guanosine 118-00-3 C10H13N5O5 1\n",
|
||||
"20 Ganciclovir 82410-32-0 C9H13N5O4 1\n",
|
||||
"22 Imiquimod maleate 896106-16-4 C18H20N4O4 1\n",
|
||||
"27 Brincidofovir 444805-28-1 C27H52N3O7P 1\n",
|
||||
"28 Imiquimod 99011-02-6 C14H16N4 1\n",
|
||||
"32 Ganciclovir sodium 107910-75-8 C9H13N5NaO4 1\n",
|
||||
"33 Cytarabine 147-94-4 C9H13N3O5 1\n",
|
||||
"35 Vidarabine 5536-17-4 C10H13N5O4 1\n",
|
||||
"38 Penciclovir 39809-25-1 C10H15N5O3 1\n",
|
||||
"41 Famciclovir 104227-87-4 C14H19N5O4 1\n",
|
||||
"... 还有 68 个分子\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"def screen_molecules_for_aniline(df, smarts_pattern, max_molecules=100):\n",
|
||||
" \"\"\"\n",
|
||||
" 筛选包含芳香胺结构的分子\n",
|
||||
" \n",
|
||||
" Args:\n",
|
||||
" df: 包含分子的DataFrame\n",
|
||||
" smarts_pattern: RDKit SMARTS模式对象\n",
|
||||
" max_molecules: 最大处理分子数量\n",
|
||||
" \n",
|
||||
" Returns:\n",
|
||||
" 筛选结果DataFrame\n",
|
||||
" \"\"\"\n",
|
||||
" print(f\"开始筛选芳香胺结构...\")\n",
|
||||
" print(f\"SMARTS模式: {Chem.MolToSmarts(smarts_pattern)}\")\n",
|
||||
" \n",
|
||||
" matched_molecules = []\n",
|
||||
" processed_count = 0\n",
|
||||
" \n",
|
||||
" for idx, row in df.iterrows():\n",
|
||||
" if processed_count >= max_molecules:\n",
|
||||
" break\n",
|
||||
" \n",
|
||||
" mol = row['ROMol']\n",
|
||||
" if mol is None:\n",
|
||||
" continue\n",
|
||||
" \n",
|
||||
" processed_count += 1\n",
|
||||
" \n",
|
||||
" # 检查是否匹配SMARTS模式\n",
|
||||
" if mol.HasSubstructMatch(smarts_pattern):\n",
|
||||
" matches = mol.GetSubstructMatches(smarts_pattern)\n",
|
||||
" \n",
|
||||
" # 收集所有匹配的原子\n",
|
||||
" matched_atoms = set()\n",
|
||||
" for match in matches:\n",
|
||||
" matched_atoms.update(match)\n",
|
||||
" \n",
|
||||
" # 创建匹配记录\n",
|
||||
" match_record = row.copy()\n",
|
||||
" match_record['matched_atoms'] = list(matched_atoms)\n",
|
||||
" match_record['total_matches'] = len(matches)\n",
|
||||
" match_record['smarts_pattern'] = Chem.MolToSmarts(smarts_pattern)\n",
|
||||
" matched_molecules.append(match_record)\n",
|
||||
" \n",
|
||||
" result_df = pd.DataFrame(matched_molecules)\n",
|
||||
" print(f\"找到 {len(result_df)} 个匹配分子(处理了 {processed_count} 个分子)\")\n",
|
||||
" \n",
|
||||
" return result_df\n",
|
||||
"\n",
|
||||
"# 执行筛选\n",
|
||||
"matched_df = screen_molecules_for_aniline(df, pattern, max_molecules=1000)\n",
|
||||
"\n",
|
||||
"# 显示结果摘要\n",
|
||||
"if len(matched_df) > 0:\n",
|
||||
" print(\"\\n筛选结果摘要:\")\n",
|
||||
" summary_cols = ['Name', 'CAS', 'Formula', 'total_matches']\n",
|
||||
" if len(matched_df) <= 10:\n",
|
||||
" print(matched_df[summary_cols])\n",
|
||||
" else:\n",
|
||||
" print(matched_df[summary_cols].head(10))\n",
|
||||
" print(f\"... 还有 {len(matched_df) - 10} 个分子\")\n",
|
||||
"else:\n",
|
||||
" print(\"\\n未找到匹配分子\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 保存筛选结果\n",
|
||||
"\n",
|
||||
"### 输出文件\n",
|
||||
"1. **CSV文件**:包含所有匹配分子的属性信息和匹配详情\n",
|
||||
"2. **SVG图片**:每个匹配分子的结构可视化,高亮芳香胺结构"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2025-11-11T13:21:36.120981Z",
|
||||
"iopub.status.busy": "2025-11-11T13:21:36.120553Z",
|
||||
"iopub.status.idle": "2025-11-11T13:21:36.279125Z",
|
||||
"shell.execute_reply": "2025-11-11T13:21:36.277892Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"CSV结果已保存到:../data/drug_targetmol/aniline_candidates/aniline_candidates.csv\n",
|
||||
"包含 78 个分子,23 个属性列\n",
|
||||
"\n",
|
||||
"开始生成可视化图片(最多50个)...\n",
|
||||
"已生成 10 个分子图片\n",
|
||||
"已生成 20 个分子图片\n",
|
||||
"已生成 30 个分子图片\n",
|
||||
"已生成 40 个分子图片\n",
|
||||
"已生成 50 个分子图片\n",
|
||||
"已达到最大可视化数量限制 (50),停止生成\n",
|
||||
"完成!共生成 50 个可视化图片\n",
|
||||
"\n",
|
||||
"示例图片: 118-00-3_Guanosine.svg\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"image/svg+xml": [
|
||||
"<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:rdkit=\"http://www.rdkit.org/xml\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" version=\"1.1\" baseProfile=\"full\" xml:space=\"preserve\" width=\"1200px\" height=\"900px\" viewBox=\"0 0 1200 900\">\n",
|
||||
"<!-- END OF HEADER -->\n",
|
||||
"<rect style=\"opacity:1.0;fill:#FFFFFF;stroke:none\" width=\"1200.0\" height=\"900.0\" x=\"0.0\" y=\"0.0\"> </rect>\n",
|
||||
"<path class=\"bond-0 atom-0 atom-1\" d=\"M 912.0,197.7 L 940.1,201.0 L 924.8,332.9 L 896.6,329.6 Z\" style=\"fill:#4C4CFF;fill-rule:evenodd;fill-opacity:1;stroke:#4C4CFF;stroke-width:0.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
|
||||
"<ellipse cx=\"932.9\" cy=\"201.5\" rx=\"26.6\" ry=\"26.6\" class=\"atom-0\" style=\"fill:#4C4CFF;fill-rule:evenodd;stroke:#4C4CFF;stroke-width:1.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<ellipse cx=\"910.7\" cy=\"331.2\" rx=\"26.6\" ry=\"26.6\" class=\"atom-1\" style=\"fill:#4C4CFF;fill-rule:evenodd;stroke:#4C4CFF;stroke-width:1.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-0 atom-0 atom-1\" d=\"M 925.1,208.0 L 910.7,331.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-1 atom-1 atom-2\" d=\"M 910.7,331.2 L 853.5,355.9\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-1 atom-1 atom-2\" d=\"M 853.5,355.9 L 796.4,380.6\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-1 atom-1 atom-2\" d=\"M 908.0,354.1 L 856.2,376.5\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-1 atom-1 atom-2\" d=\"M 856.2,376.5 L 804.3,398.9\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-2 atom-2 atom-3\" d=\"M 787.8,392.5 L 780.6,454.1\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-2 atom-2 atom-3\" d=\"M 780.6,454.1 L 773.4,515.8\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-3 atom-3 atom-4\" d=\"M 773.4,515.8 L 879.9,595.0\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-3 atom-3 atom-4\" d=\"M 794.5,506.6 L 882.6,572.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-4 atom-4 atom-5\" d=\"M 879.9,595.0 L 860.1,653.6\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-4 atom-4 atom-5\" d=\"M 860.1,653.6 L 840.4,712.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-5 atom-5 atom-6\" d=\"M 829.8,720.7 L 767.3,720.0\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-5 atom-5 atom-6\" d=\"M 767.3,720.0 L 704.7,719.3\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-5 atom-5 atom-6\" d=\"M 830.1,700.8 L 774.7,700.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-5 atom-5 atom-6\" d=\"M 774.7,700.2 L 719.4,699.6\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-6 atom-6 atom-7\" d=\"M 704.7,719.3 L 686.2,660.3\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-6 atom-6 atom-7\" d=\"M 686.2,660.3 L 667.8,601.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-7 atom-3 atom-7\" d=\"M 773.4,515.8 L 723.0,551.5\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-7 atom-3 atom-7\" d=\"M 723.0,551.5 L 672.7,587.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-8 atom-7 atom-8\" d=\"M 657.5,590.0 L 598.4,570.1\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-8 atom-7 atom-8\" d=\"M 598.4,570.1 L 539.3,550.1\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-9 atom-8 atom-9\" d=\"M 539.3,550.1 L 489.3,585.6\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-9 atom-8 atom-9\" d=\"M 489.3,585.6 L 439.2,621.0\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-10 atom-9 atom-10\" d=\"M 422.7,620.8 L 373.6,584.2\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-10 atom-9 atom-10\" d=\"M 373.6,584.2 L 324.4,547.7\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-11 atom-10 atom-11\" d=\"M 324.4,547.7 L 197.7,587.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-12 atom-11 atom-12\" d=\"M 197.7,587.2 L 153.0,546.1\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-12 atom-11 atom-12\" d=\"M 153.0,546.1 L 108.3,504.9\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-13 atom-10 atom-13\" d=\"M 324.4,547.7 L 366.9,421.8\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-14 atom-13 atom-14\" d=\"M 366.9,421.8 L 331.6,372.1\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-14 atom-13 atom-14\" d=\"M 331.6,372.1 L 296.3,322.3\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-15 atom-13 atom-15\" d=\"M 366.9,421.8 L 499.7,423.4\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-16 atom-8 atom-15\" d=\"M 539.3,550.1 L 499.7,423.4\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-17 atom-15 atom-16\" d=\"M 499.7,423.4 L 536.0,374.5\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-17 atom-15 atom-16\" d=\"M 536.0,374.5 L 572.4,325.6\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-18 atom-4 atom-17\" d=\"M 879.9,595.0 L 1001.8,542.4\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-19 atom-17 atom-18\" d=\"M 991.3,547.0 L 1042.7,585.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-19 atom-17 atom-18\" d=\"M 1042.7,585.2 L 1094.1,623.5\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-19 atom-17 atom-18\" d=\"M 1003.2,531.0 L 1054.6,569.3\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-19 atom-17 atom-18\" d=\"M 1054.6,569.3 L 1106.0,607.5\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-20 atom-17 atom-19\" d=\"M 1001.8,542.4 L 1009.0,480.8\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-20 atom-17 atom-19\" d=\"M 1009.0,480.8 L 1016.2,419.1\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-21 atom-1 atom-19\" d=\"M 910.7,331.2 L 960.2,368.0\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path class=\"bond-21 atom-1 atom-19\" d=\"M 960.2,368.0 L 1009.6,404.9\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
|
||||
"<path d=\"M 707.8,719.4 L 704.7,719.3 L 703.7,716.4\" style=\"fill:none;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
|
||||
"<path d=\"M 204.0,585.3 L 197.7,587.2 L 195.5,585.2\" style=\"fill:none;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
|
||||
"<path d=\"M 995.7,545.0 L 1001.8,542.4 L 1002.2,539.3\" style=\"fill:none;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
|
||||
"<path class=\"atom-0\" d=\"M 924.2 195.1 L 927.0 199.6 Q 927.3 200.0, 927.7 200.8 Q 928.1 201.7, 928.2 201.7 L 928.2 195.1 L 929.3 195.1 L 929.3 203.6 L 928.1 203.6 L 925.1 198.7 Q 924.8 198.1, 924.4 197.4 Q 924.1 196.8, 924.0 196.6 L 924.0 203.6 L 922.9 203.6 L 922.9 195.1 L 924.2 195.1 \" fill=\"#000000\"/>\n",
|
||||
"<path class=\"atom-0\" d=\"M 930.9 195.1 L 932.1 195.1 L 932.1 198.7 L 936.4 198.7 L 936.4 195.1 L 937.6 195.1 L 937.6 203.6 L 936.4 203.6 L 936.4 199.7 L 932.1 199.7 L 932.1 203.6 L 930.9 203.6 L 930.9 195.1 \" fill=\"#000000\"/>\n",
|
||||
"<path class=\"atom-0\" d=\"M 939.2 203.3 Q 939.4 202.8, 939.9 202.5 Q 940.4 202.2, 941.1 202.2 Q 942.0 202.2, 942.4 202.6 Q 942.9 203.1, 942.9 203.9 Q 942.9 204.7, 942.3 205.5 Q 941.7 206.3, 940.4 207.2 L 943.0 207.2 L 943.0 207.8 L 939.2 207.8 L 939.2 207.3 Q 940.3 206.6, 940.9 206.0 Q 941.5 205.5, 941.8 205.0 Q 942.1 204.5, 942.1 203.9 Q 942.1 203.4, 941.8 203.1 Q 941.6 202.8, 941.1 202.8 Q 940.7 202.8, 940.4 203.0 Q 940.0 203.2, 939.8 203.6 L 939.2 203.3 \" fill=\"#000000\"/>\n",
|
||||
"<path class=\"atom-2\" d=\"M 786.9 379.6 L 789.7 384.1 Q 790.0 384.6, 790.4 385.4 Q 790.8 386.2, 790.9 386.2 L 790.9 379.6 L 792.0 379.6 L 792.0 388.1 L 790.8 388.1 L 787.8 383.2 Q 787.5 382.6, 787.1 382.0 Q 786.8 381.3, 786.7 381.1 L 786.7 388.1 L 785.6 388.1 L 785.6 379.6 L 786.9 379.6 \" fill=\"#0000FF\"/>\n",
|
||||
"<path class=\"atom-5\" d=\"M 835.6 716.6 L 838.4 721.1 Q 838.6 721.5, 839.1 722.3 Q 839.5 723.1, 839.5 723.2 L 839.5 716.6 L 840.7 716.6 L 840.7 725.1 L 839.5 725.1 L 836.5 720.2 Q 836.2 719.6, 835.8 718.9 Q 835.4 718.3, 835.3 718.1 L 835.3 725.1 L 834.2 725.1 L 834.2 716.6 L 835.6 716.6 \" fill=\"#0000FF\"/>\n",
|
||||
"<path class=\"atom-7\" d=\"M 663.2 588.3 L 666.0 592.8 Q 666.3 593.3, 666.7 594.1 Q 667.2 594.9, 667.2 594.9 L 667.2 588.3 L 668.3 588.3 L 668.3 596.8 L 667.1 596.8 L 664.2 591.9 Q 663.8 591.3, 663.4 590.7 Q 663.1 590.0, 663.0 589.8 L 663.0 596.8 L 661.9 596.8 L 661.9 588.3 L 663.2 588.3 \" fill=\"#0000FF\"/>\n",
|
||||
"<path class=\"atom-9\" d=\"M 427.1 626.9 Q 427.1 624.9, 428.1 623.8 Q 429.1 622.6, 431.0 622.6 Q 432.8 622.6, 433.9 623.8 Q 434.9 624.9, 434.9 626.9 Q 434.9 629.0, 433.8 630.2 Q 432.8 631.3, 431.0 631.3 Q 429.1 631.3, 428.1 630.2 Q 427.1 629.0, 427.1 626.9 M 431.0 630.4 Q 432.3 630.4, 433.0 629.5 Q 433.7 628.6, 433.7 626.9 Q 433.7 625.3, 433.0 624.4 Q 432.3 623.6, 431.0 623.6 Q 429.7 623.6, 429.0 624.4 Q 428.3 625.3, 428.3 626.9 Q 428.3 628.7, 429.0 629.5 Q 429.7 630.4, 431.0 630.4 \" fill=\"#FF0000\"/>\n",
|
||||
"<path class=\"atom-12\" d=\"M 87.7 493.1 L 88.9 493.1 L 88.9 496.7 L 93.2 496.7 L 93.2 493.1 L 94.4 493.1 L 94.4 501.6 L 93.2 501.6 L 93.2 497.6 L 88.9 497.6 L 88.9 501.6 L 87.7 501.6 L 87.7 493.1 \" fill=\"#FF0000\"/>\n",
|
||||
"<path class=\"atom-12\" d=\"M 96.1 497.3 Q 96.1 495.3, 97.1 494.1 Q 98.1 493.0, 100.0 493.0 Q 101.9 493.0, 102.9 494.1 Q 103.9 495.3, 103.9 497.3 Q 103.9 499.4, 102.9 500.5 Q 101.9 501.7, 100.0 501.7 Q 98.2 501.7, 97.1 500.5 Q 96.1 499.4, 96.1 497.3 M 100.0 500.7 Q 101.3 500.7, 102.0 499.9 Q 102.7 499.0, 102.7 497.3 Q 102.7 495.6, 102.0 494.8 Q 101.3 493.9, 100.0 493.9 Q 98.7 493.9, 98.0 494.8 Q 97.3 495.6, 97.3 497.3 Q 97.3 499.0, 98.0 499.9 Q 98.7 500.7, 100.0 500.7 \" fill=\"#FF0000\"/>\n",
|
||||
"<path class=\"atom-14\" d=\"M 277.8 309.3 L 278.9 309.3 L 278.9 312.9 L 283.3 312.9 L 283.3 309.3 L 284.4 309.3 L 284.4 317.8 L 283.3 317.8 L 283.3 313.9 L 278.9 313.9 L 278.9 317.8 L 277.8 317.8 L 277.8 309.3 \" fill=\"#FF0000\"/>\n",
|
||||
"<path class=\"atom-14\" d=\"M 286.2 313.6 Q 286.2 311.5, 287.2 310.4 Q 288.2 309.2, 290.1 309.2 Q 292.0 309.2, 293.0 310.4 Q 294.0 311.5, 294.0 313.6 Q 294.0 315.6, 293.0 316.8 Q 291.9 318.0, 290.1 318.0 Q 288.2 318.0, 287.2 316.8 Q 286.2 315.6, 286.2 313.6 M 290.1 317.0 Q 291.4 317.0, 292.1 316.1 Q 292.8 315.3, 292.8 313.6 Q 292.8 311.9, 292.1 311.0 Q 291.4 310.2, 290.1 310.2 Q 288.8 310.2, 288.1 311.0 Q 287.4 311.9, 287.4 313.6 Q 287.4 315.3, 288.1 316.1 Q 288.8 317.0, 290.1 317.0 \" fill=\"#FF0000\"/>\n",
|
||||
"<path class=\"atom-16\" d=\"M 575.1 316.9 Q 575.1 314.8, 576.1 313.7 Q 577.1 312.5, 579.0 312.5 Q 580.8 312.5, 581.8 313.7 Q 582.9 314.8, 582.9 316.9 Q 582.9 318.9, 581.8 320.1 Q 580.8 321.3, 579.0 321.3 Q 577.1 321.3, 576.1 320.1 Q 575.1 318.9, 575.1 316.9 M 579.0 320.3 Q 580.2 320.3, 580.9 319.4 Q 581.7 318.6, 581.7 316.9 Q 581.7 315.2, 580.9 314.3 Q 580.2 313.5, 579.0 313.5 Q 577.7 313.5, 576.9 314.3 Q 576.3 315.2, 576.3 316.9 Q 576.3 318.6, 576.9 319.4 Q 577.7 320.3, 579.0 320.3 \" fill=\"#FF0000\"/>\n",
|
||||
"<path class=\"atom-16\" d=\"M 584.2 312.6 L 585.3 312.6 L 585.3 316.2 L 589.7 316.2 L 589.7 312.6 L 590.8 312.6 L 590.8 321.1 L 589.7 321.1 L 589.7 317.2 L 585.3 317.2 L 585.3 321.1 L 584.2 321.1 L 584.2 312.6 \" fill=\"#FF0000\"/>\n",
|
||||
"<path class=\"atom-18\" d=\"M 1104.5 621.7 Q 1104.5 619.7, 1105.5 618.5 Q 1106.5 617.4, 1108.4 617.4 Q 1110.2 617.4, 1111.3 618.5 Q 1112.3 619.7, 1112.3 621.7 Q 1112.3 623.8, 1111.2 624.9 Q 1110.2 626.1, 1108.4 626.1 Q 1106.5 626.1, 1105.5 624.9 Q 1104.5 623.8, 1104.5 621.7 M 1108.4 625.1 Q 1109.7 625.1, 1110.4 624.3 Q 1111.1 623.4, 1111.1 621.7 Q 1111.1 620.0, 1110.4 619.2 Q 1109.7 618.3, 1108.4 618.3 Q 1107.1 618.3, 1106.4 619.2 Q 1105.7 620.0, 1105.7 621.7 Q 1105.7 623.4, 1106.4 624.3 Q 1107.1 625.1, 1108.4 625.1 \" fill=\"#FF0000\"/>\n",
|
||||
"<path class=\"atom-19\" d=\"M 1015.3 406.3 L 1018.1 410.8 Q 1018.4 411.2, 1018.8 412.0 Q 1019.3 412.8, 1019.3 412.9 L 1019.3 406.3 L 1020.4 406.3 L 1020.4 414.8 L 1019.3 414.8 L 1016.3 409.8 Q 1015.9 409.3, 1015.6 408.6 Q 1015.2 407.9, 1015.1 407.7 L 1015.1 414.8 L 1014.0 414.8 L 1014.0 406.3 L 1015.3 406.3 \" fill=\"#0000FF\"/>\n",
|
||||
"<path class=\"atom-19\" d=\"M 1022.1 406.3 L 1023.2 406.3 L 1023.2 409.9 L 1027.6 409.9 L 1027.6 406.3 L 1028.7 406.3 L 1028.7 414.8 L 1027.6 414.8 L 1027.6 410.8 L 1023.2 410.8 L 1023.2 414.8 L 1022.1 414.8 L 1022.1 406.3 \" fill=\"#0000FF\"/>\n",
|
||||
"</svg>"
|
||||
],
|
||||
"text/plain": [
|
||||
"<IPython.core.display.SVG object>"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"def save_aniline_screening_results(df, output_dir, visualization_dir, max_visualizations=50):\n",
|
||||
" \"\"\"保存芳香胺筛选结果\"\"\"\n",
|
||||
" \n",
|
||||
" # 保存CSV文件\n",
|
||||
" csv_path = output_dir / \"aniline_candidates.csv\"\n",
|
||||
" \n",
|
||||
" # 转换ROMol列为SMILES(因为ROMol对象无法保存到CSV)\n",
|
||||
" df_export = df.copy()\n",
|
||||
" if 'ROMol' in df_export.columns:\n",
|
||||
" df_export['SMILES_from_mol'] = df_export['ROMol'].apply(lambda x: Chem.MolToSmiles(x) if x else '')\n",
|
||||
" df_export = df_export.drop('ROMol', axis=1)\n",
|
||||
" \n",
|
||||
" df_export.to_csv(csv_path, index=False, encoding='utf-8')\n",
|
||||
" print(f\"CSV结果已保存到:{csv_path}\")\n",
|
||||
" print(f\"包含 {len(df_export)} 个分子,{len(df_export.columns)} 个属性列\")\n",
|
||||
" \n",
|
||||
" # 生成可视化图片\n",
|
||||
" print(f\"\\n开始生成可视化图片(最多{max_visualizations}个)...\")\n",
|
||||
" generated_count = 0\n",
|
||||
" \n",
|
||||
" for idx, row in df.iterrows():\n",
|
||||
" if generated_count >= max_visualizations:\n",
|
||||
" print(f\"已达到最大可视化数量限制 ({max_visualizations}),停止生成\")\n",
|
||||
" break\n",
|
||||
" \n",
|
||||
" cas = str(row.get('CAS', 'unknown')).strip()\n",
|
||||
" name = str(row.get('Name', 'unknown')).strip()\n",
|
||||
" \n",
|
||||
" # 清理文件名(去除特殊字符)\n",
|
||||
" safe_name = \"\".join(c for c in name if c.isalnum() or c in (' ', '-', '_')).rstrip()\n",
|
||||
" safe_cas = \"\".join(c for c in cas if c.isalnum() or c in ('-',)).rstrip()\n",
|
||||
" \n",
|
||||
" # 跳过无效的标识符\n",
|
||||
" if not safe_cas or safe_cas == 'nan' or safe_cas == 'unknown':\n",
|
||||
" continue\n",
|
||||
" \n",
|
||||
" mol = row.get('ROMol')\n",
|
||||
" if mol is None:\n",
|
||||
" continue\n",
|
||||
" \n",
|
||||
" matched_atoms = row.get('matched_atoms', [])\n",
|
||||
" if not matched_atoms:\n",
|
||||
" continue\n",
|
||||
" \n",
|
||||
" # 生成文件名和标题\n",
|
||||
" filename = visualization_dir / f\"{safe_cas}_{safe_name.replace(' ', '_')}.svg\"\n",
|
||||
" title = f\"{name} ({cas}) - 芳香胺结构\"\n",
|
||||
" \n",
|
||||
" try:\n",
|
||||
" # 生成SVG\n",
|
||||
" svg_content = generate_highlighted_svg(mol, matched_atoms, filename, title)\n",
|
||||
" generated_count += 1\n",
|
||||
" \n",
|
||||
" # 每10个显示一次进度\n",
|
||||
" if generated_count % 10 == 0:\n",
|
||||
" print(f\"已生成 {generated_count} 个分子图片\")\n",
|
||||
" \n",
|
||||
" except Exception as e:\n",
|
||||
" print(f\"生成 {safe_cas} 失败: {e}\")\n",
|
||||
" continue\n",
|
||||
" \n",
|
||||
" print(f\"完成!共生成 {generated_count} 个可视化图片\")\n",
|
||||
" return csv_path, generated_count\n",
|
||||
"\n",
|
||||
"# 保存结果\n",
|
||||
"if len(matched_df) > 0:\n",
|
||||
" csv_path, viz_count = save_aniline_screening_results(\n",
|
||||
" matched_df, output_dir, visualization_dir, max_visualizations=50\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" # 显示第一个生成的图片作为示例\n",
|
||||
" if viz_count > 0:\n",
|
||||
" example_files = list(visualization_dir.glob(\"*.svg\"))\n",
|
||||
" if example_files:\n",
|
||||
" example_file = example_files[0]\n",
|
||||
" print(f\"\\n示例图片: {example_file.name}\")\n",
|
||||
" with open(example_file, \"r\") as f:\n",
|
||||
" svg_content = f.read()\n",
|
||||
" display(SVG(svg_content))\n",
|
||||
"else:\n",
|
||||
" print(\"没有匹配结果,无需保存\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 结果统计和分析\n",
|
||||
"\n",
|
||||
"### 筛选统计\n",
|
||||
"- 总分子数\n",
|
||||
"- 匹配分子数\n",
|
||||
"- 可视化文件数量\n",
|
||||
"- 输出文件位置"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {
|
||||
"execution": {
|
||||
"iopub.execute_input": "2025-11-11T13:21:36.282118Z",
|
||||
"iopub.status.busy": "2025-11-11T13:21:36.281886Z",
|
||||
"iopub.status.idle": "2025-11-11T13:21:36.317857Z",
|
||||
"shell.execute_reply": "2025-11-11T13:21:36.316621Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"=== 芳香胺筛选结果统计 ===\n",
|
||||
"总分子数:3276\n",
|
||||
"匹配分子数:78\n",
|
||||
"匹配率:2.38%\n",
|
||||
"\n",
|
||||
"输出目录:../data/drug_targetmol/aniline_candidates\n",
|
||||
"CSV文件:../data/drug_targetmol/aniline_candidates/aniline_candidates.csv\n",
|
||||
"可视化目录:../data/drug_targetmol/aniline_candidates/visualizations\n",
|
||||
"SVG文件数量:50\n",
|
||||
"\n",
|
||||
"匹配数量最多的分子:\n",
|
||||
" Name CAS total_matches\n",
|
||||
"432 Proflavine Hemisulfate 1811-28-5 4\n",
|
||||
"335 Pemetrexed disodium hemipenta hydrate 357166-30-4 2\n",
|
||||
"463 Lamotrigine 84057-84-1 2\n",
|
||||
"779 Pyrimethamine 58-14-0 2\n",
|
||||
"784 Dapsone 80-08-0 2\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# 结果统计\n",
|
||||
"print(\"=== 芳香胺筛选结果统计 ===\")\n",
|
||||
"print(f\"总分子数:{len(df)}\")\n",
|
||||
"print(f\"匹配分子数:{len(matched_df)}\")\n",
|
||||
"print(f\"匹配率:{len(matched_df)/len(df)*100:.2f}%\")\n",
|
||||
"print(f\"\\n输出目录:{output_dir}\")\n",
|
||||
"print(f\"CSV文件:{output_dir}/aniline_candidates.csv\")\n",
|
||||
"print(f\"可视化目录:{visualization_dir}\")\n",
|
||||
"print(f\"SVG文件数量:{len(list(visualization_dir.glob('*.svg')))}\")\n",
|
||||
"\n",
|
||||
"# 显示匹配最多的前几个分子\n",
|
||||
"if len(matched_df) > 0:\n",
|
||||
" print(\"\\n匹配数量最多的分子:\")\n",
|
||||
" top_matches = matched_df.nlargest(5, 'total_matches')[['Name', 'CAS', 'total_matches']]\n",
|
||||
" print(top_matches)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 使用建议\n",
|
||||
"\n",
|
||||
"### 筛选结果解读\n",
|
||||
"- **匹配分子**:包含芳香胺结构(Ar-NH₂)的药物\n",
|
||||
"- **蓝色高亮**:匹配的SMARTS结构(芳香碳/氮 + 氨基)\n",
|
||||
"- **多重匹配**:分子中可能存在多个芳香胺基团\n",
|
||||
"\n",
|
||||
"### 后续分析建议\n",
|
||||
"1. **合成路线验证**:查阅匹配分子的合成文献\n",
|
||||
"2. **Sandmeyer反应确认**:确认是否使用Sandmeyer反应引入卤素\n",
|
||||
"3. **张夏恒反应评估**:评估替代Sandmeyer反应的可行性\n",
|
||||
"4. **工艺优化潜力**:分析替换为张夏恒反应的经济效益\n",
|
||||
"\n",
|
||||
"### 文件说明\n",
|
||||
"- **CSV文件**:完整的分子属性和匹配信息\n",
|
||||
"- **SVG文件**:结构可视化,蓝色高亮芳香胺结构\n",
|
||||
"- **命名规则**:{CAS}_{Name}.svg(特殊字符已清理)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.14.0"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
797
notebooks/screen_sandmeyer_candidates.ipynb
Normal file
797
notebooks/screen_sandmeyer_candidates.ipynb
Normal file
@@ -0,0 +1,797 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 筛选Sandmeyer反应候选药物 - 张夏恒反应替代分析\n",
|
||||
"\n",
|
||||
"## 背景介绍\n",
|
||||
"\n",
|
||||
"### Sandmeyer反应回顾\n",
|
||||
"Sandmeyer反应是经典的芳香胺转化方法:\n",
|
||||
"**Ar-NH₂ → [Ar-N₂]⁺ → Ar-X**\n",
|
||||
"其中 X = Cl, Br, I, CN, OH, SCN 等\n",
|
||||
"\n",
|
||||
"### 张夏恒反应\n",
|
||||
"根据论文《s41586-025-09791-5_reference.pdf》,张夏恒反应可能是一种新的替代方法,\n",
|
||||
"可以更有效地实现芳香胺到芳香卤素的转化。\n",
|
||||
"\n",
|
||||
"### 筛选策略\n",
|
||||
"我们通过识别药物分子中可能来自Sandmeyer反应的芳香卤素结构,\n",
|
||||
"找出可以考虑用张夏恒反应进行工艺优化的候选药物。\n",
|
||||
"\n",
|
||||
"**重要提醒:**\n",
|
||||
"- 此筛选仅基于分子结构特征\n",
|
||||
"- 最终需要查阅文献确认合成路线\n",
|
||||
"- 并非所有含卤素的药物都使用Sandmeyer反应合成"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 导入所需库"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from pathlib import Path\n",
|
||||
"from rdkit.Chem.Draw import rdMolDraw2D\n",
|
||||
"from IPython.display import SVG, display\n",
|
||||
"from rdkit.Chem import AllChem\n",
|
||||
"\n",
|
||||
"# 创建输出目录\n",
|
||||
"output_base = Path(\"../data/drug_targetmol\")\n",
|
||||
"scheme_a_dir = output_base / \"scheme_A_visualizations\"\n",
|
||||
"scheme_b_dir = output_base / \"scheme_B_visualizations\"\n",
|
||||
"\n",
|
||||
"scheme_a_dir.mkdir(exist_ok=True)\n",
|
||||
"scheme_b_dir.mkdir(exist_ok=True)\n",
|
||||
"\n",
|
||||
"print(f\"创建目录:{scheme_a_dir}\")\n",
|
||||
"print(f\"创建目录:{scheme_b_dir}\")\n",
|
||||
"\n",
|
||||
"def generate_highlighted_svg(mol, highlight_atoms, filename, title=\"\"):\n",
|
||||
" \"\"\"生成高亮卤素结构的高清晰度SVG图片\"\"\"\n",
|
||||
" from rdkit.Chem import AllChem\n",
|
||||
" \n",
|
||||
" # 计算2D坐标\n",
|
||||
" AllChem.Compute2DCoords(mol)\n",
|
||||
" \n",
|
||||
" # 创建SVG绘制器\n",
|
||||
" drawer = rdMolDraw2D.MolDraw2DSVG(1200, 900) # 更大的尺寸以提高清晰度\n",
|
||||
" drawer.SetFontSize(12)\n",
|
||||
" \n",
|
||||
" # 绘制选项\n",
|
||||
" draw_options = drawer.drawOptions()\n",
|
||||
" draw_options.addAtomIndices = False # 不显示原子索引,保持简洁\n",
|
||||
" draw_options.addBondIndices = False\n",
|
||||
" draw_options.addStereoAnnotation = True\n",
|
||||
" draw_options.fixedFontSize = 12\n",
|
||||
" \n",
|
||||
" # 高亮卤素原子(红色)\n",
|
||||
" atom_colors = {}\n",
|
||||
" for atom_idx in highlight_atoms:\n",
|
||||
" atom_colors[atom_idx] = (1.0, 0.3, 0.3) # 红色高亮\n",
|
||||
" \n",
|
||||
" # 绘制分子\n",
|
||||
" drawer.DrawMolecule(mol, \n",
|
||||
" highlightAtoms=highlight_atoms,\n",
|
||||
" highlightAtomColors=atom_colors)\n",
|
||||
" \n",
|
||||
" drawer.FinishDrawing()\n",
|
||||
" svg_content = drawer.GetDrawingText()\n",
|
||||
" \n",
|
||||
" # 添加标题\n",
|
||||
" if title:\n",
|
||||
" # 在SVG中添加标题\n",
|
||||
" svg_lines = svg_content.split(\"\\n\")\n",
|
||||
" # 在<g>标签前插入标题\n",
|
||||
" for i, line in enumerate(svg_lines):\n",
|
||||
" if \"<g \" in line and \"transform\" in line:\n",
|
||||
" svg_lines.insert(i, f\"<text x=\"50%\" y=\"30\" text-anchor=\"middle\" font-size=\"16\" font-weight=\"bold\">{title}</text>\")\n",
|
||||
" break\n",
|
||||
" svg_with_title = \"\\n\".join(svg_lines)\n",
|
||||
" else:\n",
|
||||
" svg_with_title = svg_content\n",
|
||||
" \n",
|
||||
" # 保存文件\n",
|
||||
" with open(filename, \"w\") as f:\n",
|
||||
" f.write(svg_with_title)\n",
|
||||
" \n",
|
||||
" print(f\"保存SVG: {filename}\")\n",
|
||||
" \n",
|
||||
" return svg_content\n",
|
||||
"\n",
|
||||
"def visualize_molecules(df, output_dir, scheme_name, max_molecules=50):\n",
|
||||
" \"\"\"为DataFrame中的分子生成可视化图片\"\"\"\n",
|
||||
" print(f\"\\n开始生成{scheme_name}的可视化图片...\")\n",
|
||||
" print(f\"输出目录: {output_dir}\")\n",
|
||||
" \n",
|
||||
" generated_count = 0\n",
|
||||
" \n",
|
||||
" for idx, row in df.iterrows():\n",
|
||||
" if generated_count >= max_molecules:\n",
|
||||
" print(f\"已达到最大生成数量限制 ({max_molecules}),停止生成\")\n",
|
||||
" break\n",
|
||||
" \n",
|
||||
" cas = str(row.get(\"CAS\", \"unknown\")).strip()\n",
|
||||
" name = str(row.get(\"Name\", \"unknown\")).strip()\n",
|
||||
" \n",
|
||||
" # 跳过无效的CAS号\n",
|
||||
" if not cas or cas == \"nan\" or cas == \"unknown\":\n",
|
||||
" continue\n",
|
||||
" \n",
|
||||
" mol = row.get(\"ROMol\")\n",
|
||||
" if mol is None:\n",
|
||||
" continue\n",
|
||||
" \n",
|
||||
" # 找出卤素原子\n",
|
||||
" halogen_atoms = []\n",
|
||||
" for atom in mol.GetAtoms():\n",
|
||||
" if atom.GetAtomicNum() in [9, 17, 35, 53]: # F, Cl, Br, I\n",
|
||||
" halogen_atoms.append(atom.GetIdx())\n",
|
||||
" \n",
|
||||
" if not halogen_atoms:\n",
|
||||
" continue\n",
|
||||
" \n",
|
||||
" # 生成文件名和标题\n",
|
||||
" filename = output_dir / f\"{cas}.svg\"\n",
|
||||
" title = f\"{name} ({cas})\"\n",
|
||||
" \n",
|
||||
" try:\n",
|
||||
" # 生成SVG\n",
|
||||
" generate_highlighted_svg(mol, halogen_atoms, filename, title)\n",
|
||||
" generated_count += 1\n",
|
||||
" \n",
|
||||
" # 每10个显示一次进度\n",
|
||||
" if generated_count % 10 == 0:\n",
|
||||
" print(f\"已生成 {generated_count} 个分子图片\")\n",
|
||||
" \n",
|
||||
" except Exception as e:\n",
|
||||
" print(f\"生成 {cas} 失败: {e}\")\n",
|
||||
" continue\n",
|
||||
" \n",
|
||||
" print(f\"完成!共生成 {generated_count} 个{scheme_name}的可视化图片\")\n",
|
||||
" return generated_count\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 定义筛选模式\n",
|
||||
"\n",
|
||||
"### SMARTS模式说明\n",
|
||||
"\n",
|
||||
"#### 方案A:杂芳环卤素(最高优先级)\n",
|
||||
"- **筛选逻辑**:杂芳环上的卤素最可能是Sandmeyer反应产物\n",
|
||||
"- **原因**:杂芳环直接卤代通常较困难,Sandmeyer反应是重要合成方法\n",
|
||||
"- **预期结果**:候选数量少但精准度高\n",
|
||||
"\n",
|
||||
"#### 方案B:所有芳香卤素(中等优先级) \n",
|
||||
"- **筛选逻辑**:所有芳环上的卤素\n",
|
||||
"- **原因**:虽然有些卤素可能来自其他途径,但可以扩大筛选范围\n",
|
||||
"- **预期结果**:候选数量较多,需要更多文献验证\n",
|
||||
"\n",
|
||||
"**SMARTS模式优化说明:**\n",
|
||||
"- 原始模式 `n:c:[Cl,Br,I]` 语法有误\n",
|
||||
"- 优化为更准确的环结构匹配模式\n",
|
||||
"- 使用更精确的原子环境描述"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# 定义筛选模式\n",
|
||||
"SCREENING_PATTERNS = {\n",
|
||||
" 'heteroaryl_halides': {\n",
|
||||
" 'name': '杂芳环卤素',\n",
|
||||
" 'description': '杂环上的Cl, Br, I原子(方案A)',\n",
|
||||
" 'smarts': [\n",
|
||||
" '[n,o,s]c[Cl,Br,I]', # 杂原子邻位卤素\n",
|
||||
" '[n,o,s]cc[Cl,Br,I]', # 杂原子邻位卤素(隔一个碳)\n",
|
||||
" 'c1[n,o,s]c([Cl,Br,I])ccc1', # 卤代吡咯类\n",
|
||||
" 'c1c([Cl,Br,I])cncn1', # 卤代嘧啶\n",
|
||||
" 'c1ccc2c([Cl,Br,I])ccnc2c1', # 卤代喹啉\n",
|
||||
" 'c1c([Cl,Br,I])cncc1', # 卤代吡嗪\n",
|
||||
" 'c1([Cl,Br,I])scnc1', # 卤代噻唑\n",
|
||||
" ],\n",
|
||||
" 'scheme': 'A'\n",
|
||||
" },\n",
|
||||
" 'aryl_halides': {\n",
|
||||
" 'name': '芳香卤素',\n",
|
||||
" 'description': '所有芳环上的Cl, Br, I原子(方案B)',\n",
|
||||
" 'smarts': [\n",
|
||||
" 'c[Cl,Br,I]', # 任意芳香氯\n",
|
||||
" 'c-C#N', # 芳香氰基\n",
|
||||
" 'c1ccc(S(=O)(=O)N)cc1', # 磺胺核心\n",
|
||||
" 'c1c(Cl)cc(Cl)cc1', # 多卤代苯\n",
|
||||
" ],\n",
|
||||
" 'scheme': 'B'\n",
|
||||
" }\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"def create_pattern_matchers():\n",
|
||||
" \"\"\"创建SMARTS模式匹配器\"\"\"\n",
|
||||
" matchers = {}\n",
|
||||
" for key, pattern_info in SCREENING_PATTERNS.items():\n",
|
||||
" matchers[key] = {\n",
|
||||
" 'info': pattern_info,\n",
|
||||
" 'matchers': [Chem.MolFromSmarts(smarts) for smarts in pattern_info['smarts']]\n",
|
||||
" }\n",
|
||||
" return matchers\n",
|
||||
"\n",
|
||||
"PATTERN_MATCHERS = create_pattern_matchers()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 数据加载和预处理\n",
|
||||
"\n",
|
||||
"### SDF文件说明\n",
|
||||
"- 文件位置:`data/drug_targetmol/0c04ffc9fe8c2ec916412fbdc2a49bf4.sdf`\n",
|
||||
"- 包含药物分子结构和丰富属性信息\n",
|
||||
"- 每个分子记录包含:SMILES、分子式、分子量、批准状态、适应症等"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"正在读取SDF文件...\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[15:05:51] Both bonds on one end of an atropisomer are on the same side - atoms is : 3\n",
|
||||
"[15:05:51] Explicit valence for atom # 2 N greater than permitted\n",
|
||||
"[15:05:51] ERROR: Could not sanitize molecule ending on line 217340\n",
|
||||
"[15:05:51] ERROR: Explicit valence for atom # 2 N greater than permitted\n",
|
||||
"[15:05:52] Explicit valence for atom # 4 N greater than permitted\n",
|
||||
"[15:05:52] ERROR: Could not sanitize molecule ending on line 317283\n",
|
||||
"[15:05:52] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
|
||||
"[15:05:52] Explicit valence for atom # 4 N greater than permitted\n",
|
||||
"[15:05:52] ERROR: Could not sanitize molecule ending on line 324666\n",
|
||||
"[15:05:52] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
|
||||
"[15:05:52] Explicit valence for atom # 5 N greater than permitted\n",
|
||||
"[15:05:52] ERROR: Could not sanitize molecule ending on line 365883\n",
|
||||
"[15:05:52] ERROR: Explicit valence for atom # 5 N greater than permitted\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"成功加载 3276 个分子\n",
|
||||
"\n",
|
||||
"数据概览:\n",
|
||||
" Index Plate Row Col ID Name \\\n",
|
||||
"0 1 L1010-1 a 2 Dexamethasone \n",
|
||||
"1 2 L1010-1 a 3 Danicopan \n",
|
||||
"2 3 L1010-1 a 4 Cyclosporin A \n",
|
||||
"3 4 L1010-1 a 5 L-Carnitine \n",
|
||||
"4 5 L1010-1 a 6 Trimetazidine dihydrochloride \n",
|
||||
"\n",
|
||||
" Synonyms CAS \\\n",
|
||||
"0 MK 125;Prednisolone F;NSC 34521;Hexadecadrol 50-02-2 \n",
|
||||
"1 ACH-4471 1903768-17-1 \n",
|
||||
"2 Cyclosporine A;Ciclosporin;Cyclosporine 59865-13-3 \n",
|
||||
"3 L(-)-Carnitine;Levocarnitine 541-15-1 \n",
|
||||
"4 Yoshimilon;Kyurinett;Vastarel F 13171-25-0 \n",
|
||||
"\n",
|
||||
" SMILES \\\n",
|
||||
"0 C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(F)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO \n",
|
||||
"1 CC(=O)c1nn(CC(=O)N2C[C@H](F)C[C@H]2C(=O)Nc2cccc(Br)n2)c2ccc(cc12)-c1cnc(C)nc1 \n",
|
||||
"2 [C@H]([C@@H](C/C=C/C)C)(O)[C@@]1(N(C)C(=O)[C@H]([C@@H](C)C)N(C)C(=O)[C@H](CC(C)C)N(C)C(=O)[C@H](... \n",
|
||||
"3 C[N+](C)(C)C[C@@H](O)CC([O-])=O \n",
|
||||
"4 Cl.Cl.COC1=C(OC)C(OC)=C(CN2CCNCC2)C=C1 \n",
|
||||
"\n",
|
||||
" Formula MolWt Approved status \\\n",
|
||||
"0 C22H29FO5 392.46 NMPA;EMA;FDA \n",
|
||||
"1 C26H23BrFN7O3 580.41 FDA \n",
|
||||
"2 C62H111N11O12 1202.61 FDA \n",
|
||||
"3 C7H15NO3 161.2 FDA \n",
|
||||
"4 C14H24Cl2N2O3 339.258 NMPA;EMA \n",
|
||||
"\n",
|
||||
" Pharmacopoeia \\\n",
|
||||
"0 USP39-NF34;BP2015;JP16;IP2010 \n",
|
||||
"1 NaN \n",
|
||||
"2 Martindale the Extra Pharmacopoei, EP10.2, USP43-NF38, Ph.Int_6th, JP17 \n",
|
||||
"3 NaN \n",
|
||||
"4 BP2019;KP Ⅹ;EP9.2;IP2010;JP17;Martindale:The Extra Pharmacopoeia \n",
|
||||
"\n",
|
||||
" Disease \\\n",
|
||||
"0 Metabolism \n",
|
||||
"1 Others \n",
|
||||
"2 Immune system \n",
|
||||
"3 Cardiovascular system \n",
|
||||
"4 Cardiovascular system \n",
|
||||
"\n",
|
||||
" Pathways \\\n",
|
||||
"0 Antibody-drug Conjugate/ADC Related;Autophagy;Endocrinology/Hormones;Immunology/Inflammation;Mic... \n",
|
||||
"1 Immunology/Inflammation \n",
|
||||
"2 Immunology/Inflammation;Metabolism;Microbiology/Virology \n",
|
||||
"3 Metabolism \n",
|
||||
"4 Autophagy;Metabolism \n",
|
||||
"\n",
|
||||
" Target \\\n",
|
||||
"0 Antibacterial;Antibiotic;Autophagy;Complement System;Glucocorticoid Receptor;IL Receptor;Mitopha... \n",
|
||||
"1 Complement System \n",
|
||||
"2 Phosphatase;Antibiotic;Complement System \n",
|
||||
"3 Endogenous Metabolite;Fatty Acid Synthase \n",
|
||||
"4 Autophagy;Fatty Acid Synthase \n",
|
||||
"\n",
|
||||
" Receptor \\\n",
|
||||
"0 Antibiotic; Autophagy; Bacterial; Complement System; Glucocorticoid Receptor; IL receptor; Mitop... \n",
|
||||
"1 Complement System; factor D \n",
|
||||
"2 Antibiotic; calcineurin phosphatase; Complement System; Phosphatase \n",
|
||||
"3 Endogenous Metabolite; FAS \n",
|
||||
"4 Autophagy; mitochondrial long-chain 3-ketoacyl thiolase \n",
|
||||
"\n",
|
||||
" Bioactivity \\\n",
|
||||
"0 Dexamethasone is a glucocorticoid receptor agonist and IL receptor modulator with anti-inflammat... \n",
|
||||
"1 Danicopan (ACH-4471) (ACH-4471) is a selective, orally active small molecule factor D inhibitor ... \n",
|
||||
"2 Cyclosporin A is a natural product and an active fungal metabolite, classified as a cyclic polyp... \n",
|
||||
"3 L-Carnitine (L(-)-Carnitine) is an amino acid derivative. L-Carnitine facilitates long-chain fat... \n",
|
||||
"4 Trimetazidine dihydrochloride (Vastarel F) can improve myocardial glucose utilization by inhibit... \n",
|
||||
"\n",
|
||||
" Reference \\\n",
|
||||
"0 Li M, Yu H. Identification of WP1066, an inhibitor of JAK2 and STAT3, as a Kv1. 3 potassium chan... \n",
|
||||
"1 Yuan X, et al. Small-molecule factor D inhibitors selectively block the alternative pathway of c... \n",
|
||||
"2 D'Angelo G, et al. Cyclosporin A prevents the hypoxic adaptation by activating hypoxia-inducible... \n",
|
||||
"3 Jogl G, Tong L. Cell. 2003 Jan 10; 112(1):113-22. \n",
|
||||
"4 Yang Q, et al. Int J Clin Exp Pathol. 2015, 8(4):3735-3741.;Liu Z, et al. Metabolism. 2016, 65(3... \n",
|
||||
"\n",
|
||||
" ROMol \n",
|
||||
"0 <rdkit.Chem.rdchem.Mol object at 0x743d782049e0> \n",
|
||||
"1 <rdkit.Chem.rdchem.Mol object at 0x743d782871b0> \n",
|
||||
"2 <rdkit.Chem.rdchem.Mol object at 0x743d78287220> \n",
|
||||
"3 <rdkit.Chem.rdchem.Mol object at 0x743d782873e0> \n",
|
||||
"4 <rdkit.Chem.rdchem.Mol object at 0x743d78287450> \n",
|
||||
"\n",
|
||||
"列名:['Index', 'Plate', 'Row', 'Col', 'ID', 'Name', 'Synonyms', 'CAS', 'SMILES', 'Formula', 'MolWt', 'Approved status', 'Pharmacopoeia', 'Disease', 'Pathways', 'Target', 'Receptor', 'Bioactivity', 'Reference', 'ROMol']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# 读取筛选结果CSV文件\n",
|
||||
"import pandas as pd\n",
|
||||
"from rdkit import Chem\n",
|
||||
"\n",
|
||||
"print(\"正在读取筛选结果CSV文件...\")\n",
|
||||
"\n",
|
||||
"# 读取方案A结果\n",
|
||||
"df_a = pd.read_csv(\"../data/drug_targetmol/sandmeyer_candidates_scheme_A_heteroaryl_halides.csv\")\n",
|
||||
"print(f\"方案A数据: {len(df_a)} 行\")\n",
|
||||
"\n",
|
||||
"# 读取方案B结果\n",
|
||||
"df_b = pd.read_csv(\"../data/drug_targetmol/sandmeyer_candidates_scheme_B_aryl_halides.csv\")\n",
|
||||
"print(f\"方案B数据: {len(df_b)} 行\")\n",
|
||||
"\n",
|
||||
"# 重建分子对象\n",
|
||||
"def rebuild_molecules(df):\n",
|
||||
" mols = []\n",
|
||||
" for idx, row in df.iterrows():\n",
|
||||
" smiles = row.get(\"SMILES_from_mol\", \"\")\n",
|
||||
" if smiles and str(smiles) != \"nan\":\n",
|
||||
" mol = Chem.MolFromSmiles(str(smiles))\n",
|
||||
" mols.append(mol)\n",
|
||||
" else:\n",
|
||||
" mols.append(None)\n",
|
||||
" df[\"ROMol\"] = mols\n",
|
||||
" valid_mols = sum(1 for m in mols if m is not None)\n",
|
||||
" print(f\"成功重建 {valid_mols} 个分子对象\")\n",
|
||||
" return df\n",
|
||||
"\n",
|
||||
"df_a = rebuild_molecules(df_a)\n",
|
||||
"df_b = rebuild_molecules(df_b)\n",
|
||||
"\n",
|
||||
"print(\"\n",
|
||||
"数据加载完成\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 分子筛选函数\n",
|
||||
"\n",
|
||||
"### 筛选逻辑说明\n",
|
||||
"\n",
|
||||
"1. **分子验证**:确保分子结构有效\n",
|
||||
"2. **子结构匹配**:使用RDKit的SMARTS匹配\n",
|
||||
"3. **结果记录**:记录匹配的模式和具体子结构\n",
|
||||
"4. **数据完整性**:保留所有原始属性信息"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def screen_molecules_for_patterns(df, pattern_key):\n",
|
||||
" \"\"\"\n",
|
||||
" 筛选包含特定子结构的分子\n",
|
||||
" \n",
|
||||
" Args:\n",
|
||||
" df: 包含分子的DataFrame\n",
|
||||
" pattern_key: 筛选模式键名\n",
|
||||
" \n",
|
||||
" Returns:\n",
|
||||
" 筛选结果DataFrame\n",
|
||||
" \"\"\"\n",
|
||||
" pattern_info = PATTERN_MATCHERS[pattern_key]['info']\n",
|
||||
" matchers = PATTERN_MATCHERS[pattern_key]['matchers']\n",
|
||||
" \n",
|
||||
" print(f\"\\n开始筛选:{pattern_info['name']}\")\n",
|
||||
" print(f\"描述:{pattern_info['description']}\")\n",
|
||||
" print(f\"SMARTS模式数量:{len(pattern_info['smarts'])}\")\n",
|
||||
" \n",
|
||||
" matched_molecules = []\n",
|
||||
" \n",
|
||||
" for idx, row in df.iterrows():\n",
|
||||
" mol = row['ROMol']\n",
|
||||
" if mol is None:\n",
|
||||
" continue\n",
|
||||
" \n",
|
||||
" # 检查是否匹配任何模式\n",
|
||||
" matched_patterns = []\n",
|
||||
" for i, matcher in enumerate(matchers):\n",
|
||||
" if matcher is None:\n",
|
||||
" continue\n",
|
||||
" if mol.HasSubstructMatch(matcher):\n",
|
||||
" matched_patterns.append({\n",
|
||||
" 'pattern_index': i,\n",
|
||||
" 'smarts': pattern_info['smarts'][i],\n",
|
||||
" 'matches': len(mol.GetSubstructMatches(matcher))\n",
|
||||
" })\n",
|
||||
" \n",
|
||||
" if matched_patterns:\n",
|
||||
" # 创建匹配记录\n",
|
||||
" match_record = row.copy()\n",
|
||||
" match_record['matched_patterns'] = matched_patterns\n",
|
||||
" match_record['total_matches'] = sum(p['matches'] for p in matched_patterns)\n",
|
||||
" match_record['screening_scheme'] = pattern_info['scheme']\n",
|
||||
" matched_molecules.append(match_record)\n",
|
||||
" \n",
|
||||
" result_df = pd.DataFrame(matched_molecules)\n",
|
||||
" print(f\"找到 {len(result_df)} 个匹配分子\")\n",
|
||||
" \n",
|
||||
" return result_df\n",
|
||||
"\n",
|
||||
"def save_screening_results(df, filename, description):\n",
|
||||
" \"\"\"保存筛选结果到CSV\"\"\"\n",
|
||||
" output_path = f\"../data/drug_targetmol/{filename}\"\n",
|
||||
" \n",
|
||||
" # 转换ROMol列为SMILES(因为ROMol对象无法保存到CSV)\n",
|
||||
" df_export = df.copy()\n",
|
||||
" if 'ROMol' in df_export.columns:\n",
|
||||
" df_export['SMILES_from_mol'] = df_export['ROMol'].apply(lambda x: Chem.MolToSmiles(x) if x else '')\n",
|
||||
" df_export = df_export.drop('ROMol', axis=1)\n",
|
||||
" \n",
|
||||
" df_export.to_csv(output_path, index=False, encoding='utf-8')\n",
|
||||
" print(f\"结果已保存到:{output_path}\")\n",
|
||||
" print(f\"包含 {len(df_export)} 个分子,{len(df_export.columns)} 个属性列\")\n",
|
||||
" \n",
|
||||
" return output_path"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 方案A筛选:杂芳环卤素\n",
|
||||
"\n",
|
||||
"### 执行逻辑\n",
|
||||
"- 使用最保守的筛选策略\n",
|
||||
"- 只匹配杂芳环上的卤素\n",
|
||||
"- 预期获得高精度结果\n",
|
||||
"- 需要进一步的合成路线验证"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"开始筛选:杂芳环卤素\n",
|
||||
"描述:杂环上的Cl, Br, I原子(方案A)\n",
|
||||
"SMARTS模式数量:7\n",
|
||||
"找到 57 个匹配分子\n",
|
||||
"\n",
|
||||
"方案A筛选结果摘要:\n",
|
||||
" Name CAS Formula \\\n",
|
||||
"1 Danicopan 1903768-17-1 C26H23BrFN7O3 \n",
|
||||
"8 Lonafarnib 193275-84-2 C27H31Br2ClN4O2 \n",
|
||||
"19 Idoxuridine 54-42-2 C9H11IN2O5 \n",
|
||||
"144 Dimenhydrinate 523-87-5 C24H28ClN5O3 \n",
|
||||
"259 Sertaconazole 99592-32-2 C20H15Cl3N2OS \n",
|
||||
"311 Tioconazole 65899-73-2 C16H13Cl3N2OS \n",
|
||||
"337 Gimeracil 103766-25-2 C5H4ClNO2 \n",
|
||||
"580 Bromocriptine mesylate 22260-51-1 C33H44BrN5O8S \n",
|
||||
"592 Clofarabine 123318-82-1 C10H11ClFN5O3 \n",
|
||||
"684 Vorasidenib 1644545-52-7 C14H13ClF6N6 \n",
|
||||
"\n",
|
||||
" matched_patterns \\\n",
|
||||
"1 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 1}, {'pattern_index': 2, 'smarts... \n",
|
||||
"8 [{'pattern_index': 1, 'smarts': '[n,o,s]cc[Cl,Br,I]', 'matches': 1}, {'pattern_index': 5, 'smart... \n",
|
||||
"19 [{'pattern_index': 1, 'smarts': '[n,o,s]cc[Cl,Br,I]', 'matches': 2}, {'pattern_index': 3, 'smart... \n",
|
||||
"144 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 2}] \n",
|
||||
"259 [{'pattern_index': 1, 'smarts': '[n,o,s]cc[Cl,Br,I]', 'matches': 1}] \n",
|
||||
"311 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 1}] \n",
|
||||
"337 [{'pattern_index': 1, 'smarts': '[n,o,s]cc[Cl,Br,I]', 'matches': 1}, {'pattern_index': 5, 'smart... \n",
|
||||
"580 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 1}] \n",
|
||||
"592 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 2}] \n",
|
||||
"684 [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 1}, {'pattern_index': 2, 'smarts... \n",
|
||||
"\n",
|
||||
" total_matches \n",
|
||||
"1 2 \n",
|
||||
"8 2 \n",
|
||||
"19 3 \n",
|
||||
"144 2 \n",
|
||||
"259 1 \n",
|
||||
"311 1 \n",
|
||||
"337 2 \n",
|
||||
"580 1 \n",
|
||||
"592 2 \n",
|
||||
"684 2 \n",
|
||||
"结果已保存到:../data/drug_targetmol/sandmeyer_candidates_scheme_A_heteroaryl_halides.csv\n",
|
||||
"包含 57 个分子,23 个属性列\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# 执行方案A筛选\n",
|
||||
"scheme_a_results = screen_molecules_for_patterns(df, 'heteroaryl_halides')\n",
|
||||
"\n",
|
||||
"# 显示结果摘要\n",
|
||||
"if len(scheme_a_results) > 0:\n",
|
||||
" print(\"\\n方案A筛选结果摘要:\")\n",
|
||||
" print(scheme_a_results[['Name', 'CAS', 'Formula', 'matched_patterns', 'total_matches']].head(10))\n",
|
||||
" \n",
|
||||
" # 保存结果\n",
|
||||
" save_screening_results(\n",
|
||||
" scheme_a_results, \n",
|
||||
" 'sandmeyer_candidates_scheme_A_heteroaryl_halides.csv',\n",
|
||||
" '方案A:杂芳环卤素筛选结果'\n",
|
||||
" )\n",
|
||||
"else:\n",
|
||||
" print(\"\\n方案A未找到匹配分子\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 方案B筛选:所有芳香卤素\n",
|
||||
"\n",
|
||||
"### 执行逻辑\n",
|
||||
"- 使用更宽松的筛选策略 \n",
|
||||
"- 匹配所有芳环上的卤素\n",
|
||||
"- 会包含更多候选分子\n",
|
||||
"- 需要更多的文献验证工作"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"开始筛选:芳香卤素\n",
|
||||
"描述:所有芳环上的Cl, Br, I原子(方案B)\n",
|
||||
"SMARTS模式数量:4\n",
|
||||
"找到 548 个匹配分子\n",
|
||||
"\n",
|
||||
"方案B筛选结果摘要:\n",
|
||||
" Name CAS Formula \\\n",
|
||||
"1 Danicopan 1903768-17-1 C26H23BrFN7O3 \n",
|
||||
"8 Lonafarnib 193275-84-2 C27H31Br2ClN4O2 \n",
|
||||
"9 Ketoconazole 65277-42-1 C26H28Cl2N4O4 \n",
|
||||
"13 Ozanimod 1306760-87-1 C23H24N4O3 \n",
|
||||
"14 Ponesimod 854107-55-4 C23H25ClN2O4S \n",
|
||||
"19 Idoxuridine 54-42-2 C9H11IN2O5 \n",
|
||||
"53 Moclobemide 71320-77-9 C13H17ClN2O2 \n",
|
||||
"74 Clemastine 15686-51-8 C21H26ClNO \n",
|
||||
"75 Buclizine dihydrochloride 129-74-8 C28H33ClN2·2HCl \n",
|
||||
"78 Asenapine 65576-45-6 C17H16ClNO \n",
|
||||
"\n",
|
||||
" matched_patterns \\\n",
|
||||
"1 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
|
||||
"8 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 3}] \n",
|
||||
"9 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 2}, {'pattern_index': 3, 'smarts': 'c1c... \n",
|
||||
"13 [{'pattern_index': 1, 'smarts': 'c-C#N', 'matches': 1}] \n",
|
||||
"14 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
|
||||
"19 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
|
||||
"53 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
|
||||
"74 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
|
||||
"75 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
|
||||
"78 [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}] \n",
|
||||
"\n",
|
||||
" total_matches \n",
|
||||
"1 1 \n",
|
||||
"8 3 \n",
|
||||
"9 3 \n",
|
||||
"13 1 \n",
|
||||
"14 1 \n",
|
||||
"19 1 \n",
|
||||
"53 1 \n",
|
||||
"74 1 \n",
|
||||
"75 1 \n",
|
||||
"78 1 \n",
|
||||
"结果已保存到:../data/drug_targetmol/sandmeyer_candidates_scheme_B_aryl_halides.csv\n",
|
||||
"包含 548 个分子,23 个属性列\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# 执行方案B筛选\n",
|
||||
"scheme_b_results = screen_molecules_for_patterns(df, 'aryl_halides')\n",
|
||||
"\n",
|
||||
"# 显示结果摘要\n",
|
||||
"if len(scheme_b_results) > 0:\n",
|
||||
" print(\"\\n方案B筛选结果摘要:\")\n",
|
||||
" print(scheme_b_results[['Name', 'CAS', 'Formula', 'matched_patterns', 'total_matches']].head(10))\n",
|
||||
" \n",
|
||||
" # 保存结果\n",
|
||||
" save_screening_results(\n",
|
||||
" scheme_b_results, \n",
|
||||
" 'sandmeyer_candidates_scheme_B_aryl_halides.csv',\n",
|
||||
" '方案B:所有芳香卤素筛选结果'\n",
|
||||
" )\n",
|
||||
"else:\n",
|
||||
" print(\"\\n方案B未找到匹配分子\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 结果分析和总结\n",
|
||||
"\n",
|
||||
"### 筛选统计"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"=== 筛选结果统计 ===\n",
|
||||
"总分子数:3276\n",
|
||||
"方案A(杂芳环卤素)匹配数:57\n",
|
||||
"方案B(所有芳香卤素)匹配数:548\n",
|
||||
"两方案重叠分子数:57\n",
|
||||
"仅方案A匹配的分子数:0\n",
|
||||
"仅方案B匹配的分子数:490\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# 结果统计\n",
|
||||
"print(\"=== 筛选结果统计 ===\")\n",
|
||||
"print(f\"总分子数:{len(df)}\")\n",
|
||||
"print(f\"方案A(杂芳环卤素)匹配数:{len(scheme_a_results)}\")\n",
|
||||
"print(f\"方案B(所有芳香卤素)匹配数:{len(scheme_b_results)}\")\n",
|
||||
"\n",
|
||||
"if len(scheme_a_results) > 0 and len(scheme_b_results) > 0:\n",
|
||||
" # 分析重叠\n",
|
||||
" scheme_a_cas = set(scheme_a_results['CAS'].dropna())\n",
|
||||
" scheme_b_cas = set(scheme_b_results['CAS'].dropna())\n",
|
||||
" overlap = scheme_a_cas & scheme_b_cas\n",
|
||||
" print(f\"两方案重叠分子数:{len(overlap)}\")\n",
|
||||
" \n",
|
||||
" # 方案A特有\n",
|
||||
" scheme_a_only = scheme_a_cas - scheme_b_cas\n",
|
||||
" print(f\"仅方案A匹配的分子数:{len(scheme_a_only)}\")\n",
|
||||
" \n",
|
||||
" # 方案B特有\n",
|
||||
" scheme_b_only = scheme_b_cas - scheme_a_cas\n",
|
||||
" print(f\"仅方案B匹配的分子数:{len(scheme_b_only)}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 使用建议\n",
|
||||
"\n",
|
||||
"### 优先级推荐\n",
|
||||
"\n",
|
||||
"1. **第一优先级**:方案A结果\n",
|
||||
" - 杂芳环卤素最可能是Sandmeyer反应产物\n",
|
||||
" - 候选数量相对较少,便于深入研究\n",
|
||||
" - 建议重点查阅这些分子的合成路线\n",
|
||||
"\n",
|
||||
"2. **第二优先级**:方案B独有结果\n",
|
||||
" - 苯环卤素可能来自多种途径\n",
|
||||
" - 需要仔细评估合成可能性\n",
|
||||
" - 适合作为补充筛选\n",
|
||||
"\n",
|
||||
"### 后续验证步骤\n",
|
||||
"\n",
|
||||
"1. **文献调研**:查阅候选分子的合成路线\n",
|
||||
"2. **反应条件评估**:确认是否使用了Sandmeyer反应\n",
|
||||
"3. **经济性分析**:评估张夏恒反应用于该分子的潜力\n",
|
||||
"4. **实验验证**:必要时进行小规模验证实验\n",
|
||||
"\n",
|
||||
"### 注意事项\n",
|
||||
"\n",
|
||||
"- 此筛选基于结构特征,不等同于合成路线确认\n",
|
||||
"- 部分卤素可能来自原料而非合成步骤\n",
|
||||
"- 分子复杂程度和合成可行性需要综合考虑\n",
|
||||
"- 建议结合药物的重要性和市场规模进行优先级排序"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 分子结构可视化\n",
|
||||
"\n",
|
||||
"### 创建输出目录和可视化函数\n",
|
||||
"\n",
|
||||
"本节将为筛选出的候选分子生成高清晰度的SVG结构图,突出显示卤素结构。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.14.0"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
285
notebooks/smarts_match_visualization.ipynb
Normal file
285
notebooks/smarts_match_visualization.ipynb
Normal file
@@ -0,0 +1,285 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# SMARTS匹配检测与可视化\n",
|
||||
"\n",
|
||||
"本notebook用于:\n",
|
||||
"1. 读取ring16/temp.csv中的smiles列\n",
|
||||
"2. 对SMARTS模式进行匹配检测:`O=C1C[C@@H](O)[*:15][*:17][*:18]C[*:23]C(=O)/C=C/[*:28]=C/[*:7][*:8]O1`\n",
|
||||
"3. 处理dummy原子([*:X]),尝试两种方式:\n",
|
||||
" - 不替换dummy原子\n",
|
||||
" - 将dummy原子替换为C\n",
|
||||
"4. 可视化匹配的原子高亮显示\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 1. 导入必要的库\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"所有模块导入成功!\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import sys\n",
|
||||
"from pathlib import Path\n",
|
||||
"import re\n",
|
||||
"\n",
|
||||
"# 添加项目根目录到 Python 路径\n",
|
||||
"notebook_dir = Path().resolve()\n",
|
||||
"project_root = notebook_dir.parent\n",
|
||||
"sys.path.insert(0, str(project_root))\n",
|
||||
"\n",
|
||||
"from rdkit import Chem\n",
|
||||
"from rdkit.Chem import Draw\n",
|
||||
"from rdkit.Chem.Draw import rdMolDraw2D\n",
|
||||
"from IPython.display import SVG, display, HTML\n",
|
||||
"import pandas as pd\n",
|
||||
"import numpy as np\n",
|
||||
"import matplotlib.pyplot as plt\n",
|
||||
"from collections import Counter\n",
|
||||
"\n",
|
||||
"print(\"所有模块导入成功!\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 2. 读取数据\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"数据集大小: 2022 个分子\n",
|
||||
"列名: ['IDs', 'molecule_pref_name', 'max_pChEMBL', 'max_pChEMBL_target', '# Target Organisms', 'Target Organisms', '# Known Targets', 'Known Targets', 'target_pref_name', 'smiles']\n",
|
||||
"\n",
|
||||
"SMILES列存在,共 2022 个有效SMILES\n",
|
||||
"\n",
|
||||
"前5个SMILES示例:\n",
|
||||
"['C/C(=C\\\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[C@H](C)[C@H](O)[C@@H](C)C(=O)C(C)(C)[C@@H](O)CC(=O)O1', 'C/C(=C\\\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[C@H](C)[C@H](O)[C@@H](C)C(=O)C(C)(C)[C@@H](O)CC(=O)O1', 'Cc1c2oc3c(C)ccc(C(=O)N[C@@H]4C(=O)N[C@H](C(C)C)C(=O)N5CCC[C@H]5C(=O)N(C)CC(=O)N(C)[C@@H](C(C)C)C(=O)O[C@@H]4C)c3nc-2c(C(=O)N[C@@H]2C(=O)N[C@H](C(C)C)C(=O)N3CCC[C@H]3C(=O)N(C)CC(=O)N(C)[C@@H](C(C)C)C(=O)O[C@@H]2C)c(N)c1=O', 'CCCCCCCC(=O)SCC/C=C/[C@@H]1CC(=O)NCc2nc(cs2)C2=N[C@@](C)(CS2)C(=O)N[C@@H](C(C)C)C(=O)O1', 'Cc1cc2ccc1[C@@H](C)COC(=O)Nc1ccc(S(=O)(=O)C3CC3)c(c1)CN(C)C(=O)[C@@H]2Nc1ccc2c(N)ncc(F)c2c1']\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>IDs</th>\n",
|
||||
" <th>molecule_pref_name</th>\n",
|
||||
" <th>max_pChEMBL</th>\n",
|
||||
" <th>max_pChEMBL_target</th>\n",
|
||||
" <th># Target Organisms</th>\n",
|
||||
" <th>Target Organisms</th>\n",
|
||||
" <th># Known Targets</th>\n",
|
||||
" <th>Known Targets</th>\n",
|
||||
" <th>target_pref_name</th>\n",
|
||||
" <th>smiles</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>CHEMBL94657</td>\n",
|
||||
" <td>PATUPILONE</td>\n",
|
||||
" <td>10.67</td>\n",
|
||||
" <td>CHEMBL1075590</td>\n",
|
||||
" <td>695</td>\n",
|
||||
" <td>Sus scrofa, Mus musculus, None, Plasmodium fal...</td>\n",
|
||||
" <td>695</td>\n",
|
||||
" <td>CHEMBL612519, CHEMBL614129, CHEMBL1075484, CHE...</td>\n",
|
||||
" <td>AGS, NCI-H1703, MKN-7, HT-1080, NCI-H226, Lu-6...</td>\n",
|
||||
" <td>C/C(=C\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[...</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>CHEMBL94657</td>\n",
|
||||
" <td>PATUPILONE</td>\n",
|
||||
" <td>10.67</td>\n",
|
||||
" <td>CHEMBL1075590</td>\n",
|
||||
" <td>695</td>\n",
|
||||
" <td>Sus scrofa, Mus musculus, None, Plasmodium fal...</td>\n",
|
||||
" <td>695</td>\n",
|
||||
" <td>CHEMBL612519, CHEMBL614129, CHEMBL1075484, CHE...</td>\n",
|
||||
" <td>AGS, NCI-H1703, MKN-7, HT-1080, NCI-H226, Lu-6...</td>\n",
|
||||
" <td>C/C(=C\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[...</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>CHEMBL1554</td>\n",
|
||||
" <td>DACTINOMYCIN</td>\n",
|
||||
" <td>10.10</td>\n",
|
||||
" <td>CHEMBL614533</td>\n",
|
||||
" <td>177</td>\n",
|
||||
" <td>Giardia intestinalis, Trypanosoma cruzi, Equus...</td>\n",
|
||||
" <td>177</td>\n",
|
||||
" <td>CHEMBL388, CHEMBL614151, CHEMBL3577, CHEMBL551...</td>\n",
|
||||
" <td>HT-29, CCRF-CEM, WIL2-NS, Unchecked, Caspase-7...</td>\n",
|
||||
" <td>Cc1c2oc3c(C)ccc(C(=O)N[C@@H]4C(=O)N[C@H](C(C)C...</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>CHEMBL1173445</td>\n",
|
||||
" <td>LARGAZOLE</td>\n",
|
||||
" <td>8.80</td>\n",
|
||||
" <td>CHEMBL612545</td>\n",
|
||||
" <td>45</td>\n",
|
||||
" <td>Homo sapiens, None</td>\n",
|
||||
" <td>45</td>\n",
|
||||
" <td>CHEMBL392, CHEMBL3192, CHEMBL3524, CHEMBL5103,...</td>\n",
|
||||
" <td>Histone deacetylase 9, Ubiquitin-like modifier...</td>\n",
|
||||
" <td>CCCCCCCC(=O)SCC/C=C/[C@@H]1CC(=O)NCc2nc(cs2)C2...</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>CHEMBL3902498</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>9.37</td>\n",
|
||||
" <td>CHEMBL2095194,CHEMBL3991</td>\n",
|
||||
" <td>17</td>\n",
|
||||
" <td>Homo sapiens, None</td>\n",
|
||||
" <td>17</td>\n",
|
||||
" <td>CHEMBL2820, CHEMBL3991, CHEMBL1801, CHEMBL204,...</td>\n",
|
||||
" <td>Coagulation factor X, Kallikrein 1, Coagulatio...</td>\n",
|
||||
" <td>Cc1cc2ccc1[C@@H](C)COC(=O)Nc1ccc(S(=O)(=O)C3CC...</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" IDs molecule_pref_name max_pChEMBL max_pChEMBL_target \\\n",
|
||||
"0 CHEMBL94657 PATUPILONE 10.67 CHEMBL1075590 \n",
|
||||
"1 CHEMBL94657 PATUPILONE 10.67 CHEMBL1075590 \n",
|
||||
"2 CHEMBL1554 DACTINOMYCIN 10.10 CHEMBL614533 \n",
|
||||
"3 CHEMBL1173445 LARGAZOLE 8.80 CHEMBL612545 \n",
|
||||
"4 CHEMBL3902498 NaN 9.37 CHEMBL2095194,CHEMBL3991 \n",
|
||||
"\n",
|
||||
" # Target Organisms Target Organisms \\\n",
|
||||
"0 695 Sus scrofa, Mus musculus, None, Plasmodium fal... \n",
|
||||
"1 695 Sus scrofa, Mus musculus, None, Plasmodium fal... \n",
|
||||
"2 177 Giardia intestinalis, Trypanosoma cruzi, Equus... \n",
|
||||
"3 45 Homo sapiens, None \n",
|
||||
"4 17 Homo sapiens, None \n",
|
||||
"\n",
|
||||
" # Known Targets Known Targets \\\n",
|
||||
"0 695 CHEMBL612519, CHEMBL614129, CHEMBL1075484, CHE... \n",
|
||||
"1 695 CHEMBL612519, CHEMBL614129, CHEMBL1075484, CHE... \n",
|
||||
"2 177 CHEMBL388, CHEMBL614151, CHEMBL3577, CHEMBL551... \n",
|
||||
"3 45 CHEMBL392, CHEMBL3192, CHEMBL3524, CHEMBL5103,... \n",
|
||||
"4 17 CHEMBL2820, CHEMBL3991, CHEMBL1801, CHEMBL204,... \n",
|
||||
"\n",
|
||||
" target_pref_name \\\n",
|
||||
"0 AGS, NCI-H1703, MKN-7, HT-1080, NCI-H226, Lu-6... \n",
|
||||
"1 AGS, NCI-H1703, MKN-7, HT-1080, NCI-H226, Lu-6... \n",
|
||||
"2 HT-29, CCRF-CEM, WIL2-NS, Unchecked, Caspase-7... \n",
|
||||
"3 Histone deacetylase 9, Ubiquitin-like modifier... \n",
|
||||
"4 Coagulation factor X, Kallikrein 1, Coagulatio... \n",
|
||||
"\n",
|
||||
" smiles \n",
|
||||
"0 C/C(=C\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[... \n",
|
||||
"1 C/C(=C\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[... \n",
|
||||
"2 Cc1c2oc3c(C)ccc(C(=O)N[C@@H]4C(=O)N[C@H](C(C)C... \n",
|
||||
"3 CCCCCCCC(=O)SCC/C=C/[C@@H]1CC(=O)NCc2nc(cs2)C2... \n",
|
||||
"4 Cc1cc2ccc1[C@@H](C)COC(=O)Nc1ccc(S(=O)(=O)C3CC... "
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# 读取CSV文件\n",
|
||||
"input_file = project_root / 'ring16' / 'temp.csv'\n",
|
||||
"df = pd.read_csv(input_file)\n",
|
||||
"\n",
|
||||
"print(f\"数据集大小: {len(df)} 个分子\")\n",
|
||||
"print(f\"列名: {df.columns.tolist()}\")\n",
|
||||
"\n",
|
||||
"# 检查smiles列\n",
|
||||
"if 'smiles' in df.columns:\n",
|
||||
" print(f\"\\nSMILES列存在,共 {df['smiles'].notna().sum()} 个有效SMILES\")\n",
|
||||
" print(f\"\\n前5个SMILES示例:\")\n",
|
||||
" print(df['smiles'].head().tolist())\n",
|
||||
"else:\n",
|
||||
" print(\"错误: 未找到smiles列\")\n",
|
||||
" \n",
|
||||
"df.head()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 3. 定义SMARTS模式和处理函数\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.14.0"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
3322
notebooks/test_align_two_molecules.ipynb
Normal file
3322
notebooks/test_align_two_molecules.ipynb
Normal file
File diff suppressed because one or more lines are too long
Reference in New Issue
Block a user