first add

2025-11-14 20:34:58 +08:00
commit 0d99f7d12c
46 changed files with 698209 additions and 0 deletions
--- a/notebooks/README_analyze_ring16.md
+++ b/notebooks/README_analyze_ring16.md
@@ -0,0 +1,190 @@
+# 16元环大环内酯分子分析说明
+
+## 文件说明
+
+- **notebook文件**: `analyze_ring16_molecules.ipynb`
+- **输入数据**: `../output/ring16_match_smarts.csv` (307个分子)
+- **输出目录**: `../output/`
+
+## 使用方法
+
+### 1. 激活环境
+
+```bash
+cd /home/zly/project/macro_split
+pixi shell
+```
+
+### 2. 运行notebook
+
+```bash
+jupyter notebook notebooks/analyze_ring16_molecules.ipynb
+```
+
+或使用 JupyterLab：
+
+```bash
+jupyter lab notebooks/analyze_ring16_molecules.ipynb
+```
+
+### 3. 按顺序运行所有单元格
+
+notebook会自动：
+1. 计算所有分子的药物性质（分子量、LogP、QED、TPSA等）
+2. 进行侧链断裂分析
+3. 统计每个位置（3-16）的碎片分布
+4. 生成分布图并保存到 `output/` 目录
+
+## 生成的输出文件
+
+### 图片文件
+- `ring16_molecular_properties_distribution.png` - 分子性质分布图（4个子图）
+  - 分子量分布
+  - LogP分布
+  - QED分布
+  - TPSA分布
+
+- `atom_count_distribution_ring16.png` - 每个位置的原子数分布（14个子图，位置3-16）
+
+- `molecular_weight_distribution_ring16.png` - 每个位置的分子量分布（14个子图，位置3-16）
+
+### 数据文件
+- `ring16_fragments_analysis.csv` - 所有碎片的详细信息
+  - 列：fragment_id, parent_id, parent_smiles, cleavage_position, fragment_smiles, atom_count, molecular_weight
+
+- `ring16_molecular_properties.csv` - 所有分子的性质数据
+  - 列：unique_id, mol_weight, logP, num_h_donors, num_h_acceptors, num_rotatable_bonds, tpsa, qed, num_atoms, num_heavy_atoms
+
+## 分析内容
+
+### 已完成
+
+1. **分子基本性质计算** ✅
+   - 分子量、LogP、QED、TPSA
+   - 氢键供受体数、可旋转键数
+   - 原子数、重原子数
+
+2. **侧链断裂分析** ✅
+   - 使用封装好的 `MacrolactoneFragmenter` 类
+   - 批量处理所有307个分子
+   - 统计每个位置的碎片类型和数量
+
+3. **分布图绘制** ✅
+   - 参考 `test_align_two_molecules.ipynb` 的绘图逻辑
+   - 4x4子图布局，展示位置3-16的分布
+   - 使用 seaborn 和 matplotlib 绘图
+
+### 延伸分析建议
+
+notebook的最后一个单元格（Section 9）提供了详细的延伸分析建议，包括：
+
+#### 优先级1（强烈推荐）⭐⭐⭐
+- **LogP分析**：找出对亲脂性贡献最大的侧链位置
+- **QED分析**：比较高/低QED分子的侧链差异
+- **TPSA分析**：分析极性侧链的分布模式
+
+#### 优先级2（重要）⭐⭐⭐
+- **SAR分析**：如果有活性数据（max_pChEMBL），分析结构-活性关系
+- **特权侧链**：找出高频出现在活性分子中的侧链
+
+#### 优先级3（有价值）⭐⭐
+- **碎片多样性分析**：统计每个位置的独特碎片类型
+- **聚类分析**：基于碎片指纹进行分子聚类
+- **极性/疏水性分析**：分析侧链的极性特征
+
+#### 可选分析 ⭐
+- **3D性质**：PMI、NPR等3D描述符
+- **Lipinski规则**：检查类药性规则
+- **立体化学**：手性中心分析
+
+## 代码示例
+
+notebook中包含了完整的代码示例，可以直接运行或修改。主要功能：
+
+```python
+# 1. 计算分子性质
+props = calculate_properties(smiles)
+
+# 2. 批量断裂
+fragmenter = MacrolactoneFragmenter(ring_size=16)
+batch_results = fragmenter.process_csv(csv_file)
+
+# 3. 统计分析
+df_fragments = fragmenter.batch_to_dataframe(batch_results)
+position_stats = df_fragments.groupby('cleavage_position').agg(...)
+
+# 4. 绘图
+sns.histplot(values, kde=True, ax=ax, bins=30)
+```
+
+## 关键洞察
+
+### LogP的价值
+- 反映分子的亲脂性，对膜通透性和药物分布至关重要
+- 大环内酯通常LogP较高
+- 了解侧链对LogP的贡献有助于优化药物设计
+
+### QED的意义
+- QED综合评估"类药性"
+- 大环内酯往往违反Lipinski规则（分子量>500），但仍可能是好药
+- 比较高/低QED分子可以找出影响类药性的关键侧链
+
+### TPSA的重要性
+- TPSA与口服生物利用度密切相关（一般<140Ų为佳）
+- 极性侧链对TPSA贡献显著
+- 可以指导侧链修饰策略
+
+## 注意事项
+
+1. **环境要求**：
+   - 需要安装 `seaborn` 和 `matplotlib`
+   - 如果没有安装，notebook会提示：`pixi add seaborn matplotlib`
+
+2. **处理时间**：
+   - 处理307个分子可能需要几分钟
+   - 绘制分布图也需要一些时间
+
+3. **内存使用**：
+   - 批量处理和绘图会占用一定内存
+   - 如果遇到内存问题，可以减少 `max_rows` 参数
+
+4. **图片分辨率**：
+   - 默认使用 300 DPI 保存图片
+   - 可以根据需要调整 `dpi` 参数
+
+## 后续工作
+
+根据分析结果，建议进行：
+
+1. **LogP与侧链的定量关系**
+   - 计算去除各个侧链后的LogP变化
+   - 找出对LogP贡献最大的位置
+
+2. **活性数据关联**（如果有）
+   - 分析高活性分子的侧链特征
+   - 找出"特权侧链"
+
+3. **碎片库构建**
+   - 整理每个位置的常见碎片
+   - 用于指导新分子设计
+
+4. **机器学习预测**
+   - 使用碎片特征预测分子性质
+   - 建立QSAR模型
+
+## 参考
+
+- `filter_molecules.ipynb` - 分子过滤和断裂逻辑
+- `test_align_two_molecules.ipynb` - 绘图逻辑参考
+- `src/macrolactone_fragmenter.py` - 封装的断裂器类
+- `src/ring_visualization.py` - 可视化工具
+
+## 问题反馈
+
+如果遇到问题：
+1. 检查是否在 `pixi shell` 环境中
+2. 确认所有依赖包已安装
+3. 查看输出目录是否有写入权限
+4. 检查CSV文件路径是否正确
+
+
--- a/notebooks/SIME-MacroValidator.ipynb
+++ b/notebooks/SIME-MacroValidator.ipynb
--- a/notebooks/analyze_ring12_20_molecules_CLEAN.ipynb
+++ b/notebooks/analyze_ring12_20_molecules_CLEAN.ipynb
--- a/notebooks/analyze_ring16_molecules.ipynb
+++ b/notebooks/analyze_ring16_molecules.ipynb
--- a/notebooks/demo_single_molecule.ipynb
+++ b/notebooks/demo_single_molecule.ipynb
--- a/notebooks/drug_screening_sandmeyer.ipynb
+++ b/notebooks/drug_screening_sandmeyer.ipynb
@@ -0,0 +1,93 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 药物分子SMARTS筛选：基于张夏恒反应替代Sandmeyer反应的策略\n",
+    "\n",
+    "## 研究背景\n",
+    "\n",
+    "本notebook旨在筛选药物分子数据库中可能使用**张夏恒反应**替代**Sandmeyer反应**合成的化合物。\n",
+    "\n",
+    "### 关键概念\n",
+    "\n",
+    "**Sandmeyer反应**：传统的芳香胺转化方法\n",
+    "- 反应式：Ar-NH₂ → [Ar-N₂⁺] → Ar-X\n",
+    "- 产物：芳香卤化物（X = Cl, Br, I, CN, OH, SCN等）\n",
+    "\n",
+    "**张夏恒反应**：新兴的绿色反应方法\n",
+    "- 提供更环保的合成路线\n",
+    "- 可能替代传统Sandmeyer反应\n",
+    "\n",
+    "### 筛选策略\n",
+    "\n",
+    "基于**同分异构体生物等排替换**原理：\n",
+    "- 如果化合物A（使用Sandmeyer合成）有活性\n",
+    "- 化合物B（使用张夏恒反应合成相同骨架）可能有相似活性\n",
+    "\n",
+    "### 筛选逻辑\n",
+    "\n",
+    "**核心假设**：含有芳香卤素的药物可能通过Sandmeyer反应合成\n",
+    "\n",
+    "**优先级排序**：\n",
+    "1. **杂芳环卤素**（最高优先级）\n",
+    "   - 氯代吡啶、氯代嘧啶等\n",
+    "   - 这些结构更可能使用Sandmeyer或SNAr反应合成\n",
+    "   \n",
+    "2. **普通芳香卤素**（高优先级）\n",
+    "   - 任意芳香氯、溴、碘\n",
+    "   - 可能来自Sandmeyer反应，需要文献验证\n",
+    "\n",
+    "### 三种筛选方案\n",
+    "\n",
+    "#### 方案A（最保守）：杂芳环卤素筛选\n",
+    "- **SMARTS模式**：`n:c:[Cl,Br,I]` 或 `n1c([Cl,Br,I])cccc1`\n",
+    "- **优势**：精准度最高，假阳性率低\n",
+    "- **适用**：快速找到最可能的候选药物\n",
+    "- **预期结果**：候选数量少但精准\n",
+    "\n",
+    "#### 方案B（平衡）：所有芳香卤素筛选\n",
+    "- **SMARTS模式**：`c[Cl,Br,I]`\n",
+    "- **优势**：覆盖面更广，平衡精准度和广度\n",
+    "- **适用**：全面筛选药物库\n",
+    "- **预期结果**：候选数量中等，适中假阳性率\n",
+    "\n",
+    "#### 方案C（已删除）：简化版\n",
+    "- 只筛选含卤素化合物\n",
+    "- 精准度较低，已废弃\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## 文件信息\n",
+    "\n",
+    "- **输入文件**：`/data/drug_targetmol/0c04ffc9fe8c2ec916412fbdc2a49bf4.sdf`\n",
+    "- **输出目录**：`/data/drug_targetmol/`\n",
+    "- **输出文件**：\n",
+    "  - `candidates_planA_heteroaryl_halides.csv`（方案A结果）\n",
+    "  - `candidates_planB_all_aromatic_halides.csv`（方案B结果）"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/notebooks/filter_molecules.ipynb
+++ b/notebooks/filter_molecules.ipynb
--- a/notebooks/mactch_test.ipynb
+++ b/notebooks/mactch_test.ipynb
--- a/notebooks/rdkit_show.ipynb
+++ b/notebooks/rdkit_show.ipynb
--- a/notebooks/screen_aniline_candidates.ipynb
+++ b/notebooks/screen_aniline_candidates.ipynb
@@ -0,0 +1,754 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 筛选芳香胺候选药物 - Sandmeyer反应起始物分析\n",
+    "\n",
+    "## 背景介绍\n",
+    "\n",
+    "### Sandmeyer反应回顾\n",
+    "Sandmeyer反应是经典的芳香胺转化方法：\n",
+    "**Ar-NH₂ → [Ar-N₂]⁺ → Ar-X**\n",
+    "其中 X = Cl, Br, I, CN, OH, SCN 等\n",
+    "\n",
+    "### 筛选目标\n",
+    "通过识别药物分子中含有芳香胺结构（Ar-NH₂）的化合物，\n",
+    "找出可能作为Sandmeyer反应起始物的候选药物。\n",
+    "这些分子可能原本通过Sandmeyer反应引入芳香卤素，\n",
+    "现在可以用张夏恒反应进行更高效的转化。\n",
+    "\n",
+    "### SMARTS模式\n",
+    "使用SMARTS模式 `[c,n][NH2]` 匹配：\n",
+    "- `[c,n]`: 芳香碳或氮原子\n",
+    "- `[NH2]`: 氨基（-NH₂）\n",
+    "\n",
+    "**重要提醒：**\n",
+    "- 此筛选基于分子结构特征\n",
+    "- 最终需要查阅文献确认合成路线\n",
+    "- 并非所有含芳香胺的药物都使用Sandmeyer反应"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 导入所需库"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from pathlib import Path\n",
+    "from rdkit import Chem\n",
+    "from rdkit.Chem import PandasTools, Draw\n",
+    "from rdkit.Chem.Draw import rdMolDraw2D\n",
+    "from IPython.display import SVG, display\n",
+    "from rdkit.Chem import AllChem\n",
+    "import pandas as pd\n",
+    "import warnings\n",
+    "warnings.filterwarnings('ignore')\n",
+    "\n",
+    "# 设置显示选项\n",
+    "pd.set_option('display.max_columns', None)\n",
+    "pd.set_option('display.max_colwidth', 100)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 定义筛选模式和可视化函数\n",
+    "\n",
+    "### SMARTS模式设置\n",
+    "- **目标模式**: `[c,n][NH2]` - 芳香碳/氮原子连接的氨基\n",
+    "- **匹配逻辑**: 寻找所有包含此子结构的分子"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "使用SMARTS模式: [c,n][NH2]\n",
+      "模式验证: ✓\n",
+      "\n",
+      "创建目录：../data/drug_targetmol/aniline_candidates\n",
+      "创建可视化目录：../data/drug_targetmol/aniline_candidates/visualizations\n"
+     ]
+    }
+   ],
+   "source": [
+    "# 定义筛选模式\n",
+    "TARGET_SMARTS = '[c,n][NH2]'\n",
+    "pattern = Chem.MolFromSmarts(TARGET_SMARTS)\n",
+    "\n",
+    "if pattern is None:\n",
+    "    raise ValueError(f\"无效的SMARTS模式: {TARGET_SMARTS}\")\n",
+    "\n",
+    "print(f\"使用SMARTS模式: {TARGET_SMARTS}\")\n",
+    "print(f\"模式验证: {'✓' if pattern else '✗'}\")\n",
+    "\n",
+    "# 创建输出目录\n",
+    "output_base = Path(\"../data/drug_targetmol\")\n",
+    "output_dir = output_base / \"aniline_candidates\"\n",
+    "visualization_dir = output_dir / \"visualizations\"\n",
+    "\n",
+    "output_dir.mkdir(exist_ok=True)\n",
+    "visualization_dir.mkdir(exist_ok=True)\n",
+    "\n",
+    "print(f\"\\n创建目录：{output_dir}\")\n",
+    "print(f\"创建可视化目录：{visualization_dir}\")\n",
+    "\n",
+    "def generate_highlighted_svg(mol, highlight_atoms, filename, title=\"\"):\n",
+    "    \"\"\"生成高亮匹配结构的高清晰度SVG图片\"\"\"\n",
+    "    # 计算2D坐标\n",
+    "    AllChem.Compute2DCoords(mol)\n",
+    "    \n",
+    "    # 创建SVG绘制器\n",
+    "    drawer = rdMolDraw2D.MolDraw2DSVG(1200, 900)  # 更大的尺寸以提高清晰度\n",
+    "    drawer.SetFontSize(12)\n",
+    "    \n",
+    "    # 绘制选项\n",
+    "    draw_options = drawer.drawOptions()\n",
+    "    draw_options.addAtomIndices = False  # 不显示原子索引，保持简洁\n",
+    "    draw_options.addBondIndices = False\n",
+    "    draw_options.addStereoAnnotation = True\n",
+    "    draw_options.fixedFontSize = 12\n",
+    "    \n",
+    "    # 高亮匹配的原子（蓝色）\n",
+    "    atom_colors = {}\n",
+    "    for atom_idx in highlight_atoms:\n",
+    "        atom_colors[atom_idx] = (0.3, 0.3, 1.0)  # 蓝色高亮\n",
+    "    \n",
+    "    # 绘制分子\n",
+    "    drawer.DrawMolecule(mol, \n",
+    "                       highlightAtoms=highlight_atoms,\n",
+    "                       highlightAtomColors=atom_colors)\n",
+    "    \n",
+    "    drawer.FinishDrawing()\n",
+    "    svg_content = drawer.GetDrawingText()\n",
+    "    \n",
+    "    # 添加标题\n",
+    "    if title:\n",
+    "        # 在SVG中添加标题\n",
+    "        svg_lines = svg_content.split(\"\\\\n\")\n",
+    "        # 在<g>标签前插入标题\n",
+    "        for i, line in enumerate(svg_lines):\n",
+    "            if \"<g \" in line and \"transform\" in line:\n",
+    "                svg_lines.insert(i, f\"<text x=\\\"50%\\\" y=\\\"30\\\" text-anchor=\\\"middle\\\" font-size=\\\"16\\\" font-weight=\\\"bold\\\">{title}</text>\")\n",
+    "                break\n",
+    "        svg_with_title = \"\\\\n\".join(svg_lines)\n",
+    "    else:\n",
+    "        svg_with_title = svg_content\n",
+    "    \n",
+    "    # 保存文件\n",
+    "    with open(filename, \"w\") as f:\n",
+    "        f.write(svg_with_title)\n",
+    "    \n",
+    "    return svg_content"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 数据加载和分子筛选\n",
+    "\n",
+    "### 数据源\n",
+    "- 文件位置：`data/drug_targetmol/0c04ffc9fe8c2ec916412fbdc2a49bf4.sdf`\n",
+    "- 包含药物分子结构和丰富属性信息\n",
+    "\n",
+    "### 筛选逻辑\n",
+    "1. 读取SDF文件\n",
+    "2. 对每个分子进行SMARTS匹配\n",
+    "3. 记录匹配的原子和匹配数量\n",
+    "4. 保存匹配结果到CSV\n",
+    "5. 生成高亮可视化图片"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "正在读取SDF文件...\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "[21:24:23] Both bonds on one end of an atropisomer are on the same side - atoms is : 3\n",
+      "[21:24:23] Explicit valence for atom # 2 N greater than permitted\n",
+      "[21:24:23] ERROR: Could not sanitize molecule ending on line 217340\n",
+      "[21:24:23] ERROR: Explicit valence for atom # 2 N greater than permitted\n",
+      "[21:24:24] Explicit valence for atom # 4 N greater than permitted\n",
+      "[21:24:24] ERROR: Could not sanitize molecule ending on line 317283\n",
+      "[21:24:24] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
+      "[21:24:24] Explicit valence for atom # 4 N greater than permitted\n",
+      "[21:24:24] ERROR: Could not sanitize molecule ending on line 324666\n",
+      "[21:24:24] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
+      "[21:24:24] Explicit valence for atom # 5 N greater than permitted\n",
+      "[21:24:24] ERROR: Could not sanitize molecule ending on line 365883\n",
+      "[21:24:24] ERROR: Explicit valence for atom # 5 N greater than permitted\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "成功加载 3276 个分子\n",
+      "\n",
+      "数据概览：\n",
+      "  Index    Plate Row Col ID                           Name  \\\n",
+      "0     1  L1010-1   a   2                     Dexamethasone   \n",
+      "1     2  L1010-1   a   3                         Danicopan   \n",
+      "2     3  L1010-1   a   4                     Cyclosporin A   \n",
+      "3     4  L1010-1   a   5                       L-Carnitine   \n",
+      "4     5  L1010-1   a   6     Trimetazidine dihydrochloride   \n",
+      "\n",
+      "                                       Synonyms           CAS  \\\n",
+      "0  MK 125;Prednisolone F;NSC 34521;Hexadecadrol       50-02-2   \n",
+      "1                                      ACH-4471  1903768-17-1   \n",
+      "2       Cyclosporine A;Ciclosporin;Cyclosporine    59865-13-3   \n",
+      "3                  L(-)-Carnitine;Levocarnitine      541-15-1   \n",
+      "4               Yoshimilon;Kyurinett;Vastarel F    13171-25-0   \n",
+      "\n",
+      "                                                                                                SMILES  \\\n",
+      "0              C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(F)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO   \n",
+      "1                        CC(=O)c1nn(CC(=O)N2C[C@H](F)C[C@H]2C(=O)Nc2cccc(Br)n2)c2ccc(cc12)-c1cnc(C)nc1   \n",
+      "2  [C@H]([C@@H](C/C=C/C)C)(O)[C@@]1(N(C)C(=O)[C@H]([C@@H](C)C)N(C)C(=O)[C@H](CC(C)C)N(C)C(=O)[C@H](...   \n",
+      "3                                                                      C[N+](C)(C)C[C@@H](O)CC([O-])=O   \n",
+      "4                                                               Cl.Cl.COC1=C(OC)C(OC)=C(CN2CCNCC2)C=C1   \n",
+      "\n",
+      "         Formula    MolWt Approved status  \\\n",
+      "0      C22H29FO5   392.46    NMPA;EMA;FDA   \n",
+      "1  C26H23BrFN7O3   580.41             FDA   \n",
+      "2  C62H111N11O12  1202.61             FDA   \n",
+      "3       C7H15NO3    161.2             FDA   \n",
+      "4  C14H24Cl2N2O3  339.258        NMPA;EMA   \n",
+      "\n",
+      "                                                             Pharmacopoeia  \\\n",
+      "0                                            USP39-NF34;BP2015;JP16;IP2010   \n",
+      "1                                                                      NaN   \n",
+      "2  Martindale the Extra Pharmacopoei, EP10.2, USP43-NF38, Ph.Int_6th, JP17   \n",
+      "3                                                                      NaN   \n",
+      "4         BP2019;KP Ⅹ;EP9.2;IP2010;JP17;Martindale:The Extra Pharmacopoeia   \n",
+      "\n",
+      "                 Disease  \\\n",
+      "0             Metabolism   \n",
+      "1                 Others   \n",
+      "2          Immune system   \n",
+      "3  Cardiovascular system   \n",
+      "4  Cardiovascular system   \n",
+      "\n",
+      "                                                                                              Pathways  \\\n",
+      "0  Antibody-drug Conjugate/ADC Related;Autophagy;Endocrinology/Hormones;Immunology/Inflammation;Mic...   \n",
+      "1                                                                              Immunology/Inflammation   \n",
+      "2                                             Immunology/Inflammation;Metabolism;Microbiology/Virology   \n",
+      "3                                                                                           Metabolism   \n",
+      "4                                                                                 Autophagy;Metabolism   \n",
+      "\n",
+      "                                                                                                Target  \\\n",
+      "0  Antibacterial;Antibiotic;Autophagy;Complement System;Glucocorticoid Receptor;IL Receptor;Mitopha...   \n",
+      "1                                                                                    Complement System   \n",
+      "2                                                             Phosphatase;Antibiotic;Complement System   \n",
+      "3                                                            Endogenous Metabolite;Fatty Acid Synthase   \n",
+      "4                                                                        Autophagy;Fatty Acid Synthase   \n",
+      "\n",
+      "                                                                                              Receptor  \\\n",
+      "0  Antibiotic; Autophagy; Bacterial; Complement System; Glucocorticoid Receptor; IL receptor; Mitop...   \n",
+      "1                                                                          Complement System; factor D   \n",
+      "2                                  Antibiotic; calcineurin phosphatase; Complement System; Phosphatase   \n",
+      "3                                                                           Endogenous Metabolite; FAS   \n",
+      "4                                              Autophagy; mitochondrial long-chain 3-ketoacyl thiolase   \n",
+      "\n",
+      "                                                                                           Bioactivity  \\\n",
+      "0  Dexamethasone is a glucocorticoid receptor agonist and IL receptor modulator with anti-inflammat...   \n",
+      "1  Danicopan (ACH-4471) (ACH-4471) is a selective, orally active small molecule factor D inhibitor ...   \n",
+      "2  Cyclosporin A is a natural product and an active fungal metabolite, classified as a cyclic polyp...   \n",
+      "3  L-Carnitine (L(-)-Carnitine) is an amino acid derivative. L-Carnitine facilitates long-chain fat...   \n",
+      "4  Trimetazidine dihydrochloride (Vastarel F) can improve myocardial glucose utilization by inhibit...   \n",
+      "\n",
+      "                                                                                             Reference  \\\n",
+      "0  Li M, Yu H. Identification of WP1066, an inhibitor of JAK2 and STAT3, as a Kv1. 3 potassium chan...   \n",
+      "1  Yuan X, et al. Small-molecule factor D inhibitors selectively block the alternative pathway of c...   \n",
+      "2  D'Angelo G, et al. Cyclosporin A prevents the hypoxic adaptation by activating hypoxia-inducible...   \n",
+      "3                                                    Jogl G, Tong L. Cell. 2003 Jan 10; 112(1):113-22.   \n",
+      "4  Yang Q, et al. Int J Clin Exp Pathol. 2015, 8(4):3735-3741.;Liu Z, et al. Metabolism. 2016, 65(3...   \n",
+      "\n",
+      "                                              ROMol  \n",
+      "0  <rdkit.Chem.rdchem.Mol object at 0x77530d73c820>  \n",
+      "1  <rdkit.Chem.rdchem.Mol object at 0x77530d73c890>  \n",
+      "2  <rdkit.Chem.rdchem.Mol object at 0x77530a3f6f10>  \n",
+      "3  <rdkit.Chem.rdchem.Mol object at 0x77530a3f70d0>  \n",
+      "4  <rdkit.Chem.rdchem.Mol object at 0x77530a3f7140>  \n",
+      "\n",
+      "列名：['Index', 'Plate', 'Row', 'Col', 'ID', 'Name', 'Synonyms', 'CAS', 'SMILES', 'Formula', 'MolWt', 'Approved status', 'Pharmacopoeia', 'Disease', 'Pathways', 'Target', 'Receptor', 'Bioactivity', 'Reference', 'ROMol']\n"
+     ]
+    }
+   ],
+   "source": [
+    "# 读取SDF文件\n",
+    "sdf_path = '../data/drug_targetmol/0c04ffc9fe8c2ec916412fbdc2a49bf4.sdf'\n",
+    "\n",
+    "print(\"正在读取SDF文件...\")\n",
+    "df = PandasTools.LoadSDF(sdf_path)\n",
+    "print(f\"成功加载 {len(df)} 个分子\")\n",
+    "\n",
+    "# 显示数据基本信息\n",
+    "print(\"\\n数据概览：\")\n",
+    "print(df.head())\n",
+    "print(f\"\\n列名：{list(df.columns)}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "开始筛选芳香胺结构...\n",
+      "SMARTS模式: [c,n][N&H2]\n",
+      "找到 262 个匹配分子（处理了 3276 个分子）\n",
+      "\n",
+      "筛选结果摘要：\n",
+      "                  Name          CAS      Formula  total_matches\n",
+      "17           Guanosine     118-00-3   C10H13N5O5              1\n",
+      "20         Ganciclovir   82410-32-0    C9H13N5O4              1\n",
+      "22   Imiquimod maleate  896106-16-4   C18H20N4O4              1\n",
+      "27       Brincidofovir  444805-28-1  C27H52N3O7P              1\n",
+      "28           Imiquimod   99011-02-6     C14H16N4              1\n",
+      "32  Ganciclovir sodium  107910-75-8  C9H13N5NaO4              1\n",
+      "33          Cytarabine     147-94-4    C9H13N3O5              1\n",
+      "35          Vidarabine    5536-17-4   C10H13N5O4              1\n",
+      "38         Penciclovir   39809-25-1   C10H15N5O3              1\n",
+      "41         Famciclovir  104227-87-4   C14H19N5O4              1\n",
+      "... 还有 252 个分子\n"
+     ]
+    }
+   ],
+   "source": [
+    "def screen_molecules_for_aniline(df, smarts_pattern, max_molecules=100):\n",
+    "    \"\"\"\n",
+    "    筛选包含芳香胺结构的分子\n",
+    "    \n",
+    "    Args:\n",
+    "        df: 包含分子的DataFrame\n",
+    "        smarts_pattern: RDKit SMARTS模式对象\n",
+    "        max_molecules: 最大处理分子数量\n",
+    "    \n",
+    "    Returns:\n",
+    "        筛选结果DataFrame\n",
+    "    \"\"\"\n",
+    "    print(f\"开始筛选芳香胺结构...\")\n",
+    "    print(f\"SMARTS模式: {Chem.MolToSmarts(smarts_pattern)}\")\n",
+    "    \n",
+    "    matched_molecules = []\n",
+    "    processed_count = 0\n",
+    "    \n",
+    "    for idx, row in df.iterrows():\n",
+    "        if processed_count >= max_molecules:\n",
+    "            break\n",
+    "            \n",
+    "        mol = row['ROMol']\n",
+    "        if mol is None:\n",
+    "            continue\n",
+    "            \n",
+    "        processed_count += 1\n",
+    "        \n",
+    "        # 检查是否匹配SMARTS模式\n",
+    "        if mol.HasSubstructMatch(smarts_pattern):\n",
+    "            matches = mol.GetSubstructMatches(smarts_pattern)\n",
+    "            \n",
+    "            # 收集所有匹配的原子\n",
+    "            matched_atoms = set()\n",
+    "            for match in matches:\n",
+    "                matched_atoms.update(match)\n",
+    "            \n",
+    "            # 创建匹配记录\n",
+    "            match_record = row.copy()\n",
+    "            match_record['matched_atoms'] = list(matched_atoms)\n",
+    "            match_record['total_matches'] = len(matches)\n",
+    "            match_record['smarts_pattern'] = Chem.MolToSmarts(smarts_pattern)\n",
+    "            matched_molecules.append(match_record)\n",
+    "    \n",
+    "    result_df = pd.DataFrame(matched_molecules)\n",
+    "    print(f\"找到 {len(result_df)} 个匹配分子（处理了 {processed_count} 个分子）\")\n",
+    "    \n",
+    "    return result_df\n",
+    "\n",
+    "# 执行筛选\n",
+    "matched_df = screen_molecules_for_aniline(df, pattern, max_molecules=1000000)\n",
+    "\n",
+    "# 显示结果摘要\n",
+    "if len(matched_df) > 0:\n",
+    "    print(\"\\n筛选结果摘要：\")\n",
+    "    summary_cols = ['Name', 'CAS', 'Formula', 'total_matches']\n",
+    "    if len(matched_df) <= 10:\n",
+    "        print(matched_df[summary_cols])\n",
+    "    else:\n",
+    "        print(matched_df[summary_cols].head(10))\n",
+    "        print(f\"... 还有 {len(matched_df) - 10} 个分子\")\n",
+    "else:\n",
+    "    print(\"\\n未找到匹配分子\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 保存筛选结果\n",
+    "\n",
+    "### 输出文件\n",
+    "1. **CSV文件**：包含所有匹配分子的属性信息和匹配详情\n",
+    "2. **SVG图片**：每个匹配分子的结构可视化，高亮芳香胺结构"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CSV结果已保存到：../data/drug_targetmol/aniline_candidates/aniline_candidates.csv\n",
+      "包含 262 个分子，23 个属性列\n",
+      "\n",
+      "开始生成可视化图片（最多500个）...\n",
+      "已生成 10 个分子图片\n",
+      "已生成 20 个分子图片\n",
+      "已生成 30 个分子图片\n",
+      "已生成 40 个分子图片\n",
+      "已生成 50 个分子图片\n",
+      "已生成 60 个分子图片\n",
+      "已生成 70 个分子图片\n",
+      "已生成 80 个分子图片\n",
+      "已生成 90 个分子图片\n",
+      "已生成 100 个分子图片\n",
+      "已生成 110 个分子图片\n",
+      "已生成 120 个分子图片\n",
+      "已生成 130 个分子图片\n",
+      "已生成 140 个分子图片\n",
+      "已生成 150 个分子图片\n",
+      "已生成 160 个分子图片\n",
+      "已生成 170 个分子图片\n",
+      "已生成 180 个分子图片\n",
+      "已生成 190 个分子图片\n",
+      "已生成 200 个分子图片\n",
+      "已生成 210 个分子图片\n",
+      "已生成 220 个分子图片\n",
+      "已生成 230 个分子图片\n",
+      "已生成 240 个分子图片\n",
+      "已生成 250 个分子图片\n",
+      "已生成 260 个分子图片\n",
+      "完成！共生成 262 个可视化图片\n",
+      "\n",
+      "示例图片: 118-00-3_Guanosine.svg\n"
+     ]
+    },
+    {
+     "data": {
+      "image/svg+xml": [
+       "<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:rdkit=\"http://www.rdkit.org/xml\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" version=\"1.1\" baseProfile=\"full\" xml:space=\"preserve\" width=\"1200px\" height=\"900px\" viewBox=\"0 0 1200 900\">\n",
+       "<!-- END OF HEADER -->\n",
+       "<rect style=\"opacity:1.0;fill:#FFFFFF;stroke:none\" width=\"1200.0\" height=\"900.0\" x=\"0.0\" y=\"0.0\"> </rect>\n",
+       "<path class=\"bond-0 atom-0 atom-1\" d=\"M 912.0,197.7 L 940.1,201.0 L 924.8,332.9 L 896.6,329.6 Z\" style=\"fill:#4C4CFF;fill-rule:evenodd;fill-opacity:1;stroke:#4C4CFF;stroke-width:0.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
+       "<ellipse cx=\"932.9\" cy=\"201.5\" rx=\"26.6\" ry=\"26.6\" class=\"atom-0\" style=\"fill:#4C4CFF;fill-rule:evenodd;stroke:#4C4CFF;stroke-width:1.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<ellipse cx=\"910.7\" cy=\"331.2\" rx=\"26.6\" ry=\"26.6\" class=\"atom-1\" style=\"fill:#4C4CFF;fill-rule:evenodd;stroke:#4C4CFF;stroke-width:1.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-0 atom-0 atom-1\" d=\"M 925.1,208.0 L 910.7,331.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-1 atom-1 atom-2\" d=\"M 910.7,331.2 L 853.5,355.9\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-1 atom-1 atom-2\" d=\"M 853.5,355.9 L 796.4,380.6\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-1 atom-1 atom-2\" d=\"M 908.0,354.1 L 856.2,376.5\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-1 atom-1 atom-2\" d=\"M 856.2,376.5 L 804.3,398.9\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-2 atom-2 atom-3\" d=\"M 787.8,392.5 L 780.6,454.1\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-2 atom-2 atom-3\" d=\"M 780.6,454.1 L 773.4,515.8\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-3 atom-3 atom-4\" d=\"M 773.4,515.8 L 879.9,595.0\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-3 atom-3 atom-4\" d=\"M 794.5,506.6 L 882.6,572.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-4 atom-4 atom-5\" d=\"M 879.9,595.0 L 860.1,653.6\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-4 atom-4 atom-5\" d=\"M 860.1,653.6 L 840.4,712.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-5 atom-5 atom-6\" d=\"M 829.8,720.7 L 767.3,720.0\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-5 atom-5 atom-6\" d=\"M 767.3,720.0 L 704.7,719.3\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-5 atom-5 atom-6\" d=\"M 830.1,700.8 L 774.7,700.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-5 atom-5 atom-6\" d=\"M 774.7,700.2 L 719.4,699.6\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-6 atom-6 atom-7\" d=\"M 704.7,719.3 L 686.2,660.3\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-6 atom-6 atom-7\" d=\"M 686.2,660.3 L 667.8,601.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-7 atom-3 atom-7\" d=\"M 773.4,515.8 L 723.0,551.5\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-7 atom-3 atom-7\" d=\"M 723.0,551.5 L 672.7,587.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-8 atom-7 atom-8\" d=\"M 657.5,590.0 L 598.4,570.1\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-8 atom-7 atom-8\" d=\"M 598.4,570.1 L 539.3,550.1\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-9 atom-8 atom-9\" d=\"M 539.3,550.1 L 489.3,585.6\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-9 atom-8 atom-9\" d=\"M 489.3,585.6 L 439.2,621.0\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-10 atom-9 atom-10\" d=\"M 422.7,620.8 L 373.6,584.2\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-10 atom-9 atom-10\" d=\"M 373.6,584.2 L 324.4,547.7\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-11 atom-10 atom-11\" d=\"M 324.4,547.7 L 197.7,587.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-12 atom-11 atom-12\" d=\"M 197.7,587.2 L 153.0,546.1\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-12 atom-11 atom-12\" d=\"M 153.0,546.1 L 108.3,504.9\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-13 atom-10 atom-13\" d=\"M 324.4,547.7 L 366.9,421.8\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-14 atom-13 atom-14\" d=\"M 366.9,421.8 L 331.6,372.1\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-14 atom-13 atom-14\" d=\"M 331.6,372.1 L 296.3,322.3\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-15 atom-13 atom-15\" d=\"M 366.9,421.8 L 499.7,423.4\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-16 atom-8 atom-15\" d=\"M 539.3,550.1 L 499.7,423.4\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-17 atom-15 atom-16\" d=\"M 499.7,423.4 L 536.0,374.5\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-17 atom-15 atom-16\" d=\"M 536.0,374.5 L 572.4,325.6\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-18 atom-4 atom-17\" d=\"M 879.9,595.0 L 1001.8,542.4\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-19 atom-17 atom-18\" d=\"M 991.3,547.0 L 1042.7,585.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-19 atom-17 atom-18\" d=\"M 1042.7,585.2 L 1094.1,623.5\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-19 atom-17 atom-18\" d=\"M 1003.2,531.0 L 1054.6,569.3\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-19 atom-17 atom-18\" d=\"M 1054.6,569.3 L 1106.0,607.5\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-20 atom-17 atom-19\" d=\"M 1001.8,542.4 L 1009.0,480.8\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-20 atom-17 atom-19\" d=\"M 1009.0,480.8 L 1016.2,419.1\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-21 atom-1 atom-19\" d=\"M 910.7,331.2 L 960.2,368.0\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-21 atom-1 atom-19\" d=\"M 960.2,368.0 L 1009.6,404.9\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path d=\"M 707.8,719.4 L 704.7,719.3 L 703.7,716.4\" style=\"fill:none;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
+       "<path d=\"M 204.0,585.3 L 197.7,587.2 L 195.5,585.2\" style=\"fill:none;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
+       "<path d=\"M 995.7,545.0 L 1001.8,542.4 L 1002.2,539.3\" style=\"fill:none;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
+       "<path class=\"atom-0\" d=\"M 924.2 195.1 L 927.0 199.6 Q 927.3 200.0, 927.7 200.8 Q 928.1 201.7, 928.2 201.7 L 928.2 195.1 L 929.3 195.1 L 929.3 203.6 L 928.1 203.6 L 925.1 198.7 Q 924.8 198.1, 924.4 197.4 Q 924.1 196.8, 924.0 196.6 L 924.0 203.6 L 922.9 203.6 L 922.9 195.1 L 924.2 195.1 \" fill=\"#000000\"/>\n",
+       "<path class=\"atom-0\" d=\"M 930.9 195.1 L 932.1 195.1 L 932.1 198.7 L 936.4 198.7 L 936.4 195.1 L 937.6 195.1 L 937.6 203.6 L 936.4 203.6 L 936.4 199.7 L 932.1 199.7 L 932.1 203.6 L 930.9 203.6 L 930.9 195.1 \" fill=\"#000000\"/>\n",
+       "<path class=\"atom-0\" d=\"M 939.2 203.3 Q 939.4 202.8, 939.9 202.5 Q 940.4 202.2, 941.1 202.2 Q 942.0 202.2, 942.4 202.6 Q 942.9 203.1, 942.9 203.9 Q 942.9 204.7, 942.3 205.5 Q 941.7 206.3, 940.4 207.2 L 943.0 207.2 L 943.0 207.8 L 939.2 207.8 L 939.2 207.3 Q 940.3 206.6, 940.9 206.0 Q 941.5 205.5, 941.8 205.0 Q 942.1 204.5, 942.1 203.9 Q 942.1 203.4, 941.8 203.1 Q 941.6 202.8, 941.1 202.8 Q 940.7 202.8, 940.4 203.0 Q 940.0 203.2, 939.8 203.6 L 939.2 203.3 \" fill=\"#000000\"/>\n",
+       "<path class=\"atom-2\" d=\"M 786.9 379.6 L 789.7 384.1 Q 790.0 384.6, 790.4 385.4 Q 790.8 386.2, 790.9 386.2 L 790.9 379.6 L 792.0 379.6 L 792.0 388.1 L 790.8 388.1 L 787.8 383.2 Q 787.5 382.6, 787.1 382.0 Q 786.8 381.3, 786.7 381.1 L 786.7 388.1 L 785.6 388.1 L 785.6 379.6 L 786.9 379.6 \" fill=\"#0000FF\"/>\n",
+       "<path class=\"atom-5\" d=\"M 835.6 716.6 L 838.4 721.1 Q 838.6 721.5, 839.1 722.3 Q 839.5 723.1, 839.5 723.2 L 839.5 716.6 L 840.7 716.6 L 840.7 725.1 L 839.5 725.1 L 836.5 720.2 Q 836.2 719.6, 835.8 718.9 Q 835.4 718.3, 835.3 718.1 L 835.3 725.1 L 834.2 725.1 L 834.2 716.6 L 835.6 716.6 \" fill=\"#0000FF\"/>\n",
+       "<path class=\"atom-7\" d=\"M 663.2 588.3 L 666.0 592.8 Q 666.3 593.3, 666.7 594.1 Q 667.2 594.9, 667.2 594.9 L 667.2 588.3 L 668.3 588.3 L 668.3 596.8 L 667.1 596.8 L 664.2 591.9 Q 663.8 591.3, 663.4 590.7 Q 663.1 590.0, 663.0 589.8 L 663.0 596.8 L 661.9 596.8 L 661.9 588.3 L 663.2 588.3 \" fill=\"#0000FF\"/>\n",
+       "<path class=\"atom-9\" d=\"M 427.1 626.9 Q 427.1 624.9, 428.1 623.8 Q 429.1 622.6, 431.0 622.6 Q 432.8 622.6, 433.9 623.8 Q 434.9 624.9, 434.9 626.9 Q 434.9 629.0, 433.8 630.2 Q 432.8 631.3, 431.0 631.3 Q 429.1 631.3, 428.1 630.2 Q 427.1 629.0, 427.1 626.9 M 431.0 630.4 Q 432.3 630.4, 433.0 629.5 Q 433.7 628.6, 433.7 626.9 Q 433.7 625.3, 433.0 624.4 Q 432.3 623.6, 431.0 623.6 Q 429.7 623.6, 429.0 624.4 Q 428.3 625.3, 428.3 626.9 Q 428.3 628.7, 429.0 629.5 Q 429.7 630.4, 431.0 630.4 \" fill=\"#FF0000\"/>\n",
+       "<path class=\"atom-12\" d=\"M 87.7 493.1 L 88.9 493.1 L 88.9 496.7 L 93.2 496.7 L 93.2 493.1 L 94.4 493.1 L 94.4 501.6 L 93.2 501.6 L 93.2 497.6 L 88.9 497.6 L 88.9 501.6 L 87.7 501.6 L 87.7 493.1 \" fill=\"#FF0000\"/>\n",
+       "<path class=\"atom-12\" d=\"M 96.1 497.3 Q 96.1 495.3, 97.1 494.1 Q 98.1 493.0, 100.0 493.0 Q 101.9 493.0, 102.9 494.1 Q 103.9 495.3, 103.9 497.3 Q 103.9 499.4, 102.9 500.5 Q 101.9 501.7, 100.0 501.7 Q 98.2 501.7, 97.1 500.5 Q 96.1 499.4, 96.1 497.3 M 100.0 500.7 Q 101.3 500.7, 102.0 499.9 Q 102.7 499.0, 102.7 497.3 Q 102.7 495.6, 102.0 494.8 Q 101.3 493.9, 100.0 493.9 Q 98.7 493.9, 98.0 494.8 Q 97.3 495.6, 97.3 497.3 Q 97.3 499.0, 98.0 499.9 Q 98.7 500.7, 100.0 500.7 \" fill=\"#FF0000\"/>\n",
+       "<path class=\"atom-14\" d=\"M 277.8 309.3 L 278.9 309.3 L 278.9 312.9 L 283.3 312.9 L 283.3 309.3 L 284.4 309.3 L 284.4 317.8 L 283.3 317.8 L 283.3 313.9 L 278.9 313.9 L 278.9 317.8 L 277.8 317.8 L 277.8 309.3 \" fill=\"#FF0000\"/>\n",
+       "<path class=\"atom-14\" d=\"M 286.2 313.6 Q 286.2 311.5, 287.2 310.4 Q 288.2 309.2, 290.1 309.2 Q 292.0 309.2, 293.0 310.4 Q 294.0 311.5, 294.0 313.6 Q 294.0 315.6, 293.0 316.8 Q 291.9 318.0, 290.1 318.0 Q 288.2 318.0, 287.2 316.8 Q 286.2 315.6, 286.2 313.6 M 290.1 317.0 Q 291.4 317.0, 292.1 316.1 Q 292.8 315.3, 292.8 313.6 Q 292.8 311.9, 292.1 311.0 Q 291.4 310.2, 290.1 310.2 Q 288.8 310.2, 288.1 311.0 Q 287.4 311.9, 287.4 313.6 Q 287.4 315.3, 288.1 316.1 Q 288.8 317.0, 290.1 317.0 \" fill=\"#FF0000\"/>\n",
+       "<path class=\"atom-16\" d=\"M 575.1 316.9 Q 575.1 314.8, 576.1 313.7 Q 577.1 312.5, 579.0 312.5 Q 580.8 312.5, 581.8 313.7 Q 582.9 314.8, 582.9 316.9 Q 582.9 318.9, 581.8 320.1 Q 580.8 321.3, 579.0 321.3 Q 577.1 321.3, 576.1 320.1 Q 575.1 318.9, 575.1 316.9 M 579.0 320.3 Q 580.2 320.3, 580.9 319.4 Q 581.7 318.6, 581.7 316.9 Q 581.7 315.2, 580.9 314.3 Q 580.2 313.5, 579.0 313.5 Q 577.7 313.5, 576.9 314.3 Q 576.3 315.2, 576.3 316.9 Q 576.3 318.6, 576.9 319.4 Q 577.7 320.3, 579.0 320.3 \" fill=\"#FF0000\"/>\n",
+       "<path class=\"atom-16\" d=\"M 584.2 312.6 L 585.3 312.6 L 585.3 316.2 L 589.7 316.2 L 589.7 312.6 L 590.8 312.6 L 590.8 321.1 L 589.7 321.1 L 589.7 317.2 L 585.3 317.2 L 585.3 321.1 L 584.2 321.1 L 584.2 312.6 \" fill=\"#FF0000\"/>\n",
+       "<path class=\"atom-18\" d=\"M 1104.5 621.7 Q 1104.5 619.7, 1105.5 618.5 Q 1106.5 617.4, 1108.4 617.4 Q 1110.2 617.4, 1111.3 618.5 Q 1112.3 619.7, 1112.3 621.7 Q 1112.3 623.8, 1111.2 624.9 Q 1110.2 626.1, 1108.4 626.1 Q 1106.5 626.1, 1105.5 624.9 Q 1104.5 623.8, 1104.5 621.7 M 1108.4 625.1 Q 1109.7 625.1, 1110.4 624.3 Q 1111.1 623.4, 1111.1 621.7 Q 1111.1 620.0, 1110.4 619.2 Q 1109.7 618.3, 1108.4 618.3 Q 1107.1 618.3, 1106.4 619.2 Q 1105.7 620.0, 1105.7 621.7 Q 1105.7 623.4, 1106.4 624.3 Q 1107.1 625.1, 1108.4 625.1 \" fill=\"#FF0000\"/>\n",
+       "<path class=\"atom-19\" d=\"M 1015.3 406.3 L 1018.1 410.8 Q 1018.4 411.2, 1018.8 412.0 Q 1019.3 412.8, 1019.3 412.9 L 1019.3 406.3 L 1020.4 406.3 L 1020.4 414.8 L 1019.3 414.8 L 1016.3 409.8 Q 1015.9 409.3, 1015.6 408.6 Q 1015.2 407.9, 1015.1 407.7 L 1015.1 414.8 L 1014.0 414.8 L 1014.0 406.3 L 1015.3 406.3 \" fill=\"#0000FF\"/>\n",
+       "<path class=\"atom-19\" d=\"M 1022.1 406.3 L 1023.2 406.3 L 1023.2 409.9 L 1027.6 409.9 L 1027.6 406.3 L 1028.7 406.3 L 1028.7 414.8 L 1027.6 414.8 L 1027.6 410.8 L 1023.2 410.8 L 1023.2 414.8 L 1022.1 414.8 L 1022.1 406.3 \" fill=\"#0000FF\"/>\n",
+       "</svg>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.SVG object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "def save_aniline_screening_results(df, output_dir, visualization_dir, max_visualizations=500):\n",
+    "    \"\"\"保存芳香胺筛选结果\"\"\"\n",
+    "    \n",
+    "    # 保存CSV文件\n",
+    "    csv_path = output_dir / \"aniline_candidates.csv\"\n",
+    "    \n",
+    "    # 转换ROMol列为SMILES（因为ROMol对象无法保存到CSV）\n",
+    "    df_export = df.copy()\n",
+    "    if 'ROMol' in df_export.columns:\n",
+    "        df_export['SMILES_from_mol'] = df_export['ROMol'].apply(lambda x: Chem.MolToSmiles(x) if x else '')\n",
+    "        df_export = df_export.drop('ROMol', axis=1)\n",
+    "    \n",
+    "    df_export.to_csv(csv_path, index=False, encoding='utf-8')\n",
+    "    print(f\"CSV结果已保存到：{csv_path}\")\n",
+    "    print(f\"包含 {len(df_export)} 个分子，{len(df_export.columns)} 个属性列\")\n",
+    "    \n",
+    "    # 生成可视化图片\n",
+    "    print(f\"\\n开始生成可视化图片（最多{max_visualizations}个）...\")\n",
+    "    generated_count = 0\n",
+    "    \n",
+    "    for idx, row in df.iterrows():\n",
+    "        if generated_count >= max_visualizations:\n",
+    "            print(f\"已达到最大可视化数量限制 ({max_visualizations})，停止生成\")\n",
+    "            break\n",
+    "            \n",
+    "        cas = str(row.get('CAS', 'unknown')).strip()\n",
+    "        name = str(row.get('Name', 'unknown')).strip()\n",
+    "        \n",
+    "        # 清理文件名（去除特殊字符）\n",
+    "        safe_name = \"\".join(c for c in name if c.isalnum() or c in (' ', '-', '_')).rstrip()\n",
+    "        safe_cas = \"\".join(c for c in cas if c.isalnum() or c in ('-',)).rstrip()\n",
+    "        \n",
+    "        # 跳过无效的标识符\n",
+    "        if not safe_cas or safe_cas == 'nan' or safe_cas == 'unknown':\n",
+    "            continue\n",
+    "            \n",
+    "        mol = row.get('ROMol')\n",
+    "        if mol is None:\n",
+    "            continue\n",
+    "            \n",
+    "        matched_atoms = row.get('matched_atoms', [])\n",
+    "        if not matched_atoms:\n",
+    "            continue\n",
+    "            \n",
+    "        # 生成文件名和标题\n",
+    "        filename = visualization_dir / f\"{safe_cas}_{safe_name.replace(' ', '_')}.svg\"\n",
+    "        title = f\"{name} ({cas}) - 芳香胺结构\"\n",
+    "        \n",
+    "        try:\n",
+    "            # 生成SVG\n",
+    "            svg_content = generate_highlighted_svg(mol, matched_atoms, filename, title)\n",
+    "            generated_count += 1\n",
+    "            \n",
+    "            # 每10个显示一次进度\n",
+    "            if generated_count % 10 == 0:\n",
+    "                print(f\"已生成 {generated_count} 个分子图片\")\n",
+    "                \n",
+    "        except Exception as e:\n",
+    "            print(f\"生成 {safe_cas} 失败: {e}\")\n",
+    "            continue\n",
+    "    \n",
+    "    print(f\"完成！共生成 {generated_count} 个可视化图片\")\n",
+    "    return csv_path, generated_count\n",
+    "\n",
+    "# 保存结果\n",
+    "if len(matched_df) > 0:\n",
+    "    csv_path, viz_count = save_aniline_screening_results(\n",
+    "        matched_df, output_dir, visualization_dir, max_visualizations=500\n",
+    "    )\n",
+    "    \n",
+    "    # 显示第一个生成的图片作为示例\n",
+    "    if viz_count > 0:\n",
+    "        example_files = list(visualization_dir.glob(\"*.svg\"))\n",
+    "        if example_files:\n",
+    "            example_file = example_files[0]\n",
+    "            print(f\"\\n示例图片: {example_file.name}\")\n",
+    "            with open(example_file, \"r\") as f:\n",
+    "                svg_content = f.read()\n",
+    "            display(SVG(svg_content))\n",
+    "else:\n",
+    "    print(\"没有匹配结果，无需保存\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 结果统计和分析\n",
+    "\n",
+    "### 筛选统计\n",
+    "- 总分子数\n",
+    "- 匹配分子数\n",
+    "- 可视化文件数量\n",
+    "- 输出文件位置"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "=== 芳香胺筛选结果统计 ===\n",
+      "总分子数：3276\n",
+      "匹配分子数：262\n",
+      "匹配率：8.00%\n",
+      "\n",
+      "输出目录：../data/drug_targetmol/aniline_candidates\n",
+      "CSV文件：../data/drug_targetmol/aniline_candidates/aniline_candidates.csv\n",
+      "可视化目录：../data/drug_targetmol/aniline_candidates/visualizations\n",
+      "SVG文件数量：262\n",
+      "\n",
+      "匹配数量最多的分子：\n",
+      "                                       Name          CAS  total_matches\n",
+      "432                  Proflavine Hemisulfate    1811-28-5              4\n",
+      "1064                            Triamterene     396-01-0              3\n",
+      "335   Pemetrexed disodium hemipenta hydrate  357166-30-4              2\n",
+      "463                             Lamotrigine   84057-84-1              2\n",
+      "779                           Pyrimethamine      58-14-0              2\n"
+     ]
+    }
+   ],
+   "source": [
+    "# 结果统计\n",
+    "print(\"=== 芳香胺筛选结果统计 ===\")\n",
+    "print(f\"总分子数：{len(df)}\")\n",
+    "print(f\"匹配分子数：{len(matched_df)}\")\n",
+    "print(f\"匹配率：{len(matched_df)/len(df)*100:.2f}%\")\n",
+    "print(f\"\\n输出目录：{output_dir}\")\n",
+    "print(f\"CSV文件：{output_dir}/aniline_candidates.csv\")\n",
+    "print(f\"可视化目录：{visualization_dir}\")\n",
+    "print(f\"SVG文件数量：{len(list(visualization_dir.glob('*.svg')))}\")\n",
+    "\n",
+    "# 显示匹配最多的前几个分子\n",
+    "if len(matched_df) > 0:\n",
+    "    print(\"\\n匹配数量最多的分子：\")\n",
+    "    top_matches = matched_df.nlargest(5, 'total_matches')[['Name', 'CAS', 'total_matches']]\n",
+    "    print(top_matches)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 使用建议\n",
+    "\n",
+    "### 筛选结果解读\n",
+    "- **匹配分子**：包含芳香胺结构（Ar-NH₂）的药物\n",
+    "- **蓝色高亮**：匹配的SMARTS结构（芳香碳/氮 + 氨基）\n",
+    "- **多重匹配**：分子中可能存在多个芳香胺基团\n",
+    "\n",
+    "### 后续分析建议\n",
+    "1. **合成路线验证**：查阅匹配分子的合成文献\n",
+    "2. **Sandmeyer反应确认**：确认是否使用Sandmeyer反应引入卤素\n",
+    "3. **张夏恒反应评估**：评估替代Sandmeyer反应的可行性\n",
+    "4. **工艺优化潜力**：分析替换为张夏恒反应的经济效益\n",
+    "\n",
+    "### 文件说明\n",
+    "- **CSV文件**：完整的分子属性和匹配信息\n",
+    "- **SVG文件**：结构可视化，蓝色高亮芳香胺结构\n",
+    "- **命名规则**：{CAS}_{Name}.svg（特殊字符已清理）"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 抗生素筛选结果\n",
+    "\n",
+    "/home/zly/project/macro_split/data/drug_targetmol/aniline_candidates/antibiotics_identified.csv"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.14.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/notebooks/screen_aniline_candidates_executed.ipynb
+++ b/notebooks/screen_aniline_candidates_executed.ipynb
@@ -0,0 +1,774 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 筛选芳香胺候选药物 - Sandmeyer反应起始物分析\n",
+    "\n",
+    "## 背景介绍\n",
+    "\n",
+    "### Sandmeyer反应回顾\n",
+    "Sandmeyer反应是经典的芳香胺转化方法：\n",
+    "**Ar-NH₂ → [Ar-N₂]⁺ → Ar-X**\n",
+    "其中 X = Cl, Br, I, CN, OH, SCN 等\n",
+    "\n",
+    "### 筛选目标\n",
+    "通过识别药物分子中含有芳香胺结构（Ar-NH₂）的化合物，\n",
+    "找出可能作为Sandmeyer反应起始物的候选药物。\n",
+    "这些分子可能原本通过Sandmeyer反应引入芳香卤素，\n",
+    "现在可以用张夏恒反应进行更高效的转化。\n",
+    "\n",
+    "### SMARTS模式\n",
+    "使用SMARTS模式 `[c,n][NH2]` 匹配：\n",
+    "- `[c,n]`: 芳香碳或氮原子\n",
+    "- `[NH2]`: 氨基（-NH₂）\n",
+    "\n",
+    "**重要提醒：**\n",
+    "- 此筛选基于分子结构特征\n",
+    "- 最终需要查阅文献确认合成路线\n",
+    "- 并非所有含芳香胺的药物都使用Sandmeyer反应"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 导入所需库"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2025-11-11T13:21:31.660096Z",
+     "iopub.status.busy": "2025-11-11T13:21:31.657369Z",
+     "iopub.status.idle": "2025-11-11T13:21:32.943162Z",
+     "shell.execute_reply": "2025-11-11T13:21:32.938881Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from pathlib import Path\n",
+    "from rdkit import Chem\n",
+    "from rdkit.Chem import PandasTools, Draw\n",
+    "from rdkit.Chem.Draw import rdMolDraw2D\n",
+    "from IPython.display import SVG, display\n",
+    "from rdkit.Chem import AllChem\n",
+    "import pandas as pd\n",
+    "import warnings\n",
+    "warnings.filterwarnings('ignore')\n",
+    "\n",
+    "# 设置显示选项\n",
+    "pd.set_option('display.max_columns', None)\n",
+    "pd.set_option('display.max_colwidth', 100)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 定义筛选模式和可视化函数\n",
+    "\n",
+    "### SMARTS模式设置\n",
+    "- **目标模式**: `[c,n][NH2]` - 芳香碳/氮原子连接的氨基\n",
+    "- **匹配逻辑**: 寻找所有包含此子结构的分子"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2025-11-11T13:21:32.959832Z",
+     "iopub.status.busy": "2025-11-11T13:21:32.957734Z",
+     "iopub.status.idle": "2025-11-11T13:21:32.987085Z",
+     "shell.execute_reply": "2025-11-11T13:21:32.980584Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "使用SMARTS模式: [c,n][NH2]\n",
+      "模式验证: ✓\n",
+      "\n",
+      "创建目录：../data/drug_targetmol/aniline_candidates\n",
+      "创建可视化目录：../data/drug_targetmol/aniline_candidates/visualizations\n"
+     ]
+    }
+   ],
+   "source": [
+    "# 定义筛选模式\n",
+    "TARGET_SMARTS = '[c,n][NH2]'\n",
+    "pattern = Chem.MolFromSmarts(TARGET_SMARTS)\n",
+    "\n",
+    "if pattern is None:\n",
+    "    raise ValueError(f\"无效的SMARTS模式: {TARGET_SMARTS}\")\n",
+    "\n",
+    "print(f\"使用SMARTS模式: {TARGET_SMARTS}\")\n",
+    "print(f\"模式验证: {'✓' if pattern else '✗'}\")\n",
+    "\n",
+    "# 创建输出目录\n",
+    "output_base = Path(\"../data/drug_targetmol\")\n",
+    "output_dir = output_base / \"aniline_candidates\"\n",
+    "visualization_dir = output_dir / \"visualizations\"\n",
+    "\n",
+    "output_dir.mkdir(exist_ok=True)\n",
+    "visualization_dir.mkdir(exist_ok=True)\n",
+    "\n",
+    "print(f\"\\n创建目录：{output_dir}\")\n",
+    "print(f\"创建可视化目录：{visualization_dir}\")\n",
+    "\n",
+    "def generate_highlighted_svg(mol, highlight_atoms, filename, title=\"\"):\n",
+    "    \"\"\"生成高亮匹配结构的高清晰度SVG图片\"\"\"\n",
+    "    # 计算2D坐标\n",
+    "    AllChem.Compute2DCoords(mol)\n",
+    "    \n",
+    "    # 创建SVG绘制器\n",
+    "    drawer = rdMolDraw2D.MolDraw2DSVG(1200, 900)  # 更大的尺寸以提高清晰度\n",
+    "    drawer.SetFontSize(12)\n",
+    "    \n",
+    "    # 绘制选项\n",
+    "    draw_options = drawer.drawOptions()\n",
+    "    draw_options.addAtomIndices = False  # 不显示原子索引，保持简洁\n",
+    "    draw_options.addBondIndices = False\n",
+    "    draw_options.addStereoAnnotation = True\n",
+    "    draw_options.fixedFontSize = 12\n",
+    "    \n",
+    "    # 高亮匹配的原子（蓝色）\n",
+    "    atom_colors = {}\n",
+    "    for atom_idx in highlight_atoms:\n",
+    "        atom_colors[atom_idx] = (0.3, 0.3, 1.0)  # 蓝色高亮\n",
+    "    \n",
+    "    # 绘制分子\n",
+    "    drawer.DrawMolecule(mol, \n",
+    "                       highlightAtoms=highlight_atoms,\n",
+    "                       highlightAtomColors=atom_colors)\n",
+    "    \n",
+    "    drawer.FinishDrawing()\n",
+    "    svg_content = drawer.GetDrawingText()\n",
+    "    \n",
+    "    # 添加标题\n",
+    "    if title:\n",
+    "        # 在SVG中添加标题\n",
+    "        svg_lines = svg_content.split(\"\\\\n\")\n",
+    "        # 在<g>标签前插入标题\n",
+    "        for i, line in enumerate(svg_lines):\n",
+    "            if \"<g \" in line and \"transform\" in line:\n",
+    "                svg_lines.insert(i, f\"<text x=\\\"50%\\\" y=\\\"30\\\" text-anchor=\\\"middle\\\" font-size=\\\"16\\\" font-weight=\\\"bold\\\">{title}</text>\")\n",
+    "                break\n",
+    "        svg_with_title = \"\\\\n\".join(svg_lines)\n",
+    "    else:\n",
+    "        svg_with_title = svg_content\n",
+    "    \n",
+    "    # 保存文件\n",
+    "    with open(filename, \"w\") as f:\n",
+    "        f.write(svg_with_title)\n",
+    "    \n",
+    "    return svg_content"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 数据加载和分子筛选\n",
+    "\n",
+    "### 数据源\n",
+    "- 文件位置：`data/drug_targetmol/0c04ffc9fe8c2ec916412fbdc2a49bf4.sdf`\n",
+    "- 包含药物分子结构和丰富属性信息\n",
+    "\n",
+    "### 筛选逻辑\n",
+    "1. 读取SDF文件\n",
+    "2. 对每个分子进行SMARTS匹配\n",
+    "3. 记录匹配的原子和匹配数量\n",
+    "4. 保存匹配结果到CSV\n",
+    "5. 生成高亮可视化图片"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2025-11-11T13:21:33.114695Z",
+     "iopub.status.busy": "2025-11-11T13:21:33.113063Z",
+     "iopub.status.idle": "2025-11-11T13:21:35.754026Z",
+     "shell.execute_reply": "2025-11-11T13:21:35.745369Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "正在读取SDF文件...\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "[21:21:34] Both bonds on one end of an atropisomer are on the same side - atoms is : 3\n",
+      "[21:21:34] Explicit valence for atom # 2 N greater than permitted\n",
+      "[21:21:34] ERROR: Could not sanitize molecule ending on line 217340\n",
+      "[21:21:34] ERROR: Explicit valence for atom # 2 N greater than permitted\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "[21:21:35] Explicit valence for atom # 4 N greater than permitted\n",
+      "[21:21:35] ERROR: Could not sanitize molecule ending on line 317283\n",
+      "[21:21:35] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
+      "[21:21:35] Explicit valence for atom # 4 N greater than permitted\n",
+      "[21:21:35] ERROR: Could not sanitize molecule ending on line 324666\n",
+      "[21:21:35] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
+      "[21:21:35] Explicit valence for atom # 5 N greater than permitted\n",
+      "[21:21:35] ERROR: Could not sanitize molecule ending on line 365883\n",
+      "[21:21:35] ERROR: Explicit valence for atom # 5 N greater than permitted\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "成功加载 3276 个分子\n",
+      "\n",
+      "数据概览：\n",
+      "  Index    Plate Row Col ID                           Name  \\\n",
+      "0     1  L1010-1   a   2                     Dexamethasone   \n",
+      "1     2  L1010-1   a   3                         Danicopan   \n",
+      "2     3  L1010-1   a   4                     Cyclosporin A   \n",
+      "3     4  L1010-1   a   5                       L-Carnitine   \n",
+      "4     5  L1010-1   a   6     Trimetazidine dihydrochloride   \n",
+      "\n",
+      "                                       Synonyms           CAS  \\\n",
+      "0  MK 125;Prednisolone F;NSC 34521;Hexadecadrol       50-02-2   \n",
+      "1                                      ACH-4471  1903768-17-1   \n",
+      "2       Cyclosporine A;Ciclosporin;Cyclosporine    59865-13-3   \n",
+      "3                  L(-)-Carnitine;Levocarnitine      541-15-1   \n",
+      "4               Yoshimilon;Kyurinett;Vastarel F    13171-25-0   \n",
+      "\n",
+      "                                                                                                SMILES  \\\n",
+      "0              C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(F)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO   \n",
+      "1                        CC(=O)c1nn(CC(=O)N2C[C@H](F)C[C@H]2C(=O)Nc2cccc(Br)n2)c2ccc(cc12)-c1cnc(C)nc1   \n",
+      "2  [C@H]([C@@H](C/C=C/C)C)(O)[C@@]1(N(C)C(=O)[C@H]([C@@H](C)C)N(C)C(=O)[C@H](CC(C)C)N(C)C(=O)[C@H](...   \n",
+      "3                                                                      C[N+](C)(C)C[C@@H](O)CC([O-])=O   \n",
+      "4                                                               Cl.Cl.COC1=C(OC)C(OC)=C(CN2CCNCC2)C=C1   \n",
+      "\n",
+      "         Formula    MolWt Approved status  \\\n",
+      "0      C22H29FO5   392.46    NMPA;EMA;FDA   \n",
+      "1  C26H23BrFN7O3   580.41             FDA   \n",
+      "2  C62H111N11O12  1202.61             FDA   \n",
+      "3       C7H15NO3    161.2             FDA   \n",
+      "4  C14H24Cl2N2O3  339.258        NMPA;EMA   \n",
+      "\n",
+      "                                                             Pharmacopoeia  \\\n",
+      "0                                            USP39-NF34;BP2015;JP16;IP2010   \n",
+      "1                                                                      NaN   \n",
+      "2  Martindale the Extra Pharmacopoei, EP10.2, USP43-NF38, Ph.Int_6th, JP17   \n",
+      "3                                                                      NaN   \n",
+      "4         BP2019;KP Ⅹ;EP9.2;IP2010;JP17;Martindale:The Extra Pharmacopoeia   \n",
+      "\n",
+      "                 Disease  \\\n",
+      "0             Metabolism   \n",
+      "1                 Others   \n",
+      "2          Immune system   \n",
+      "3  Cardiovascular system   \n",
+      "4  Cardiovascular system   \n",
+      "\n",
+      "                                                                                              Pathways  \\\n",
+      "0  Antibody-drug Conjugate/ADC Related;Autophagy;Endocrinology/Hormones;Immunology/Inflammation;Mic...   \n",
+      "1                                                                              Immunology/Inflammation   \n",
+      "2                                             Immunology/Inflammation;Metabolism;Microbiology/Virology   \n",
+      "3                                                                                           Metabolism   \n",
+      "4                                                                                 Autophagy;Metabolism   \n",
+      "\n",
+      "                                                                                                Target  \\\n",
+      "0  Antibacterial;Antibiotic;Autophagy;Complement System;Glucocorticoid Receptor;IL Receptor;Mitopha...   \n",
+      "1                                                                                    Complement System   \n",
+      "2                                                             Phosphatase;Antibiotic;Complement System   \n",
+      "3                                                            Endogenous Metabolite;Fatty Acid Synthase   \n",
+      "4                                                                        Autophagy;Fatty Acid Synthase   \n",
+      "\n",
+      "                                                                                              Receptor  \\\n",
+      "0  Antibiotic; Autophagy; Bacterial; Complement System; Glucocorticoid Receptor; IL receptor; Mitop...   \n",
+      "1                                                                          Complement System; factor D   \n",
+      "2                                  Antibiotic; calcineurin phosphatase; Complement System; Phosphatase   \n",
+      "3                                                                           Endogenous Metabolite; FAS   \n",
+      "4                                              Autophagy; mitochondrial long-chain 3-ketoacyl thiolase   \n",
+      "\n",
+      "                                                                                           Bioactivity  \\\n",
+      "0  Dexamethasone is a glucocorticoid receptor agonist and IL receptor modulator with anti-inflammat...   \n",
+      "1  Danicopan (ACH-4471) (ACH-4471) is a selective, orally active small molecule factor D inhibitor ...   \n",
+      "2  Cyclosporin A is a natural product and an active fungal metabolite, classified as a cyclic polyp...   \n",
+      "3  L-Carnitine (L(-)-Carnitine) is an amino acid derivative. L-Carnitine facilitates long-chain fat...   \n",
+      "4  Trimetazidine dihydrochloride (Vastarel F) can improve myocardial glucose utilization by inhibit...   \n",
+      "\n",
+      "                                                                                             Reference  \\\n",
+      "0  Li M, Yu H. Identification of WP1066, an inhibitor of JAK2 and STAT3, as a Kv1. 3 potassium chan...   \n",
+      "1  Yuan X, et al. Small-molecule factor D inhibitors selectively block the alternative pathway of c...   \n",
+      "2  D'Angelo G, et al. Cyclosporin A prevents the hypoxic adaptation by activating hypoxia-inducible...   \n",
+      "3                                                    Jogl G, Tong L. Cell. 2003 Jan 10; 112(1):113-22.   \n",
+      "4  Yang Q, et al. Int J Clin Exp Pathol. 2015, 8(4):3735-3741.;Liu Z, et al. Metabolism. 2016, 65(3...   \n",
+      "\n",
+      "                                              ROMol  \n",
+      "0  <rdkit.Chem.rdchem.Mol object at 0x774684c557e0>  \n",
+      "1  <rdkit.Chem.rdchem.Mol object at 0x7746818ffdf0>  \n",
+      "2  <rdkit.Chem.rdchem.Mol object at 0x7746818ffd80>  \n",
+      "3  <rdkit.Chem.rdchem.Mol object at 0x7746816e0040>  \n",
+      "4  <rdkit.Chem.rdchem.Mol object at 0x7746816e00b0>  \n",
+      "\n",
+      "列名：['Index', 'Plate', 'Row', 'Col', 'ID', 'Name', 'Synonyms', 'CAS', 'SMILES', 'Formula', 'MolWt', 'Approved status', 'Pharmacopoeia', 'Disease', 'Pathways', 'Target', 'Receptor', 'Bioactivity', 'Reference', 'ROMol']\n"
+     ]
+    }
+   ],
+   "source": [
+    "# 读取SDF文件\n",
+    "sdf_path = '../data/drug_targetmol/0c04ffc9fe8c2ec916412fbdc2a49bf4.sdf'\n",
+    "\n",
+    "print(\"正在读取SDF文件...\")\n",
+    "df = PandasTools.LoadSDF(sdf_path)\n",
+    "print(f\"成功加载 {len(df)} 个分子\")\n",
+    "\n",
+    "# 显示数据基本信息\n",
+    "print(\"\\n数据概览：\")\n",
+    "print(df.head())\n",
+    "print(f\"\\n列名：{list(df.columns)}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2025-11-11T13:21:35.770585Z",
+     "iopub.status.busy": "2025-11-11T13:21:35.768752Z",
+     "iopub.status.idle": "2025-11-11T13:21:36.114723Z",
+     "shell.execute_reply": "2025-11-11T13:21:36.111467Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "开始筛选芳香胺结构...\n",
+      "SMARTS模式: [c,n][N&H2]\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "找到 78 个匹配分子（处理了 1000 个分子）\n",
+      "\n",
+      "筛选结果摘要：\n",
+      "                  Name          CAS      Formula  total_matches\n",
+      "17           Guanosine     118-00-3   C10H13N5O5              1\n",
+      "20         Ganciclovir   82410-32-0    C9H13N5O4              1\n",
+      "22   Imiquimod maleate  896106-16-4   C18H20N4O4              1\n",
+      "27       Brincidofovir  444805-28-1  C27H52N3O7P              1\n",
+      "28           Imiquimod   99011-02-6     C14H16N4              1\n",
+      "32  Ganciclovir sodium  107910-75-8  C9H13N5NaO4              1\n",
+      "33          Cytarabine     147-94-4    C9H13N3O5              1\n",
+      "35          Vidarabine    5536-17-4   C10H13N5O4              1\n",
+      "38         Penciclovir   39809-25-1   C10H15N5O3              1\n",
+      "41         Famciclovir  104227-87-4   C14H19N5O4              1\n",
+      "... 还有 68 个分子\n"
+     ]
+    }
+   ],
+   "source": [
+    "def screen_molecules_for_aniline(df, smarts_pattern, max_molecules=100):\n",
+    "    \"\"\"\n",
+    "    筛选包含芳香胺结构的分子\n",
+    "    \n",
+    "    Args:\n",
+    "        df: 包含分子的DataFrame\n",
+    "        smarts_pattern: RDKit SMARTS模式对象\n",
+    "        max_molecules: 最大处理分子数量\n",
+    "    \n",
+    "    Returns:\n",
+    "        筛选结果DataFrame\n",
+    "    \"\"\"\n",
+    "    print(f\"开始筛选芳香胺结构...\")\n",
+    "    print(f\"SMARTS模式: {Chem.MolToSmarts(smarts_pattern)}\")\n",
+    "    \n",
+    "    matched_molecules = []\n",
+    "    processed_count = 0\n",
+    "    \n",
+    "    for idx, row in df.iterrows():\n",
+    "        if processed_count >= max_molecules:\n",
+    "            break\n",
+    "            \n",
+    "        mol = row['ROMol']\n",
+    "        if mol is None:\n",
+    "            continue\n",
+    "            \n",
+    "        processed_count += 1\n",
+    "        \n",
+    "        # 检查是否匹配SMARTS模式\n",
+    "        if mol.HasSubstructMatch(smarts_pattern):\n",
+    "            matches = mol.GetSubstructMatches(smarts_pattern)\n",
+    "            \n",
+    "            # 收集所有匹配的原子\n",
+    "            matched_atoms = set()\n",
+    "            for match in matches:\n",
+    "                matched_atoms.update(match)\n",
+    "            \n",
+    "            # 创建匹配记录\n",
+    "            match_record = row.copy()\n",
+    "            match_record['matched_atoms'] = list(matched_atoms)\n",
+    "            match_record['total_matches'] = len(matches)\n",
+    "            match_record['smarts_pattern'] = Chem.MolToSmarts(smarts_pattern)\n",
+    "            matched_molecules.append(match_record)\n",
+    "    \n",
+    "    result_df = pd.DataFrame(matched_molecules)\n",
+    "    print(f\"找到 {len(result_df)} 个匹配分子（处理了 {processed_count} 个分子）\")\n",
+    "    \n",
+    "    return result_df\n",
+    "\n",
+    "# 执行筛选\n",
+    "matched_df = screen_molecules_for_aniline(df, pattern, max_molecules=1000)\n",
+    "\n",
+    "# 显示结果摘要\n",
+    "if len(matched_df) > 0:\n",
+    "    print(\"\\n筛选结果摘要：\")\n",
+    "    summary_cols = ['Name', 'CAS', 'Formula', 'total_matches']\n",
+    "    if len(matched_df) <= 10:\n",
+    "        print(matched_df[summary_cols])\n",
+    "    else:\n",
+    "        print(matched_df[summary_cols].head(10))\n",
+    "        print(f\"... 还有 {len(matched_df) - 10} 个分子\")\n",
+    "else:\n",
+    "    print(\"\\n未找到匹配分子\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 保存筛选结果\n",
+    "\n",
+    "### 输出文件\n",
+    "1. **CSV文件**：包含所有匹配分子的属性信息和匹配详情\n",
+    "2. **SVG图片**：每个匹配分子的结构可视化，高亮芳香胺结构"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2025-11-11T13:21:36.120981Z",
+     "iopub.status.busy": "2025-11-11T13:21:36.120553Z",
+     "iopub.status.idle": "2025-11-11T13:21:36.279125Z",
+     "shell.execute_reply": "2025-11-11T13:21:36.277892Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CSV结果已保存到：../data/drug_targetmol/aniline_candidates/aniline_candidates.csv\n",
+      "包含 78 个分子，23 个属性列\n",
+      "\n",
+      "开始生成可视化图片（最多50个）...\n",
+      "已生成 10 个分子图片\n",
+      "已生成 20 个分子图片\n",
+      "已生成 30 个分子图片\n",
+      "已生成 40 个分子图片\n",
+      "已生成 50 个分子图片\n",
+      "已达到最大可视化数量限制 (50)，停止生成\n",
+      "完成！共生成 50 个可视化图片\n",
+      "\n",
+      "示例图片: 118-00-3_Guanosine.svg\n"
+     ]
+    },
+    {
+     "data": {
+      "image/svg+xml": [
+       "<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:rdkit=\"http://www.rdkit.org/xml\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" version=\"1.1\" baseProfile=\"full\" xml:space=\"preserve\" width=\"1200px\" height=\"900px\" viewBox=\"0 0 1200 900\">\n",
+       "<!-- END OF HEADER -->\n",
+       "<rect style=\"opacity:1.0;fill:#FFFFFF;stroke:none\" width=\"1200.0\" height=\"900.0\" x=\"0.0\" y=\"0.0\"> </rect>\n",
+       "<path class=\"bond-0 atom-0 atom-1\" d=\"M 912.0,197.7 L 940.1,201.0 L 924.8,332.9 L 896.6,329.6 Z\" style=\"fill:#4C4CFF;fill-rule:evenodd;fill-opacity:1;stroke:#4C4CFF;stroke-width:0.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
+       "<ellipse cx=\"932.9\" cy=\"201.5\" rx=\"26.6\" ry=\"26.6\" class=\"atom-0\" style=\"fill:#4C4CFF;fill-rule:evenodd;stroke:#4C4CFF;stroke-width:1.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<ellipse cx=\"910.7\" cy=\"331.2\" rx=\"26.6\" ry=\"26.6\" class=\"atom-1\" style=\"fill:#4C4CFF;fill-rule:evenodd;stroke:#4C4CFF;stroke-width:1.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-0 atom-0 atom-1\" d=\"M 925.1,208.0 L 910.7,331.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-1 atom-1 atom-2\" d=\"M 910.7,331.2 L 853.5,355.9\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-1 atom-1 atom-2\" d=\"M 853.5,355.9 L 796.4,380.6\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-1 atom-1 atom-2\" d=\"M 908.0,354.1 L 856.2,376.5\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-1 atom-1 atom-2\" d=\"M 856.2,376.5 L 804.3,398.9\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-2 atom-2 atom-3\" d=\"M 787.8,392.5 L 780.6,454.1\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-2 atom-2 atom-3\" d=\"M 780.6,454.1 L 773.4,515.8\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-3 atom-3 atom-4\" d=\"M 773.4,515.8 L 879.9,595.0\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-3 atom-3 atom-4\" d=\"M 794.5,506.6 L 882.6,572.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-4 atom-4 atom-5\" d=\"M 879.9,595.0 L 860.1,653.6\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-4 atom-4 atom-5\" d=\"M 860.1,653.6 L 840.4,712.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-5 atom-5 atom-6\" d=\"M 829.8,720.7 L 767.3,720.0\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-5 atom-5 atom-6\" d=\"M 767.3,720.0 L 704.7,719.3\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-5 atom-5 atom-6\" d=\"M 830.1,700.8 L 774.7,700.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-5 atom-5 atom-6\" d=\"M 774.7,700.2 L 719.4,699.6\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-6 atom-6 atom-7\" d=\"M 704.7,719.3 L 686.2,660.3\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-6 atom-6 atom-7\" d=\"M 686.2,660.3 L 667.8,601.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-7 atom-3 atom-7\" d=\"M 773.4,515.8 L 723.0,551.5\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-7 atom-3 atom-7\" d=\"M 723.0,551.5 L 672.7,587.2\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-8 atom-7 atom-8\" d=\"M 657.5,590.0 L 598.4,570.1\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-8 atom-7 atom-8\" d=\"M 598.4,570.1 L 539.3,550.1\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-9 atom-8 atom-9\" d=\"M 539.3,550.1 L 489.3,585.6\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-9 atom-8 atom-9\" d=\"M 489.3,585.6 L 439.2,621.0\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-10 atom-9 atom-10\" d=\"M 422.7,620.8 L 373.6,584.2\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-10 atom-9 atom-10\" d=\"M 373.6,584.2 L 324.4,547.7\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-11 atom-10 atom-11\" d=\"M 324.4,547.7 L 197.7,587.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-12 atom-11 atom-12\" d=\"M 197.7,587.2 L 153.0,546.1\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-12 atom-11 atom-12\" d=\"M 153.0,546.1 L 108.3,504.9\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-13 atom-10 atom-13\" d=\"M 324.4,547.7 L 366.9,421.8\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-14 atom-13 atom-14\" d=\"M 366.9,421.8 L 331.6,372.1\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-14 atom-13 atom-14\" d=\"M 331.6,372.1 L 296.3,322.3\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-15 atom-13 atom-15\" d=\"M 366.9,421.8 L 499.7,423.4\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-16 atom-8 atom-15\" d=\"M 539.3,550.1 L 499.7,423.4\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-17 atom-15 atom-16\" d=\"M 499.7,423.4 L 536.0,374.5\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-17 atom-15 atom-16\" d=\"M 536.0,374.5 L 572.4,325.6\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-18 atom-4 atom-17\" d=\"M 879.9,595.0 L 1001.8,542.4\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-19 atom-17 atom-18\" d=\"M 991.3,547.0 L 1042.7,585.2\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-19 atom-17 atom-18\" d=\"M 1042.7,585.2 L 1094.1,623.5\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-19 atom-17 atom-18\" d=\"M 1003.2,531.0 L 1054.6,569.3\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-19 atom-17 atom-18\" d=\"M 1054.6,569.3 L 1106.0,607.5\" style=\"fill:none;fill-rule:evenodd;stroke:#FF0000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-20 atom-17 atom-19\" d=\"M 1001.8,542.4 L 1009.0,480.8\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-20 atom-17 atom-19\" d=\"M 1009.0,480.8 L 1016.2,419.1\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-21 atom-1 atom-19\" d=\"M 910.7,331.2 L 960.2,368.0\" style=\"fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path class=\"bond-21 atom-1 atom-19\" d=\"M 960.2,368.0 L 1009.6,404.9\" style=\"fill:none;fill-rule:evenodd;stroke:#0000FF;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1\"/>\n",
+       "<path d=\"M 707.8,719.4 L 704.7,719.3 L 703.7,716.4\" style=\"fill:none;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
+       "<path d=\"M 204.0,585.3 L 197.7,587.2 L 195.5,585.2\" style=\"fill:none;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
+       "<path d=\"M 995.7,545.0 L 1001.8,542.4 L 1002.2,539.3\" style=\"fill:none;stroke:#000000;stroke-width:2.0px;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10;stroke-opacity:1;\"/>\n",
+       "<path class=\"atom-0\" d=\"M 924.2 195.1 L 927.0 199.6 Q 927.3 200.0, 927.7 200.8 Q 928.1 201.7, 928.2 201.7 L 928.2 195.1 L 929.3 195.1 L 929.3 203.6 L 928.1 203.6 L 925.1 198.7 Q 924.8 198.1, 924.4 197.4 Q 924.1 196.8, 924.0 196.6 L 924.0 203.6 L 922.9 203.6 L 922.9 195.1 L 924.2 195.1 \" fill=\"#000000\"/>\n",
+       "<path class=\"atom-0\" d=\"M 930.9 195.1 L 932.1 195.1 L 932.1 198.7 L 936.4 198.7 L 936.4 195.1 L 937.6 195.1 L 937.6 203.6 L 936.4 203.6 L 936.4 199.7 L 932.1 199.7 L 932.1 203.6 L 930.9 203.6 L 930.9 195.1 \" fill=\"#000000\"/>\n",
+       "<path class=\"atom-0\" d=\"M 939.2 203.3 Q 939.4 202.8, 939.9 202.5 Q 940.4 202.2, 941.1 202.2 Q 942.0 202.2, 942.4 202.6 Q 942.9 203.1, 942.9 203.9 Q 942.9 204.7, 942.3 205.5 Q 941.7 206.3, 940.4 207.2 L 943.0 207.2 L 943.0 207.8 L 939.2 207.8 L 939.2 207.3 Q 940.3 206.6, 940.9 206.0 Q 941.5 205.5, 941.8 205.0 Q 942.1 204.5, 942.1 203.9 Q 942.1 203.4, 941.8 203.1 Q 941.6 202.8, 941.1 202.8 Q 940.7 202.8, 940.4 203.0 Q 940.0 203.2, 939.8 203.6 L 939.2 203.3 \" fill=\"#000000\"/>\n",
+       "<path class=\"atom-2\" d=\"M 786.9 379.6 L 789.7 384.1 Q 790.0 384.6, 790.4 385.4 Q 790.8 386.2, 790.9 386.2 L 790.9 379.6 L 792.0 379.6 L 792.0 388.1 L 790.8 388.1 L 787.8 383.2 Q 787.5 382.6, 787.1 382.0 Q 786.8 381.3, 786.7 381.1 L 786.7 388.1 L 785.6 388.1 L 785.6 379.6 L 786.9 379.6 \" fill=\"#0000FF\"/>\n",
+       "<path class=\"atom-5\" d=\"M 835.6 716.6 L 838.4 721.1 Q 838.6 721.5, 839.1 722.3 Q 839.5 723.1, 839.5 723.2 L 839.5 716.6 L 840.7 716.6 L 840.7 725.1 L 839.5 725.1 L 836.5 720.2 Q 836.2 719.6, 835.8 718.9 Q 835.4 718.3, 835.3 718.1 L 835.3 725.1 L 834.2 725.1 L 834.2 716.6 L 835.6 716.6 \" fill=\"#0000FF\"/>\n",
+       "<path class=\"atom-7\" d=\"M 663.2 588.3 L 666.0 592.8 Q 666.3 593.3, 666.7 594.1 Q 667.2 594.9, 667.2 594.9 L 667.2 588.3 L 668.3 588.3 L 668.3 596.8 L 667.1 596.8 L 664.2 591.9 Q 663.8 591.3, 663.4 590.7 Q 663.1 590.0, 663.0 589.8 L 663.0 596.8 L 661.9 596.8 L 661.9 588.3 L 663.2 588.3 \" fill=\"#0000FF\"/>\n",
+       "<path class=\"atom-9\" d=\"M 427.1 626.9 Q 427.1 624.9, 428.1 623.8 Q 429.1 622.6, 431.0 622.6 Q 432.8 622.6, 433.9 623.8 Q 434.9 624.9, 434.9 626.9 Q 434.9 629.0, 433.8 630.2 Q 432.8 631.3, 431.0 631.3 Q 429.1 631.3, 428.1 630.2 Q 427.1 629.0, 427.1 626.9 M 431.0 630.4 Q 432.3 630.4, 433.0 629.5 Q 433.7 628.6, 433.7 626.9 Q 433.7 625.3, 433.0 624.4 Q 432.3 623.6, 431.0 623.6 Q 429.7 623.6, 429.0 624.4 Q 428.3 625.3, 428.3 626.9 Q 428.3 628.7, 429.0 629.5 Q 429.7 630.4, 431.0 630.4 \" fill=\"#FF0000\"/>\n",
+       "<path class=\"atom-12\" d=\"M 87.7 493.1 L 88.9 493.1 L 88.9 496.7 L 93.2 496.7 L 93.2 493.1 L 94.4 493.1 L 94.4 501.6 L 93.2 501.6 L 93.2 497.6 L 88.9 497.6 L 88.9 501.6 L 87.7 501.6 L 87.7 493.1 \" fill=\"#FF0000\"/>\n",
+       "<path class=\"atom-12\" d=\"M 96.1 497.3 Q 96.1 495.3, 97.1 494.1 Q 98.1 493.0, 100.0 493.0 Q 101.9 493.0, 102.9 494.1 Q 103.9 495.3, 103.9 497.3 Q 103.9 499.4, 102.9 500.5 Q 101.9 501.7, 100.0 501.7 Q 98.2 501.7, 97.1 500.5 Q 96.1 499.4, 96.1 497.3 M 100.0 500.7 Q 101.3 500.7, 102.0 499.9 Q 102.7 499.0, 102.7 497.3 Q 102.7 495.6, 102.0 494.8 Q 101.3 493.9, 100.0 493.9 Q 98.7 493.9, 98.0 494.8 Q 97.3 495.6, 97.3 497.3 Q 97.3 499.0, 98.0 499.9 Q 98.7 500.7, 100.0 500.7 \" fill=\"#FF0000\"/>\n",
+       "<path class=\"atom-14\" d=\"M 277.8 309.3 L 278.9 309.3 L 278.9 312.9 L 283.3 312.9 L 283.3 309.3 L 284.4 309.3 L 284.4 317.8 L 283.3 317.8 L 283.3 313.9 L 278.9 313.9 L 278.9 317.8 L 277.8 317.8 L 277.8 309.3 \" fill=\"#FF0000\"/>\n",
+       "<path class=\"atom-14\" d=\"M 286.2 313.6 Q 286.2 311.5, 287.2 310.4 Q 288.2 309.2, 290.1 309.2 Q 292.0 309.2, 293.0 310.4 Q 294.0 311.5, 294.0 313.6 Q 294.0 315.6, 293.0 316.8 Q 291.9 318.0, 290.1 318.0 Q 288.2 318.0, 287.2 316.8 Q 286.2 315.6, 286.2 313.6 M 290.1 317.0 Q 291.4 317.0, 292.1 316.1 Q 292.8 315.3, 292.8 313.6 Q 292.8 311.9, 292.1 311.0 Q 291.4 310.2, 290.1 310.2 Q 288.8 310.2, 288.1 311.0 Q 287.4 311.9, 287.4 313.6 Q 287.4 315.3, 288.1 316.1 Q 288.8 317.0, 290.1 317.0 \" fill=\"#FF0000\"/>\n",
+       "<path class=\"atom-16\" d=\"M 575.1 316.9 Q 575.1 314.8, 576.1 313.7 Q 577.1 312.5, 579.0 312.5 Q 580.8 312.5, 581.8 313.7 Q 582.9 314.8, 582.9 316.9 Q 582.9 318.9, 581.8 320.1 Q 580.8 321.3, 579.0 321.3 Q 577.1 321.3, 576.1 320.1 Q 575.1 318.9, 575.1 316.9 M 579.0 320.3 Q 580.2 320.3, 580.9 319.4 Q 581.7 318.6, 581.7 316.9 Q 581.7 315.2, 580.9 314.3 Q 580.2 313.5, 579.0 313.5 Q 577.7 313.5, 576.9 314.3 Q 576.3 315.2, 576.3 316.9 Q 576.3 318.6, 576.9 319.4 Q 577.7 320.3, 579.0 320.3 \" fill=\"#FF0000\"/>\n",
+       "<path class=\"atom-16\" d=\"M 584.2 312.6 L 585.3 312.6 L 585.3 316.2 L 589.7 316.2 L 589.7 312.6 L 590.8 312.6 L 590.8 321.1 L 589.7 321.1 L 589.7 317.2 L 585.3 317.2 L 585.3 321.1 L 584.2 321.1 L 584.2 312.6 \" fill=\"#FF0000\"/>\n",
+       "<path class=\"atom-18\" d=\"M 1104.5 621.7 Q 1104.5 619.7, 1105.5 618.5 Q 1106.5 617.4, 1108.4 617.4 Q 1110.2 617.4, 1111.3 618.5 Q 1112.3 619.7, 1112.3 621.7 Q 1112.3 623.8, 1111.2 624.9 Q 1110.2 626.1, 1108.4 626.1 Q 1106.5 626.1, 1105.5 624.9 Q 1104.5 623.8, 1104.5 621.7 M 1108.4 625.1 Q 1109.7 625.1, 1110.4 624.3 Q 1111.1 623.4, 1111.1 621.7 Q 1111.1 620.0, 1110.4 619.2 Q 1109.7 618.3, 1108.4 618.3 Q 1107.1 618.3, 1106.4 619.2 Q 1105.7 620.0, 1105.7 621.7 Q 1105.7 623.4, 1106.4 624.3 Q 1107.1 625.1, 1108.4 625.1 \" fill=\"#FF0000\"/>\n",
+       "<path class=\"atom-19\" d=\"M 1015.3 406.3 L 1018.1 410.8 Q 1018.4 411.2, 1018.8 412.0 Q 1019.3 412.8, 1019.3 412.9 L 1019.3 406.3 L 1020.4 406.3 L 1020.4 414.8 L 1019.3 414.8 L 1016.3 409.8 Q 1015.9 409.3, 1015.6 408.6 Q 1015.2 407.9, 1015.1 407.7 L 1015.1 414.8 L 1014.0 414.8 L 1014.0 406.3 L 1015.3 406.3 \" fill=\"#0000FF\"/>\n",
+       "<path class=\"atom-19\" d=\"M 1022.1 406.3 L 1023.2 406.3 L 1023.2 409.9 L 1027.6 409.9 L 1027.6 406.3 L 1028.7 406.3 L 1028.7 414.8 L 1027.6 414.8 L 1027.6 410.8 L 1023.2 410.8 L 1023.2 414.8 L 1022.1 414.8 L 1022.1 406.3 \" fill=\"#0000FF\"/>\n",
+       "</svg>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.SVG object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "def save_aniline_screening_results(df, output_dir, visualization_dir, max_visualizations=50):\n",
+    "    \"\"\"保存芳香胺筛选结果\"\"\"\n",
+    "    \n",
+    "    # 保存CSV文件\n",
+    "    csv_path = output_dir / \"aniline_candidates.csv\"\n",
+    "    \n",
+    "    # 转换ROMol列为SMILES（因为ROMol对象无法保存到CSV）\n",
+    "    df_export = df.copy()\n",
+    "    if 'ROMol' in df_export.columns:\n",
+    "        df_export['SMILES_from_mol'] = df_export['ROMol'].apply(lambda x: Chem.MolToSmiles(x) if x else '')\n",
+    "        df_export = df_export.drop('ROMol', axis=1)\n",
+    "    \n",
+    "    df_export.to_csv(csv_path, index=False, encoding='utf-8')\n",
+    "    print(f\"CSV结果已保存到：{csv_path}\")\n",
+    "    print(f\"包含 {len(df_export)} 个分子，{len(df_export.columns)} 个属性列\")\n",
+    "    \n",
+    "    # 生成可视化图片\n",
+    "    print(f\"\\n开始生成可视化图片（最多{max_visualizations}个）...\")\n",
+    "    generated_count = 0\n",
+    "    \n",
+    "    for idx, row in df.iterrows():\n",
+    "        if generated_count >= max_visualizations:\n",
+    "            print(f\"已达到最大可视化数量限制 ({max_visualizations})，停止生成\")\n",
+    "            break\n",
+    "            \n",
+    "        cas = str(row.get('CAS', 'unknown')).strip()\n",
+    "        name = str(row.get('Name', 'unknown')).strip()\n",
+    "        \n",
+    "        # 清理文件名（去除特殊字符）\n",
+    "        safe_name = \"\".join(c for c in name if c.isalnum() or c in (' ', '-', '_')).rstrip()\n",
+    "        safe_cas = \"\".join(c for c in cas if c.isalnum() or c in ('-',)).rstrip()\n",
+    "        \n",
+    "        # 跳过无效的标识符\n",
+    "        if not safe_cas or safe_cas == 'nan' or safe_cas == 'unknown':\n",
+    "            continue\n",
+    "            \n",
+    "        mol = row.get('ROMol')\n",
+    "        if mol is None:\n",
+    "            continue\n",
+    "            \n",
+    "        matched_atoms = row.get('matched_atoms', [])\n",
+    "        if not matched_atoms:\n",
+    "            continue\n",
+    "            \n",
+    "        # 生成文件名和标题\n",
+    "        filename = visualization_dir / f\"{safe_cas}_{safe_name.replace(' ', '_')}.svg\"\n",
+    "        title = f\"{name} ({cas}) - 芳香胺结构\"\n",
+    "        \n",
+    "        try:\n",
+    "            # 生成SVG\n",
+    "            svg_content = generate_highlighted_svg(mol, matched_atoms, filename, title)\n",
+    "            generated_count += 1\n",
+    "            \n",
+    "            # 每10个显示一次进度\n",
+    "            if generated_count % 10 == 0:\n",
+    "                print(f\"已生成 {generated_count} 个分子图片\")\n",
+    "                \n",
+    "        except Exception as e:\n",
+    "            print(f\"生成 {safe_cas} 失败: {e}\")\n",
+    "            continue\n",
+    "    \n",
+    "    print(f\"完成！共生成 {generated_count} 个可视化图片\")\n",
+    "    return csv_path, generated_count\n",
+    "\n",
+    "# 保存结果\n",
+    "if len(matched_df) > 0:\n",
+    "    csv_path, viz_count = save_aniline_screening_results(\n",
+    "        matched_df, output_dir, visualization_dir, max_visualizations=50\n",
+    "    )\n",
+    "    \n",
+    "    # 显示第一个生成的图片作为示例\n",
+    "    if viz_count > 0:\n",
+    "        example_files = list(visualization_dir.glob(\"*.svg\"))\n",
+    "        if example_files:\n",
+    "            example_file = example_files[0]\n",
+    "            print(f\"\\n示例图片: {example_file.name}\")\n",
+    "            with open(example_file, \"r\") as f:\n",
+    "                svg_content = f.read()\n",
+    "            display(SVG(svg_content))\n",
+    "else:\n",
+    "    print(\"没有匹配结果，无需保存\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 结果统计和分析\n",
+    "\n",
+    "### 筛选统计\n",
+    "- 总分子数\n",
+    "- 匹配分子数\n",
+    "- 可视化文件数量\n",
+    "- 输出文件位置"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2025-11-11T13:21:36.282118Z",
+     "iopub.status.busy": "2025-11-11T13:21:36.281886Z",
+     "iopub.status.idle": "2025-11-11T13:21:36.317857Z",
+     "shell.execute_reply": "2025-11-11T13:21:36.316621Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "=== 芳香胺筛选结果统计 ===\n",
+      "总分子数：3276\n",
+      "匹配分子数：78\n",
+      "匹配率：2.38%\n",
+      "\n",
+      "输出目录：../data/drug_targetmol/aniline_candidates\n",
+      "CSV文件：../data/drug_targetmol/aniline_candidates/aniline_candidates.csv\n",
+      "可视化目录：../data/drug_targetmol/aniline_candidates/visualizations\n",
+      "SVG文件数量：50\n",
+      "\n",
+      "匹配数量最多的分子：\n",
+      "                                      Name          CAS  total_matches\n",
+      "432                 Proflavine Hemisulfate    1811-28-5              4\n",
+      "335  Pemetrexed disodium hemipenta hydrate  357166-30-4              2\n",
+      "463                            Lamotrigine   84057-84-1              2\n",
+      "779                          Pyrimethamine      58-14-0              2\n",
+      "784                                Dapsone      80-08-0              2\n"
+     ]
+    }
+   ],
+   "source": [
+    "# 结果统计\n",
+    "print(\"=== 芳香胺筛选结果统计 ===\")\n",
+    "print(f\"总分子数：{len(df)}\")\n",
+    "print(f\"匹配分子数：{len(matched_df)}\")\n",
+    "print(f\"匹配率：{len(matched_df)/len(df)*100:.2f}%\")\n",
+    "print(f\"\\n输出目录：{output_dir}\")\n",
+    "print(f\"CSV文件：{output_dir}/aniline_candidates.csv\")\n",
+    "print(f\"可视化目录：{visualization_dir}\")\n",
+    "print(f\"SVG文件数量：{len(list(visualization_dir.glob('*.svg')))}\")\n",
+    "\n",
+    "# 显示匹配最多的前几个分子\n",
+    "if len(matched_df) > 0:\n",
+    "    print(\"\\n匹配数量最多的分子：\")\n",
+    "    top_matches = matched_df.nlargest(5, 'total_matches')[['Name', 'CAS', 'total_matches']]\n",
+    "    print(top_matches)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 使用建议\n",
+    "\n",
+    "### 筛选结果解读\n",
+    "- **匹配分子**：包含芳香胺结构（Ar-NH₂）的药物\n",
+    "- **蓝色高亮**：匹配的SMARTS结构（芳香碳/氮 + 氨基）\n",
+    "- **多重匹配**：分子中可能存在多个芳香胺基团\n",
+    "\n",
+    "### 后续分析建议\n",
+    "1. **合成路线验证**：查阅匹配分子的合成文献\n",
+    "2. **Sandmeyer反应确认**：确认是否使用Sandmeyer反应引入卤素\n",
+    "3. **张夏恒反应评估**：评估替代Sandmeyer反应的可行性\n",
+    "4. **工艺优化潜力**：分析替换为张夏恒反应的经济效益\n",
+    "\n",
+    "### 文件说明\n",
+    "- **CSV文件**：完整的分子属性和匹配信息\n",
+    "- **SVG文件**：结构可视化，蓝色高亮芳香胺结构\n",
+    "- **命名规则**：{CAS}_{Name}.svg（特殊字符已清理）"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.14.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/notebooks/screen_sandmeyer_candidates.ipynb
+++ b/notebooks/screen_sandmeyer_candidates.ipynb
@@ -0,0 +1,797 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 筛选Sandmeyer反应候选药物 - 张夏恒反应替代分析\n",
+    "\n",
+    "## 背景介绍\n",
+    "\n",
+    "### Sandmeyer反应回顾\n",
+    "Sandmeyer反应是经典的芳香胺转化方法：\n",
+    "**Ar-NH₂ → [Ar-N₂]⁺ → Ar-X**\n",
+    "其中 X = Cl, Br, I, CN, OH, SCN 等\n",
+    "\n",
+    "### 张夏恒反应\n",
+    "根据论文《s41586-025-09791-5_reference.pdf》，张夏恒反应可能是一种新的替代方法，\n",
+    "可以更有效地实现芳香胺到芳香卤素的转化。\n",
+    "\n",
+    "### 筛选策略\n",
+    "我们通过识别药物分子中可能来自Sandmeyer反应的芳香卤素结构，\n",
+    "找出可以考虑用张夏恒反应进行工艺优化的候选药物。\n",
+    "\n",
+    "**重要提醒：**\n",
+    "- 此筛选仅基于分子结构特征\n",
+    "- 最终需要查阅文献确认合成路线\n",
+    "- 并非所有含卤素的药物都使用Sandmeyer反应合成"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 导入所需库"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from pathlib import Path\n",
+    "from rdkit.Chem.Draw import rdMolDraw2D\n",
+    "from IPython.display import SVG, display\n",
+    "from rdkit.Chem import AllChem\n",
+    "\n",
+    "# 创建输出目录\n",
+    "output_base = Path(\"../data/drug_targetmol\")\n",
+    "scheme_a_dir = output_base / \"scheme_A_visualizations\"\n",
+    "scheme_b_dir = output_base / \"scheme_B_visualizations\"\n",
+    "\n",
+    "scheme_a_dir.mkdir(exist_ok=True)\n",
+    "scheme_b_dir.mkdir(exist_ok=True)\n",
+    "\n",
+    "print(f\"创建目录：{scheme_a_dir}\")\n",
+    "print(f\"创建目录：{scheme_b_dir}\")\n",
+    "\n",
+    "def generate_highlighted_svg(mol, highlight_atoms, filename, title=\"\"):\n",
+    "    \"\"\"生成高亮卤素结构的高清晰度SVG图片\"\"\"\n",
+    "    from rdkit.Chem import AllChem\n",
+    "    \n",
+    "    # 计算2D坐标\n",
+    "    AllChem.Compute2DCoords(mol)\n",
+    "    \n",
+    "    # 创建SVG绘制器\n",
+    "    drawer = rdMolDraw2D.MolDraw2DSVG(1200, 900)  # 更大的尺寸以提高清晰度\n",
+    "    drawer.SetFontSize(12)\n",
+    "    \n",
+    "    # 绘制选项\n",
+    "    draw_options = drawer.drawOptions()\n",
+    "    draw_options.addAtomIndices = False  # 不显示原子索引，保持简洁\n",
+    "    draw_options.addBondIndices = False\n",
+    "    draw_options.addStereoAnnotation = True\n",
+    "    draw_options.fixedFontSize = 12\n",
+    "    \n",
+    "    # 高亮卤素原子（红色）\n",
+    "    atom_colors = {}\n",
+    "    for atom_idx in highlight_atoms:\n",
+    "        atom_colors[atom_idx] = (1.0, 0.3, 0.3)  # 红色高亮\n",
+    "    \n",
+    "    # 绘制分子\n",
+    "    drawer.DrawMolecule(mol, \n",
+    "                       highlightAtoms=highlight_atoms,\n",
+    "                       highlightAtomColors=atom_colors)\n",
+    "    \n",
+    "    drawer.FinishDrawing()\n",
+    "    svg_content = drawer.GetDrawingText()\n",
+    "    \n",
+    "    # 添加标题\n",
+    "    if title:\n",
+    "        # 在SVG中添加标题\n",
+    "        svg_lines = svg_content.split(\"\\n\")\n",
+    "        # 在<g>标签前插入标题\n",
+    "        for i, line in enumerate(svg_lines):\n",
+    "            if \"<g \" in line and \"transform\" in line:\n",
+    "                svg_lines.insert(i, f\"<text x=\"50%\" y=\"30\" text-anchor=\"middle\" font-size=\"16\" font-weight=\"bold\">{title}</text>\")\n",
+    "                break\n",
+    "        svg_with_title = \"\\n\".join(svg_lines)\n",
+    "    else:\n",
+    "        svg_with_title = svg_content\n",
+    "    \n",
+    "    # 保存文件\n",
+    "    with open(filename, \"w\") as f:\n",
+    "        f.write(svg_with_title)\n",
+    "    \n",
+    "    print(f\"保存SVG: {filename}\")\n",
+    "    \n",
+    "    return svg_content\n",
+    "\n",
+    "def visualize_molecules(df, output_dir, scheme_name, max_molecules=50):\n",
+    "    \"\"\"为DataFrame中的分子生成可视化图片\"\"\"\n",
+    "    print(f\"\\n开始生成{scheme_name}的可视化图片...\")\n",
+    "    print(f\"输出目录: {output_dir}\")\n",
+    "    \n",
+    "    generated_count = 0\n",
+    "    \n",
+    "    for idx, row in df.iterrows():\n",
+    "        if generated_count >= max_molecules:\n",
+    "            print(f\"已达到最大生成数量限制 ({max_molecules})，停止生成\")\n",
+    "            break\n",
+    "            \n",
+    "        cas = str(row.get(\"CAS\", \"unknown\")).strip()\n",
+    "        name = str(row.get(\"Name\", \"unknown\")).strip()\n",
+    "        \n",
+    "        # 跳过无效的CAS号\n",
+    "        if not cas or cas == \"nan\" or cas == \"unknown\":\n",
+    "            continue\n",
+    "            \n",
+    "        mol = row.get(\"ROMol\")\n",
+    "        if mol is None:\n",
+    "            continue\n",
+    "            \n",
+    "        # 找出卤素原子\n",
+    "        halogen_atoms = []\n",
+    "        for atom in mol.GetAtoms():\n",
+    "            if atom.GetAtomicNum() in [9, 17, 35, 53]:  # F, Cl, Br, I\n",
+    "                halogen_atoms.append(atom.GetIdx())\n",
+    "        \n",
+    "        if not halogen_atoms:\n",
+    "            continue\n",
+    "            \n",
+    "        # 生成文件名和标题\n",
+    "        filename = output_dir / f\"{cas}.svg\"\n",
+    "        title = f\"{name} ({cas})\"\n",
+    "        \n",
+    "        try:\n",
+    "            # 生成SVG\n",
+    "            generate_highlighted_svg(mol, halogen_atoms, filename, title)\n",
+    "            generated_count += 1\n",
+    "            \n",
+    "            # 每10个显示一次进度\n",
+    "            if generated_count % 10 == 0:\n",
+    "                print(f\"已生成 {generated_count} 个分子图片\")\n",
+    "                \n",
+    "        except Exception as e:\n",
+    "            print(f\"生成 {cas} 失败: {e}\")\n",
+    "            continue\n",
+    "    \n",
+    "    print(f\"完成！共生成 {generated_count} 个{scheme_name}的可视化图片\")\n",
+    "    return generated_count\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 定义筛选模式\n",
+    "\n",
+    "### SMARTS模式说明\n",
+    "\n",
+    "#### 方案A：杂芳环卤素（最高优先级）\n",
+    "- **筛选逻辑**：杂芳环上的卤素最可能是Sandmeyer反应产物\n",
+    "- **原因**：杂芳环直接卤代通常较困难，Sandmeyer反应是重要合成方法\n",
+    "- **预期结果**：候选数量少但精准度高\n",
+    "\n",
+    "#### 方案B：所有芳香卤素（中等优先级）  \n",
+    "- **筛选逻辑**：所有芳环上的卤素\n",
+    "- **原因**：虽然有些卤素可能来自其他途径，但可以扩大筛选范围\n",
+    "- **预期结果**：候选数量较多，需要更多文献验证\n",
+    "\n",
+    "**SMARTS模式优化说明：**\n",
+    "- 原始模式 `n:c:[Cl,Br,I]` 语法有误\n",
+    "- 优化为更准确的环结构匹配模式\n",
+    "- 使用更精确的原子环境描述"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 定义筛选模式\n",
+    "SCREENING_PATTERNS = {\n",
+    "    'heteroaryl_halides': {\n",
+    "        'name': '杂芳环卤素',\n",
+    "        'description': '杂环上的Cl, Br, I原子（方案A）',\n",
+    "        'smarts': [\n",
+    "            '[n,o,s]c[Cl,Br,I]',  # 杂原子邻位卤素\n",
+    "            '[n,o,s]cc[Cl,Br,I]', # 杂原子邻位卤素（隔一个碳）\n",
+    "            'c1[n,o,s]c([Cl,Br,I])ccc1', # 卤代吡咯类\n",
+    "            'c1c([Cl,Br,I])cncn1', # 卤代嘧啶\n",
+    "            'c1ccc2c([Cl,Br,I])ccnc2c1', # 卤代喹啉\n",
+    "            'c1c([Cl,Br,I])cncc1', # 卤代吡嗪\n",
+    "            'c1([Cl,Br,I])scnc1', # 卤代噻唑\n",
+    "        ],\n",
+    "        'scheme': 'A'\n",
+    "    },\n",
+    "    'aryl_halides': {\n",
+    "        'name': '芳香卤素',\n",
+    "        'description': '所有芳环上的Cl, Br, I原子（方案B）',\n",
+    "        'smarts': [\n",
+    "            'c[Cl,Br,I]',  # 任意芳香氯\n",
+    "            'c-C#N',       # 芳香氰基\n",
+    "            'c1ccc(S(=O)(=O)N)cc1', # 磺胺核心\n",
+    "            'c1c(Cl)cc(Cl)cc1',     # 多卤代苯\n",
+    "        ],\n",
+    "        'scheme': 'B'\n",
+    "    }\n",
+    "}\n",
+    "\n",
+    "def create_pattern_matchers():\n",
+    "    \"\"\"创建SMARTS模式匹配器\"\"\"\n",
+    "    matchers = {}\n",
+    "    for key, pattern_info in SCREENING_PATTERNS.items():\n",
+    "        matchers[key] = {\n",
+    "            'info': pattern_info,\n",
+    "            'matchers': [Chem.MolFromSmarts(smarts) for smarts in pattern_info['smarts']]\n",
+    "        }\n",
+    "    return matchers\n",
+    "\n",
+    "PATTERN_MATCHERS = create_pattern_matchers()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 数据加载和预处理\n",
+    "\n",
+    "### SDF文件说明\n",
+    "- 文件位置：`data/drug_targetmol/0c04ffc9fe8c2ec916412fbdc2a49bf4.sdf`\n",
+    "- 包含药物分子结构和丰富属性信息\n",
+    "- 每个分子记录包含：SMILES、分子式、分子量、批准状态、适应症等"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "正在读取SDF文件...\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "[15:05:51] Both bonds on one end of an atropisomer are on the same side - atoms is : 3\n",
+      "[15:05:51] Explicit valence for atom # 2 N greater than permitted\n",
+      "[15:05:51] ERROR: Could not sanitize molecule ending on line 217340\n",
+      "[15:05:51] ERROR: Explicit valence for atom # 2 N greater than permitted\n",
+      "[15:05:52] Explicit valence for atom # 4 N greater than permitted\n",
+      "[15:05:52] ERROR: Could not sanitize molecule ending on line 317283\n",
+      "[15:05:52] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
+      "[15:05:52] Explicit valence for atom # 4 N greater than permitted\n",
+      "[15:05:52] ERROR: Could not sanitize molecule ending on line 324666\n",
+      "[15:05:52] ERROR: Explicit valence for atom # 4 N greater than permitted\n",
+      "[15:05:52] Explicit valence for atom # 5 N greater than permitted\n",
+      "[15:05:52] ERROR: Could not sanitize molecule ending on line 365883\n",
+      "[15:05:52] ERROR: Explicit valence for atom # 5 N greater than permitted\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "成功加载 3276 个分子\n",
+      "\n",
+      "数据概览：\n",
+      "  Index    Plate Row Col ID                           Name  \\\n",
+      "0     1  L1010-1   a   2                     Dexamethasone   \n",
+      "1     2  L1010-1   a   3                         Danicopan   \n",
+      "2     3  L1010-1   a   4                     Cyclosporin A   \n",
+      "3     4  L1010-1   a   5                       L-Carnitine   \n",
+      "4     5  L1010-1   a   6     Trimetazidine dihydrochloride   \n",
+      "\n",
+      "                                       Synonyms           CAS  \\\n",
+      "0  MK 125;Prednisolone F;NSC 34521;Hexadecadrol       50-02-2   \n",
+      "1                                      ACH-4471  1903768-17-1   \n",
+      "2       Cyclosporine A;Ciclosporin;Cyclosporine    59865-13-3   \n",
+      "3                  L(-)-Carnitine;Levocarnitine      541-15-1   \n",
+      "4               Yoshimilon;Kyurinett;Vastarel F    13171-25-0   \n",
+      "\n",
+      "                                                                                                SMILES  \\\n",
+      "0              C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@]4(C)[C@@]3(F)[C@@H](O)C[C@]2(C)[C@@]1(O)C(=O)CO   \n",
+      "1                        CC(=O)c1nn(CC(=O)N2C[C@H](F)C[C@H]2C(=O)Nc2cccc(Br)n2)c2ccc(cc12)-c1cnc(C)nc1   \n",
+      "2  [C@H]([C@@H](C/C=C/C)C)(O)[C@@]1(N(C)C(=O)[C@H]([C@@H](C)C)N(C)C(=O)[C@H](CC(C)C)N(C)C(=O)[C@H](...   \n",
+      "3                                                                      C[N+](C)(C)C[C@@H](O)CC([O-])=O   \n",
+      "4                                                               Cl.Cl.COC1=C(OC)C(OC)=C(CN2CCNCC2)C=C1   \n",
+      "\n",
+      "         Formula    MolWt Approved status  \\\n",
+      "0      C22H29FO5   392.46    NMPA;EMA;FDA   \n",
+      "1  C26H23BrFN7O3   580.41             FDA   \n",
+      "2  C62H111N11O12  1202.61             FDA   \n",
+      "3       C7H15NO3    161.2             FDA   \n",
+      "4  C14H24Cl2N2O3  339.258        NMPA;EMA   \n",
+      "\n",
+      "                                                             Pharmacopoeia  \\\n",
+      "0                                            USP39-NF34;BP2015;JP16;IP2010   \n",
+      "1                                                                      NaN   \n",
+      "2  Martindale the Extra Pharmacopoei, EP10.2, USP43-NF38, Ph.Int_6th, JP17   \n",
+      "3                                                                      NaN   \n",
+      "4         BP2019;KP Ⅹ;EP9.2;IP2010;JP17;Martindale:The Extra Pharmacopoeia   \n",
+      "\n",
+      "                 Disease  \\\n",
+      "0             Metabolism   \n",
+      "1                 Others   \n",
+      "2          Immune system   \n",
+      "3  Cardiovascular system   \n",
+      "4  Cardiovascular system   \n",
+      "\n",
+      "                                                                                              Pathways  \\\n",
+      "0  Antibody-drug Conjugate/ADC Related;Autophagy;Endocrinology/Hormones;Immunology/Inflammation;Mic...   \n",
+      "1                                                                              Immunology/Inflammation   \n",
+      "2                                             Immunology/Inflammation;Metabolism;Microbiology/Virology   \n",
+      "3                                                                                           Metabolism   \n",
+      "4                                                                                 Autophagy;Metabolism   \n",
+      "\n",
+      "                                                                                                Target  \\\n",
+      "0  Antibacterial;Antibiotic;Autophagy;Complement System;Glucocorticoid Receptor;IL Receptor;Mitopha...   \n",
+      "1                                                                                    Complement System   \n",
+      "2                                                             Phosphatase;Antibiotic;Complement System   \n",
+      "3                                                            Endogenous Metabolite;Fatty Acid Synthase   \n",
+      "4                                                                        Autophagy;Fatty Acid Synthase   \n",
+      "\n",
+      "                                                                                              Receptor  \\\n",
+      "0  Antibiotic; Autophagy; Bacterial; Complement System; Glucocorticoid Receptor; IL receptor; Mitop...   \n",
+      "1                                                                          Complement System; factor D   \n",
+      "2                                  Antibiotic; calcineurin phosphatase; Complement System; Phosphatase   \n",
+      "3                                                                           Endogenous Metabolite; FAS   \n",
+      "4                                              Autophagy; mitochondrial long-chain 3-ketoacyl thiolase   \n",
+      "\n",
+      "                                                                                           Bioactivity  \\\n",
+      "0  Dexamethasone is a glucocorticoid receptor agonist and IL receptor modulator with anti-inflammat...   \n",
+      "1  Danicopan (ACH-4471) (ACH-4471) is a selective, orally active small molecule factor D inhibitor ...   \n",
+      "2  Cyclosporin A is a natural product and an active fungal metabolite, classified as a cyclic polyp...   \n",
+      "3  L-Carnitine (L(-)-Carnitine) is an amino acid derivative. L-Carnitine facilitates long-chain fat...   \n",
+      "4  Trimetazidine dihydrochloride (Vastarel F) can improve myocardial glucose utilization by inhibit...   \n",
+      "\n",
+      "                                                                                             Reference  \\\n",
+      "0  Li M, Yu H. Identification of WP1066, an inhibitor of JAK2 and STAT3, as a Kv1. 3 potassium chan...   \n",
+      "1  Yuan X, et al. Small-molecule factor D inhibitors selectively block the alternative pathway of c...   \n",
+      "2  D'Angelo G, et al. Cyclosporin A prevents the hypoxic adaptation by activating hypoxia-inducible...   \n",
+      "3                                                    Jogl G, Tong L. Cell. 2003 Jan 10; 112(1):113-22.   \n",
+      "4  Yang Q, et al. Int J Clin Exp Pathol. 2015, 8(4):3735-3741.;Liu Z, et al. Metabolism. 2016, 65(3...   \n",
+      "\n",
+      "                                              ROMol  \n",
+      "0  <rdkit.Chem.rdchem.Mol object at 0x743d782049e0>  \n",
+      "1  <rdkit.Chem.rdchem.Mol object at 0x743d782871b0>  \n",
+      "2  <rdkit.Chem.rdchem.Mol object at 0x743d78287220>  \n",
+      "3  <rdkit.Chem.rdchem.Mol object at 0x743d782873e0>  \n",
+      "4  <rdkit.Chem.rdchem.Mol object at 0x743d78287450>  \n",
+      "\n",
+      "列名：['Index', 'Plate', 'Row', 'Col', 'ID', 'Name', 'Synonyms', 'CAS', 'SMILES', 'Formula', 'MolWt', 'Approved status', 'Pharmacopoeia', 'Disease', 'Pathways', 'Target', 'Receptor', 'Bioactivity', 'Reference', 'ROMol']\n"
+     ]
+    }
+   ],
+   "source": [
+    "# 读取筛选结果CSV文件\n",
+    "import pandas as pd\n",
+    "from rdkit import Chem\n",
+    "\n",
+    "print(\"正在读取筛选结果CSV文件...\")\n",
+    "\n",
+    "# 读取方案A结果\n",
+    "df_a = pd.read_csv(\"../data/drug_targetmol/sandmeyer_candidates_scheme_A_heteroaryl_halides.csv\")\n",
+    "print(f\"方案A数据: {len(df_a)} 行\")\n",
+    "\n",
+    "# 读取方案B结果\n",
+    "df_b = pd.read_csv(\"../data/drug_targetmol/sandmeyer_candidates_scheme_B_aryl_halides.csv\")\n",
+    "print(f\"方案B数据: {len(df_b)} 行\")\n",
+    "\n",
+    "# 重建分子对象\n",
+    "def rebuild_molecules(df):\n",
+    "    mols = []\n",
+    "    for idx, row in df.iterrows():\n",
+    "        smiles = row.get(\"SMILES_from_mol\", \"\")\n",
+    "        if smiles and str(smiles) != \"nan\":\n",
+    "            mol = Chem.MolFromSmiles(str(smiles))\n",
+    "            mols.append(mol)\n",
+    "        else:\n",
+    "            mols.append(None)\n",
+    "    df[\"ROMol\"] = mols\n",
+    "    valid_mols = sum(1 for m in mols if m is not None)\n",
+    "    print(f\"成功重建 {valid_mols} 个分子对象\")\n",
+    "    return df\n",
+    "\n",
+    "df_a = rebuild_molecules(df_a)\n",
+    "df_b = rebuild_molecules(df_b)\n",
+    "\n",
+    "print(\"\n",
+    "数据加载完成\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 分子筛选函数\n",
+    "\n",
+    "### 筛选逻辑说明\n",
+    "\n",
+    "1. **分子验证**：确保分子结构有效\n",
+    "2. **子结构匹配**：使用RDKit的SMARTS匹配\n",
+    "3. **结果记录**：记录匹配的模式和具体子结构\n",
+    "4. **数据完整性**：保留所有原始属性信息"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def screen_molecules_for_patterns(df, pattern_key):\n",
+    "    \"\"\"\n",
+    "    筛选包含特定子结构的分子\n",
+    "    \n",
+    "    Args:\n",
+    "        df: 包含分子的DataFrame\n",
+    "        pattern_key: 筛选模式键名\n",
+    "    \n",
+    "    Returns:\n",
+    "        筛选结果DataFrame\n",
+    "    \"\"\"\n",
+    "    pattern_info = PATTERN_MATCHERS[pattern_key]['info']\n",
+    "    matchers = PATTERN_MATCHERS[pattern_key]['matchers']\n",
+    "    \n",
+    "    print(f\"\\n开始筛选：{pattern_info['name']}\")\n",
+    "    print(f\"描述：{pattern_info['description']}\")\n",
+    "    print(f\"SMARTS模式数量：{len(pattern_info['smarts'])}\")\n",
+    "    \n",
+    "    matched_molecules = []\n",
+    "    \n",
+    "    for idx, row in df.iterrows():\n",
+    "        mol = row['ROMol']\n",
+    "        if mol is None:\n",
+    "            continue\n",
+    "            \n",
+    "        # 检查是否匹配任何模式\n",
+    "        matched_patterns = []\n",
+    "        for i, matcher in enumerate(matchers):\n",
+    "            if matcher is None:\n",
+    "                continue\n",
+    "            if mol.HasSubstructMatch(matcher):\n",
+    "                matched_patterns.append({\n",
+    "                    'pattern_index': i,\n",
+    "                    'smarts': pattern_info['smarts'][i],\n",
+    "                    'matches': len(mol.GetSubstructMatches(matcher))\n",
+    "                })\n",
+    "        \n",
+    "        if matched_patterns:\n",
+    "            # 创建匹配记录\n",
+    "            match_record = row.copy()\n",
+    "            match_record['matched_patterns'] = matched_patterns\n",
+    "            match_record['total_matches'] = sum(p['matches'] for p in matched_patterns)\n",
+    "            match_record['screening_scheme'] = pattern_info['scheme']\n",
+    "            matched_molecules.append(match_record)\n",
+    "    \n",
+    "    result_df = pd.DataFrame(matched_molecules)\n",
+    "    print(f\"找到 {len(result_df)} 个匹配分子\")\n",
+    "    \n",
+    "    return result_df\n",
+    "\n",
+    "def save_screening_results(df, filename, description):\n",
+    "    \"\"\"保存筛选结果到CSV\"\"\"\n",
+    "    output_path = f\"../data/drug_targetmol/{filename}\"\n",
+    "    \n",
+    "    # 转换ROMol列为SMILES（因为ROMol对象无法保存到CSV）\n",
+    "    df_export = df.copy()\n",
+    "    if 'ROMol' in df_export.columns:\n",
+    "        df_export['SMILES_from_mol'] = df_export['ROMol'].apply(lambda x: Chem.MolToSmiles(x) if x else '')\n",
+    "        df_export = df_export.drop('ROMol', axis=1)\n",
+    "    \n",
+    "    df_export.to_csv(output_path, index=False, encoding='utf-8')\n",
+    "    print(f\"结果已保存到：{output_path}\")\n",
+    "    print(f\"包含 {len(df_export)} 个分子，{len(df_export.columns)} 个属性列\")\n",
+    "    \n",
+    "    return output_path"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 方案A筛选：杂芳环卤素\n",
+    "\n",
+    "### 执行逻辑\n",
+    "- 使用最保守的筛选策略\n",
+    "- 只匹配杂芳环上的卤素\n",
+    "- 预期获得高精度结果\n",
+    "- 需要进一步的合成路线验证"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "开始筛选：杂芳环卤素\n",
+      "描述：杂环上的Cl, Br, I原子（方案A）\n",
+      "SMARTS模式数量：7\n",
+      "找到 57 个匹配分子\n",
+      "\n",
+      "方案A筛选结果摘要：\n",
+      "                       Name           CAS          Formula  \\\n",
+      "1                 Danicopan  1903768-17-1    C26H23BrFN7O3   \n",
+      "8                Lonafarnib   193275-84-2  C27H31Br2ClN4O2   \n",
+      "19              Idoxuridine       54-42-2       C9H11IN2O5   \n",
+      "144          Dimenhydrinate      523-87-5     C24H28ClN5O3   \n",
+      "259           Sertaconazole    99592-32-2    C20H15Cl3N2OS   \n",
+      "311             Tioconazole    65899-73-2    C16H13Cl3N2OS   \n",
+      "337               Gimeracil   103766-25-2        C5H4ClNO2   \n",
+      "580  Bromocriptine mesylate    22260-51-1    C33H44BrN5O8S   \n",
+      "592             Clofarabine   123318-82-1    C10H11ClFN5O3   \n",
+      "684             Vorasidenib  1644545-52-7     C14H13ClF6N6   \n",
+      "\n",
+      "                                                                                        matched_patterns  \\\n",
+      "1    [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 1}, {'pattern_index': 2, 'smarts...   \n",
+      "8    [{'pattern_index': 1, 'smarts': '[n,o,s]cc[Cl,Br,I]', 'matches': 1}, {'pattern_index': 5, 'smart...   \n",
+      "19   [{'pattern_index': 1, 'smarts': '[n,o,s]cc[Cl,Br,I]', 'matches': 2}, {'pattern_index': 3, 'smart...   \n",
+      "144                                  [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 2}]   \n",
+      "259                                 [{'pattern_index': 1, 'smarts': '[n,o,s]cc[Cl,Br,I]', 'matches': 1}]   \n",
+      "311                                  [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 1}]   \n",
+      "337  [{'pattern_index': 1, 'smarts': '[n,o,s]cc[Cl,Br,I]', 'matches': 1}, {'pattern_index': 5, 'smart...   \n",
+      "580                                  [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 1}]   \n",
+      "592                                  [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 2}]   \n",
+      "684  [{'pattern_index': 0, 'smarts': '[n,o,s]c[Cl,Br,I]', 'matches': 1}, {'pattern_index': 2, 'smarts...   \n",
+      "\n",
+      "     total_matches  \n",
+      "1                2  \n",
+      "8                2  \n",
+      "19               3  \n",
+      "144              2  \n",
+      "259              1  \n",
+      "311              1  \n",
+      "337              2  \n",
+      "580              1  \n",
+      "592              2  \n",
+      "684              2  \n",
+      "结果已保存到：../data/drug_targetmol/sandmeyer_candidates_scheme_A_heteroaryl_halides.csv\n",
+      "包含 57 个分子，23 个属性列\n"
+     ]
+    }
+   ],
+   "source": [
+    "# 执行方案A筛选\n",
+    "scheme_a_results = screen_molecules_for_patterns(df, 'heteroaryl_halides')\n",
+    "\n",
+    "# 显示结果摘要\n",
+    "if len(scheme_a_results) > 0:\n",
+    "    print(\"\\n方案A筛选结果摘要：\")\n",
+    "    print(scheme_a_results[['Name', 'CAS', 'Formula', 'matched_patterns', 'total_matches']].head(10))\n",
+    "    \n",
+    "    # 保存结果\n",
+    "    save_screening_results(\n",
+    "        scheme_a_results, \n",
+    "        'sandmeyer_candidates_scheme_A_heteroaryl_halides.csv',\n",
+    "        '方案A：杂芳环卤素筛选结果'\n",
+    "    )\n",
+    "else:\n",
+    "    print(\"\\n方案A未找到匹配分子\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 方案B筛选：所有芳香卤素\n",
+    "\n",
+    "### 执行逻辑\n",
+    "- 使用更宽松的筛选策略  \n",
+    "- 匹配所有芳环上的卤素\n",
+    "- 会包含更多候选分子\n",
+    "- 需要更多的文献验证工作"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "开始筛选：芳香卤素\n",
+      "描述：所有芳环上的Cl, Br, I原子（方案B）\n",
+      "SMARTS模式数量：4\n",
+      "找到 548 个匹配分子\n",
+      "\n",
+      "方案B筛选结果摘要：\n",
+      "                         Name           CAS          Formula  \\\n",
+      "1                   Danicopan  1903768-17-1    C26H23BrFN7O3   \n",
+      "8                  Lonafarnib   193275-84-2  C27H31Br2ClN4O2   \n",
+      "9                Ketoconazole    65277-42-1    C26H28Cl2N4O4   \n",
+      "13                   Ozanimod  1306760-87-1       C23H24N4O3   \n",
+      "14                  Ponesimod   854107-55-4    C23H25ClN2O4S   \n",
+      "19                Idoxuridine       54-42-2       C9H11IN2O5   \n",
+      "53                Moclobemide    71320-77-9     C13H17ClN2O2   \n",
+      "74                 Clemastine    15686-51-8       C21H26ClNO   \n",
+      "75  Buclizine dihydrochloride      129-74-8  C28H33ClN2·2HCl   \n",
+      "78                  Asenapine    65576-45-6       C17H16ClNO   \n",
+      "\n",
+      "                                                                                       matched_patterns  \\\n",
+      "1                                          [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}]   \n",
+      "8                                          [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 3}]   \n",
+      "9   [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 2}, {'pattern_index': 3, 'smarts': 'c1c...   \n",
+      "13                                              [{'pattern_index': 1, 'smarts': 'c-C#N', 'matches': 1}]   \n",
+      "14                                         [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}]   \n",
+      "19                                         [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}]   \n",
+      "53                                         [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}]   \n",
+      "74                                         [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}]   \n",
+      "75                                         [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}]   \n",
+      "78                                         [{'pattern_index': 0, 'smarts': 'c[Cl,Br,I]', 'matches': 1}]   \n",
+      "\n",
+      "    total_matches  \n",
+      "1               1  \n",
+      "8               3  \n",
+      "9               3  \n",
+      "13              1  \n",
+      "14              1  \n",
+      "19              1  \n",
+      "53              1  \n",
+      "74              1  \n",
+      "75              1  \n",
+      "78              1  \n",
+      "结果已保存到：../data/drug_targetmol/sandmeyer_candidates_scheme_B_aryl_halides.csv\n",
+      "包含 548 个分子，23 个属性列\n"
+     ]
+    }
+   ],
+   "source": [
+    "# 执行方案B筛选\n",
+    "scheme_b_results = screen_molecules_for_patterns(df, 'aryl_halides')\n",
+    "\n",
+    "# 显示结果摘要\n",
+    "if len(scheme_b_results) > 0:\n",
+    "    print(\"\\n方案B筛选结果摘要：\")\n",
+    "    print(scheme_b_results[['Name', 'CAS', 'Formula', 'matched_patterns', 'total_matches']].head(10))\n",
+    "    \n",
+    "    # 保存结果\n",
+    "    save_screening_results(\n",
+    "        scheme_b_results, \n",
+    "        'sandmeyer_candidates_scheme_B_aryl_halides.csv',\n",
+    "        '方案B：所有芳香卤素筛选结果'\n",
+    "    )\n",
+    "else:\n",
+    "    print(\"\\n方案B未找到匹配分子\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 结果分析和总结\n",
+    "\n",
+    "### 筛选统计"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "=== 筛选结果统计 ===\n",
+      "总分子数：3276\n",
+      "方案A（杂芳环卤素）匹配数：57\n",
+      "方案B（所有芳香卤素）匹配数：548\n",
+      "两方案重叠分子数：57\n",
+      "仅方案A匹配的分子数：0\n",
+      "仅方案B匹配的分子数：490\n"
+     ]
+    }
+   ],
+   "source": [
+    "# 结果统计\n",
+    "print(\"=== 筛选结果统计 ===\")\n",
+    "print(f\"总分子数：{len(df)}\")\n",
+    "print(f\"方案A（杂芳环卤素）匹配数：{len(scheme_a_results)}\")\n",
+    "print(f\"方案B（所有芳香卤素）匹配数：{len(scheme_b_results)}\")\n",
+    "\n",
+    "if len(scheme_a_results) > 0 and len(scheme_b_results) > 0:\n",
+    "    # 分析重叠\n",
+    "    scheme_a_cas = set(scheme_a_results['CAS'].dropna())\n",
+    "    scheme_b_cas = set(scheme_b_results['CAS'].dropna())\n",
+    "    overlap = scheme_a_cas & scheme_b_cas\n",
+    "    print(f\"两方案重叠分子数：{len(overlap)}\")\n",
+    "    \n",
+    "    # 方案A特有\n",
+    "    scheme_a_only = scheme_a_cas - scheme_b_cas\n",
+    "    print(f\"仅方案A匹配的分子数：{len(scheme_a_only)}\")\n",
+    "    \n",
+    "    # 方案B特有\n",
+    "    scheme_b_only = scheme_b_cas - scheme_a_cas\n",
+    "    print(f\"仅方案B匹配的分子数：{len(scheme_b_only)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 使用建议\n",
+    "\n",
+    "### 优先级推荐\n",
+    "\n",
+    "1. **第一优先级**：方案A结果\n",
+    "   - 杂芳环卤素最可能是Sandmeyer反应产物\n",
+    "   - 候选数量相对较少，便于深入研究\n",
+    "   - 建议重点查阅这些分子的合成路线\n",
+    "\n",
+    "2. **第二优先级**：方案B独有结果\n",
+    "   - 苯环卤素可能来自多种途径\n",
+    "   - 需要仔细评估合成可能性\n",
+    "   - 适合作为补充筛选\n",
+    "\n",
+    "### 后续验证步骤\n",
+    "\n",
+    "1. **文献调研**：查阅候选分子的合成路线\n",
+    "2. **反应条件评估**：确认是否使用了Sandmeyer反应\n",
+    "3. **经济性分析**：评估张夏恒反应用于该分子的潜力\n",
+    "4. **实验验证**：必要时进行小规模验证实验\n",
+    "\n",
+    "### 注意事项\n",
+    "\n",
+    "- 此筛选基于结构特征，不等同于合成路线确认\n",
+    "- 部分卤素可能来自原料而非合成步骤\n",
+    "- 分子复杂程度和合成可行性需要综合考虑\n",
+    "- 建议结合药物的重要性和市场规模进行优先级排序"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 分子结构可视化\n",
+    "\n",
+    "### 创建输出目录和可视化函数\n",
+    "\n",
+    "本节将为筛选出的候选分子生成高清晰度的SVG结构图，突出显示卤素结构。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.14.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/notebooks/smarts_match_visualization.ipynb
+++ b/notebooks/smarts_match_visualization.ipynb
@@ -0,0 +1,285 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# SMARTS匹配检测与可视化\n",
+        "\n",
+        "本notebook用于：\n",
+        "1. 读取ring16/temp.csv中的smiles列\n",
+        "2. 对SMARTS模式进行匹配检测：`O=C1C[C@@H](O)[*:15][*:17][*:18]C[*:23]C(=O)/C=C/[*:28]=C/[*:7][*:8]O1`\n",
+        "3. 处理dummy原子（[*:X]），尝试两种方式：\n",
+        "   - 不替换dummy原子\n",
+        "   - 将dummy原子替换为C\n",
+        "4. 可视化匹配的原子高亮显示\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 1. 导入必要的库\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 1,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "所有模块导入成功!\n"
+          ]
+        }
+      ],
+      "source": [
+        "import sys\n",
+        "from pathlib import Path\n",
+        "import re\n",
+        "\n",
+        "# 添加项目根目录到 Python 路径\n",
+        "notebook_dir = Path().resolve()\n",
+        "project_root = notebook_dir.parent\n",
+        "sys.path.insert(0, str(project_root))\n",
+        "\n",
+        "from rdkit import Chem\n",
+        "from rdkit.Chem import Draw\n",
+        "from rdkit.Chem.Draw import rdMolDraw2D\n",
+        "from IPython.display import SVG, display, HTML\n",
+        "import pandas as pd\n",
+        "import numpy as np\n",
+        "import matplotlib.pyplot as plt\n",
+        "from collections import Counter\n",
+        "\n",
+        "print(\"所有模块导入成功!\")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 2. 读取数据\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "数据集大小: 2022 个分子\n",
+            "列名: ['IDs', 'molecule_pref_name', 'max_pChEMBL', 'max_pChEMBL_target', '# Target Organisms', 'Target Organisms', '# Known Targets', 'Known Targets', 'target_pref_name', 'smiles']\n",
+            "\n",
+            "SMILES列存在，共 2022 个有效SMILES\n",
+            "\n",
+            "前5个SMILES示例:\n",
+            "['C/C(=C\\\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[C@H](C)[C@H](O)[C@@H](C)C(=O)C(C)(C)[C@@H](O)CC(=O)O1', 'C/C(=C\\\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[C@H](C)[C@H](O)[C@@H](C)C(=O)C(C)(C)[C@@H](O)CC(=O)O1', 'Cc1c2oc3c(C)ccc(C(=O)N[C@@H]4C(=O)N[C@H](C(C)C)C(=O)N5CCC[C@H]5C(=O)N(C)CC(=O)N(C)[C@@H](C(C)C)C(=O)O[C@@H]4C)c3nc-2c(C(=O)N[C@@H]2C(=O)N[C@H](C(C)C)C(=O)N3CCC[C@H]3C(=O)N(C)CC(=O)N(C)[C@@H](C(C)C)C(=O)O[C@@H]2C)c(N)c1=O', 'CCCCCCCC(=O)SCC/C=C/[C@@H]1CC(=O)NCc2nc(cs2)C2=N[C@@](C)(CS2)C(=O)N[C@@H](C(C)C)C(=O)O1', 'Cc1cc2ccc1[C@@H](C)COC(=O)Nc1ccc(S(=O)(=O)C3CC3)c(c1)CN(C)C(=O)[C@@H]2Nc1ccc2c(N)ncc(F)c2c1']\n"
+          ]
+        },
+        {
+          "data": {
+            "text/html": [
+              "<div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>IDs</th>\n",
+              "      <th>molecule_pref_name</th>\n",
+              "      <th>max_pChEMBL</th>\n",
+              "      <th>max_pChEMBL_target</th>\n",
+              "      <th># Target Organisms</th>\n",
+              "      <th>Target Organisms</th>\n",
+              "      <th># Known Targets</th>\n",
+              "      <th>Known Targets</th>\n",
+              "      <th>target_pref_name</th>\n",
+              "      <th>smiles</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>CHEMBL94657</td>\n",
+              "      <td>PATUPILONE</td>\n",
+              "      <td>10.67</td>\n",
+              "      <td>CHEMBL1075590</td>\n",
+              "      <td>695</td>\n",
+              "      <td>Sus scrofa, Mus musculus, None, Plasmodium fal...</td>\n",
+              "      <td>695</td>\n",
+              "      <td>CHEMBL612519, CHEMBL614129, CHEMBL1075484, CHE...</td>\n",
+              "      <td>AGS, NCI-H1703, MKN-7, HT-1080, NCI-H226, Lu-6...</td>\n",
+              "      <td>C/C(=C\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>CHEMBL94657</td>\n",
+              "      <td>PATUPILONE</td>\n",
+              "      <td>10.67</td>\n",
+              "      <td>CHEMBL1075590</td>\n",
+              "      <td>695</td>\n",
+              "      <td>Sus scrofa, Mus musculus, None, Plasmodium fal...</td>\n",
+              "      <td>695</td>\n",
+              "      <td>CHEMBL612519, CHEMBL614129, CHEMBL1075484, CHE...</td>\n",
+              "      <td>AGS, NCI-H1703, MKN-7, HT-1080, NCI-H226, Lu-6...</td>\n",
+              "      <td>C/C(=C\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>CHEMBL1554</td>\n",
+              "      <td>DACTINOMYCIN</td>\n",
+              "      <td>10.10</td>\n",
+              "      <td>CHEMBL614533</td>\n",
+              "      <td>177</td>\n",
+              "      <td>Giardia intestinalis, Trypanosoma cruzi, Equus...</td>\n",
+              "      <td>177</td>\n",
+              "      <td>CHEMBL388, CHEMBL614151, CHEMBL3577, CHEMBL551...</td>\n",
+              "      <td>HT-29, CCRF-CEM, WIL2-NS, Unchecked, Caspase-7...</td>\n",
+              "      <td>Cc1c2oc3c(C)ccc(C(=O)N[C@@H]4C(=O)N[C@H](C(C)C...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>CHEMBL1173445</td>\n",
+              "      <td>LARGAZOLE</td>\n",
+              "      <td>8.80</td>\n",
+              "      <td>CHEMBL612545</td>\n",
+              "      <td>45</td>\n",
+              "      <td>Homo sapiens, None</td>\n",
+              "      <td>45</td>\n",
+              "      <td>CHEMBL392, CHEMBL3192, CHEMBL3524, CHEMBL5103,...</td>\n",
+              "      <td>Histone deacetylase 9, Ubiquitin-like modifier...</td>\n",
+              "      <td>CCCCCCCC(=O)SCC/C=C/[C@@H]1CC(=O)NCc2nc(cs2)C2...</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>CHEMBL3902498</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>9.37</td>\n",
+              "      <td>CHEMBL2095194,CHEMBL3991</td>\n",
+              "      <td>17</td>\n",
+              "      <td>Homo sapiens, None</td>\n",
+              "      <td>17</td>\n",
+              "      <td>CHEMBL2820, CHEMBL3991, CHEMBL1801, CHEMBL204,...</td>\n",
+              "      <td>Coagulation factor X, Kallikrein 1, Coagulatio...</td>\n",
+              "      <td>Cc1cc2ccc1[C@@H](C)COC(=O)Nc1ccc(S(=O)(=O)C3CC...</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>"
+            ],
+            "text/plain": [
+              "             IDs molecule_pref_name  max_pChEMBL        max_pChEMBL_target  \\\n",
+              "0    CHEMBL94657         PATUPILONE        10.67             CHEMBL1075590   \n",
+              "1    CHEMBL94657         PATUPILONE        10.67             CHEMBL1075590   \n",
+              "2     CHEMBL1554       DACTINOMYCIN        10.10              CHEMBL614533   \n",
+              "3  CHEMBL1173445          LARGAZOLE         8.80              CHEMBL612545   \n",
+              "4  CHEMBL3902498                NaN         9.37  CHEMBL2095194,CHEMBL3991   \n",
+              "\n",
+              "   # Target Organisms                                   Target Organisms  \\\n",
+              "0                 695  Sus scrofa, Mus musculus, None, Plasmodium fal...   \n",
+              "1                 695  Sus scrofa, Mus musculus, None, Plasmodium fal...   \n",
+              "2                 177  Giardia intestinalis, Trypanosoma cruzi, Equus...   \n",
+              "3                  45                                 Homo sapiens, None   \n",
+              "4                  17                                 Homo sapiens, None   \n",
+              "\n",
+              "   # Known Targets                                      Known Targets  \\\n",
+              "0              695  CHEMBL612519, CHEMBL614129, CHEMBL1075484, CHE...   \n",
+              "1              695  CHEMBL612519, CHEMBL614129, CHEMBL1075484, CHE...   \n",
+              "2              177  CHEMBL388, CHEMBL614151, CHEMBL3577, CHEMBL551...   \n",
+              "3               45  CHEMBL392, CHEMBL3192, CHEMBL3524, CHEMBL5103,...   \n",
+              "4               17  CHEMBL2820, CHEMBL3991, CHEMBL1801, CHEMBL204,...   \n",
+              "\n",
+              "                                    target_pref_name  \\\n",
+              "0  AGS, NCI-H1703, MKN-7, HT-1080, NCI-H226, Lu-6...   \n",
+              "1  AGS, NCI-H1703, MKN-7, HT-1080, NCI-H226, Lu-6...   \n",
+              "2  HT-29, CCRF-CEM, WIL2-NS, Unchecked, Caspase-7...   \n",
+              "3  Histone deacetylase 9, Ubiquitin-like modifier...   \n",
+              "4  Coagulation factor X, Kallikrein 1, Coagulatio...   \n",
+              "\n",
+              "                                              smiles  \n",
+              "0  C/C(=C\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[...  \n",
+              "1  C/C(=C\\c1csc(C)n1)[C@@H]1C[C@@H]2O[C@]2(C)CCC[...  \n",
+              "2  Cc1c2oc3c(C)ccc(C(=O)N[C@@H]4C(=O)N[C@H](C(C)C...  \n",
+              "3  CCCCCCCC(=O)SCC/C=C/[C@@H]1CC(=O)NCc2nc(cs2)C2...  \n",
+              "4  Cc1cc2ccc1[C@@H](C)COC(=O)Nc1ccc(S(=O)(=O)C3CC...  "
+            ]
+          },
+          "execution_count": 2,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "# 读取CSV文件\n",
+        "input_file = project_root / 'ring16' / 'temp.csv'\n",
+        "df = pd.read_csv(input_file)\n",
+        "\n",
+        "print(f\"数据集大小: {len(df)} 个分子\")\n",
+        "print(f\"列名: {df.columns.tolist()}\")\n",
+        "\n",
+        "# 检查smiles列\n",
+        "if 'smiles' in df.columns:\n",
+        "    print(f\"\\nSMILES列存在，共 {df['smiles'].notna().sum()} 个有效SMILES\")\n",
+        "    print(f\"\\n前5个SMILES示例:\")\n",
+        "    print(df['smiles'].head().tolist())\n",
+        "else:\n",
+        "    print(\"错误: 未找到smiles列\")\n",
+        "    \n",
+        "df.head()\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 3. 定义SMARTS模式和处理函数\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": []
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.14.0"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 2
+}
--- a/notebooks/test_align_two_molecules.ipynb
+++ b/notebooks/test_align_two_molecules.ipynb