Add AGENTS.md and ring numbering test

- Add AGENTS.md with AI assistant guidelines for the project - Add tests/test_ring_numbering.py to verify ring numbering consistency - Test confirms atom numbering is fixed and reproducible Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-23 09:53:27 +08:00
parent 0d99f7d12c
commit 0ced0fa816
2 changed files with 478 additions and 0 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -0,0 +1,255 @@
+# AGENTS.md
+
+本文件为 AI 编程助手（如 Claude、Copilot、Cursor 等）提供项目上下文和开发指南。
+
+## 项目概述
+
+**Macrolactone Fragmenter** 是一个专业的大环内酯（12-20元环）侧链断裂和分析工具，用于化学信息学研究。
+
+### 核心功能
+- 智能环原子编号（基于内酯结构）
+- 自动侧链断裂分析
+- 分子可视化（SVG/PNG）
+- 批量处理和数据导出
+
+## 技术栈
+
+| 组件 | 技术 |
+|------|------|
+| 语言 | Python 3.8+ |
+| 化学库 | RDKit |
+| 数据处理 | Pandas, NumPy |
+| 可视化 | Matplotlib, Seaborn |
+| 环境管理 | Pixi (推荐) / Conda |
+| 文档 | MkDocs + Material |
+| 测试 | Pytest |
+| 代码格式 | Black, Flake8 |
+
+## 项目结构
+
+```
+macro_split/
+├── src/                           # 核心源代码
+│   ├── __init__.py               # 包初始化
+│   ├── macrolactone_fragmenter.py # ⭐ 主入口类
+│   ├── macro_lactone_analyzer.py  # 环数分析器
+│   ├── ring_numbering.py          # 环编号系统
+│   ├── ring_visualization.py      # 可视化工具
+│   ├── fragment_cleaver.py        # 侧链断裂逻辑
+│   ├── fragment_dataclass.py      # 碎片数据类
+│   └── visualizer.py              # 统计可视化
+├── notebooks/                     # Jupyter Notebook 示例
+├── scripts/                       # 批量处理脚本
+├── tests/                         # 单元测试
+├── docs/                          # 文档目录
+├── pyproject.toml                 # 项目配置
+├── pixi.toml                      # Pixi 环境配置
+└── mkdocs.yml                     # 文档配置
+```
+
+## 核心模块说明
+
+### MacrolactoneFragmenter (主入口)
+```python
+from src.macrolactone_fragmenter import MacrolactoneFragmenter
+
+fragmenter = MacrolactoneFragmenter(ring_size=16)
+result = fragmenter.process_molecule(smiles, parent_id="mol_001")
+```
+
+### MacroLactoneAnalyzer (环数分析)
+```python
+from src.macro_lactone_analyzer import MacroLactoneAnalyzer
+
+analyzer = MacroLactoneAnalyzer()
+info = analyzer.get_single_ring_info(smiles)
+```
+
+### 数据类结构
+```python
+@dataclass
+class Fragment:
+    fragment_smiles: str      # 碎片 SMILES
+    parent_smiles: str        # 母分子 SMILES
+    cleavage_position: int    # 断裂位置 (1-N)
+    fragment_id: str          # 碎片 ID
+    parent_id: str            # 母分子 ID
+    atom_count: int           # 原子数
+    molecular_weight: float   # 分子量
+```
+
+## 开发命令
+
+### 环境设置
+```bash
+# 安装依赖
+pixi install
+
+# 激活环境
+pixi shell
+```
+
+### 代码质量
+```bash
+# 格式化代码
+pixi run black src/
+
+# 代码检查
+pixi run flake8 src/
+
+# 运行测试
+pixi run pytest
+
+# 测试覆盖率
+pixi run pytest --cov=src
+```
+
+### 文档
+```bash
+# 本地预览文档
+pixi run mkdocs serve
+
+# 构建文档
+pixi run mkdocs build
+```
+
+## 编码规范
+
+### Python 风格
+- 使用 Black 格式化，行宽 100 字符
+- 使用 Google 风格的 docstring
+- 类型注解：所有公共函数必须有类型提示
+- 命名：类用 PascalCase，函数/变量用 snake_case
+
+### Docstring 示例
+```python
+def process_molecule(self, smiles: str, parent_id: str = None) -> FragmentResult:
+    """
+    处理单个分子，进行侧链断裂分析。
+
+    Args:
+        smiles: 分子的 SMILES 字符串
+        parent_id: 可选的分子标识符
+
+    Returns:
+        FragmentResult 对象，包含所有碎片信息
+
+    Raises:
+        ValueError: 如果 SMILES 无效或不是目标环大小
+
+    Example:
+        >>> fragmenter = MacrolactoneFragmenter(ring_size=16)
+        >>> result = fragmenter.process_molecule("C1CC...")
+    """
+```
+
+### 导入顺序
+```python
+# 1. 标准库
+import json
+from pathlib import Path
+from typing import List, Dict, Optional
+
+# 2. 第三方库
+import pandas as pd
+import numpy as np
+from rdkit import Chem
+
+# 3. 本地模块
+from src.fragment_dataclass import Fragment
+from src.ring_numbering import RingNumbering
+```
+
+## 关键概念
+
+### 环编号系统
+- **位置 1**: 羰基碳（C=O 中的 C）
+- **位置 2**: 酯键氧（环上的 O）
+- **位置 3-N**: 按顺序编号环上剩余原子
+
+### 支持的环大小
+- 12元环 到 20元环
+- 默认处理 16元环
+
+### SMARTS 模式
+```python
+# 内酯键 SMARTS（16元环示例）
+LACTONE_SMARTS_16 = "[C;R16](=O)[O;R16]"
+```
+
+## 测试指南
+
+### 运行测试
+```bash
+# 全部测试
+pixi run pytest
+
+# 特定模块
+pixi run pytest tests/test_fragmenter.py
+
+# 详细输出
+pixi run pytest -v
+
+# 单个测试
+pixi run pytest tests/test_fragmenter.py::test_process_molecule
+```
+
+### 测试数据
+测试用的 SMILES 示例（16元环大环内酯）：
+```python
+TEST_SMILES = [
+    "O=C1CCCCCCCC(=O)OCC/C=C/C=C/1",  # 简单 16 元环
+    "CCC1OC(=O)C[C@H](O)C(C)[C@@H](O)...",  # 复杂结构
+]
+```
+
+## 常见任务
+
+### 添加新功能
+1. 在 `src/` 目录创建或修改模块
+2. 更新 `src/__init__.py` 导出新类/函数
+3. 编写单元测试
+4. 更新文档
+
+### 处理新的环大小
+```python
+# 在 MacrolactoneFragmenter 中指定环大小
+fragmenter = MacrolactoneFragmenter(ring_size=14)  # 14元环
+```
+
+### 批量处理
+```python
+results = fragmenter.process_csv(
+    "data/molecules.csv",
+    smiles_column="smiles",
+    id_column="unique_id",
+    max_rows=1000
+)
+df = fragmenter.batch_to_dataframe(results)
+```
+
+## 注意事项
+
+### RDKit 依赖
+- RDKit 必须通过 conda/pixi 安装，不支持 pip
+- 确保环境中有 RDKit：`python -c "from rdkit import Chem; print('OK')"`
+
+### 性能考虑
+- 批量处理大数据集时，使用 `process_csv` 方法
+- 处理速度约 ~100 分子/分钟
+- 大规模处理考虑使用 `scripts/batch_process_*.py`
+
+### 错误处理
+- 无效 SMILES 会抛出 `ValueError`
+- 非目标环大小会被跳过
+- 批量处理会记录失败的分子到日志
+
+## 相关资源
+
+- **文档**: `docs/` 目录或运行 `pixi run mkdocs serve`
+- **示例**: `notebooks/filter_molecules.ipynb`
+- **脚本**: `scripts/README.md`
+
+---
+
+*最后更新: 2025-01-23*
--- a/tests/test_ring_numbering.py
+++ b/tests/test_ring_numbering.py
@@ -0,0 +1,223 @@
+"""
+测试环编号功能 - 验证原子编号是否固定
+"""
+import sys
+sys.path.insert(0, '/home/zly/project/macro_split')
+
+from rdkit import Chem
+from rdkit.Chem import Draw, AllChem
+from rdkit.Chem.Draw import rdMolDraw2D
+from src.ring_visualization import (
+    get_macrolactone_numbering,
+    get_ring_atoms_by_size
+)
+
+
+def test_ring_numbering_consistency(smiles: str, ring_size: int = 16, num_tests: int = 5):
+    """
+    测试环编号的一致性 - 多次运行确保编号固定
+    """
+    print("=" * 70)
+    print("测试环编号一致性")
+    print("=" * 70)
+    print(f"\nSMILES: {smiles[:80]}...")
+    print(f"环大小: {ring_size}")
+    print(f"测试次数: {num_tests}")
+
+    # 解析分子
+    mol = Chem.MolFromSmiles(smiles)
+    if mol is None:
+        print("❌ 无法解析SMILES")
+        return False
+
+    print(f"✓ 分子解析成功，共 {mol.GetNumAtoms()} 个原子")
+
+    # 检测环大小
+    ring_atoms = get_ring_atoms_by_size(mol, ring_size)
+    if ring_atoms is None:
+        for size in range(12, 21):
+            ring_atoms = get_ring_atoms_by_size(mol, size)
+            if ring_atoms:
+                ring_size = size
+                print(f"⚠️  使用检测到的{size}元环")
+                break
+
+    if ring_atoms is None:
+        print("❌ 未找到12-20元环")
+        return False
+
+    print(f"✓ 找到{ring_size}元环，包含 {len(ring_atoms)} 个原子")
+
+    # 多次测试编号一致性
+    all_numberings = []
+    all_carbonyl_carbons = []
+    all_ester_oxygens = []
+
+    for i in range(num_tests):
+        result = get_macrolactone_numbering(mol, ring_size)
+        ring_atoms_result, ring_numbering, ordered_atoms, carbonyl_carbon, ester_oxygen, (is_valid, reason) = result
+
+        if not is_valid:
+            print(f"❌ 第{i+1}次测试失败: {reason}")
+            return False
+
+        all_numberings.append(ring_numbering.copy())
+        all_carbonyl_carbons.append(carbonyl_carbon)
+        all_ester_oxygens.append(ester_oxygen)
+
+    # 验证一致性
+    print("\n" + "-" * 50)
+    print("编号一致性检查:")
+    print("-" * 50)
+
+    is_consistent = True
+
+    if len(set(all_carbonyl_carbons)) == 1:
+        print(f"✓ 羰基碳位置一致: 原子索引 {all_carbonyl_carbons[0]}")
+    else:
+        print(f"❌ 羰基碳位置不一致: {all_carbonyl_carbons}")
+        is_consistent = False
+
+    if len(set(all_ester_oxygens)) == 1:
+        print(f"✓ 酯氧位置一致: 原子索引 {all_ester_oxygens[0]}")
+    else:
+        print(f"❌ 酯氧位置不一致: {all_ester_oxygens}")
+        is_consistent = False
+
+    first_numbering = all_numberings[0]
+    for i, numbering in enumerate(all_numberings[1:], 2):
+        if numbering != first_numbering:
+            print(f"❌ 第{i}次编号与第1次不一致")
+            is_consistent = False
+            break
+
+    if is_consistent:
+        print(f"✓ 所有{num_tests}次测试的编号完全一致")
+
+    # 显示详细编号信息
+    print("\n" + "-" * 50)
+    print("环原子编号详情:")
+    print("-" * 50)
+
+    numbering = all_numberings[0]
+    carbonyl_carbon = all_carbonyl_carbons[0]
+    ester_oxygen = all_ester_oxygens[0]
+
+    sorted_items = sorted(numbering.items(), key=lambda x: x[1])
+
+    print(f"{'位置':<6} {'原子索引':<10} {'元素':<6} {'说明'}")
+    print("-" * 40)
+
+    for atom_idx, position in sorted_items:
+        atom = mol.GetAtomWithIdx(atom_idx)
+        symbol = atom.GetSymbol()
+        note = ""
+        if atom_idx == carbonyl_carbon:
+            note = "← 羰基碳 (C=O)"
+        elif atom_idx == ester_oxygen:
+            note = "← 酯键氧"
+        print(f"{position:<6} {atom_idx:<10} {symbol:<6} {note}")
+
+    return is_consistent
+
+
+def save_visualization(smiles: str, output_path: str, ring_size: int = 16):
+    """保存分子可视化图片"""
+    print("\n" + "=" * 70)
+    print("保存可视化图片")
+    print("=" * 70)
+
+    mol = Chem.MolFromSmiles(smiles)
+    if mol is None:
+        print("❌ 无法解析SMILES")
+        return
+
+    for size in range(12, 21):
+        ring_atoms = get_ring_atoms_by_size(mol, size)
+        if ring_atoms:
+            ring_size = size
+            break
+
+    result = get_macrolactone_numbering(mol, ring_size)
+    ring_atoms, ring_numbering, ordered_atoms, carbonyl_carbon, ester_oxygen, (is_valid, reason) = result
+
+    if not is_valid:
+        print(f"❌ 无法获取编号: {reason}")
+        return
+
+    mol_copy = Chem.Mol(mol)
+    AllChem.Compute2DCoords(mol_copy)
+
+    for atom_idx in ring_atoms:
+        if atom_idx in ring_numbering:
+            atom = mol_copy.GetAtomWithIdx(atom_idx)
+            atom.SetProp("atomNote", str(ring_numbering[atom_idx]))
+
+    atom_colors = {}
+    for atom_idx in ring_atoms:
+        atom = mol.GetAtomWithIdx(atom_idx)
+        symbol = atom.GetSymbol()
+
+        if atom_idx == carbonyl_carbon:
+            atom_colors[atom_idx] = (1.0, 0.6, 0.0)
+        elif atom_idx == ester_oxygen:
+            atom_colors[atom_idx] = (1.0, 0.4, 0.4)
+        elif symbol == 'C':
+            atom_colors[atom_idx] = (0.7, 0.85, 1.0)
+        elif symbol == 'O':
+            atom_colors[atom_idx] = (1.0, 0.7, 0.7)
+        elif symbol == 'N':
+            atom_colors[atom_idx] = (0.8, 0.7, 1.0)
+        else:
+            atom_colors[atom_idx] = (0.8, 1.0, 0.8)
+
+    drawer = rdMolDraw2D.MolDraw2DSVG(1000, 1000)
+    drawer.SetFontSize(14)
+    drawer.DrawMolecule(mol_copy, highlightAtoms=list(ring_atoms), highlightAtomColors=atom_colors)
+    drawer.FinishDrawing()
+    svg = drawer.GetDrawingText()
+
+    svg_path = output_path.replace('.png', '.svg')
+    with open(svg_path, 'w', encoding='utf-8') as f:
+        f.write(svg)
+    print(f"✓ SVG已保存到: {svg_path}")
+
+    try:
+        drawer_png = rdMolDraw2D.MolDraw2DCairo(1000, 1000)
+        drawer_png.SetFontSize(14)
+        drawer_png.DrawMolecule(mol_copy, highlightAtoms=list(ring_atoms), highlightAtomColors=atom_colors)
+        drawer_png.FinishDrawing()
+        drawer_png.WriteDrawingText(output_path)
+        print(f"✓ PNG已保存到: {output_path}")
+    except Exception as e:
+        print(f"⚠️  PNG保存失败: {e}")
+
+    print("\n颜色说明:")
+    print("  橙色: 羰基碳 (位置1)")
+    print("  红色: 酯键氧 (位置2)")
+    print("  浅蓝色: 环上碳原子")
+
+
+def main():
+    smiles = "O[C@H]1[C@H]([C@H]([C@H](OC[C@@H]2[C@@H](CC)OC(C[C@H]([C@H](C)[C@H]([C@@H](CC=O)C[C@@H](C)C(/C=C/C(/C)=C/2)=O)O[C@H]2[C@@H]([C@H]([C@@H]([C@@H](C)O2)O[C@H]2C[C@](C)([C@@H]([C@@H](C)O2)O)O)[N@](C)C)O)O)=O)O[C@@H]1C)OC)OC"
+
+    print("\n大环内酯环编号测试\n")
+    is_consistent = test_ring_numbering_consistency(smiles, ring_size=16, num_tests=5)
+
+    output_path = "/home/zly/project/macro_split/output/test_ring_numbering.png"
+    save_visualization(smiles, output_path, ring_size=16)
+
+    print("\n" + "=" * 70)
+    print("测试总结")
+    print("=" * 70)
+    if is_consistent:
+        print("✅ 所有测试通过！环原子编号是固定的。")
+    else:
+        print("❌ 测试失败：环原子编号不一致")
+
+    return is_consistent
+
+
+if __name__ == "__main__":
+    success = main()
+    sys.exit(0 if success else 1)