Add AGENTS.md and ring numbering test
- Add AGENTS.md with AI assistant guidelines for the project - Add tests/test_ring_numbering.py to verify ring numbering consistency - Test confirms atom numbering is fixed and reproducible Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
255
AGENTS.md
Normal file
255
AGENTS.md
Normal file
@@ -0,0 +1,255 @@
|
||||
# AGENTS.md
|
||||
|
||||
本文件为 AI 编程助手(如 Claude、Copilot、Cursor 等)提供项目上下文和开发指南。
|
||||
|
||||
## 项目概述
|
||||
|
||||
**Macrolactone Fragmenter** 是一个专业的大环内酯(12-20元环)侧链断裂和分析工具,用于化学信息学研究。
|
||||
|
||||
### 核心功能
|
||||
- 智能环原子编号(基于内酯结构)
|
||||
- 自动侧链断裂分析
|
||||
- 分子可视化(SVG/PNG)
|
||||
- 批量处理和数据导出
|
||||
|
||||
## 技术栈
|
||||
|
||||
| 组件 | 技术 |
|
||||
|------|------|
|
||||
| 语言 | Python 3.8+ |
|
||||
| 化学库 | RDKit |
|
||||
| 数据处理 | Pandas, NumPy |
|
||||
| 可视化 | Matplotlib, Seaborn |
|
||||
| 环境管理 | Pixi (推荐) / Conda |
|
||||
| 文档 | MkDocs + Material |
|
||||
| 测试 | Pytest |
|
||||
| 代码格式 | Black, Flake8 |
|
||||
|
||||
## 项目结构
|
||||
|
||||
```
|
||||
macro_split/
|
||||
├── src/ # 核心源代码
|
||||
│ ├── __init__.py # 包初始化
|
||||
│ ├── macrolactone_fragmenter.py # ⭐ 主入口类
|
||||
│ ├── macro_lactone_analyzer.py # 环数分析器
|
||||
│ ├── ring_numbering.py # 环编号系统
|
||||
│ ├── ring_visualization.py # 可视化工具
|
||||
│ ├── fragment_cleaver.py # 侧链断裂逻辑
|
||||
│ ├── fragment_dataclass.py # 碎片数据类
|
||||
│ └── visualizer.py # 统计可视化
|
||||
├── notebooks/ # Jupyter Notebook 示例
|
||||
├── scripts/ # 批量处理脚本
|
||||
├── tests/ # 单元测试
|
||||
├── docs/ # 文档目录
|
||||
├── pyproject.toml # 项目配置
|
||||
├── pixi.toml # Pixi 环境配置
|
||||
└── mkdocs.yml # 文档配置
|
||||
```
|
||||
|
||||
## 核心模块说明
|
||||
|
||||
### MacrolactoneFragmenter (主入口)
|
||||
```python
|
||||
from src.macrolactone_fragmenter import MacrolactoneFragmenter
|
||||
|
||||
fragmenter = MacrolactoneFragmenter(ring_size=16)
|
||||
result = fragmenter.process_molecule(smiles, parent_id="mol_001")
|
||||
```
|
||||
|
||||
### MacroLactoneAnalyzer (环数分析)
|
||||
```python
|
||||
from src.macro_lactone_analyzer import MacroLactoneAnalyzer
|
||||
|
||||
analyzer = MacroLactoneAnalyzer()
|
||||
info = analyzer.get_single_ring_info(smiles)
|
||||
```
|
||||
|
||||
### 数据类结构
|
||||
```python
|
||||
@dataclass
|
||||
class Fragment:
|
||||
fragment_smiles: str # 碎片 SMILES
|
||||
parent_smiles: str # 母分子 SMILES
|
||||
cleavage_position: int # 断裂位置 (1-N)
|
||||
fragment_id: str # 碎片 ID
|
||||
parent_id: str # 母分子 ID
|
||||
atom_count: int # 原子数
|
||||
molecular_weight: float # 分子量
|
||||
```
|
||||
|
||||
## 开发命令
|
||||
|
||||
### 环境设置
|
||||
```bash
|
||||
# 安装依赖
|
||||
pixi install
|
||||
|
||||
# 激活环境
|
||||
pixi shell
|
||||
```
|
||||
|
||||
### 代码质量
|
||||
```bash
|
||||
# 格式化代码
|
||||
pixi run black src/
|
||||
|
||||
# 代码检查
|
||||
pixi run flake8 src/
|
||||
|
||||
# 运行测试
|
||||
pixi run pytest
|
||||
|
||||
# 测试覆盖率
|
||||
pixi run pytest --cov=src
|
||||
```
|
||||
|
||||
### 文档
|
||||
```bash
|
||||
# 本地预览文档
|
||||
pixi run mkdocs serve
|
||||
|
||||
# 构建文档
|
||||
pixi run mkdocs build
|
||||
```
|
||||
|
||||
## 编码规范
|
||||
|
||||
### Python 风格
|
||||
- 使用 Black 格式化,行宽 100 字符
|
||||
- 使用 Google 风格的 docstring
|
||||
- 类型注解:所有公共函数必须有类型提示
|
||||
- 命名:类用 PascalCase,函数/变量用 snake_case
|
||||
|
||||
### Docstring 示例
|
||||
```python
|
||||
def process_molecule(self, smiles: str, parent_id: str = None) -> FragmentResult:
|
||||
"""
|
||||
处理单个分子,进行侧链断裂分析。
|
||||
|
||||
Args:
|
||||
smiles: 分子的 SMILES 字符串
|
||||
parent_id: 可选的分子标识符
|
||||
|
||||
Returns:
|
||||
FragmentResult 对象,包含所有碎片信息
|
||||
|
||||
Raises:
|
||||
ValueError: 如果 SMILES 无效或不是目标环大小
|
||||
|
||||
Example:
|
||||
>>> fragmenter = MacrolactoneFragmenter(ring_size=16)
|
||||
>>> result = fragmenter.process_molecule("C1CC...")
|
||||
"""
|
||||
```
|
||||
|
||||
### 导入顺序
|
||||
```python
|
||||
# 1. 标准库
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import List, Dict, Optional
|
||||
|
||||
# 2. 第三方库
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
from rdkit import Chem
|
||||
|
||||
# 3. 本地模块
|
||||
from src.fragment_dataclass import Fragment
|
||||
from src.ring_numbering import RingNumbering
|
||||
```
|
||||
|
||||
## 关键概念
|
||||
|
||||
### 环编号系统
|
||||
- **位置 1**: 羰基碳(C=O 中的 C)
|
||||
- **位置 2**: 酯键氧(环上的 O)
|
||||
- **位置 3-N**: 按顺序编号环上剩余原子
|
||||
|
||||
### 支持的环大小
|
||||
- 12元环 到 20元环
|
||||
- 默认处理 16元环
|
||||
|
||||
### SMARTS 模式
|
||||
```python
|
||||
# 内酯键 SMARTS(16元环示例)
|
||||
LACTONE_SMARTS_16 = "[C;R16](=O)[O;R16]"
|
||||
```
|
||||
|
||||
## 测试指南
|
||||
|
||||
### 运行测试
|
||||
```bash
|
||||
# 全部测试
|
||||
pixi run pytest
|
||||
|
||||
# 特定模块
|
||||
pixi run pytest tests/test_fragmenter.py
|
||||
|
||||
# 详细输出
|
||||
pixi run pytest -v
|
||||
|
||||
# 单个测试
|
||||
pixi run pytest tests/test_fragmenter.py::test_process_molecule
|
||||
```
|
||||
|
||||
### 测试数据
|
||||
测试用的 SMILES 示例(16元环大环内酯):
|
||||
```python
|
||||
TEST_SMILES = [
|
||||
"O=C1CCCCCCCC(=O)OCC/C=C/C=C/1", # 简单 16 元环
|
||||
"CCC1OC(=O)C[C@H](O)C(C)[C@@H](O)...", # 复杂结构
|
||||
]
|
||||
```
|
||||
|
||||
## 常见任务
|
||||
|
||||
### 添加新功能
|
||||
1. 在 `src/` 目录创建或修改模块
|
||||
2. 更新 `src/__init__.py` 导出新类/函数
|
||||
3. 编写单元测试
|
||||
4. 更新文档
|
||||
|
||||
### 处理新的环大小
|
||||
```python
|
||||
# 在 MacrolactoneFragmenter 中指定环大小
|
||||
fragmenter = MacrolactoneFragmenter(ring_size=14) # 14元环
|
||||
```
|
||||
|
||||
### 批量处理
|
||||
```python
|
||||
results = fragmenter.process_csv(
|
||||
"data/molecules.csv",
|
||||
smiles_column="smiles",
|
||||
id_column="unique_id",
|
||||
max_rows=1000
|
||||
)
|
||||
df = fragmenter.batch_to_dataframe(results)
|
||||
```
|
||||
|
||||
## 注意事项
|
||||
|
||||
### RDKit 依赖
|
||||
- RDKit 必须通过 conda/pixi 安装,不支持 pip
|
||||
- 确保环境中有 RDKit:`python -c "from rdkit import Chem; print('OK')"`
|
||||
|
||||
### 性能考虑
|
||||
- 批量处理大数据集时,使用 `process_csv` 方法
|
||||
- 处理速度约 ~100 分子/分钟
|
||||
- 大规模处理考虑使用 `scripts/batch_process_*.py`
|
||||
|
||||
### 错误处理
|
||||
- 无效 SMILES 会抛出 `ValueError`
|
||||
- 非目标环大小会被跳过
|
||||
- 批量处理会记录失败的分子到日志
|
||||
|
||||
## 相关资源
|
||||
|
||||
- **文档**: `docs/` 目录或运行 `pixi run mkdocs serve`
|
||||
- **示例**: `notebooks/filter_molecules.ipynb`
|
||||
- **脚本**: `scripts/README.md`
|
||||
|
||||
---
|
||||
|
||||
*最后更新: 2025-01-23*
|
||||
223
tests/test_ring_numbering.py
Normal file
223
tests/test_ring_numbering.py
Normal file
@@ -0,0 +1,223 @@
|
||||
"""
|
||||
测试环编号功能 - 验证原子编号是否固定
|
||||
"""
|
||||
import sys
|
||||
sys.path.insert(0, '/home/zly/project/macro_split')
|
||||
|
||||
from rdkit import Chem
|
||||
from rdkit.Chem import Draw, AllChem
|
||||
from rdkit.Chem.Draw import rdMolDraw2D
|
||||
from src.ring_visualization import (
|
||||
get_macrolactone_numbering,
|
||||
get_ring_atoms_by_size
|
||||
)
|
||||
|
||||
|
||||
def test_ring_numbering_consistency(smiles: str, ring_size: int = 16, num_tests: int = 5):
|
||||
"""
|
||||
测试环编号的一致性 - 多次运行确保编号固定
|
||||
"""
|
||||
print("=" * 70)
|
||||
print("测试环编号一致性")
|
||||
print("=" * 70)
|
||||
print(f"\nSMILES: {smiles[:80]}...")
|
||||
print(f"环大小: {ring_size}")
|
||||
print(f"测试次数: {num_tests}")
|
||||
|
||||
# 解析分子
|
||||
mol = Chem.MolFromSmiles(smiles)
|
||||
if mol is None:
|
||||
print("❌ 无法解析SMILES")
|
||||
return False
|
||||
|
||||
print(f"✓ 分子解析成功,共 {mol.GetNumAtoms()} 个原子")
|
||||
|
||||
# 检测环大小
|
||||
ring_atoms = get_ring_atoms_by_size(mol, ring_size)
|
||||
if ring_atoms is None:
|
||||
for size in range(12, 21):
|
||||
ring_atoms = get_ring_atoms_by_size(mol, size)
|
||||
if ring_atoms:
|
||||
ring_size = size
|
||||
print(f"⚠️ 使用检测到的{size}元环")
|
||||
break
|
||||
|
||||
if ring_atoms is None:
|
||||
print("❌ 未找到12-20元环")
|
||||
return False
|
||||
|
||||
print(f"✓ 找到{ring_size}元环,包含 {len(ring_atoms)} 个原子")
|
||||
|
||||
# 多次测试编号一致性
|
||||
all_numberings = []
|
||||
all_carbonyl_carbons = []
|
||||
all_ester_oxygens = []
|
||||
|
||||
for i in range(num_tests):
|
||||
result = get_macrolactone_numbering(mol, ring_size)
|
||||
ring_atoms_result, ring_numbering, ordered_atoms, carbonyl_carbon, ester_oxygen, (is_valid, reason) = result
|
||||
|
||||
if not is_valid:
|
||||
print(f"❌ 第{i+1}次测试失败: {reason}")
|
||||
return False
|
||||
|
||||
all_numberings.append(ring_numbering.copy())
|
||||
all_carbonyl_carbons.append(carbonyl_carbon)
|
||||
all_ester_oxygens.append(ester_oxygen)
|
||||
|
||||
# 验证一致性
|
||||
print("\n" + "-" * 50)
|
||||
print("编号一致性检查:")
|
||||
print("-" * 50)
|
||||
|
||||
is_consistent = True
|
||||
|
||||
if len(set(all_carbonyl_carbons)) == 1:
|
||||
print(f"✓ 羰基碳位置一致: 原子索引 {all_carbonyl_carbons[0]}")
|
||||
else:
|
||||
print(f"❌ 羰基碳位置不一致: {all_carbonyl_carbons}")
|
||||
is_consistent = False
|
||||
|
||||
if len(set(all_ester_oxygens)) == 1:
|
||||
print(f"✓ 酯氧位置一致: 原子索引 {all_ester_oxygens[0]}")
|
||||
else:
|
||||
print(f"❌ 酯氧位置不一致: {all_ester_oxygens}")
|
||||
is_consistent = False
|
||||
|
||||
first_numbering = all_numberings[0]
|
||||
for i, numbering in enumerate(all_numberings[1:], 2):
|
||||
if numbering != first_numbering:
|
||||
print(f"❌ 第{i}次编号与第1次不一致")
|
||||
is_consistent = False
|
||||
break
|
||||
|
||||
if is_consistent:
|
||||
print(f"✓ 所有{num_tests}次测试的编号完全一致")
|
||||
|
||||
# 显示详细编号信息
|
||||
print("\n" + "-" * 50)
|
||||
print("环原子编号详情:")
|
||||
print("-" * 50)
|
||||
|
||||
numbering = all_numberings[0]
|
||||
carbonyl_carbon = all_carbonyl_carbons[0]
|
||||
ester_oxygen = all_ester_oxygens[0]
|
||||
|
||||
sorted_items = sorted(numbering.items(), key=lambda x: x[1])
|
||||
|
||||
print(f"{'位置':<6} {'原子索引':<10} {'元素':<6} {'说明'}")
|
||||
print("-" * 40)
|
||||
|
||||
for atom_idx, position in sorted_items:
|
||||
atom = mol.GetAtomWithIdx(atom_idx)
|
||||
symbol = atom.GetSymbol()
|
||||
note = ""
|
||||
if atom_idx == carbonyl_carbon:
|
||||
note = "← 羰基碳 (C=O)"
|
||||
elif atom_idx == ester_oxygen:
|
||||
note = "← 酯键氧"
|
||||
print(f"{position:<6} {atom_idx:<10} {symbol:<6} {note}")
|
||||
|
||||
return is_consistent
|
||||
|
||||
|
||||
def save_visualization(smiles: str, output_path: str, ring_size: int = 16):
|
||||
"""保存分子可视化图片"""
|
||||
print("\n" + "=" * 70)
|
||||
print("保存可视化图片")
|
||||
print("=" * 70)
|
||||
|
||||
mol = Chem.MolFromSmiles(smiles)
|
||||
if mol is None:
|
||||
print("❌ 无法解析SMILES")
|
||||
return
|
||||
|
||||
for size in range(12, 21):
|
||||
ring_atoms = get_ring_atoms_by_size(mol, size)
|
||||
if ring_atoms:
|
||||
ring_size = size
|
||||
break
|
||||
|
||||
result = get_macrolactone_numbering(mol, ring_size)
|
||||
ring_atoms, ring_numbering, ordered_atoms, carbonyl_carbon, ester_oxygen, (is_valid, reason) = result
|
||||
|
||||
if not is_valid:
|
||||
print(f"❌ 无法获取编号: {reason}")
|
||||
return
|
||||
|
||||
mol_copy = Chem.Mol(mol)
|
||||
AllChem.Compute2DCoords(mol_copy)
|
||||
|
||||
for atom_idx in ring_atoms:
|
||||
if atom_idx in ring_numbering:
|
||||
atom = mol_copy.GetAtomWithIdx(atom_idx)
|
||||
atom.SetProp("atomNote", str(ring_numbering[atom_idx]))
|
||||
|
||||
atom_colors = {}
|
||||
for atom_idx in ring_atoms:
|
||||
atom = mol.GetAtomWithIdx(atom_idx)
|
||||
symbol = atom.GetSymbol()
|
||||
|
||||
if atom_idx == carbonyl_carbon:
|
||||
atom_colors[atom_idx] = (1.0, 0.6, 0.0)
|
||||
elif atom_idx == ester_oxygen:
|
||||
atom_colors[atom_idx] = (1.0, 0.4, 0.4)
|
||||
elif symbol == 'C':
|
||||
atom_colors[atom_idx] = (0.7, 0.85, 1.0)
|
||||
elif symbol == 'O':
|
||||
atom_colors[atom_idx] = (1.0, 0.7, 0.7)
|
||||
elif symbol == 'N':
|
||||
atom_colors[atom_idx] = (0.8, 0.7, 1.0)
|
||||
else:
|
||||
atom_colors[atom_idx] = (0.8, 1.0, 0.8)
|
||||
|
||||
drawer = rdMolDraw2D.MolDraw2DSVG(1000, 1000)
|
||||
drawer.SetFontSize(14)
|
||||
drawer.DrawMolecule(mol_copy, highlightAtoms=list(ring_atoms), highlightAtomColors=atom_colors)
|
||||
drawer.FinishDrawing()
|
||||
svg = drawer.GetDrawingText()
|
||||
|
||||
svg_path = output_path.replace('.png', '.svg')
|
||||
with open(svg_path, 'w', encoding='utf-8') as f:
|
||||
f.write(svg)
|
||||
print(f"✓ SVG已保存到: {svg_path}")
|
||||
|
||||
try:
|
||||
drawer_png = rdMolDraw2D.MolDraw2DCairo(1000, 1000)
|
||||
drawer_png.SetFontSize(14)
|
||||
drawer_png.DrawMolecule(mol_copy, highlightAtoms=list(ring_atoms), highlightAtomColors=atom_colors)
|
||||
drawer_png.FinishDrawing()
|
||||
drawer_png.WriteDrawingText(output_path)
|
||||
print(f"✓ PNG已保存到: {output_path}")
|
||||
except Exception as e:
|
||||
print(f"⚠️ PNG保存失败: {e}")
|
||||
|
||||
print("\n颜色说明:")
|
||||
print(" 橙色: 羰基碳 (位置1)")
|
||||
print(" 红色: 酯键氧 (位置2)")
|
||||
print(" 浅蓝色: 环上碳原子")
|
||||
|
||||
|
||||
def main():
|
||||
smiles = "O[C@H]1[C@H]([C@H]([C@H](OC[C@@H]2[C@@H](CC)OC(C[C@H]([C@H](C)[C@H]([C@@H](CC=O)C[C@@H](C)C(/C=C/C(/C)=C/2)=O)O[C@H]2[C@@H]([C@H]([C@@H]([C@@H](C)O2)O[C@H]2C[C@](C)([C@@H]([C@@H](C)O2)O)O)[N@](C)C)O)O)=O)O[C@@H]1C)OC)OC"
|
||||
|
||||
print("\n大环内酯环编号测试\n")
|
||||
is_consistent = test_ring_numbering_consistency(smiles, ring_size=16, num_tests=5)
|
||||
|
||||
output_path = "/home/zly/project/macro_split/output/test_ring_numbering.png"
|
||||
save_visualization(smiles, output_path, ring_size=16)
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print("测试总结")
|
||||
print("=" * 70)
|
||||
if is_consistent:
|
||||
print("✅ 所有测试通过!环原子编号是固定的。")
|
||||
else:
|
||||
print("❌ 测试失败:环原子编号不一致")
|
||||
|
||||
return is_consistent
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
success = main()
|
||||
sys.exit(0 if success else 1)
|
||||
Reference in New Issue
Block a user