diff --git a/README.md b/README.md index 8828c36..de8f97e 100644 --- a/README.md +++ b/README.md @@ -1,292 +1,67 @@ -# Macrolactone Fragmenter +# macro_lactone_toolkit -[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) -[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) -[![Pixi](https://img.shields.io/badge/pixi-managed-blue)](https://pixi.sh) +`macro_lactone_toolkit` 是一个正式可安装的 Python 包,用于 12-20 元有效大环内酯的识别、环编号、侧链裂解和简单拼接回组装。 -一个专业的大环内酯(Macrolactone, 12-20 元环)侧链断裂和分析工具,提供从分子编号、侧链断裂到可视化和数据导出的完整工作流。 +## 核心能力 -## ✨ 主要特性 +- 默认自动识别 12-20 元有效大环内酯,也允许显式指定 `ring_size` +- 环编号规则固定为: + - 位置 1 = 内酯羰基碳 + - 位置 2 = 环上的酯键氧 + - 位置 3-N = 沿统一方向连续编号 +- 侧链裂解同时输出两套 SMILES: + - `fragment_smiles_labeled`,例如 `[5*]` + - `fragment_smiles_plain`,例如 `*` +- dummy 原子与连接原子的原始键型保持一致 +- 提供正式 CLI: + - `macro-lactone-toolkit analyze` + - `macro-lactone-toolkit number` + - `macro-lactone-toolkit fragment` -- 🔢 **智能环原子编号** - 支持 12-20 元环,基于内酯结构的固定编号系统 -- ✂️ **自动侧链断裂** - 智能识别并断裂所有侧链,支持单个或批量处理 -- 📊 **强大的可视化** - SVG 显示 + PNG 保存,支持自定义 DPI 和尺寸 -- 💾 **多种导出格式** - JSON、CSV、DataFrame,预留 SQLModel 数据库接口 -- 🚀 **批量处理** - 直接处理 CSV 文件,支持 2000+ 分子的大规模分析 -- 📦 **易于安装** - 使用 pixi 或 pip 安装,支持开发模式 +## 环境 -## 🚀 快速开始 - -### 安装 - -**使用 Pixi(推荐)** +推荐使用 `pixi`,项目已固定到 Python 3.12,并支持 `osx-arm64` 与 `linux-64`。 ```bash -# 安装 pixi -curl -fsSL https://pixi.sh/install.sh | bash - -# 克隆项目并安装依赖 -git clone https://github.com/yourusername/macro_split.git -cd macro_split pixi install - -# 激活环境 -pixi shell -``` - -**使用 Pip** - -```bash -# 首先安装 RDKit(必须通过 conda) -conda install -c conda-forge rdkit - -# 安装项目 -pip install -e . # 开发模式 -``` - -### 基本使用 - -```python -from src.macrolactone_fragmenter import MacrolactoneFragmenter - -# 创建 16 元环处理器 -fragmenter = MacrolactoneFragmenter(ring_size=16) - -# 处理单个分子 -smiles = "CCC1OC(=O)C[C@H](O)C(C)[C@@H](O)..." -result = fragmenter.process_molecule(smiles, parent_id="mol_001") - -# 查看结果 -print(f"总碎片数: {len(result.fragments)}") -for frag in result.fragments[:5]: - print(f"位置 {frag.cleavage_position}: {frag.fragment_smiles}") - -# 可视化 -svg = fragmenter.visualize_molecule(result.parent_smiles) - -# 导出数据 -fragmenter.save_to_json(result, "output/fragments.json") -df = fragmenter.to_dataframe(result) -``` - -### 批量处理 - -```python -# 批量处理 CSV 文件 -results = fragmenter.process_csv( - "data/molecules.csv", - smiles_column="smiles", - id_column="unique_id", - max_rows=100 -) - -# 转换为 DataFrame 并保存 -df_all = fragmenter.batch_to_dataframe(results) -df_all.to_csv("output/all_fragments.csv", index=False) -``` - -## 📖 文档 - -完整的文档请访问:[在线文档](https://yourusername.github.io/macro_split/) - -或本地查看: - -```bash -pixi run mkdocs serve -# 访问 http://localhost:8000 -``` - -### 文档内容 - -- 📘 **[快速开始](docs/getting-started.md)** - 5 分钟上手指南 -- 📗 **[安装指南](docs/installation.md)** - 详细的安装说明 -- 📙 **[用户指南](docs/user-guide/index.md)** - 完整的使用说明 -- 📕 **[API 参考](docs/api/index.md)** - 自动生成的 API 文档 -- 📔 **[教程与示例](docs/tutorials/index.md)** - 实际使用案例 - -## 🎯 核心功能 - -### MacrolactoneFragmenter 类 - -高级封装类,提供一站式解决方案: - -```python -fragmenter = MacrolactoneFragmenter(ring_size=16) - -# 一站式处理 -result = fragmenter.process_molecule(smiles) - -# 单位置断裂 -fragments = fragmenter.cleave_at_position(mol, position=5) - -# 可视化(SVG + PNG) -fragmenter.visualize_molecule(smiles, save_path="mol.png", dpi=600) - -# 批量处理 -results = fragmenter.process_csv("molecules.csv") -``` - -### 环编号系统 - -基于内酯结构的固定编号: - -- **位置 1**: 羰基碳(C=O 中的 C) -- **位置 2**: 酯键氧(环上的 O) -- **位置 3-N**: 按顺序编号环上剩余原子 - -### 数据结构 - -使用 dataclass 存储碎片信息,支持 JSON 导入导出: - -```python -@dataclass -class Fragment: - fragment_smiles: str # 碎片 SMILES(含 dummy 原子 *) - parent_smiles: str # 母分子 SMILES - cleavage_position: int # 断裂位置(1-N) - fragment_id: str # 碎片 ID - parent_id: str # 母分子 ID - atom_count: int # 原子数 - molecular_weight: float # 分子量 -``` - -## 📂 项目结构 - -``` -macro_split/ -├── src/ # 核心源代码 -│ ├── macrolactone_fragmenter.py # ⭐ 高级封装类 -│ ├── ring_numbering.py # 环编号系统 -│ ├── ring_visualization.py # 可视化工具 -│ ├── fragment_dataclass.py # 碎片数据类 -│ ├── fragment_cleaver.py # 侧链断裂 -│ └── visualizer.py # 统计可视化 -├── notebooks/ # Jupyter Notebook 示例 -│ └── filter_molecules.ipynb # ⭐ 完整使用案例 -├── docs/ # MkDocs 文档 -├── scripts/ # 执行脚本 -├── tests/ # 单元测试 -├── pyproject.toml # 项目配置 -├── setup.py # 打包脚本 -├── pixi.toml # Pixi 环境配置 -└── mkdocs.yml # 文档配置 -``` - -## 🔧 环境管理 - -本项目使用 [Pixi](https://pixi.sh/) 进行环境管理。Pixi 是一个现代化的包管理工具,具有以下优势: - -- ✅ 自动安装 RDKit(无需手动配置 conda) -- ✅ 环境隔离,不污染系统 -- ✅ 跨平台支持(Linux、macOS、Windows) -- ✅ 快速且可重现的依赖管理 - -### Pixi 常用命令 - -```bash -# 安装依赖 -pixi install - -# 激活环境 -pixi shell - -# 添加新包 -pixi add package-name - -# 运行命令 -pixi run python script.py -pixi run jupyter notebook - -# 查看已安装包 -pixi list -``` - -## 🧪 运行示例 - -### 方式 1: Jupyter Notebook - -```bash -pixi run jupyter notebook notebooks/filter_molecules.ipynb -``` - -查看第 15 章节的 MacrolactoneFragmenter 完整演示。 - -### 方式 2: Python 脚本 - -```bash -pixi run python examples/basic_usage.py -``` - -### 方式 3: 交互式 Python - -```bash -pixi shell -python ->>> from src.macrolactone_fragmenter import MacrolactoneFragmenter ->>> fragmenter = MacrolactoneFragmenter(ring_size=16) ->>> # 开始使用... -``` - -## 📊 性能 - -- **处理速度**: ~100 分子/分钟 -- **已测试**: 2000+ 个 16 元环大环内酯 -- **支持环大小**: 12-20 元环 -- **输出格式**: JSON, CSV, DataFrame, PNG, SVG - -## 🤝 贡献 - -欢迎贡献!请查看 [贡献指南](docs/development/contributing.md)。 - -### 开发环境设置 - -```bash -# Fork 并克隆项目 -git clone https://github.com/YOUR_USERNAME/macro_split.git -cd macro_split - -# 安装开发依赖 -pixi install - -# 运行测试 pixi run pytest - -# 检查代码风格 -pixi run black src/ -pixi run flake8 src/ +pixi run python -c "import macro_lactone_toolkit" ``` -## 📝 许可证 +## Python API -本项目基于 [MIT License](LICENSE) 开源。 +```python +from macro_lactone_toolkit import MacroLactoneAnalyzer, MacrolactoneFragmenter -## 🔗 相关链接 +analyzer = MacroLactoneAnalyzer() +valid_ring_sizes = analyzer.get_valid_ring_sizes("O=C1CCCCCCCCCCCCCCO1") -- **文档**: [在线文档](https://yourusername.github.io/macro_split/) -- **GitHub**: [macro_split](https://github.com/yourusername/macro_split) -- **PyPI**: [macrolactone-fragmenter](https://pypi.org/project/macrolactone-fragmenter/)(即将发布) -- **问题反馈**: [Issues](https://github.com/yourusername/macro_split/issues) +fragmenter = MacrolactoneFragmenter() +numbering = fragmenter.number_molecule("O=C1CCCCCCCCCCCCCCO1") +result = fragmenter.fragment_molecule("O=C1CCCC(C)CCCCCCCCCCO1", parent_id="mol_001") +``` -## 📧 联系方式 +## CLI -如有问题或建议,请: +单分子分析: -- 提交 [Issue](https://github.com/yourusername/macro_split/issues) -- 参与 [Discussions](https://github.com/yourusername/macro_split/discussions) -- 查看[文档](https://yourusername.github.io/macro_split/) +```bash +pixi run macro-lactone-toolkit analyze --smiles "O=C1CCCCCCCCCCCCCCO1" +pixi run macro-lactone-toolkit number --smiles "O=C1CCCCCCCCCCCCCCO1" +pixi run macro-lactone-toolkit fragment --smiles "O=C1CCCC(C)CCCCCCCCCCO1" --parent-id mol_001 +``` -## 🌟 致谢 +CSV 批处理: -- [RDKit](https://www.rdkit.org/) - 化学信息学核心库 -- [Pixi](https://pixi.sh/) - 现代化的包管理工具 -- [MkDocs Material](https://squidfunk.github.io/mkdocs-material/) - 漂亮的文档主题 +```bash +pixi run macro-lactone-toolkit fragment \ + --input molecules.csv \ + --output fragments.csv \ + --errors-output fragment_errors.csv +``` ---- +默认读取 `smiles` 列;若存在 `id` 列则将其作为 `parent_id`,否则自动生成 `row_`。 -
+## Legacy Scripts -**⭐ 如果这个项目对您有帮助,请给它一个 Star!⭐** - -Made with ❤️ by Macro Split Team - -
+`scripts/` 目录保留为薄封装或迁移提示,不再承载核心实现。正式接口以 `macro_lactone_toolkit.*` 与 `macro-lactone-toolkit` CLI 为准。 diff --git a/pixi.toml b/pixi.toml index 646f6a1..c751e27 100644 --- a/pixi.toml +++ b/pixi.toml @@ -1,24 +1,23 @@ [workspace] authors = ["hotwa "] channels = ["conda-forge"] -name = "macro_split" -platforms = ["linux-64"] +name = "macro_lactone_toolkit" +platforms = ["osx-arm64", "linux-64"] version = "0.1.0" [tasks] +test = "pytest" +lint = "python -m compileall src" +smoke-import = "python -c \"import macro_lactone_toolkit\"" +macro-lactone-toolkit = "python -m macro_lactone_toolkit.cli" [dependencies] +python = "3.12.*" +pip = ">=24,<26" +pytest = ">=8.3,<9" rdkit = ">=2025.9.1,<2026" pandas = ">=2.3.3,<3" numpy = ">=2.3.4,<3" -matplotlib = ">=3.10.7,<4" -seaborn = ">=0.13.2,<0.14" -jupyter = ">=1.1.1,<2" -tqdm = ">=4.67.1,<5" -mkdocs = ">=1.6.1,<2" -mkdocs-material = ">=9.6.23,<10" -mkdocstrings = ">=0.30.1,<0.31" -mkdocstrings-python = ">=1.18.2,<2" [pypi-dependencies] -dataclasses-json = ">=0.6.7, <0.7" +macro_lactone_toolkit = { path = ".", editable = true } diff --git a/pyproject.toml b/pyproject.toml index 70b0c9b..ba8e662 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,74 +1,28 @@ [build-system] -requires = ["setuptools>=61.0", "wheel"] +requires = ["setuptools>=69", "wheel"] build-backend = "setuptools.build_meta" [project] -name = "macrolactone-fragmenter" -version = "1.2.0" -description = "A tool for analyzing and fragmenting macrolactone (12-20 membered ring) side chains" +name = "macro_lactone_toolkit" +version = "0.1.0" +description = "Toolkit for macrolactone detection, numbering, fragmentation, and splicing." readme = "README.md" -requires-python = ">=3.8" -license = {text = "MIT"} +requires-python = ">=3.12" +license = { text = "MIT" } authors = [ - {name = "Macro Split Team", email = "your.email@example.com"}, + { name = "Macro Split Team" }, ] -keywords = ["chemistry", "cheminformatics", "macrolactone", "rdkit", "fragmentation"] -classifiers = [ - "Development Status :: 4 - Beta", - "Intended Audience :: Science/Research", - "Topic :: Scientific/Engineering :: Chemistry", - "License :: OSI Approved :: MIT License", - "Programming Language :: Python :: 3", - "Programming Language :: Python :: 3.8", - "Programming Language :: Python :: 3.9", - "Programming Language :: Python :: 3.10", - "Programming Language :: Python :: 3.11", -] - dependencies = [ - "pandas>=1.3.0", - "numpy>=1.20.0", - "matplotlib>=3.3.0", - "seaborn>=0.11.0", - "dataclasses-json>=0.5.0", - "tqdm>=4.60.0", + "numpy>=1.26", + "pandas>=2.2", ] -[project.optional-dependencies] -dev = [ - "jupyter>=1.0.0", - "pytest>=7.0.0", - "black>=22.0.0", - "flake8>=4.0.0", -] -docs = [ - "mkdocs>=1.4.0", - "mkdocs-material>=9.0.0", - "mkdocstrings[python]>=0.20.0", -] - -[project.urls] -Homepage = "https://github.com/yourusername/macro_split" -Documentation = "https://macro-split.readthedocs.io" -Repository = "https://github.com/yourusername/macro_split" -Issues = "https://github.com/yourusername/macro_split/issues" - [project.scripts] -macro-fragmenter = "src.macrolactone_fragmenter:main" +macro-lactone-toolkit = "macro_lactone_toolkit.cli:main" -[tool.setuptools] -packages = ["src"] - -[tool.setuptools.package-data] -src = ["*.py"] - -[tool.black] -line-length = 100 -target-version = ['py38', 'py39', 'py310', 'py311'] +[tool.setuptools.packages.find] +where = ["src"] +include = ["macro_lactone_toolkit*"] [tool.pytest.ini_options] testpaths = ["tests"] -python_files = "test_*.py" -python_classes = "Test*" -python_functions = "test_*" - diff --git a/scripts/README.md b/scripts/README.md index 4cbe5ba..78c4a18 100644 --- a/scripts/README.md +++ b/scripts/README.md @@ -1,317 +1,11 @@ -# 批量处理脚本使用说明 +# scripts -## 概述 +这些脚本现在都是基于 `macro_lactone_toolkit.*` 的薄封装或迁移提示。 -本文件夹包含两个批量处理脚本,用于对大环内酯分子进行编号、裂解和保存。 - -## 脚本列表 - -### 1. `batch_process_ring16.py` - 处理16元环分子 - -**功能:** -- 处理已过滤的1241个16元环分子 -- 对每个分子进行环原子编号 -- 进行侧链断裂 -- 保存JSON文件 - -**输入文件:** -- `ring16/temp_filtered_complete.csv` - -**输出文件夹:** -- `output/ring16_fragments/` - - `ring16_mol_{idx}/` - 每个分子的文件夹 - - `ring16_mol_{idx}_all_fragments.json` - 所有碎片信息 - - `{fragment_id}.json` - 单个碎片文件 - - `metadata.json` - 分子元数据 - - `processing_log_*.txt` - 处理日志 - - `error_log_*.txt` - 错误日志 - - `processing_statistics.json` - 统计信息 - -**使用方法:** - -```bash -cd /home/zly/project/macro_split - -# 方法1: 直接运行Python脚本 -python scripts/batch_process_ring16.py - -# 方法2: 作为可执行文件运行 -chmod +x scripts/batch_process_ring16.py -./scripts/batch_process_ring16.py -``` - -**预期运行时间:** 10-30分钟(取决于计算机性能) - ---- - -### 2. `batch_process_multi_rings.py` - 处理12-20元环分子 - -**功能:** -- 处理12-20元环的所有大环内酯分子 -- 按环大小自动分类(12元、13元、14元...20元) -- 检测并剔除含有多个内酯键的分子 -- 对每种环大小分别编号和裂解 -- 保存到对应的文件夹 - -**输入文件:** -- `data/ring12_20/temp.csv` - -**输出文件夹:** -- `output/ring12_fragments/` - 12元环碎片 -- `output/ring13_fragments/` - 13元环碎片 -- `output/ring14_fragments/` - 14元环碎片 -- ... -- `output/ring20_fragments/` - 20元环碎片 - -每个文件夹包含: -- `ring{N}_mol_{idx}/` - 每个分子的文件夹 - - `ring{N}_mol_{idx}_all_fragments.json` - - `{fragment_id}.json` - - `metadata.json` -- `processing_log_*.txt` - 处理日志 -- `error_log_*.txt` - 错误日志 -- `multiple_lactone_log_*.txt` - 多内酯键分子记录 -- `processing_statistics.json` - 统计信息 - -**使用方法:** - -```bash -cd /home/zly/project/macro_split - -# 确保输入文件存在 -ls data/ring12_20/temp.csv - -# 运行脚本 -python scripts/batch_process_multi_rings.py -``` - -**预期运行时间:** 根据数据集大小,可能需要30分钟-2小时 - ---- - -## 输出文件说明 - -### JSON文件格式 - -#### 1. `{parent_id}_all_fragments.json` - -包含所有碎片的完整信息: - -```json -{ - "parent_id": "ring16_mol_0", - "parent_smiles": "...", - "fragments": [ - { - "fragment_smiles": "CC(C)C", - "parent_smiles": "...", - "cleavage_position": 5, - "fragment_id": "ring16_mol_0_frag_0", - "parent_id": "ring16_mol_0", - "atom_count": 4, - "molecular_weight": 58.12 - }, - ... - ] -} -``` - -#### 2. `{fragment_id}.json` - -单个碎片的信息: - -```json -{ - "fragment_smiles": "CC(C)C", - "parent_smiles": "...", - "cleavage_position": 5, - "fragment_id": "ring16_mol_0_frag_0", - "parent_id": "ring16_mol_0", - "atom_count": 4, - "molecular_weight": 58.12 -} -``` - -#### 3. `metadata.json` - -分子的元数据: - -```json -{ - "parent_id": "ring16_mol_0", - "molecule_id": "CHEMBL94657", - "molecule_name": "PATUPILONE", - "smiles": "C/C(=C\\c1csc(C)n1)...", - "ring_size": 16, - "num_fragments": 11, - "processing_date": "2025-11-06T10:30:00" -} -``` - -#### 4. `processing_statistics.json` - -处理统计信息: - -```json -{ - "ring_size": 16, - "total": 1241, - "success": 1200, - "failed_parse": 5, - "failed_numbering": 10, - "failed_validation": 8, - "failed_cleavage": 18, - "total_fragments": 12500 -} -``` - ---- - -## 日志文件 - -### 1. `processing_log_*.txt` - -记录所有处理过程: - -``` -[2025-11-06 10:30:00] 开始批量处理 1241 个16元环分子 -[2025-11-06 10:30:05] 分子 0 处理成功 -... -[2025-11-06 11:00:00] 处理完成: 成功 1200/1241 -``` - -### 2. `error_log_*.txt` - -记录所有错误: - -``` -[2025-11-06 10:35:12] ❌ 分子 125 (CHEMBL12345) SMILES解析失败 -[2025-11-06 10:36:45] ❌ 分子 256 (CHEMBL67890) 编号验证失败 -``` - -### 3. `multiple_lactone_log_*.txt` - -记录含有多个内酯键的分子(仅在multi_rings脚本中): - -``` -[2025-11-06 10:40:00] 分子 350 (CHEMBL11111, Molecule Name) 有多个内酯键,已剔除。内酯碳索引: [15, 28, 42] -``` - ---- - -## 常见问题 - -### Q1: 如何检查处理进度? - -**A:** 脚本会显示实时进度条。你也可以查看日志文件: - -```bash -tail -f output/ring16_fragments/processing_log_*.txt -``` - -### Q2: 如何查看处理结果? - -**A:** 查看统计文件: - -```bash -cat output/ring16_fragments/processing_statistics.json -``` - -### Q3: 处理失败了怎么办? - -**A:** -1. 查看 `error_log_*.txt` 了解失败原因 -2. 脚本会跳过失败的分子,继续处理其他分子 -3. 可以手动处理失败的分子 - -### Q4: 如何重新处理? - -**A:** -1. 删除输出文件夹:`rm -rf output/ring16_fragments` -2. 重新运行脚本 - -### Q5: 内存不足怎么办? - -**A:** -- 脚本已优化为逐个处理分子,内存占用很小 -- 如果仍有问题,可以修改脚本分批处理 - ---- - -## 后续分析 - -处理完成后,可以进行以下分析: - -### 1. 统计碎片多样性 - -```python -import json -from pathlib import Path -from collections import Counter - -# 读取所有碎片 -all_fragments = [] -for mol_dir in Path('output/ring16_fragments').iterdir(): - if mol_dir.is_dir() and mol_dir.name.startswith('ring16_mol_'): - fragments_file = mol_dir / f"{mol_dir.name}_all_fragments.json" - if fragments_file.exists(): - with open(fragments_file) as f: - data = json.load(f) - all_fragments.extend([frag['fragment_smiles'] for frag in data['fragments']]) - -# 统计 -fragment_counts = Counter(all_fragments) -print(f"总碎片数: {len(all_fragments)}") -print(f"独特碎片数: {len(fragment_counts)}") -print(f"\n最常见的10个碎片:") -for smiles, count in fragment_counts.most_common(10): - print(f" {smiles}: {count}次") -``` - -### 2. 分析位置分布 - -```python -from collections import defaultdict - -position_fragments = defaultdict(list) - -for mol_dir in Path('output/ring16_fragments').iterdir(): - if mol_dir.is_dir() and mol_dir.name.startswith('ring16_mol_'): - fragments_file = mol_dir / f"{mol_dir.name}_all_fragments.json" - if fragments_file.exists(): - with open(fragments_file) as f: - data = json.load(f) - for frag in data['fragments']: - position = frag['cleavage_position'] - position_fragments[position].append(frag['fragment_smiles']) - -# 统计每个位置的碎片多样性 -for pos in sorted(position_fragments.keys()): - unique_frags = set(position_fragments[pos]) - print(f"位置 {pos:2d}: {len(position_fragments[pos])} 个碎片, {len(unique_frags)} 种不同结构") -``` - ---- - -## 性能优化 - -如果需要处理大量数据,可以考虑: - -1. **并行处理:** 使用 `multiprocessing` 模块 -2. **分批处理:** 将数据集分成多个批次 -3. **增量处理:** 只处理新增的分子 - ---- - -## 联系与支持 - -如有问题,请查看: -- 错误日志文件 -- `BRIDGE_RING_ANALYSIS.md` - 详细技术文档 -- `QUICK_GUIDE.md` - 快速参考指南 - ---- - -**最后更新:** 2025-11-06 -**版本:** 1.0 +- `batch_process.py`: 等价于 `macro-lactone-toolkit fragment` +- `batch_process_ring16.py`: 等价于 `macro-lactone-toolkit fragment --ring-size 16` +- `batch_process_multi_rings.py`: 自动识别模式的批处理封装 +- `analyze_fragments.py`: 等价于 `macro-lactone-toolkit analyze` +- `tylosin_splicer.py`: 使用 `macro_lactone_toolkit.splicing.*` 做简单拼接 +核心实现与正式接口都在 `src/macro_lactone_toolkit/` 中。 diff --git a/scripts/analyze_fragments.py b/scripts/analyze_fragments.py old mode 100755 new mode 100644 index bfb9e93..4febb62 --- a/scripts/analyze_fragments.py +++ b/scripts/analyze_fragments.py @@ -1,496 +1,10 @@ -#!/usr/bin/env python3 -""" -碎片分析脚本 -分析指定环大小的分子碎片: -- 统计每个位置的碎片频率 -- 分析碎片理化性质 -- 找出裂解最多的位置(前5) -- 找出变化最大的位置 -""" +from __future__ import annotations import sys -from pathlib import Path -# 添加项目根目录到 Python 路径 -script_dir = Path(__file__).resolve().parent -project_root = script_dir.parent -sys.path.insert(0, str(project_root)) - -import json -import pandas as pd -import numpy as np -from collections import Counter, defaultdict -from rdkit import Chem -from rdkit.Chem import Descriptors, rdMolDescriptors -from rdkit import DataStructs -from rdkit.Chem import AllChem -import matplotlib.pyplot as plt -import seaborn as sns -from scipy import stats as scipy_stats +from macro_lactone_toolkit.cli import main -def load_fragments_from_ring_size(ring_size): - """加载指定环大小的所有碎片""" - output_dir = Path(f'output/ring{ring_size}_fragments') - - if not output_dir.exists(): - print(f"❌ 未找到 {ring_size}元环的输出文件夹") - return None - - all_fragments = [] - - for mol_dir in output_dir.iterdir(): - if mol_dir.is_dir() and mol_dir.name.startswith(f'ring{ring_size}_mol_'): - fragments_file = mol_dir / f"{mol_dir.name}_all_fragments.json" - if fragments_file.exists(): - with open(fragments_file, 'r', encoding='utf-8') as f: - data = json.load(f) - all_fragments.extend(data['fragments']) - - return all_fragments - - -def calculate_fragment_properties(smiles): - """计算碎片的理化性质""" - mol = Chem.MolFromSmiles(smiles) - if mol is None: - return None - - properties = { - 'smiles': smiles, - 'molecular_weight': Descriptors.MolWt(mol), - 'num_atoms': mol.GetNumAtoms(), - 'num_heavy_atoms': mol.GetNumHeavyAtoms(), - 'num_rotatable_bonds': Descriptors.NumRotatableBonds(mol), - 'num_hbd': Descriptors.NumHDonors(mol), # 氢键供体 - 'num_hba': Descriptors.NumHAcceptors(mol), # 氢键受体 - 'logp': Descriptors.MolLogP(mol), - 'tpsa': Descriptors.TPSA(mol), # 极性表面积 - 'num_rings': Descriptors.RingCount(mol), - 'num_aromatic_rings': Descriptors.NumAromaticRings(mol), - } - - return properties - - -def calculate_shannon_entropy(fragment_list): - """ - 计算香农熵,用于衡量碎片多样性 - 熵越高,表示多样性越大 - """ - if not fragment_list: - return 0 - - counts = Counter(fragment_list) - total = len(fragment_list) - - entropy = 0 - for count in counts.values(): - p = count / total - if p > 0: - entropy -= p * np.log2(p) - - return entropy - - -def calculate_structural_diversity(smiles_list): - """ - 计算结构多样性 - 使用Tanimoto相似度的平均值 - 相似度越低,多样性越高 - """ - if len(smiles_list) < 2: - return 0 - - # 生成指纹 - fps = [] - for smiles in smiles_list: - mol = Chem.MolFromSmiles(smiles) - if mol: - fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=1024) - fps.append(fp) - - if len(fps) < 2: - return 0 - - # 计算所有配对的Tanimoto相似度 - similarities = [] - for i in range(len(fps)): - for j in range(i+1, len(fps)): - sim = DataStructs.TanimotoSimilarity(fps[i], fps[j]) - similarities.append(sim) - - # 返回平均相似度(1-相似度 = 多样性) - avg_similarity = np.mean(similarities) - diversity = 1 - avg_similarity - - return diversity - - -def analyze_position_statistics(fragments, ring_size): - """分析每个位置的统计信息""" - - # 按位置分组 - position_data = defaultdict(list) - for frag in fragments: - pos = frag['cleavage_position'] - position_data[pos].append(frag) - - # 统计信息 - results = [] - - for pos in sorted(position_data.keys()): - frags = position_data[pos] - smiles_list = [f['fragment_smiles'] for f in frags] - - # 基础统计 - total_count = len(smiles_list) - unique_count = len(set(smiles_list)) - most_common = Counter(smiles_list).most_common(3) - - # 多样性指标 - shannon_entropy = calculate_shannon_entropy(smiles_list) - structural_diversity = calculate_structural_diversity(list(set(smiles_list))) - - # 理化性质统计 - props_list = [] - for smiles in set(smiles_list): - props = calculate_fragment_properties(smiles) - if props: - props_list.append(props) - - if props_list: - mw_mean = np.mean([p['molecular_weight'] for p in props_list]) - mw_std = np.std([p['molecular_weight'] for p in props_list]) - atom_mean = np.mean([p['num_atoms'] for p in props_list]) - atom_std = np.std([p['num_atoms'] for p in props_list]) - else: - mw_mean = mw_std = atom_mean = atom_std = 0 - - results.append({ - 'position': pos, - 'total_count': total_count, - 'unique_count': unique_count, - 'uniqueness_ratio': unique_count / total_count if total_count > 0 else 0, - 'shannon_entropy': shannon_entropy, - 'structural_diversity': structural_diversity, - 'mw_mean': mw_mean, - 'mw_std': mw_std, - 'atom_mean': atom_mean, - 'atom_std': atom_std, - 'most_common': most_common - }) - - return pd.DataFrame(results) - - -def plot_position_analysis(df_stats, ring_size, output_dir): - """绘制位置分析图表""" - - fig, axes = plt.subplots(2, 3, figsize=(18, 12)) - fig.suptitle(f'{ring_size}元环碎片位置分析', fontsize=16, fontweight='bold') - - positions = df_stats['position'].values - - # 1. 碎片总数分布 - ax = axes[0, 0] - bars = ax.bar(positions, df_stats['total_count'], color='steelblue', edgecolor='black', alpha=0.7) - ax.set_xlabel('环位置', fontsize=12) - ax.set_ylabel('碎片总数', fontsize=12) - ax.set_title('各位置碎片总数', fontsize=14, fontweight='bold') - ax.grid(axis='y', alpha=0.3) - - # 标注前5 - top5_indices = df_stats.nlargest(5, 'total_count').index - for idx in top5_indices: - pos = df_stats.loc[idx, 'position'] - count = df_stats.loc[idx, 'total_count'] - ax.text(pos, count, str(int(count)), ha='center', va='bottom', fontsize=9, fontweight='bold') - - # 2. 独特碎片数 - ax = axes[0, 1] - ax.bar(positions, df_stats['unique_count'], color='coral', edgecolor='black', alpha=0.7) - ax.set_xlabel('环位置', fontsize=12) - ax.set_ylabel('独特碎片数', fontsize=12) - ax.set_title('各位置独特碎片数', fontsize=14, fontweight='bold') - ax.grid(axis='y', alpha=0.3) - - # 3. 香农熵(多样性) - ax = axes[0, 2] - bars = ax.bar(positions, df_stats['shannon_entropy'], color='mediumseagreen', edgecolor='black', alpha=0.7) - ax.set_xlabel('环位置', fontsize=12) - ax.set_ylabel('香农熵', fontsize=12) - ax.set_title('各位置碎片多样性(香农熵)', fontsize=14, fontweight='bold') - ax.grid(axis='y', alpha=0.3) - - # 标注熵最高的位置 - top_entropy_idx = df_stats['shannon_entropy'].idxmax() - pos = df_stats.loc[top_entropy_idx, 'position'] - entropy = df_stats.loc[top_entropy_idx, 'shannon_entropy'] - ax.text(pos, entropy, f'{entropy:.2f}', ha='center', va='bottom', fontsize=9, fontweight='bold') - - # 4. 结构多样性 - ax = axes[1, 0] - ax.bar(positions, df_stats['structural_diversity'], color='mediumpurple', edgecolor='black', alpha=0.7) - ax.set_xlabel('环位置', fontsize=12) - ax.set_ylabel('结构多样性', fontsize=12) - ax.set_title('各位置结构多样性(指纹相似度)', fontsize=14, fontweight='bold') - ax.grid(axis='y', alpha=0.3) - - # 5. 分子量分布 - ax = axes[1, 1] - ax.errorbar(positions, df_stats['mw_mean'], yerr=df_stats['mw_std'], - fmt='o-', color='darkred', ecolor='lightcoral', capsize=5, capthick=2, alpha=0.7) - ax.set_xlabel('环位置', fontsize=12) - ax.set_ylabel('分子量 (Da)', fontsize=12) - ax.set_title('各位置碎片平均分子量', fontsize=14, fontweight='bold') - ax.grid(axis='y', alpha=0.3) - - # 6. 原子数分布 - ax = axes[1, 2] - ax.errorbar(positions, df_stats['atom_mean'], yerr=df_stats['atom_std'], - fmt='s-', color='darkgreen', ecolor='lightgreen', capsize=5, capthick=2, alpha=0.7) - ax.set_xlabel('环位置', fontsize=12) - ax.set_ylabel('原子数', fontsize=12) - ax.set_title('各位置碎片平均原子数', fontsize=14, fontweight='bold') - ax.grid(axis='y', alpha=0.3) - - plt.tight_layout() - - # 保存图表 - plot_path = output_dir / f'ring{ring_size}_position_analysis.png' - plt.savefig(plot_path, dpi=300, bbox_inches='tight') - print(f"✓ 图表已保存: {plot_path}") - - plt.show() - - -def plot_fragment_properties_distribution(fragments, ring_size, output_dir): - """绘制碎片理化性质分布""" - - # 计算所有碎片的性质 - all_props = [] - unique_smiles = set([f['fragment_smiles'] for f in fragments]) - - for smiles in unique_smiles: - props = calculate_fragment_properties(smiles) - if props: - all_props.append(props) - - df_props = pd.DataFrame(all_props) - - fig, axes = plt.subplots(2, 3, figsize=(18, 12)) - fig.suptitle(f'{ring_size}元环碎片理化性质分布', fontsize=16, fontweight='bold') - - # 1. 分子量分布 - ax = axes[0, 0] - ax.hist(df_props['molecular_weight'], bins=30, color='steelblue', edgecolor='black', alpha=0.7) - ax.set_xlabel('分子量 (Da)', fontsize=12) - ax.set_ylabel('频数', fontsize=12) - ax.set_title('分子量分布', fontsize=14, fontweight='bold') - ax.axvline(df_props['molecular_weight'].median(), color='red', linestyle='--', - label=f'中位数: {df_props["molecular_weight"].median():.1f}') - ax.legend() - ax.grid(axis='y', alpha=0.3) - - # 2. 原子数分布 - ax = axes[0, 1] - ax.hist(df_props['num_atoms'], bins=range(1, int(df_props['num_atoms'].max())+2), - color='coral', edgecolor='black', alpha=0.7) - ax.set_xlabel('原子数', fontsize=12) - ax.set_ylabel('频数', fontsize=12) - ax.set_title('原子数分布', fontsize=14, fontweight='bold') - ax.axvline(df_props['num_atoms'].median(), color='red', linestyle='--', - label=f'中位数: {df_props["num_atoms"].median():.0f}') - ax.legend() - ax.grid(axis='y', alpha=0.3) - - # 3. LogP分布 - ax = axes[0, 2] - ax.hist(df_props['logp'], bins=30, color='mediumseagreen', edgecolor='black', alpha=0.7) - ax.set_xlabel('LogP', fontsize=12) - ax.set_ylabel('频数', fontsize=12) - ax.set_title('LogP分布(亲脂性)', fontsize=14, fontweight='bold') - ax.axvline(df_props['logp'].median(), color='red', linestyle='--', - label=f'中位数: {df_props["logp"].median():.2f}') - ax.legend() - ax.grid(axis='y', alpha=0.3) - - # 4. TPSA分布 - ax = axes[1, 0] - ax.hist(df_props['tpsa'], bins=30, color='mediumpurple', edgecolor='black', alpha=0.7) - ax.set_xlabel('TPSA (Ų)', fontsize=12) - ax.set_ylabel('频数', fontsize=12) - ax.set_title('极性表面积分布', fontsize=14, fontweight='bold') - ax.axvline(df_props['tpsa'].median(), color='red', linestyle='--', - label=f'中位数: {df_props["tpsa"].median():.1f}') - ax.legend() - ax.grid(axis='y', alpha=0.3) - - # 5. 可旋转键数分布 - ax = axes[1, 1] - ax.hist(df_props['num_rotatable_bonds'], - bins=range(int(df_props['num_rotatable_bonds'].max())+2), - color='darkgoldenrod', edgecolor='black', alpha=0.7) - ax.set_xlabel('可旋转键数', fontsize=12) - ax.set_ylabel('频数', fontsize=12) - ax.set_title('可旋转键数分布(柔性)', fontsize=14, fontweight='bold') - ax.grid(axis='y', alpha=0.3) - - # 6. 氢键统计 - ax = axes[1, 2] - width = 0.35 - x = np.arange(2) - hbd_mean = df_props['num_hbd'].mean() - hba_mean = df_props['num_hba'].mean() - ax.bar(x[0]-width/2, hbd_mean, width, label='氢键供体 (HBD)', color='lightcoral', edgecolor='black') - ax.bar(x[1]+width/2, hba_mean, width, label='氢键受体 (HBA)', color='lightblue', edgecolor='black') - ax.set_ylabel('平均数量', fontsize=12) - ax.set_title('氢键能力统计', fontsize=14, fontweight='bold') - ax.set_xticks(x) - ax.set_xticklabels(['HBD', 'HBA']) - ax.legend() - ax.grid(axis='y', alpha=0.3) - - plt.tight_layout() - - # 保存图表 - plot_path = output_dir / f'ring{ring_size}_properties_distribution.png' - plt.savefig(plot_path, dpi=300, bbox_inches='tight') - print(f"✓ 图表已保存: {plot_path}") - - plt.show() - - return df_props - - -def generate_analysis_report(ring_size): - """生成完整的分析报告""" - - print("="*80) - print(f"{ring_size}元环碎片分析") - print("="*80) - print() - - # 加载碎片 - print("加载碎片数据...") - fragments = load_fragments_from_ring_size(ring_size) - - if fragments is None or len(fragments) == 0: - print(f"❌ 没有找到碎片数据") - return - - print(f"✓ 加载了 {len(fragments)} 个碎片") - print() - - # 创建输出文件夹 - output_dir = Path(f'output/ring{ring_size}_analysis') - output_dir.mkdir(parents=True, exist_ok=True) - - # 1. 位置统计分析 - print("分析各位置统计信息...") - df_stats = analyze_position_statistics(fragments, ring_size) - - # 保存统计表格 - stats_csv = output_dir / f'ring{ring_size}_position_statistics.csv' - df_stats.to_csv(stats_csv, index=False, encoding='utf-8') - print(f"✓ 统计表格已保存: {stats_csv}") - print() - - # 2. 打印关键结果 - print("="*80) - print("关键发现") - print("="*80) - print() - - # 裂解最多的前5个位置 - top5_positions = df_stats.nlargest(5, 'total_count') - print("📊 裂解最多的前5个位置:") - for i, row in enumerate(top5_positions.itertuples(), 1): - print(f" {i}. 位置 {row.position}: {row.total_count} 个碎片 " - f"({row.unique_count} 种独特结构)") - print() - - # 多样性最大的前5个位置(基于香农熵) - top5_diversity = df_stats.nlargest(5, 'shannon_entropy') - print("🌈 多样性最大的前5个位置(香农熵):") - for i, row in enumerate(top5_diversity.itertuples(), 1): - print(f" {i}. 位置 {row.position}: 熵={row.shannon_entropy:.3f}, " - f"独特比={row.uniqueness_ratio:.2%}, " - f"{row.unique_count}/{row.total_count} 独特") - print() - - # 结构多样性最大的前5个位置 - top5_structural = df_stats.nlargest(5, 'structural_diversity') - print("🔬 结构多样性最大的前5个位置(指纹相似度):") - for i, row in enumerate(top5_structural.itertuples(), 1): - print(f" {i}. 位置 {row.position}: 多样性={row.structural_diversity:.3f}, " - f"{row.unique_count} 种结构") - print() - - # 碎片大小变化最大的位置 - top5_mw_variance = df_stats.nlargest(5, 'mw_std') - print("⚖️ 分子量变化最大的前5个位置:") - for i, row in enumerate(top5_mw_variance.itertuples(), 1): - print(f" {i}. 位置 {row.position}: 平均={row.mw_mean:.1f} Da, " - f"标准差={row.mw_std:.1f} Da") - print() - - # 3. 绘制图表 - print("生成可视化图表...") - plot_position_analysis(df_stats, ring_size, output_dir) - print() - - # 4. 理化性质分析 - print("分析碎片理化性质...") - df_props = plot_fragment_properties_distribution(fragments, ring_size, output_dir) - - # 保存理化性质表格 - props_csv = output_dir / f'ring{ring_size}_fragment_properties.csv' - df_props.to_csv(props_csv, index=False, encoding='utf-8') - print(f"✓ 理化性质表格已保存: {props_csv}") - print() - - # 5. 总结统计 - print("="*80) - print("总体统计") - print("="*80) - print(f"总碎片数: {len(fragments)}") - print(f"独特碎片数: {len(set([f['fragment_smiles'] for f in fragments]))}") - print(f"平均分子量: {df_props['molecular_weight'].mean():.2f} ± {df_props['molecular_weight'].std():.2f} Da") - print(f"平均原子数: {df_props['num_atoms'].mean():.2f} ± {df_props['num_atoms'].std():.2f}") - print(f"分子量范围: {df_props['molecular_weight'].min():.1f} - {df_props['molecular_weight'].max():.1f} Da") - print(f"原子数范围: {int(df_props['num_atoms'].min())} - {int(df_props['num_atoms'].max())}") - print() - - # 6. 生成文本报告 - report_path = output_dir / f'ring{ring_size}_analysis_report.txt' - with open(report_path, 'w', encoding='utf-8') as f: - f.write(f"{ring_size}元环碎片分析报告\n") - f.write("="*80 + "\n\n") - f.write(f"生成时间: {pd.Timestamp.now()}\n\n") - - f.write("位置统计摘要:\n") - f.write(df_stats.to_string()) - f.write("\n\n") - - f.write("理化性质摘要:\n") - f.write(df_props.describe().to_string()) - f.write("\n") - - print(f"✓ 完整报告已保存: {report_path}") - print() - print("="*80) - print("分析完成!") - print("="*80) - - -if __name__ == '__main__': - import argparse - - parser = argparse.ArgumentParser(description='分析大环内酯碎片') - parser.add_argument('ring_size', type=int, help='环大小(如:16)') - - args = parser.parse_args() - - generate_analysis_report(args.ring_size) +if __name__ == "__main__": + sys.argv = ["macro-lactone-toolkit", "analyze", *sys.argv[1:]] + main() diff --git a/scripts/batch_process.py b/scripts/batch_process.py index 02d9a41..d9f6709 100644 --- a/scripts/batch_process.py +++ b/scripts/batch_process.py @@ -1,165 +1,10 @@ -""" -Batch processing script for analyzing all macrolactones in the dataset. -""" +from __future__ import annotations + import sys -sys.path.append('..') -import pandas as pd -from pathlib import Path -from tqdm import tqdm -import json -from rdkit import Chem - -from src.fragment_cleaver import process_molecule -from src.fragment_dataclass import MoleculeFragments - - -def batch_process_molecules(csv_path: str, output_base_dir: str, - max_molecules: int = None): - """ - Process all molecules in the CSV file. - - Args: - csv_path: Path to the CSV file containing SMILES - output_base_dir: Base directory for output - max_molecules: Maximum number of molecules to process (None for all) - """ - # Read CSV - df = pd.read_csv(csv_path) - print(f"Loaded {len(df)} molecules from {csv_path}") - - if max_molecules: - df = df.head(max_molecules) - print(f"Processing first {max_molecules} molecules") - - # Create output directory - output_dir = Path(output_base_dir) - output_dir.mkdir(parents=True, exist_ok=True) - - # Statistics - successful = 0 - failed = 0 - failed_molecules = [] - all_fragments = [] - - # Process each molecule - for idx, row in tqdm(df.iterrows(), total=len(df), desc="Processing molecules"): - smiles = row['smiles'] - molecule_id = row.get('IDs', f'molecule_{idx}') - - try: - # Process molecule - mol_fragments = process_molecule(smiles, idx) - - if mol_fragments is None or len(mol_fragments.fragments) == 0: - failed += 1 - failed_molecules.append({ - 'index': idx, - 'id': molecule_id, - 'reason': 'No fragments extracted' - }) - continue - - # Create output directory for this molecule - mol_output_dir = output_dir / mol_fragments.parent_id - mol_output_dir.mkdir(parents=True, exist_ok=True) - - # Save complete molecule fragments - mol_fragments_path = mol_output_dir / f"{mol_fragments.parent_id}_all_fragments.json" - mol_fragments.to_json_file(str(mol_fragments_path)) - - # Save individual fragments - for frag in mol_fragments.fragments: - frag_path = mol_output_dir / f"{frag.fragment_id}.json" - frag.to_json_file(str(frag_path)) - - # Collect for overall statistics - all_fragments.append({ - 'parent_id': frag.parent_id, - 'fragment_id': frag.fragment_id, - 'cleavage_position': frag.cleavage_position, - 'fragment_smiles': frag.fragment_smiles, - 'atom_count': frag.atom_count, - 'molecular_weight': frag.molecular_weight, - 'parent_smiles': frag.parent_smiles - }) - - successful += 1 - - except Exception as e: - failed += 1 - failed_molecules.append({ - 'index': idx, - 'id': molecule_id, - 'error': str(e) - }) - print(f"\nError processing molecule {idx} ({molecule_id}): {e}") - - # Save overall statistics - stats = { - 'total_molecules': len(df), - 'successful': successful, - 'failed': failed, - 'total_fragments': len(all_fragments), - 'failed_molecules': failed_molecules - } - - stats_path = output_dir / 'processing_stats.json' - with open(stats_path, 'w') as f: - json.dump(stats, f, indent=2) - - # Save all fragments as CSV for easy analysis - if all_fragments: - fragments_df = pd.DataFrame(all_fragments) - fragments_csv_path = output_dir / 'all_fragments.csv' - fragments_df.to_csv(fragments_csv_path, index=False) - print(f"\n✓ Saved all fragments to: {fragments_csv_path}") - - # Print summary - print(f"\n{'='*60}") - print(f"PROCESSING COMPLETE") - print(f"{'='*60}") - print(f"Total molecules: {len(df)}") - print(f"Successfully processed: {successful}") - print(f"Failed: {failed}") - print(f"Total fragments extracted: {len(all_fragments)}") - print(f"{'='*60}") - print(f"\nResults saved to: {output_dir}") - print(f"Statistics saved to: {stats_path}") - - return fragments_df if all_fragments else None +from macro_lactone_toolkit.cli import main if __name__ == "__main__": - import argparse - - parser = argparse.ArgumentParser( - description="Batch process macrolactones to extract side chain fragments" - ) - parser.add_argument( - '--input', - type=str, - default='../ring16/temp.csv', - help='Input CSV file path' - ) - parser.add_argument( - '--output', - type=str, - default='../output/fragments', - help='Output directory path' - ) - parser.add_argument( - '--max', - type=int, - default=None, - help='Maximum number of molecules to process (default: all)' - ) - - args = parser.parse_args() - - batch_process_molecules( - csv_path=args.input, - output_base_dir=args.output, - max_molecules=args.max - ) - + sys.argv = ["macro-lactone-toolkit", "fragment", *sys.argv[1:]] + main() diff --git a/scripts/batch_process_multi_rings.py b/scripts/batch_process_multi_rings.py old mode 100755 new mode 100644 index 64ad1cc..d9f6709 --- a/scripts/batch_process_multi_rings.py +++ b/scripts/batch_process_multi_rings.py @@ -1,423 +1,10 @@ -#!/usr/bin/env python3 -""" -批量处理12-20元环大环内酯分子 -处理文件: data/ring12_20/temp.csv -输出文件夹: output/ring{N}_fragments/ (N=12,13,14...20) -""" +from __future__ import annotations import sys -from pathlib import Path -# 添加项目根目录到 Python 路径 -# 获取脚本所在目录的父目录(项目根目录) -script_dir = Path(__file__).resolve().parent -project_root = script_dir.parent -sys.path.insert(0, str(project_root)) - -from rdkit import Chem -import pandas as pd -import json -from tqdm import tqdm -from datetime import datetime -from collections import defaultdict - -from src.ring_numbering import get_ring_atoms -from src.fragment_cleaver import cleave_side_chains +from macro_lactone_toolkit.cli import main -def find_lactone_carbons(mol, ring_size): - """ - 找到指定环大小的环上的内酯羰基碳原子 - 返回所有环上内酯碳的索引列表 - """ - # 获取指定环大小的所有环原子 - ring_pattern = Chem.MolFromSmarts(f"[r{ring_size}]") - if ring_pattern is None: - return [] - - ring_matches = mol.GetSubstructMatches(ring_pattern) - ring_atoms = list(set([match[0] for match in ring_matches])) - - if len(ring_atoms) == 0: - return [] - - # 使用更严格的SMARTS模式:环上的氧连接到环上的羰基碳 - # [r{ring_size};#8] 环上的氧,连接到 [r{ring_size};#6] 环上的碳,该碳有双键氧 (=O) - lactone_pattern = Chem.MolFromSmarts(f"[r{ring_size};#8][r{ring_size};#6](=[#8])") - - if lactone_pattern is None: - return [] - - matches = mol.GetSubstructMatches(lactone_pattern) - - # 提取环上的内酯碳索引 (match[1]是羰基碳) - lactone_carbons = [] - for match in matches: - if len(match) >= 2: - carbonyl_carbon = match[1] # 羰基碳 - # 确保这个碳确实在指定大小的环上 - if carbonyl_carbon in ring_atoms: - lactone_carbons.append(carbonyl_carbon) - - return list(set(lactone_carbons)) # 去重 - - -def order_ring_atoms_clockwise(mol, ring_atoms, start_atom): - """顺时针排列环原子""" - if start_atom not in ring_atoms: - raise ValueError("Start atom is not in the ring") - - ordered = [start_atom] - visited = {start_atom} - current = start_atom - - while len(ordered) < len(ring_atoms): - atom = mol.GetAtomWithIdx(current) - - # 找到下一个环原子 - next_atom = None - for neighbor in atom.GetNeighbors(): - neighbor_idx = neighbor.GetIdx() - if neighbor_idx in ring_atoms and neighbor_idx not in visited: - next_atom = neighbor_idx - break - - if next_atom is None: - # 尝试从起点反向遍历 - if len(ordered) < len(ring_atoms): - atom = mol.GetAtomWithIdx(start_atom) - for neighbor in atom.GetNeighbors(): - neighbor_idx = neighbor.GetIdx() - if neighbor_idx in ring_atoms and neighbor_idx not in visited: - ordered.insert(0, neighbor_idx) - visited.add(neighbor_idx) - current = neighbor_idx - break - else: - break - else: - break - else: - ordered.append(next_atom) - visited.add(next_atom) - current = next_atom - - return ordered - - -def assign_ring_numbering_general(mol, ring_size): - """ - 为任意大小的环分配编号 - 返回: (numbering_dict, lactone_carbon) 或 (None, None) 或 ('multiple_lactones', lactone_list) - """ - # 找到内酯碳 - lactone_carbons = find_lactone_carbons(mol, ring_size) - - if len(lactone_carbons) == 0: - return None, None - - if len(lactone_carbons) > 1: - return 'multiple_lactones', lactone_carbons - - lactone_c = lactone_carbons[0] - - # 获取环原子 - ring_atoms = get_ring_atoms(mol) - - # 如果环原子数不匹配 - if len(ring_atoms) != ring_size: - return None, None - - # 排列环原子 - try: - ordered_atoms = order_ring_atoms_clockwise(mol, ring_atoms, lactone_c) - except Exception: - return None, None - - # 创建编号字典 - numbering = {} - for position, atom_idx in enumerate(ordered_atoms, start=1): - numbering[atom_idx] = position - - return numbering, lactone_c - - -def validate_numbering_general(numbering, ring_size): - """验证编号是否正确""" - if numbering is None or numbering == 'multiple_lactones': - return False - - if len(numbering) != ring_size: - return False - - positions = set(numbering.values()) - if positions != set(range(1, ring_size + 1)): - return False - - return True - - -def get_ring_size_from_smiles(smiles): - """从SMILES获取分子中的环大小""" - mol = Chem.MolFromSmiles(smiles) - if mol is None: - return [] - - ring_sizes = set() - for ring_size in range(12, 21): # 12-20 - pattern = Chem.MolFromSmarts(f"[r{ring_size}]") - if pattern is not None: - matches = mol.GetSubstructMatches(pattern) - if matches: - unique_atoms = list(set([match[0] for match in matches])) - if len(unique_atoms) == ring_size: - ring_sizes.add(ring_size) - - return sorted(list(ring_sizes)) - - -def process_single_molecule(idx, row, ring_size, output_base_dir, log_file, error_log_file, multiple_lactone_log): - """处理单个分子""" - try: - # 获取分子信息 - molecule_id = row.get('IDs', f'mol_{idx}') - molecule_name = row.get('molecule_pref_name', 'Unknown') - smiles = row['smiles'] - parent_id = f"ring{ring_size}_mol_{idx}" - - # 解析SMILES - mol = Chem.MolFromSmiles(smiles) - if mol is None: - log_message(f"分子 {idx} ({molecule_id}) SMILES解析失败", error_log_file, 'error') - return None, 'parse_failed' - - # 分配编号 - numbering, lactone_c = assign_ring_numbering_general(mol, ring_size) - - # 检查环上是否有多个内酯键 - if numbering == 'multiple_lactones': - log_message( - f"分子 {idx} ({molecule_id}, {molecule_name}) 环上有{len(lactone_c)}个内酯键,已剔除。环上内酯碳索引: {lactone_c}", - multiple_lactone_log, - 'info' - ) - return None, 'multiple_lactones' - - if numbering is None: - log_message(f"分子 {idx} ({molecule_id}) 编号失败", error_log_file, 'error') - return None, 'numbering_failed' - - # 验证编号 - if not validate_numbering_general(numbering, ring_size): - log_message(f"分子 {idx} ({molecule_id}) 编号验证失败", error_log_file, 'error') - return None, 'validation_failed' - - # 侧链断裂 - try: - mol_fragments = cleave_side_chains(mol, smiles, parent_id, numbering) - except Exception as e: - log_message(f"分子 {idx} ({molecule_id}) 裂解失败: {str(e)}", error_log_file, 'error') - return None, 'cleavage_failed' - - # 创建分子专属文件夹 - mol_output_dir = output_base_dir / parent_id - mol_output_dir.mkdir(parents=True, exist_ok=True) - - # 保存整个MoleculeFragments对象 - mol_fragments_path = mol_output_dir / f"{parent_id}_all_fragments.json" - mol_fragments.to_json_file(str(mol_fragments_path)) - - # 保存每个碎片 - for frag in mol_fragments.fragments: - frag_path = mol_output_dir / f"{frag.fragment_id}.json" - frag.to_json_file(str(frag_path)) - - # 保存分子信息到metadata - metadata = { - 'parent_id': parent_id, - 'molecule_id': molecule_id, - 'molecule_name': molecule_name, - 'smiles': smiles, - 'ring_size': ring_size, - 'num_fragments': len(mol_fragments.fragments), - 'processing_date': datetime.now().isoformat() - } - - metadata_path = mol_output_dir / 'metadata.json' - with open(metadata_path, 'w', encoding='utf-8') as f: - json.dump(metadata, f, indent=2, ensure_ascii=False) - - return len(mol_fragments.fragments), 'success' - - except Exception as e: - log_message(f"分子 {idx} 未预期错误: {str(e)}", error_log_file, 'error') - return None, 'unexpected_error' - - -def log_message(message, log_file, log_type='info'): - """记录日志信息""" - timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S') - log_entry = f"[{timestamp}] {message}\n" - - with open(log_file, 'a', encoding='utf-8') as f: - f.write(log_entry) - - if log_type == 'error': - tqdm.write(f"❌ {message}") - elif log_type == 'info' and 'multiple' not in log_file.name: - tqdm.write(f"✓ {message}") - - -def main(): - print("="*80) - print("12-20元环大环内酯批量处理程序") - print("="*80) - print() - - # 读取数据 - input_file = Path('data/ring12_20/temp.csv') - if not input_file.exists(): - print(f"❌ 错误: 找不到输入文件 {input_file}") - return - - df = pd.read_csv(input_file) - print(f"✓ 读取数据集: {len(df)} 个分子") - print(f" 文件: {input_file}") - print() - - # 第一步:按环大小分类 - print("步骤 1: 按环大小分类分子...") - print() - - ring_size_groups = defaultdict(list) - - for idx in tqdm(range(len(df)), desc="分类进度"): - smiles = df.iloc[idx]['smiles'] - ring_sizes = get_ring_size_from_smiles(smiles) - - for ring_size in ring_sizes: - if 12 <= ring_size <= 20: - ring_size_groups[ring_size].append(idx) - - print() - print("分类结果:") - for ring_size in sorted(ring_size_groups.keys()): - print(f" {ring_size}元环: {len(ring_size_groups[ring_size])} 个分子") - print() - - # 第二步:对每种环大小分别处理 - print("步骤 2: 处理各环大小分子...") - print() - - all_stats = {} - - for ring_size in sorted(ring_size_groups.keys()): - indices = ring_size_groups[ring_size] - - print(f"\n处理 {ring_size}元环 ({len(indices)} 个分子)...") - print("-" * 60) - - # 创建输出文件夹 - output_base_dir = Path(f'output/ring{ring_size}_fragments') - output_base_dir.mkdir(parents=True, exist_ok=True) - - # 创建日志文件 - timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") - log_file = output_base_dir / f'processing_log_{timestamp}.txt' - error_log_file = output_base_dir / f'error_log_{timestamp}.txt' - multiple_lactone_log = output_base_dir / f'multiple_lactone_log_{timestamp}.txt' - - # 统计信息 - stats = { - 'ring_size': ring_size, - 'total': len(indices), - 'success': 0, - 'failed_parse': 0, - 'failed_numbering': 0, - 'failed_validation': 0, - 'failed_cleavage': 0, - 'failed_unexpected': 0, - 'multiple_lactones': 0, - 'total_fragments': 0 - } - - log_message(f"开始处理 {len(indices)} 个{ring_size}元环分子", log_file) - - # 处理每个分子 - for idx in tqdm(indices, desc=f"Ring-{ring_size}"): - num_fragments, status = process_single_molecule( - idx, df.iloc[idx], ring_size, - output_base_dir, log_file, error_log_file, multiple_lactone_log - ) - - if status == 'success': - stats['success'] += 1 - stats['total_fragments'] += num_fragments - elif status == 'parse_failed': - stats['failed_parse'] += 1 - elif status == 'numbering_failed': - stats['failed_numbering'] += 1 - elif status == 'validation_failed': - stats['failed_validation'] += 1 - elif status == 'cleavage_failed': - stats['failed_cleavage'] += 1 - elif status == 'multiple_lactones': - stats['multiple_lactones'] += 1 - else: - stats['failed_unexpected'] += 1 - - # 保存统计结果 - stats_file = output_base_dir / 'processing_statistics.json' - with open(stats_file, 'w', encoding='utf-8') as f: - json.dump(stats, f, indent=2) - - all_stats[ring_size] = stats - - # 打印该环大小的统计 - print() - print(f"{ring_size}元环处理结果:") - print(f" 成功: {stats['success']}/{stats['total']} ({stats['success']/stats['total']*100:.1f}%)") - print(f" 碎片数: {stats['total_fragments']}") - if stats['success'] > 0: - print(f" 平均碎片: {stats['total_fragments']/stats['success']:.2f} 个/分子") - print(f" 多个内酯键: {stats['multiple_lactones']}") - - log_message(f"处理完成: 成功 {stats['success']}/{stats['total']}", log_file) - - # 第三步:总结 - print() - print("="*80) - print("所有环大小处理完成!") - print("="*80) - print() - - print("总体统计:") - total_molecules = 0 - total_success = 0 - total_fragments = 0 - total_multiple_lactones = 0 - - for ring_size, stats in sorted(all_stats.items()): - total_molecules += stats['total'] - total_success += stats['success'] - total_fragments += stats['total_fragments'] - total_multiple_lactones += stats['multiple_lactones'] - - print(f" 总分子数: {total_molecules}") - print(f" 总成功数: {total_success} ({total_success/total_molecules*100:.1f}%)") - print(f" 总碎片数: {total_fragments}") - if total_success > 0: - print(f" 平均碎片: {total_fragments/total_success:.2f} 个/分子") - print(f" 多内酯键分子: {total_multiple_lactones}") - print() - - print("各环大小输出文件夹:") - for ring_size in sorted(all_stats.keys()): - output_dir = Path(f'output/ring{ring_size}_fragments') - print(f" {ring_size}元环: {output_dir}") - print() - - -if __name__ == '__main__': +if __name__ == "__main__": + sys.argv = ["macro-lactone-toolkit", "fragment", *sys.argv[1:]] main() - diff --git a/scripts/batch_process_ring16.py b/scripts/batch_process_ring16.py old mode 100755 new mode 100644 index 7b1f0d5..52c9a7c --- a/scripts/batch_process_ring16.py +++ b/scripts/batch_process_ring16.py @@ -1,259 +1,10 @@ -#!/usr/bin/env python3 -""" -批量处理16元环大环内酯分子 -处理文件: ring16/temp_filtered_complete.csv (1241个分子) -输出文件夹: output/ring16_fragments/ -""" +from __future__ import annotations import sys -from pathlib import Path -# 添加项目根目录到 Python 路径 -# 获取脚本所在目录的父目录(项目根目录) -script_dir = Path(__file__).resolve().parent -project_root = script_dir.parent -sys.path.insert(0, str(project_root)) - -from rdkit import Chem -import pandas as pd -import json -from tqdm import tqdm -from datetime import datetime - -from src.ring_numbering import ( - assign_ring_numbering, - validate_numbering, - get_ring_atoms -) -from src.fragment_cleaver import cleave_side_chains +from macro_lactone_toolkit.cli import main -def log_message(message, log_file, log_type='info'): - """记录日志信息""" - timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S') - log_entry = f"[{timestamp}] {message}\n" - - with open(log_file, 'a', encoding='utf-8') as f: - f.write(log_entry) - - if log_type == 'error': - print(f"❌ {message}") - else: - print(f"✓ {message}") - - -def process_single_molecule(idx, row, output_base_dir, log_file, error_log_file, multiple_lactone_log): - """处理单个分子""" - try: - # 获取分子信息 - molecule_id = row['IDs'] - molecule_name = row['molecule_pref_name'] - smiles = row['smiles'] - parent_id = f"ring16_mol_{idx}" - - # 解析SMILES - mol = Chem.MolFromSmiles(smiles) - if mol is None: - log_message(f"分子 {idx} ({molecule_id}) SMILES解析失败", error_log_file, 'error') - return None, 'parse_failed' - - # 检查环上是否有多个内酯键 - from src.ring_numbering import find_lactone_carbon, get_ring_atoms - - # 获取16元环上的所有原子 - ring_atoms = get_ring_atoms(mol) - - if len(ring_atoms) > 0: - # 使用更严格的SMARTS模式:环上的氧连接到环上的羰基碳 - # [r16;O] 环上的氧,连接到 [r16;#6] 环上的碳,该碳有双键氧 (=O) - lactone_pattern = Chem.MolFromSmarts("[r16;#8][r16;#6](=[#8])") - - if lactone_pattern: - matches = mol.GetSubstructMatches(lactone_pattern) - - # 提取环上的内酯碳索引(match[1]是羰基碳) - lactone_carbons_on_ring = [] - for match in matches: - if len(match) >= 2: - carbonyl_carbon = match[1] # 羰基碳 - # 确保这个碳确实在16元环上 - if carbonyl_carbon in ring_atoms: - lactone_carbons_on_ring.append(carbonyl_carbon) - - # 去重 - lactone_carbons_on_ring = list(set(lactone_carbons_on_ring)) - - # 只有当环上有多个内酯键时才过滤 - if len(lactone_carbons_on_ring) > 1: - log_message( - f"分子 {idx} ({molecule_id}, {molecule_name}) 环上有{len(lactone_carbons_on_ring)}个内酯键,已过滤。环上内酯碳索引: {lactone_carbons_on_ring}", - multiple_lactone_log, - 'info' - ) - return None, 'multiple_lactones' - - # 分配编号 - numbering = assign_ring_numbering(mol) - if len(numbering) == 0: - log_message(f"分子 {idx} ({molecule_id}) 编号失败", error_log_file, 'error') - return None, 'numbering_failed' - - # 验证编号 - if not validate_numbering(mol, numbering): - log_message(f"分子 {idx} ({molecule_id}) 编号验证失败", error_log_file, 'error') - return None, 'validation_failed' - - # 侧链断裂 - try: - mol_fragments = cleave_side_chains(mol, smiles, parent_id, numbering) - except Exception as e: - log_message(f"分子 {idx} ({molecule_id}) 裂解失败: {str(e)}", error_log_file, 'error') - return None, 'cleavage_failed' - - # 创建分子专属文件夹 - mol_output_dir = output_base_dir / parent_id - mol_output_dir.mkdir(parents=True, exist_ok=True) - - # 保存整个MoleculeFragments对象 - mol_fragments_path = mol_output_dir / f"{parent_id}_all_fragments.json" - mol_fragments.to_json_file(str(mol_fragments_path)) - - # 保存每个碎片 - for frag in mol_fragments.fragments: - frag_path = mol_output_dir / f"{frag.fragment_id}.json" - frag.to_json_file(str(frag_path)) - - # 保存分子信息到metadata - metadata = { - 'parent_id': parent_id, - 'molecule_id': molecule_id, - 'molecule_name': molecule_name, - 'smiles': smiles, - 'ring_size': 16, - 'num_fragments': len(mol_fragments.fragments), - 'processing_date': datetime.now().isoformat() - } - - metadata_path = mol_output_dir / 'metadata.json' - with open(metadata_path, 'w', encoding='utf-8') as f: - json.dump(metadata, f, indent=2, ensure_ascii=False) - - return len(mol_fragments.fragments), 'success' - - except Exception as e: - log_message(f"分子 {idx} 未预期错误: {str(e)}", error_log_file, 'error') - return None, 'unexpected_error' - - -def main(): - print("="*80) - print("16元环大环内酯批量处理程序") - print("="*80) - print() - - # 读取数据 - input_file = Path('ring16/temp_filtered_complete.csv') - if not input_file.exists(): - print(f"❌ 错误: 找不到输入文件 {input_file}") - return - - df = pd.read_csv(input_file) - print(f"✓ 读取数据集: {len(df)} 个分子") - print(f" 文件: {input_file}") - print() - - # 创建输出文件夹 - output_base_dir = Path('output/ring16_fragments') - output_base_dir.mkdir(parents=True, exist_ok=True) - - # 创建日志文件 - timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") - log_file = output_base_dir / f'processing_log_{timestamp}.txt' - error_log_file = output_base_dir / f'error_log_{timestamp}.txt' - multiple_lactone_log = output_base_dir / f'multiple_lactone_log_{timestamp}.txt' - - print(f"✓ 输出文件夹: {output_base_dir}") - print(f"✓ 日志文件: {log_file}") - print() - - # 统计信息 - stats = { - 'total': len(df), - 'success': 0, - 'failed_parse': 0, - 'failed_numbering': 0, - 'failed_validation': 0, - 'failed_cleavage': 0, - 'failed_unexpected': 0, - 'multiple_lactones': 0, - 'total_fragments': 0 - } - - # 批量处理 - print("开始批量处理...") - print() - - log_message(f"开始批量处理 {len(df)} 个16元环分子", log_file) - - for idx in tqdm(range(len(df)), desc="处理进度"): - num_fragments, status = process_single_molecule( - idx, df.iloc[idx], output_base_dir, log_file, error_log_file, multiple_lactone_log - ) - - if status == 'success': - stats['success'] += 1 - stats['total_fragments'] += num_fragments - elif status == 'parse_failed': - stats['failed_parse'] += 1 - elif status == 'numbering_failed': - stats['failed_numbering'] += 1 - elif status == 'validation_failed': - stats['failed_validation'] += 1 - elif status == 'cleavage_failed': - stats['failed_cleavage'] += 1 - elif status == 'multiple_lactones': - stats['multiple_lactones'] += 1 - else: - stats['failed_unexpected'] += 1 - - print() - print("="*80) - print("处理完成!") - print("="*80) - print() - - # 打印统计结果 - print("统计结果:") - print(f" 总分子数: {stats['total']}") - print(f" 成功处理: {stats['success']} ({stats['success']/stats['total']*100:.2f}%)") - print(f" 总碎片数: {stats['total_fragments']}") - if stats['success'] > 0: - print(f" 平均碎片: {stats['total_fragments']/stats['success']:.2f} 个/分子") - print() - - print("失败情况:") - print(f" SMILES解析失败: {stats['failed_parse']}") - print(f" 编号失败: {stats['failed_numbering']}") - print(f" 验证失败: {stats['failed_validation']}") - print(f" 裂解失败: {stats['failed_cleavage']}") - print(f" 多个内酯键: {stats['multiple_lactones']}") - print(f" 其他错误: {stats['failed_unexpected']}") - total_failed = stats['total'] - stats['success'] - print(f" 总失败: {total_failed} ({total_failed/stats['total']*100:.2f}%)") - print() - - # 保存统计结果 - stats_file = output_base_dir / 'processing_statistics.json' - with open(stats_file, 'w', encoding='utf-8') as f: - json.dump(stats, f, indent=2) - - print(f"✓ 统计结果已保存: {stats_file}") - print(f"✓ 输出文件夹: {output_base_dir}") - print() - - log_message(f"处理完成: 成功 {stats['success']}/{stats['total']}", log_file) - - -if __name__ == '__main__': +if __name__ == "__main__": + sys.argv = ["macro-lactone-toolkit", "fragment", "--ring-size", "16", *sys.argv[1:]] main() - diff --git a/scripts/generate_sdf_and_statistics.py b/scripts/generate_sdf_and_statistics.py index 01c7ba5..4dc6d6a 100644 --- a/scripts/generate_sdf_and_statistics.py +++ b/scripts/generate_sdf_and_statistics.py @@ -1,203 +1,12 @@ -#!/usr/bin/env python3 -""" -统计分析795个成功处理的分子,并生成SDF文件 - -功能: -1. 统计哪些位置被替换(断裂)最多 -2. 为每个分子生成3D SDF文件(使用ETKDGv3) -""" - -import sys -from pathlib import Path -from collections import Counter, defaultdict -import json -from tqdm import tqdm - -# 添加项目根目录到 Python 路径 -script_dir = Path(__file__).resolve().parent -project_root = script_dir.parent -sys.path.insert(0, str(project_root)) - -from rdkit import Chem -from rdkit.Chem import AllChem - -from src.fragment_dataclass import MoleculeFragments +from macro_lactone_toolkit import FragmentationResult -def load_all_fragments(output_dir): - """加载所有分子的碎片信息""" - fragments_dir = Path(output_dir) / "ring16_fragments" - mol_dirs = sorted([d for d in fragments_dir.glob("ring16_mol_*") if d.is_dir()]) - - all_fragments = [] - for mol_dir in tqdm(mol_dirs, desc="加载碎片信息"): - fragments_file = mol_dir / f"{mol_dir.name}_all_fragments.json" - if fragments_file.exists(): - try: - mol_fragments = MoleculeFragments.from_json_file(str(fragments_file)) - all_fragments.append((mol_dir, mol_fragments)) - except Exception as e: - print(f"警告: 无法加载 {fragments_file}: {e}") - - return all_fragments +def main() -> None: + raise SystemExit( + "Legacy helper retired. Use 'macro-lactone-toolkit fragment' to export fragments, " + "then generate SDF/statistics in downstream analysis code." + ) -def analyze_cleavage_positions(all_fragments): - """统计分析哪些位置被替换最多""" - position_counter = Counter() - position_molecules = defaultdict(list) # 记录每个位置出现在哪些分子中 - - for mol_dir, mol_fragments in all_fragments: - positions = mol_fragments.get_cleavage_positions() - for pos in positions: - position_counter[pos] += 1 - position_molecules[pos].append(mol_dir.name) - - # 打印统计结果 - print("\n" + "="*80) - print("位置替换统计(按替换次数排序)") - print("="*80) - print(f"{'位置':<8} {'替换次数':<12} {'占比':<10} {'分子数':<10}") - print("-"*80) - - total_replacements = sum(position_counter.values()) - for pos in sorted(position_counter.keys()): - count = position_counter[pos] - percentage = count / total_replacements * 100 if total_replacements > 0 else 0 - mol_count = len(set(position_molecules[pos])) - print(f"{pos:<8} {count:<12} {percentage:>6.2f}% {mol_count:<10}") - - print("-"*80) - print(f"{'总计':<8} {total_replacements:<12} {'100.00%':<10} {len(all_fragments):<10}") - print("="*80) - - # 保存统计结果 - stats = { - 'position_counts': dict(position_counter), - 'position_molecules': {str(k): list(set(v)) for k, v in position_molecules.items()}, - 'total_replacements': total_replacements, - 'total_molecules': len(all_fragments) - } - - stats_file = Path(output_dir) / "ring16_fragments" / "cleavage_position_statistics.json" - with open(stats_file, 'w', encoding='utf-8') as f: - json.dump(stats, f, indent=2, ensure_ascii=False) - - print(f"\n✓ 统计结果已保存: {stats_file}") - - return position_counter, position_molecules - - -def generate_3d_sdf(mol, output_path, max_attempts=100): - """使用ETKDGv3生成3D SDF文件""" - try: - # 添加氢原子 - mol = Chem.AddHs(mol) - - # 使用ETKDGv3参数 - params = AllChem.ETKDGv3() - - for attempt in range(max_attempts): - try: - status = AllChem.EmbedMolecule(mol, params) - if status == 0: - # UFF优化 - AllChem.UFFOptimizeMolecule(mol) - - writer = Chem.SDWriter(str(output_path)) - writer.write(mol) - writer.close() - return mol, True - except Exception as e: - if attempt == max_attempts - 1: - raise e - continue - - return None, False - except Exception as e: - print(f"生成3D SDF失败: {e}") - return None, False - - -def generate_sdf_files(all_fragments, output_dir): - """为所有分子生成3D SDF文件""" - fragments_dir = Path(output_dir) / "ring16_fragments" - - print("\n" + "="*80) - print("生成3D SDF文件") - print("="*80) - - success_count = 0 - failed_count = 0 - failed_molecules = [] # 记录失败的分子 - - for mol_dir, mol_fragments in tqdm(all_fragments, desc="生成3D SDF"): - smiles = mol_fragments.parent_smiles - - # 先检查能否从SMILES创建分子 - test_mol = Chem.MolFromSmiles(smiles) - if test_mol is None: - failed_count += 1 - failed_molecules.append((mol_dir.name, "无法从SMILES创建分子")) - continue - - base_name = mol_dir.name - sdf_3d_path = mol_dir / f"{base_name}_3d.sdf" - - # 重试机制:最多尝试10次,每次从原始SMILES重新创建分子 - max_retries = 10 - success = False - - for retry in range(max_retries): - # 每次重试都从原始SMILES重新创建分子 - mol = Chem.MolFromSmiles(smiles) - if mol is None: - continue - - mol_3d, success = generate_3d_sdf(mol, sdf_3d_path, max_attempts=100) - if success and mol_3d is not None: - success_count += 1 - break - - if not success: - failed_count += 1 - failed_molecules.append((mol_dir.name, f"10次重试后仍失败")) - - print(f"\n✓ 成功生成3D SDF: {success_count}/{len(all_fragments)}") - if failed_count > 0: - print(f"⚠ 失败: {failed_count}") - print("\n失败的分子列表:") - for mol_name, reason in failed_molecules: - print(f" - {mol_name}: {reason}") - - -def main(): - global output_dir - output_dir = Path('output') - - print("="*80) - print("16元环大环内酯统计分析及SDF生成程序") - print("="*80) - print() - - # 1. 加载所有碎片信息 - print("步骤1: 加载所有分子的碎片信息...") - all_fragments = load_all_fragments(output_dir) - print(f"✓ 加载了 {len(all_fragments)} 个分子的碎片信息") - - # 2. 统计分析位置替换 - print("\n步骤2: 统计分析位置替换...") - position_counter, position_molecules = analyze_cleavage_positions(all_fragments) - - # 3. 生成3D SDF文件 - print("\n步骤3: 生成3D SDF文件...") - generate_sdf_files(all_fragments, output_dir) - - print("\n" + "="*80) - print("处理完成!") - print("="*80) - - -if __name__ == '__main__': +if __name__ == "__main__": main() - diff --git a/scripts/tylosin_splicer.py b/scripts/tylosin_splicer.py index 71f04d4..17dbc70 100644 --- a/scripts/tylosin_splicer.py +++ b/scripts/tylosin_splicer.py @@ -1,10 +1,34 @@ -#!/usr/bin/env python -""" -Tylosin Splicing System - Main Entry Point -""" +from __future__ import annotations + +import argparse + +from rdkit import Chem + +from macro_lactone_toolkit.splicing import ( + activate_fragment, + prepare_macrolactone_scaffold, + splice_molecule, +) + + +def main() -> None: + parser = argparse.ArgumentParser(description="Splice a fragment onto a prepared macrolactone scaffold.") + parser.add_argument("--scaffold-smiles", required=True) + parser.add_argument("--fragment-smiles", required=True) + parser.add_argument("--position", required=True, type=int) + parser.add_argument("--ring-size", type=int, default=None) + parser.add_argument("--strategy", default="smart") + args = parser.parse_args() + + scaffold, _ = prepare_macrolactone_scaffold( + args.scaffold_smiles, + positions=[args.position], + ring_size=args.ring_size, + ) + fragment = activate_fragment(args.fragment_smiles, strategy=args.strategy) + product = splice_molecule(scaffold, fragment, position=args.position) + print(Chem.MolToSmiles(product, isomericSmiles=True)) -def main(): - print("Hello from Tylosin Splicer") if __name__ == "__main__": main() diff --git a/setup.py b/setup.py deleted file mode 100644 index 07f32e6..0000000 --- a/setup.py +++ /dev/null @@ -1,75 +0,0 @@ -""" -Setup script for macrolactone-fragmenter package. - -This package provides tools for analyzing and fragmenting macrolactone -(12-20 membered ring) side chains. - -Note: RDKit must be installed separately via conda: - conda install -c conda-forge rdkit - -For development installation: - pip install -e . - -For regular installation: - pip install . -""" - -from setuptools import setup, find_packages -from pathlib import Path - -# Read the contents of README file -this_directory = Path(__file__).parent -long_description = (this_directory / "README.md").read_text(encoding="utf-8") - -setup( - name="macrolactone-fragmenter", - version="1.2.0", - author="Macro Split Team", - author_email="your.email@example.com", - description="A tool for analyzing and fragmenting macrolactone side chains", - long_description=long_description, - long_description_content_type="text/markdown", - url="https://github.com/yourusername/macro_split", - packages=find_packages(), - classifiers=[ - "Development Status :: 4 - Beta", - "Intended Audience :: Science/Research", - "Topic :: Scientific/Engineering :: Chemistry", - "License :: OSI Approved :: MIT License", - "Programming Language :: Python :: 3", - "Programming Language :: Python :: 3.8", - "Programming Language :: Python :: 3.9", - "Programming Language :: Python :: 3.10", - "Programming Language :: Python :: 3.11", - ], - python_requires=">=3.8", - install_requires=[ - "pandas>=1.3.0", - "numpy>=1.20.0", - "matplotlib>=3.3.0", - "seaborn>=0.11.0", - "dataclasses-json>=0.5.0", - "tqdm>=4.60.0", - ], - extras_require={ - "dev": [ - "jupyter>=1.0.0", - "pytest>=7.0.0", - "black>=22.0.0", - "flake8>=4.0.0", - ], - "docs": [ - "mkdocs>=1.4.0", - "mkdocs-material>=9.0.0", - "mkdocstrings[python]>=0.20.0", - ], - }, - entry_points={ - "console_scripts": [ - # Can add command-line tools here - ], - }, - include_package_data=True, - zip_safe=False, -) - diff --git a/src/__init__.py b/src/__init__.py deleted file mode 100644 index 0f5d380..0000000 --- a/src/__init__.py +++ /dev/null @@ -1,6 +0,0 @@ -""" -16-Membered Macrolactone Side Chain Analysis Package -""" - -__version__ = "1.0.0" - diff --git a/src/fragment_cleaver.py b/src/fragment_cleaver.py deleted file mode 100644 index 79d3e1b..0000000 --- a/src/fragment_cleaver.py +++ /dev/null @@ -1,219 +0,0 @@ -""" -Side chain cleavage and fragment extraction for macrolactones. -""" -from rdkit import Chem -from rdkit.Chem import Descriptors, rdMolDescriptors -from typing import List, Tuple, Optional, Dict -from .ring_numbering import assign_ring_numbering, get_ring_position -from .fragment_dataclass import Fragment, MoleculeFragments - - -def identify_side_chains(mol: Chem.Mol, ring_atoms: List[int]) -> List[Tuple[int, int]]: - """ - Identify side chains attached to the ring. - - Args: - mol: RDKit molecule object - ring_atoms: List of atom indices in the 16-membered ring - - Returns: - List of tuples (ring_atom_idx, side_chain_first_atom_idx) - """ - side_chains = [] - ring_atom_set = set(ring_atoms) - - for ring_atom_idx in ring_atoms: - atom = mol.GetAtomWithIdx(ring_atom_idx) - - for neighbor in atom.GetNeighbors(): - neighbor_idx = neighbor.GetIdx() - - # If neighbor is not in the ring, it's a side chain - if neighbor_idx not in ring_atom_set: - side_chains.append((ring_atom_idx, neighbor_idx)) - - return side_chains - - -def extract_side_chain_fragment(mol: Chem.Mol, ring_atom_idx: int, side_chain_start_idx: int, - ring_atoms: List[int]) -> Optional[str]: - """ - Extract a side chain fragment as a SMILES string. - - Args: - mol: RDKit molecule object - ring_atom_idx: Index of the ring atom where side chain is attached - side_chain_start_idx: Index of the first atom in the side chain - ring_atoms: List of all ring atom indices - - Returns: - SMILES string of the fragment, or None if extraction failed - """ - # Create an editable copy - emol = Chem.RWMol(mol) - - # Find the bond between ring and side chain - bond = mol.GetBondBetweenAtoms(ring_atom_idx, side_chain_start_idx) - if bond is None: - return None - - # Get all atoms in the side chain by BFS from side_chain_start_idx - # but not crossing into the ring - ring_atom_set = set(ring_atoms) - visited = set() - queue = [side_chain_start_idx] - side_chain_atoms = [] - - while queue: - current_idx = queue.pop(0) - if current_idx in visited: - continue - - visited.add(current_idx) - side_chain_atoms.append(current_idx) - - atom = mol.GetAtomWithIdx(current_idx) - for neighbor in atom.GetNeighbors(): - neighbor_idx = neighbor.GetIdx() - - # Only continue into non-ring atoms - if neighbor_idx not in ring_atom_set and neighbor_idx not in visited: - queue.append(neighbor_idx) - - if not side_chain_atoms: - return None - - # Create a new molecule with only the side chain atoms - # We'll use Chem.RWMol to build it - fragment_mol = Chem.RWMol() - - # Map old atom indices to new ones - old_to_new = {} - - # Add atoms - for old_idx in side_chain_atoms: - atom = mol.GetAtomWithIdx(old_idx) - new_atom = Chem.Atom(atom.GetAtomicNum()) - new_atom.SetFormalCharge(atom.GetFormalCharge()) - new_atom.SetIsAromatic(atom.GetIsAromatic()) - new_idx = fragment_mol.AddAtom(new_atom) - old_to_new[old_idx] = new_idx - - # Add bonds - for old_idx in side_chain_atoms: - atom = mol.GetAtomWithIdx(old_idx) - for neighbor in atom.GetNeighbors(): - neighbor_idx = neighbor.GetIdx() - - # Only add bonds within the side chain - if neighbor_idx in old_to_new and old_idx < neighbor_idx: - bond = mol.GetBondBetweenAtoms(old_idx, neighbor_idx) - fragment_mol.AddBond( - old_to_new[old_idx], - old_to_new[neighbor_idx], - bond.GetBondType() - ) - - # Convert to molecule and get SMILES - try: - fragment_mol = fragment_mol.GetMol() - Chem.SanitizeMol(fragment_mol) - fragment_smiles = Chem.MolToSmiles(fragment_mol) - return fragment_smiles - except: - # If sanitization fails, return None - return None - - -def cleave_side_chains(mol: Chem.Mol, parent_smiles: str, parent_id: str, - numbering: Dict[int, int]) -> MoleculeFragments: - """ - Cleave all side chains from a macrolactone and create fragment records. - - Args: - mol: RDKit molecule object - parent_smiles: SMILES string of the parent molecule - parent_id: Identifier for the parent molecule (e.g., "macro_0") - numbering: Ring numbering dictionary from assign_ring_numbering - - Returns: - MoleculeFragments object containing all fragments - """ - mol_fragments = MoleculeFragments( - parent_id=parent_id, - parent_smiles=parent_smiles, - ring_numbering={str(k): v for k, v in numbering.items()} # Convert to str keys for JSON - ) - - ring_atoms = list(numbering.keys()) - side_chains = identify_side_chains(mol, ring_atoms) - - fragment_counter = 0 - - for ring_atom_idx, side_chain_start_idx in side_chains: - # Get the ring position - cleavage_position = get_ring_position(numbering, ring_atom_idx) - - if cleavage_position is None: - continue - - # Extract the fragment - fragment_smiles = extract_side_chain_fragment( - mol, ring_atom_idx, side_chain_start_idx, ring_atoms - ) - - if fragment_smiles is None: - continue - - # Create fragment molecule for properties - frag_mol = Chem.MolFromSmiles(fragment_smiles) - if frag_mol is None: - continue - - # Create Fragment object - fragment = Fragment( - fragment_smiles=fragment_smiles, - parent_smiles=parent_smiles, - cleavage_position=cleavage_position, - fragment_id=f"{parent_id}_frag_{fragment_counter}", - parent_id=parent_id, - atom_count=frag_mol.GetNumAtoms(), - molecular_weight=round(Descriptors.MolWt(frag_mol), 2) - ) - - mol_fragments.add_fragment(fragment) - fragment_counter += 1 - - return mol_fragments - - -def process_molecule(smiles: str, molecule_idx: int) -> Optional[MoleculeFragments]: - """ - Process a single molecule: assign numbering and cleave side chains. - - Args: - smiles: SMILES string of the molecule - molecule_idx: Index number for naming (e.g., 0 -> "macro_0") - - Returns: - MoleculeFragments object or None if processing failed - """ - parent_id = f"macro_{molecule_idx}" - - # Parse molecule - mol = Chem.MolFromSmiles(smiles) - if mol is None: - print(f"Error: Could not parse SMILES for {parent_id}") - return None - - # Assign ring numbering - numbering = assign_ring_numbering(mol) - if not numbering: - print(f"Error: Could not assign ring numbering for {parent_id}") - return None - - # Cleave side chains - mol_fragments = cleave_side_chains(mol, smiles, parent_id, numbering) - - return mol_fragments - diff --git a/src/fragment_dataclass.py b/src/fragment_dataclass.py deleted file mode 100644 index 19ec446..0000000 --- a/src/fragment_dataclass.py +++ /dev/null @@ -1,95 +0,0 @@ -""" -Fragment dataclass for storing side chain information. -""" -from dataclasses import dataclass, field -from typing import List, Optional -from dataclasses_json import dataclass_json -import json - - -@dataclass_json -@dataclass -class Fragment: - """ - Data class to store fragment information from macrocycle side chain cleavage. - - Attributes: - fragment_smiles: SMILES string of the cleaved fragment (contains dummy atom '*' - at the attachment point for molzip compatibility) - parent_smiles: SMILES string of the parent molecule - cleavage_position: Position on the ring where cleavage occurred (1-N, where N is ring size) - fragment_id: Unique identifier for this fragment - parent_id: Identifier of the parent molecule (e.g., "macro_0") - atom_count: Number of atoms in the fragment - molecular_weight: Molecular weight of the fragment - - Note: - The fragment_smiles contains a dummy atom (*) at the attachment point, which allows - for easy reconstruction using RDKit's molzip functionality. This format is compatible - with SQLModel and PostgreSQL storage while maintaining chemical information. - """ - fragment_smiles: str - parent_smiles: str - cleavage_position: int - fragment_id: str - parent_id: str - atom_count: Optional[int] = None - molecular_weight: Optional[float] = None - - def to_json_file(self, filepath: str): - """Save fragment to JSON file.""" - with open(filepath, 'w') as f: - json.dump(self.to_dict(), f, indent=2, ensure_ascii=False) - - @classmethod - def from_json_file(cls, filepath: str): - """Load fragment from JSON file.""" - with open(filepath, 'r') as f: - data = json.load(f) - return cls.from_dict(data) - - -@dataclass_json -@dataclass -class MoleculeFragments: - """ - Collection of all fragments from a single macrocycle molecule. - - Attributes: - parent_id: Identifier of the parent molecule - parent_smiles: SMILES string of the parent molecule - ring_numbering: Dictionary mapping atom indices to ring positions (1-16) - fragments: List of Fragment objects - """ - parent_id: str - parent_smiles: str - ring_numbering: dict = field(default_factory=dict) - fragments: List[Fragment] = field(default_factory=list) - - def add_fragment(self, fragment: Fragment): - """Add a fragment to the collection.""" - self.fragments.append(fragment) - - def to_json_file(self, filepath: str): - """Save all fragments to JSON file.""" - with open(filepath, 'w') as f: - json.dump(self.to_dict(), f, indent=2, ensure_ascii=False) - - @classmethod - def from_json_file(cls, filepath: str): - """Load fragments from JSON file.""" - with open(filepath, 'r') as f: - data = json.load(f) - return cls.from_dict(data) - - def get_cleavage_positions(self) -> List[int]: - """Get list of all cleavage positions.""" - return [frag.cleavage_position for frag in self.fragments] - - def get_fragment_by_position(self, position: int) -> Optional[Fragment]: - """Get fragment by cleavage position.""" - for frag in self.fragments: - if frag.cleavage_position == position: - return frag - return None - diff --git a/src/macro_lactone_analyzer.py b/src/macro_lactone_analyzer.py deleted file mode 100644 index be980d1..0000000 --- a/src/macro_lactone_analyzer.py +++ /dev/null @@ -1,494 +0,0 @@ -#!/usr/bin/env python -# -*- encoding: utf-8 -*- -""" -@file :macro_lactone_analyzer.py -@Description: :大环内酯分析器 - 环数识别、验证和分类 - : - :主要功能: - 1. 识别分子中所有12-20元环的大小 - 2. 验证是否为有效的大环内酯(含有酯键) - 3. 支持单分子和批量分析 - 4. 分类为单环和桥环分子 - 5. 支持动态SMARTS模式匹配 - : -@Date :2025-11-09 (更新:2025-11-09) -@Author :参考 lyzeng 的设计 -@version :2.0 -""" - -from dataclasses import dataclass, field -from typing import List, Dict, Optional, Union, Tuple, Set -from pathlib import Path -import pandas as pd -from rdkit import Chem -from rdkit.Chem import Descriptors, Crippen, QED - -# 导入项目内的辅助函数 -from .ring_visualization import get_ring_atoms_by_size - - -@dataclass -class MacroLactoneAnalyzer: - """ - 大环内酯分析器 - 用于识别、验证和分类大环内酯分子。 - - 属性: - smiles_list (List[str]): SMILES字符串列表,可通过add_smiles方法添加。 - ester_patterns (List[str]): 用于识别酯键的SMARTS模式列表。 - """ - smiles_list: List[str] = field(default_factory=list) - ester_patterns: List[str] = field(default_factory=list) - - def __post_init__(self): - """初始化后设置默认的酯键SMARTS模式""" - if not self.ester_patterns: - self.ester_patterns = [ - "[C](=O)[O]", # 通用酯键 - "C(=O)O", # 简单酯键 - ] - - @staticmethod - def detect_ring_sizes(mol: Chem.Mol) -> List[int]: - """ - 识别分子中所有12-20元环的大小 - - Args: - mol: RDKit分子对象 - - Returns: - 环大小列表,按升序排列 - """ - if mol is None: - return [] - - ring_sizes = [] - ring_info = mol.GetRingInfo() - rings = ring_info.AtomRings() - - for ring in rings: - ring_size = len(ring) - if 12 <= ring_size <= 20: - ring_sizes.append(ring_size) - - # 去重并排序 - return sorted(list(set(ring_sizes))) - - @staticmethod - def has_ester_on_ring(mol: Chem.Mol, ring_atoms: List[int], ester_patterns: Optional[List[str]] = None) -> bool: - """ - 检查环上是否有酯键(大环内酯特征) - - Args: - mol: RDKit分子对象 - ring_atoms: 环原子索引列表 - ester_patterns: 可选的酯键SMARTS模式列表 - - Returns: - 是否为大环内酯 - """ - if mol is None or not ring_atoms: - return False - - ring_atoms_set = set(ring_atoms) - patterns = ester_patterns or ["[C](=O)[O]", "C(=O)O"] - - for pattern_str in patterns: - pattern = Chem.MolFromSmarts(pattern_str) - if pattern is None: - continue - - matches = mol.GetSubstructMatches(pattern) - for match in matches: - if len(match) >= 3: - carbonyl_c = match[0] - ester_o = match[2] - # 检查酯键原子是否在环上 - if carbonyl_c in ring_atoms_set and ester_o in ring_atoms_set: - return True - - return False - - @staticmethod - def is_valid_macrolactone(mol: Chem.Mol, size: int, ester_patterns: Optional[List[str]] = None) -> bool: - """ - 检查分子是否为指定大小的有效大环内酯 - - Args: - mol: RDKit分子对象 - size: 环大小 - ester_patterns: 可选的酯键SMARTS模式列表 - - Returns: - 是否为有效大环内酯 - """ - if mol is None: - return False - - # 找到指定大小的环 - ring_atoms = get_ring_atoms_by_size(mol, size) - if not ring_atoms: - return False - - # 检查环上是否有酯键 - return MacroLactoneAnalyzer.has_ester_on_ring(mol, ring_atoms, ester_patterns) - - @staticmethod - def analyze_smiles(smiles: str, - ring_range: range = range(12, 21), - ester_patterns: Optional[List[str]] = None) -> Optional[Dict[str, Union[int, List[int], bool]]]: - """ - 对单个SMILES字符串进行大环内酯检测和分析。 - - Args: - smiles: 分子的SMILES表示 - ring_range: 需要检测的环大小范围,默认为12至20 - ester_patterns: 可选的酯键SMARTS模式列表 - - Returns: - 包含分析结果的字典: - - 'ring_sizes': 检测到的所有环大小列表 - - 'valid_sizes': 有效大环内酯的环大小列表 - - 'is_macrolactone': 是否为大环内酯(布尔值) - - 'has_ester': 是否含有酯键(布尔值) - - 'is_bridge': 是否为桥环(布尔值) - 如果未匹配到任何大环内酯或解析失败则返回None - """ - if not isinstance(smiles, str): - raise TypeError("输入必须为SMILES字符串。") - - mol = Chem.MolFromSmiles(smiles) - if mol is None: - raise ValueError(f"无法解析SMILES: {smiles}") - - # 检测环大小 - ring_sizes = MacroLactoneAnalyzer.detect_ring_sizes(mol) - - # 过滤出有效的内酯环 - valid_sizes = [] - for size in ring_range: - if size in ring_sizes: - if MacroLactoneAnalyzer.is_valid_macrolactone(mol, size, ester_patterns): - valid_sizes.append(size) - - # 分析结果 - result = { - 'ring_sizes': ring_sizes, - 'valid_sizes': valid_sizes, - 'is_macrolactone': len(valid_sizes) > 0, - 'has_ester': any( - MacroLactoneAnalyzer.has_ester_on_ring( - mol, get_ring_atoms_by_size(mol, size), ester_patterns - ) for size in ring_sizes - ), - 'is_bridge': len(valid_sizes) > 1 - } - - return result if result['is_macrolactone'] else None - - @staticmethod - def dynamic_smarts_match(smiles: str, ring_size: int) -> Optional[List[int]]: - """ - 使用动态构造的SMARTS模式匹配大环内酯 - - 参考用户的analyze_smiles方法实现 - - Args: - smiles: SMILES字符串 - ring_size: 环大小 - - Returns: - 匹配到的原子索引列表,未匹配到则返回None - """ - mol = Chem.MolFromSmiles(smiles) - if mol is None: - return None - - # 动态构造SMARTS模式:[r{ring_size}]([#8][#6](=[#8])) - smarts = f'[r{ring_size}]([#8][#6](=[#8]))' - query = Chem.MolFromSmarts(smarts) - if query is None: - return None - - matches = mol.GetSubstructMatches(query) - return matches[0] if matches else None - - def get_single_ring_info(self, - smiles: str, - ring_range: range = range(12, 21), - ester_patterns: Optional[List[str]] = None) -> Optional[Dict[str, Union[int, List[int], bool]]]: - """ - 对单个SMILES字符串返回大环内酯的详细信息。 - - Args: - smiles: 分子的SMILES表示 - ring_range: 环大小检测范围,默认为12至20 - ester_patterns: 可选的酯键SMARTS模式列表 - - Returns: - 包含分析结果的字典;若未匹配到则返回None - """ - return self.analyze_smiles(smiles, ring_range, ester_patterns) - - def analyze_list(self, - smiles_list: Optional[List[str]] = None, - ring_range: range = range(12, 21), - ester_patterns: Optional[List[str]] = None) -> Dict[str, Union[int, List[str]]]: - """ - 对SMILES字符串列表进行统计分析。 - - Args: - smiles_list: SMILES字符串列表,如果为None则使用实例的smiles_list - ring_range: 环大小检测范围 - ester_patterns: 可选的酯键SMARTS模式列表 - - Returns: - 统计结果字典,包含: - - 'total': 总分子数 - - 'macrolactones': 大环内酯分子数 - - 'bridge_rings': 桥环分子数 - - 'ring_size_stats': 各环大小分子数统计 - - 'valid_molecules': 有效大环内酯的SMILES列表 - - 'bridge_molecules': 桥环分子的SMILES列表 - """ - target_list = smiles_list if smiles_list is not None else self.smiles_list - - if not target_list: - return {'total': 0, 'macrolactones': 0} - - stats = { - 'total': len(target_list), - 'macrolactones': 0, - 'bridge_rings': 0, - 'ring_size_stats': {size: 0 for size in ring_range}, - 'valid_molecules': [], - 'bridge_molecules': [], - 'failed_molecules': [] - } - - for smiles in target_list: - try: - result = self.analyze_smiles(smiles, ring_range, ester_patterns) - if result: - stats['macrolactones'] += 1 - if result['is_bridge']: - stats['bridge_rings'] += 1 - stats['bridge_molecules'].append(smiles) - else: - size = result['valid_sizes'][0] - stats['ring_size_stats'][size] += 1 - stats['valid_molecules'].append(smiles) - except Exception as e: - stats['failed_molecules'].append((smiles, str(e))) - - return stats - - def classify_molecules(self, - df: pd.DataFrame, - smiles_column: str = 'smiles', - id_column: Optional[str] = None, - ring_range: range = range(12, 21), - ester_patterns: Optional[List[str]] = None) -> Tuple[Dict[int, pd.DataFrame], pd.DataFrame]: - """ - 将DataFrame中的分子按环大小分类(参考notebook中的classify_molecules_by_ring_size函数) - - Args: - df: 包含SMILES的DataFrame - smiles_column: SMILES列名 - id_column: ID列名(可选) - ring_range: 环大小检测范围 - ester_patterns: 可选的酯键SMARTS模式列表 - - Returns: - Tuple of (ring_size_to_df_dict, ambiguous_df) - - ring_size_to_df_dict: 环大小 -> DataFrame映射 - - ambiguous_df: 有多个可能环数的分子DataFrame - """ - print(f"开始对 {len(df)} 个分子进行环数识别和分类...") - - # 初始化结果 - ring_size_dfs = {size: [] for size in ring_range} - ambiguous_molecules = [] - - stats = { - 'processed': 0, - 'single_ring': {size: 0 for size in ring_range}, - 'ambiguous': 0, - 'no_valid_ring': 0, - 'failed': 0 - } - - for idx, row in df.iterrows(): - smiles = row[smiles_column] - mol_id = row[id_column] if id_column else f"row_{idx}" - - try: - mol = Chem.MolFromSmiles(smiles) - if mol is None: - stats['no_valid_ring'] += 1 - continue - - # 检测环大小 - ring_sizes = self.detect_ring_sizes(mol) - - # 过滤出有效的内酯环 - valid_sizes = [] - for size in ring_sizes: - if self.is_valid_macrolactone(mol, size, ester_patterns): - valid_sizes.append(size) - - # 分类 - if len(valid_sizes) == 1: - # 单个有效环大小 - size = valid_sizes[0] - row_copy = row.copy() - row_copy['detected_ring_sizes'] = valid_sizes - row_copy['ring_size'] = size - ring_size_dfs[size].append(row_copy) - stats['single_ring'][size] += 1 - elif len(valid_sizes) > 1: - # 多个可能环大小(桥环情况) - row_copy = row.copy() - row_copy['detected_ring_sizes'] = valid_sizes - ambiguous_molecules.append(row_copy) - stats['ambiguous'] += 1 - else: - # 没有有效环 - stats['no_valid_ring'] += 1 - - except Exception as e: - stats['failed'] += 1 - print(f"警告:分子 {mol_id} 处理失败: {e}") - - stats['processed'] += 1 - - # 进度显示 - if stats['processed'] % 100 == 0: - print(f"已处理 {stats['processed']}/{len(df)} 个分子...") - - # 转换为DataFrame - ring_size_to_df = {} - for size in ring_range: - if ring_size_dfs[size]: - ring_size_to_df[size] = pd.DataFrame(ring_size_dfs[size]) - else: - ring_size_to_df[size] = pd.DataFrame() - - ambiguous_df = pd.DataFrame(ambiguous_molecules) if ambiguous_molecules else pd.DataFrame() - - # 打印统计结果 - print("\n" + "="*60) - print("环数识别和分类结果:") - print("="*60) - print(f"总处理分子数: {stats['processed']}") - print(f"无有效环分子: {stats['no_valid_ring']}") - print(f"桥环分子(多个环数): {stats['ambiguous']}") - if stats['failed'] > 0: - print(f"处理失败分子: {stats['failed']}") - print("\n各环大小分子数:") - for size in ring_range: - count = stats['single_ring'][size] - if count > 0: - print(f" {size}元环: {count} 个分子") - print("="*60) - - return ring_size_to_df, ambiguous_df - - def add_smiles(self, smiles: Union[str, List[str]]): - """ - 添加一个或多个SMILES字符串到当前分析器中。 - - Args: - smiles: 单个SMILES字符串或字符串列表 - - Raises: - TypeError: 如果输入既不是字符串也不是字符串列表 - """ - if isinstance(smiles, str): - self.smiles_list.append(smiles) - elif isinstance(smiles, list): - for s in smiles: - if not isinstance(s, str): - raise TypeError("列表中的每个元素必须为SMILES字符串。") - self.smiles_list.append(s) - else: - raise TypeError("输入必须为SMILES字符串或字符串列表。") - - def calculate_molecular_properties(self, smiles: str) -> Optional[Dict[str, float]]: - """ - 计算分子的基本性质 - - Args: - smiles: SMILES字符串 - - Returns: - 包含分子性质的字典,如果计算失败则返回None - """ - try: - mol = Chem.MolFromSmiles(smiles) - if mol is None: - return None - - return { - 'mol_weight': Descriptors.MolWt(mol), - 'logP': Crippen.MolLogP(mol), - 'qed': QED.qed(mol), - 'tpsa': Descriptors.TPSA(mol), - 'num_atoms': mol.GetNumAtoms() - } - except Exception as e: - print(f"警告:计算分子性质失败: {e}") - return None - - -# 示例用法和测试 -if __name__ == "__main__": - # 创建分析器实例 - analyzer = MacroLactoneAnalyzer() - - # 测试SMILES - test_smiles = [ - "O=C1CCCCCCCC(=O)OCC/C=C/C=C/1", # 16元环大环内酯 - "CC(=O)OC1=CC=CC=C1C(=O)O", # 非大环内酯 - ] - - # 单分子分析 - print("="*60) - print("单分子分析测试") - print("="*60) - for smiles in test_smiles: - result = analyzer.get_single_ring_info(smiles) - print(f"\nSMILES: {smiles[:50]}...") - if result: - print(f" ✓ 有效大环内酯") - print(f" 环大小: {result['valid_sizes']}") - print(f" 是否桥环: {result['is_bridge']}") - else: - print(f" ✗ 非大环内酯或无有效环") - - # 批量分析 - print("\n" + "="*60) - print("批量分析测试") - print("="*60) - analyzer.add_smiles(test_smiles) - stats = analyzer.analyze_list() - print(f"\n统计结果:") - print(f" 总分子数: {stats['total']}") - print(f" 大环内酯: {stats['macrolactones']}") - print(f" 桥环: {stats['bridge_rings']}") - print(f" 环大小统计: {stats['ring_size_stats']}") - - # DataFrame分类测试 - print("\n" + "="*60) - print("DataFrame分类测试") - print("="*60) - test_df = pd.DataFrame({ - 'ID': ['mol1', 'mol2', 'mol3'], - 'smiles': test_smiles + ["O=C1CCCCCCCC(=O)OCC/C=C/C=C/1"] - }) - - ring_size_dfs, ambiguous_df = analyzer.classify_molecules(test_df, smiles_column='smiles', id_column='ID') - print(f"\n分类结果:") - for size, df_size in ring_size_dfs.items(): - if not df_size.empty: - print(f" {size}元环: {len(df_size)} 个分子") - print(f"桥环分子: {len(ambiguous_df)} 个") diff --git a/src/macro_lactone_toolkit/__init__.py b/src/macro_lactone_toolkit/__init__.py new file mode 100644 index 0000000..133d893 --- /dev/null +++ b/src/macro_lactone_toolkit/__init__.py @@ -0,0 +1,25 @@ +from .analyzer import MacroLactoneAnalyzer +from .errors import ( + AmbiguousMacrolactoneError, + FragmentationError, + MacrolactoneDetectionError, + MacrolactoneError, + RingNumberingError, +) +from .fragmenter import MacrolactoneFragmenter +from .models import FragmentationResult, RingNumberingResult, SideChainFragment + +__all__ = [ + "AmbiguousMacrolactoneError", + "FragmentationError", + "FragmentationResult", + "MacroLactoneAnalyzer", + "MacrolactoneDetectionError", + "MacrolactoneError", + "MacrolactoneFragmenter", + "RingNumberingError", + "RingNumberingResult", + "SideChainFragment", +] + +__version__ = "0.1.0" diff --git a/src/macro_lactone_toolkit/_core.py b/src/macro_lactone_toolkit/_core.py new file mode 100644 index 0000000..7fbee07 --- /dev/null +++ b/src/macro_lactone_toolkit/_core.py @@ -0,0 +1,243 @@ +from __future__ import annotations + +from collections import deque +from dataclasses import dataclass +from typing import Iterable + +from rdkit import Chem + +from .errors import MacrolactoneDetectionError, RingNumberingError +from .models import RingNumberingResult + + +VALID_RING_SIZES = tuple(range(12, 21)) + + +@dataclass(frozen=True) +class DetectedMacrolactone: + ring_size: int + ring_atoms: tuple[int, ...] + carbonyl_carbon_idx: int + ester_oxygen_idx: int + + +def ensure_mol(mol_input: str | Chem.Mol) -> tuple[Chem.Mol, str]: + if isinstance(mol_input, Chem.Mol): + mol = Chem.Mol(mol_input) + else: + mol = Chem.MolFromSmiles(mol_input) + if mol is None: + raise MacrolactoneDetectionError(f"Unable to parse SMILES: {mol_input}") + return mol, Chem.MolToSmiles(mol, isomericSmiles=True) + + +def find_macrolactone_candidates( + mol: Chem.Mol, + ring_size: int | None = None, +) -> list[DetectedMacrolactone]: + candidates: list[DetectedMacrolactone] = [] + seen: set[tuple[tuple[int, ...], int, int]] = set() + + for ring in mol.GetRingInfo().AtomRings(): + if len(ring) not in VALID_RING_SIZES: + continue + if ring_size is not None and len(ring) != ring_size: + continue + + ester_match = _find_ring_ester_atoms(mol, ring) + if ester_match is None: + continue + + carbonyl_carbon_idx, ester_oxygen_idx = ester_match + key = (tuple(sorted(ring)), carbonyl_carbon_idx, ester_oxygen_idx) + if key in seen: + continue + seen.add(key) + candidates.append( + DetectedMacrolactone( + ring_size=len(ring), + ring_atoms=tuple(ring), + carbonyl_carbon_idx=carbonyl_carbon_idx, + ester_oxygen_idx=ester_oxygen_idx, + ) + ) + + return sorted( + candidates, + key=lambda candidate: ( + candidate.ring_size, + tuple(sorted(candidate.ring_atoms)), + candidate.carbonyl_carbon_idx, + candidate.ester_oxygen_idx, + ), + ) + + +def build_numbering_result(mol: Chem.Mol, candidate: DetectedMacrolactone) -> RingNumberingResult: + ring_atoms = list(candidate.ring_atoms) + ring_atom_set = set(ring_atoms) + adjacency = { + atom_idx: sorted( + neighbor.GetIdx() + for neighbor in mol.GetAtomWithIdx(atom_idx).GetNeighbors() + if neighbor.GetIdx() in ring_atom_set + ) + for atom_idx in ring_atoms + } + + ordered_atoms = [candidate.carbonyl_carbon_idx, candidate.ester_oxygen_idx] + visited = set(ordered_atoms) + previous_atom = candidate.carbonyl_carbon_idx + current_atom = candidate.ester_oxygen_idx + + while len(ordered_atoms) < candidate.ring_size: + next_atoms = [ + atom_idx + for atom_idx in adjacency[current_atom] + if atom_idx != previous_atom and atom_idx not in visited + ] + if len(next_atoms) != 1: + raise RingNumberingError("Unable to determine a unique ring traversal order.") + next_atom = next_atoms[0] + ordered_atoms.append(next_atom) + visited.add(next_atom) + previous_atom, current_atom = current_atom, next_atom + + if candidate.carbonyl_carbon_idx not in adjacency[current_atom]: + raise RingNumberingError("Ring traversal did not close on the lactone carbonyl carbon.") + + position_to_atom = {position: atom_idx for position, atom_idx in enumerate(ordered_atoms, start=1)} + atom_to_position = {atom_idx: position for position, atom_idx in position_to_atom.items()} + + return RingNumberingResult( + ring_size=candidate.ring_size, + ring_atoms=sorted(ring_atoms), + ordered_atoms=ordered_atoms, + carbonyl_carbon_idx=candidate.carbonyl_carbon_idx, + ester_oxygen_idx=candidate.ester_oxygen_idx, + atom_to_position=atom_to_position, + position_to_atom=position_to_atom, + ) + + +def collect_side_chain_atoms( + mol: Chem.Mol, + start_atom_idx: int, + ring_atom_indices: Iterable[int], +) -> list[int]: + ring_atom_set = set(ring_atom_indices) + queue: deque[int] = deque([start_atom_idx]) + visited: set[int] = set() + side_chain_atoms: list[int] = [] + + while queue: + atom_idx = queue.popleft() + if atom_idx in visited or atom_idx in ring_atom_set: + continue + visited.add(atom_idx) + side_chain_atoms.append(atom_idx) + + atom = mol.GetAtomWithIdx(atom_idx) + for neighbor in atom.GetNeighbors(): + neighbor_idx = neighbor.GetIdx() + if neighbor_idx not in visited and neighbor_idx not in ring_atom_set: + queue.append(neighbor_idx) + + return side_chain_atoms + + +def is_intrinsic_lactone_neighbor( + mol: Chem.Mol, + candidate: DetectedMacrolactone, + ring_atom_idx: int, + neighbor_idx: int, +) -> bool: + if ring_atom_idx != candidate.carbonyl_carbon_idx: + return False + bond = mol.GetBondBetweenAtoms(ring_atom_idx, neighbor_idx) + neighbor = mol.GetAtomWithIdx(neighbor_idx) + return bond is not None and bond.GetBondType() == Chem.BondType.DOUBLE and neighbor.GetAtomicNum() == 8 + + +def build_fragment_smiles( + mol: Chem.Mol, + side_chain_atoms: list[int], + side_chain_start_idx: int, + ring_atom_idx: int, + dummy_isotope: int, +) -> str: + fragment_mol = Chem.RWMol() + old_to_new: dict[int, int] = {} + + for old_idx in side_chain_atoms: + old_atom = mol.GetAtomWithIdx(old_idx) + new_atom = Chem.Atom(old_atom) + new_idx = fragment_mol.AddAtom(new_atom) + old_to_new[old_idx] = new_idx + + for old_idx in side_chain_atoms: + atom = mol.GetAtomWithIdx(old_idx) + for neighbor in atom.GetNeighbors(): + neighbor_idx = neighbor.GetIdx() + if neighbor_idx in old_to_new and old_idx < neighbor_idx: + bond = mol.GetBondBetweenAtoms(old_idx, neighbor_idx) + fragment_mol.AddBond( + old_to_new[old_idx], + old_to_new[neighbor_idx], + bond.GetBondType(), + ) + + bond_to_ring = mol.GetBondBetweenAtoms(ring_atom_idx, side_chain_start_idx) + if bond_to_ring is None: + raise RingNumberingError("Missing ring-to-side-chain bond during fragmentation.") + + dummy_atom = Chem.Atom(0) + dummy_atom.SetIsotope(dummy_isotope) + dummy_idx = fragment_mol.AddAtom(dummy_atom) + fragment_mol.AddBond( + old_to_new[side_chain_start_idx], + dummy_idx, + bond_to_ring.GetBondType(), + ) + + fragment = fragment_mol.GetMol() + Chem.SanitizeMol(fragment) + return Chem.MolToSmiles(fragment, isomericSmiles=True) + + +def _find_ring_ester_atoms(mol: Chem.Mol, ring_atoms: Iterable[int]) -> tuple[int, int] | None: + ring_atom_set = set(ring_atoms) + matches: list[tuple[int, int]] = [] + + for atom_idx in ring_atom_set: + atom = mol.GetAtomWithIdx(atom_idx) + if atom.GetAtomicNum() != 6: + continue + + has_carbonyl_oxygen = any( + bond.GetBondType() == Chem.BondType.DOUBLE and bond.GetOtherAtom(atom).GetAtomicNum() == 8 + for bond in atom.GetBonds() + ) + if not has_carbonyl_oxygen: + continue + + for neighbor in atom.GetNeighbors(): + neighbor_idx = neighbor.GetIdx() + bond = mol.GetBondBetweenAtoms(atom_idx, neighbor_idx) + if ( + neighbor_idx in ring_atom_set + and neighbor.GetAtomicNum() == 8 + and bond is not None + and bond.GetBondType() == Chem.BondType.SINGLE + ): + ring_neighbors = [ + ring_neighbor.GetIdx() + for ring_neighbor in neighbor.GetNeighbors() + if ring_neighbor.GetIdx() in ring_atom_set + ] + if len(ring_neighbors) == 2: + matches.append((atom_idx, neighbor_idx)) + + if not matches: + return None + return sorted(matches)[0] diff --git a/src/macro_lactone_toolkit/analyzer.py b/src/macro_lactone_toolkit/analyzer.py new file mode 100644 index 0000000..5158696 --- /dev/null +++ b/src/macro_lactone_toolkit/analyzer.py @@ -0,0 +1,27 @@ +from __future__ import annotations + +from rdkit import Chem + +from ._core import ensure_mol, find_macrolactone_candidates + + +class MacroLactoneAnalyzer: + """Identify valid 12-20 membered macrolactones.""" + + def get_valid_ring_sizes(self, mol_input: str | Chem.Mol) -> list[int]: + mol, _ = ensure_mol(mol_input) + candidates = find_macrolactone_candidates(mol) + return sorted({candidate.ring_size for candidate in candidates}) + + def analyze_molecule(self, mol_input: str | Chem.Mol) -> dict: + mol, smiles = ensure_mol(mol_input) + candidates = find_macrolactone_candidates(mol) + valid_ring_sizes = sorted({candidate.ring_size for candidate in candidates}) + is_ambiguous = len(valid_ring_sizes) > 1 or len(candidates) > 1 + return { + "smiles": smiles, + "valid_ring_sizes": valid_ring_sizes, + "candidate_count": len(candidates), + "is_ambiguous": is_ambiguous, + "selected_ring_size": valid_ring_sizes[0] if len(valid_ring_sizes) == 1 and len(candidates) == 1 else None, + } diff --git a/src/macro_lactone_toolkit/cli.py b/src/macro_lactone_toolkit/cli.py new file mode 100644 index 0000000..da6ff1c --- /dev/null +++ b/src/macro_lactone_toolkit/cli.py @@ -0,0 +1,160 @@ +from __future__ import annotations + +import argparse +import json +from dataclasses import asdict +from pathlib import Path + +import pandas as pd + +from .analyzer import MacroLactoneAnalyzer +from .errors import MacrolactoneError +from .fragmenter import MacrolactoneFragmenter + + +def main() -> None: + parser = build_parser() + args = parser.parse_args() + if not hasattr(args, "func"): + parser.print_help() + raise SystemExit(1) + args.func(args) + + +def build_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser(prog="macro-lactone-toolkit") + subparsers = parser.add_subparsers(dest="command") + + analyze = subparsers.add_parser("analyze") + _add_common_input_arguments(analyze) + analyze.add_argument("--ring-size", type=int, default=None) + analyze.set_defaults(func=run_analyze) + + number = subparsers.add_parser("number") + number.add_argument("--smiles", required=True) + number.add_argument("--ring-size", type=int, default=None) + number.set_defaults(func=run_number) + + fragment = subparsers.add_parser("fragment") + _add_common_input_arguments(fragment) + fragment.add_argument("--ring-size", type=int, default=None) + fragment.add_argument("--parent-id", default=None) + fragment.add_argument("--errors-output", default=None) + fragment.set_defaults(func=run_fragment) + + return parser + + +def run_analyze(args: argparse.Namespace) -> None: + analyzer = MacroLactoneAnalyzer() + if args.smiles: + payload = analyzer.analyze_molecule(args.smiles) + _write_output(payload, args.output) + return + + rows = _read_csv_rows(args.input, args.smiles_column, args.id_column) + payload = [] + for row in rows: + analysis = analyzer.analyze_molecule(row["smiles"]) + analysis["parent_id"] = row["parent_id"] + payload.append(analysis) + _write_output(payload, args.output) + + +def run_number(args: argparse.Namespace) -> None: + fragmenter = MacrolactoneFragmenter(ring_size=args.ring_size) + payload = fragmenter.number_molecule(args.smiles).to_dict() + _write_json(payload, None) + + +def run_fragment(args: argparse.Namespace) -> None: + fragmenter = MacrolactoneFragmenter(ring_size=args.ring_size) + + if args.smiles: + result = fragmenter.fragment_molecule(args.smiles, parent_id=args.parent_id) + _write_output(result.to_dict(), args.output) + return + + rows = _read_csv_rows(args.input, args.smiles_column, args.id_column) + fragment_rows: list[dict] = [] + error_rows: list[dict] = [] + + for row in rows: + try: + result = fragmenter.fragment_molecule(row["smiles"], parent_id=row["parent_id"]) + except MacrolactoneError as exc: + error_rows.append( + { + "parent_id": row["parent_id"], + "smiles": row["smiles"], + "error_type": type(exc).__name__, + "error_message": str(exc), + } + ) + continue + + for fragment in result.fragments: + fragment_rows.append( + { + "parent_id": result.parent_id, + "ring_size": result.ring_size, + **fragment.to_dict(), + } + ) + + if args.output: + _write_output(fragment_rows, args.output) + else: + _write_json({"fragments": fragment_rows, "errors": error_rows}, None) + + if args.errors_output: + _write_output(error_rows, args.errors_output) + + +def _add_common_input_arguments(parser: argparse.ArgumentParser) -> None: + group = parser.add_mutually_exclusive_group(required=True) + group.add_argument("--smiles") + group.add_argument("--input") + parser.add_argument("--smiles-column", default="smiles") + parser.add_argument("--id-column", default="id") + parser.add_argument("--output", default=None) + + +def _read_csv_rows(input_path: str, smiles_column: str, id_column: str) -> list[dict]: + dataframe = pd.read_csv(input_path) + rows = [] + for index, row in dataframe.iterrows(): + parent_id = row[id_column] if id_column in dataframe.columns else f"row_{index}" + rows.append( + { + "parent_id": str(parent_id), + "smiles": row[smiles_column], + } + ) + return rows + + +def _write_output(payload: list[dict] | dict, output_path: str | None) -> None: + if output_path is None: + _write_json(payload, None) + return + + path = Path(output_path) + if path.suffix.lower() == ".csv": + dataframe = pd.DataFrame(payload) + dataframe.to_csv(path, index=False) + return + + _write_json(payload, path) + + +def _write_json(payload: list[dict] | dict, output_path: Path | None) -> None: + text = json.dumps(payload, indent=2, ensure_ascii=False) + if output_path is None: + print(text) + else: + output_path.write_text(text + "\n", encoding="utf-8") + + +if __name__ == "__main__": + main() diff --git a/src/macro_lactone_toolkit/errors.py b/src/macro_lactone_toolkit/errors.py new file mode 100644 index 0000000..6102157 --- /dev/null +++ b/src/macro_lactone_toolkit/errors.py @@ -0,0 +1,18 @@ +class MacrolactoneError(Exception): + """Base class for macrolactone toolkit errors.""" + + +class MacrolactoneDetectionError(MacrolactoneError): + """Raised when a valid macrolactone cannot be detected.""" + + +class AmbiguousMacrolactoneError(MacrolactoneDetectionError): + """Raised when multiple valid macrolactone candidates are detected.""" + + +class RingNumberingError(MacrolactoneError): + """Raised when ring numbering cannot be assigned.""" + + +class FragmentationError(MacrolactoneError): + """Raised when side-chain fragmentation fails.""" diff --git a/src/macro_lactone_toolkit/fragmenter.py b/src/macro_lactone_toolkit/fragmenter.py new file mode 100644 index 0000000..11fb4f8 --- /dev/null +++ b/src/macro_lactone_toolkit/fragmenter.py @@ -0,0 +1,114 @@ +from __future__ import annotations + +from rdkit import Chem +from rdkit.Chem import Descriptors + +from ._core import ( + build_fragment_smiles, + build_numbering_result, + collect_side_chain_atoms, + ensure_mol, + find_macrolactone_candidates, + is_intrinsic_lactone_neighbor, +) +from .analyzer import MacroLactoneAnalyzer +from .errors import AmbiguousMacrolactoneError, FragmentationError, MacrolactoneDetectionError +from .models import FragmentationResult, RingNumberingResult, SideChainFragment + + +class MacrolactoneFragmenter: + """Number and fragment 12-20 membered macrolactones.""" + + def __init__(self, ring_size: int | None = None): + if ring_size is not None and not 12 <= ring_size <= 20: + raise ValueError("ring_size must be between 12 and 20") + self.ring_size = ring_size + self.analyzer = MacroLactoneAnalyzer() + + def number_molecule(self, mol_input: str | Chem.Mol) -> RingNumberingResult: + mol, _ = ensure_mol(mol_input) + candidate = self._select_candidate(mol) + return build_numbering_result(mol, candidate) + + def fragment_molecule( + self, + mol_input: str | Chem.Mol, + parent_id: str | None = None, + ) -> FragmentationResult: + mol, parent_smiles = ensure_mol(mol_input) + candidate = self._select_candidate(mol) + numbering = build_numbering_result(mol, candidate) + parent_id = parent_id or "molecule_0" + + ring_atom_set = set(candidate.ring_atoms) + fragments: list[SideChainFragment] = [] + + for position, ring_atom_idx in numbering.position_to_atom.items(): + ring_atom = mol.GetAtomWithIdx(ring_atom_idx) + for neighbor in ring_atom.GetNeighbors(): + neighbor_idx = neighbor.GetIdx() + if neighbor_idx in ring_atom_set: + continue + if is_intrinsic_lactone_neighbor(mol, candidate, ring_atom_idx, neighbor_idx): + continue + + side_chain_atoms = collect_side_chain_atoms(mol, neighbor_idx, ring_atom_set) + if not side_chain_atoms: + continue + + try: + fragment_smiles_labeled = build_fragment_smiles( + mol=mol, + side_chain_atoms=side_chain_atoms, + side_chain_start_idx=neighbor_idx, + ring_atom_idx=ring_atom_idx, + dummy_isotope=position, + ) + fragment_smiles_plain = build_fragment_smiles( + mol=mol, + side_chain_atoms=side_chain_atoms, + side_chain_start_idx=neighbor_idx, + ring_atom_idx=ring_atom_idx, + dummy_isotope=0, + ) + except Exception as exc: # pragma: no cover - sanitized by tests + raise FragmentationError(str(exc)) from exc + + plain_fragment = Chem.MolFromSmiles(fragment_smiles_plain) + if plain_fragment is None: + raise FragmentationError("Generated fragment_smiles_plain could not be read by RDKit.") + + fragment_id = f"{parent_id}_frag_{len(fragments)}" + fragments.append( + SideChainFragment( + parent_id=parent_id, + fragment_id=fragment_id, + cleavage_position=position, + attachment_atom_idx=ring_atom_idx, + fragment_smiles_labeled=fragment_smiles_labeled, + fragment_smiles_plain=fragment_smiles_plain, + atom_count=sum(1 for atom in plain_fragment.GetAtoms() if atom.GetAtomicNum() != 0), + molecular_weight=round(Descriptors.MolWt(plain_fragment), 4), + ) + ) + + return FragmentationResult( + parent_id=parent_id, + parent_smiles=parent_smiles, + ring_size=numbering.ring_size, + numbering=numbering, + fragments=fragments, + ) + + def _select_candidate(self, mol: Chem.Mol): + candidates = find_macrolactone_candidates(mol, ring_size=self.ring_size) + if not candidates: + requested = f"{self.ring_size}-membered " if self.ring_size is not None else "" + raise MacrolactoneDetectionError(f"No valid {requested}macrolactone was detected.") + + valid_ring_sizes = sorted({candidate.ring_size for candidate in candidates}) + if len(candidates) > 1 or len(valid_ring_sizes) > 1: + raise AmbiguousMacrolactoneError( + "Multiple valid macrolactone candidates were detected. Pass ring_size explicitly." + ) + return candidates[0] diff --git a/src/macro_lactone_toolkit/models.py b/src/macro_lactone_toolkit/models.py new file mode 100644 index 0000000..36d927c --- /dev/null +++ b/src/macro_lactone_toolkit/models.py @@ -0,0 +1,52 @@ +from __future__ import annotations + +from dataclasses import asdict, dataclass, field + + +@dataclass(frozen=True) +class RingNumberingResult: + ring_size: int + ring_atoms: list[int] + ordered_atoms: list[int] + carbonyl_carbon_idx: int + ester_oxygen_idx: int + atom_to_position: dict[int, int] + position_to_atom: dict[int, int] + + def to_dict(self) -> dict: + return asdict(self) + + +@dataclass(frozen=True) +class SideChainFragment: + parent_id: str + fragment_id: str + cleavage_position: int + attachment_atom_idx: int + fragment_smiles_labeled: str + fragment_smiles_plain: str + atom_count: int + molecular_weight: float + + def to_dict(self) -> dict: + return asdict(self) + + +@dataclass(frozen=True) +class FragmentationResult: + parent_id: str + parent_smiles: str + ring_size: int + numbering: RingNumberingResult + fragments: list[SideChainFragment] = field(default_factory=list) + warnings: list[str] = field(default_factory=list) + + def to_dict(self) -> dict: + return { + "parent_id": self.parent_id, + "parent_smiles": self.parent_smiles, + "ring_size": self.ring_size, + "numbering": self.numbering.to_dict(), + "fragments": [fragment.to_dict() for fragment in self.fragments], + "warnings": list(self.warnings), + } diff --git a/src/macro_lactone_toolkit/splicing/__init__.py b/src/macro_lactone_toolkit/splicing/__init__.py new file mode 100644 index 0000000..729fc3e --- /dev/null +++ b/src/macro_lactone_toolkit/splicing/__init__.py @@ -0,0 +1,9 @@ +from .engine import splice_molecule +from .fragment_prep import activate_fragment +from .scaffold_prep import prepare_macrolactone_scaffold + +__all__ = [ + "activate_fragment", + "prepare_macrolactone_scaffold", + "splice_molecule", +] diff --git a/src/macro_lactone_toolkit/splicing/engine.py b/src/macro_lactone_toolkit/splicing/engine.py new file mode 100644 index 0000000..aabefff --- /dev/null +++ b/src/macro_lactone_toolkit/splicing/engine.py @@ -0,0 +1,51 @@ +from __future__ import annotations + +from rdkit import Chem + + +def splice_molecule(scaffold: Chem.Mol, fragment: Chem.Mol, position: int) -> Chem.Mol: + combined = Chem.CombineMols(scaffold, fragment) + rwmol = Chem.RWMol(combined) + + scaffold_atom_count = scaffold.GetNumAtoms() + total_atoms = rwmol.GetNumAtoms() + + scaffold_dummy_idx = -1 + for atom_idx in range(scaffold_atom_count): + atom = rwmol.GetAtomWithIdx(atom_idx) + if atom.GetAtomicNum() == 0 and atom.GetIsotope() == position: + scaffold_dummy_idx = atom_idx + break + if scaffold_dummy_idx == -1: + raise ValueError(f"Scaffold dummy atom with isotope {position} not found") + + fragment_dummy_idx = -1 + for atom_idx in range(scaffold_atom_count, total_atoms): + atom = rwmol.GetAtomWithIdx(atom_idx) + if atom.GetAtomicNum() == 0: + fragment_dummy_idx = atom_idx + break + if fragment_dummy_idx == -1: + raise ValueError("Fragment does not contain a dummy atom") + + scaffold_dummy = rwmol.GetAtomWithIdx(scaffold_dummy_idx) + fragment_dummy = rwmol.GetAtomWithIdx(fragment_dummy_idx) + if scaffold_dummy.GetDegree() != 1: + raise ValueError("Scaffold dummy atom must have exactly one neighbor") + if fragment_dummy.GetDegree() != 1: + raise ValueError("Fragment dummy atom must have exactly one neighbor") + + scaffold_anchor_idx = scaffold_dummy.GetNeighbors()[0].GetIdx() + fragment_anchor_idx = fragment_dummy.GetNeighbors()[0].GetIdx() + fragment_bond = rwmol.GetBondBetweenAtoms(fragment_dummy_idx, fragment_anchor_idx) + if fragment_bond is None: + raise ValueError("Fragment dummy atom is not bonded to an anchor atom") + + rwmol.AddBond(scaffold_anchor_idx, fragment_anchor_idx, fragment_bond.GetBondType()) + + for atom_idx in sorted((scaffold_dummy_idx, fragment_dummy_idx), reverse=True): + rwmol.RemoveAtom(atom_idx) + + product = rwmol.GetMol() + Chem.SanitizeMol(product) + return product diff --git a/src/macro_lactone_toolkit/splicing/fragment_prep.py b/src/macro_lactone_toolkit/splicing/fragment_prep.py new file mode 100644 index 0000000..abc31b3 --- /dev/null +++ b/src/macro_lactone_toolkit/splicing/fragment_prep.py @@ -0,0 +1,49 @@ +from __future__ import annotations + +from rdkit import Chem + + +def activate_fragment(smiles: str, strategy: str = "smart") -> Chem.Mol: + mol = Chem.MolFromSmiles(smiles) + if mol is None: + raise ValueError(f"Invalid SMILES string: {smiles}") + + target_idx = -1 + + if strategy == "smart": + for smarts in ("[N;!H0]", "[O;H1]", "[S;H1]"): + pattern = Chem.MolFromSmarts(smarts) + if pattern is None: + continue + matches = mol.GetSubstructMatches(pattern) + if matches: + target_idx = matches[0][0] + break + if target_idx == -1: + strategy = "random" + + if strategy == "random": + carbon_candidates: list[int] = [] + other_candidates: list[int] = [] + for atom in mol.GetAtoms(): + if atom.GetTotalNumHs() == 0: + continue + if atom.GetAtomicNum() == 6: + carbon_candidates.append(atom.GetIdx()) + else: + other_candidates.append(atom.GetIdx()) + candidates = carbon_candidates or other_candidates + if not candidates: + raise ValueError("No suitable atoms with hydrogens found to attach to.") + target_idx = candidates[0] + + if target_idx == -1: + raise ValueError("Could not identify a target atom for activation.") + + rwmol = Chem.RWMol(mol) + dummy_idx = rwmol.AddAtom(Chem.Atom(0)) + rwmol.AddBond(target_idx, dummy_idx, Chem.BondType.SINGLE) + + result = rwmol.GetMol() + Chem.SanitizeMol(result) + return result diff --git a/src/macro_lactone_toolkit/splicing/scaffold_prep.py b/src/macro_lactone_toolkit/splicing/scaffold_prep.py new file mode 100644 index 0000000..7f4d30d --- /dev/null +++ b/src/macro_lactone_toolkit/splicing/scaffold_prep.py @@ -0,0 +1,63 @@ +from __future__ import annotations + +from typing import Iterable + +from rdkit import Chem + +from .._core import collect_side_chain_atoms, ensure_mol, find_macrolactone_candidates, is_intrinsic_lactone_neighbor +from ..fragmenter import MacrolactoneFragmenter + + +def prepare_macrolactone_scaffold( + mol_input: str | Chem.Mol, + positions: Iterable[int], + ring_size: int | None = None, +) -> tuple[Chem.Mol, dict[int, int]]: + positions = list(positions) + mol, _ = ensure_mol(mol_input) + fragmenter = MacrolactoneFragmenter(ring_size=ring_size) + numbering = fragmenter.number_molecule(mol) + candidate = find_macrolactone_candidates(mol, ring_size=numbering.ring_size)[0] + ring_atom_set = set(numbering.ordered_atoms) + + atoms_to_remove: set[int] = set() + dummy_specs: list[tuple[int, int, Chem.BondType]] = [] + + for position in positions: + if position not in numbering.position_to_atom: + raise ValueError(f"Position {position} not found in ring numbering.") + ring_atom_idx = numbering.position_to_atom[position] + ring_atom = mol.GetAtomWithIdx(ring_atom_idx) + for neighbor in ring_atom.GetNeighbors(): + neighbor_idx = neighbor.GetIdx() + if neighbor_idx in ring_atom_set: + continue + if is_intrinsic_lactone_neighbor(mol, candidate, ring_atom_idx, neighbor_idx): + continue + side_chain_atoms = collect_side_chain_atoms(mol, neighbor_idx, ring_atom_set) + atoms_to_remove.update(side_chain_atoms) + bond = mol.GetBondBetweenAtoms(ring_atom_idx, neighbor_idx) + if bond is not None: + dummy_specs.append((ring_atom_idx, position, bond.GetBondType())) + + if not any(spec_position == position for _, spec_position, _ in dummy_specs): + dummy_specs.append((ring_atom_idx, position, Chem.BondType.SINGLE)) + + rwmol = Chem.RWMol(mol) + for ring_atom_idx, position, bond_type in dummy_specs: + dummy_atom = Chem.Atom(0) + dummy_atom.SetIsotope(position) + dummy_idx = rwmol.AddAtom(dummy_atom) + rwmol.AddBond(ring_atom_idx, dummy_idx, bond_type) + + for atom_idx in sorted(atoms_to_remove, reverse=True): + rwmol.RemoveAtom(atom_idx) + + scaffold = rwmol.GetMol() + Chem.SanitizeMol(scaffold) + dummy_map = { + atom.GetIsotope(): atom.GetIdx() + for atom in scaffold.GetAtoms() + if atom.GetAtomicNum() == 0 and atom.GetIsotope() in positions + } + return scaffold, dummy_map diff --git a/src/macrolactone_fragmenter.py b/src/macrolactone_fragmenter.py deleted file mode 100644 index c730876..0000000 --- a/src/macrolactone_fragmenter.py +++ /dev/null @@ -1,856 +0,0 @@ -""" -Macrolactone Fragmenter - 大环内酯侧链断裂和分析器 - -专门用于 12-20 元环大环内酯的: -- 环原子编号(羰基C为1,酯键O为2) -- 侧链断裂(单个位置或全部) -- 分子可视化(带编号) -- 碎片可视化 -- 数据导出(JSON、DataFrame、数据库) - -Author: Auto-generated -Date: 2025-11-07 -""" - -from typing import Optional, Union, List, Tuple, Dict -from pathlib import Path -import pandas as pd -from rdkit import Chem -from rdkit.Chem import Descriptors, AllChem, Draw -from rdkit.Chem.Draw import rdMolDraw2D -from IPython.display import display, SVG - -from .fragment_dataclass import Fragment, MoleculeFragments -from .ring_visualization import ( - get_macrolactone_numbering, - identify_side_chains, - extract_side_chain_fragment, - cleave_side_chain_at_position, - draw_mol_with_ring_numbering -) - - -class MacrolactoneFragmenter: - """ - 大环内酯(Macrolactone)侧链断裂和分析器 - - 专门用于 12-20 元环大环内酯的: - - 环原子编号(羰基C为1,酯键O为2) - - 侧链断裂(单个位置或全部) - - 分子可视化(带编号) - - 碎片可视化 - - 数据导出(JSON、DataFrame、数据库) - - 使用场景: - >>> # 处理16元环大环内酯 - >>> fragmenter = MacrolactoneFragmenter(ring_size=16) - >>> result = fragmenter.process_molecule(smiles) - >>> - >>> # 可视化 - >>> svg = fragmenter.visualize_molecule(result.parent_smiles) - >>> display(SVG(svg)) - >>> - >>> # 保存结果 - >>> fragmenter.save_to_json(result, "output.json") - >>> - >>> # 批量处理 - >>> results = fragmenter.process_csv("molecules.csv", smiles_column="smiles") - - 编号规则: - - 位置1: 羰基碳(C=O中的C,在环上) - - 位置2: 酯键氧(环上的O) - - 位置3-N: 从位置2开始,按照O→C反方向顺序编号环上剩余原子 - - Attributes: - ring_size: 环的大小(12-20) - smarts_pattern: 用于识别大环内酯的SMARTS模式 - """ - - # 预设常用SMARTS模式(通用酯键模式适用于12-20元环) - DEFAULT_SMARTS = { - # 使用通用酯键模式,适用于所有12-20元环大环内酯 - # 格式:[ring_size][C](=O)[O] 表示环上的碳连接羰基和氧 - 12: "[C](=O)[O]", # 通用酯键模式 - 13: "[C](=O)[O]", # 通用酯键模式 - 14: "[C](=O)[O]", # 通用酯键模式 - 15: "[C](=O)[O]", # 通用酯键模式 - 16: "[C](=O)[O]", # 通用酯键模式(原模式:"O=C1CCCCCCCC(=O)OCC/C=C/C=C/1") - 17: "[C](=O)[O]", # 通用酯键模式 - 18: "[C](=O)[O]", # 通用酯键模式 - 19: "[C](=O)[O]", # 通用酯键模式 - 20: "[C](=O)[O]", # 通用酯键模式 - } - - def __init__( - self, - ring_size: int = 16, - custom_smarts: Optional[str] = None - ): - """ - 初始化大环内酯断裂器 - - Args: - ring_size: 环的大小(12-20),默认16 - custom_smarts: 自定义SMARTS模式(可选) - 如果不提供,使用预设模式 - - Raises: - ValueError: 如果ring_size不在12-20范围内 - ValueError: 如果custom_smarts无效 - ValueError: 如果ring_size没有预设SMARTS且未提供custom_smarts - """ - # 验证ring_size - if not (12 <= ring_size <= 20): - raise ValueError( - f"Ring size must be between 12 and 20, got {ring_size}" - ) - - self.ring_size = ring_size - - # 确定SMARTS模式 - if custom_smarts is not None: - # 使用自定义SMARTS - self.smarts_pattern = custom_smarts - # 验证SMARTS有效性 - test_mol = Chem.MolFromSmarts(self.smarts_pattern) - if test_mol is None: - raise ValueError( - f"Invalid SMARTS pattern: {self.smarts_pattern}" - ) - else: - # 使用预设SMARTS - if ring_size not in self.DEFAULT_SMARTS: - raise ValueError( - f"No default SMARTS pattern for ring size {ring_size}. " - f"Available sizes: {list(self.DEFAULT_SMARTS.keys())}. " - f"Please provide custom_smarts parameter." - ) - self.smarts_pattern = self.DEFAULT_SMARTS[ring_size] - - # ============ 核心处理方法 ============ - - def process_molecule( - self, - mol_input: Union[str, Chem.Mol], - parent_id: Optional[str] = None - ) -> MoleculeFragments: - """ - 一站式处理:编号 + 断裂所有侧链 - - Args: - mol_input: SMILES字符串或RDKit Mol对象 - parent_id: 可选的分子ID(如果不提供,自动生成) - - Returns: - MoleculeFragments对象,包含所有断裂的侧链 - - Raises: - ValueError: 如果分子解析失败 - ValueError: 如果不是有效的大环内酯(找不到环或酯键) - """ - # 解析输入 - mol, smiles = self._parse_mol_input(mol_input) - - # 自动生成parent_id - if parent_id is None: - parent_id = f"macro_{self.ring_size}_{id(mol)}" - - # 断裂所有侧链 - mol_fragments = self.cleave_all_sidechains(mol, parent_id) - - return mol_fragments - - def number_ring( - self, - mol: Chem.Mol - ) -> Tuple[List[int], Dict[int, int], int, int]: - """ - 对环原子进行编号 - - 编号规则: - - 位置1: 羰基碳(C=O中的C) - - 位置2: 酯键氧(环上的O) - - 位置3-N: 按照O→C反方向顺序编号 - - Args: - mol: RDKit分子对象 - - Returns: - Tuple of: - - ring_atoms: 环原子索引列表 - - ring_numbering: 原子索引到位置编号的映射 {atom_idx: position} - - carbonyl_carbon: 羰基碳原子索引 - - ester_oxygen: 酯键氧原子索引 - - Raises: - ValueError: 如果找不到指定大小的环 - ValueError: 如果找不到酯键(不是有效的大环内酯) - """ - ring_atoms, ring_numbering, ordered_atoms, carbonyl_carbon, ester_oxygen = \ - get_macrolactone_numbering(mol, ring_size=self.ring_size) - - if ring_numbering is None: - raise ValueError( - f"Failed to number the ring. " - f"This molecule may not be a valid {self.ring_size}-membered macrolactone. " - f"Expected to find a {self.ring_size}-membered ring with an ester bond." - ) - - return ring_atoms, ring_numbering, carbonyl_carbon, ester_oxygen - - def cleave_at_position( - self, - mol: Chem.Mol, - position: int, - parent_id: Optional[str] = None - ) -> List[Fragment]: - """ - 在指定位置断裂侧链 - - Args: - mol: RDKit分子对象 - position: 环上位置编号(1到ring_size) - parent_id: 母分子ID(可选) - - Returns: - 该位置的所有侧链碎片列表(通常0-1个Fragment) - 如果该位置没有侧链,返回空列表 - - Raises: - ValueError: 如果position超出范围 - ValueError: 如果分子编号失败 - """ - # 验证position - if not (1 <= position <= self.ring_size): - raise ValueError( - f"Position must be between 1 and {self.ring_size}, got {position}" - ) - - # 获取编号 - ring_atoms, ring_numbering, carbonyl_carbon, ester_oxygen = self.number_ring(mol) - - # 获取SMILES - parent_smiles = Chem.MolToSmiles(mol) - if parent_id is None: - parent_id = f"macro_{self.ring_size}_{id(mol)}" - - # 断裂指定位置 - fragment_smiles_list = cleave_side_chain_at_position( - mol, ring_atoms, ring_numbering, position - ) - - # 创建Fragment对象 - fragments = [] - for i, frag_smiles in enumerate(fragment_smiles_list): - if frag_smiles: - frag_mol = Chem.MolFromSmiles(frag_smiles) - if frag_mol: - fragment = Fragment( - fragment_smiles=frag_smiles, - parent_smiles=parent_smiles, - cleavage_position=position, - fragment_id=f"{parent_id}_pos{position}_{i}", - parent_id=parent_id, - atom_count=frag_mol.GetNumAtoms(), - molecular_weight=round(Descriptors.MolWt(frag_mol), 2) - ) - fragments.append(fragment) - - return fragments - - def cleave_all_sidechains( - self, - mol: Chem.Mol, - parent_id: Optional[str] = None, - skip_positions: Optional[List[int]] = None - ) -> MoleculeFragments: - """ - 断裂所有侧链(默认位置3到ring_size) - - Args: - mol: RDKit分子对象 - parent_id: 母分子ID(可选) - skip_positions: 要跳过的位置列表(默认跳过1和2,即酯键位置) - - Returns: - MoleculeFragments对象,包含所有断裂的侧链 - - Raises: - ValueError: 如果分子编号失败 - """ - # 获取编号 - ring_atoms, ring_numbering, carbonyl_carbon, ester_oxygen = self.number_ring(mol) - - # 获取SMILES - parent_smiles = Chem.MolToSmiles(mol) - if parent_id is None: - parent_id = f"macro_{self.ring_size}_{id(mol)}" - - # 创建MoleculeFragments对象 - mol_fragments = MoleculeFragments( - parent_id=parent_id, - parent_smiles=parent_smiles, - ring_numbering={str(k): v for k, v in ring_numbering.items()} - ) - - # 确定要跳过的位置(默认跳过1和2) - if skip_positions is None: - skip_positions = [1, 2] - - # 断裂所有侧链 - fragment_counter = 0 - for position in range(1, self.ring_size + 1): - if position in skip_positions: - continue - - fragment_smiles_list = cleave_side_chain_at_position( - mol, ring_atoms, ring_numbering, position - ) - - for frag_smiles in fragment_smiles_list: - if frag_smiles: - frag_mol = Chem.MolFromSmiles(frag_smiles) - if frag_mol: - fragment = Fragment( - fragment_smiles=frag_smiles, - parent_smiles=parent_smiles, - cleavage_position=position, - fragment_id=f"{parent_id}_frag_{fragment_counter}", - parent_id=parent_id, - atom_count=frag_mol.GetNumAtoms(), - molecular_weight=round(Descriptors.MolWt(frag_mol), 2) - ) - mol_fragments.add_fragment(fragment) - fragment_counter += 1 - - return mol_fragments - - # ============ 可视化方法 ============ - - def visualize_molecule( - self, - mol_input: Union[str, Chem.Mol], - show_numbering: bool = True, - size: Tuple[int, int] = (800, 800), - title: str = "", - save_path: Optional[Union[str, Path]] = None, - dpi: int = 600 - ) -> str: - """ - 可视化分子结构 - - Args: - mol_input: SMILES字符串或RDKit Mol对象 - show_numbering: 是否显示环原子编号 - size: 图像大小(宽,高) - title: 图像标题 - save_path: 保存PNG的路径(可选)。如果提供,保存PNG图片 - dpi: PNG图片的DPI(仅在save_path提供时有效) - - Returns: - SVG字符串(可在Jupyter中用display(SVG(...))显示) - - Raises: - ValueError: 如果分子解析失败 - ValueError: 如果分子编号失败(show_numbering=True时) - """ - # 解析输入 - mol, smiles = self._parse_mol_input(mol_input) - - if show_numbering: - # 获取编号 - ring_atoms, ring_numbering, _, _ = self.number_ring(mol) - - # 生成SVG - svg = draw_mol_with_ring_numbering( - mol, - ring_numbering=ring_numbering, - ring_atoms=ring_atoms, - size=size, - title=title - ) - else: - # 不显示编号,只高亮环原子 - try: - ring_atoms, _, _, _ = self.number_ring(mol) - - # 确保有2D坐标 - AllChem.Compute2DCoords(mol) - - drawer = rdMolDraw2D.MolDraw2DSVG(size[0], size[1]) - drawer.SetFontSize(6) - - draw_options = drawer.drawOptions() - draw_options.addAtomIndices = False - - # 高亮环原子 - highlight_atoms = list(ring_atoms) - atom_colors = {idx: (0.8, 0.9, 1.0) for idx in ring_atoms} - - drawer.DrawMolecule( - mol, - highlightAtoms=highlight_atoms, - highlightAtomColors=atom_colors - ) - - drawer.FinishDrawing() - svg = drawer.GetDrawingText() - except: - # 如果编号失败,只显示基本分子 - AllChem.Compute2DCoords(mol) - drawer = rdMolDraw2D.MolDraw2DSVG(size[0], size[1]) - drawer.DrawMolecule(mol) - drawer.FinishDrawing() - svg = drawer.GetDrawingText() - - # 如果提供了save_path,保存PNG - if save_path is not None: - self._save_png(mol, save_path, size, dpi, show_numbering) - - return svg - - def visualize_fragment( - self, - fragment: Fragment, - size: Tuple[int, int] = (400, 400), - save_path: Optional[Union[str, Path]] = None, - dpi: int = 600 - ) -> str: - """ - 可视化单个碎片 - - Args: - fragment: Fragment对象 - size: 图像大小 - save_path: 保存PNG的路径(可选) - dpi: PNG图片的DPI - - Returns: - SVG字符串 - - Raises: - ValueError: 如果碎片SMILES无效 - """ - frag_mol = Chem.MolFromSmiles(fragment.fragment_smiles) - if frag_mol is None: - raise ValueError(f"Invalid fragment SMILES: {fragment.fragment_smiles}") - - # 确保有2D坐标 - AllChem.Compute2DCoords(frag_mol) - - # 生成SVG - drawer = rdMolDraw2D.MolDraw2DSVG(size[0], size[1]) - drawer.SetFontSize(6) - - title = f"Position {fragment.cleavage_position}" - drawer.DrawMolecule(frag_mol) - drawer.FinishDrawing() - svg = drawer.GetDrawingText() - - # 如果提供了save_path,保存PNG - if save_path is not None: - save_path = Path(save_path) - save_path.parent.mkdir(parents=True, exist_ok=True) - - # 计算实际尺寸 - scale_factor = dpi / 72.0 - actual_width = int(size[0] * scale_factor) - actual_height = int(size[1] * scale_factor) - - drawer_png = rdMolDraw2D.MolDraw2DCairo(actual_width, actual_height) - font_size = max(6.0, min(40.0, 12.0 * scale_factor)) - drawer_png.SetFontSize(font_size) - - drawer_png.DrawMolecule(frag_mol) - drawer_png.FinishDrawing() - - png_data = drawer_png.GetDrawingText() - with open(save_path, 'wb') as f: - f.write(png_data) - - return svg - - def visualize_all_fragments( - self, - mol_fragments: MoleculeFragments, - max_display: int = 10, - save_dir: Optional[Union[str, Path]] = None, - dpi: int = 600 - ): - """ - 可视化所有碎片(在Jupyter中显示) - - Args: - mol_fragments: MoleculeFragments对象 - max_display: 最多显示的碎片数量(防止输出过多) - save_dir: 保存PNG图片的目录(可选)。如果提供,所有碎片图片会保存到此目录 - dpi: PNG图片的DPI - - Note: - 此方法会直接在Jupyter中显示图片,不返回值 - """ - fragments = mol_fragments.fragments[:max_display] - - if save_dir is not None: - save_dir = Path(save_dir) - save_dir.mkdir(parents=True, exist_ok=True) - - for i, frag in enumerate(fragments, 1): - print(f"\n{'='*60}") - print(f"Fragment {i}/{len(fragments)}") - print(f"Position: {frag.cleavage_position}") - print(f"SMILES: {frag.fragment_smiles}") - print(f"Atoms: {frag.atom_count}, MW: {frag.molecular_weight}") - print('='*60) - - # 确定保存路径 - frag_save_path = None - if save_dir is not None: - frag_save_path = save_dir / f"{frag.fragment_id}.png" - - # 生成SVG并显示 - svg = self.visualize_fragment(frag, save_path=frag_save_path, dpi=dpi) - display(SVG(svg)) - - if len(mol_fragments.fragments) > max_display: - print(f"\n... and {len(mol_fragments.fragments) - max_display} more fragments") - - # ============ 数据导出方法 ============ - - def save_to_json( - self, - mol_fragments: MoleculeFragments, - filepath: Union[str, Path] - ): - """ - 保存到JSON文件 - - Args: - mol_fragments: MoleculeFragments对象 - filepath: JSON文件路径 - """ - filepath = Path(filepath) - filepath.parent.mkdir(parents=True, exist_ok=True) - mol_fragments.to_json_file(str(filepath)) - - def to_dataframe( - self, - mol_fragments: MoleculeFragments - ) -> pd.DataFrame: - """ - 转换为DataFrame(方便分析) - - Args: - mol_fragments: MoleculeFragments对象 - - Returns: - 包含所有碎片信息的DataFrame,列包括: - - fragment_id - - parent_id - - parent_smiles - - cleavage_position - - fragment_smiles - - atom_count - - molecular_weight - """ - data = [] - for frag in mol_fragments.fragments: - data.append({ - 'fragment_id': frag.fragment_id, - 'parent_id': frag.parent_id, - 'parent_smiles': frag.parent_smiles, - 'cleavage_position': frag.cleavage_position, - 'fragment_smiles': frag.fragment_smiles, - 'atom_count': frag.atom_count, - 'molecular_weight': frag.molecular_weight - }) - - return pd.DataFrame(data) - - # ============ 批处理方法 ============ - - def process_csv( - self, - csv_path: Union[str, Path], - smiles_column: str = "smiles", - id_column: Optional[str] = "unique_id", - max_rows: Optional[int] = None - ) -> List[MoleculeFragments]: - """ - 批量处理CSV文件中的分子 - - Args: - csv_path: CSV文件路径 - smiles_column: SMILES列名 - id_column: ID列名(可选)。如果为None,自动生成ID - max_rows: 最多处理的行数(可选)。用于测试或限制处理量 - - Returns: - MoleculeFragments对象列表 - - Raises: - FileNotFoundError: 如果CSV文件不存在 - ValueError: 如果CSV中没有指定的列 - """ - csv_path = Path(csv_path) - if not csv_path.exists(): - raise FileNotFoundError(f"CSV file not found: {csv_path}") - - # 读取CSV - df = pd.read_csv(csv_path) - - # 验证列存在 - if smiles_column not in df.columns: - raise ValueError( - f"Column '{smiles_column}' not found in CSV. " - f"Available columns: {df.columns.tolist()}" - ) - - if id_column is not None and id_column not in df.columns: - raise ValueError( - f"Column '{id_column}' not found in CSV. " - f"Available columns: {df.columns.tolist()}" - ) - - # 限制行数 - if max_rows is not None: - df = df.head(max_rows) - - # 批量处理 - results = [] - failed_count = 0 - - for idx, row in df.iterrows(): - smiles = row[smiles_column] - mol_id = row[id_column] if id_column is not None else f"mol_{idx}" - - try: - mol_fragments = self.process_molecule(smiles, parent_id=mol_id) - results.append(mol_fragments) - except Exception as e: - failed_count += 1 - print(f"Failed to process {mol_id}: {e}") - continue - - print(f"\nProcessed {len(results)} molecules successfully") - if failed_count > 0: - print(f"Failed: {failed_count} molecules") - - return results - - def batch_to_dataframe( - self, - results: List[MoleculeFragments] - ) -> pd.DataFrame: - """ - 将批处理结果转换为单个DataFrame - - Args: - results: MoleculeFragments对象列表 - - Returns: - 包含所有分子、所有碎片的DataFrame - """ - all_data = [] - for mol_fragments in results: - df = self.to_dataframe(mol_fragments) - all_data.append(df) - - if all_data: - return pd.concat(all_data, ignore_index=True) - else: - return pd.DataFrame() - - # ============ 数据库接口(预留SQLModel)============ - - def save_to_database( - self, - mol_fragments: MoleculeFragments, - session # SQLModel Session - ): - """ - 保存到数据库(预留接口) - - 预期Schema: - - ```python - from sqlmodel import SQLModel, Field, Relationship - from typing import Optional, List - - class MacrolactoneParent(SQLModel, table=True): - \"\"\"母分子表\"\"\" - id: Optional[int] = Field(default=None, primary_key=True) - parent_id: str = Field(index=True, unique=True) - parent_smiles: str - ring_size: int - ring_numbering: str # JSON字符串 - - # 关系 - fragments: List["FragmentSidechain"] = Relationship(back_populates="parent") - - class FragmentSidechain(SQLModel, table=True): - \"\"\"侧链碎片表\"\"\" - id: Optional[int] = Field(default=None, primary_key=True) - fragment_id: str = Field(index=True) - parent_id: str = Field(foreign_key="macrolactoneparent.parent_id") - cleavage_position: int - fragment_smiles: str - atom_count: int - molecular_weight: float - - # 关系 - parent: Optional[MacrolactoneParent] = Relationship(back_populates="fragments") - ``` - - 使用示例: - ```python - from sqlmodel import create_engine, Session - - engine = create_engine("postgresql://user:pass@localhost/db") - - with Session(engine) as session: - fragmenter.save_to_database(mol_fragments, session) - ``` - - Args: - mol_fragments: MoleculeFragments对象 - session: SQLModel Session对象 - - TODO: 实现SQLModel集成 - """ - raise NotImplementedError( - "Database integration not yet implemented. " - "Will use SQLModel with schema: MacrolactoneParent + FragmentSidechain. " - "See method docstring for expected schema." - ) - - @classmethod - def load_from_database( - cls, - parent_id: str, - session # SQLModel Session - ) -> MoleculeFragments: - """ - 从数据库加载(预留接口) - - Args: - parent_id: 母分子ID - session: SQLModel Session对象 - - Returns: - MoleculeFragments对象 - - TODO: 实现SQLModel集成 - """ - raise NotImplementedError("Database loading not yet implemented") - - # ============ 私有辅助方法 ============ - - def _parse_mol_input( - self, - mol_input: Union[str, Chem.Mol] - ) -> Tuple[Chem.Mol, str]: - """ - 解析分子输入(SMILES或Mol对象) - - Args: - mol_input: SMILES字符串或RDKit Mol对象 - - Returns: - Tuple of (RDKit Mol对象, SMILES字符串) - - Raises: - ValueError: 如果输入无效 - """ - if isinstance(mol_input, str): - # SMILES字符串 - mol = Chem.MolFromSmiles(mol_input) - if mol is None: - raise ValueError(f"Invalid SMILES: {mol_input}") - smiles = mol_input - elif isinstance(mol_input, Chem.Mol): - # RDKit Mol对象 - mol = mol_input - smiles = Chem.MolToSmiles(mol) - else: - raise ValueError( - f"mol_input must be SMILES string or RDKit Mol object, " - f"got {type(mol_input)}" - ) - - return mol, smiles - - def _save_png( - self, - mol: Chem.Mol, - save_path: Union[str, Path], - size: Tuple[int, int], - dpi: int, - show_numbering: bool - ): - """ - 保存分子为PNG图片 - - Args: - mol: RDKit分子对象 - save_path: 保存路径 - size: 图像大小 - dpi: DPI - show_numbering: 是否显示编号 - """ - save_path = Path(save_path) - save_path.parent.mkdir(parents=True, exist_ok=True) - - # 计算实际尺寸 - scale_factor = dpi / 72.0 - actual_width = int(size[0] * scale_factor) - actual_height = int(size[1] * scale_factor) - - drawer = rdMolDraw2D.MolDraw2DCairo(actual_width, actual_height) - - # 设置字体大小 - font_size = max(6.0, min(40.0, 12.0 * scale_factor)) - drawer.SetFontSize(font_size) - - draw_options = drawer.drawOptions() - draw_options.addAtomIndices = False - - if show_numbering: - try: - ring_atoms, ring_numbering, _, _ = self.number_ring(mol) - - # 高亮环原子 - highlight_atoms = list(ring_atoms) - atom_colors = {idx: (0.8, 0.9, 1.0) for idx in ring_atoms} - - # 设置编号 - mol_copy = Chem.Mol(mol) - for atom_idx in ring_atoms: - if atom_idx in ring_numbering: - atom = mol_copy.GetAtomWithIdx(atom_idx) - atom.SetProp("atomNote", str(ring_numbering[atom_idx])) - - drawer.DrawMolecule( - mol_copy, - highlightAtoms=highlight_atoms, - highlightAtomColors=atom_colors - ) - except: - # 如果编号失败,只绘制基本分子 - drawer.DrawMolecule(mol) - else: - drawer.DrawMolecule(mol) - - drawer.FinishDrawing() - - # 保存PNG - png_data = drawer.GetDrawingText() - with open(save_path, 'wb') as f: - f.write(png_data) - - def __repr__(self) -> str: - return ( - f"MacrolactoneFragmenter(ring_size={self.ring_size}, " - f"smarts='{self.smarts_pattern[:50]}...')" - ) - diff --git a/src/ring_numbering.py b/src/ring_numbering.py deleted file mode 100644 index 6b76224..0000000 --- a/src/ring_numbering.py +++ /dev/null @@ -1,211 +0,0 @@ -""" -Ring numbering system for 16-membered macrolactones. -""" -from rdkit import Chem -from typing import Dict, Tuple, Optional, List - - -def find_lactone_carbon(mol: Chem.Mol) -> Optional[int]: - """ - Find the lactone carbonyl carbon in the 16-membered ring. - Uses SMARTS pattern: [r16;#8][r16;#6](=[#8]) to ensure both O and C are on the ring. - - Args: - mol: RDKit molecule object - - Returns: - Atom index of the lactone carbonyl carbon (C=O in the ring) - Returns None if not found - """ - # First, get the 16-membered ring atoms - ring_atoms = get_ring_atoms(mol) - if len(ring_atoms) == 0: - return None - - ring_atoms_set = set(ring_atoms) - - # SMARTS pattern to match ester group where both O and C are on the ring - # [r16;#8] = ring atom that is oxygen - # [r16;#6](=[#8]) = ring atom that is carbon with double bond to oxygen - lactone_pattern = Chem.MolFromSmarts("[r16;#8][r16;#6](=[#8])") - - if lactone_pattern is None: - return None - - matches = mol.GetSubstructMatches(lactone_pattern) - - if not matches: - return None - - # Get the first match - # matches[0] contains: (ring_O, ring_C, O_double) - # We want the carbon (index 1) which should be in the ring - for match in matches: - if len(match) >= 2: - carbonyl_carbon = match[1] # The carbonyl carbon - # Double check that it's in the ring (should be guaranteed by SMARTS, but verify) - if carbonyl_carbon in ring_atoms_set: - return carbonyl_carbon - - return None - - -def get_ring_atoms(mol: Chem.Mol) -> List[int]: - """ - Get all atoms in the 16-membered ring. - - Args: - mol: RDKit molecule object - - Returns: - List of atom indices in the 16-membered ring - """ - # Find atoms in 16-membered rings - # Note: Use r16 directly, not r{16} which is not supported in all RDKit versions - ring_pattern = Chem.MolFromSmarts("[r16]") - - if ring_pattern is None: - return [] - - matches = mol.GetSubstructMatches(ring_pattern) - - # Extract unique atom indices - ring_atoms = list(set([match[0] for match in matches])) - - return ring_atoms - - -def order_ring_atoms_clockwise(mol: Chem.Mol, ring_atoms: List[int], start_atom: int) -> List[int]: - """ - Order ring atoms in a clockwise manner starting from a specific atom. - - Args: - mol: RDKit molecule object - ring_atoms: List of atom indices in the ring - start_atom: Starting atom index - - Returns: - Ordered list of atom indices - """ - if start_atom not in ring_atoms: - raise ValueError("Start atom is not in the ring") - - ordered = [start_atom] - visited = {start_atom} - current = start_atom - - while len(ordered) < len(ring_atoms): - atom = mol.GetAtomWithIdx(current) - - # Find the next ring atom among neighbors - next_atom = None - for neighbor in atom.GetNeighbors(): - neighbor_idx = neighbor.GetIdx() - if neighbor_idx in ring_atoms and neighbor_idx not in visited: - next_atom = neighbor_idx - break - - if next_atom is None: - # If we can't find a forward path, we might need to go backwards from start - if len(ordered) < len(ring_atoms): - # Try going backwards from the start - atom = mol.GetAtomWithIdx(start_atom) - for neighbor in atom.GetNeighbors(): - neighbor_idx = neighbor.GetIdx() - if neighbor_idx in ring_atoms and neighbor_idx not in visited: - # Insert at the beginning - ordered.insert(0, neighbor_idx) - visited.add(neighbor_idx) - current = neighbor_idx - break - else: - break - else: - break - else: - ordered.append(next_atom) - visited.add(next_atom) - current = next_atom - - return ordered - - -def assign_ring_numbering(mol: Chem.Mol) -> Dict[int, int]: - """ - Assign fixed numbering (1-16) to atoms in the 16-membered ring. - Numbering starts from the lactone carbonyl carbon (position 1) and proceeds clockwise. - - Args: - mol: RDKit molecule object - - Returns: - Dictionary mapping atom index to ring position (1-16) - Returns empty dict if no valid ring found - """ - # Find the lactone carbon (will be position 1) - lactone_c = find_lactone_carbon(mol) - - if lactone_c is None: - print("Warning: Could not find lactone carbon") - return {} - - # Get all 16-ring atoms - ring_atoms = get_ring_atoms(mol) - - if len(ring_atoms) != 16: - print(f"Warning: Expected 16 ring atoms, found {len(ring_atoms)}") - # Still proceed if we found some ring atoms - if len(ring_atoms) == 0: - return {} - - # Order atoms clockwise starting from lactone carbon - ordered_atoms = order_ring_atoms_clockwise(mol, ring_atoms, lactone_c) - - # Create numbering dictionary - numbering = {} - for position, atom_idx in enumerate(ordered_atoms, start=1): - numbering[atom_idx] = position - - return numbering - - -def get_ring_position(numbering: Dict[int, int], atom_idx: int) -> Optional[int]: - """ - Get the ring position for a given atom index. - - Args: - numbering: Dictionary from assign_ring_numbering - atom_idx: Atom index - - Returns: - Ring position (1-16) or None if not in ring - """ - return numbering.get(atom_idx) - - -def validate_numbering(mol: Chem.Mol, numbering: Dict[int, int]) -> bool: - """ - Validate that the numbering is correct. - - Args: - mol: RDKit molecule object - numbering: Dictionary from assign_ring_numbering - - Returns: - True if valid, False otherwise - """ - if len(numbering) != 16: - return False - - # Check that positions are 1-16 - positions = set(numbering.values()) - if positions != set(range(1, 17)): - return False - - # Check that position 1 is the lactone carbon - lactone_c = find_lactone_carbon(mol) - if lactone_c is None or numbering.get(lactone_c) != 1: - return False - - return True - diff --git a/src/ring_visualization.py b/src/ring_visualization.py deleted file mode 100644 index 7dce6e1..0000000 --- a/src/ring_visualization.py +++ /dev/null @@ -1,1175 +0,0 @@ -""" -Ring numbering and visualization utilities for macrolactones. - -This module provides functions for: -1. Assigning ring numbering starting from C=O carbonyl carbon -2. Visualizing molecules with ring numbering in Jupyter notebooks -3. Supporting both general molecules and macrolactones (12-20 membered rings) -""" -from rdkit import Chem -from rdkit.Chem import Draw, AllChem -from rdkit.Chem.Draw import rdMolDraw2D -from IPython.display import SVG, display -from typing import Dict, Tuple, Optional, List, Union -from collections import deque -from pathlib import Path -import pandas as pd - -# 导入分析器类 -try: - from src.macro_lactone_analyzer import MacroLactoneAnalyzer -except ImportError: - # 如果导入失败,定义一个简单的替代类 - class MacroLactoneAnalyzer: - def is_valid_macrolactone(self, mol, ring_size): - # 简单检查是否有指定大小的环 - ring_atoms = get_ring_atoms_by_size(mol, ring_size) - return ring_atoms is not None - -# 原子序数映射,支持元素符号和原子序数字符串 -ATOMIC_NUMBERS = { - "H": 1, "C": 6, "N": 7, "O": 8, "S": 16, "P": 15, - "F": 9, "Cl": 17, "Br": 35, "I": 53 -} - -def normalize_atom_types(atom_types: Optional[List[str]]) -> Optional[List[int]]: - """ - 标准化原子类型列表,将元素符号转换为原子序数 - - Args: - atom_types: 原子类型列表,可以是元素符号["C", "N"]或原子序数["6", "7"] - - Returns: - 标准化的原子序数列表,如果输入为None则返回None - """ - if atom_types is None: - return None - - normalized = [] - for atom_type in atom_types: - if atom_type.isdigit(): - # 已经是原子序数字符串 - normalized.append(int(atom_type)) - elif atom_type.upper() in ATOMIC_NUMBERS: - # 元素符号,转换为原子序数 - normalized.append(ATOMIC_NUMBERS[atom_type.upper()]) - else: - raise ValueError(f"不支持的原子类型: {atom_type}") - - return normalized - -def validate_ring_atom_composition( - mol: Chem.Mol, - ring_atoms: List[int], - carbonyl_carbon_idx: int, - ester_oxygen_idx: int, - allowed_atom_types: Optional[List[int]] = None -) -> Tuple[bool, str]: - """ - 验证环原子组成是否符合要求 - - Args: - mol: RDKit分子对象 - ring_atoms: 环原子索引列表 - carbonyl_carbon_idx: 羰基碳索引(位置1) - ester_oxygen_idx: 酯氧索引(位置2) - allowed_atom_types: 允许的原子类型列表(原子序数) - - Returns: - (is_valid, reason): 是否有效及原因说明 - """ - if allowed_atom_types is None: - return True, "不限制原子类型" - - # 检查环中每个原子(除了酯键氧) - invalid_atoms = [] - - for atom_idx in ring_atoms: - # 跳过酯键氧原子(位置2) - if atom_idx == ester_oxygen_idx: - continue - - atom = mol.GetAtomWithIdx(atom_idx) - atomic_num = atom.GetAtomicNum() - - if atomic_num not in allowed_atom_types: - symbol = atom.GetSymbol() - invalid_atoms.append(f"{symbol}(索引:{atom_idx})") - - if invalid_atoms: - return False, f"环中包含不允许的原子类型: {', '.join(invalid_atoms)}" - - return True, f"环原子组成符合要求,只允许: {[Chem.GetPeriodicTable().GetElementSymbol(num) for num in allowed_atom_types]}" - - -def get_ring_atoms_by_size(mol: Chem.Mol, ring_size: int) -> Optional[List[int]]: - """ - Find atoms in a ring of specified size. - - Args: - mol: RDKit molecule object - ring_size: Size of the ring to find (e.g., 16 for 16-membered ring) - - Returns: - List of atom indices in the ring, or None if not found - """ - ring_info = mol.GetRingInfo() - rings = ring_info.AtomRings() - - for ring in rings: - if len(ring) == ring_size: - return list(ring) - - return None - - -def find_ester_smarts(mol: Chem.Mol, ring_atoms: List[int]) -> Tuple[Optional[List[int]], Optional[str]]: - """ - Find ester group atoms using SMARTS patterns. - - Args: - mol: RDKit molecule object - ring_atoms: List of atom indices in the ring - - Returns: - Tuple of (ester_atoms_list, pattern_string) or (None, None) if not found - """ - ring_atoms_set = set(ring_atoms) - - # Try different SMARTS patterns - patterns = [ - "[r16][C](=O)[O]", # Original pattern for 16-membered ring - "[C](=O)[O]", # General pattern - "C(=O)O", # Simple pattern - ] - - for pattern_str in patterns: - pattern = Chem.MolFromSmarts(pattern_str) - if pattern is None: - continue - - matches = mol.GetSubstructMatches(pattern) - for match in matches: - # Check if first atom is in the ring - if match[0] in ring_atoms_set: - return list(match), pattern_str - - return None, None - - -def get_carbonyl_carbon_in_ring(mol: Chem.Mol, ester_atoms: List[int], ring_atoms: List[int]) -> Optional[int]: - """ - Find the C=O carbonyl carbon atom in the ring. - - Args: - mol: RDKit molecule object - ester_atoms: List of ester group atom indices - ring_atoms: List of ring atom indices - - Returns: - Atom index of carbonyl carbon, or None if not found - """ - ring_atoms_set = set(ring_atoms) - - for idx in ester_atoms: - atom = mol.GetAtomWithIdx(idx) - # Find C atom in the ring - if atom.GetSymbol() == 'C' and idx in ring_atoms_set: - # Check if this C has a double bond to O (carbonyl feature) - for neighbor in atom.GetNeighbors(): - neighbor_idx = neighbor.GetIdx() - bond = mol.GetBondBetweenAtoms(idx, neighbor_idx) - if bond.GetBondType() == Chem.BondType.DOUBLE and neighbor.GetSymbol() == 'O': - return idx - - return None - - -def get_ester_oxygen_in_ring(mol: Chem.Mol, ester_atoms: List[int], ring_atoms: List[int]) -> Optional[int]: - """ - Find the ester oxygen atom in the ring. - - Args: - mol: RDKit molecule object - ester_atoms: List of ester group atom indices - ring_atoms: List of ring atom indices - - Returns: - Atom index of ester oxygen in ring, or None if not found - """ - ring_atoms_set = set(ring_atoms) - - for idx in ester_atoms: - atom = mol.GetAtomWithIdx(idx) - if atom.GetSymbol() == 'O' and idx in ring_atoms_set: - return idx - - return None - - -def order_ring_atoms_from_start( - mol: Chem.Mol, - ring_atoms: List[int], - start_atom_idx: int, - target_atom_idx: Optional[int] = None -) -> List[int]: - """ - Order ring atoms starting from a specific atom, optionally prioritizing path to target atom. - - Args: - mol: RDKit molecule object - ring_atoms: List of ring atom indices - start_atom_idx: Starting atom index - target_atom_idx: Optional target atom index to prioritize path towards - - Returns: - Ordered list of atom indices - """ - ring_atoms_set = set(ring_atoms) - - if start_atom_idx not in ring_atoms_set: - return ring_atoms - - ordered = [start_atom_idx] - remaining = ring_atoms_set - {start_atom_idx} - current = start_atom_idx - - # If target atom specified, prioritize path towards it - if target_atom_idx and target_atom_idx in ring_atoms_set: - queue = deque([(start_atom_idx, [start_atom_idx])]) - visited = {start_atom_idx} - - while queue: - node, path = queue.popleft() - if node == target_atom_idx: - # Found path, add atoms in path - for atom_idx in path[1:]: # Skip first (already in ordered) - if atom_idx in remaining: - ordered.append(atom_idx) - remaining.remove(atom_idx) - break - - atom = mol.GetAtomWithIdx(node) - for neighbor in atom.GetNeighbors(): - neighbor_idx = neighbor.GetIdx() - if neighbor_idx in ring_atoms_set and neighbor_idx not in visited: - visited.add(neighbor_idx) - queue.append((neighbor_idx, path + [neighbor_idx])) - - # Continue adding remaining ring atoms - while remaining: - atom = mol.GetAtomWithIdx(current) - found_next = False - - for neighbor in atom.GetNeighbors(): - neighbor_idx = neighbor.GetIdx() - if neighbor_idx in remaining: - ordered.append(neighbor_idx) - remaining.remove(neighbor_idx) - current = neighbor_idx - found_next = True - break - - if not found_next: - if remaining: - next_atom = remaining.pop() - ordered.append(next_atom) - current = next_atom - - return ordered - - -def create_ring_numbering( - mol: Chem.Mol, - ring_atoms: List[int], - carbonyl_carbon_idx: int, - ester_oxygen_idx: int -) -> Tuple[Dict[int, int], List[int]]: - """ - Create ring numbering mapping starting from C=O carbonyl carbon. - - Args: - mol: RDKit molecule object - ring_atoms: List of ring atom indices - carbonyl_carbon_idx: Index of C=O carbonyl carbon (will be position 1) - ester_oxygen_idx: Index of ester oxygen in ring (for ordering direction) - - Returns: - Tuple of (numbering_dict, ordered_atoms_list) - numbering_dict: Maps atom index to position (1-N) - ordered_atoms_list: Ordered list of atom indices - """ - ordered_atoms = order_ring_atoms_from_start( - mol, ring_atoms, carbonyl_carbon_idx, ester_oxygen_idx - ) - numbering = {} - for i, atom_idx in enumerate(ordered_atoms, start=1): - numbering[atom_idx] = i - - return numbering, ordered_atoms - - -def get_macrolactone_numbering( - mol: Union[Chem.Mol, str], - ring_size: int = 16, - allowed_ring_atom_types: Optional[List[str]] = None -) -> Tuple[Optional[List[int]], Optional[Dict[int, int]], Optional[List[int]], Optional[int], Optional[int], Tuple[bool, str]]: - """ - Get ring numbering for a macrolactone molecule with optional atom type filtering. - - This function performs the complete numbering workflow: - 1. Find ring of specified size - 2. Find ester group - 3. Find carbonyl carbon and ester oxygen - 4. Validate ring atom composition (if filtering requested) - 5. Create numbering mapping - - Args: - mol: RDKit molecule object or SMILES string - ring_size: Size of the macrolactone ring (12-20, default 16) - allowed_ring_atom_types: Allowed atom types for ring atoms (excluding ester oxygen) - - None: No restriction (default behavior) - - ["C"]: Only carbon atoms allowed (except ester oxygen) - - ["C", "N"]: Carbon and nitrogen atoms allowed (except ester oxygen) - - Can use element symbols or atomic number strings: ["6", "7"] - - Returns: - Tuple of (ring_atoms, ring_numbering, ordered_atoms, carbonyl_carbon, ester_oxygen, (is_valid, validation_reason)) - Returns (None, None, None, None, None, (False, reason)) if numbering fails - """ - # Convert SMILES to molecule if needed - if isinstance(mol, str): - mol = Chem.MolFromSmiles(mol) - if mol is None: - return None, None, None, None, None, (False, "无法解析SMILES字符串") - - # 标准化原子类型 - try: - allowed_atom_numbers = normalize_atom_types(allowed_ring_atom_types) - except ValueError as e: - return None, None, None, None, None, (False, f"原子类型参数错误: {str(e)}") - - # Find ring of specified size - ring_atoms = get_ring_atoms_by_size(mol, ring_size) - if ring_atoms is None: - return None, None, None, None, None, (False, f"未找到{ring_size}元环") - - # Find ester group - ester_atoms, pattern = find_ester_smarts(mol, ring_atoms) - if ester_atoms is None: - return None, None, None, None, None, (False, "未在环中找到酯键") - - # Find carbonyl carbon and ester oxygen - carbonyl_carbon = get_carbonyl_carbon_in_ring(mol, ester_atoms, ring_atoms) - ester_oxygen = get_ester_oxygen_in_ring(mol, ester_atoms, ring_atoms) - - if carbonyl_carbon is None or ester_oxygen is None: - return None, None, None, None, None, (False, "无法识别羰基碳或酯氧原子") - - # 验证环原子组成(如果指定了允许的原子类型) - if allowed_atom_numbers is not None: - is_valid, validation_reason = validate_ring_atom_composition( - mol, ring_atoms, carbonyl_carbon, ester_oxygen, allowed_atom_numbers - ) - if not is_valid: - return None, None, None, None, None, (False, validation_reason) - else: - validation_reason = "不限制原子类型" - - # Create numbering - ring_numbering, ordered_atoms = create_ring_numbering( - mol, ring_atoms, carbonyl_carbon, ester_oxygen - ) - - return ring_atoms, ring_numbering, ordered_atoms, carbonyl_carbon, ester_oxygen, (True, validation_reason) - - -def get_macrolactone_numbering_legacy( - mol: Union[Chem.Mol, str], - ring_size: int = 16 -) -> Tuple[Optional[List[int]], Optional[Dict[int, int]], Optional[List[int]], Optional[int], Optional[int]]: - """ - 向后兼容版本的get_macrolactone_numbering函数 - - Args: - mol: RDKit molecule object or SMILES string - ring_size: Size of the macrolactone ring (12-20, default 16) - - Returns: - Tuple of (ring_atoms, ring_numbering, ordered_atoms, carbonyl_carbon, ester_oxygen) - Returns (None, None, None, None, None) if numbering fails - """ - ring_atoms, ring_numbering, ordered_atoms, carbonyl_carbon, ester_oxygen, (is_valid, _) = \ - get_macrolactone_numbering(mol, ring_size, allowed_ring_atom_types=None) - - if not is_valid: - return None, None, None, None, None - - return ring_atoms, ring_numbering, ordered_atoms, carbonyl_carbon, ester_oxygen - - -def is_pure_carbon_macrolactone(mol: Union[Chem.Mol, str], ring_size: int) -> Tuple[bool, str]: - """ - 便捷函数:检查是否为纯碳大环内酯(除酯键氧外只含碳原子) - - Args: - mol: RDKit molecule object or SMILES string - ring_size: Size of the macrolactone ring - - Returns: - (is_valid, reason): 是否为纯碳大环内酯及原因 - """ - _, _, _, _, _, (is_valid, reason) = get_macrolactone_numbering( - mol, ring_size, allowed_ring_atom_types=["C"] - ) - return is_valid, reason - - -def is_carbon_nitrogen_macrolactone(mol: Union[Chem.Mol, str], ring_size: int) -> Tuple[bool, str]: - """ - 便捷函数:检查是否为碳氮大环内酯(除酯键氧外只含碳和氮原子) - - Args: - mol: RDKit molecule object or SMILES string - ring_size: Size of the macrolactone ring - - Returns: - (is_valid, reason): 是否为碳氮大环内酯及原因 - """ - _, _, _, _, _, (is_valid, reason) = get_macrolactone_numbering( - mol, ring_size, allowed_ring_atom_types=["C", "N"] - ) - return is_valid, reason - - -def visualize_macrolactone_with_auto_coloring( - mol: Union[Chem.Mol, str], - ring_size: Optional[int] = None, - allowed_ring_atom_types: Optional[List[str]] = None, - size: Tuple[int, int] = (800, 800), - title: str = "", - return_svg: bool = False, - show_atom_labels: bool = True -) -> Union[str, None]: - """ - 可视化大环内酯,自动根据原子类型进行颜色编码 - - Args: - mol: RDKit分子对象或SMILES字符串 - ring_size: 环大小(12-20),如果为None则自动检测 - allowed_ring_atom_types: 允许的环原子类型(除酯氧外) - - None: 不限制,显示所有原子类型 - - ["C"]: 只允许碳原子(默认筛选条件) - - ["C", "N"]: 允许碳和氮原子 - size: 图像大小 - title: 图像标题 - return_svg: 是否返回SVG字符串 - show_atom_labels: 是否显示原子编号标签 - - Returns: - SVG字符串(如果return_svg=True)或显示图像 - """ - # 转换SMILES到分子对象 - if isinstance(mol, str): - mol_obj = Chem.MolFromSmiles(mol) - if mol_obj is None: - if return_svg: - return None - else: - print("❌ 无法解析SMILES字符串") - return None - else: - mol_obj = mol - - # 自动检测环大小(如果未指定) - if ring_size is None: - ring_sizes = [] - for size in range(12, 21): - ring_atoms = get_ring_atoms_by_size(mol_obj, size) - if ring_atoms: - # 检查是否为有效大环内酯 - analyzer = MacroLactoneAnalyzer() - if analyzer.is_valid_macrolactone(mol_obj, size): - ring_sizes.append(size) - - if not ring_sizes: - if return_svg: - return None - else: - print("❌ 未找到12-20元大环内酯") - return None - - # 使用找到的第一个环大小 - ring_size = ring_sizes[0] - if len(ring_sizes) > 1: - print(f"⚠️ 找到多个环大小: {ring_sizes},使用 {ring_size}") - - # 获取环编号信息 - ring_atoms, ring_numbering, ordered_atoms, carbonyl_carbon, ester_oxygen, (is_valid, validation_reason) = \ - get_macrolactone_numbering(mol_obj, ring_size, allowed_ring_atom_types) - - if not is_valid or not ring_atoms: - if return_svg: - return None - else: - print(f"❌ 无法获取环编号: {validation_reason}") - return None - - # 创建分子副本并设置原子标签 - mol_copy = Chem.Mol(mol_obj) - AllChem.Compute2DCoords(mol_copy) - - if show_atom_labels: - for atom_idx in ring_atoms: - if atom_idx in ring_numbering: - atom = mol_copy.GetAtomWithIdx(atom_idx) - atom.SetProp("atomNote", str(ring_numbering[atom_idx])) - - # 定义原子颜色方案 - def get_atom_color(symbol: str, is_ester_oxygen: bool = False) -> Tuple[float, float, float]: - """获取原子颜色""" - if is_ester_oxygen: - return (1.0, 0.4, 0.4) # 酯氧用深红色 - elif symbol == 'C': - return (0.7, 0.8, 1.0) # 碳用蓝色 - elif symbol == 'N': - return (1.0, 0.8, 1.0) # 氮用粉色 - elif symbol == 'O': - return (1.0, 0.7, 0.7) # 氧用红色 - elif symbol == 'S': - return (1.0, 1.0, 0.6) # 硫用黄色 - elif symbol == 'P': - return (0.8, 0.6, 1.0) # 磷用紫色 - else: - return (0.8, 1.0, 0.8) # 其他用绿色 - - # 设置原子颜色 - atom_colors = {} - atom_type_stats = {} - - for atom_idx in ring_atoms: - atom = mol_obj.GetAtomWithIdx(atom_idx) - symbol = atom.GetSymbol() - is_ester_oxygen = (atom_idx == ester_oxygen) - - # 设置颜色 - color = get_atom_color(symbol, is_ester_oxygen) - atom_colors[atom_idx] = color - - # 统计原子类型(不包括酯氧) - if not is_ester_oxygen: - atom_type_stats[symbol] = atom_type_stats.get(symbol, 0) + 1 - - # 绘制分子 - # 确保size是元组格式 - if isinstance(size, int): - size = (size, size) - elif isinstance(size, (list, tuple)) and len(size) == 1: - size = (size[0], size[0]) - elif not isinstance(size, (list, tuple)) or len(size) != 2: - size = (800, 800) # 默认大小 - - drawer = rdMolDraw2D.MolDraw2DSVG(int(size[0]), int(size[1])) - drawer.SetFontSize(12) # 设置合适的字体大小(最小为6) - - # 注意:某些RDKit版本不支持DrawTitle,暂时注释掉 - # if title: - # drawer.DrawTitle(title) - - drawer.DrawMolecule(mol_copy, - highlightAtoms=ring_atoms, - highlightAtomColors=atom_colors) - drawer.FinishDrawing() - svg = drawer.GetDrawingText() - - # 显示统计信息 - if not return_svg: - display(SVG(svg)) - - print(f"\n📊 分子信息:") - print(f" 环大小: {ring_size} 元环") - print(f" 羰基碳位置: {ring_numbering.get(carbonyl_carbon, 'N/A')} (深红色标记)") - print(f" 酯氧位置: {ring_numbering.get(ester_oxygen, 'N/A')} (深红色标记)") - print(f" 环原子组成: {atom_type_stats}") - print(f" 筛选条件: {validation_reason}") - - # 显示颜色说明 - print(f"\n🎨 颜色说明:") - print(f" 深红色: 酯键氧原子 (位置2)") - print(f" 蓝色: 碳原子") - print(f" 粉色: 氮原子") - print(f" 红色: 氧原子") - print(f" 黄色: 硫原子") - print(f" 紫色: 磷原子") - print(f" 绿色: 其他原子") - - return svg if return_svg else None - - -def batch_visualize_macrolactones( - data_file: Path, - ring_sizes: List[int] = None, - allowed_ring_atom_types: Optional[List[str]] = None, - max_examples_per_size: int = 3, - output_dir: Optional[Path] = None -) -> Dict[int, List[Dict]]: - """ - 批量可视化大环内酯分子 - - Args: - data_file: CSV数据文件路径 - ring_sizes: 要测试的环大小列表,默认为12-20 - allowed_ring_atom_types: 允许的环原子类型 - max_examples_per_size: 每种环大小最大示例数 - output_dir: 输出目录,如果指定则保存SVG文件 - - Returns: - 按环大小分组的可视化结果字典 - """ - if ring_sizes is None: - ring_sizes = list(range(12, 21)) - - print(f"🔍 开始批量可视化大环内酯") - print(f" 数据文件: {data_file}") - print(f" 环大小范围: {ring_sizes}") - print(f" 筛选条件: {allowed_ring_atom_types or '无限制'}") - - # 加载数据 - if not data_file.exists(): - print(f"❌ 数据文件不存在: {data_file}") - return {} - - df = pd.read_csv(data_file) - print(f"✓ 加载数据: {len(df)} 个分子") - - results = {} - - for ring_size in ring_sizes: - print(f"\n🔄 处理 {ring_size} 元环...") - - size_results = [] - found_count = 0 - - for idx, row in df.iterrows(): - if found_count >= max_examples_per_size: - break - - smiles = row.get('smiles', '') - if not smiles: - continue - - try: - # 测试分子 - mol = Chem.MolFromSmiles(smiles) - if not mol: - continue - - # 检查是否为指定大小的有效大环内酯 - analyzer = MacroLactoneAnalyzer() - if not analyzer.is_valid_macrolactone(mol, ring_size): - continue - - # 应用筛选条件(如果指定) - if allowed_ring_atom_types: - is_valid, validation_reason = get_macrolactone_numbering( - mol, ring_size, allowed_ring_atom_types - )[5] - if not is_valid: - continue - - # 获取详细信息 - ring_atoms, ring_numbering, ordered_atoms, carbonyl_carbon, ester_oxygen, (is_valid, validation_reason) = \ - get_macrolactone_numbering(mol, ring_size, allowed_ring_atom_types) - - if is_valid: - # 分析原子组成 - composition = analyze_ring_atom_composition(smiles, ring_size) - - result = { - 'index': idx, - 'smiles': smiles, - 'molecule_id': row.get('molecule_id', f'mol_{idx}'), - 'ring_size': ring_size, - 'composition': composition, - 'validation_reason': validation_reason, - 'carbonyl_carbon_pos': ring_numbering.get(carbonyl_carbon, 'N/A'), - 'ester_oxygen_pos': ring_numbering.get(ester_oxygen, 'N/A') - } - - size_results.append(result) - found_count += 1 - - print(f" ✓ 找到示例 {found_count}: 分子{idx} ({composition})") - - # 保存可视化(如果指定了输出目录) - if output_dir: - output_dir.mkdir(parents=True, exist_ok=True) - filename = f"ring{ring_size}_example{found_count}_mol{idx}.svg" - output_path = output_dir / filename - - svg = visualize_macrolactone_with_auto_coloring( - mol, ring_size, allowed_ring_atom_types, - return_svg=True, - title=f"{ring_size}-元环示例 {found_count}" - ) - - if svg: - with open(output_path, 'w', encoding='utf-8') as f: - f.write(svg) - print(f" 💾 保存到: {output_path}") - - except Exception as e: - print(f" ⚠️ 处理分子 {idx} 时出错: {str(e)}") - continue - - results[ring_size] = size_results - print(f" 📊 {ring_size} 元环: 找到 {len(size_results)} 个示例") - - # 显示总体统计 - print(f"\n📈 总体统计:") - total_found = sum(len(results[size]) for size in ring_sizes) - print(f" 总示例数: {total_found}") - - for ring_size in ring_sizes: - count = len(results[ring_size]) - print(f" {ring_size} 元环: {count} 个示例") - - return results - - -def test_all_ring_sizes_with_filtering( - filtered_data_dir: Path, - allowed_ring_atom_types: Optional[List[str]] = ["C"], # 默认只允许碳原子 - output_dir: Optional[Path] = None -) -> Dict[int, Dict]: - """ - 测试所有环大小(12-20)的筛选效果 - - Args: - filtered_data_dir: 包含filtered CSV文件的目录 - allowed_ring_atom_types: 允许的环原子类型 - output_dir: 输出目录 - - Returns: - 详细的测试结果 - """ - print(f"🧪 开始测试所有环大小的筛选效果") - print(f" 数据目录: {filtered_data_dir}") - print(f" 筛选条件: {allowed_ring_atom_types or '无限制'}") - - all_results = {} - - for ring_size in range(12, 21): - print(f"\n{'='*60}") - print(f"🔍 测试 {ring_size} 元环") - print(f"{'='*60}") - - # 查找对应的数据文件 - data_file = filtered_data_dir / f'macrolactone_ring{ring_size}_filtered.csv' - - if not data_file.exists(): - print(f"⚠️ 未找到 {ring_size} 元环数据文件: {data_file}") - all_results[ring_size] = { - 'data_file_exists': False, - 'total_molecules': 0, - 'valid_macrolactones': 0, - 'filtered_molecules': 0, - 'examples': [] - } - continue - - # 批量测试 - batch_results = batch_visualize_macrolactones( - data_file, - ring_sizes=[ring_size], - allowed_ring_atom_types=allowed_ring_atom_types, - max_examples_per_size=5, - output_dir=output_dir / f'ring{ring_size}_examples' if output_dir else None - ) - - # 统计结果 - size_results = batch_results.get(ring_size, []) - - # 加载完整数据进行统计 - try: - df = pd.read_csv(data_file) - total_molecules = len(df) - - # 统计有效大环内酯数量 - valid_count = 0 - filtered_count = 0 - - for _, row in df.iterrows(): - smiles = row.get('smiles', '') - if not smiles: - continue - - try: - mol = Chem.MolFromSmiles(smiles) - if mol: - analyzer = MacroLactoneAnalyzer() - if analyzer.is_valid_macrolactone(mol, ring_size): - valid_count += 1 - - if allowed_ring_atom_types: - is_valid, _ = get_macrolactone_numbering( - mol, ring_size, allowed_ring_atom_types - )[5] - if is_valid: - filtered_count += 1 - else: - filtered_count += 1 - except: - continue - - all_results[ring_size] = { - 'data_file_exists': True, - 'total_molecules': total_molecules, - 'valid_macrolactones': valid_count, - 'filtered_molecules': filtered_count, - 'filter_rate': filtered_count / valid_count * 100 if valid_count > 0 else 0, - 'examples': size_results - } - - print(f"\n📊 {ring_size} 元环统计:") - print(f" 总分子数: {total_molecules}") - print(f" 有效大环内酯: {valid_count}") - print(f" 通过筛选: {filtered_count}") - print(f" 筛选通过率: {filtered_count/valid_count*100:.1f}%" if valid_count > 0 else " 筛选通过率: 0%") - - except Exception as e: - print(f"❌ 统计 {ring_size} 元环时出错: {str(e)}") - all_results[ring_size] = { - 'data_file_exists': True, - 'error': str(e), - 'examples': size_results - } - - # 显示总体统计 - print(f"\n{'='*60}") - print(f"📈 所有环大小测试总结") - print(f"{'='*60}") - - total_molecules = sum(result.get('total_molecules', 0) for result in all_results.values()) - total_valid = sum(result.get('valid_macrolactones', 0) for result in all_results.values()) - total_filtered = sum(result.get('filtered_molecules', 0) for result in all_results.values()) - - print(f"总分子数: {total_molecules}") - print(f"总有效大环内酯: {total_valid}") - print(f"总通过筛选: {total_filtered}") - print(f"总体筛选通过率: {total_filtered/total_valid*100:.1f}%" if total_valid > 0 else "总体筛选通过率: 0%") - - return all_results - - -def analyze_ring_atom_composition(mol: Union[Chem.Mol, str], ring_size: int) -> Dict[str, int]: - # 首先获取环编号信息 - ring_atoms, ring_numbering, ordered_atoms, carbonyl_carbon, ester_oxygen, (is_valid, _) = \ - get_macrolactone_numbering(mol, ring_size) - - if not is_valid or not ring_atoms: - return {} - - # 转换SMILES到分子对象(如果需要) - if isinstance(mol, str): - mol = Chem.MolFromSmiles(mol) - if mol is None: - return {} - - # 统计原子类型 - composition = {} - - for atom_idx in ring_atoms: - # 跳过酯键氧原子 - if atom_idx == ester_oxygen: - continue - - atom = mol.GetAtomWithIdx(atom_idx) - symbol = atom.GetSymbol() - composition[symbol] = composition.get(symbol, 0) + 1 - - return composition - - -def draw_mol_with_ring_numbering( - mol: Union[Chem.Mol, str], - ring_numbering: Optional[Dict[int, int]] = None, - ring_atoms: Optional[List[int]] = None, - size: Tuple[int, int] = (800, 800), - title: str = "", - ring_size: int = 16, - return_svg: bool = False -) -> Optional[str]: - """ - Draw molecule with ring numbering displayed. - - This function can work in two modes: - 1. If ring_numbering is provided, use it directly - 2. If ring_numbering is None, automatically compute it for macrolactones - - Args: - mol: RDKit molecule object or SMILES string - ring_numbering: Optional pre-computed numbering dictionary - ring_atoms: Optional pre-computed ring atoms list - size: Image size (width, height) - title: Optional title for the image - ring_size: Ring size for auto-numbering (default 16) - return_svg: If True, return SVG string instead of displaying - - Returns: - SVG string if return_svg=True, None otherwise (displays in notebook) - """ - # Convert SMILES to molecule if needed - if isinstance(mol, str): - mol = Chem.MolFromSmiles(mol) - if mol is None: - print("Error: Could not parse SMILES") - return None - - # Auto-compute numbering if not provided - if ring_numbering is None: - ring_atoms, ring_numbering, _, _, _ = get_macrolactone_numbering(mol, ring_size) - if ring_numbering is None: - print("Error: Could not compute ring numbering") - return None - - # Get ring atoms if not provided - if ring_atoms is None: - ring_atoms = list(ring_numbering.keys()) - - # Create drawer - drawer = rdMolDraw2D.MolDraw2DSVG(size[0], size[1]) - drawer.SetFontSize(6) # Minimum font size is 6 - - draw_options = drawer.drawOptions() - draw_options.addAtomIndices = False - - # Highlight ring atoms - highlight_atoms = list(ring_atoms) - atom_colors = {} - for atom_idx in ring_atoms: - atom_colors[atom_idx] = (0.8, 0.9, 1.0) # Light blue - - # Create copy and set atom notes (ring numbering) - mol_copy = Chem.Mol(mol) - for atom_idx in ring_atoms: - if atom_idx in ring_numbering: - atom = mol_copy.GetAtomWithIdx(atom_idx) - atom.SetProp("atomNote", str(ring_numbering[atom_idx])) - - # Draw molecule - drawer.DrawMolecule( - mol_copy, - highlightAtoms=highlight_atoms, - highlightAtomColors=atom_colors - ) - - drawer.FinishDrawing() - svg = drawer.GetDrawingText() - - if return_svg: - return svg - else: - display(SVG(svg)) - return None - - -def visualize_molecule_with_numbering( - mol: Union[Chem.Mol, str], - ring_size: int = 16, - size: Tuple[int, int] = (800, 800), - title: str = "" -) -> Tuple[Optional[Dict[int, int]], Optional[List[int]]]: - """ - Convenience function to visualize a molecule with automatic ring numbering. - - This is the main function to use in Jupyter notebooks for quick visualization. - - Args: - mol: RDKit molecule object or SMILES string - ring_size: Ring size for macrolactones (default 16) - size: Image size (width, height) - title: Optional title - - Returns: - Tuple of (ring_numbering_dict, ring_atoms_list) - """ - # Get numbering - ring_atoms, ring_numbering, ordered_atoms, carbonyl_carbon, ester_oxygen = \ - get_macrolactone_numbering(mol, ring_size) - - if ring_numbering is None: - print("Error: Could not compute ring numbering") - return None, None - - # Display - print(f"Ring size: {ring_size}") - print(f"Carbonyl C position: {ring_numbering.get(carbonyl_carbon, 'N/A')}") - print(f"Ester O position: {ring_numbering.get(ester_oxygen, 'N/A')}") - print(f"Numbering range: 1-{len(ring_numbering)}") - - draw_mol_with_ring_numbering( - mol, ring_numbering, ring_atoms, size, title, ring_size - ) - - return ring_numbering, ring_atoms - - -def identify_side_chains(mol: Chem.Mol, ring_atoms: List[int]) -> List[Tuple[int, int]]: - """ - Identify side chains attached to the ring. - - Args: - mol: RDKit molecule object - ring_atoms: List of atom indices in the ring - - Returns: - List of tuples (ring_atom_idx, side_chain_first_atom_idx) - """ - side_chains = [] - ring_atom_set = set(ring_atoms) - - for ring_atom_idx in ring_atoms: - atom = mol.GetAtomWithIdx(ring_atom_idx) - for neighbor in atom.GetNeighbors(): - neighbor_idx = neighbor.GetIdx() - # If neighbor is not in the ring, it's a side chain - if neighbor_idx not in ring_atom_set: - side_chains.append((ring_atom_idx, neighbor_idx)) - - return side_chains - - -def extract_side_chain_fragment( - mol: Chem.Mol, - ring_atom_idx: int, - side_chain_start_idx: int, - ring_atoms: List[int] -) -> Optional[str]: - """ - Extract a side chain fragment as a SMILES string with dummy atom (*) at attachment point. - - Args: - mol: RDKit molecule object - ring_atom_idx: Index of the ring atom where side chain is attached - side_chain_start_idx: Index of the first atom in the side chain - ring_atoms: List of all ring atom indices - - Returns: - SMILES string of the fragment (contains dummy atom *), or None if extraction failed - """ - ring_atom_set = set(ring_atoms) - visited = set() - queue = [side_chain_start_idx] - side_chain_atoms = [] - - # Use BFS to collect all atoms in the side chain - while queue: - current_idx = queue.pop(0) - if current_idx in visited: - continue - - visited.add(current_idx) - side_chain_atoms.append(current_idx) - - atom = mol.GetAtomWithIdx(current_idx) - for neighbor in atom.GetNeighbors(): - neighbor_idx = neighbor.GetIdx() - # Only continue into non-ring atoms - if neighbor_idx not in ring_atom_set and neighbor_idx not in visited: - queue.append(neighbor_idx) - - if not side_chain_atoms: - return None - - # Get the bond type to the ring - bond_to_ring = mol.GetBondBetweenAtoms(ring_atom_idx, side_chain_start_idx) - if bond_to_ring is None: - return None - bond_type = bond_to_ring.GetBondType() - - # Create a new molecule with only the side chain atoms - fragment_mol = Chem.RWMol() - old_to_new = {} - - # Add atoms - for old_idx in side_chain_atoms: - atom = mol.GetAtomWithIdx(old_idx) - new_atom = Chem.Atom(atom.GetAtomicNum()) - new_atom.SetFormalCharge(atom.GetFormalCharge()) - new_atom.SetIsAromatic(atom.GetIsAromatic()) - new_idx = fragment_mol.AddAtom(new_atom) - old_to_new[old_idx] = new_idx - - # Add bonds (within side chain) - for old_idx in side_chain_atoms: - atom = mol.GetAtomWithIdx(old_idx) - for neighbor in atom.GetNeighbors(): - neighbor_idx = neighbor.GetIdx() - if neighbor_idx in old_to_new and old_idx < neighbor_idx: - bond = mol.GetBondBetweenAtoms(old_idx, neighbor_idx) - fragment_mol.AddBond( - old_to_new[old_idx], - old_to_new[neighbor_idx], - bond.GetBondType() - ) - - # Add dummy atom (atomic number 0, displays as * in SMILES) - # Dummy atom connects to the atom that was originally connected to the ring - attachment_point = old_to_new[side_chain_start_idx] - dummy_atom_idx = fragment_mol.AddAtom(Chem.Atom(0)) - - # Add bond between dummy atom and attachment point, keeping original bond type - fragment_mol.AddBond(attachment_point, dummy_atom_idx, bond_type) - - # Convert to molecule and get SMILES - try: - fragment_mol = fragment_mol.GetMol() - Chem.SanitizeMol(fragment_mol) - fragment_smiles = Chem.MolToSmiles(fragment_mol) - return fragment_smiles - except Exception as e: - return None - - -def cleave_side_chain_at_position( - mol: Chem.Mol, - ring_atoms: List[int], - ring_numbering: Dict[int, int], - position: int -) -> List[str]: - """ - Cleave side chains at a specified ring position. - - Args: - mol: RDKit molecule object - ring_atoms: List of ring atom indices - ring_numbering: Ring numbering dictionary mapping atom index to position - position: Position number to cleave (1-N, where N is ring size) - - Returns: - List of SMILES strings, one for each side chain (contains dummy atom *, - compatible with molzip for reconstruction) - """ - # Find the ring atom index for this position - ring_atom_idx = None - for atom_idx, pos in ring_numbering.items(): - if pos == position: - ring_atom_idx = atom_idx - break - - if ring_atom_idx is None: - return [] - - # Find all side chains attached to this ring atom - side_chains = identify_side_chains(mol, ring_atoms) - fragments = [] - - for ring_atom, side_start in side_chains: - if ring_atom == ring_atom_idx: - fragment_smiles = extract_side_chain_fragment(mol, ring_atom, side_start, ring_atoms) - if fragment_smiles: - fragments.append(fragment_smiles) - - return fragments - diff --git a/src/splicing/__init__.py b/src/splicing/__init__.py deleted file mode 100644 index e69de29..0000000 diff --git a/src/splicing/engine.py b/src/splicing/engine.py deleted file mode 100644 index 3a4e3a9..0000000 --- a/src/splicing/engine.py +++ /dev/null @@ -1,79 +0,0 @@ -from rdkit import Chem - -def splice_molecule(scaffold: Chem.Mol, fragment: Chem.Mol, position: int) -> Chem.Mol: - """ - Connects a scaffold and a fragment by replacing a specific dummy atom on the scaffold - and a dummy atom on the fragment with a single bond. - - Args: - scaffold: The scaffold molecule containing labeled dummy atoms (e.g., [1*]). - fragment: The fragment molecule containing a dummy atom (*). - position: The isotope number of the dummy atom on the scaffold to attach to. - - Returns: - Chem.Mol: The spliced molecule. - - Raises: - ValueError: If the specified dummy atom is not found on the scaffold - or if the fragment does not contain a dummy atom. - """ - # 1. Combine molecules - # Note: CombineMols preserves atom indices of mol1 (scaffold), and appends mol2 (fragment). - # Atoms 0 to N-1 are scaffold, N to N+M-1 are fragment. - combined = Chem.CombineMols(scaffold, fragment) - rw_mol = Chem.RWMol(combined) - - scaffold_atom_count = scaffold.GetNumAtoms() - total_atoms = rw_mol.GetNumAtoms() - - # 2. Find Scaffold Dummy - scaffold_dummy_idx = -1 - for i in range(scaffold_atom_count): - atom = rw_mol.GetAtomWithIdx(i) - if atom.GetAtomicNum() == 0 and atom.GetIsotope() == position: - scaffold_dummy_idx = i - break - - if scaffold_dummy_idx == -1: - raise ValueError(f"Scaffold dummy atom with isotope {position} not found") - - # 3. Find Fragment Dummy - # We search in the fragment part (indices >= scaffold_atom_count) - fragment_dummy_idx = -1 - for i in range(scaffold_atom_count, total_atoms): - atom = rw_mol.GetAtomWithIdx(i) - # We assume any dummy atom in the fragment is the attachment point. - # Usually fragment has isotope 0 on its dummy. - if atom.GetAtomicNum() == 0: - fragment_dummy_idx = i - break - - if fragment_dummy_idx == -1: - raise ValueError("Fragment does not contain a dummy atom") - - # 4. Identify Neighbors (Anchors) - scaffold_dummy = rw_mol.GetAtomWithIdx(scaffold_dummy_idx) - if scaffold_dummy.GetDegree() != 1: - raise ValueError(f"Scaffold dummy atom at index {scaffold_dummy_idx} must have exactly one neighbor") - scaffold_anchor_idx = scaffold_dummy.GetNeighbors()[0].GetIdx() - - fragment_dummy = rw_mol.GetAtomWithIdx(fragment_dummy_idx) - if fragment_dummy.GetDegree() != 1: - raise ValueError(f"Fragment dummy atom at index {fragment_dummy_idx} must have exactly one neighbor") - fragment_anchor_idx = fragment_dummy.GetNeighbors()[0].GetIdx() - - # 5. Add Bond - rw_mol.AddBond(scaffold_anchor_idx, fragment_anchor_idx, Chem.BondType.SINGLE) - - # 6. Remove Dummy Atoms - # Remove the higher index first to preserve the lower index - # We know fragment_dummy_idx > scaffold_dummy_idx because fragment atoms were appended. - # However, just to be safe, we sort them. - indices_to_remove = sorted([scaffold_dummy_idx, fragment_dummy_idx], reverse=True) - for idx in indices_to_remove: - rw_mol.RemoveAtom(idx) - - # 7. Sanitize - Chem.SanitizeMol(rw_mol) - - return rw_mol.GetMol() diff --git a/src/splicing/fragment_prep.py b/src/splicing/fragment_prep.py deleted file mode 100644 index 788851d..0000000 --- a/src/splicing/fragment_prep.py +++ /dev/null @@ -1,87 +0,0 @@ -import random -from rdkit import Chem - -def activate_fragment(smiles: str, strategy: str = "smart") -> Chem.Mol: - """ - Convert a small molecule fragment into an attachable R-group by adding a dummy atom (*). - - Args: - smiles: SMILES string of the fragment. - strategy: 'smart' (prioritize heteroatoms) or 'random' (any atom with H). - - Returns: - Chem.Mol: The activated fragment with a dummy atom attached. - """ - mol = Chem.MolFromSmiles(smiles) - if mol is None: - raise ValueError(f"Invalid SMILES string: {smiles}") - - target_idx = -1 - - if strategy == "smart": - # Order of preference: Amine, Alcohol/Phenol, Thiol - # Amine: [N;!H0] - Nitrogen with at least one H - # Alcohol/Phenol: [O;H1] - Oxygen with 1 H (usually 2 bonds total) - # Thiol: [S;H1] - - smarts_patterns = [ - "[N;!H0]", # Primary/Secondary amine - "[O;H1]", # Alcohol/Phenol - "[S;H1]" # Thiol - ] - - for smarts in smarts_patterns: - pattern = Chem.MolFromSmarts(smarts) - if pattern: - matches = mol.GetSubstructMatches(pattern) - if matches: - # Pick the first match - target_idx = matches[0][0] - break - - if target_idx == -1: - # Fallback to random if no smart match found - strategy = "random" - - if strategy == "random": - # Find all atoms with at least one H - candidates = [] - carbon_candidates = [] - - for atom in mol.GetAtoms(): - # GetTotalNumHs includes implicit and explicit Hs - if atom.GetTotalNumHs() > 0: - candidates.append(atom.GetIdx()) - if atom.GetSymbol() == 'C': - carbon_candidates.append(atom.GetIdx()) - - if not candidates: - raise ValueError("No suitable atoms with hydrogens found to attach to.") - - # Prefer Carbon atoms if available - if carbon_candidates: - # Pick the first one for deterministic behavior - target_idx = carbon_candidates[0] - else: - target_idx = candidates[0] - - if target_idx == -1: - # Should be caught by the candidates check, but just in case - raise ValueError("Could not identify a target atom for activation.") - - # Perform attachment - rwmol = Chem.RWMol(mol) - - # Add dummy atom - dummy_idx = rwmol.AddAtom(Chem.Atom('*')) - - # Add bond to target atom - rwmol.AddBond(target_idx, dummy_idx, Chem.BondType.SINGLE) - - # Sanitize to fix implicit H counts and ensure validity - try: - Chem.SanitizeMol(rwmol) - except Exception as e: - raise ValueError(f"Failed to sanitize molecule after activation: {e}") - - return rwmol.GetMol() diff --git a/src/splicing/scaffold_prep.py b/src/splicing/scaffold_prep.py deleted file mode 100644 index 3ce30d9..0000000 --- a/src/splicing/scaffold_prep.py +++ /dev/null @@ -1,123 +0,0 @@ -from rdkit import Chem -from typing import List, Dict, Tuple, Set, Optional -from src.ring_numbering import assign_ring_numbering - -def get_subtree_indices(mol: Chem.Mol, start_atom_idx: int, forbidden_idx: int) -> Set[int]: - """ - Get all atom indices in the subtree starting at start_atom_idx, - moving away from forbidden_idx. - Used to identify side chain atoms to remove. - """ - subtree = set() - stack = [start_atom_idx] - - while stack: - current = stack.pop() - if current in subtree: - continue - subtree.add(current) - - atom = mol.GetAtomWithIdx(current) - for neighbor in atom.GetNeighbors(): - n_idx = neighbor.GetIdx() - # Traverse neighbors except the one leading back to the ring (forbidden) - # and those already visited - if n_idx != forbidden_idx and n_idx not in subtree: - stack.append(n_idx) - - return subtree - -def prepare_tylosin_scaffold(smiles: str, positions: List[int]) -> Tuple[Chem.Mol, Dict[int, int]]: - """ - Prepare the Tylosin scaffold by removing side chains at specified positions - and marking them with dummy atoms. - - Args: - smiles: SMILES string of the scaffold/molecule. - positions: List of ring positions (1-16) to prepare (add sockets). - - Returns: - Tuple of (Modified Molecule, Dict mapping position -> new_dummy_atom_idx) - The returned molecule has dummy atoms ('*') at the specified positions, - marked with Isotope = position number. - """ - mol = Chem.MolFromSmiles(smiles) - if not mol: - raise ValueError(f"Invalid SMILES: {smiles}") - - # 1. Ring numbering to identify target atoms - ring_map = assign_ring_numbering(mol) # atom_idx -> ring_pos - if not ring_map: - raise ValueError("Could not assign ring numbering. Is this a 16-membered lactone?") - - # Reverse map for easy lookup: ring_pos -> atom_idx - pos_to_atom = {v: k for k, v in ring_map.items()} - - atoms_to_remove = set() - dummies_to_add = [] # List of (ring_atom_idx, position) - - # 2. Identify edits - for pos in positions: - if pos not in pos_to_atom: - raise ValueError(f"Position {pos} not found in ring numbering.") - - ring_atom_idx = pos_to_atom[pos] - ring_atom = mol.GetAtomWithIdx(ring_atom_idx) - - # Identify non-ring neighbors (side chains) - # Note: neighbors in ring have indices in ring_map - side_chain_neighbors = [] - for n in ring_atom.GetNeighbors(): - if n.GetIdx() not in ring_map: - side_chain_neighbors.append(n.GetIdx()) - - # If side chains exist, mark them for removal - if side_chain_neighbors: - for sc_idx in side_chain_neighbors: - subtree = get_subtree_indices(mol, sc_idx, forbidden_idx=ring_atom_idx) - atoms_to_remove.update(subtree) - - # Plan to add dummy at this ring atom - dummies_to_add.append((ring_atom_idx, pos)) - - # 3. Apply edits using RWMol - rwmol = Chem.RWMol(mol) - - # Step A: Add dummy atoms - # We add them before deletion to use stable ring indices. - # Note: Adding atoms does not change existing indices. - - for ring_idx, pos in dummies_to_add: - # Create dummy atom - dummy = Chem.Atom('*') - dummy.SetIsotope(pos) # Mark with position - - new_idx = rwmol.AddAtom(dummy) - - # Add bond to ring atom - rwmol.AddBond(ring_idx, new_idx, Chem.BondType.SINGLE) - - # Step B: Remove side chain atoms - # Sort descending to preserve lower indices during deletion - sorted_remove = sorted(list(atoms_to_remove), reverse=True) - for idx in sorted_remove: - rwmol.RemoveAtom(idx) - - # 4. Finalize - mol_final = rwmol.GetMol() - - try: - Chem.SanitizeMol(mol_final) - except Exception: - # Sometime dummies trigger sanitization errors, but usually partial sanitization works. - # We'll ignore strict sanitization errors for the scaffold as it has dummies. - pass - - # 5. Build result map (position -> atom_index in new mol) - # Since indices shifted, we find dummies by their isotope markers. - final_dummy_map = {} - for atom in mol_final.GetAtoms(): - if atom.GetSymbol() == '*' and atom.GetIsotope() in positions: - final_dummy_map[atom.GetIsotope()] = atom.GetIdx() - - return mol_final, final_dummy_map diff --git a/src/visualizer.py b/src/visualizer.py deleted file mode 100644 index e1ecdb8..0000000 --- a/src/visualizer.py +++ /dev/null @@ -1,222 +0,0 @@ -""" -Visualization utilities for macrolactone analysis. -""" -from rdkit import Chem -from rdkit.Chem import Draw -from typing import Dict, List -import matplotlib.pyplot as plt -import seaborn as sns -import pandas as pd -import numpy as np - - -def draw_molecule_with_numbering(mol: Chem.Mol, numbering: Dict[int, int], - highlight_atoms: List[int] = None, - figsize=(10, 10)) -> None: - """ - Draw molecule with ring atom numbering. - - Args: - mol: RDKit molecule object - numbering: Ring numbering dictionary - highlight_atoms: Optional list of atom indices to highlight - figsize: Figure size tuple - """ - # Create atom labels - atom_labels = {} - for atom_idx, position in numbering.items(): - atom_labels[atom_idx] = str(position) - - # Draw with labels - img = Draw.MolToImage( - mol, - size=(figsize[0]*100, figsize[1]*100), - highlightAtoms=highlight_atoms or [], - atomLabels=atom_labels - ) - - plt.figure(figsize=figsize) - plt.imshow(img) - plt.axis('off') - plt.tight_layout() - plt.show() - - -def plot_substitution_frequency(position_counts: Dict[int, int], - save_path: str = None, - figsize=(12, 6)) -> None: - """ - Plot bar chart of substitution frequency at each ring position. - - Args: - position_counts: Dictionary mapping ring position to count - save_path: Optional path to save the figure - figsize: Figure size tuple - """ - positions = sorted(position_counts.keys()) - counts = [position_counts[pos] for pos in positions] - - plt.figure(figsize=figsize) - - # Create bar plot - bars = plt.bar(positions, counts, color='steelblue', edgecolor='black', alpha=0.7) - - # Color bars by frequency (gradient) - max_count = max(counts) if counts else 1 - for bar, count in zip(bars, counts): - normalized = count / max_count - bar.set_color(plt.cm.YlOrRd(normalized)) - - plt.xlabel('Ring Position', fontsize=14, fontweight='bold') - plt.ylabel('Number of Substitutions', fontsize=14, fontweight='bold') - plt.title('Side Chain Substitution Frequency by Ring Position', - fontsize=16, fontweight='bold') - plt.xticks(positions, fontsize=12) - plt.yticks(fontsize=12) - plt.grid(axis='y', alpha=0.3, linestyle='--') - - # Add value labels on bars - for i, (pos, count) in enumerate(zip(positions, counts)): - plt.text(pos, count + max_count*0.01, str(count), - ha='center', va='bottom', fontsize=10, fontweight='bold') - - plt.tight_layout() - - if save_path: - plt.savefig(save_path, dpi=300, bbox_inches='tight') - print(f"Figure saved to {save_path}") - - plt.show() - - -def plot_substitution_heatmap(position_fragment_data: pd.DataFrame, - save_path: str = None, - figsize=(14, 8)) -> None: - """ - Plot heatmap showing fragment diversity at each position. - - Args: - position_fragment_data: DataFrame with position and fragment information - save_path: Optional path to save the figure - figsize: Figure size tuple - """ - # Create pivot table - pivot = position_fragment_data.pivot_table( - index='fragment_smiles', - columns='cleavage_position', - values='count', - fill_value=0 - ) - - plt.figure(figsize=figsize) - - sns.heatmap(pivot, cmap='YlOrRd', annot=False, - cbar_kws={'label': 'Frequency'}, - linewidths=0.5, linecolor='gray') - - plt.xlabel('Ring Position', fontsize=14, fontweight='bold') - plt.ylabel('Fragment Type', fontsize=14, fontweight='bold') - plt.title('Fragment Diversity Across Ring Positions', - fontsize=16, fontweight='bold') - plt.tight_layout() - - if save_path: - plt.savefig(save_path, dpi=300, bbox_inches='tight') - print(f"Figure saved to {save_path}") - - plt.show() - - -def plot_circular_substitution_pattern(position_counts: Dict[int, int], - save_path: str = None, - figsize=(10, 10)) -> None: - """ - Plot circular/radar plot of substitution patterns around the ring. - - Args: - position_counts: Dictionary mapping ring position to count - save_path: Optional path to save the figure - figsize: Figure size tuple - """ - positions = list(range(1, 17)) # All 16 positions - counts = [position_counts.get(pos, 0) for pos in positions] - - # Append first value to close the circle - counts_circle = counts + [counts[0]] - - # Calculate angles - angles = np.linspace(0, 2 * np.pi, 16, endpoint=False).tolist() - angles_circle = angles + [angles[0]] - - fig, ax = plt.subplots(figsize=figsize, subplot_kw=dict(projection='polar')) - - # Plot - ax.plot(angles_circle, counts_circle, 'o-', linewidth=2, color='steelblue', - markersize=8, markerfacecolor='orange', markeredgecolor='black', markeredgewidth=1.5) - ax.fill(angles_circle, counts_circle, alpha=0.25, color='steelblue') - - # Set labels - ax.set_xticks(angles) - ax.set_xticklabels([str(i) for i in positions], fontsize=12) - ax.set_ylim(0, max(counts) * 1.1 if counts else 1) - - ax.set_title('Circular Substitution Pattern\n(16-Membered Ring)', - fontsize=16, fontweight='bold', pad=20) - ax.grid(True, linestyle='--', alpha=0.6) - - plt.tight_layout() - - if save_path: - plt.savefig(save_path, dpi=300, bbox_inches='tight') - print(f"Figure saved to {save_path}") - - plt.show() - - -def plot_fragment_statistics(fragments_df: pd.DataFrame, - save_path: str = None, - figsize=(15, 5)) -> None: - """ - Plot statistics about fragment properties. - - Args: - fragments_df: DataFrame with fragment information - save_path: Optional path to save the figure - figsize: Figure size tuple - """ - fig, axes = plt.subplots(1, 3, figsize=figsize) - - # Plot 1: Fragment size distribution - axes[0].hist(fragments_df['atom_count'], bins=20, color='skyblue', - edgecolor='black', alpha=0.7) - axes[0].set_xlabel('Number of Atoms', fontsize=12, fontweight='bold') - axes[0].set_ylabel('Frequency', fontsize=12, fontweight='bold') - axes[0].set_title('Fragment Size Distribution', fontsize=14, fontweight='bold') - axes[0].grid(axis='y', alpha=0.3) - - # Plot 2: Molecular weight distribution - axes[1].hist(fragments_df['molecular_weight'], bins=20, color='lightcoral', - edgecolor='black', alpha=0.7) - axes[1].set_xlabel('Molecular Weight (Da)', fontsize=12, fontweight='bold') - axes[1].set_ylabel('Frequency', fontsize=12, fontweight='bold') - axes[1].set_title('Fragment MW Distribution', fontsize=14, fontweight='bold') - axes[1].grid(axis='y', alpha=0.3) - - # Plot 3: Fragments per molecule - frags_per_mol = fragments_df.groupby('parent_id').size() - axes[2].hist(frags_per_mol, bins=range(int(frags_per_mol.min()), - int(frags_per_mol.max())+2), - color='lightgreen', edgecolor='black', alpha=0.7) - axes[2].set_xlabel('Fragments per Molecule', fontsize=12, fontweight='bold') - axes[2].set_ylabel('Number of Molecules', fontsize=12, fontweight='bold') - axes[2].set_title('Fragments per Macrolactone', fontsize=14, fontweight='bold') - axes[2].grid(axis='y', alpha=0.3) - - plt.tight_layout() - - if save_path: - plt.savefig(save_path, dpi=300, bbox_inches='tight') - print(f"Figure saved to {save_path}") - - plt.show() - diff --git a/tests/__init__.py b/tests/__init__.py new file mode 100644 index 0000000..3e0ea00 --- /dev/null +++ b/tests/__init__.py @@ -0,0 +1 @@ +# Tests package marker for helper imports. diff --git a/tests/helpers.py b/tests/helpers.py new file mode 100644 index 0000000..4efbc75 --- /dev/null +++ b/tests/helpers.py @@ -0,0 +1,96 @@ +from __future__ import annotations + +from dataclasses import dataclass +from typing import Mapping + +from rdkit import Chem + + +@dataclass(frozen=True) +class BuiltMacrolactone: + mol: Chem.Mol + smiles: str + position_to_atom: dict[int, int] + + +def build_macrolactone( + ring_size: int, + side_chains: Mapping[int, str] | None = None, +) -> BuiltMacrolactone: + if not 12 <= ring_size <= 20: + raise ValueError("ring_size must be between 12 and 20") + + side_chains = dict(side_chains or {}) + rwmol = Chem.RWMol() + + position_to_atom: dict[int, int] = { + 1: rwmol.AddAtom(Chem.Atom("C")), + 2: rwmol.AddAtom(Chem.Atom("O")), + } + for position in range(3, ring_size + 1): + position_to_atom[position] = rwmol.AddAtom(Chem.Atom("C")) + + carbonyl_oxygen_idx = rwmol.AddAtom(Chem.Atom("O")) + + rwmol.AddBond(position_to_atom[1], position_to_atom[2], Chem.BondType.SINGLE) + for position in range(2, ring_size): + rwmol.AddBond( + position_to_atom[position], + position_to_atom[position + 1], + Chem.BondType.SINGLE, + ) + rwmol.AddBond(position_to_atom[ring_size], position_to_atom[1], Chem.BondType.SINGLE) + rwmol.AddBond(position_to_atom[1], carbonyl_oxygen_idx, Chem.BondType.DOUBLE) + + for position, side_chain in side_chains.items(): + if position not in position_to_atom: + raise ValueError(f"Invalid ring position: {position}") + _add_side_chain(rwmol, position_to_atom[position], side_chain) + + mol = rwmol.GetMol() + Chem.SanitizeMol(mol) + return BuiltMacrolactone( + mol=mol, + smiles=Chem.MolToSmiles(mol, isomericSmiles=True), + position_to_atom=position_to_atom, + ) + + +def build_ambiguous_smiles() -> str: + mol_12 = build_macrolactone(12).mol + mol_14 = build_macrolactone(14).mol + combined = Chem.CombineMols(mol_12, mol_14) + return Chem.MolToSmiles(combined, isomericSmiles=True) + + +def canonicalize(smiles_or_mol: str | Chem.Mol) -> str: + if isinstance(smiles_or_mol, Chem.Mol): + mol = smiles_or_mol + else: + mol = Chem.MolFromSmiles(smiles_or_mol) + if mol is None: + raise ValueError(f"Unable to parse SMILES: {smiles_or_mol}") + return Chem.MolToSmiles(mol, isomericSmiles=True) + + +def _add_side_chain(rwmol: Chem.RWMol, ring_atom_idx: int, side_chain: str) -> None: + if side_chain == "methyl": + carbon_idx = rwmol.AddAtom(Chem.Atom("C")) + rwmol.AddBond(ring_atom_idx, carbon_idx, Chem.BondType.SINGLE) + return + + if side_chain == "ethyl": + carbon_1_idx = rwmol.AddAtom(Chem.Atom("C")) + carbon_2_idx = rwmol.AddAtom(Chem.Atom("C")) + rwmol.AddBond(ring_atom_idx, carbon_1_idx, Chem.BondType.SINGLE) + rwmol.AddBond(carbon_1_idx, carbon_2_idx, Chem.BondType.SINGLE) + return + + if side_chain == "exocyclic_alkene": + carbon_1_idx = rwmol.AddAtom(Chem.Atom("C")) + carbon_2_idx = rwmol.AddAtom(Chem.Atom("C")) + rwmol.AddBond(ring_atom_idx, carbon_1_idx, Chem.BondType.DOUBLE) + rwmol.AddBond(carbon_1_idx, carbon_2_idx, Chem.BondType.SINGLE) + return + + raise ValueError(f"Unsupported side chain: {side_chain}") diff --git a/tests/test_cli.py b/tests/test_cli.py new file mode 100644 index 0000000..9b047db --- /dev/null +++ b/tests/test_cli.py @@ -0,0 +1,74 @@ +from __future__ import annotations + +import json +import subprocess +import sys + +import pandas as pd + +from .helpers import build_ambiguous_smiles, build_macrolactone + + +def run_cli(*args: str) -> subprocess.CompletedProcess[str]: + return subprocess.run( + [sys.executable, "-m", "macro_lactone_toolkit.cli", *args], + capture_output=True, + text=True, + check=False, + ) + + +def test_cli_smoke_commands(): + built = build_macrolactone(16, {5: "methyl"}) + + analyze = run_cli("analyze", "--smiles", built.smiles) + assert analyze.returncode == 0, analyze.stderr + analyze_payload = json.loads(analyze.stdout) + assert analyze_payload["valid_ring_sizes"] == [16] + + number = run_cli("number", "--smiles", built.smiles) + assert number.returncode == 0, number.stderr + number_payload = json.loads(number.stdout) + assert number_payload["ring_size"] == 16 + assert number_payload["position_to_atom"]["1"] >= 0 + + fragment = run_cli("fragment", "--smiles", built.smiles, "--parent-id", "cli_1") + assert fragment.returncode == 0, fragment.stderr + fragment_payload = json.loads(fragment.stdout) + assert fragment_payload["parent_id"] == "cli_1" + assert fragment_payload["ring_size"] == 16 + assert fragment_payload["fragments"][0]["fragment_smiles_labeled"] + + +def test_cli_fragment_csv_skips_ambiguous_and_records_errors(tmp_path): + valid = build_macrolactone(14, {4: "methyl"}) + ambiguous = build_ambiguous_smiles() + input_path = tmp_path / "molecules.csv" + output_path = tmp_path / "fragments.csv" + errors_path = tmp_path / "errors.csv" + + pd.DataFrame( + [ + {"id": "valid_1", "smiles": valid.smiles}, + {"id": "ambiguous_1", "smiles": ambiguous}, + ] + ).to_csv(input_path, index=False) + + completed = run_cli( + "fragment", + "--input", + str(input_path), + "--output", + str(output_path), + "--errors-output", + str(errors_path), + ) + + assert completed.returncode == 0, completed.stderr + + fragments = pd.read_csv(output_path) + errors = pd.read_csv(errors_path) + + assert set(fragments["parent_id"]) == {"valid_1"} + assert errors.loc[0, "parent_id"] == "ambiguous_1" + assert errors.loc[0, "error_type"] == "AmbiguousMacrolactoneError" diff --git a/tests/test_detection_and_numbering.py b/tests/test_detection_and_numbering.py new file mode 100644 index 0000000..1cbc4a8 --- /dev/null +++ b/tests/test_detection_and_numbering.py @@ -0,0 +1,73 @@ +import pytest +from rdkit import Chem + +from macro_lactone_toolkit import ( + AmbiguousMacrolactoneError, + MacroLactoneAnalyzer, + MacrolactoneDetectionError, + MacrolactoneFragmenter, +) + +from .helpers import build_ambiguous_smiles, build_macrolactone + + +@pytest.mark.parametrize("ring_size", [12, 14, 16, 20]) +def test_analyzer_detects_supported_ring_sizes(ring_size: int): + built = build_macrolactone(ring_size) + analyzer = MacroLactoneAnalyzer() + + assert analyzer.get_valid_ring_sizes(built.smiles) == [ring_size] + + +def test_analyzer_rejects_non_lactone_macrocycle(): + analyzer = MacroLactoneAnalyzer() + + assert analyzer.get_valid_ring_sizes("C1CCCCCCCCCCC1") == [] + + +def test_fragmenter_auto_numbers_ring_with_expected_positions(): + built = build_macrolactone(16, {5: "methyl"}) + result = MacrolactoneFragmenter().number_molecule(built.mol) + + assert result.ring_size == 16 + assert result.position_to_atom == built.position_to_atom + assert set(result.position_to_atom) == set(range(1, 17)) + assert result.atom_to_position == { + atom_idx: position for position, atom_idx in built.position_to_atom.items() + } + + carbonyl_atom = built.mol.GetAtomWithIdx(result.position_to_atom[1]) + assert carbonyl_atom.GetSymbol() == "C" + assert any( + bond.GetBondType() == Chem.BondType.DOUBLE and bond.GetOtherAtom(carbonyl_atom).GetSymbol() == "O" + for bond in carbonyl_atom.GetBonds() + ) + + ester_oxygen = built.mol.GetAtomWithIdx(result.position_to_atom[2]) + assert ester_oxygen.GetSymbol() == "O" + + +def test_fragmenter_requires_explicit_ring_size_for_ambiguous_molecule(): + ambiguous_smiles = build_ambiguous_smiles() + + with pytest.raises(AmbiguousMacrolactoneError): + MacrolactoneFragmenter().number_molecule(ambiguous_smiles) + + +def test_fragmenter_raises_for_missing_macrolactone(): + with pytest.raises(MacrolactoneDetectionError): + MacrolactoneFragmenter().number_molecule("CCO") + + +def test_explicit_ring_size_selects_requested_ring(): + built = build_macrolactone(14) + result = MacrolactoneFragmenter(ring_size=14).number_molecule(built.smiles) + + assert result.ring_size == 14 + + +def test_explicit_ring_size_rejects_wrong_ring(): + built = build_macrolactone(12) + + with pytest.raises(MacrolactoneDetectionError): + MacrolactoneFragmenter(ring_size=16).number_molecule(built.smiles) diff --git a/tests/test_env_integration.py b/tests/test_env_integration.py deleted file mode 100644 index 8ed3779..0000000 --- a/tests/test_env_integration.py +++ /dev/null @@ -1,39 +0,0 @@ -import sys -import os -from pathlib import Path - -# Add SIME to path -SIME_PATH = "/home/zly/project/SIME" -if SIME_PATH not in sys.path: - sys.path.append(SIME_PATH) - -# Add project root to path so we can import 'src' -PROJECT_ROOT = str(Path(__file__).parent.parent) -if PROJECT_ROOT not in sys.path: - sys.path.append(PROJECT_ROOT) - -def test_imports(): - """Verify that we can import from both local project and SIME.""" - print(f"sys.path: {sys.path}") - - # 1. Test local import from src - try: - # Correct function name based on file inspection - from src.ring_numbering import assign_ring_numbering - assert callable(assign_ring_numbering) - print("Successfully imported src.ring_numbering.assign_ring_numbering") - except ImportError as e: - print(f"Failed to import src.ring_numbering: {e}") - raise - - # 2. Test SIME import - try: - from utils.mole_predictor import ParallelBroadSpectrumPredictor - assert ParallelBroadSpectrumPredictor is not None - print("Successfully imported ParallelBroadSpectrumPredictor from utils.mole_predictor") - except ImportError as e: - print(f"Failed to import from SIME: {e}") - raise - -if __name__ == "__main__": - test_imports() diff --git a/tests/test_fragment_prep.py b/tests/test_fragment_prep.py index 5245f9c..592f7f7 100644 --- a/tests/test_fragment_prep.py +++ b/tests/test_fragment_prep.py @@ -1,95 +1,42 @@ import pytest from rdkit import Chem -from src.splicing.fragment_prep import activate_fragment + +from macro_lactone_toolkit.splicing.fragment_prep import activate_fragment + def test_activate_smart_ethanol(): - """Test 'smart' activation on Ethanol (CCO). Should attach to Oxygen.""" - smiles = "CCO" - mol = activate_fragment(smiles, strategy="smart") - - # Check if we have a dummy atom + mol = activate_fragment("CCO", strategy="smart") + assert mol is not None - assert mol.GetNumAtoms() == 4 # C, C, O, * - - # Check if the dummy atom is attached to Oxygen - # Find the dummy atom - dummy_atom = None - for atom in mol.GetAtoms(): - if atom.GetSymbol() == '*': - dummy_atom = atom - break - - assert dummy_atom is not None - - # Check neighbors of dummy atom - neighbors = dummy_atom.GetNeighbors() - assert len(neighbors) == 1 - assert neighbors[0].GetSymbol() == 'O' - - # Check output SMILES format - out_smiles = Chem.MolToSmiles(mol) - assert '*' in out_smiles + assert mol.GetNumAtoms() == 4 + dummy_atom = next(atom for atom in mol.GetAtoms() if atom.GetAtomicNum() == 0) + assert dummy_atom.GetNeighbors()[0].GetSymbol() == "O" + assert "*" in Chem.MolToSmiles(mol) + def test_activate_smart_amine(): - """Test 'smart' activation on Ethylamine (CCN). Should attach to Nitrogen.""" - smiles = "CCN" - mol = activate_fragment(smiles, strategy="smart") - - assert mol is not None - - # Find the dummy atom - dummy_atom = None - for atom in mol.GetAtoms(): - if atom.GetSymbol() == '*': - dummy_atom = atom - break - - assert dummy_atom is not None - neighbors = dummy_atom.GetNeighbors() - assert neighbors[0].GetSymbol() == 'N' + mol = activate_fragment("CCN", strategy="smart") + + dummy_atom = next(atom for atom in mol.GetAtoms() if atom.GetAtomicNum() == 0) + assert dummy_atom.GetNeighbors()[0].GetSymbol() == "N" + def test_activate_random_pentane(): - """Test 'random' activation on Pentane (CCCCC). Should attach to a Carbon.""" - smiles = "CCCCC" - # Seed is not easily passed to the function unless we add it to the signature or fix it inside. - # For this test, any Carbon is fine. - mol = activate_fragment(smiles, strategy="random") - - assert mol is not None - assert mol.GetNumAtoms() == 6 # 5 C + 1 * - - dummy_atom = None - for atom in mol.GetAtoms(): - if atom.GetSymbol() == '*': - dummy_atom = atom - break - - assert dummy_atom is not None - neighbors = dummy_atom.GetNeighbors() - assert neighbors[0].GetSymbol() == 'C' + mol = activate_fragment("CCCCC", strategy="random") + + assert mol.GetNumAtoms() == 6 + dummy_atom = next(atom for atom in mol.GetAtoms() if atom.GetAtomicNum() == 0) + assert dummy_atom.GetNeighbors()[0].GetSymbol() == "C" + def test_activate_smart_fallback(): - """Test 'smart' fallback when no heteroatoms are found (e.g. Propane).""" - smiles = "CCC" - # Should fall back to finding a terminal carbon or random - # The requirement says "fall back to a terminal Carbon" or random. - # Let's assume the implementation picks a terminal carbon if possible, or just behaves like random on C. - mol = activate_fragment(smiles, strategy="smart") - - assert mol is not None - dummy_atom = None - for atom in mol.GetAtoms(): - if atom.GetSymbol() == '*': - dummy_atom = atom - break - - assert dummy_atom is not None - neighbor = dummy_atom.GetNeighbors()[0] - assert neighbor.GetSymbol() == 'C' - # Verify it's a valid molecule + mol = activate_fragment("CCC", strategy="smart") + + dummy_atom = next(atom for atom in mol.GetAtoms() if atom.GetAtomicNum() == 0) + assert dummy_atom.GetNeighbors()[0].GetSymbol() == "C" assert Chem.SanitizeMol(mol) == Chem.SanitizeFlags.SANITIZE_NONE + def test_invalid_smiles(): with pytest.raises(ValueError): activate_fragment("NotASmiles", strategy="smart") - diff --git a/tests/test_fragmentation.py b/tests/test_fragmentation.py new file mode 100644 index 0000000..02a1a58 --- /dev/null +++ b/tests/test_fragmentation.py @@ -0,0 +1,53 @@ +from rdkit import Chem + +from macro_lactone_toolkit import MacrolactoneFragmenter + +from .helpers import build_macrolactone + + +def test_fragmentation_returns_empty_list_without_sidechains(): + built = build_macrolactone(12) + result = MacrolactoneFragmenter().fragment_molecule(built.smiles, parent_id="plain") + + assert result.fragments == [] + + +def test_fragmentation_emits_labeled_and_plain_smiles_round_trip(): + built = build_macrolactone(16, {5: "ethyl", 8: "methyl"}) + result = MacrolactoneFragmenter().fragment_molecule(built.smiles, parent_id="mol_001") + + assert result.parent_id == "mol_001" + assert result.ring_size == 16 + assert {fragment.cleavage_position for fragment in result.fragments} == {5, 8} + + for fragment in result.fragments: + labeled = Chem.MolFromSmiles(fragment.fragment_smiles_labeled) + plain = Chem.MolFromSmiles(fragment.fragment_smiles_plain) + + assert labeled is not None + assert plain is not None + assert Chem.MolToSmiles(labeled, isomericSmiles=True) + assert Chem.MolToSmiles(plain, isomericSmiles=True) + assert any( + atom.GetAtomicNum() == 0 and atom.GetIsotope() == fragment.cleavage_position + for atom in labeled.GetAtoms() + ) + assert any( + atom.GetAtomicNum() == 0 and atom.GetIsotope() == 0 + for atom in plain.GetAtoms() + ) + + +def test_fragmentation_preserves_attachment_bond_type(): + built = build_macrolactone(16, {6: "exocyclic_alkene"}) + result = MacrolactoneFragmenter().fragment_molecule(built.smiles, parent_id="bond_type") + + fragment = next(fragment for fragment in result.fragments if fragment.cleavage_position == 6) + labeled = Chem.MolFromSmiles(fragment.fragment_smiles_labeled) + plain = Chem.MolFromSmiles(fragment.fragment_smiles_plain) + + for mol in (labeled, plain): + dummy_atom = next(atom for atom in mol.GetAtoms() if atom.GetAtomicNum() == 0) + neighbor = dummy_atom.GetNeighbors()[0] + bond = mol.GetBondBetweenAtoms(dummy_atom.GetIdx(), neighbor.GetIdx()) + assert bond.GetBondType() == Chem.BondType.DOUBLE diff --git a/tests/test_imports.py b/tests/test_imports.py new file mode 100644 index 0000000..9aed8e4 --- /dev/null +++ b/tests/test_imports.py @@ -0,0 +1,5 @@ +import macro_lactone_toolkit + + +def test_public_imports_smoke(): + assert macro_lactone_toolkit is not None diff --git a/tests/test_ring_numbering.py b/tests/test_ring_numbering.py deleted file mode 100644 index 170f0d5..0000000 --- a/tests/test_ring_numbering.py +++ /dev/null @@ -1,223 +0,0 @@ -""" -测试环编号功能 - 验证原子编号是否固定 -""" -import sys -sys.path.insert(0, '/home/zly/project/macro_split') - -from rdkit import Chem -from rdkit.Chem import Draw, AllChem -from rdkit.Chem.Draw import rdMolDraw2D -from src.ring_visualization import ( - get_macrolactone_numbering, - get_ring_atoms_by_size -) - - -def test_ring_numbering_consistency(smiles: str, ring_size: int = 16, num_tests: int = 5): - """ - 测试环编号的一致性 - 多次运行确保编号固定 - """ - print("=" * 70) - print("测试环编号一致性") - print("=" * 70) - print(f"\nSMILES: {smiles[:80]}...") - print(f"环大小: {ring_size}") - print(f"测试次数: {num_tests}") - - # 解析分子 - mol = Chem.MolFromSmiles(smiles) - if mol is None: - print("❌ 无法解析SMILES") - return False - - print(f"✓ 分子解析成功,共 {mol.GetNumAtoms()} 个原子") - - # 检测环大小 - ring_atoms = get_ring_atoms_by_size(mol, ring_size) - if ring_atoms is None: - for size in range(12, 21): - ring_atoms = get_ring_atoms_by_size(mol, size) - if ring_atoms: - ring_size = size - print(f"⚠️ 使用检测到的{size}元环") - break - - if ring_atoms is None: - print("❌ 未找到12-20元环") - return False - - print(f"✓ 找到{ring_size}元环,包含 {len(ring_atoms)} 个原子") - - # 多次测试编号一致性 - all_numberings = [] - all_carbonyl_carbons = [] - all_ester_oxygens = [] - - for i in range(num_tests): - result = get_macrolactone_numbering(mol, ring_size) - ring_atoms_result, ring_numbering, ordered_atoms, carbonyl_carbon, ester_oxygen, (is_valid, reason) = result - - if not is_valid: - print(f"❌ 第{i+1}次测试失败: {reason}") - return False - - all_numberings.append(ring_numbering.copy()) - all_carbonyl_carbons.append(carbonyl_carbon) - all_ester_oxygens.append(ester_oxygen) - - # 验证一致性 - print("\n" + "-" * 50) - print("编号一致性检查:") - print("-" * 50) - - is_consistent = True - - if len(set(all_carbonyl_carbons)) == 1: - print(f"✓ 羰基碳位置一致: 原子索引 {all_carbonyl_carbons[0]}") - else: - print(f"❌ 羰基碳位置不一致: {all_carbonyl_carbons}") - is_consistent = False - - if len(set(all_ester_oxygens)) == 1: - print(f"✓ 酯氧位置一致: 原子索引 {all_ester_oxygens[0]}") - else: - print(f"❌ 酯氧位置不一致: {all_ester_oxygens}") - is_consistent = False - - first_numbering = all_numberings[0] - for i, numbering in enumerate(all_numberings[1:], 2): - if numbering != first_numbering: - print(f"❌ 第{i}次编号与第1次不一致") - is_consistent = False - break - - if is_consistent: - print(f"✓ 所有{num_tests}次测试的编号完全一致") - - # 显示详细编号信息 - print("\n" + "-" * 50) - print("环原子编号详情:") - print("-" * 50) - - numbering = all_numberings[0] - carbonyl_carbon = all_carbonyl_carbons[0] - ester_oxygen = all_ester_oxygens[0] - - sorted_items = sorted(numbering.items(), key=lambda x: x[1]) - - print(f"{'位置':<6} {'原子索引':<10} {'元素':<6} {'说明'}") - print("-" * 40) - - for atom_idx, position in sorted_items: - atom = mol.GetAtomWithIdx(atom_idx) - symbol = atom.GetSymbol() - note = "" - if atom_idx == carbonyl_carbon: - note = "← 羰基碳 (C=O)" - elif atom_idx == ester_oxygen: - note = "← 酯键氧" - print(f"{position:<6} {atom_idx:<10} {symbol:<6} {note}") - - return is_consistent - - -def save_visualization(smiles: str, output_path: str, ring_size: int = 16): - """保存分子可视化图片""" - print("\n" + "=" * 70) - print("保存可视化图片") - print("=" * 70) - - mol = Chem.MolFromSmiles(smiles) - if mol is None: - print("❌ 无法解析SMILES") - return - - for size in range(12, 21): - ring_atoms = get_ring_atoms_by_size(mol, size) - if ring_atoms: - ring_size = size - break - - result = get_macrolactone_numbering(mol, ring_size) - ring_atoms, ring_numbering, ordered_atoms, carbonyl_carbon, ester_oxygen, (is_valid, reason) = result - - if not is_valid: - print(f"❌ 无法获取编号: {reason}") - return - - mol_copy = Chem.Mol(mol) - AllChem.Compute2DCoords(mol_copy) - - for atom_idx in ring_atoms: - if atom_idx in ring_numbering: - atom = mol_copy.GetAtomWithIdx(atom_idx) - atom.SetProp("atomNote", str(ring_numbering[atom_idx])) - - atom_colors = {} - for atom_idx in ring_atoms: - atom = mol.GetAtomWithIdx(atom_idx) - symbol = atom.GetSymbol() - - if atom_idx == carbonyl_carbon: - atom_colors[atom_idx] = (1.0, 0.6, 0.0) - elif atom_idx == ester_oxygen: - atom_colors[atom_idx] = (1.0, 0.4, 0.4) - elif symbol == 'C': - atom_colors[atom_idx] = (0.7, 0.85, 1.0) - elif symbol == 'O': - atom_colors[atom_idx] = (1.0, 0.7, 0.7) - elif symbol == 'N': - atom_colors[atom_idx] = (0.8, 0.7, 1.0) - else: - atom_colors[atom_idx] = (0.8, 1.0, 0.8) - - drawer = rdMolDraw2D.MolDraw2DSVG(1000, 1000) - drawer.SetFontSize(14) - drawer.DrawMolecule(mol_copy, highlightAtoms=list(ring_atoms), highlightAtomColors=atom_colors) - drawer.FinishDrawing() - svg = drawer.GetDrawingText() - - svg_path = output_path.replace('.png', '.svg') - with open(svg_path, 'w', encoding='utf-8') as f: - f.write(svg) - print(f"✓ SVG已保存到: {svg_path}") - - try: - drawer_png = rdMolDraw2D.MolDraw2DCairo(1000, 1000) - drawer_png.SetFontSize(14) - drawer_png.DrawMolecule(mol_copy, highlightAtoms=list(ring_atoms), highlightAtomColors=atom_colors) - drawer_png.FinishDrawing() - drawer_png.WriteDrawingText(output_path) - print(f"✓ PNG已保存到: {output_path}") - except Exception as e: - print(f"⚠️ PNG保存失败: {e}") - - print("\n颜色说明:") - print(" 橙色: 羰基碳 (位置1)") - print(" 红色: 酯键氧 (位置2)") - print(" 浅蓝色: 环上碳原子") - - -def main(): - smiles = "O[C@H]1[C@H]([C@H]([C@H](OC[C@@H]2[C@@H](CC)OC(C[C@H]([C@H](C)[C@H]([C@@H](CC=O)C[C@@H](C)C(/C=C/C(/C)=C/2)=O)O[C@H]2[C@@H]([C@H]([C@@H]([C@@H](C)O2)O[C@H]2C[C@](C)([C@@H]([C@@H](C)O2)O)O)[N@](C)C)O)O)=O)O[C@@H]1C)OC)OC" - - print("\n大环内酯环编号测试\n") - is_consistent = test_ring_numbering_consistency(smiles, ring_size=16, num_tests=5) - - output_path = "/home/zly/project/macro_split/output/test_ring_numbering.png" - save_visualization(smiles, output_path, ring_size=16) - - print("\n" + "=" * 70) - print("测试总结") - print("=" * 70) - if is_consistent: - print("✅ 所有测试通过!环原子编号是固定的。") - else: - print("❌ 测试失败:环原子编号不一致") - - return is_consistent - - -if __name__ == "__main__": - success = main() - sys.exit(0 if success else 1) diff --git a/tests/test_scaffold_prep.py b/tests/test_scaffold_prep.py deleted file mode 100644 index d1d5a8d..0000000 --- a/tests/test_scaffold_prep.py +++ /dev/null @@ -1,84 +0,0 @@ -import pytest -from rdkit import Chem -from src.splicing.scaffold_prep import prepare_tylosin_scaffold -from src.ring_numbering import assign_ring_numbering - -def test_prepare_tylosin_scaffold(): - # Construct a 16-membered lactone with side chains - # Numbering logic (assumed based on implementation): - # 1: C=O - # 2-6: CH2 - # 7: CH(CH3) <- Methyl side chain - # 8-14: CH2 - # 15: CH(CC) <- Ethyl side chain - # 16: O - - # SMILES: - # O=C1 (pos 1) - # CCCCC (pos 2-6) - # C(C) (pos 7, with Methyl) - # CCCCCCC (pos 8-14) - # C(CC) (pos 15, with Ethyl) - # O1 (pos 16) - - smiles = "O=C1CCCCC(C)CCCCCCCCC(CC)O1" - - # Verify initial assumption about numbering - mol = Chem.MolFromSmiles(smiles) - numbering = assign_ring_numbering(mol) - - # Find atom indices for pos 7 and 15 to ensure our SMILES construction is correct for the test - pos_map = {v: k for k, v in numbering.items()} - assert 7 in pos_map, "Position 7 not found in ring" - assert 15 in pos_map, "Position 15 not found in ring" - assert 5 in pos_map, "Position 5 not found in ring" - - atom7 = mol.GetAtomWithIdx(pos_map[7]) - atom15 = mol.GetAtomWithIdx(pos_map[15]) - atom5 = mol.GetAtomWithIdx(pos_map[5]) - - # Check side chains exist - # Atom 7 should have 3 neighbors (2 ring, 1 methyl) - assert len(atom7.GetNeighbors()) == 3 - # Atom 15 should have 3 neighbors (2 ring, 1 ethyl) - assert len(atom15.GetNeighbors()) == 3 - # Atom 5 should have 2 neighbors (2 ring, 2 implicit H) - assert len(atom5.GetNeighbors()) == 2 - - # Execute scaffold prep - target_positions = [5, 7, 15] - res_mol, dummy_map = prepare_tylosin_scaffold(smiles, target_positions) - - assert res_mol is not None - assert len(dummy_map) == 3 - - # Verify dummies - for pos in target_positions: - assert pos in dummy_map - dummy_idx = dummy_map[pos] - dummy_atom = res_mol.GetAtomWithIdx(dummy_idx) - assert dummy_atom.GetSymbol() == "*" - assert dummy_atom.GetIsotope() == pos - - # Check that dummy is connected to the correct ring position - neighbors = dummy_atom.GetNeighbors() - assert len(neighbors) == 1 - - # Verify side chains removed - # New atom counts. - # Original: 16 (ring) + 1 (O=) + 1 (Me) + 2 (Et) = 20 heavy atoms. - # Removed: Me (1), Et (2). Total -3. - # Added: 3 dummies. Total +3. - # Net: 20. - assert res_mol.GetNumAtoms() == 20 - - # Check that the specific side chains are gone. - # Count carbons. - # Original C count: 1 (C=O) + 14 (CH2/CH) + 1(Me) + 2(Et) = 18 C. - # New C count: 1 (C=O) + 14 (Ring C) = 15 C. - # Dummies are *. O are O. - c_count = sum(1 for a in res_mol.GetAtoms() if a.GetSymbol() == 'C') - assert c_count == 15, f"Expected 15 Carbons, found {c_count}" - - dummy_count = sum(1 for a in res_mol.GetAtoms() if a.GetSymbol() == '*') - assert dummy_count == 3 diff --git a/tests/test_splicing_engine.py b/tests/test_splicing_engine.py index e3a6edf..09ee986 100644 --- a/tests/test_splicing_engine.py +++ b/tests/test_splicing_engine.py @@ -1,77 +1,51 @@ import pytest from rdkit import Chem -from src.splicing.engine import splice_molecule + +from macro_lactone_toolkit import MacrolactoneFragmenter +from macro_lactone_toolkit.splicing.engine import splice_molecule +from macro_lactone_toolkit.splicing.scaffold_prep import prepare_macrolactone_scaffold + +from .helpers import build_macrolactone, canonicalize + def test_splice_benzene_methyl(): - """ - Test splicing a benzene scaffold (isotope 1) with a methyl fragment. - Scaffold: c1ccccc1[1*] (Phenyl radical-ish dummy) - Fragment: C* (Methyl radical-ish dummy) - Result: Cc1ccccc1 (Toluene) - """ scaffold = Chem.MolFromSmiles("c1ccccc1[1*]") fragment = Chem.MolFromSmiles("C*") - - assert scaffold is not None - assert fragment is not None - + product = splice_molecule(scaffold, fragment, position=1) - - # Expected result: Toluene - expected_smiles = "Cc1ccccc1" - expected_mol = Chem.MolFromSmiles(expected_smiles) - expected_canonical = Chem.MolToSmiles(expected_mol, isomericSmiles=True) - - product_canonical = Chem.MolToSmiles(product, isomericSmiles=True) - - assert product_canonical == expected_canonical + + assert canonicalize(product) == canonicalize("Cc1ccccc1") + def test_splice_missing_isotope(): - """Test that error is raised if the requested position is not found on scaffold.""" - scaffold = Chem.MolFromSmiles("c1ccccc1[2*]") # Isotope 2 + scaffold = Chem.MolFromSmiles("c1ccccc1[2*]") fragment = Chem.MolFromSmiles("C*") - + with pytest.raises(ValueError, match="Scaffold dummy atom with isotope 1 not found"): splice_molecule(scaffold, fragment, position=1) + def test_splice_no_fragment_dummy(): - """Test that error is raised if fragment has no dummy atom.""" scaffold = Chem.MolFromSmiles("c1ccccc1[1*]") - fragment = Chem.MolFromSmiles("C") # Methane, no dummy - + fragment = Chem.MolFromSmiles("C") + with pytest.raises(ValueError, match="Fragment does not contain a dummy atom"): splice_molecule(scaffold, fragment, position=1) -def test_complex_splicing(): - """ - Test splicing with more complex structures. - Scaffold: Pyridine derivative n1cccc1CC[1*] - Fragment: Cyclopropane C1CC1* - Result: n1cccc1CCC1CC1 - """ - scaffold = Chem.MolFromSmiles("n1cccc1CC[1*]") - fragment = Chem.MolFromSmiles("*C1CC1") - - product = splice_molecule(scaffold, fragment, position=1) - - expected = Chem.MolFromSmiles("n1cccc1CCC1CC1") - - assert Chem.MolToSmiles(product) == Chem.MolToSmiles(expected) -def test_scaffold_with_multiple_different_dummies(): - """ - Test splicing when scaffold has multiple dummies with different isotopes. - Scaffold: [1*]c1ccccc1[2*] - Fragment: C* - Target: Splicing at 1 should leave [2*] intact. - """ - scaffold = Chem.MolFromSmiles("[1*]c1ccccc1[2*]") - fragment = Chem.MolFromSmiles("C*") - - # Splice at 1 - product = splice_molecule(scaffold, fragment, position=1) - - # Expected: Cc1ccccc1[2*] - expected = Chem.MolFromSmiles("Cc1ccccc1[2*]") - - assert Chem.MolToSmiles(product) == Chem.MolToSmiles(expected) +def test_prepare_scaffold_and_reassemble_fragment(): + built = build_macrolactone(16, {5: "ethyl"}) + result = MacrolactoneFragmenter(ring_size=16).fragment_molecule(built.smiles, parent_id="reassemble") + fragment = next(fragment for fragment in result.fragments if fragment.cleavage_position == 5) + + scaffold, dummy_map = prepare_macrolactone_scaffold( + built.smiles, + positions=[5], + ring_size=16, + ) + + assert 5 in dummy_map + + product = splice_molecule(scaffold, Chem.MolFromSmiles(fragment.fragment_smiles_labeled), position=5) + + assert canonicalize(product) == canonicalize(built.mol)