From 3e07402f4edb27095f34e5b2acf51285b225c471 Mon Sep 17 00:00:00 2001 From: lingyuzeng Date: Fri, 20 Mar 2026 15:14:31 +0800 Subject: [PATCH] feat(numbering): publish canonical numbering API Add a public numbering module and route fragmenting, validation, and scaffold preparation through the canonical numbering entry. Rewrite the repository entry docs around the fixed numbering contract, add MkDocs landing pages, and document the mirror mapping used for medicinal-chemistry comparisons. Also refresh the validation analysis reports to explain the canonical-versus-mirrored numbering relationship. --- AGENTS.md | 38 +++ README.md | 178 +++-------- docs/development/index.md | 19 ++ docs/development/project-structure.md | 29 ++ docs/getting-started.md | 39 +++ docs/index.md | 29 ++ docs/project-docs/AGENTS.md | 284 +----------------- docs/user-guide/index.md | 17 ++ docs/user-guide/ring-numbering.md | 55 ++++ mkdocs.yml | 39 +-- pixi.toml | 4 + .../analyze_validation_fragment_library.py | 35 ++- src/macro_lactone_toolkit/__init__.py | 8 + src/macro_lactone_toolkit/fragmenter.py | 5 +- src/macro_lactone_toolkit/numbering.py | 70 +++++ .../splicing/scaffold_prep.py | 5 +- .../validation/validator.py | 4 +- tests/test_documentation_entrypoints.py | 34 +++ tests/test_numbering_api.py | 52 ++++ tests/test_scripts_and_docs.py | 4 + .../fragment_library_analysis_report.md | 11 + .../fragment_library_analysis_report_zh.md | 14 +- 22 files changed, 529 insertions(+), 444 deletions(-) create mode 100644 AGENTS.md create mode 100644 docs/development/index.md create mode 100644 docs/development/project-structure.md create mode 100644 docs/getting-started.md create mode 100644 docs/index.md create mode 100644 docs/user-guide/index.md create mode 100644 docs/user-guide/ring-numbering.md create mode 100644 src/macro_lactone_toolkit/numbering.py create mode 100644 tests/test_documentation_entrypoints.py create mode 100644 tests/test_numbering_api.py diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..4d330a7 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,38 @@ +# AGENTS.md + +This is the only authoritative agent entry for this repository. +If another `AGENTS.md` file says something different, follow this file. + +## Canonical numbering + +- `1 = 内酯羰基碳` +- `2 = 相邻酯氧` +- `3..N = 从 2 位出发沿环唯一图遍历顺序继续编号` + +For 16-membered rings, the mirror mapping is fixed: + +- `3 → 16` +- `4 → 15` +- `5 → 14` +- `6 → 13` +- `7 → 12` +- `8 → 11` +- `9 → 10` + +This numbering is deterministic and is not a visual clockwise / anticlockwise toggle. +它不是视觉顺时针,也不是视觉逆时针切换。 +The public API does not expose `clockwise` or `anticlockwise` parameters. + +## Practical rule + +- Use canonical numbering in code, reports, and validation outputs. +- Convert to literature-style mirrored labels only when you are comparing against a source that numbers the ring from the opposite direction. +- Keep bridge / fused multi-anchor cases explicit; do not silently reinterpret them as a direction choice. + +## Entry points + +- `README.md` is the progressive disclosure landing page. +- `docs/index.md` is the documentation landing page. +- `docs/user-guide/ring-numbering.md` is the canonical numbering reference. +- `docs/development/project-structure.md` is the repository layout reference. +- `docs/project-docs/AGENTS.md` points back here and should never override this file. diff --git a/README.md b/README.md index 64ca723..01b5d62 100644 --- a/README.md +++ b/README.md @@ -1,157 +1,71 @@ # macro_lactone_toolkit -`macro_lactone_toolkit` 是一个正式可安装的 Python 包,用于 12-20 元有效大环内酯的识别、环编号、侧链裂解和简单拼接回组装。 +`macro_lactone_toolkit` 是一个用于 12-20 元大环内酯识别、canonical numbering、侧链裂解和拼接准备的 Python 工具包。 -## 核心能力 +## 先看哪里 -- 默认自动识别 12-20 元有效大环内酯,也允许显式指定 `ring_size` -- 环编号规则固定为: - - 位置 1 = 内酯羰基碳 - - 位置 2 = 环上的酯键氧 - - 位置 3-N = 沿统一方向连续编号 -- 侧链裂解同时输出两套 SMILES: - - `fragment_smiles_labeled`,例如 `[5*]` - - `fragment_smiles_plain`,例如 `*` -- dummy 原子与连接原子的原始键型保持一致 -- 提供正式 CLI: - - `macro-lactone-toolkit analyze` - - `macro-lactone-toolkit number` - - `macro-lactone-toolkit fragment` +- 只想快速上手: [docs/index.md](docs/index.md) +- 只想看编号规则: [docs/user-guide/ring-numbering.md](docs/user-guide/ring-numbering.md) +- 只想看项目结构: [docs/development/project-structure.md](docs/development/project-structure.md) +- 只想看 agent 入口: [AGENTS.md](AGENTS.md) -## 环境 +## 渐进式入口 -推荐使用 `pixi`,项目已固定到 Python 3.12,并支持 `osx-arm64` 与 `linux-64`。 +### 1. 先确认编号契约 -```bash -pixi install -pixi run pytest -pixi run python -c "import macro_lactone_toolkit" -``` +这个仓库只有一套 canonical numbering: -## Python API +- `1 = 内酯羰基碳` +- `2 = 相邻酯氧` +- `3..N = 从 2 位出发沿环唯一图遍历顺序继续编号` + +对 16 元环,镜像映射是固定的: + +- `3 → 16` +- `4 → 15` +- `5 → 14` +- `6 → 13` +- `7 → 12` +- `8 → 11` +- `9 → 10` +- `10 → 9` +- `11 → 8` +- `12 → 7` +- `13 → 6` +- `14 → 5` +- `15 → 4` +- `16 → 3` + +这不是视觉顺时针/逆时针切换,公开 API 也不提供 `clockwise` / `anticlockwise` 参数。 + +### 2. 再看最小用法 ```python from macro_lactone_toolkit import MacroLactoneAnalyzer, MacrolactoneFragmenter analyzer = MacroLactoneAnalyzer() -valid_ring_sizes = analyzer.get_valid_ring_sizes("O=C1CCCCCCCCCCCCCCO1") - fragmenter = MacrolactoneFragmenter() -numbering = fragmenter.number_molecule("O=C1CCCCCCCCCCCCCCO1") -result = fragmenter.fragment_molecule("O=C1CCCC(C)CCCCCCCCCCO1", parent_id="mol_001") + +print(analyzer.get_valid_ring_sizes("O=C1CCCCCCCCCCCCCCO1")) +print(fragmenter.number_molecule("O=C1CCCCCCCCCCCCCCO1").position_to_atom) ``` -## CLI - -单分子分析: - ```bash +pixi install +pixi run pytest pixi run macro-lactone-toolkit analyze --smiles "O=C1CCCCCCCCCCCCCCO1" pixi run macro-lactone-toolkit number --smiles "O=C1CCCCCCCCCCCCCCO1" pixi run macro-lactone-toolkit fragment --smiles "O=C1CCCC(C)CCCCCCCCCCO1" --parent-id mol_001 ``` -CSV 批处理: +### 3. 再深入到页面 -```bash -pixi run macro-lactone-toolkit fragment \ - --input molecules.csv \ - --output fragments.csv \ - --errors-output fragment_errors.csv -``` +- [docs/getting-started.md](docs/getting-started.md) +- [docs/user-guide/index.md](docs/user-guide/index.md) +- [docs/development/index.md](docs/development/index.md) -默认读取 `smiles` 列;若存在 `id` 列则将其作为 `parent_id`,否则自动生成 `row_`。 +## 维护约束 -## MacrolactoneDB 验证模块 - -用于对 MacrolactoneDB 数据库进行抽样验证、分类、侧链断裂和数据库存储。 - -### 验证脚本使用 - -```bash -# 基本使用(10% 分层抽样) -pixi run python scripts/validate_macrolactone_db.py \ - --input data/MacrolactoneDB/ring12_20/temp.csv \ - --output validation_output \ - --sample-ratio 0.1 - -# 处理全量数据 -pixi run python scripts/validate_macrolactone_db.py \ - --input data/MacrolactoneDB/ring12_20/temp.csv \ - --output validation_output \ - --sample-ratio 1.0 - -# 指定列名(如果 CSV 列名不同) -pixi run python scripts/validate_macrolactone_db.py \ - --input data.csv \ - --output validation_output \ - --id-col ml_id \ - --chembl-id-col IDs \ - --smiles-col smiles -``` - -### 输出结构 - -``` -validation_output/ -├── README.md # 目录说明 -├── fragments.db # SQLite 数据库 -├── fragment_library.csv # 最终片段库导出(含 has_dummy_atom / splice_ready) -├── summary.csv # 汇总表(含 ml_id, chembl_id) -├── summary_statistics.json # 统计信息 -├── ring_size_12/ # 按环大小组织 -├── ring_size_13/ -... -└── ring_size_20/ - ├── standard/ - │ ├── numbered/ # 带编号的环图(文件名使用 ml_id) - │ │ └── {ml_id}_numbered.png - │ └── sidechains/ # 片段图 - │ └── {ml_id}/ - │ └── {ml_id}_frag_{n}_pos{pos}.png - ├── non_standard/original/ - └── rejected/original/ -``` - -### 数据库查询示例 - -```bash -# 查看表结构 -sqlite3 validation_output/fragments.db ".tables" - -# 查询标准大环内酯 -sqlite3 validation_output/fragments.db \ - "SELECT ml_id, chembl_id, ring_size, num_sidechains \ - FROM parent_molecules \ - WHERE classification='standard_macrolactone' LIMIT 5;" - -# 查询最终片段库 -sqlite3 validation_output/fragments.db \ - "SELECT source_type, source_parent_ml_id, cleavage_position, has_dummy_atom, splice_ready \ - FROM fragment_library_entries LIMIT 10;" - -# 查询片段 -sqlite3 validation_output/fragments.db \ - "SELECT fragment_id, cleavage_position, dummy_isotope, has_dummy_atom, dummy_atom_count \ - FROM side_chain_fragments LIMIT 10;" - -# 按环大小统计 -sqlite3 validation_output/fragments.db \ - "SELECT ring_size, COUNT(*) FROM parent_molecules GROUP BY ring_size;" -``` - -### 关键字段说明 - -| 字段 | 说明 | -|------|------| -| `ml_id` | MacrolactoneDB 唯一 ID(如 ML00000001),用于文件命名 | -| `chembl_id` | 原始 CHEMBL ID(如 CHEMBL94657),可能为空 | -| `classification` | standard_macrolactone / non_standard_macrocycle / not_macrolactone | -| `dummy_isotope` | 裂解位置编号,用于片段重建 | -| `cleavage_position` | 环上的断裂位置 | -| `has_dummy_atom` | 该片段是否带 dummy 原子,可用于区分可直接拼接片段 | -| `splice_ready` | 是否与当前单锚点拼接流程直接兼容 | - -## Legacy Scripts - -`scripts/` 目录保留为薄封装或迁移提示,不再承载核心实现。正式接口以 `macro_lactone_toolkit.*` 与 `macro-lactone-toolkit` CLI 为准。 +- 根目录 `AGENTS.md` 是唯一权威 agent 入口。 +- 入口文档只保留当前真实存在、持续维护的页面。 +- 如果你要把药化文献里的镜像位置拿来对照,先按 canonical numbering 记账,再做镜像转换。 diff --git a/docs/development/index.md b/docs/development/index.md new file mode 100644 index 0000000..031059f --- /dev/null +++ b/docs/development/index.md @@ -0,0 +1,19 @@ +# 开发者指南 + +这里给维护者看:项目结构、入口和约束。 + +## 先看什么 + +- [项目结构](project-structure.md) +- 仓库根目录的 `AGENTS.md` 是唯一权威 agent 入口 + +## 维护原则 + +- 入口文档只保留真实存在、持续维护的页面。 +- 编号规则只使用 canonical numbering。 +- 不引入 `clockwise` / `anticlockwise` 参数。 + +## 适合继续往下看的内容 + +- 如果你在找包和脚本分别负责什么,去 [project-structure.md](project-structure.md) +- 如果你在找 agent 约束,直接查看仓库根目录 `AGENTS.md` diff --git a/docs/development/project-structure.md b/docs/development/project-structure.md new file mode 100644 index 0000000..bb33945 --- /dev/null +++ b/docs/development/project-structure.md @@ -0,0 +1,29 @@ +# 项目结构 + +这是当前仓库里真正承担职责的目录划分。 + +## 顶层目录 + +- `src/macro_lactone_toolkit/`: 正式 Python 包,包含分析、编号、裂解、可视化、工作流和验证模块。 +- `scripts/`: 薄封装和批处理脚本,基于正式包接口运行。 +- `tests/`: pytest 测试,覆盖入口、脚本和核心行为。 +- `docs/`: 面向使用者和维护者的入口文档。 +- `notebooks/`: 探索性或归档性的 notebook,不作为权威接口说明。 +- `validation_output/`: 生成的验证产物和报告,属于输出,不是核心源码。 + +## 关键入口 + +- `macro_lactone_toolkit.analyzer.MacroLactoneAnalyzer` +- `macro_lactone_toolkit.fragmenter.MacrolactoneFragmenter` +- `macro-lactone-toolkit` CLI + +## 结构约束 + +- 代码和文档都只认 canonical numbering。 +- 16 元环镜像映射按 `p_mirror = ring_size - p + 3` 处理。 +- 不用 `clockwise` / `anticlockwise` 参数来表达编号方向。 + +## 维护提示 + +- `scripts/README.md` 解释脚本层的现状。 +- `docs/project-docs/AGENTS.md` 只是项目文档入口,不是权威 agent 入口。 diff --git a/docs/getting-started.md b/docs/getting-started.md new file mode 100644 index 0000000..979a02f --- /dev/null +++ b/docs/getting-started.md @@ -0,0 +1,39 @@ +# 快速开始 + +这页只放最短路径:安装、验证、第一次调用。 + +## 安装与验证 + +```bash +pixi install +pixi run pytest +pixi run python -c "import macro_lactone_toolkit" +``` + +## 第一次分析 + +```python +from macro_lactone_toolkit import MacroLactoneAnalyzer, MacrolactoneFragmenter + +analyzer = MacroLactoneAnalyzer() +fragmenter = MacrolactoneFragmenter() + +print(analyzer.get_valid_ring_sizes("O=C1CCCCCCCCCCCCCCO1")) +print(fragmenter.number_molecule("O=C1CCCCCCCCCCCCCCO1").position_to_atom) +``` + +```bash +pixi run macro-lactone-toolkit analyze --smiles "O=C1CCCCCCCCCCCCCCO1" +pixi run macro-lactone-toolkit number --smiles "O=C1CCCCCCCCCCCCCCO1" +pixi run macro-lactone-toolkit fragment --smiles "O=C1CCCC(C)CCCCCCCCCCO1" --parent-id mol_001 +``` + +## 你需要记住的规则 + +- `1 = 内酯羰基碳` +- `2 = 相邻酯氧` +- `3..N = 从 2 位出发沿环唯一图遍历顺序继续编号` +- 16 元环镜像映射固定为 `p_mirror = ring_size - p + 3` +- 不支持 `clockwise` / `anticlockwise` 参数 + +如果你要继续往下看,去 [user-guide/index.md](user-guide/index.md)。 diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..697dac3 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,29 @@ +# 文档首页 + +这里是文档入口。先按你要解决的问题选路,不要一开始就翻完整套材料。 + +## 快速路径 + +- 想先跑起来: [getting-started.md](getting-started.md) +- 想先确认编号规则: [user-guide/ring-numbering.md](user-guide/ring-numbering.md) +- 想先看仓库结构: [development/project-structure.md](development/project-structure.md) + +## 入口约定 + +这个项目只使用一套 canonical numbering: + +- `1 = 内酯羰基碳` +- `2 = 相邻酯氧` +- `3..N = 从 2 位出发沿环唯一图遍历顺序继续编号` + +对 16 元环,镜像映射是固定的 `p_mirror = ring_size - p + 3`,因此 `6 → 13`、`7 → 12`、`15 → 4`、`16 → 3`。 +公开 API 不支持 `clockwise` / `anticlockwise` 参数。 + +## 这套文档保留什么 + +- `README.md`: 渐进式入口 +- `AGENTS.md`: 唯一权威 agent 入口 +- `user-guide/`: 面向使用者的稳定规则 +- `development/`: 面向维护者的结构说明 + +如果你只想开始干活,先看 [getting-started.md](getting-started.md)。 diff --git a/docs/project-docs/AGENTS.md b/docs/project-docs/AGENTS.md index 2432209..336eba8 100644 --- a/docs/project-docs/AGENTS.md +++ b/docs/project-docs/AGENTS.md @@ -1,275 +1,23 @@ -# AGENTS.md +# Project Docs AGENTS -本文件为 AI 编程助手(如 Claude、Copilot、Cursor 等)提供项目上下文和开发指南。 +This page is a project-docs landing note only. +The authoritative agent entry is the repository root `AGENTS.md`. -## 项目概述 +## What belongs here -**Macrolactone Fragmenter** 是一个专业的大环内酯(12-20元环)侧链断裂和分析工具,用于化学信息学研究。 +- Docs-system notes +- Project documentation summaries +- Short commands and maintenance reminders -### 核心功能 -- 智能环原子编号(基于内酯结构) -- 自动侧链断裂分析 -- 分子可视化(SVG/PNG) -- 批量处理和数据导出 +## What does not belong here -## 技术栈 +- Canonical policy overrides +- Alternate numbering rules +- `clockwise` / `anticlockwise` controls -| 组件 | 技术 | -|------|------| -| 语言 | Python 3.8+ | -| 化学库 | RDKit | -| 数据处理 | Pandas, NumPy | -| 可视化 | Matplotlib, Seaborn | -| 环境管理 | Pixi (推荐) / Conda | -| 文档 | MkDocs + Material | -| 测试 | Pytest | -| 代码格式 | Black, Flake8 | +## Stable rule reminder -## 项目结构 - -``` -macro_split/ -├── src/ # 核心源代码 -│ ├── __init__.py # 包初始化 -│ ├── macrolactone_fragmenter.py # ⭐ 主入口类 -│ ├── macro_lactone_analyzer.py # 环数分析器 -│ ├── ring_numbering.py # 环编号系统 -│ ├── ring_visualization.py # 可视化工具 -│ ├── fragment_cleaver.py # 侧链断裂逻辑 -│ ├── fragment_dataclass.py # 碎片数据类 -│ ├── visualizer.py # 统计可视化 -│ └── splicing/ # 分子拼接模块 -│ ├── engine.py # 拼接引擎 -│ ├── scaffold_prep.py # 骨架准备 -│ └── fragment_prep.py # 片段激活 -├── notebooks/ # Jupyter Notebook 示例 -├── scripts/ # 批量处理脚本 -├── tests/ # 单元测试 -├── docs/ # 文档目录 -├── pyproject.toml # 项目配置 -├── pixi.toml # Pixi 环境配置 -└── mkdocs.yml # 文档配置 -``` - -## 核心模块说明 - -### MacrolactoneFragmenter (主入口) -```python -from src.macrolactone_fragmenter import MacrolactoneFragmenter - -fragmenter = MacrolactoneFragmenter(ring_size=16) -result = fragmenter.process_molecule(smiles, parent_id="mol_001") -``` - -### MacroLactoneAnalyzer (环数分析) -```python -from src.macro_lactone_analyzer import MacroLactoneAnalyzer - -analyzer = MacroLactoneAnalyzer() -info = analyzer.get_single_ring_info(smiles) -``` - -### Splicing 模块 (分子拼接) -```python -from src.splicing.scaffold_prep import prepare_tylosin_scaffold -from src.splicing.fragment_prep import activate_fragment -from src.splicing.engine import splice_molecule - -# 准备骨架(移除侧链,标记dummy原子) -scaffold, dummy_map = prepare_tylosin_scaffold(smiles, positions=[3, 5, 9]) - -# 激活片段(添加连接点) -fragment = activate_fragment(fragment_smiles, strategy="smart") - -# 拼接分子 -new_mol = splice_molecule(scaffold, fragment, position=3) -``` - -### 数据类结构 -```python -@dataclass -class Fragment: - fragment_smiles: str # 碎片 SMILES - parent_smiles: str # 母分子 SMILES - cleavage_position: int # 断裂位置 (1-N) - fragment_id: str # 碎片 ID - parent_id: str # 母分子 ID - atom_count: int # 原子数 - molecular_weight: float # 分子量 -``` - -## 开发命令 - -### 环境设置 -```bash -# 安装依赖 -pixi install - -# 激活环境 -pixi shell -``` - -### 代码质量 -```bash -# 格式化代码 -pixi run black src/ - -# 代码检查 -pixi run flake8 src/ - -# 运行测试 -pixi run pytest - -# 测试覆盖率 -pixi run pytest --cov=src -``` - -### 文档 -```bash -# 本地预览文档 -pixi run mkdocs serve - -# 构建文档 -pixi run mkdocs build -``` - -## 编码规范 - -### Python 风格 -- 使用 Black 格式化,行宽 100 字符 -- 使用 Google 风格的 docstring -- 类型注解:所有公共函数必须有类型提示 -- 命名:类用 PascalCase,函数/变量用 snake_case - -### Docstring 示例 -```python -def process_molecule(self, smiles: str, parent_id: str = None) -> FragmentResult: - """ - 处理单个分子,进行侧链断裂分析。 - - Args: - smiles: 分子的 SMILES 字符串 - parent_id: 可选的分子标识符 - - Returns: - FragmentResult 对象,包含所有碎片信息 - - Raises: - ValueError: 如果 SMILES 无效或不是目标环大小 - - Example: - >>> fragmenter = MacrolactoneFragmenter(ring_size=16) - >>> result = fragmenter.process_molecule("C1CC...") - """ -``` - -### 导入顺序 -```python -# 1. 标准库 -import json -from pathlib import Path -from typing import List, Dict, Optional - -# 2. 第三方库 -import pandas as pd -import numpy as np -from rdkit import Chem - -# 3. 本地模块 -from src.fragment_dataclass import Fragment -from src.ring_numbering import RingNumbering -``` - -## 关键概念 - -### 环编号系统 -- **位置 1**: 羰基碳(C=O 中的 C) -- **位置 2**: 酯键氧(环上的 O) -- **位置 3-N**: 按顺序编号环上剩余原子 - -### 支持的环大小 -- 12元环 到 20元环 -- 默认处理 16元环 - -### SMARTS 模式 -```python -# 内酯键 SMARTS(16元环示例) -LACTONE_SMARTS_16 = "[C;R16](=O)[O;R16]" -``` - -## 测试指南 - -### 运行测试 -```bash -# 全部测试 -pixi run pytest - -# 特定模块 -pixi run pytest tests/test_fragmenter.py - -# 详细输出 -pixi run pytest -v - -# 单个测试 -pixi run pytest tests/test_fragmenter.py::test_process_molecule -``` - -### 测试数据 -测试用的 SMILES 示例(16元环大环内酯): -```python -TEST_SMILES = [ - "O=C1CCCCCCCC(=O)OCC/C=C/C=C/1", # 简单 16 元环 - "CCC1OC(=O)C[C@H](O)C(C)[C@@H](O)...", # 复杂结构 -] -``` - -## 常见任务 - -### 添加新功能 -1. 在 `src/` 目录创建或修改模块 -2. 更新 `src/__init__.py` 导出新类/函数 -3. 编写单元测试 -4. 更新文档 - -### 处理新的环大小 -```python -# 在 MacrolactoneFragmenter 中指定环大小 -fragmenter = MacrolactoneFragmenter(ring_size=14) # 14元环 -``` - -### 批量处理 -```python -results = fragmenter.process_csv( - "data/molecules.csv", - smiles_column="smiles", - id_column="unique_id", - max_rows=1000 -) -df = fragmenter.batch_to_dataframe(results) -``` - -## 注意事项 - -### RDKit 依赖 -- RDKit 必须通过 conda/pixi 安装,不支持 pip -- 确保环境中有 RDKit:`python -c "from rdkit import Chem; print('OK')"` - -### 性能考虑 -- 批量处理大数据集时,使用 `process_csv` 方法 -- 处理速度约 ~100 分子/分钟 -- 大规模处理考虑使用 `scripts/batch_process_*.py` - -### 错误处理 -- 无效 SMILES 会抛出 `ValueError` -- 非目标环大小会被跳过 -- 批量处理会记录失败的分子到日志 - -## 相关资源 - -- **文档**: `docs/` 目录或运行 `pixi run mkdocs serve` -- **示例**: `notebooks/filter_molecules.ipynb` -- **脚本**: `scripts/README.md` - ---- - -*最后更新: 2025-01-23* +- `1 = 内酯羰基碳` +- `2 = 相邻酯氧` +- `3..N = 从 2 位出发沿环唯一图遍历顺序继续编号` +- 16 元环镜像映射固定为 `p_mirror = ring_size - p + 3` diff --git a/docs/user-guide/index.md b/docs/user-guide/index.md new file mode 100644 index 0000000..c2900d8 --- /dev/null +++ b/docs/user-guide/index.md @@ -0,0 +1,17 @@ +# 用户指南 + +这里收的是稳定规则,不放临时笔记。 + +## 你最可能要看的内容 + +- [环编号系统](ring-numbering.md) + +## 这个目录的边界 + +- 这里只描述公开、可复用、会长期维持的行为。 +- 所有编号说明都以 canonical numbering 为准。 +- 如果文献图里用的是反向标注,先把它转换成镜像映射,再和代码结果对照。 + +## 一句话约定 + +这套工具没有 `clockwise` / `anticlockwise` 开关,编号方向不靠参数切换。 diff --git a/docs/user-guide/ring-numbering.md b/docs/user-guide/ring-numbering.md new file mode 100644 index 0000000..ffed9e2 --- /dev/null +++ b/docs/user-guide/ring-numbering.md @@ -0,0 +1,55 @@ +# 环编号系统 + +本项目只采用一套 canonical numbering。它是确定性的,不依赖 `clockwise` / `anticlockwise` 参数,也不是视觉上的方向切换。 + +## 规则 + +- `1 = 内酯羰基碳` +- `2 = 相邻酯氧` +- `3..N = 从 2 位出发沿环唯一图遍历顺序继续编号` + +这条规则在代码、文档、测试和验证输出中都保持一致。 + +## 16 元环镜像映射 + +当你需要把代码里的 canonical numbering 转成文献里常见的反向标注时,使用: + +```text +p_mirror = ring_size - p + 3 +``` + +对 16 元环,这会得到: + +- `3 → 16` +- `4 → 15` +- `5 → 14` +- `6 → 13` +- `7 → 12` +- `8 → 11` +- `9 → 10` +- `10 → 9` +- `11 → 8` +- `12 → 7` +- `13 → 6` +- `14 → 5` +- `15 → 4` +- `16 → 3` + +常见对照点: + +- `6 → 13` +- `7 → 12` +- `15 → 4` +- `16 → 3` + +## 使用建议 + +- 先在代码和数据库里保留 canonical numbering。 +- 只在图注、论文、汇报里按需要加镜像标签。 +- 不要把方向差异实现成 API 参数。 +- 如果分子是 bridge / fused multi-anchor,先把结构语义说明清楚,再讨论编号可视化。 + +## 最短结论 + +如果你在看代码,记 canonical numbering。 +如果你在对照文献,记 `p_mirror = ring_size - p + 3`。 diff --git a/mkdocs.yml b/mkdocs.yml index c81b50a..7463ff2 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -130,49 +130,25 @@ nav: - 首页: - index.md - 快速开始: getting-started.md - - 安装指南: installation.md - 用户指南: - user-guide/index.md - - MacrolactoneFragmenter 使用: user-guide/fragmenter-usage.md - 环编号系统: user-guide/ring-numbering.md - - 可视化功能: user-guide/visualization.md - - 批量处理: user-guide/batch-processing.md - - 数据导出: user-guide/data-export.md - - - 教程与示例: - - tutorials/index.md - - 基础教程: tutorials/basic-tutorial.md - - 环数识别教程: tutorials/using-macro-lactone-analyzer.md - - 高级用法: tutorials/advanced-usage.md - - 使用案例: tutorials/use-cases.md - - - API 参考: - - api/index.md - - MacroLactoneAnalyzer: api/macro-lactone-analyzer.md - - MacrolactoneFragmenter: api/macrolactone-fragmenter.md - - Fragment 数据类: api/fragment-dataclass.md - - 环编号模块: api/ring-numbering.md - - 可视化模块: api/ring-visualization.md - - 工具函数: api/utilities.md - 开发者指南: - development/index.md - - 贡献指南: development/contributing.md - 项目结构: development/project-structure.md - - 测试: development/testing.md - 项目文档: - project-docs/AGENTS.md - - 实现总结: project-docs/IMPLEMENTATION_SUMMARY.md - - 清理总结: project-docs/CLEANUP_SUMMARY.md - - 文档指南: project-docs/DOCUMENTATION_GUIDE.md - - 快速命令: project-docs/QUICK_COMMANDS.md - - 关于: - - about/index.md - - 更新日志: about/changelog.md - - 许可证: about/license.md +not_in_nav: | + /SUMMARY.md + /plans/ + /project-docs/CLEANUP_SUMMARY.md + /project-docs/DOCUMENTATION_GUIDE.md + /project-docs/IMPLEMENTATION_SUMMARY.md + /project-docs/QUICK_COMMANDS.md # Extra extra: @@ -201,4 +177,3 @@ extra: note: >- Thanks for your feedback! Help us improve this page by creating an issue. - diff --git a/pixi.toml b/pixi.toml index 3fcc455..9f2edf8 100644 --- a/pixi.toml +++ b/pixi.toml @@ -24,3 +24,7 @@ sqlmodel = ">=0.0.37,<0.0.38" [pypi-dependencies] macro_lactone_toolkit = { path = ".", editable = true } +mkdocs = ">=1.6,<2" +mkdocs-material = ">=9.6,<10" +mkdocstrings = ">=0.28,<0.29" +mkdocstrings-python = ">=1.16,<2" diff --git a/scripts/analyze_validation_fragment_library.py b/scripts/analyze_validation_fragment_library.py index 4930613..98de199 100644 --- a/scripts/analyze_validation_fragment_library.py +++ b/scripts/analyze_validation_fragment_library.py @@ -438,6 +438,16 @@ def format_top_positions(table: pd.DataFrame, sort_column: str, limit: int = 5) return subset.to_string(index=False) +def mirror_ring_position(position: int, ring_size: int) -> int: + if position <= 2: + return position + return ring_size - position + 3 + + +def format_position_mapping(positions: list[int], ring_size: int) -> str: + return ", ".join(f"{position} → {mirror_ring_position(position, ring_size)}" for position in positions) + + def build_markdown_report( output_dir: Path, analysis_df: pd.DataFrame, @@ -559,6 +569,15 @@ def build_markdown_report( "- Position 15: supported as a **frequent modification site**, but the retained chemotypes are concentrated into a small number of acyl substituents.", "- Position 16: not prevalent in the current database, but the few retained fragments are structurally distinct singletons; this makes it a **low-evidence exploratory site**, not a high-confidence natural hotspot.", "", + "## Numbering Alignment With Medicinal-Chemistry Labels", + "", + "- The codebase uses one canonical numbering rule: position 1 is the lactone carbonyl carbon, position 2 is the ester oxygen, and positions 3..N follow the unique ring traversal that starts from position 2 in `build_numbering_result()`.", + "- If a medicinal-chemistry scheme keeps positions 1 and 2 fixed but numbers the rest of the ring in the mirrored direction, then the conversion for positions >=3 is `p_mirror = ring_size - p + 3`.", + f"- For a {ring_size}-membered ring, literature labels `{','.join(str(position) for position in hotspot_table['cleavage_position'].tolist())}` map to current-code labels `{format_position_mapping(hotspot_table['cleavage_position'].tolist(), ring_size)}`.", + f"- Conversely, the current-code natural-diversity hotspots `13, 3, 4, 12` correspond to mirrored medicinal-chemistry labels `6, 16, 15, 7` in a {ring_size}-membered ring.", + "- This means the apparent disagreement was a numbering-direction mismatch, not a chemical contradiction between the database analysis and the literature-guided hotspot list.", + "- Practical rule: keep the database and cleavage-position statistics in the current canonical code numbering, but add mirrored medicinal-chemistry labels in figures, tables and manuscripts whenever you compare against literature.", + "", "## Figure 6. Are the Top Positions Driven by Ring-Bearing Side Chains?", "", f"![Ring {ring_size} ring sensitivity](ring{ring_size}_position_ring_sensitivity.png)", @@ -616,6 +635,7 @@ def build_markdown_report( "### Recommended paper-safe wording", "", f"> In the validated MacrolactoneDB fragment library, natural side-chain diversity of {ring_size}-membered macrolactones is concentrated primarily at positions 13, 3/4 and 12. After excluding fragments with <=3 heavy atoms to focus on design-relevant substituents, position 6 remains strongly diversity-enriched and position 15 remains frequency-enriched, whereas positions 7 and 16 are sparse and should be interpreted as literature-guided derivatization sites rather than statistically dominant natural hotspots.", + f"> If medicinal-chemistry labels are reported in the mirrored direction, those natural-diversity hotspots correspond to literature labels 6, 16, 15 and 7, while literature hotspot labels 6, 7, 15 and 16 correspond to current-code positions 13, 12, 4 and 3.", "", "### Practical interpretation for fragment-library design", "", @@ -624,6 +644,7 @@ def build_markdown_report( f"- For 16-membered macrolide design, prioritize positions **13, 3, 4, 12 and 6** for natural-diversity-driven fragment mining.", "- Keep positions **15** as a targeted acyl-modification site even though its chemotype diversity is narrower.", "- Treat positions **7 and 16** as hypothesis-driven medicinal chemistry positions that need literature or synthesis justification beyond database prevalence.", + "- When comparing to literature numbering, either rerun the hotspot panel with mirrored positions or label every reported position as `code_position (medchem_position)` to avoid directional ambiguity.", "", ] return "\n".join(lines) @@ -698,6 +719,15 @@ def build_markdown_report_zh( "- 其中 6 位最能支持你的药化判断:它不是最高频位点,但在设计相关大侧链中显示出很高的结构多样性。", "- 15 位则更偏向高频但低多样性的酰基修饰位点。", "", + "## 编号校准说明(代码编号 vs 药化编号)", + "", + "- 当前代码和数据库采用统一编号:`1 = 内酯羰基碳`,`2 = 相邻酯氧`,`3..N` 则从 2 位出发沿环的唯一遍历顺序继续编号。", + "- 如果药化文献同样固定 1 和 2 位,但把 `3..N` 按相反方向编号,则对于 `p >= 3` 有镜像换算公式:`p_镜像 = ring_size - p + 3`。", + f"- 对于 {ring_size} 元环,你关心的药化位点 `{','.join(str(position) for position in hotspot_table['cleavage_position'].tolist())}`,在当前代码编号下对应为:`{format_position_mapping(hotspot_table['cleavage_position'].tolist(), ring_size)}`。", + f"- 反过来,当前代码编号下的天然多样性热点 `13、3、4、12`,在药化镜像编号下分别对应 `6、16、15、7`。", + "- 因此,之前看起来对不上的 `13、3、4、12` 与 `6、7、15、16`,本质上是同一组位点的方向镜像,不是化学结论冲突。", + "- 建议后续统一规则:数据库、断裂结果、拼接和模型训练一律使用当前代码编号;论文、图表和药化讨论中若需对照文献,再同时标注镜像药化编号。", + "", "## 桥环 / 稠环干扰的敏感性分析", "", "桥连或双锚点侧链不会进入当前片段库,因为断裂逻辑只保留与主环存在 **1 个连接点** 的侧链组件。也就是说,真正的 bridge / fused multi-anchor components 已被代码层面排除。", @@ -756,9 +786,10 @@ def build_markdown_report_zh( "", "## 建议的论文表述方式", "", - "- 若讨论天然产物中的侧链多样性,可写为:`16 元大环内酯的天然侧链多样性主要集中在 13、3/4 和 12 位,并在 6 位保留较强的设计相关多样性。`", - "- 若讨论药化半合成改造热点,可写为:`6、7、15、16 位代表文献和先导化合物研究中优先使用的衍生化位点,其中 6 和 15 位在数据库统计中分别对应高多样性和高频率信号,而 7 和 16 位更多体现为文献指导的探索性位点。`", + "- 若讨论天然产物中的侧链多样性,可写为:`按当前代码编号,16 元大环内酯的天然侧链多样性主要集中在 13、3/4 和 12 位,并在 6 位保留较强的设计相关多样性;若换成药化镜像编号,则对应为 6、16/15 和 7 位。`", + "- 若讨论药化半合成改造热点,可写为:`按药化镜像编号,6、7、15、16 位代表文献和先导化合物研究中优先使用的衍生化位点;在当前代码编号下,它们对应 13、12、4、3 位。`", "- 若专门讨论非环状侧链设计,则应强调:`在排除 <=3 重原子小片段并进一步排除带环侧链后,15 位是最主要的非环状侧链修饰位点。`", + "- 若在图表中同时展示两套体系,建议统一写成:`代码编号 13(药化 6)` 这类双标签格式,而不要在同一表中混用单独编号。", "", "## 相关图表", "", diff --git a/src/macro_lactone_toolkit/__init__.py b/src/macro_lactone_toolkit/__init__.py index 734c93c..2d9765f 100644 --- a/src/macro_lactone_toolkit/__init__.py +++ b/src/macro_lactone_toolkit/__init__.py @@ -13,6 +13,11 @@ from .models import ( RingNumberingResult, SideChainFragment, ) +from .numbering import ( + mirror_macrolactone_position, + mirror_macrolactone_positions, + number_macrolactone, +) from .visualization import ( fragment_svg, numbered_molecule_svg, @@ -38,6 +43,9 @@ __all__ = [ "MacrolactoneFragmenter", "MacrolactoneValidator", "MacrocycleClassificationResult", + "mirror_macrolactone_position", + "mirror_macrolactone_positions", + "number_macrolactone", "numbered_molecule_svg", "ParentMolecule", "RingNumberingError", diff --git a/src/macro_lactone_toolkit/fragmenter.py b/src/macro_lactone_toolkit/fragmenter.py index de08d58..1da3a2c 100644 --- a/src/macro_lactone_toolkit/fragmenter.py +++ b/src/macro_lactone_toolkit/fragmenter.py @@ -14,6 +14,7 @@ from ._core import ( from .analyzer import MacroLactoneAnalyzer from .errors import AmbiguousMacrolactoneError, FragmentationError, MacrolactoneDetectionError from .models import FragmentationResult, RingNumberingResult, SideChainFragment +from .numbering import number_macrolactone class MacrolactoneFragmenter: @@ -26,9 +27,7 @@ class MacrolactoneFragmenter: self.analyzer = MacroLactoneAnalyzer() def number_molecule(self, mol_input: str | Chem.Mol) -> RingNumberingResult: - mol, _ = ensure_mol(mol_input) - candidate = self._select_candidate(mol) - return build_numbering_result(mol, candidate) + return number_macrolactone(mol_input, ring_size=self.ring_size) def fragment_molecule( self, diff --git a/src/macro_lactone_toolkit/numbering.py b/src/macro_lactone_toolkit/numbering.py new file mode 100644 index 0000000..254034e --- /dev/null +++ b/src/macro_lactone_toolkit/numbering.py @@ -0,0 +1,70 @@ +from __future__ import annotations + +from collections.abc import Iterable + +from rdkit import Chem + +from ._core import build_numbering_result, classify_macrolactone, ensure_mol, find_macrolactone_candidates +from .errors import AmbiguousMacrolactoneError, MacrolactoneDetectionError +from .models import RingNumberingResult + + +def number_macrolactone( + mol_input: str | Chem.Mol, + ring_size: int | None = None, +) -> RingNumberingResult: + """ + Return the canonical ring numbering for a supported macrolactone. + + Canonical numbering is fixed across the project: + 1 = lactone carbonyl carbon, 2 = ester oxygen, and 3..N continue by the + unique graph traversal that starts from position 2. This API does not + expose clockwise/anticlockwise options; mirrored medicinal-chemistry labels + should be handled through the mirror helpers in this module. + """ + + mol, smiles = ensure_mol(mol_input) + classification = classify_macrolactone(mol, smiles=smiles, ring_size=ring_size) + if classification.classification != "standard_macrolactone": + raise MacrolactoneDetectionError( + "Macrolactone rejected: " + f"classification={classification.classification} " + f"primary_reason_code={classification.primary_reason_code}" + ) + + candidates = find_macrolactone_candidates(mol, ring_size=ring_size) + valid_ring_sizes = sorted({candidate.ring_size for candidate in candidates}) + if len(candidates) > 1 or len(valid_ring_sizes) > 1: + raise AmbiguousMacrolactoneError( + "Multiple valid macrolactone candidates were detected. Pass ring_size explicitly." + ) + return build_numbering_result(mol, candidates[0]) + + +def mirror_macrolactone_position(position: int, ring_size: int) -> int: + """ + Convert a canonical ring position to its mirrored medicinal-chemistry label. + + Positions 1 and 2 are invariant. For positions >= 3, the mirrored label is + computed as `ring_size - position + 3`. + """ + + if position <= 2: + return position + return ring_size - position + 3 + + +def mirror_macrolactone_positions( + positions: Iterable[int], + ring_size: int, +) -> dict[int, int]: + """ + Convert multiple canonical positions to mirrored medicinal-chemistry labels. + + The input order is preserved in the returned mapping. + """ + + return { + int(position): mirror_macrolactone_position(int(position), ring_size) + for position in positions + } diff --git a/src/macro_lactone_toolkit/splicing/scaffold_prep.py b/src/macro_lactone_toolkit/splicing/scaffold_prep.py index 2367fcd..3d5bb55 100644 --- a/src/macro_lactone_toolkit/splicing/scaffold_prep.py +++ b/src/macro_lactone_toolkit/splicing/scaffold_prep.py @@ -5,7 +5,7 @@ from typing import Iterable from rdkit import Chem from .._core import collect_fragmentable_side_chain_atoms, ensure_mol, find_macrolactone_candidates, is_intrinsic_lactone_neighbor -from ..fragmenter import MacrolactoneFragmenter +from ..numbering import number_macrolactone def prepare_macrolactone_scaffold( @@ -15,8 +15,7 @@ def prepare_macrolactone_scaffold( ) -> tuple[Chem.Mol, dict[int, int]]: positions = list(positions) mol, _ = ensure_mol(mol_input) - fragmenter = MacrolactoneFragmenter(ring_size=ring_size) - numbering = fragmenter.number_molecule(mol) + numbering = number_macrolactone(mol, ring_size=ring_size) candidate = find_macrolactone_candidates(mol, ring_size=numbering.ring_size)[0] ring_atom_set = set(numbering.ordered_atoms) diff --git a/src/macro_lactone_toolkit/validation/validator.py b/src/macro_lactone_toolkit/validation/validator.py index be956ed..da595cd 100644 --- a/src/macro_lactone_toolkit/validation/validator.py +++ b/src/macro_lactone_toolkit/validation/validator.py @@ -11,11 +11,11 @@ from sqlmodel import select from macro_lactone_toolkit import MacroLactoneAnalyzer from macro_lactone_toolkit._core import ( - build_numbering_result, collect_fragmentable_side_chain_atoms, find_macrolactone_candidates, is_intrinsic_lactone_neighbor, ) +from macro_lactone_toolkit.numbering import number_macrolactone from macro_lactone_toolkit.validation.database import get_engine, get_session, init_database from macro_lactone_toolkit.validation.isotope_utils import build_fragment_with_isotope from macro_lactone_toolkit.validation.models import ( @@ -163,7 +163,7 @@ class MacrolactoneValidator: candidate = candidates[0] # Get numbering - numbering = build_numbering_result(mol, candidate) + numbering = number_macrolactone(mol, ring_size=parent.ring_size) # Save numbering to database numbering_record = RingNumbering( diff --git a/tests/test_documentation_entrypoints.py b/tests/test_documentation_entrypoints.py new file mode 100644 index 0000000..d0794eb --- /dev/null +++ b/tests/test_documentation_entrypoints.py @@ -0,0 +1,34 @@ +from __future__ import annotations + +from pathlib import Path + + +PROJECT_ROOT = Path(__file__).resolve().parents[1] + + +def test_root_readme_documents_canonical_numbering() -> None: + readme = (PROJECT_ROOT / "README.md").read_text(encoding="utf-8") + + assert "1 = 内酯羰基碳" in readme + assert "2 = 相邻酯氧" in readme + assert "3..N = 从 2 位出发沿环唯一图遍历顺序继续编号" in readme + assert "6 → 13" in readme + assert "7 → 12" in readme + + +def test_root_agents_exists_and_documents_numbering_invariants() -> None: + agents_path = PROJECT_ROOT / "AGENTS.md" + + assert agents_path.exists() + agents = agents_path.read_text(encoding="utf-8") + assert "canonical numbering" in agents + assert "不是视觉顺时针" in agents + assert "bridge / fused multi-anchor" in agents + + +def test_mkdocs_ring_numbering_page_documents_mirror_mapping() -> None: + ring_doc = (PROJECT_ROOT / "docs" / "user-guide" / "ring-numbering.md").read_text(encoding="utf-8") + + assert "p_mirror = ring_size - p + 3" in ring_doc + assert "6 → 13" in ring_doc + assert "15 → 4" in ring_doc diff --git a/tests/test_numbering_api.py b/tests/test_numbering_api.py new file mode 100644 index 0000000..a697558 --- /dev/null +++ b/tests/test_numbering_api.py @@ -0,0 +1,52 @@ +from __future__ import annotations + +from macro_lactone_toolkit import ( + MacrolactoneFragmenter, + mirror_macrolactone_position, + mirror_macrolactone_positions, + number_macrolactone, +) +from macro_lactone_toolkit.splicing.scaffold_prep import prepare_macrolactone_scaffold + +from .helpers import build_macrolactone + + +def test_number_macrolactone_matches_fragmenter_numbering() -> None: + built = build_macrolactone(16, {5: "methyl"}) + + api_result = number_macrolactone(built.smiles, ring_size=16) + fragmenter_result = MacrolactoneFragmenter(ring_size=16).number_molecule(built.smiles) + + assert api_result.position_to_atom == fragmenter_result.position_to_atom + assert api_result.atom_to_position == fragmenter_result.atom_to_position + + +def test_mirror_macrolactone_position_for_ring16() -> None: + assert mirror_macrolactone_position(6, 16) == 13 + assert mirror_macrolactone_position(7, 16) == 12 + assert mirror_macrolactone_position(15, 16) == 4 + assert mirror_macrolactone_position(16, 16) == 3 + + +def test_mirror_macrolactone_positions_returns_stable_mapping() -> None: + assert mirror_macrolactone_positions([6, 7, 15, 16], 16) == { + 6: 13, + 7: 12, + 15: 4, + 16: 3, + } + + +def test_prepare_scaffold_keeps_requested_position_label() -> None: + built = build_macrolactone(16, {5: "ethyl"}) + + scaffold, dummy_map = prepare_macrolactone_scaffold( + built.mol, + positions=[5], + ring_size=16, + ) + + numbering = number_macrolactone(built.mol, ring_size=16) + assert 5 in dummy_map + assert numbering.position_to_atom[5] == built.position_to_atom[5] + assert numbering.position_to_atom[5] == scaffold.GetAtomWithIdx(dummy_map[5]).GetNeighbors()[0].GetIdx() diff --git a/tests/test_scripts_and_docs.py b/tests/test_scripts_and_docs.py index 59e57f7..0336882 100644 --- a/tests/test_scripts_and_docs.py +++ b/tests/test_scripts_and_docs.py @@ -285,6 +285,10 @@ def test_analyze_validation_fragment_library_script_generates_reports(tmp_path): report_zh = (output_dir / "fragment_library_analysis_report_zh.md").read_text(encoding="utf-8") assert "桥连或双锚点侧链不会进入当前片段库" in report_zh assert "cyclic single-anchor side chains" in report_zh + assert "6 → 13" in report_zh + assert "7 → 12" in report_zh + assert "15 → 4" in report_zh + assert "16 → 3" in report_zh def test_active_text_assets_do_not_reference_legacy_api(): diff --git a/validation_output/fragment_library_analysis/fragment_library_analysis_report.md b/validation_output/fragment_library_analysis/fragment_library_analysis_report.md index 07dbf79..23b2ca5 100644 --- a/validation_output/fragment_library_analysis/fragment_library_analysis_report.md +++ b/validation_output/fragment_library_analysis/fragment_library_analysis_report.md @@ -84,6 +84,15 @@ This panel focuses on positions 6, 7, 15 and 16 because these are the literature - Position 15: supported as a **frequent modification site**, but the retained chemotypes are concentrated into a small number of acyl substituents. - Position 16: not prevalent in the current database, but the few retained fragments are structurally distinct singletons; this makes it a **low-evidence exploratory site**, not a high-confidence natural hotspot. +## Numbering Alignment With Medicinal-Chemistry Labels + +- The codebase uses one canonical numbering rule: position 1 is the lactone carbonyl carbon, position 2 is the ester oxygen, and positions 3..N follow the unique ring traversal that starts from position 2 in `build_numbering_result()`. +- If a medicinal-chemistry scheme keeps positions 1 and 2 fixed but numbers the rest of the ring in the mirrored direction, then the conversion for positions >=3 is `p_mirror = ring_size - p + 3`. +- For a 16-membered ring, literature labels `6,7,15,16` map to current-code labels `6 → 13, 7 → 12, 15 → 4, 16 → 3`. +- Conversely, the current-code natural-diversity hotspots `13, 3, 4, 12` correspond to mirrored medicinal-chemistry labels `6, 16, 15, 7` in a 16-membered ring. +- This means the apparent disagreement was a numbering-direction mismatch, not a chemical contradiction between the database analysis and the literature-guided hotspot list. +- Practical rule: keep the database and cleavage-position statistics in the current canonical code numbering, but add mirrored medicinal-chemistry labels in figures, tables and manuscripts whenever you compare against literature. + ## Figure 6. Are the Top Positions Driven by Ring-Bearing Side Chains? ![Ring 16 ring sensitivity](ring16_position_ring_sensitivity.png) @@ -140,6 +149,7 @@ The earlier statement that `6,7,15,16` are important 16-membered macrolide modif ### Recommended paper-safe wording > In the validated MacrolactoneDB fragment library, natural side-chain diversity of 16-membered macrolactones is concentrated primarily at positions 13, 3/4 and 12. After excluding fragments with <=3 heavy atoms to focus on design-relevant substituents, position 6 remains strongly diversity-enriched and position 15 remains frequency-enriched, whereas positions 7 and 16 are sparse and should be interpreted as literature-guided derivatization sites rather than statistically dominant natural hotspots. +> If medicinal-chemistry labels are reported in the mirrored direction, those natural-diversity hotspots correspond to literature labels 6, 16, 15 and 7, while literature hotspot labels 6, 7, 15 and 16 correspond to current-code positions 13, 12, 4 and 3. ### Practical interpretation for fragment-library design @@ -148,4 +158,5 @@ The earlier statement that `6,7,15,16` are important 16-membered macrolide modif - For 16-membered macrolide design, prioritize positions **13, 3, 4, 12 and 6** for natural-diversity-driven fragment mining. - Keep positions **15** as a targeted acyl-modification site even though its chemotype diversity is narrower. - Treat positions **7 and 16** as hypothesis-driven medicinal chemistry positions that need literature or synthesis justification beyond database prevalence. +- When comparing to literature numbering, either rerun the hotspot panel with mirrored positions or label every reported position as `code_position (medchem_position)` to avoid directional ambiguity. diff --git a/validation_output/fragment_library_analysis/fragment_library_analysis_report_zh.md b/validation_output/fragment_library_analysis/fragment_library_analysis_report_zh.md index 9775fdb..9c4e383 100644 --- a/validation_output/fragment_library_analysis/fragment_library_analysis_report_zh.md +++ b/validation_output/fragment_library_analysis/fragment_library_analysis_report_zh.md @@ -30,6 +30,15 @@ - 其中 6 位最能支持你的药化判断:它不是最高频位点,但在设计相关大侧链中显示出很高的结构多样性。 - 15 位则更偏向高频但低多样性的酰基修饰位点。 +## 编号校准说明(代码编号 vs 药化编号) + +- 当前代码和数据库采用统一编号:`1 = 内酯羰基碳`,`2 = 相邻酯氧`,`3..N` 则从 2 位出发沿环的唯一遍历顺序继续编号。 +- 如果药化文献同样固定 1 和 2 位,但把 `3..N` 按相反方向编号,则对于 `p >= 3` 有镜像换算公式:`p_镜像 = ring_size - p + 3`。 +- 对于 16 元环,你关心的药化位点 `6,7,15,16`,在当前代码编号下对应为:`6 → 13, 7 → 12, 15 → 4, 16 → 3`。 +- 反过来,当前代码编号下的天然多样性热点 `13、3、4、12`,在药化镜像编号下分别对应 `6、16、15、7`。 +- 因此,之前看起来对不上的 `13、3、4、12` 与 `6、7、15、16`,本质上是同一组位点的方向镜像,不是化学结论冲突。 +- 建议后续统一规则:数据库、断裂结果、拼接和模型训练一律使用当前代码编号;论文、图表和药化讨论中若需对照文献,再同时标注镜像药化编号。 + ## 桥环 / 稠环干扰的敏感性分析 桥连或双锚点侧链不会进入当前片段库,因为断裂逻辑只保留与主环存在 **1 个连接点** 的侧链组件。也就是说,真正的 bridge / fused multi-anchor components 已被代码层面排除。 @@ -88,9 +97,10 @@ ## 建议的论文表述方式 -- 若讨论天然产物中的侧链多样性,可写为:`16 元大环内酯的天然侧链多样性主要集中在 13、3/4 和 12 位,并在 6 位保留较强的设计相关多样性。` -- 若讨论药化半合成改造热点,可写为:`6、7、15、16 位代表文献和先导化合物研究中优先使用的衍生化位点,其中 6 和 15 位在数据库统计中分别对应高多样性和高频率信号,而 7 和 16 位更多体现为文献指导的探索性位点。` +- 若讨论天然产物中的侧链多样性,可写为:`按当前代码编号,16 元大环内酯的天然侧链多样性主要集中在 13、3/4 和 12 位,并在 6 位保留较强的设计相关多样性;若换成药化镜像编号,则对应为 6、16/15 和 7 位。` +- 若讨论药化半合成改造热点,可写为:`按药化镜像编号,6、7、15、16 位代表文献和先导化合物研究中优先使用的衍生化位点;在当前代码编号下,它们对应 13、12、4、3 位。` - 若专门讨论非环状侧链设计,则应强调:`在排除 <=3 重原子小片段并进一步排除带环侧链后,15 位是最主要的非环状侧链修饰位点。` +- 若在图表中同时展示两套体系,建议统一写成:`代码编号 13(药化 6)` 这类双标签格式,而不要在同一表中混用单独编号。 ## 相关图表