feat(numbering): publish canonical numbering API

Add a public numbering module and route fragmenting, validation, and scaffold preparation through the canonical numbering entry. Rewrite the repository entry docs around the fixed numbering contract, add MkDocs landing pages, and document the mirror mapping used for medicinal-chemistry comparisons. Also refresh the validation analysis reports to explain the canonical-versus-mirrored numbering relationship.
2026-03-20 15:14:31 +08:00
parent 8071a141ee
commit 3e07402f4e
22 changed files with 529 additions and 444 deletions
--- a/validation_output/fragment_library_analysis/fragment_library_analysis_report.md
+++ b/validation_output/fragment_library_analysis/fragment_library_analysis_report.md
@@ -84,6 +84,15 @@ This panel focuses on positions 6, 7, 15 and 16 because these are the literature
 - Position 15: supported as a **frequent modification site**, but the retained chemotypes are concentrated into a small number of acyl substituents.
 - Position 16: not prevalent in the current database, but the few retained fragments are structurally distinct singletons; this makes it a **low-evidence exploratory site**, not a high-confidence natural hotspot.

+## Numbering Alignment With Medicinal-Chemistry Labels
+
+- The codebase uses one canonical numbering rule: position 1 is the lactone carbonyl carbon, position 2 is the ester oxygen, and positions 3..N follow the unique ring traversal that starts from position 2 in `build_numbering_result()`.
+- If a medicinal-chemistry scheme keeps positions 1 and 2 fixed but numbers the rest of the ring in the mirrored direction, then the conversion for positions >=3 is `p_mirror = ring_size - p + 3`.
+- For a 16-membered ring, literature labels `6,7,15,16` map to current-code labels `6 → 13, 7 → 12, 15 → 4, 16 → 3`.
+- Conversely, the current-code natural-diversity hotspots `13, 3, 4, 12` correspond to mirrored medicinal-chemistry labels `6, 16, 15, 7` in a 16-membered ring.
+- This means the apparent disagreement was a numbering-direction mismatch, not a chemical contradiction between the database analysis and the literature-guided hotspot list.
+- Practical rule: keep the database and cleavage-position statistics in the current canonical code numbering, but add mirrored medicinal-chemistry labels in figures, tables and manuscripts whenever you compare against literature.
+
 ## Figure 6. Are the Top Positions Driven by Ring-Bearing Side Chains?

 ![Ring 16 ring sensitivity](ring16_position_ring_sensitivity.png)
@@ -140,6 +149,7 @@ The earlier statement that `6,7,15,16` are important 16-membered macrolide modif
 ### Recommended paper-safe wording

 > In the validated MacrolactoneDB fragment library, natural side-chain diversity of 16-membered macrolactones is concentrated primarily at positions 13, 3/4 and 12. After excluding fragments with <=3 heavy atoms to focus on design-relevant substituents, position 6 remains strongly diversity-enriched and position 15 remains frequency-enriched, whereas positions 7 and 16 are sparse and should be interpreted as literature-guided derivatization sites rather than statistically dominant natural hotspots.
+> If medicinal-chemistry labels are reported in the mirrored direction, those natural-diversity hotspots correspond to literature labels 6, 16, 15 and 7, while literature hotspot labels 6, 7, 15 and 16 correspond to current-code positions 13, 12, 4 and 3.

 ### Practical interpretation for fragment-library design

@@ -148,4 +158,5 @@ The earlier statement that `6,7,15,16` are important 16-membered macrolide modif
 - For 16-membered macrolide design, prioritize positions **13, 3, 4, 12 and 6** for natural-diversity-driven fragment mining.
 - Keep positions **15** as a targeted acyl-modification site even though its chemotype diversity is narrower.
 - Treat positions **7 and 16** as hypothesis-driven medicinal chemistry positions that need literature or synthesis justification beyond database prevalence.
+- When comparing to literature numbering, either rerun the hotspot panel with mirrored positions or label every reported position as `code_position (medchem_position)` to avoid directional ambiguity.

--- a/validation_output/fragment_library_analysis/fragment_library_analysis_report_zh.md
+++ b/validation_output/fragment_library_analysis/fragment_library_analysis_report_zh.md
@@ -30,6 +30,15 @@
 - 其中 6 位最能支持你的药化判断：它不是最高频位点，但在设计相关大侧链中显示出很高的结构多样性。
 - 15 位则更偏向高频但低多样性的酰基修饰位点。

+## 编号校准说明（代码编号 vs 药化编号）
+
+- 当前代码和数据库采用统一编号：`1 = 内酯羰基碳`，`2 = 相邻酯氧`，`3..N` 则从 2 位出发沿环的唯一遍历顺序继续编号。
+- 如果药化文献同样固定 1 和 2 位，但把 `3..N` 按相反方向编号，则对于 `p >= 3` 有镜像换算公式：`p_镜像 = ring_size - p + 3`。
+- 对于 16 元环，你关心的药化位点 `6,7,15,16`，在当前代码编号下对应为：`6 → 13, 7 → 12, 15 → 4, 16 → 3`。
+- 反过来，当前代码编号下的天然多样性热点 `13、3、4、12`，在药化镜像编号下分别对应 `6、16、15、7`。
+- 因此，之前看起来对不上的 `13、3、4、12` 与 `6、7、15、16`，本质上是同一组位点的方向镜像，不是化学结论冲突。
+- 建议后续统一规则：数据库、断裂结果、拼接和模型训练一律使用当前代码编号；论文、图表和药化讨论中若需对照文献，再同时标注镜像药化编号。
+
 ## 桥环 / 稠环干扰的敏感性分析

 桥连或双锚点侧链不会进入当前片段库，因为断裂逻辑只保留与主环存在 **1 个连接点** 的侧链组件。也就是说，真正的 bridge / fused multi-anchor components 已被代码层面排除。
@@ -88,9 +97,10 @@

 ## 建议的论文表述方式

- 若讨论天然产物中的侧链多样性，可写为：`16 元大环内酯的天然侧链多样性主要集中在 13、3/4 和 12 位，并在 6 位保留较强的设计相关多样性。`
- 若讨论药化半合成改造热点，可写为：`6、7、15、16 位代表文献和先导化合物研究中优先使用的衍生化位点，其中 6 和 15 位在数据库统计中分别对应高多样性和高频率信号，而 7 和 16 位更多体现为文献指导的探索性位点。`
+- 若讨论天然产物中的侧链多样性，可写为：`按当前代码编号，16 元大环内酯的天然侧链多样性主要集中在 13、3/4 和 12 位，并在 6 位保留较强的设计相关多样性；若换成药化镜像编号，则对应为 6、16/15 和 7 位。`
+- 若讨论药化半合成改造热点，可写为：`按药化镜像编号，6、7、15、16 位代表文献和先导化合物研究中优先使用的衍生化位点；在当前代码编号下，它们对应 13、12、4、3 位。`
 - 若专门讨论非环状侧链设计，则应强调：`在排除 <=3 重原子小片段并进一步排除带环侧链后，15 位是最主要的非环状侧链修饰位点。`
+- 若在图表中同时展示两套体系，建议统一写成：`代码编号 13（药化 6）` 这类双标签格式，而不要在同一表中混用单独编号。

 ## 相关图表