feat(validation): archive key result assets
Keep key validation outputs and analysis tables tracked directly, package analysis plot PNGs into a small tar.gz backup, and add analysis scripts plus tests so the stored results remain reproducible without flooding git with large image trees.
This commit is contained in:
BIN
validation_output/analysis_plots.tar.gz
Normal file
BIN
validation_output/analysis_plots.tar.gz
Normal file
Binary file not shown.
34830
validation_output/fragment_library.csv
Normal file
34830
validation_output/fragment_library.csv
Normal file
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,47 @@
|
||||
Analyzed rows: 34829
|
||||
Unique parent molecules: 4451
|
||||
Unique fragment smiles: 1852
|
||||
Fragment atom count percentiles: p05=1.0, p25=1.0, p50=1.0, p75=2.0, p95=14.0
|
||||
|
||||
Filter candidates (drop fragments with atom_count <= threshold):
|
||||
<= 1: remove 23994 rows (68.9%), remove 10 unique fragments (0.5%)
|
||||
<= 2: remove 28069 rows (80.6%), remove 26 unique fragments (1.4%)
|
||||
<= 3: remove 28550 rows (82.0%), remove 52 unique fragments (2.8%)
|
||||
<= 4: remove 29045 rows (83.4%), remove 88 unique fragments (4.8%)
|
||||
<= 5: remove 29272 rows (84.0%), remove 141 unique fragments (7.6%)
|
||||
|
||||
Ring 16 rows: 8108
|
||||
Ring 16 unique fragment smiles: 596
|
||||
Ring 16 rows with >= 4 heavy atoms: 1880
|
||||
Ring 16 unique fragment smiles with >= 4 heavy atoms: 566
|
||||
|
||||
Ring 16 top positions by normalized Shannon entropy:
|
||||
Position 7: entropy=0.857, unique=4, mean_atom_count=2.57
|
||||
Position 13: entropy=0.739, unique=198, mean_atom_count=15.50
|
||||
Position 4: entropy=0.584, unique=70, mean_atom_count=6.89
|
||||
Position 12: entropy=0.490, unique=99, mean_atom_count=3.63
|
||||
Position 3: entropy=0.449, unique=121, mean_atom_count=5.10
|
||||
|
||||
Ring 16 top positions by mean pairwise Tanimoto distance:
|
||||
Position 16: distance=0.901, entropy=0.415, atom_count_range=12
|
||||
Position 10: distance=0.871, entropy=0.077, atom_count_range=13
|
||||
Position 7: distance=0.860, entropy=0.857, atom_count_range=9
|
||||
Position 14: distance=0.848, entropy=0.375, atom_count_range=13
|
||||
Position 12: distance=0.839, entropy=0.490, atom_count_range=20
|
||||
|
||||
Ring 16 top filtered positions by normalized Shannon entropy:
|
||||
Position 6: entropy=0.973, unique=60, total=89, mean_atom_count=12.58
|
||||
Position 12: entropy=0.886, unique=83, total=177, mean_atom_count=10.00
|
||||
Position 3: entropy=0.854, unique=117, total=269, mean_atom_count=15.41
|
||||
Position 13: entropy=0.763, unique=193, total=709, mean_atom_count=18.91
|
||||
Position 9: entropy=0.729, unique=37, total=141, mean_atom_count=7.82
|
||||
|
||||
Medicinal-chemistry hotspot comparison:
|
||||
Position 6: all=536, >=4 atoms=89, unique_filtered=60, entropy_filtered=0.973
|
||||
Position 7: all=23, >=4 atoms=4, unique_filtered=1, entropy_filtered=0.000
|
||||
Position 15: all=747, >=4 atoms=205, unique_filtered=8, entropy_filtered=0.456
|
||||
Position 16: all=135, >=4 atoms=5, unique_filtered=5, entropy_filtered=1.000
|
||||
|
||||
Interpretation note: atom-count spread is only a coarse proxy for diversity.
|
||||
Use entropy and fingerprint distance as primary diversity evidence; use atom-count spread as supporting context.
|
||||
For cyclic-side-chain sensitivity, see ring_sensitivity output and the markdown report.
|
||||
@@ -0,0 +1,11 @@
|
||||
drop_if_atom_count_lte,removed_rows,removed_row_fraction,retained_rows,retained_row_fraction,removed_unique_fragments,removed_unique_fraction,retained_unique_fragments,retained_unique_fraction
|
||||
1,23994,0.6889086680639697,10835,0.31109133193603034,10,0.005399568034557235,1842,0.9946004319654428
|
||||
2,28069,0.80590886904591,6760,0.19409113095408997,26,0.014038876889848811,1826,0.9859611231101512
|
||||
3,28550,0.8197191995176434,6279,0.18028080048235665,52,0.028077753779697623,1800,0.9719222462203023
|
||||
4,29045,0.8339314938700508,5784,0.16606850612994917,88,0.047516198704103674,1764,0.9524838012958964
|
||||
5,29272,0.8404490510781245,5557,0.15955094892187544,141,0.07613390928725702,1711,0.923866090712743
|
||||
6,29353,0.8427746992448821,5476,0.15722530075511787,188,0.10151187904967603,1664,0.8984881209503239
|
||||
7,29446,0.845444887880789,5383,0.154555112119211,248,0.13390928725701945,1604,0.8660907127429806
|
||||
8,29603,0.849952625685492,5226,0.15004737431450801,331,0.1787257019438445,1521,0.8212742980561555
|
||||
9,29812,0.8559533721898418,5017,0.1440466278101582,411,0.22192224622030238,1441,0.7780777537796977
|
||||
10,30146,0.8655430819144966,4683,0.13445691808550345,510,0.275377969762419,1342,0.724622030237581
|
||||
|
@@ -0,0 +1,46 @@
|
||||
fragment_atom_count,row_count,unique_fragment_count,row_fraction,unique_fragment_fraction
|
||||
1,23994,10,0.6889086680639697,0.005399568034557235
|
||||
2,4075,16,0.11700020098194033,0.008639308855291577
|
||||
3,481,26,0.013810330471733325,0.014038876889848811
|
||||
4,495,36,0.014212294352407477,0.019438444924406047
|
||||
5,227,53,0.006517557208073732,0.028617710583153346
|
||||
6,81,47,0.002325648166757587,0.025377969762419007
|
||||
7,93,60,0.0026701886359068593,0.032397408207343416
|
||||
8,157,83,0.004507737804702977,0.044816414686825054
|
||||
9,209,80,0.006000746504349824,0.04319654427645788
|
||||
10,334,99,0.009589709724654743,0.05345572354211663
|
||||
11,375,142,0.010766889660914755,0.07667386609071274
|
||||
12,2002,148,0.057480834936403574,0.07991360691144708
|
||||
13,382,100,0.01096787160125183,0.05399568034557235
|
||||
14,504,133,0.014470699704269431,0.07181425485961124
|
||||
15,230,97,0.006603692325361049,0.052375809935205186
|
||||
16,110,79,0.003158287633868328,0.04265658747300216
|
||||
17,122,71,0.0035028281030176005,0.03833693304535637
|
||||
18,128,81,0.003675098337592236,0.04373650107991361
|
||||
19,109,53,0.003129575928105889,0.028617710583153346
|
||||
20,30,26,0.0008613511728731804,0.014038876889848811
|
||||
21,56,48,0.0016078555226966035,0.02591792656587473
|
||||
22,24,21,0.0006890809382985443,0.011339092872570195
|
||||
23,137,42,0.0039335036894541904,0.02267818574514039
|
||||
24,32,29,0.000918774584398059,0.01565874730021598
|
||||
25,26,21,0.000746504349823423,0.011339092872570195
|
||||
26,50,36,0.0014355852881219673,0.019438444924406047
|
||||
27,69,29,0.001981107697608315,0.01565874730021598
|
||||
28,41,23,0.0011771799362600133,0.012419006479481642
|
||||
29,84,34,0.002411783284044905,0.0183585313174946
|
||||
30,30,21,0.0008613511728731804,0.011339092872570195
|
||||
31,18,13,0.0005168107037239083,0.007019438444924406
|
||||
32,33,23,0.0009474862901604985,0.012419006479481642
|
||||
33,14,10,0.0004019638806741509,0.005399568034557235
|
||||
34,11,9,0.0003158287633868328,0.004859611231101512
|
||||
35,14,10,0.0004019638806741509,0.005399568034557235
|
||||
36,8,7,0.00022969364609951476,0.003779697624190065
|
||||
37,8,8,0.00022969364609951476,0.004319654427645789
|
||||
38,8,7,0.00022969364609951476,0.003779697624190065
|
||||
39,14,9,0.0004019638806741509,0.004859611231101512
|
||||
40,2,2,5.742341152487869e-05,0.0010799136069114472
|
||||
41,2,2,5.742341152487869e-05,0.0010799136069114472
|
||||
43,2,2,5.742341152487869e-05,0.0010799136069114472
|
||||
44,2,2,5.742341152487869e-05,0.0010799136069114472
|
||||
45,3,3,8.613511728731804e-05,0.0016198704103671706
|
||||
48,3,1,8.613511728731804e-05,0.0005399568034557236
|
||||
|
@@ -0,0 +1,2 @@
|
||||
rows,unique_parent_molecules,unique_fragment_smiles,min_atom_count,p05_atom_count,p25_atom_count,median_atom_count,mean_atom_count,p75_atom_count,p95_atom_count,max_atom_count
|
||||
34829,4451,1852,1,1.0,1.0,1.0,3.3206523299549224,2.0,14.0,48
|
||||
|
@@ -0,0 +1,151 @@
|
||||
# Fragment Library Analysis Report
|
||||
|
||||
## Scope and Dataset
|
||||
|
||||
- Input fragment library: `validation_extract` rows merged with `parent_molecules` metadata.
|
||||
- Current validated design library contains **34,829** splice-ready fragment rows from **4,451** parent macrolactones.
|
||||
- The 16-membered subset contains **8,108** fragment rows from **1,105** parent molecules.
|
||||
- This dataset is narrower than the earlier broad workflow summary because it only keeps **splice-ready, single-anchor fragments** from the validated library. It should not be compared to prior total-fragment counts as if they were the same denominator.
|
||||
|
||||
## Figure 1. Global Atom-Count Distribution
|
||||
|
||||

|
||||
|
||||
- Heavy-atom count is extremely right-skewed: p05=1.0, p25=1.0, median=1.0, p75=2.0, p95=14.0.
|
||||
- A conservative cleanup filter of `<= 2` heavy atoms removes **28,069** rows (80.6%) but only **26** unique fragment SMILES (1.4%).
|
||||
- A design-oriented filter aligned with your previous analysis, `<= 3` heavy atoms, removes **28,550** rows (82.0%) and **52** unique fragment SMILES (2.8%).
|
||||
- Interpretation: `<= 2` is the safer default for library cleanup, while `> 3` is useful for positional diversity analysis because it suppresses one- and two-atom noise.
|
||||
|
||||
## Figure 2. Ring-Specific Counts Before and After Size Filtering
|
||||
|
||||

|
||||
|
||||
- This figure compares all splice-ready fragments with the design-oriented subset that keeps fragments with **>= 4 heavy atoms**.
|
||||
- After size filtering, the most populated 16-membered positions are:
|
||||
|
||||
```text
|
||||
cleavage_position total_fragments_gt3 gt3_row_fraction
|
||||
13 709 0.809361
|
||||
3 269 0.256679
|
||||
4 269 0.452101
|
||||
15 205 0.274431
|
||||
12 177 0.190323
|
||||
9 141 0.192098
|
||||
```
|
||||
|
||||
## Figure 3. Atom-Count Spread for Design-Relevant Fragments
|
||||
|
||||

|
||||
|
||||
- This boxplot only includes fragments with **>= 4 heavy atoms**.
|
||||
- Positions 13, 3, 4, 11 and 6 carry the broadest large-fragment size envelopes; position 15 remains common but its size spread is narrower, indicating repeated reuse of a small set of acyl-like substituents.
|
||||
|
||||
## Figure 4. Position-Wise Diversity Metrics After Filtering
|
||||
|
||||

|
||||
|
||||
- Diversity is evaluated with three complementary metrics:
|
||||
1. `unique_fragments`: raw chemotype count.
|
||||
2. `normalized_shannon_entropy`: how evenly those chemotypes are distributed.
|
||||
3. `mean_pairwise_tanimoto_distance`: structural spread in Morgan fingerprint space.
|
||||
|
||||
Top robust positions after removing <=3-heavy-atom fragments (`total_fragments >= 20`):
|
||||
|
||||
```text
|
||||
cleavage_position total_fragments unique_fragments normalized_shannon_entropy mean_pairwise_tanimoto_distance mean_atom_count
|
||||
6 89 60 0.973126 0.780092 12.584270
|
||||
12 177 83 0.885717 0.825769 10.000000
|
||||
3 269 117 0.854464 0.764966 15.405204
|
||||
13 709 193 0.763147 0.565162 18.906911
|
||||
9 141 37 0.729256 0.804709 7.815603
|
||||
4 269 63 0.585202 0.701620 13.375465
|
||||
15 205 8 0.455556 0.769222 4.692683
|
||||
```
|
||||
|
||||
- Within the filtered 16-membered set, **position 6** is the strongest support for your medicinal-chemistry hypothesis: it has moderate abundance (**89** rows) but high chemotype diversity and near-maximal entropy.
|
||||
- **Position 15** is also clearly relevant because it retains **205** filtered rows, but its diversity is narrow; the site is frequent, yet dominated by a few acyl motifs.
|
||||
|
||||
## Figure 5. Focus on Medicinal-Chemistry Hotspots
|
||||
|
||||

|
||||
|
||||
This panel focuses on positions 6, 7, 15 and 16 because these are the literature-guided derivatization positions from tylosin / tilmicosin / tildipirosin / tulathromycin-like scaffold analysis.
|
||||
|
||||
```text
|
||||
cleavage_position total_fragments_all total_fragments_gt3 unique_fragments_gt3 normalized_shannon_entropy_gt3 mean_pairwise_tanimoto_distance_gt3
|
||||
6 536 89 60 0.973126 0.780092
|
||||
7 23 4 1 0.000000 0.000000
|
||||
15 747 205 8 0.455556 0.769222
|
||||
16 135 5 5 1.000000 0.891010
|
||||
```
|
||||
|
||||
- Position 6: supported by the current database as a **design-relevant and structurally diverse** site.
|
||||
- Position 7: not supported as a prevalent natural hotspot in the current database; it appears only a few times after filtering, so it should be described as a **literature- or scaffold-guided modification site**, not a database-enriched site.
|
||||
- Position 15: supported as a **frequent modification site**, but the retained chemotypes are concentrated into a small number of acyl substituents.
|
||||
- Position 16: not prevalent in the current database, but the few retained fragments are structurally distinct singletons; this makes it a **low-evidence exploratory site**, not a high-confidence natural hotspot.
|
||||
|
||||
## Figure 6. Are the Top Positions Driven by Ring-Bearing Side Chains?
|
||||
|
||||

|
||||
|
||||
- Multi-anchor bridge or fused components do **not** enter the fragment library because the collector only keeps side-chain components with exactly one ring connection; see [src/macro_lactone_toolkit/_core.py](/Users/lingyuzeng/project/macro-lactone-sidechain-profiler/macro_split/src/macro_lactone_toolkit/_core.py#L293) and [src/macro_lactone_toolkit/validation/validator.py](/Users/lingyuzeng/project/macro-lactone-sidechain-profiler/macro_split/src/macro_lactone_toolkit/validation/validator.py#L206).
|
||||
- What can still happen is that a retained fragment is a **single-anchor but ring-bearing side chain** such as a sugar, heterocycle or other cyclic appendage.
|
||||
|
||||
```text
|
||||
cleavage_position total_fragments_gt3 cyclic_fragment_rows acyclic_fragment_rows acyclic_row_fraction cyclic_unique_fragments acyclic_unique_fragments
|
||||
3 269 231 38 0.141264 94 23
|
||||
4 269 263 6 0.022305 57 6
|
||||
5 0 0 0 0.000000 0 0
|
||||
6 89 82 7 0.078652 54 6
|
||||
7 4 0 4 1.000000 0 1
|
||||
8 0 0 0 0.000000 0 0
|
||||
9 141 61 80 0.567376 29 8
|
||||
10 4 4 0 0.000000 4 0
|
||||
11 7 0 7 1.000000 0 5
|
||||
12 177 116 61 0.344633 38 45
|
||||
13 709 689 20 0.028209 187 6
|
||||
14 1 1 0 0.000000 1 0
|
||||
15 205 4 201 0.980488 3 5
|
||||
16 5 4 1 0.200000 4 1
|
||||
```
|
||||
|
||||
- This shows that positions 13, 4, 3 and 6 are indeed dominated by **ring-bearing single-anchor side chains**, not by leaked bridge fragments.
|
||||
- Position 15 behaves differently: almost all retained `>3` fragments are **acyclic** acyl-like substituents.
|
||||
- Therefore, if your scientific question is specifically about non-cyclic side-chain diversification, the ranking should be recalculated on an acyclic-only subset.
|
||||
|
||||
Acyclic-only ranking after removing <=3-heavy-atom fragments:
|
||||
|
||||
```text
|
||||
cleavage_position total_fragments unique_fragments normalized_shannon_entropy mean_pairwise_tanimoto_distance mean_atom_count
|
||||
15 201 5 0.526528 0.633544 4.562189
|
||||
9 80 8 0.527036 0.607697 4.387500
|
||||
12 61 45 0.976171 0.792308 8.409836
|
||||
3 38 23 0.949162 0.677847 12.236842
|
||||
13 20 6 0.900676 0.749984 7.500000
|
||||
6 7 6 0.975504 0.591713 7.571429
|
||||
11 7 5 0.962961 0.695245 14.857143
|
||||
4 6 6 1.000000 0.730513 5.500000
|
||||
```
|
||||
|
||||
- Under this stricter acyclic-only view, position 15 becomes the dominant site, followed by 9, 12 and 3. Positions 13 and 4 no longer dominate, confirming that their earlier prominence is mainly driven by cyclic side chains.
|
||||
|
||||
## Reconciliation With the Previous Conclusion
|
||||
|
||||
The earlier statement that `6,7,15,16` are important 16-membered macrolide modification positions is **not invalidated** by the current analysis, but it must be phrased carefully because it answers a different question.
|
||||
|
||||
- The **previous conclusion** came from scaffold-centric medicinal chemistry on known 16-membered macrolides and identifies **synthetically exploited modification positions**.
|
||||
- The **current MacrolactoneDB analysis** measures **natural side-chain occurrence and diversity** in a validated fragment library.
|
||||
- These are related but not identical concepts. A site can be medicinally attractive even if it is rare in natural products, and a site can be naturally diverse without being the preferred position for semisynthetic optimization.
|
||||
|
||||
### Recommended paper-safe wording
|
||||
|
||||
> In the validated MacrolactoneDB fragment library, natural side-chain diversity of 16-membered macrolactones is concentrated primarily at positions 13, 3/4 and 12. After excluding fragments with <=3 heavy atoms to focus on design-relevant substituents, position 6 remains strongly diversity-enriched and position 15 remains frequency-enriched, whereas positions 7 and 16 are sparse and should be interpreted as literature-guided derivatization sites rather than statistically dominant natural hotspots.
|
||||
|
||||
### Practical interpretation for fragment-library design
|
||||
|
||||
- Use `<= 2` heavy atoms as the **default cleanup filter** for the production fragment library.
|
||||
- Use the stricter `> 3` heavy-atom subset when discussing **position-wise diversity** or aligning with the previous exploratory analysis.
|
||||
- For 16-membered macrolide design, prioritize positions **13, 3, 4, 12 and 6** for natural-diversity-driven fragment mining.
|
||||
- Keep positions **15** as a targeted acyl-modification site even though its chemotype diversity is narrower.
|
||||
- Treat positions **7 and 16** as hypothesis-driven medicinal chemistry positions that need literature or synthesis justification beyond database prevalence.
|
||||
|
||||
@@ -0,0 +1,103 @@
|
||||
# 大环内酯碎片库分析报告(中文)
|
||||
|
||||
## 数据范围
|
||||
|
||||
- 当前验证后的可拼接碎片库包含 **34,829** 条片段记录,来源于 **4,451** 个母体分子。
|
||||
- 其中 16 元环子集包含 **8,108** 条片段记录,来源于 **1,105** 个母体分子。
|
||||
- 用于设计相关位点分析的严格子集定义为:片段重原子数 **>= 4**。
|
||||
|
||||
## 全库碎片大小结论
|
||||
|
||||
- 默认清洗阈值建议使用 `<= 2` 重原子删除。该阈值会删除 **28,069** 条记录(80.6%),但仅删除 **26** 个唯一片段(1.4%)。
|
||||
- 若用于和你之前的分析口径对齐,则建议采用 `> 3` 重原子作为设计相关片段集合。此时会删除 **28,550** 条记录(82.0%)。
|
||||
|
||||
## 16 元环位点结论
|
||||
|
||||
- 在 16 元环中,保留所有可拼接片段时,天然侧链多样性较高的位置主要集中在 `13、3/4、12`,并且 6 位也显示出较强的设计相关多样性。
|
||||
- 在 `> 3` 重原子子集中,样本量足够且多样性较高的位点为:
|
||||
|
||||
```text
|
||||
cleavage_position total_fragments unique_fragments normalized_shannon_entropy mean_pairwise_tanimoto_distance mean_atom_count
|
||||
6 89 60 0.973126 0.780092 12.584270
|
||||
12 177 83 0.885717 0.825769 10.000000
|
||||
3 269 117 0.854464 0.764966 15.405204
|
||||
13 709 193 0.763147 0.565162 18.906911
|
||||
9 141 37 0.729256 0.804709 7.815603
|
||||
4 269 63 0.585202 0.701620 13.375465
|
||||
15 205 8 0.455556 0.769222 4.692683
|
||||
```
|
||||
|
||||
- 其中 6 位最能支持你的药化判断:它不是最高频位点,但在设计相关大侧链中显示出很高的结构多样性。
|
||||
- 15 位则更偏向高频但低多样性的酰基修饰位点。
|
||||
|
||||
## 桥环 / 稠环干扰的敏感性分析
|
||||
|
||||
桥连或双锚点侧链不会进入当前片段库,因为断裂逻辑只保留与主环存在 **1 个连接点** 的侧链组件。也就是说,真正的 bridge / fused multi-anchor components 已被代码层面排除。
|
||||
|
||||
但是,需要额外区分另一类情况:**cyclic single-anchor side chains**。这类片段虽然只在一个位置连到主环,因此会被保留下来,但片段自身可能包含糖环、杂环或其他环状骨架,仍然会显著影响位点多样性排名。
|
||||
|
||||
敏感性分析结果如下:
|
||||
|
||||
```text
|
||||
cleavage_position total_fragments_gt3 cyclic_fragment_rows acyclic_fragment_rows acyclic_row_fraction cyclic_unique_fragments acyclic_unique_fragments
|
||||
3 269 231 38 0.141264 94 23
|
||||
4 269 263 6 0.022305 57 6
|
||||
5 0 0 0 0.000000 0 0
|
||||
6 89 82 7 0.078652 54 6
|
||||
7 4 0 4 1.000000 0 1
|
||||
8 0 0 0 0.000000 0 0
|
||||
9 141 61 80 0.567376 29 8
|
||||
10 4 4 0 0.000000 4 0
|
||||
11 7 0 7 1.000000 0 5
|
||||
12 177 116 61 0.344633 38 45
|
||||
13 709 689 20 0.028209 187 6
|
||||
14 1 1 0 0.000000 1 0
|
||||
15 205 4 201 0.980488 3 5
|
||||
16 5 4 1 0.200000 4 1
|
||||
```
|
||||
|
||||
- `13` 位:`689/709` 条大侧链片段本身带环,说明该位点的高多样性主要由天然糖基/环状侧链驱动。
|
||||
- `4` 位:`263/269` 条大侧链片段带环,几乎同样完全由环状侧链主导。
|
||||
- `3` 位:`231/269` 条大侧链片段带环,带环片段占主导。
|
||||
- `12` 位:环状与非环状侧链并存,但环状侧链仍占多数。
|
||||
- `6` 位:多数保留的大侧链也带环,但仍表现出较强的多样性。
|
||||
- `15` 位:几乎全部是非环状侧链,因此更接近半合成药化修饰位点的特征。
|
||||
|
||||
## 仅看非环状侧链时的位点排序
|
||||
|
||||
- 当只保留 `> 3` 重原子且 **不含环** 的侧链时,位点排序明显变化:
|
||||
|
||||
```text
|
||||
cleavage_position total_fragments unique_fragments normalized_shannon_entropy mean_pairwise_tanimoto_distance mean_atom_count
|
||||
15 201 5 0.526528 0.633544 4.562189
|
||||
9 80 8 0.527036 0.607697 4.387500
|
||||
12 61 45 0.976171 0.792308 8.409836
|
||||
3 38 23 0.949162 0.677847 12.236842
|
||||
13 20 6 0.900676 0.749984 7.500000
|
||||
6 7 6 0.975504 0.591713 7.571429
|
||||
11 7 5 0.962961 0.695245 14.857143
|
||||
4 6 6 1.000000 0.730513 5.500000
|
||||
```
|
||||
|
||||
- 在这个更严格的口径下,`15` 位成为最主要的非环状侧链位点,随后是 `9、12、3` 位。
|
||||
- 因此,`13、3、4、12` 的高天然多样性是真实存在的,但它主要表征的是**带环单锚点天然侧链**的富集,而不是桥连片段泄漏。
|
||||
|
||||
## 论文可直接引用的中文讨论段落
|
||||
|
||||
在当前验证后的 MacrolactoneDB 片段库中,桥连或双锚点侧链不会进入当前片段库,因此 16 元大环内酯位点排序的变化并不是由桥环片段误纳入所致。本研究的断裂规则仅保留与主环具有单一连接点的可拼接侧链;然而,许多被保留的大侧链本身仍可包含糖环、杂环或稠合环等 cyclic single-anchor side chains(带环单锚点侧链),这会显著抬高 13、3、4、12 以及 6 位的天然侧链多样性统计。相反,当仅保留 >3 个重原子且进一步限定为非环状侧链后,位点排序转而更偏向 15、9、12 和 3 位,说明 13、3、4、12 的优势主要反映了天然带环侧链的富集,而 15 位则更接近药物化学中常见的非环状酰基修饰位点。因此,在论文讨论中应明确区分“天然带环侧链多样性热点”和“非环状半合成修饰热点”这两类位点概念。
|
||||
|
||||
## 建议的论文表述方式
|
||||
|
||||
- 若讨论天然产物中的侧链多样性,可写为:`16 元大环内酯的天然侧链多样性主要集中在 13、3/4 和 12 位,并在 6 位保留较强的设计相关多样性。`
|
||||
- 若讨论药化半合成改造热点,可写为:`6、7、15、16 位代表文献和先导化合物研究中优先使用的衍生化位点,其中 6 和 15 位在数据库统计中分别对应高多样性和高频率信号,而 7 和 16 位更多体现为文献指导的探索性位点。`
|
||||
- 若专门讨论非环状侧链设计,则应强调:`在排除 <=3 重原子小片段并进一步排除带环侧链后,15 位是最主要的非环状侧链修饰位点。`
|
||||
|
||||
## 相关图表
|
||||
|
||||
- 全库原子数分布:`fragment_atom_count_distribution.png`
|
||||
- 16 元环过滤前后位点数量对比:`ring16_position_count_comparison.png`
|
||||
- 16 元环大侧链 boxplot:`ring16_position_atom_count_boxplot_gt3.png`
|
||||
- 16 元环位点多样性图:`ring16_position_diversity_gt3.png`
|
||||
- 16 元环桥环/带环侧链敏感性图:`ring16_position_ring_sensitivity.png`
|
||||
- 16 元环药化热点对比图:`ring16_medchem_hotspot_comparison.png`
|
||||
|
||||
@@ -0,0 +1,5 @@
|
||||
cleavage_position,total_fragments_all,unique_fragments_all,mean_atom_count_all,total_fragments_gt3,unique_fragments_gt3,mean_atom_count_gt3,gt3_row_fraction,gt3_unique_fraction,normalized_shannon_entropy_all,mean_pairwise_tanimoto_distance_all,normalized_shannon_entropy_gt3,mean_pairwise_tanimoto_distance_gt3
|
||||
6,536,68,2.9402985074626864,89,60,12.584269662921349,0.166044776119403,0.8823529411764706,0.30950388972503856,0.8133239394372517,0.9731262721536741,0.7800916405422387
|
||||
7,23,4,2.5652173913043477,4,1,10.0,0.17391304347826086,0.25,0.8567985762221871,0.86,0.0,0.0
|
||||
15,747,13,2.0174029451137887,205,8,4.692682926829268,0.2744310575635877,0.6153846153846154,0.3487511597393803,0.8168500968741627,0.45555578681139725,0.7692219308030589
|
||||
16,135,8,1.9851851851851852,5,5,8.4,0.037037037037037035,0.625,0.4148320683071004,0.9006253315127155,1.0000000000000002,0.8910101403362273
|
||||
|
@@ -0,0 +1,15 @@
|
||||
cleavage_position,total_fragments_all,unique_fragments_all,mean_atom_count_all,total_fragments_gt3,unique_fragments_gt3,mean_atom_count_gt3,gt3_row_fraction,gt3_unique_fraction
|
||||
3,1048,121,5.097328244274809,269,117,15.405204460966543,0.2566793893129771,0.9669421487603306
|
||||
4,595,70,6.894117647058824,269,63,13.37546468401487,0.45210084033613446,0.9
|
||||
5,54,2,1.0,0,0,0.0,0.0,0.0
|
||||
6,536,68,2.9402985074626864,89,60,12.584269662921349,0.166044776119403,0.8823529411764706
|
||||
7,23,4,2.5652173913043477,4,1,10.0,0.17391304347826086,0.25
|
||||
8,123,4,1.016260162601626,0,0,0.0,0.0,0.0
|
||||
9,734,41,2.310626702997275,141,37,7.815602836879433,0.19209809264305178,0.9024390243902439
|
||||
10,993,7,1.0433031218529707,4,4,11.5,0.004028197381671702,0.5714285714285714
|
||||
11,249,8,1.3895582329317269,7,5,14.857142857142858,0.028112449799196786,0.625
|
||||
12,930,99,3.6268817204301076,177,83,10.0,0.19032258064516128,0.8383838383838383
|
||||
13,876,198,15.501141552511415,709,193,18.90691114245416,0.8093607305936074,0.9747474747474747
|
||||
14,1065,6,1.267605633802817,1,1,14.0,0.0009389671361502347,0.16666666666666666
|
||||
15,747,13,2.0174029451137887,205,8,4.692682926829268,0.2744310575635877,0.6153846153846154
|
||||
16,135,8,1.9851851851851852,5,5,8.4,0.037037037037037035,0.625
|
||||
|
@@ -0,0 +1,15 @@
|
||||
cleavage_position,total_fragments,unique_fragments,normalized_unique_ratio,shannon_entropy,normalized_shannon_entropy,mean_pairwise_tanimoto_distance,mean_atom_count,median_atom_count,std_atom_count,iqr_atom_count,min_atom_count,max_atom_count,atom_count_range
|
||||
3,1048,121,0.11545801526717557,2.1513093369119027,0.4485828387328402,0.7727580364590483,5.097328244274809,2.0,7.125515417214252,5.0,1,39,38
|
||||
4,595,70,0.11764705882352941,2.4817848918166048,0.584156213064148,0.7382343170047687,6.894117647058824,2.0,6.017946467329047,12.5,1,18,17
|
||||
5,54,2,0.037037037037037035,0.0922160573371918,0.13303964861069892,0.8,1.0,1.0,0.0,0.0,1,1,0
|
||||
6,536,68,0.12686567164179105,1.3059540474767763,0.30950388972503856,0.8133239394372517,2.9402985074626864,1.0,4.444116006032291,0.0,1,20,19
|
||||
7,23,4,0.17391304347826086,1.1877750348323688,0.8567985762221871,0.86,2.5652173913043477,1.0,3.4113122166840055,0.0,1,10,9
|
||||
8,123,4,0.032520325203252036,0.2372288446747181,0.171124438884017,0.7915343915343915,1.016260162601626,1.0,0.1795993661331262,0.0,1,3,2
|
||||
9,734,41,0.055858310626702996,1.472202321733401,0.39643833357458974,0.8284402664223979,2.310626702997275,1.0,3.280936828670794,0.0,1,17,16
|
||||
10,993,7,0.007049345417925478,0.14975384152906807,0.07695825092529042,0.8712928851639762,1.0433031218529707,1.0,0.6755237166842796,0.0,1,14,13
|
||||
11,249,8,0.0321285140562249,0.41116030451725305,0.1977263107791457,0.8041654723607728,1.3895582329317269,1.0,3.5978646341022595,0.0,1,41,40
|
||||
12,930,99,0.1064516129032258,2.251679166978565,0.49001532939616665,0.8394325303461457,3.6268817204301076,3.0,3.4290862065303043,2.0,1,21,20
|
||||
13,876,198,0.22602739726027396,3.9101740989981857,0.7394055701617327,0.5841396079985519,15.501141552511415,13.0,9.597433474135945,11.0,1,36,35
|
||||
14,1065,6,0.005633802816901409,0.6717715558650733,0.3749228439431623,0.848458035826457,1.267605633802817,1.0,0.5852108438838486,1.0,1,14,13
|
||||
15,747,13,0.01740294511378849,0.8945290630874894,0.3487511597393803,0.8168500968741627,2.0174029451137887,1.0,1.7641278118940684,3.0,1,13,12
|
||||
16,135,8,0.05925925925925926,0.8626190356587518,0.4148320683071004,0.9006253315127155,1.9851851851851852,2.0,1.4141359627921224,0.5,1,13,12
|
||||
|
@@ -0,0 +1,13 @@
|
||||
cleavage_position,total_fragments,unique_fragments,normalized_unique_ratio,shannon_entropy,normalized_shannon_entropy,mean_pairwise_tanimoto_distance,mean_atom_count,median_atom_count,std_atom_count,iqr_atom_count,min_atom_count,max_atom_count,atom_count_range
|
||||
3,269,117,0.4349442379182156,4.069106586040983,0.8544640833690577,0.7649661414788775,15.405204460966543,14.0,7.356263980699357,10.0,4,39,35
|
||||
4,269,63,0.2342007434944238,2.424572263644206,0.5852023706107886,0.7016204220824205,13.37546468401487,14.0,1.7682432704870303,0.0,5,18,13
|
||||
6,89,60,0.6741573033707865,3.984314260747859,0.9731262721536741,0.7800916405422387,12.584269662921349,13.0,2.7016806619569276,3.0,5,20,15
|
||||
7,4,1,0.25,-0.0,0.0,0.0,10.0,10.0,0.0,0.0,10,10,0
|
||||
9,141,37,0.2624113475177305,2.633283400210265,0.7292559576027439,0.8047085590557639,7.815602836879433,5.0,4.30339275172448,7.0,4,17,13
|
||||
10,4,4,1.0,1.3862943611198906,1.0,0.7794154494906375,11.5,11.5,1.8027756377319946,2.0,9,14,5
|
||||
11,7,5,0.7142857142857143,1.5498260458782016,0.9629610647945144,0.6952446898083474,14.857142857142858,4.0,16.54801301346713,19.5,4,41,37
|
||||
12,177,83,0.4689265536723164,3.913840989343892,0.8857167154747141,0.8257691691480503,10.0,10.0,2.788191402941166,4.0,4,21,17
|
||||
13,709,193,0.27221438645980256,4.016208118657202,0.7631473589542491,0.5651618154111845,18.90691114245416,15.0,7.276898560461122,13.0,4,36,32
|
||||
14,1,1,1.0,-0.0,0.0,0.0,14.0,14.0,0.0,0.0,14,14,0
|
||||
15,205,8,0.03902439024390244,0.9473016276482624,0.45555578681139725,0.7692219308030589,4.692682926829268,5.0,1.2049471689465932,1.0,4,13,9
|
||||
16,5,5,1.0,1.6094379124341005,1.0000000000000002,0.8910101403362273,8.4,7.0,2.4979991993593593,2.0,6,13,7
|
||||
|
@@ -0,0 +1,15 @@
|
||||
cleavage_position,total_fragments_gt3,cyclic_fragment_rows,acyclic_fragment_rows,cyclic_row_fraction,acyclic_row_fraction,unique_fragments_gt3,cyclic_unique_fragments,acyclic_unique_fragments
|
||||
3,269,231,38,0.8587360594795539,0.1412639405204461,117,94,23
|
||||
4,269,263,6,0.9776951672862454,0.022304832713754646,63,57,6
|
||||
5,0,0,0,0.0,0.0,0,0,0
|
||||
6,89,82,7,0.9213483146067416,0.07865168539325842,60,54,6
|
||||
7,4,0,4,0.0,1.0,1,0,1
|
||||
8,0,0,0,0.0,0.0,0,0,0
|
||||
9,141,61,80,0.4326241134751773,0.5673758865248227,37,29,8
|
||||
10,4,4,0,1.0,0.0,4,4,0
|
||||
11,7,0,7,0.0,1.0,5,0,5
|
||||
12,177,116,61,0.655367231638418,0.3446327683615819,83,38,45
|
||||
13,709,689,20,0.9717912552891397,0.028208744710860368,193,187,6
|
||||
14,1,1,0,1.0,0.0,1,1,0
|
||||
15,205,4,201,0.01951219512195122,0.9804878048780488,8,3,5
|
||||
16,5,4,1,0.8,0.2,5,4,1
|
||||
|
Reference in New Issue
Block a user