Files
macro_split/README.md
lingyuzeng 46a438dd36 feat(validation): enforce single-anchor fragments
- skip fused/shared/multi-anchor side systems during extraction
- add fragment library schema and fragment_library.csv export
- make scaffold prep strict for non-spliceable positions
2026-03-19 14:20:32 +08:00

158 lines
5.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# macro_lactone_toolkit
`macro_lactone_toolkit` 是一个正式可安装的 Python 包,用于 12-20 元有效大环内酯的识别、环编号、侧链裂解和简单拼接回组装。
## 核心能力
- 默认自动识别 12-20 元有效大环内酯,也允许显式指定 `ring_size`
- 环编号规则固定为:
- 位置 1 = 内酯羰基碳
- 位置 2 = 环上的酯键氧
- 位置 3-N = 沿统一方向连续编号
- 侧链裂解同时输出两套 SMILES
- `fragment_smiles_labeled`,例如 `[5*]`
- `fragment_smiles_plain`,例如 `*`
- dummy 原子与连接原子的原始键型保持一致
- 提供正式 CLI
- `macro-lactone-toolkit analyze`
- `macro-lactone-toolkit number`
- `macro-lactone-toolkit fragment`
## 环境
推荐使用 `pixi`,项目已固定到 Python 3.12,并支持 `osx-arm64``linux-64`
```bash
pixi install
pixi run pytest
pixi run python -c "import macro_lactone_toolkit"
```
## Python API
```python
from macro_lactone_toolkit import MacroLactoneAnalyzer, MacrolactoneFragmenter
analyzer = MacroLactoneAnalyzer()
valid_ring_sizes = analyzer.get_valid_ring_sizes("O=C1CCCCCCCCCCCCCCO1")
fragmenter = MacrolactoneFragmenter()
numbering = fragmenter.number_molecule("O=C1CCCCCCCCCCCCCCO1")
result = fragmenter.fragment_molecule("O=C1CCCC(C)CCCCCCCCCCO1", parent_id="mol_001")
```
## CLI
单分子分析:
```bash
pixi run macro-lactone-toolkit analyze --smiles "O=C1CCCCCCCCCCCCCCO1"
pixi run macro-lactone-toolkit number --smiles "O=C1CCCCCCCCCCCCCCO1"
pixi run macro-lactone-toolkit fragment --smiles "O=C1CCCC(C)CCCCCCCCCCO1" --parent-id mol_001
```
CSV 批处理:
```bash
pixi run macro-lactone-toolkit fragment \
--input molecules.csv \
--output fragments.csv \
--errors-output fragment_errors.csv
```
默认读取 `smiles` 列;若存在 `id` 列则将其作为 `parent_id`,否则自动生成 `row_<index>`
## MacrolactoneDB 验证模块
用于对 MacrolactoneDB 数据库进行抽样验证、分类、侧链断裂和数据库存储。
### 验证脚本使用
```bash
# 基本使用10% 分层抽样)
pixi run python scripts/validate_macrolactone_db.py \
--input data/MacrolactoneDB/ring12_20/temp.csv \
--output validation_output \
--sample-ratio 0.1
# 处理全量数据
pixi run python scripts/validate_macrolactone_db.py \
--input data/MacrolactoneDB/ring12_20/temp.csv \
--output validation_output \
--sample-ratio 1.0
# 指定列名(如果 CSV 列名不同)
pixi run python scripts/validate_macrolactone_db.py \
--input data.csv \
--output validation_output \
--id-col ml_id \
--chembl-id-col IDs \
--smiles-col smiles
```
### 输出结构
```
validation_output/
├── README.md # 目录说明
├── fragments.db # SQLite 数据库
├── fragment_library.csv # 最终片段库导出(含 has_dummy_atom / splice_ready
├── summary.csv # 汇总表(含 ml_id, chembl_id
├── summary_statistics.json # 统计信息
├── ring_size_12/ # 按环大小组织
├── ring_size_13/
...
└── ring_size_20/
├── standard/
│ ├── numbered/ # 带编号的环图(文件名使用 ml_id
│ │ └── {ml_id}_numbered.png
│ └── sidechains/ # 片段图
│ └── {ml_id}/
│ └── {ml_id}_frag_{n}_pos{pos}.png
├── non_standard/original/
└── rejected/original/
```
### 数据库查询示例
```bash
# 查看表结构
sqlite3 validation_output/fragments.db ".tables"
# 查询标准大环内酯
sqlite3 validation_output/fragments.db \
"SELECT ml_id, chembl_id, ring_size, num_sidechains \
FROM parent_molecules \
WHERE classification='standard_macrolactone' LIMIT 5;"
# 查询最终片段库
sqlite3 validation_output/fragments.db \
"SELECT source_type, source_parent_ml_id, cleavage_position, has_dummy_atom, splice_ready \
FROM fragment_library_entries LIMIT 10;"
# 查询片段
sqlite3 validation_output/fragments.db \
"SELECT fragment_id, cleavage_position, dummy_isotope, has_dummy_atom, dummy_atom_count \
FROM side_chain_fragments LIMIT 10;"
# 按环大小统计
sqlite3 validation_output/fragments.db \
"SELECT ring_size, COUNT(*) FROM parent_molecules GROUP BY ring_size;"
```
### 关键字段说明
| 字段 | 说明 |
|------|------|
| `ml_id` | MacrolactoneDB 唯一 ID如 ML00000001用于文件命名 |
| `chembl_id` | 原始 CHEMBL ID如 CHEMBL94657可能为空 |
| `classification` | standard_macrolactone / non_standard_macrocycle / not_macrolactone |
| `dummy_isotope` | 裂解位置编号,用于片段重建 |
| `cleavage_position` | 环上的断裂位置 |
| `has_dummy_atom` | 该片段是否带 dummy 原子,可用于区分可直接拼接片段 |
| `splice_ready` | 是否与当前单锚点拼接流程直接兼容 |
## Legacy Scripts
`scripts/` 目录保留为薄封装或迁移提示,不再承载核心实现。正式接口以 `macro_lactone_toolkit.*``macro-lactone-toolkit` CLI 为准。