Files

hotwa 68f171ad1d Add splicing module and related tests

- Add src/splicing/ module with scaffold_prep, fragment_prep, and engine
- Add tylosin_splicer.py entry script
- Add unit tests for splicing components

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-18 17:47:00 +08:00

__init__.py

first add

2025-11-14 20:34:58 +08:00

analyze_fragments.py

first add

2025-11-14 20:34:58 +08:00

batch_process_multi_rings.py

first add

2025-11-14 20:34:58 +08:00

batch_process_ring16.py

first add

2025-11-14 20:34:58 +08:00

batch_process.py

first add

2025-11-14 20:34:58 +08:00

generate_sdf_and_statistics.py

first add

2025-11-14 20:34:58 +08:00

README.md

first add

2025-11-14 20:34:58 +08:00

tylosin_splicer.py

Add splicing module and related tests

2026-03-18 17:47:00 +08:00

README.md

批量处理脚本使用说明

概述

本文件夹包含两个批量处理脚本，用于对大环内酯分子进行编号、裂解和保存。

脚本列表

1. `batch_process_ring16.py` - 处理16元环分子

功能：

处理已过滤的1241个16元环分子
对每个分子进行环原子编号
进行侧链断裂
保存JSON文件

输入文件：

ring16/temp_filtered_complete.csv

输出文件夹：

output/ring16_fragments/
- ring16_mol_{idx}/ - 每个分子的文件夹
  - ring16_mol_{idx}_all_fragments.json - 所有碎片信息
  - {fragment_id}.json - 单个碎片文件
  - metadata.json - 分子元数据
- processing_log_*.txt - 处理日志
- error_log_*.txt - 错误日志
- processing_statistics.json - 统计信息

使用方法：

cd /home/zly/project/macro_split

# 方法1: 直接运行Python脚本
python scripts/batch_process_ring16.py

# 方法2: 作为可执行文件运行
chmod +x scripts/batch_process_ring16.py
./scripts/batch_process_ring16.py

预期运行时间： 10-30分钟（取决于计算机性能）

2. `batch_process_multi_rings.py` - 处理12-20元环分子

功能：

处理12-20元环的所有大环内酯分子
按环大小自动分类（12元、13元、14元...20元）
检测并剔除含有多个内酯键的分子
对每种环大小分别编号和裂解
保存到对应的文件夹

输入文件：

data/ring12_20/temp.csv

输出文件夹：

output/ring12_fragments/ - 12元环碎片
output/ring13_fragments/ - 13元环碎片
output/ring14_fragments/ - 14元环碎片
...
output/ring20_fragments/ - 20元环碎片

每个文件夹包含：

ring{N}_mol_{idx}/ - 每个分子的文件夹
- ring{N}_mol_{idx}_all_fragments.json
- {fragment_id}.json
- metadata.json
processing_log_*.txt - 处理日志
error_log_*.txt - 错误日志
multiple_lactone_log_*.txt - 多内酯键分子记录
processing_statistics.json - 统计信息

使用方法：

cd /home/zly/project/macro_split

# 确保输入文件存在
ls data/ring12_20/temp.csv

# 运行脚本
python scripts/batch_process_multi_rings.py

预期运行时间： 根据数据集大小，可能需要30分钟-2小时

输出文件说明

JSON文件格式

1. `{parent_id}_all_fragments.json`

包含所有碎片的完整信息：

{
  "parent_id": "ring16_mol_0",
  "parent_smiles": "...",
  "fragments": [
    {
      "fragment_smiles": "CC(C)C",
      "parent_smiles": "...",
      "cleavage_position": 5,
      "fragment_id": "ring16_mol_0_frag_0",
      "parent_id": "ring16_mol_0",
      "atom_count": 4,
      "molecular_weight": 58.12
    },
    ...
  ]
}

2. `{fragment_id}.json`

单个碎片的信息：

{
  "fragment_smiles": "CC(C)C",
  "parent_smiles": "...",
  "cleavage_position": 5,
  "fragment_id": "ring16_mol_0_frag_0",
  "parent_id": "ring16_mol_0",
  "atom_count": 4,
  "molecular_weight": 58.12
}

3. `metadata.json`

分子的元数据：

{
  "parent_id": "ring16_mol_0",
  "molecule_id": "CHEMBL94657",
  "molecule_name": "PATUPILONE",
  "smiles": "C/C(=C\\c1csc(C)n1)...",
  "ring_size": 16,
  "num_fragments": 11,
  "processing_date": "2025-11-06T10:30:00"
}

4. `processing_statistics.json`

处理统计信息：

{
  "ring_size": 16,
  "total": 1241,
  "success": 1200,
  "failed_parse": 5,
  "failed_numbering": 10,
  "failed_validation": 8,
  "failed_cleavage": 18,
  "total_fragments": 12500
}

日志文件

1. `processing_log_*.txt`

记录所有处理过程：

[2025-11-06 10:30:00] 开始批量处理 1241 个16元环分子
[2025-11-06 10:30:05] 分子 0 处理成功
...
[2025-11-06 11:00:00] 处理完成: 成功 1200/1241

2. `error_log_*.txt`

记录所有错误：

[2025-11-06 10:35:12] ❌ 分子 125 (CHEMBL12345) SMILES解析失败
[2025-11-06 10:36:45] ❌ 分子 256 (CHEMBL67890) 编号验证失败

3. `multiple_lactone_log_*.txt`

记录含有多个内酯键的分子（仅在multi_rings脚本中）：

[2025-11-06 10:40:00] 分子 350 (CHEMBL11111, Molecule Name) 有多个内酯键，已剔除。内酯碳索引: [15, 28, 42]

常见问题

Q1: 如何检查处理进度？

A: 脚本会显示实时进度条。你也可以查看日志文件：

tail -f output/ring16_fragments/processing_log_*.txt

Q2: 如何查看处理结果？

A: 查看统计文件：

cat output/ring16_fragments/processing_statistics.json

Q3: 处理失败了怎么办？

查看 error_log_*.txt 了解失败原因
脚本会跳过失败的分子，继续处理其他分子
可以手动处理失败的分子

Q4: 如何重新处理？

删除输出文件夹：rm -rf output/ring16_fragments
重新运行脚本

Q5: 内存不足怎么办？

脚本已优化为逐个处理分子，内存占用很小
如果仍有问题，可以修改脚本分批处理

后续分析

处理完成后，可以进行以下分析：

1. 统计碎片多样性

import json
from pathlib import Path
from collections import Counter

# 读取所有碎片
all_fragments = []
for mol_dir in Path('output/ring16_fragments').iterdir():
    if mol_dir.is_dir() and mol_dir.name.startswith('ring16_mol_'):
        fragments_file = mol_dir / f"{mol_dir.name}_all_fragments.json"
        if fragments_file.exists():
            with open(fragments_file) as f:
                data = json.load(f)
                all_fragments.extend([frag['fragment_smiles'] for frag in data['fragments']])

# 统计
fragment_counts = Counter(all_fragments)
print(f"总碎片数: {len(all_fragments)}")
print(f"独特碎片数: {len(fragment_counts)}")
print(f"\n最常见的10个碎片:")
for smiles, count in fragment_counts.most_common(10):
    print(f"  {smiles}: {count}次")

2. 分析位置分布

from collections import defaultdict

position_fragments = defaultdict(list)

for mol_dir in Path('output/ring16_fragments').iterdir():
    if mol_dir.is_dir() and mol_dir.name.startswith('ring16_mol_'):
        fragments_file = mol_dir / f"{mol_dir.name}_all_fragments.json"
        if fragments_file.exists():
            with open(fragments_file) as f:
                data = json.load(f)
                for frag in data['fragments']:
                    position = frag['cleavage_position']
                    position_fragments[position].append(frag['fragment_smiles'])

# 统计每个位置的碎片多样性
for pos in sorted(position_fragments.keys()):
    unique_frags = set(position_fragments[pos])
    print(f"位置 {pos:2d}: {len(position_fragments[pos])} 个碎片, {len(unique_frags)} 种不同结构")

性能优化

如果需要处理大量数据，可以考虑：

并行处理： 使用 multiprocessing 模块
分批处理： 将数据集分成多个批次
增量处理： 只处理新增的分子

联系与支持

如有问题，请查看：

错误日志文件
BRIDGE_RING_ANALYSIS.md - 详细技术文档
QUICK_GUIDE.md - 快速参考指南

最后更新： 2025-11-06
版本： 1.0

README.md Unescape Escape

批量处理脚本使用说明

概述

脚本列表

1. batch_process_ring16.py - 处理16元环分子

2. batch_process_multi_rings.py - 处理12-20元环分子

输出文件说明

JSON文件格式

1. {parent_id}_all_fragments.json

2. {fragment_id}.json

3. metadata.json

4. processing_statistics.json

日志文件

1. processing_log_*.txt

2. error_log_*.txt

3. multiple_lactone_log_*.txt

常见问题

Q1: 如何检查处理进度？

Q2: 如何查看处理结果？

Q3: 处理失败了怎么办？

Q4: 如何重新处理？

Q5: 内存不足怎么办？

后续分析

1. 统计碎片多样性

2. 分析位置分布

性能优化

联系与支持

README.md

1. `batch_process_ring16.py` - 处理16元环分子

2. `batch_process_multi_rings.py` - 处理12-20元环分子

1. `{parent_id}_all_fragments.json`

2. `{fragment_id}.json`

3. `metadata.json`

4. `processing_statistics.json`

1. `processing_log_*.txt`

2. `error_log_*.txt`

3. `multiple_lactone_log_*.txt`