first add

2025-11-14 20:34:58 +08:00
commit 0d99f7d12c
46 changed files with 698209 additions and 0 deletions
--- a/scripts/README.md
+++ b/scripts/README.md
@@ -0,0 +1,317 @@
+# 批量处理脚本使用说明
+
+## 概述
+
+本文件夹包含两个批量处理脚本，用于对大环内酯分子进行编号、裂解和保存。
+
+## 脚本列表
+
+### 1. `batch_process_ring16.py` - 处理16元环分子
+
+**功能：**
+- 处理已过滤的1241个16元环分子
+- 对每个分子进行环原子编号
+- 进行侧链断裂
+- 保存JSON文件
+
+**输入文件：**
+- `ring16/temp_filtered_complete.csv`
+
+**输出文件夹：**
+- `output/ring16_fragments/`
+  - `ring16_mol_{idx}/` - 每个分子的文件夹
+    - `ring16_mol_{idx}_all_fragments.json` - 所有碎片信息
+    - `{fragment_id}.json` - 单个碎片文件
+    - `metadata.json` - 分子元数据
+  - `processing_log_*.txt` - 处理日志
+  - `error_log_*.txt` - 错误日志
+  - `processing_statistics.json` - 统计信息
+
+**使用方法：**
+
+```bash
+cd /home/zly/project/macro_split
+
+# 方法1: 直接运行Python脚本
+python scripts/batch_process_ring16.py
+
+# 方法2: 作为可执行文件运行
+chmod +x scripts/batch_process_ring16.py
+./scripts/batch_process_ring16.py
+```
+
+**预期运行时间：** 10-30分钟（取决于计算机性能）
+
+---
+
+### 2. `batch_process_multi_rings.py` - 处理12-20元环分子
+
+**功能：**
+- 处理12-20元环的所有大环内酯分子
+- 按环大小自动分类（12元、13元、14元...20元）
+- 检测并剔除含有多个内酯键的分子
+- 对每种环大小分别编号和裂解
+- 保存到对应的文件夹
+
+**输入文件：**
+- `data/ring12_20/temp.csv`
+
+**输出文件夹：**
+- `output/ring12_fragments/` - 12元环碎片
+- `output/ring13_fragments/` - 13元环碎片
+- `output/ring14_fragments/` - 14元环碎片
+- ...
+- `output/ring20_fragments/` - 20元环碎片
+
+每个文件夹包含：
+- `ring{N}_mol_{idx}/` - 每个分子的文件夹
+  - `ring{N}_mol_{idx}_all_fragments.json`
+  - `{fragment_id}.json`
+  - `metadata.json`
+- `processing_log_*.txt` - 处理日志
+- `error_log_*.txt` - 错误日志
+- `multiple_lactone_log_*.txt` - 多内酯键分子记录
+- `processing_statistics.json` - 统计信息
+
+**使用方法：**
+
+```bash
+cd /home/zly/project/macro_split
+
+# 确保输入文件存在
+ls data/ring12_20/temp.csv
+
+# 运行脚本
+python scripts/batch_process_multi_rings.py
+```
+
+**预期运行时间：** 根据数据集大小，可能需要30分钟-2小时
+
+---
+
+## 输出文件说明
+
+### JSON文件格式
+
+#### 1. `{parent_id}_all_fragments.json`
+
+包含所有碎片的完整信息：
+
+```json
+{
+  "parent_id": "ring16_mol_0",
+  "parent_smiles": "...",
+  "fragments": [
+    {
+      "fragment_smiles": "CC(C)C",
+      "parent_smiles": "...",
+      "cleavage_position": 5,
+      "fragment_id": "ring16_mol_0_frag_0",
+      "parent_id": "ring16_mol_0",
+      "atom_count": 4,
+      "molecular_weight": 58.12
+    },
+    ...
+  ]
+}
+```
+
+#### 2. `{fragment_id}.json`
+
+单个碎片的信息：
+
+```json
+{
+  "fragment_smiles": "CC(C)C",
+  "parent_smiles": "...",
+  "cleavage_position": 5,
+  "fragment_id": "ring16_mol_0_frag_0",
+  "parent_id": "ring16_mol_0",
+  "atom_count": 4,
+  "molecular_weight": 58.12
+}
+```
+
+#### 3. `metadata.json`
+
+分子的元数据：
+
+```json
+{
+  "parent_id": "ring16_mol_0",
+  "molecule_id": "CHEMBL94657",
+  "molecule_name": "PATUPILONE",
+  "smiles": "C/C(=C\\c1csc(C)n1)...",
+  "ring_size": 16,
+  "num_fragments": 11,
+  "processing_date": "2025-11-06T10:30:00"
+}
+```
+
+#### 4. `processing_statistics.json`
+
+处理统计信息：
+
+```json
+{
+  "ring_size": 16,
+  "total": 1241,
+  "success": 1200,
+  "failed_parse": 5,
+  "failed_numbering": 10,
+  "failed_validation": 8,
+  "failed_cleavage": 18,
+  "total_fragments": 12500
+}
+```
+
+---
+
+## 日志文件
+
+### 1. `processing_log_*.txt`
+
+记录所有处理过程：
+
+```
+[2025-11-06 10:30:00] 开始批量处理 1241 个16元环分子
+[2025-11-06 10:30:05] 分子 0 处理成功
+...
+[2025-11-06 11:00:00] 处理完成: 成功 1200/1241
+```
+
+### 2. `error_log_*.txt`
+
+记录所有错误：
+
+```
+[2025-11-06 10:35:12] ❌ 分子 125 (CHEMBL12345) SMILES解析失败
+[2025-11-06 10:36:45] ❌ 分子 256 (CHEMBL67890) 编号验证失败
+```
+
+### 3. `multiple_lactone_log_*.txt`
+
+记录含有多个内酯键的分子（仅在multi_rings脚本中）：
+
+```
+[2025-11-06 10:40:00] 分子 350 (CHEMBL11111, Molecule Name) 有多个内酯键，已剔除。内酯碳索引: [15, 28, 42]
+```
+
+---
+
+## 常见问题
+
+### Q1: 如何检查处理进度？
+
+**A:** 脚本会显示实时进度条。你也可以查看日志文件：
+
+```bash
+tail -f output/ring16_fragments/processing_log_*.txt
+```
+
+### Q2: 如何查看处理结果？
+
+**A:** 查看统计文件：
+
+```bash
+cat output/ring16_fragments/processing_statistics.json
+```
+
+### Q3: 处理失败了怎么办？
+
+**A:** 
+1. 查看 `error_log_*.txt` 了解失败原因
+2. 脚本会跳过失败的分子，继续处理其他分子
+3. 可以手动处理失败的分子
+
+### Q4: 如何重新处理？
+
+**A:** 
+1. 删除输出文件夹：`rm -rf output/ring16_fragments`
+2. 重新运行脚本
+
+### Q5: 内存不足怎么办？
+
+**A:** 
+- 脚本已优化为逐个处理分子，内存占用很小
+- 如果仍有问题，可以修改脚本分批处理
+
+---
+
+## 后续分析
+
+处理完成后，可以进行以下分析：
+
+### 1. 统计碎片多样性
+
+```python
+import json
+from pathlib import Path
+from collections import Counter
+
+# 读取所有碎片
+all_fragments = []
+for mol_dir in Path('output/ring16_fragments').iterdir():
+    if mol_dir.is_dir() and mol_dir.name.startswith('ring16_mol_'):
+        fragments_file = mol_dir / f"{mol_dir.name}_all_fragments.json"
+        if fragments_file.exists():
+            with open(fragments_file) as f:
+                data = json.load(f)
+                all_fragments.extend([frag['fragment_smiles'] for frag in data['fragments']])
+
+# 统计
+fragment_counts = Counter(all_fragments)
+print(f"总碎片数: {len(all_fragments)}")
+print(f"独特碎片数: {len(fragment_counts)}")
+print(f"\n最常见的10个碎片:")
+for smiles, count in fragment_counts.most_common(10):
+    print(f"  {smiles}: {count}次")
+```
+
+### 2. 分析位置分布
+
+```python
+from collections import defaultdict
+
+position_fragments = defaultdict(list)
+
+for mol_dir in Path('output/ring16_fragments').iterdir():
+    if mol_dir.is_dir() and mol_dir.name.startswith('ring16_mol_'):
+        fragments_file = mol_dir / f"{mol_dir.name}_all_fragments.json"
+        if fragments_file.exists():
+            with open(fragments_file) as f:
+                data = json.load(f)
+                for frag in data['fragments']:
+                    position = frag['cleavage_position']
+                    position_fragments[position].append(frag['fragment_smiles'])
+
+# 统计每个位置的碎片多样性
+for pos in sorted(position_fragments.keys()):
+    unique_frags = set(position_fragments[pos])
+    print(f"位置 {pos:2d}: {len(position_fragments[pos])} 个碎片, {len(unique_frags)} 种不同结构")
+```
+
+---
+
+## 性能优化
+
+如果需要处理大量数据，可以考虑：
+
+1. **并行处理：** 使用 `multiprocessing` 模块
+2. **分批处理：** 将数据集分成多个批次
+3. **增量处理：** 只处理新增的分子
+
+---
+
+## 联系与支持
+
+如有问题，请查看：
+- 错误日志文件
+- `BRIDGE_RING_ANALYSIS.md` - 详细技术文档
+- `QUICK_GUIDE.md` - 快速参考指南
+
+---
+
+**最后更新：** 2025-11-06  
+**版本：** 1.0
+