first add
This commit is contained in:
317
scripts/README.md
Normal file
317
scripts/README.md
Normal file
@@ -0,0 +1,317 @@
|
||||
# 批量处理脚本使用说明
|
||||
|
||||
## 概述
|
||||
|
||||
本文件夹包含两个批量处理脚本,用于对大环内酯分子进行编号、裂解和保存。
|
||||
|
||||
## 脚本列表
|
||||
|
||||
### 1. `batch_process_ring16.py` - 处理16元环分子
|
||||
|
||||
**功能:**
|
||||
- 处理已过滤的1241个16元环分子
|
||||
- 对每个分子进行环原子编号
|
||||
- 进行侧链断裂
|
||||
- 保存JSON文件
|
||||
|
||||
**输入文件:**
|
||||
- `ring16/temp_filtered_complete.csv`
|
||||
|
||||
**输出文件夹:**
|
||||
- `output/ring16_fragments/`
|
||||
- `ring16_mol_{idx}/` - 每个分子的文件夹
|
||||
- `ring16_mol_{idx}_all_fragments.json` - 所有碎片信息
|
||||
- `{fragment_id}.json` - 单个碎片文件
|
||||
- `metadata.json` - 分子元数据
|
||||
- `processing_log_*.txt` - 处理日志
|
||||
- `error_log_*.txt` - 错误日志
|
||||
- `processing_statistics.json` - 统计信息
|
||||
|
||||
**使用方法:**
|
||||
|
||||
```bash
|
||||
cd /home/zly/project/macro_split
|
||||
|
||||
# 方法1: 直接运行Python脚本
|
||||
python scripts/batch_process_ring16.py
|
||||
|
||||
# 方法2: 作为可执行文件运行
|
||||
chmod +x scripts/batch_process_ring16.py
|
||||
./scripts/batch_process_ring16.py
|
||||
```
|
||||
|
||||
**预期运行时间:** 10-30分钟(取决于计算机性能)
|
||||
|
||||
---
|
||||
|
||||
### 2. `batch_process_multi_rings.py` - 处理12-20元环分子
|
||||
|
||||
**功能:**
|
||||
- 处理12-20元环的所有大环内酯分子
|
||||
- 按环大小自动分类(12元、13元、14元...20元)
|
||||
- 检测并剔除含有多个内酯键的分子
|
||||
- 对每种环大小分别编号和裂解
|
||||
- 保存到对应的文件夹
|
||||
|
||||
**输入文件:**
|
||||
- `data/ring12_20/temp.csv`
|
||||
|
||||
**输出文件夹:**
|
||||
- `output/ring12_fragments/` - 12元环碎片
|
||||
- `output/ring13_fragments/` - 13元环碎片
|
||||
- `output/ring14_fragments/` - 14元环碎片
|
||||
- ...
|
||||
- `output/ring20_fragments/` - 20元环碎片
|
||||
|
||||
每个文件夹包含:
|
||||
- `ring{N}_mol_{idx}/` - 每个分子的文件夹
|
||||
- `ring{N}_mol_{idx}_all_fragments.json`
|
||||
- `{fragment_id}.json`
|
||||
- `metadata.json`
|
||||
- `processing_log_*.txt` - 处理日志
|
||||
- `error_log_*.txt` - 错误日志
|
||||
- `multiple_lactone_log_*.txt` - 多内酯键分子记录
|
||||
- `processing_statistics.json` - 统计信息
|
||||
|
||||
**使用方法:**
|
||||
|
||||
```bash
|
||||
cd /home/zly/project/macro_split
|
||||
|
||||
# 确保输入文件存在
|
||||
ls data/ring12_20/temp.csv
|
||||
|
||||
# 运行脚本
|
||||
python scripts/batch_process_multi_rings.py
|
||||
```
|
||||
|
||||
**预期运行时间:** 根据数据集大小,可能需要30分钟-2小时
|
||||
|
||||
---
|
||||
|
||||
## 输出文件说明
|
||||
|
||||
### JSON文件格式
|
||||
|
||||
#### 1. `{parent_id}_all_fragments.json`
|
||||
|
||||
包含所有碎片的完整信息:
|
||||
|
||||
```json
|
||||
{
|
||||
"parent_id": "ring16_mol_0",
|
||||
"parent_smiles": "...",
|
||||
"fragments": [
|
||||
{
|
||||
"fragment_smiles": "CC(C)C",
|
||||
"parent_smiles": "...",
|
||||
"cleavage_position": 5,
|
||||
"fragment_id": "ring16_mol_0_frag_0",
|
||||
"parent_id": "ring16_mol_0",
|
||||
"atom_count": 4,
|
||||
"molecular_weight": 58.12
|
||||
},
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### 2. `{fragment_id}.json`
|
||||
|
||||
单个碎片的信息:
|
||||
|
||||
```json
|
||||
{
|
||||
"fragment_smiles": "CC(C)C",
|
||||
"parent_smiles": "...",
|
||||
"cleavage_position": 5,
|
||||
"fragment_id": "ring16_mol_0_frag_0",
|
||||
"parent_id": "ring16_mol_0",
|
||||
"atom_count": 4,
|
||||
"molecular_weight": 58.12
|
||||
}
|
||||
```
|
||||
|
||||
#### 3. `metadata.json`
|
||||
|
||||
分子的元数据:
|
||||
|
||||
```json
|
||||
{
|
||||
"parent_id": "ring16_mol_0",
|
||||
"molecule_id": "CHEMBL94657",
|
||||
"molecule_name": "PATUPILONE",
|
||||
"smiles": "C/C(=C\\c1csc(C)n1)...",
|
||||
"ring_size": 16,
|
||||
"num_fragments": 11,
|
||||
"processing_date": "2025-11-06T10:30:00"
|
||||
}
|
||||
```
|
||||
|
||||
#### 4. `processing_statistics.json`
|
||||
|
||||
处理统计信息:
|
||||
|
||||
```json
|
||||
{
|
||||
"ring_size": 16,
|
||||
"total": 1241,
|
||||
"success": 1200,
|
||||
"failed_parse": 5,
|
||||
"failed_numbering": 10,
|
||||
"failed_validation": 8,
|
||||
"failed_cleavage": 18,
|
||||
"total_fragments": 12500
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 日志文件
|
||||
|
||||
### 1. `processing_log_*.txt`
|
||||
|
||||
记录所有处理过程:
|
||||
|
||||
```
|
||||
[2025-11-06 10:30:00] 开始批量处理 1241 个16元环分子
|
||||
[2025-11-06 10:30:05] 分子 0 处理成功
|
||||
...
|
||||
[2025-11-06 11:00:00] 处理完成: 成功 1200/1241
|
||||
```
|
||||
|
||||
### 2. `error_log_*.txt`
|
||||
|
||||
记录所有错误:
|
||||
|
||||
```
|
||||
[2025-11-06 10:35:12] ❌ 分子 125 (CHEMBL12345) SMILES解析失败
|
||||
[2025-11-06 10:36:45] ❌ 分子 256 (CHEMBL67890) 编号验证失败
|
||||
```
|
||||
|
||||
### 3. `multiple_lactone_log_*.txt`
|
||||
|
||||
记录含有多个内酯键的分子(仅在multi_rings脚本中):
|
||||
|
||||
```
|
||||
[2025-11-06 10:40:00] 分子 350 (CHEMBL11111, Molecule Name) 有多个内酯键,已剔除。内酯碳索引: [15, 28, 42]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 常见问题
|
||||
|
||||
### Q1: 如何检查处理进度?
|
||||
|
||||
**A:** 脚本会显示实时进度条。你也可以查看日志文件:
|
||||
|
||||
```bash
|
||||
tail -f output/ring16_fragments/processing_log_*.txt
|
||||
```
|
||||
|
||||
### Q2: 如何查看处理结果?
|
||||
|
||||
**A:** 查看统计文件:
|
||||
|
||||
```bash
|
||||
cat output/ring16_fragments/processing_statistics.json
|
||||
```
|
||||
|
||||
### Q3: 处理失败了怎么办?
|
||||
|
||||
**A:**
|
||||
1. 查看 `error_log_*.txt` 了解失败原因
|
||||
2. 脚本会跳过失败的分子,继续处理其他分子
|
||||
3. 可以手动处理失败的分子
|
||||
|
||||
### Q4: 如何重新处理?
|
||||
|
||||
**A:**
|
||||
1. 删除输出文件夹:`rm -rf output/ring16_fragments`
|
||||
2. 重新运行脚本
|
||||
|
||||
### Q5: 内存不足怎么办?
|
||||
|
||||
**A:**
|
||||
- 脚本已优化为逐个处理分子,内存占用很小
|
||||
- 如果仍有问题,可以修改脚本分批处理
|
||||
|
||||
---
|
||||
|
||||
## 后续分析
|
||||
|
||||
处理完成后,可以进行以下分析:
|
||||
|
||||
### 1. 统计碎片多样性
|
||||
|
||||
```python
|
||||
import json
|
||||
from pathlib import Path
|
||||
from collections import Counter
|
||||
|
||||
# 读取所有碎片
|
||||
all_fragments = []
|
||||
for mol_dir in Path('output/ring16_fragments').iterdir():
|
||||
if mol_dir.is_dir() and mol_dir.name.startswith('ring16_mol_'):
|
||||
fragments_file = mol_dir / f"{mol_dir.name}_all_fragments.json"
|
||||
if fragments_file.exists():
|
||||
with open(fragments_file) as f:
|
||||
data = json.load(f)
|
||||
all_fragments.extend([frag['fragment_smiles'] for frag in data['fragments']])
|
||||
|
||||
# 统计
|
||||
fragment_counts = Counter(all_fragments)
|
||||
print(f"总碎片数: {len(all_fragments)}")
|
||||
print(f"独特碎片数: {len(fragment_counts)}")
|
||||
print(f"\n最常见的10个碎片:")
|
||||
for smiles, count in fragment_counts.most_common(10):
|
||||
print(f" {smiles}: {count}次")
|
||||
```
|
||||
|
||||
### 2. 分析位置分布
|
||||
|
||||
```python
|
||||
from collections import defaultdict
|
||||
|
||||
position_fragments = defaultdict(list)
|
||||
|
||||
for mol_dir in Path('output/ring16_fragments').iterdir():
|
||||
if mol_dir.is_dir() and mol_dir.name.startswith('ring16_mol_'):
|
||||
fragments_file = mol_dir / f"{mol_dir.name}_all_fragments.json"
|
||||
if fragments_file.exists():
|
||||
with open(fragments_file) as f:
|
||||
data = json.load(f)
|
||||
for frag in data['fragments']:
|
||||
position = frag['cleavage_position']
|
||||
position_fragments[position].append(frag['fragment_smiles'])
|
||||
|
||||
# 统计每个位置的碎片多样性
|
||||
for pos in sorted(position_fragments.keys()):
|
||||
unique_frags = set(position_fragments[pos])
|
||||
print(f"位置 {pos:2d}: {len(position_fragments[pos])} 个碎片, {len(unique_frags)} 种不同结构")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 性能优化
|
||||
|
||||
如果需要处理大量数据,可以考虑:
|
||||
|
||||
1. **并行处理:** 使用 `multiprocessing` 模块
|
||||
2. **分批处理:** 将数据集分成多个批次
|
||||
3. **增量处理:** 只处理新增的分子
|
||||
|
||||
---
|
||||
|
||||
## 联系与支持
|
||||
|
||||
如有问题,请查看:
|
||||
- 错误日志文件
|
||||
- `BRIDGE_RING_ANALYSIS.md` - 详细技术文档
|
||||
- `QUICK_GUIDE.md` - 快速参考指南
|
||||
|
||||
---
|
||||
|
||||
**最后更新:** 2025-11-06
|
||||
**版本:** 1.0
|
||||
|
||||
Reference in New Issue
Block a user