过滤结果保存

This commit is contained in:
2025-08-12 18:45:33 +08:00
parent df72a0b9c0
commit e58f90cd1e
6 changed files with 25790 additions and 151 deletions

297
README.md
View File

@@ -1,151 +1,36 @@
## 目录结构
```shell
project_root/
├── input/
│ ├── receptors/
│ │ ├── TrpE_entry_1.pdb
│ │ └── TrpE_entry_1.pdbqt
│ ├── ligands/
│ │ ├── sdf/
│ │ │ ├── ligand_001.sdf
│ │ │ ├── ligand_002.sdf
│ │ │ └── ...
│ │ └── pdbqt/
│ │ ├── ligand_001.pdbqt
│ │ ├── ligand_002.pdbqt
│ │ └── ...
│ └── configs/
│ ├── TrpE_entry_1.box.txt
│ └── TrpE_entry_1.box.pdb
├── results/
│ ├── poses/
│ │ ├── ligand_001_out.pdbqt
│ │ ├── ligand_002_out.pdbqt
│ │ └── ...
│ └── scores/
│ ├── docking_scores.csv
│ └── summary_report.txt
└── scripts/
├── batch_prepare_ligands.sh
├── batch_docking.sh
└── analyze_results.py
```
.
├── config/ # Configuration files (box definitions, etc.)
├── ligand/ # Ligand files
│ └── pdbqt/ # Prepared ligand files in PDBQT format
├── receptor/ # Receptor files
├── result/ # Docking results
│ ├── fgbar/ # FgBar dataset results
│ │ └── poses_all/ # Individual docking results in SDF format
│ ├── trpe/ # TrpE dataset results
│ │ └── poses_all/ # Individual docking results in SDF format
│ └── refence/ # Reference molecule files
│ ├── fgbar/ # FgBar reference molecules
│ └── trpe/ # TrpE reference molecules
├── scripts/ # Analysis scripts and utilities
└── README.md # This file
```
## 1. Preparation
Before running the pipeline, you need to prepare the following files:
1. Protein structure file (PDB format)
2. Ligand library (MOL2 format, named according to the format CNPxxxxxx.1.mol2)
3. Configuration file (box.txt format, defining the docking box parameters)
## 2. Execution Steps
### 2.1 Protein Preparation
```bash
prepare_receptor4.py -r protein.pdb -o protein.pdbqt
```
### 2.2 Ligand Preparation
```bash
prepare_ligand4.py -l ligand.mol2 -o ligand.pdbqt
```
### 2.3 Docking Execution
```bash
vina --config box.txt --receptor protein.pdbqt --ligand ligand.pdbqt --out out.pdbqt
```
### 2.4 Result Format Conversion
Convert PDBQT format results to SDF format:
```bash
mk_export.py ./*_out.pdbqt --suffix _converted
```
## 3. Result Analysis
### 3.1 Calculate QED Properties
Calculate QED values for all molecules:
```bash
cd scripts
python calculate_qed_values.py
```
This script processes both the docked molecules in the poses_all directories and the reference molecules in the refence directories. It generates two CSV files:
- qed_values_fgbar.csv
- qed_values_trpe.csv
Each CSV file contains the following columns:
- smiles: SMILES representation of the molecule
- filename: Name of the source file
- qed: QED value of the molecule
- molecular_weight: Molecular weight of the molecule
- vina_scores: List of Vina scores for all conformers
### 3.2 Analyze QED and Molecular Weight Distribution
Analyze the distribution of QED and molecular weight properties and generate KDE plots:
```bash
python analyze_qed_mw_distribution.py qed_values_fgbar.csv qed_values_trpe.csv --dataset-names fgbar --dataset-names trpe
```
This will generate four plots:
1. kde_distribution_fgbar_normalized.png - Normalized distribution for fgbar dataset
2. kde_distribution_fgbar_actual.png - Actual values distribution for fgbar dataset
3. kde_distribution_trpe_normalized.png - Normalized distribution for trpe dataset
4. kde_distribution_trpe_actual.png - Actual values distribution for trpe dataset
Each plot contains three distributions:
- QED distribution (blue)
- Molecular weight distribution (red)
- Vina score distribution (green)
Reference molecules are marked with different colored markers and labeled with their identifiers and corresponding values.
### 3.3 Advanced Analysis Options
You can also specify custom reference scores and conformation rank:
```bash
python analyze_qed_mw_distribution.py qed_values_fgbar.csv qed_values_trpe.csv --dataset-names fgbar --dataset-names trpe --rank 0
```
The `--rank` option allows you to specify which conformation from the reference molecule's docking results to use for the Vina score reference.
- Rank 0 (default) uses the best scoring conformation (rank 1 in Vina results)
- Rank 1 uses the second best scoring conformation, and so on
The maximum valid rank is determined by the minimum number of conformations generated across all docked molecules. If you specify a rank that exceeds this minimum, the script will raise an error and inform you of the maximum valid rank.
You can also specify custom reference scores:
```bash
python analyze_qed_mw_distribution.py qed_values_fgbar.csv qed_values_trpe.csv --dataset-names fgbar --dataset-names trpe --reference-scores '{"fgbar": {"9NY": -5.268}, "trpe": {"0GA": -6.531}}'
```
## 4. API Usage
The analysis functions can also be called directly from Python:
```python
import sys
sys.path.append('scripts')
from analyze_qed_mw_distribution import main_api
# Basic usage
main_api(['scripts/qed_values_fgbar.csv', 'scripts/qed_values_trpe.csv'], ['fgbar', 'trpe'])
# With custom reference scores
main_api(['scripts/qed_values_fgbar.csv', 'scripts/qed_values_trpe.csv'], ['fgbar', 'trpe'],
reference_scores={'fgbar': {'9NY': -5.268}, 'trpe': {'0GA': -6.531}})
# With specific conformation rank
main_api(['scripts/qed_values_fgbar.csv', 'scripts/qed_values_trpe.csv'], ['fgbar', 'trpe'], rank=0)
```
## 5. Output Files
The analysis generates several output files:
- CSV files with QED values and Vina scores for all molecules
- KDE distribution plots in both normalized and actual values formats
# AutoDock Vina Pipeline
This repository contains a complete pipeline for molecular docking using AutoDock Vina, including preparation, execution, result processing, and analysis.
## 受体准备 pdbqt 文件
@@ -193,6 +78,29 @@ mk_prepare_receptor.py -i prot.pdb \
```
### 受体准备介绍
```shell
mk_prepare_receptor.py -i xxx.pdb -o my_receptor -p -j -v
mk_prepare_receptor.py -i FgBar1_cut_proteinprep.pdb -o FgBar1_cut_proteinprep -p
```
这样会生成:
my_receptor.pdbqt对接用的受体文件
my_receptor.json结构元数据编程用得上
my_receptor.vina_box.txt对接区域参数给 vina 用)
| 选项 | 作用 | 输出文件例子 |
| ---- | -------------------- | ------------------ |
| `-p` | 输出PDBQT文件受体 | `xxx.pdbqt` |
| `-j` | 输出JSON文件元数据 | `xxx.json` |
| `-v` | 输出vina box参数 | `xxx.vina_box.txt` |
| `-g` | 输出GPF文件老版AutoDock用 | `xxx.gpf` |
## 小分子 3D 构象准备
需要给小分子一个初始化的 3d 构象存放到`ligand/sdf`
@@ -263,7 +171,7 @@ vina --receptor input/receptors/TrpE_entry_1.pdbqt \
## 环境安装
```shell
conda install -c conda-forge vina meeko rdkit joblib rich ipython parallel -y
conda install -c conda-forge vina meeko rdkit joblib rich ipython parallel openpyxl pandas mordred -y
```
@@ -321,7 +229,7 @@ mk_export.py vina_results.pdbqt -j my_receptor.json -s lig_docked.sdf -p rec_doc
## djob 运行时间耗时长的批次任务
```
```shell
24562323 vina_job15 RUNNING lyzeng24 default default 2025/07/31 23:16:30 - agent-ARM-17
24562322 vina_job14 RUNNING lyzeng24 default default 2025/07/31 23:16:30 - agent-ARM-17
24562321 vina_job13 RUNNING lyzeng24 default default 2025/07/31 23:16:30 - agent-ARM-17
@@ -337,6 +245,31 @@ mk_export.py vina_results.pdbqt -j my_receptor.json -s lig_docked.sdf -p rec_doc
24562311 vina_job3 RUNNING lyzeng24 default default 2025/07/31 23:16:27 - agent-ARM-19
```
## plant_metabolit 数据集 准备
阮耀师兄之前用构建的植物代谢网络,里面包含的代谢物
执行命令:
```shell
cd /Users/lingyuzeng/Downloads/211.69.141.180/202508021824/vina/ligand/plant_meta
chmod +x run_convert_smiles.sh
./run_convert_smiles.sh
```
执行结果:
```shell
Conversion Summary:
Total SMILES processed: 8086
Successfully converted: 6238
Failed conversions: 1848
Skipped molecules (empty abbreviation): 0
Output directory: sdf
Success rate: 77.1%
Script execution completed.
```
## autodock vina 参考分子对接
trpe:(PDB ID: 5cwa)
@@ -347,7 +280,7 @@ trpe:(PDB ID: 5cwa)
result:
```
```shell
AutoDock Vina v1.2.7
#################################################################
# If you used AutoDock Vina in your work, please cite: #
@@ -417,7 +350,7 @@ fgbar:PDB ID 8izd
reusult:
```
```shell
AutoDock Vina v1.2.7
#################################################################
# If you used AutoDock Vina in your work, please cite: #
@@ -484,14 +417,76 @@ mode | affinity | dist from best mode
## 分析策略
### trpe
### trpeCOCUNT
AutoDock vinaQED 针对小空间分子量小的trpeQED 过滤。()
#### AutoDock Vina 筛选
过滤结果:
1. 针对 trpe 口袋
因为 trpe 口袋和参考分子较小,考虑使用小分子先过滤(MW < 800)。
针对 AutoDock Vina 的 score score 参考 align_5cwa_0GA_addH 结果 < -6.5 分子保留。
剩下根据 QED 排名选择前 100 个分子作为最后实验分子。
2. 针对 fgbar 口袋筛选
fgbar 口袋的参考分子较大MW 不进行筛选。 针对 QED 进行过滤QED > 0.5 , 参考分子align_8izd_F_9NY_addH rank1 的Vina 分数 < -5.2过滤,之后选择 rank 前 100 的分子。
AutoDock vinaQED 针对小空间分子量小的trpeQED 过滤。
过滤结果
```shell
使用 head 命令查看了两个 CSV 文件的数据结构
验证了 vina_scores 列的数据完整性
trpe 数据集发现 1919 个文件的构象数少于 20
fgbar 数据集发现 404 个文件的构象数少于 20
所有分子的最小构象数为 1
按照 README.md 的要求实现了数据过滤:
TRPE 过滤条件MW < 800 且 Vina < -6.5
FGBAR 过滤条件QED > 0.5 且 Vina < -5.2
生成了过滤结果文件:
/result/filtered_results/qed_values_fgbar_combined_filtered.csv (1878.1KB)
/result/filtered_results/qed_values_fgbar_top100.csv (27.6KB)
/result/filtered_results/qed_values_trpe_combined_filtered.csv (6090.1KB)
/result/filtered_results/qed_values_trpe_top100.csv (27.5KB)
输出了统计信息:
TRPE 数据统计:
原始数据总数: 41166
仅QED过滤后数据总数: 7229
仅Vina得分过滤后数据总数: 29728
同时满足QED和Vina得分条件的数据总数: 18787
FGBAR 数据统计:
原始数据总数: 41166
仅QED过滤后数据总数: 7228
仅Vina得分过滤后数据总数: 36111
同时满足QED和Vina得分条件的数据总数: 6568
```
#### karamadock 筛选
待反馈结构结果
karamadock只看 qed 过滤后的小分子对接情况(过滤标准:**小分子**QED
glide: 小分子QED。vina 打分好的 1w 个 按照底物标准)
#### glide 筛选
之前是考虑底物标准的交集这里使用底物标准剩余的分子全部使用glide 进行分子对接。
`qed_values_fgbar_filtered.csv`
`qed_values_fgbar_filtered.csv`
---
### fgbar