1. 目录结构调整: - 创建scripts目录统一存放分析脚本 - 保持数据文件在原有目录结构中 - 生成的CSV文件和PNG图表文件也放在scripts目录下 2. 功能改进: - 更新calculate_qed_values.py脚本,添加对参考分子SDF文件的处理 - 修改analyze_qed_mw_distribution.py脚本,统一使用SDF文件stem名称作为参考分子标识符 - 改进Vina得分提取逻辑,支持从SDF文件中提取所有构象的得分 - 完善KDE分布图绘制,确保参考分子在所有图表中显示统一的名称 3. 文档更新: - 更新README.md中的目录结构说明 - 更新命令行和API使用示例 - 添加详细的使用说明和示例 4. 示例代码: - 更新example_api_usage.py以适应新的目录结构和API调用方式
500 lines
18 KiB
Markdown
500 lines
18 KiB
Markdown
## 目录结构
|
||
|
||
```
|
||
.
|
||
├── config/ # Configuration files (box definitions, etc.)
|
||
├── ligand/ # Ligand files
|
||
│ └── pdbqt/ # Prepared ligand files in PDBQT format
|
||
├── receptor/ # Receptor files
|
||
├── result/ # Docking results
|
||
│ ├── fgbar/ # FgBar dataset results
|
||
│ │ └── poses_all/ # Individual docking results in SDF format
|
||
│ ├── trpe/ # TrpE dataset results
|
||
│ │ └── poses_all/ # Individual docking results in SDF format
|
||
│ └── refence/ # Reference molecule files
|
||
│ ├── fgbar/ # FgBar reference molecules
|
||
│ └── trpe/ # TrpE reference molecules
|
||
├── scripts/ # Analysis scripts and utilities
|
||
└── README.md # This file
|
||
```
|
||
|
||
## 1. Preparation
|
||
|
||
Before running the pipeline, you need to prepare the following files:
|
||
|
||
1. Protein structure file (PDB format)
|
||
2. Ligand library (MOL2 format, named according to the format CNPxxxxxx.1.mol2)
|
||
3. Configuration file (box.txt format, defining the docking box parameters)
|
||
|
||
## 2. Execution Steps
|
||
|
||
### 2.1 Protein Preparation
|
||
|
||
```bash
|
||
prepare_receptor4.py -r protein.pdb -o protein.pdbqt
|
||
```
|
||
|
||
### 2.2 Ligand Preparation
|
||
|
||
```bash
|
||
prepare_ligand4.py -l ligand.mol2 -o ligand.pdbqt
|
||
```
|
||
|
||
### 2.3 Docking Execution
|
||
|
||
```bash
|
||
vina --config box.txt --receptor protein.pdbqt --ligand ligand.pdbqt --out out.pdbqt
|
||
```
|
||
|
||
### 2.4 Result Format Conversion
|
||
|
||
Convert PDBQT format results to SDF format:
|
||
|
||
```bash
|
||
mk_export.py ./*_out.pdbqt --suffix _converted
|
||
```
|
||
|
||
## 3. Result Analysis
|
||
|
||
### 3.1 Calculate QED Properties
|
||
|
||
Calculate QED values for all molecules:
|
||
|
||
```bash
|
||
cd scripts
|
||
python calculate_qed_values.py
|
||
```
|
||
|
||
This script processes both the docked molecules in the poses_all directories and the reference molecules in the refence directories. It generates two CSV files:
|
||
- qed_values_fgbar.csv
|
||
- qed_values_trpe.csv
|
||
|
||
Each CSV file contains the following columns:
|
||
- smiles: SMILES representation of the molecule
|
||
- filename: Name of the source file
|
||
- qed: QED value of the molecule
|
||
- molecular_weight: Molecular weight of the molecule
|
||
- vina_scores: List of Vina scores for all conformers
|
||
|
||
### 3.2 Analyze QED and Molecular Weight Distribution
|
||
|
||
Analyze the distribution of QED and molecular weight properties and generate KDE plots:
|
||
|
||
```bash
|
||
python analyze_qed_mw_distribution.py qed_values_fgbar.csv qed_values_trpe.csv --dataset-names fgbar --dataset-names trpe
|
||
```
|
||
|
||
This will generate four plots:
|
||
1. kde_distribution_fgbar_normalized.png - Normalized distribution for fgbar dataset
|
||
2. kde_distribution_fgbar_actual.png - Actual values distribution for fgbar dataset
|
||
3. kde_distribution_trpe_normalized.png - Normalized distribution for trpe dataset
|
||
4. kde_distribution_trpe_actual.png - Actual values distribution for trpe dataset
|
||
|
||
Each plot contains three distributions:
|
||
- QED distribution (blue)
|
||
- Molecular weight distribution (red)
|
||
- Vina score distribution (green)
|
||
|
||
Reference molecules are marked with different colored markers and labeled with their identifiers and corresponding values.
|
||
|
||
### 3.3 Advanced Analysis Options
|
||
|
||
You can also specify custom reference scores and conformation rank:
|
||
|
||
```bash
|
||
python analyze_qed_mw_distribution.py qed_values_fgbar.csv qed_values_trpe.csv --dataset-names fgbar --dataset-names trpe --rank 0
|
||
```
|
||
|
||
The `--rank` option allows you to specify which conformation from the reference molecule's docking results to use for the Vina score reference.
|
||
- Rank 0 (default) uses the best scoring conformation (rank 1 in Vina results)
|
||
- Rank 1 uses the second best scoring conformation, and so on
|
||
|
||
The maximum valid rank is determined by the minimum number of conformations generated across all docked molecules. If you specify a rank that exceeds this minimum, the script will raise an error and inform you of the maximum valid rank.
|
||
|
||
You can also specify custom reference scores:
|
||
|
||
```bash
|
||
python analyze_qed_mw_distribution.py qed_values_fgbar.csv qed_values_trpe.csv --dataset-names fgbar --dataset-names trpe --reference-scores '{"fgbar": {"9NY": -5.268}, "trpe": {"0GA": -6.531}}'
|
||
```
|
||
|
||
## 4. API Usage
|
||
|
||
The analysis functions can also be called directly from Python:
|
||
|
||
```python
|
||
import sys
|
||
sys.path.append('scripts')
|
||
from analyze_qed_mw_distribution import main_api
|
||
|
||
# Basic usage
|
||
main_api(['scripts/qed_values_fgbar.csv', 'scripts/qed_values_trpe.csv'], ['fgbar', 'trpe'])
|
||
|
||
# With custom reference scores
|
||
main_api(['scripts/qed_values_fgbar.csv', 'scripts/qed_values_trpe.csv'], ['fgbar', 'trpe'],
|
||
reference_scores={'fgbar': {'9NY': -5.268}, 'trpe': {'0GA': -6.531}})
|
||
|
||
# With specific conformation rank
|
||
main_api(['scripts/qed_values_fgbar.csv', 'scripts/qed_values_trpe.csv'], ['fgbar', 'trpe'], rank=0)
|
||
```
|
||
|
||
## 5. Output Files
|
||
|
||
The analysis generates several output files:
|
||
- CSV files with QED values and Vina scores for all molecules
|
||
- KDE distribution plots in both normalized and actual values formats
|
||
|
||
# AutoDock Vina Pipeline
|
||
|
||
This repository contains a complete pipeline for molecular docking using AutoDock Vina, including preparation, execution, result processing, and analysis.
|
||
|
||
## 受体准备 pdbqt 文件
|
||
|
||
使用 alphafold 预测 pdb 文件 cif 文件。
|
||
|
||
修复使用 moderller 同源建模,或者 pdbfixer,MOE,maestro 等
|
||
|
||
这里使用 maestro 的 `Protein reparation Workflow` 模块
|
||
|
||
然后导出 pdb 文件
|
||
|
||
使用 meeko 准备受体文件 pdbqt 文件,详细可以[参考](https://meeko.readthedocs.io/en/release-doc/rec_overview.html)
|
||
|
||
```shell
|
||
micromamba run -n vina mk_prepare_receptor.py -i receptor/FgBar1_cut_proteinprep.pdb --write_pdbqt receptor/FgBar1_cut_proteinprep.pdbqt
|
||
```
|
||
|
||
选项组合用法
|
||
|
||
### 举例1:用默认输出名生成 pdbqt 和 vina box 配置
|
||
|
||
mk_prepare_receptor.py -i 1abc.pdb -o 1abc_clean --write_pdbqt --write_vina_box
|
||
|
||
得到 1abc_clean_rigid.pdbqt, 1abc_clean.vina.txt
|
||
|
||
### 举例2:为指定残基设置模板/柔性,并生成 box 配置
|
||
|
||
```shell
|
||
mk_prepare_receptor.py -i system.pdb \
|
||
--output_basename system_prep \
|
||
-f "A:42,B:23" \
|
||
-n "A:5,7=CYX,B:17=HID" \
|
||
--write_pdbqt --write_vina_box
|
||
```
|
||
|
||
|
||
### 举例3:自动包络某配体生成 box 配置
|
||
|
||
```
|
||
mk_prepare_receptor.py -i prot.pdb \
|
||
--box_enveloping ligand.pdb \
|
||
--padding 3.0 \
|
||
--output_basename dock_ready \
|
||
--write_pdbqt --write_vina_box
|
||
```
|
||
|
||
|
||
## 小分子 3D 构象准备
|
||
|
||
需要给小分子一个初始化的 3d 构象存放到`ligand/sdf`
|
||
|
||
```shell
|
||
python sdf2to3d.py --src_dir ./2d_sdf_dir --out_dir ./3d_sdf_dir --n_jobs 8
|
||
```
|
||
|
||
## 小分子格式转化
|
||
|
||
使用 meeko 将 `ligand/sdf` 转为 `ligand/pdbqt`
|
||
|
||
```shell
|
||
micromamba run -n vina ./scripts/batch_prepare_ligands.sh ligands/sdf ligands/pdbqt/ batch_prepare_ligands.log 128
|
||
```
|
||
|
||
## 小分子批量提交对接
|
||
|
||
分割小分子文件将 ligand 目录里面的 pdbqt 文件夹拆分 n 个子文件夹(pdbqt1,pdbqt2,pdbqt3...pdbqtn)
|
||
|
||
```shell
|
||
micromamba run -n vina python vina_split_and_submit.py <split_number_n>
|
||
```
|
||
|
||
执行完成后会自动使用 dsub 命令将对接任务提交给华为多瑙调度系统
|
||
|
||
需要注意有时候提交执行速度过快可能有批次遗漏,可以在合并时候检查
|
||
|
||
## 对接结果合并
|
||
|
||
在对接完成之后会在 `result` 文件夹里面创建 n 个对接结果文件夹(poses1,poses2,poses3...posesn)
|
||
|
||
每个文件夹中都有对应的`*_out.pdbqt`文件与`*_converted.sdf`文件,调用
|
||
|
||
```shell
|
||
micromamba run -n vina python vina_merge_and_check.py --n_splits <split_number_n> --out_dir ./result --output_prefix poses --poses_dir ./result/poses_all
|
||
```
|
||
|
||
会将所有的n 个对接结果文件夹中`*_converted.sdf`文件存放到 `./result/poses_all` 目录,同时会检测是否有提交时候过快导致遗漏某个批次没有对接,需要注意查看。
|
||
|
||
## 分析对接结果
|
||
|
||
在`*_converted.sdf`文件中存在`20`个对接构象,取决于`scripts/batch_docking.sh` 中 `NUM_MODES` 设置多少数目,默认设置为 20。
|
||
|
||
其中每个 sdf 构象存在下面的`<meeko>`字段 用于获取对接打分等属性用于后续筛选分子。
|
||
|
||
```
|
||
> <meeko> (20)
|
||
{"is_sidechain": [false], "free_energy": -6.38, "intermolecular_energy": -15.695, "internal_energy": -2.912}
|
||
```
|
||
|
||
## batch 模式对接
|
||
|
||
vina=1.2.7可以使用batch 模式进行批量对接。
|
||
|
||
```shell
|
||
mkdir -p results/poses
|
||
vina --receptor input/receptors/TrpE_entry_1.pdbqt \
|
||
--batch input/ligands/test \
|
||
--config ./configs/TrpE_entry_1.box.txt \
|
||
--dir results/poses \
|
||
--exhaustiveness=32
|
||
|
||
# 使用脚本对接
|
||
./scripts/batch_docking.sh ./receptors/TrpE_entry_1.pdbqt ./config/TrpE_entry_1.box.txt ligands/test output test.log /share/home/lyzeng24/rdkit_script/vina/vina
|
||
```
|
||
|
||
## 环境安装
|
||
|
||
```shell
|
||
conda install -c conda-forge vina meeko rdkit joblib rich ipython parallel -y
|
||
```
|
||
|
||
|
||
## 准备小分子pdbqt
|
||
|
||
```shell
|
||
# 单个配体准备
|
||
mk_prepare_ligand.py -i molecule.sdf -o molecule.pdbqt
|
||
|
||
# 批量准备
|
||
micromamba run -n vina ./scripts/batch_prepare_ligands.sh ligands/sdf ligands/pdbqt/ batch_prepare_ligands.log 128
|
||
|
||
#监控文件
|
||
watch -n 1 "ls -l pdbqt/*.pdbqt 2>/dev/null | wc -l"
|
||
```
|
||
|
||
## 准备受体pdbqt
|
||
|
||
```shell
|
||
# 受体准备(带柔性侧链)
|
||
mk_prepare_receptor.py -i nucleic_acid.cif -o my_receptor -j -p -f A:42
|
||
```
|
||
|
||
## batch对接模式
|
||
|
||
```shell
|
||
./scripts/batch_docking.sh input/receptors/TrpE_entry_1.pdbqt \
|
||
input/configs/TrpE_entry_1.box.txt \
|
||
input/ligands/pdbqt \
|
||
results/poses \
|
||
results/batch_docking.log
|
||
```
|
||
|
||
## 监控对接结果
|
||
|
||
```shell
|
||
watch -n 1 'for i in {1..12}; do printf "poses$i: "; ls results/poses$i/*.pdbqt 2>/dev/null | wc -l; done'
|
||
```
|
||
|
||
## 将对接结果还原为sdf文件
|
||
|
||
mk_export.py 命令行工具的各个参数选项。
|
||
|
||
```shell
|
||
cd output
|
||
mk_export.py ./*_out.pdbqt --suffix _converted
|
||
```
|
||
|
||
## 分析vina对接结果
|
||
|
||
```shell
|
||
# 结果导出
|
||
mk_export.py vina_results.pdbqt -j my_receptor.json -s lig_docked.sdf -p rec_docked.pdb
|
||
```
|
||
|
||
## djob 运行时间耗时长的批次任务
|
||
|
||
```
|
||
24562323 vina_job15 RUNNING lyzeng24 default default 2025/07/31 23:16:30 - agent-ARM-17
|
||
24562322 vina_job14 RUNNING lyzeng24 default default 2025/07/31 23:16:30 - agent-ARM-17
|
||
24562321 vina_job13 RUNNING lyzeng24 default default 2025/07/31 23:16:30 - agent-ARM-17
|
||
24562320 vina_job12 RUNNING lyzeng24 default default 2025/07/31 23:16:29 - agent-ARM-21
|
||
24562319 vina_job11 RUNNING lyzeng24 default default 2025/07/31 23:16:29 - agent-ARM-21
|
||
24562318 vina_job10 RUNNING lyzeng24 default default 2025/07/31 23:16:29 - agent-ARM-21
|
||
24562317 vina_job9 RUNNING lyzeng24 default default 2025/07/31 23:16:28 - agent-ARM-21
|
||
24562316 vina_job8 RUNNING lyzeng24 default default 2025/07/31 23:16:28 - agent-ARM-16
|
||
24562315 vina_job7 RUNNING lyzeng24 default default 2025/07/31 23:16:28 - agent-ARM-16
|
||
24562314 vina_job6 RUNNING lyzeng24 default default 2025/07/31 23:16:27 - agent-ARM-16
|
||
24562313 vina_job5 RUNNING lyzeng24 default default 2025/07/31 23:16:27 - agent-ARM-19
|
||
24562312 vina_job4 RUNNING lyzeng24 default default 2025/07/31 23:16:27 - agent-ARM-19
|
||
24562311 vina_job3 RUNNING lyzeng24 default default 2025/07/31 23:16:27 - agent-ARM-19
|
||
```
|
||
|
||
## autodock vina 参考分子对接
|
||
|
||
trpe:(PDB ID: 5cwa)
|
||
|
||
```shell
|
||
./vina --receptor ./refence/trpe/TrpE_entry_1.pdbqt --ligand ./refence/trpe/align_5cwa_0GA_addH.pdbqt --config ./refence/trpe/TrpE_entry_1.box.txt --out ./refence/trpe/align_5cwa_0GA_addH_out.pdbqt --exhaustiveness="32" --num_modes="20" --energy_range="5.0"
|
||
```
|
||
|
||
result:
|
||
|
||
```
|
||
AutoDock Vina v1.2.7
|
||
#################################################################
|
||
# If you used AutoDock Vina in your work, please cite: #
|
||
# #
|
||
# J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli #
|
||
# AutoDock Vina 1.2.0: New Docking Methods, Expanded Force #
|
||
# Field, and Python Bindings, J. Chem. Inf. Model. (2021) #
|
||
# DOI 10.1021/acs.jcim.1c00203 #
|
||
# #
|
||
# O. Trott, A. J. Olson, #
|
||
# AutoDock Vina: improving the speed and accuracy of docking #
|
||
# with a new scoring function, efficient optimization and #
|
||
# multithreading, J. Comp. Chem. (2010) #
|
||
# DOI 10.1002/jcc.21334 #
|
||
# #
|
||
# Please see https://github.com/ccsb-scripps/AutoDock-Vina for #
|
||
# more information. #
|
||
#################################################################
|
||
|
||
Scoring function : vina
|
||
Rigid receptor: ./refence/trpe/TrpE_entry_1.pdbqt
|
||
Ligand: ./refence/trpe/align_5cwa_0GA_addH.pdbqt
|
||
Grid center: X 7.402 Y -4.783 Z -11.818
|
||
Grid size : X 30 Y 30 Z 30
|
||
Grid space : 0.375
|
||
Exhaustiveness: 32
|
||
CPU: 0
|
||
Verbosity: 1
|
||
|
||
Computing Vina grid ... done.
|
||
WARNING: At low exhaustiveness, it may be impossible to utilize all CPUs.
|
||
Performing docking (random seed: 650309048) ...
|
||
0% 10 20 30 40 50 60 70 80 90 100%
|
||
|----|----|----|----|----|----|----|----|----|----|
|
||
***************************************************
|
||
|
||
mode | affinity | dist from best mode
|
||
| (kcal/mol) | rmsd l.b.| rmsd u.b.
|
||
-----+------------+----------+----------
|
||
1 -6.531 0 0
|
||
2 -6.352 3.988 6.453
|
||
3 -6.3 1.447 5.602
|
||
4 -6.291 1.94 5.284
|
||
5 -6.283 1.044 2.037
|
||
6 -6.159 3.798 5.275
|
||
7 -6.124 1.43 5.553
|
||
8 -5.988 3.499 5.489
|
||
9 -5.925 3.311 4.252
|
||
10 -5.912 3.647 4.894
|
||
11 -5.889 7.256 10.49
|
||
12 -5.821 2.351 5.29
|
||
13 -5.763 3.731 6.18
|
||
14 -5.732 3.557 6.002
|
||
15 -5.729 7.213 9.251
|
||
16 -5.693 4.179 5.642
|
||
17 -5.684 3.058 4.111
|
||
18 -5.679 4.117 5.518
|
||
19 -5.671 4.656 6.098
|
||
20 -5.663 4.112 5.705
|
||
```
|
||
|
||
fgbar:(PDB ID: 8izd)
|
||
|
||
```shell
|
||
./vina --receptor ./refence/fgbar/FgBar1_cut_proteinprep.pdbqt --ligand ./refence/fgbar/align_8izd_F_9NY_addH.pdbqt --config ./refence/fgbar/FgBar1_entry_1.box.txt --out ./refence/fgbar/align_8izd_F_9NY_addH_out.pdbqt --exhaustiveness="32" --num_modes="20" --energy_range="5.0"
|
||
```
|
||
|
||
reusult:
|
||
|
||
```
|
||
AutoDock Vina v1.2.7
|
||
#################################################################
|
||
# If you used AutoDock Vina in your work, please cite: #
|
||
# #
|
||
# J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli #
|
||
# AutoDock Vina 1.2.0: New Docking Methods, Expanded Force #
|
||
# Field, and Python Bindings, J. Chem. Inf. Model. (2021) #
|
||
# DOI 10.1021/acs.jcim.1c00203 #
|
||
# #
|
||
# O. Trott, A. J. Olson, #
|
||
# AutoDock Vina: improving the speed and accuracy of docking #
|
||
# with a new scoring function, efficient optimization and #
|
||
# multithreading, J. Comp. Chem. (2010) #
|
||
# DOI 10.1002/jcc.21334 #
|
||
# #
|
||
# Please see https://github.com/ccsb-scripps/AutoDock-Vina for #
|
||
# more information. #
|
||
#################################################################
|
||
|
||
Scoring function : vina
|
||
Rigid receptor: ./refence/fgbar/FgBar1_cut_proteinprep.pdbqt
|
||
Ligand: ./refence/fgbar/align_8izd_F_9NY_addH.pdbqt
|
||
Grid center: X -12.7 Y -9.1 Z -0.3
|
||
Grid size : X 49.1 Y 37.6 Z 35.2
|
||
Grid space : 0.375
|
||
Exhaustiveness: 32
|
||
CPU: 0
|
||
Verbosity: 1
|
||
|
||
WARNING: Search space volume is greater than 27000 Angstrom^3 (See FAQ)
|
||
Computing Vina grid ... done.
|
||
WARNING: At low exhaustiveness, it may be impossible to utilize all CPUs.
|
||
Performing docking (random seed: -399012800) ...
|
||
0% 10 20 30 40 50 60 70 80 90 100%
|
||
|----|----|----|----|----|----|----|----|----|----|
|
||
***************************************************
|
||
|
||
mode | affinity | dist from best mode
|
||
| (kcal/mol) | rmsd l.b.| rmsd u.b.
|
||
-----+------------+----------+----------
|
||
1 -5.268 0 0
|
||
2 -5.106 3.453 7.96
|
||
3 -5.003 3.114 6.709
|
||
4 -4.986 6.86 13.92
|
||
5 -4.947 5.434 13
|
||
6 -4.875 4.933 10.47
|
||
7 -4.867 6.888 13.75
|
||
8 -4.862 4.244 9.114
|
||
9 -4.835 3.776 6.806
|
||
10 -4.826 3.682 7.143
|
||
11 -4.824 5.4 10.17
|
||
12 -4.81 5.364 7.809
|
||
13 -4.808 4.364 11.15
|
||
14 -4.805 3.211 5.684
|
||
15 -4.783 3.585 8.995
|
||
16 -4.773 6.47 13.64
|
||
17 -4.773 3.465 6.652
|
||
18 -4.731 4.73 9.619
|
||
19 -4.726 4.867 10.88
|
||
20 -4.716 4.834 8.903
|
||
```
|
||
|
||
对接结果并不理想,可能是分子中灵活的扭转角多,柔性较大。AutoDock Vina 更偏向刚性对接。
|
||
|
||
## 分析策略
|
||
|
||
### trpe
|
||
|
||
AutoDock vina:QED 针对小空间(分子量小的)trpe,QED 过滤。()
|
||
|
||
karamadock:只看 qed 过滤后的小分子对接情况(过滤标准:**小分子**,QED)
|
||
|
||
glide: 小分子,QED。(vina 打分好的 1w 个 , 按照底物标准)
|
||
|
||
---
|
||
|
||
### fgbar
|
||
|
||
vina,karamadock,底物标准,选择 交集做 glide。
|