vina_docking_batch/README.md

## 目录结构

```
.
├── config/                    # Configuration files (box definitions, etc.)
├── ligand/                    # Ligand files
│   └── pdbqt/                 # Prepared ligand files in PDBQT format
├── receptor/                  # Receptor files
├── result/                    # Docking results
│   ├── fgbar/                 # FgBar dataset results
│   │   └── poses_all/         # Individual docking results in SDF format
│   ├── trpe/                  # TrpE dataset results
│   │   └── poses_all/         # Individual docking results in SDF format
│   └── refence/               # Reference molecule files
│       ├── fgbar/             # FgBar reference molecules
│       └── trpe/              # TrpE reference molecules
├── scripts/                   # Analysis scripts and utilities
└── README.md                  # This file
```

## 1. Preparation

Before running the pipeline, you need to prepare the following files:

1. Protein structure file (PDB format)
2. Ligand library (MOL2 format, named according to the format CNPxxxxxx.1.mol2)
3. Configuration file (box.txt format, defining the docking box parameters)

## 2. Execution Steps

### 2.1 Protein Preparation

```bash
prepare_receptor4.py -r protein.pdb -o protein.pdbqt
```

### 2.2 Ligand Preparation

```bash
prepare_ligand4.py -l ligand.mol2 -o ligand.pdbqt
```

### 2.3 Docking Execution

```bash
vina --config box.txt --receptor protein.pdbqt --ligand ligand.pdbqt --out out.pdbqt
```

### 2.4 Result Format Conversion

Convert PDBQT format results to SDF format:

```bash
mk_export.py ./*_out.pdbqt --suffix _converted
```

## 3. Result Analysis

### 3.1 Calculate QED Properties

Calculate QED values for all molecules:

```bash
cd scripts
python calculate_qed_values.py
```

This script processes both the docked molecules in the poses_all directories and the reference molecules in the refence directories. It generates two CSV files:
- qed_values_fgbar.csv
- qed_values_trpe.csv

Each CSV file contains the following columns:
- smiles: SMILES representation of the molecule
- filename: Name of the source file
- qed: QED value of the molecule
- molecular_weight: Molecular weight of the molecule
- vina_scores: List of Vina scores for all conformers

### 3.2 Analyze QED and Molecular Weight Distribution

Analyze the distribution of QED and molecular weight properties and generate KDE plots:

```bash
python analyze_qed_mw_distribution.py qed_values_fgbar.csv qed_values_trpe.csv --dataset-names fgbar --dataset-names trpe
```

This will generate four plots:
1. kde_distribution_fgbar_normalized.png - Normalized distribution for fgbar dataset
2. kde_distribution_fgbar_actual.png - Actual values distribution for fgbar dataset
3. kde_distribution_trpe_normalized.png - Normalized distribution for trpe dataset
4. kde_distribution_trpe_actual.png - Actual values distribution for trpe dataset

Each plot contains three distributions:
- QED distribution (blue)
- Molecular weight distribution (red)
- Vina score distribution (green)

Reference molecules are marked with different colored markers and labeled with their identifiers and corresponding values.

### 3.3 Advanced Analysis Options

You can also specify custom reference scores and conformation rank:

```bash
python analyze_qed_mw_distribution.py qed_values_fgbar.csv qed_values_trpe.csv --dataset-names fgbar --dataset-names trpe --rank 0
```

The `--rank` option allows you to specify which conformation from the reference molecule's docking results to use for the Vina score reference.
- Rank 0 (default) uses the best scoring conformation (rank 1 in Vina results)
- Rank 1 uses the second best scoring conformation, and so on

The maximum valid rank is determined by the minimum number of conformations generated across all docked molecules. If you specify a rank that exceeds this minimum, the script will raise an error and inform you of the maximum valid rank.

You can also specify custom reference scores:

```bash
python analyze_qed_mw_distribution.py qed_values_fgbar.csv qed_values_trpe.csv --dataset-names fgbar --dataset-names trpe --reference-scores '{"fgbar": {"9NY": -5.268}, "trpe": {"0GA": -6.531}}'
```

## 4. API Usage

The analysis functions can also be called directly from Python:

```python
import sys
sys.path.append('scripts')
from analyze_qed_mw_distribution import main_api

# Basic usage
main_api(['scripts/qed_values_fgbar.csv', 'scripts/qed_values_trpe.csv'], ['fgbar', 'trpe'])

# With custom reference scores
main_api(['scripts/qed_values_fgbar.csv', 'scripts/qed_values_trpe.csv'], ['fgbar', 'trpe'],
         reference_scores={'fgbar': {'9NY': -5.268}, 'trpe': {'0GA': -6.531}})

# With specific conformation rank
main_api(['scripts/qed_values_fgbar.csv', 'scripts/qed_values_trpe.csv'], ['fgbar', 'trpe'], rank=0)
```

## 5. Output Files

The analysis generates several output files:
- CSV files with QED values and Vina scores for all molecules
- KDE distribution plots in both normalized and actual values formats

# AutoDock Vina Pipeline

This repository contains a complete pipeline for molecular docking using AutoDock Vina, including preparation, execution, result processing, and analysis.

## 受体准备 pdbqt 文件

使用 alphafold 预测 pdb 文件 cif 文件。

修复使用 moderller 同源建模，或者 pdbfixer，MOE，maestro 等

这里使用 maestro 的 `Protein reparation Workflow` 模块

然后导出 pdb 文件

使用 meeko 准备受体文件 pdbqt 文件，详细可以[参考](https://meeko.readthedocs.io/en/release-doc/rec_overview.html)

```shell
micromamba run -n vina mk_prepare_receptor.py -i receptor/FgBar1_cut_proteinprep.pdb --write_pdbqt receptor/FgBar1_cut_proteinprep.pdbqt
```

选项组合用法

### 举例1：用默认输出名生成 pdbqt 和 vina box 配置

mk_prepare_receptor.py -i 1abc.pdb -o 1abc_clean --write_pdbqt --write_vina_box

得到 1abc_clean_rigid.pdbqt, 1abc_clean.vina.txt

### 举例2：为指定残基设置模板/柔性，并生成 box 配置

```shell
mk_prepare_receptor.py -i system.pdb \
  --output_basename system_prep \
  -f "A:42,B:23" \
  -n "A:5,7=CYX,B:17=HID" \
  --write_pdbqt --write_vina_box
```


### 举例3：自动包络某配体生成 box 配置

```
mk_prepare_receptor.py -i prot.pdb \
  --box_enveloping ligand.pdb \
  --padding 3.0 \
  --output_basename dock_ready \
  --write_pdbqt --write_vina_box
```


## 小分子 3D 构象准备

需要给小分子一个初始化的 3d 构象存放到`ligand/sdf`

```shell
python sdf2to3d.py --src_dir ./2d_sdf_dir --out_dir ./3d_sdf_dir --n_jobs 8
```

## 小分子格式转化

使用 meeko 将 `ligand/sdf` 转为 `ligand/pdbqt`

```shell
micromamba run -n vina ./scripts/batch_prepare_ligands.sh ligands/sdf ligands/pdbqt/ batch_prepare_ligands.log 128
```

## 小分子批量提交对接

分割小分子文件将 ligand 目录里面的 pdbqt 文件夹拆分 n 个子文件夹(pdbqt1,pdbqt2,pdbqt3...pdbqtn)

```shell
micromamba run -n vina python vina_split_and_submit.py <split_number_n>
```

执行完成后会自动使用 dsub 命令将对接任务提交给华为多瑙调度系统

需要注意有时候提交执行速度过快可能有批次遗漏，可以在合并时候检查

## 对接结果合并

在对接完成之后会在 `result` 文件夹里面创建 n 个对接结果文件夹（poses1，poses2，poses3...posesn）

每个文件夹中都有对应的`*_out.pdbqt`文件与`*_converted.sdf`文件，调用

```shell
micromamba run -n vina python vina_merge_and_check.py --n_splits <split_number_n> --out_dir ./result --output_prefix poses --poses_dir ./result/poses_all
```

会将所有的n 个对接结果文件夹中`*_converted.sdf`文件存放到 `./result/poses_all` 目录，同时会检测是否有提交时候过快导致遗漏某个批次没有对接，需要注意查看。

## 分析对接结果

在`*_converted.sdf`文件中存在`20`个对接构象，取决于`scripts/batch_docking.sh` 中 `NUM_MODES` 设置多少数目，默认设置为 20。

其中每个 sdf 构象存在下面的`<meeko>`字段 用于获取对接打分等属性用于后续筛选分子。

```
>  <meeko>  (20)
{"is_sidechain": [false], "free_energy": -6.38, "intermolecular_energy": -15.695, "internal_energy": -2.912}
```

## batch 模式对接

vina=1.2.7可以使用batch 模式进行批量对接。

```shell
mkdir -p results/poses
vina --receptor input/receptors/TrpE_entry_1.pdbqt \
     --batch input/ligands/test \
     --config ./configs/TrpE_entry_1.box.txt \
     --dir results/poses \
     --exhaustiveness=32

# 使用脚本对接
./scripts/batch_docking.sh ./receptors/TrpE_entry_1.pdbqt ./config/TrpE_entry_1.box.txt ligands/test output test.log /share/home/lyzeng24/rdkit_script/vina/vina
```

## 环境安装

```shell
conda install -c conda-forge vina meeko rdkit joblib rich ipython parallel -y
```


## 准备小分子pdbqt

```shell
# 单个配体准备
mk_prepare_ligand.py -i molecule.sdf -o molecule.pdbqt

# 批量准备
micromamba run -n vina ./scripts/batch_prepare_ligands.sh ligands/sdf ligands/pdbqt/ batch_prepare_ligands.log 128

#监控文件
watch -n 1 "ls -l pdbqt/*.pdbqt 2>/dev/null | wc -l"
```

## 准备受体pdbqt

```shell
# 受体准备（带柔性侧链）
mk_prepare_receptor.py -i nucleic_acid.cif -o my_receptor -j -p -f A:42
```

## batch对接模式

```shell
./scripts/batch_docking.sh input/receptors/TrpE_entry_1.pdbqt \
                          input/configs/TrpE_entry_1.box.txt \
                          input/ligands/pdbqt \
                          results/poses \
                          results/batch_docking.log
```

## 监控对接结果

```shell
watch -n 1 'for i in {1..12}; do printf "poses$i: "; ls results/poses$i/*.pdbqt 2>/dev/null | wc -l; done'
```

## 将对接结果还原为sdf文件

mk_export.py 命令行工具的各个参数选项。

```shell
cd output
mk_export.py ./*_out.pdbqt --suffix _converted
```

## 分析vina对接结果

```shell
# 结果导出
mk_export.py vina_results.pdbqt -j my_receptor.json -s lig_docked.sdf -p rec_docked.pdb
```

## djob 运行时间耗时长的批次任务

```
24562323     vina_job15   RUNNING    lyzeng24     default      default      2025/07/31 23:16:30  -                    agent-ARM-17
24562322     vina_job14   RUNNING    lyzeng24     default      default      2025/07/31 23:16:30  -                    agent-ARM-17
24562321     vina_job13   RUNNING    lyzeng24     default      default      2025/07/31 23:16:30  -                    agent-ARM-17
24562320     vina_job12   RUNNING    lyzeng24     default      default      2025/07/31 23:16:29  -                    agent-ARM-21
24562319     vina_job11   RUNNING    lyzeng24     default      default      2025/07/31 23:16:29  -                    agent-ARM-21
24562318     vina_job10   RUNNING    lyzeng24     default      default      2025/07/31 23:16:29  -                    agent-ARM-21
24562317     vina_job9    RUNNING    lyzeng24     default      default      2025/07/31 23:16:28  -                    agent-ARM-21
24562316     vina_job8    RUNNING    lyzeng24     default      default      2025/07/31 23:16:28  -                    agent-ARM-16
24562315     vina_job7    RUNNING    lyzeng24     default      default      2025/07/31 23:16:28  -                    agent-ARM-16
24562314     vina_job6    RUNNING    lyzeng24     default      default      2025/07/31 23:16:27  -                    agent-ARM-16
24562313     vina_job5    RUNNING    lyzeng24     default      default      2025/07/31 23:16:27  -                    agent-ARM-19
24562312     vina_job4    RUNNING    lyzeng24     default      default      2025/07/31 23:16:27  -                    agent-ARM-19
24562311     vina_job3    RUNNING    lyzeng24     default      default      2025/07/31 23:16:27  -                    agent-ARM-19
```

## autodock vina 参考分子对接

trpe:(PDB ID: 5cwa)

```shell
./vina --receptor ./refence/trpe/TrpE_entry_1.pdbqt --ligand ./refence/trpe/align_5cwa_0GA_addH.pdbqt --config ./refence/trpe/TrpE_entry_1.box.txt --out ./refence/trpe/align_5cwa_0GA_addH_out.pdbqt --exhaustiveness="32" --num_modes="20"  --energy_range="5.0"
```

result:

```
AutoDock Vina v1.2.7
#################################################################
# If you used AutoDock Vina in your work, please cite:          #
#                                                               #
# J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli  #
# AutoDock Vina 1.2.0: New Docking Methods, Expanded Force      #
# Field, and Python Bindings, J. Chem. Inf. Model. (2021)       #
# DOI 10.1021/acs.jcim.1c00203                                  #
#                                                               #
# O. Trott, A. J. Olson,                                        #
# AutoDock Vina: improving the speed and accuracy of docking    #
# with a new scoring function, efficient optimization and       #
# multithreading, J. Comp. Chem. (2010)                         #
# DOI 10.1002/jcc.21334                                         #
#                                                               #
# Please see https://github.com/ccsb-scripps/AutoDock-Vina for  #
# more information.                                             #
#################################################################

Scoring function : vina
Rigid receptor: ./refence/trpe/TrpE_entry_1.pdbqt
Ligand: ./refence/trpe/align_5cwa_0GA_addH.pdbqt
Grid center: X 7.402 Y -4.783 Z -11.818
Grid size  : X 30 Y 30 Z 30
Grid space : 0.375
Exhaustiveness: 32
CPU: 0
Verbosity: 1

Computing Vina grid ... done.
WARNING: At low exhaustiveness, it may be impossible to utilize all CPUs.
Performing docking (random seed: 650309048) ...
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

mode |   affinity | dist from best mode
     | (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
   1       -6.531          0          0
   2       -6.352      3.988      6.453
   3         -6.3      1.447      5.602
   4       -6.291       1.94      5.284
   5       -6.283      1.044      2.037
   6       -6.159      3.798      5.275
   7       -6.124       1.43      5.553
   8       -5.988      3.499      5.489
   9       -5.925      3.311      4.252
  10       -5.912      3.647      4.894
  11       -5.889      7.256      10.49
  12       -5.821      2.351       5.29
  13       -5.763      3.731       6.18
  14       -5.732      3.557      6.002
  15       -5.729      7.213      9.251
  16       -5.693      4.179      5.642
  17       -5.684      3.058      4.111
  18       -5.679      4.117      5.518
  19       -5.671      4.656      6.098
  20       -5.663      4.112      5.705
```

fgbar:（PDB ID： 8izd）

```shell
./vina --receptor ./refence/fgbar/FgBar1_cut_proteinprep.pdbqt --ligand ./refence/fgbar/align_8izd_F_9NY_addH.pdbqt --config ./refence/fgbar/FgBar1_entry_1.box.txt --out ./refence/fgbar/align_8izd_F_9NY_addH_out.pdbqt --exhaustiveness="32" --num_modes="20"  --energy_range="5.0"
```

reusult:

```
AutoDock Vina v1.2.7
#################################################################
# If you used AutoDock Vina in your work, please cite:          #
#                                                               #
# J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli  #
# AutoDock Vina 1.2.0: New Docking Methods, Expanded Force      #
# Field, and Python Bindings, J. Chem. Inf. Model. (2021)       #
# DOI 10.1021/acs.jcim.1c00203                                  #
#                                                               #
# O. Trott, A. J. Olson,                                        #
# AutoDock Vina: improving the speed and accuracy of docking    #
# with a new scoring function, efficient optimization and       #
# multithreading, J. Comp. Chem. (2010)                         #
# DOI 10.1002/jcc.21334                                         #
#                                                               #
# Please see https://github.com/ccsb-scripps/AutoDock-Vina for  #
# more information.                                             #
#################################################################

Scoring function : vina
Rigid receptor: ./refence/fgbar/FgBar1_cut_proteinprep.pdbqt
Ligand: ./refence/fgbar/align_8izd_F_9NY_addH.pdbqt
Grid center: X -12.7 Y -9.1 Z -0.3
Grid size  : X 49.1 Y 37.6 Z 35.2
Grid space : 0.375
Exhaustiveness: 32
CPU: 0
Verbosity: 1

WARNING: Search space volume is greater than 27000 Angstrom^3 (See FAQ)
Computing Vina grid ... done.
WARNING: At low exhaustiveness, it may be impossible to utilize all CPUs.
Performing docking (random seed: -399012800) ...
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

mode |   affinity | dist from best mode
     | (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
   1       -5.268          0          0
   2       -5.106      3.453       7.96
   3       -5.003      3.114      6.709
   4       -4.986       6.86      13.92
   5       -4.947      5.434         13
   6       -4.875      4.933      10.47
   7       -4.867      6.888      13.75
   8       -4.862      4.244      9.114
   9       -4.835      3.776      6.806
  10       -4.826      3.682      7.143
  11       -4.824        5.4      10.17
  12        -4.81      5.364      7.809
  13       -4.808      4.364      11.15
  14       -4.805      3.211      5.684
  15       -4.783      3.585      8.995
  16       -4.773       6.47      13.64
  17       -4.773      3.465      6.652
  18       -4.731       4.73      9.619
  19       -4.726      4.867      10.88
  20       -4.716      4.834      8.903
```

对接结果并不理想，可能是分子中灵活的扭转角多，柔性较大。AutoDock Vina 更偏向刚性对接。

## 分析策略

### trpe

AutoDock vina：QED 针对小空间（分子量小的）trpe，QED 过滤。（）

karamadock：只看 qed 过滤后的小分子对接情况（过滤标准：**小分子**，QED）

glide: 小分子，QED。（vina 打分好的 1w 个 ， 按照底物标准）

---

### fgbar

vina，karamadock，底物标准，选择 交集做 glide。