Go to file

lingyuzeng f6c182f38e 重构分子属性分析流程并更新目录结构

1. 目录结构调整：
   - 创建scripts目录统一存放分析脚本
   - 保持数据文件在原有目录结构中
   - 生成的CSV文件和PNG图表文件也放在scripts目录下

2. 功能改进：
   - 更新calculate_qed_values.py脚本，添加对参考分子SDF文件的处理
   - 修改analyze_qed_mw_distribution.py脚本，统一使用SDF文件stem名称作为参考分子标识符
   - 改进Vina得分提取逻辑，支持从SDF文件中提取所有构象的得分
   - 完善KDE分布图绘制，确保参考分子在所有图表中显示统一的名称

3. 文档更新：
   - 更新README.md中的目录结构说明
   - 更新命令行和API使用示例
   - 添加详细的使用说明和示例

4. 示例代码：
   - 更新example_api_usage.py以适应新的目录结构和API调用方式

2025-08-05 17:00:39 +08:00

config

add Fgbar example

2025-08-02 22:13:12 +08:00

receptor

add Fgbar example

2025-08-02 22:13:12 +08:00

result/refence

重构分子属性分析流程并更新目录结构

2025-08-05 17:00:39 +08:00

scripts

重构分子属性分析流程并更新目录结构

2025-08-05 17:00:39 +08:00

.gitignore

重构分子属性分析流程并更新目录结构

2025-08-05 17:00:39 +08:00

clean_vina_jobs.sh

first add

2025-08-02 21:54:31 +08:00

environment.yaml

first add

2025-08-02 21:54:31 +08:00

kill_vina_jobs.sh

first add

2025-08-02 21:54:31 +08:00

README.md

重构分子属性分析流程并更新目录结构

2025-08-05 17:00:39 +08:00

remove_files.sh

first add

2025-08-02 21:54:31 +08:00

sdf2to3d.py

first add

2025-08-02 21:54:31 +08:00

submit_sdf2pdbqt.sh

first add

2025-08-02 21:54:31 +08:00

vina

first add

2025-08-02 21:54:31 +08:00

vina_merge_and_check.py

first add

2025-08-02 21:54:31 +08:00

vina_split

first add

2025-08-02 21:54:31 +08:00

vina_split_and_submit.py

增加受体准备说明与增加分批提交参数灵活性

2025-08-02 23:07:05 +08:00

README.md

目录结构

.
├── config/                    # Configuration files (box definitions, etc.)
├── ligand/                    # Ligand files
│   └── pdbqt/                 # Prepared ligand files in PDBQT format
├── receptor/                  # Receptor files
├── result/                    # Docking results
│   ├── fgbar/                 # FgBar dataset results
│   │   └── poses_all/         # Individual docking results in SDF format
│   ├── trpe/                  # TrpE dataset results
│   │   └── poses_all/         # Individual docking results in SDF format
│   └── refence/               # Reference molecule files
│       ├── fgbar/             # FgBar reference molecules
│       └── trpe/              # TrpE reference molecules
├── scripts/                   # Analysis scripts and utilities
└── README.md                  # This file

1. Preparation

Before running the pipeline, you need to prepare the following files:

Protein structure file (PDB format)
Ligand library (MOL2 format, named according to the format CNPxxxxxx.1.mol2)
Configuration file (box.txt format, defining the docking box parameters)

2. Execution Steps

2.1 Protein Preparation

prepare_receptor4.py -r protein.pdb -o protein.pdbqt

2.2 Ligand Preparation

prepare_ligand4.py -l ligand.mol2 -o ligand.pdbqt

2.3 Docking Execution

vina --config box.txt --receptor protein.pdbqt --ligand ligand.pdbqt --out out.pdbqt

2.4 Result Format Conversion

Convert PDBQT format results to SDF format:

mk_export.py ./*_out.pdbqt --suffix _converted

3. Result Analysis

3.1 Calculate QED Properties

Calculate QED values for all molecules:

cd scripts
python calculate_qed_values.py

This script processes both the docked molecules in the poses_all directories and the reference molecules in the refence directories. It generates two CSV files:

qed_values_fgbar.csv
qed_values_trpe.csv

Each CSV file contains the following columns:

smiles: SMILES representation of the molecule
filename: Name of the source file
qed: QED value of the molecule
molecular_weight: Molecular weight of the molecule
vina_scores: List of Vina scores for all conformers

3.2 Analyze QED and Molecular Weight Distribution

Analyze the distribution of QED and molecular weight properties and generate KDE plots:

python analyze_qed_mw_distribution.py qed_values_fgbar.csv qed_values_trpe.csv --dataset-names fgbar --dataset-names trpe

This will generate four plots:

kde_distribution_fgbar_normalized.png - Normalized distribution for fgbar dataset
kde_distribution_fgbar_actual.png - Actual values distribution for fgbar dataset
kde_distribution_trpe_normalized.png - Normalized distribution for trpe dataset
kde_distribution_trpe_actual.png - Actual values distribution for trpe dataset

Each plot contains three distributions:

QED distribution (blue)
Molecular weight distribution (red)
Vina score distribution (green)

Reference molecules are marked with different colored markers and labeled with their identifiers and corresponding values.

3.3 Advanced Analysis Options

You can also specify custom reference scores and conformation rank:

python analyze_qed_mw_distribution.py qed_values_fgbar.csv qed_values_trpe.csv --dataset-names fgbar --dataset-names trpe --rank 0

The --rank option allows you to specify which conformation from the reference molecule's docking results to use for the Vina score reference.

Rank 0 (default) uses the best scoring conformation (rank 1 in Vina results)
Rank 1 uses the second best scoring conformation, and so on

The maximum valid rank is determined by the minimum number of conformations generated across all docked molecules. If you specify a rank that exceeds this minimum, the script will raise an error and inform you of the maximum valid rank.

You can also specify custom reference scores:

python analyze_qed_mw_distribution.py qed_values_fgbar.csv qed_values_trpe.csv --dataset-names fgbar --dataset-names trpe --reference-scores '{"fgbar": {"9NY": -5.268}, "trpe": {"0GA": -6.531}}'

4. API Usage

The analysis functions can also be called directly from Python:

import sys
sys.path.append('scripts')
from analyze_qed_mw_distribution import main_api

# Basic usage
main_api(['scripts/qed_values_fgbar.csv', 'scripts/qed_values_trpe.csv'], ['fgbar', 'trpe'])

# With custom reference scores
main_api(['scripts/qed_values_fgbar.csv', 'scripts/qed_values_trpe.csv'], ['fgbar', 'trpe'], 
         reference_scores={'fgbar': {'9NY': -5.268}, 'trpe': {'0GA': -6.531}})

# With specific conformation rank
main_api(['scripts/qed_values_fgbar.csv', 'scripts/qed_values_trpe.csv'], ['fgbar', 'trpe'], rank=0)

5. Output Files

The analysis generates several output files:

CSV files with QED values and Vina scores for all molecules
KDE distribution plots in both normalized and actual values formats

AutoDock Vina Pipeline

This repository contains a complete pipeline for molecular docking using AutoDock Vina, including preparation, execution, result processing, and analysis.

受体准备 pdbqt 文件

使用 alphafold 预测 pdb 文件 cif 文件。

修复使用 moderller 同源建模，或者 pdbfixer，MOE，maestro 等

这里使用 maestro 的 Protein reparation Workflow 模块

然后导出 pdb 文件

使用 meeko 准备受体文件 pdbqt 文件，详细可以参考

micromamba run -n vina mk_prepare_receptor.py -i receptor/FgBar1_cut_proteinprep.pdb --write_pdbqt receptor/FgBar1_cut_proteinprep.pdbqt

选项组合用法

举例1：用默认输出名生成 pdbqt 和 vina box 配置

mk_prepare_receptor.py -i 1abc.pdb -o 1abc_clean --write_pdbqt --write_vina_box

得到 1abc_clean_rigid.pdbqt, 1abc_clean.vina.txt

举例2：为指定残基设置模板/柔性，并生成 box 配置

mk_prepare_receptor.py -i system.pdb \
  --output_basename system_prep \
  -f "A:42,B:23" \
  -n "A:5,7=CYX,B:17=HID" \
  --write_pdbqt --write_vina_box

举例3：自动包络某配体生成 box 配置

mk_prepare_receptor.py -i prot.pdb \
  --box_enveloping ligand.pdb \
  --padding 3.0 \
  --output_basename dock_ready \
  --write_pdbqt --write_vina_box

小分子 3D 构象准备

需要给小分子一个初始化的 3d 构象存放到ligand/sdf

python sdf2to3d.py --src_dir ./2d_sdf_dir --out_dir ./3d_sdf_dir --n_jobs 8

小分子格式转化

使用 meeko 将 ligand/sdf 转为 ligand/pdbqt

micromamba run -n vina ./scripts/batch_prepare_ligands.sh ligands/sdf ligands/pdbqt/ batch_prepare_ligands.log 128

小分子批量提交对接

分割小分子文件将 ligand 目录里面的 pdbqt 文件夹拆分 n 个子文件夹(pdbqt1,pdbqt2,pdbqt3...pdbqtn)

micromamba run -n vina python vina_split_and_submit.py <split_number_n>

执行完成后会自动使用 dsub 命令将对接任务提交给华为多瑙调度系统

需要注意有时候提交执行速度过快可能有批次遗漏，可以在合并时候检查

对接结果合并

在对接完成之后会在 result 文件夹里面创建 n 个对接结果文件夹（poses1，poses2，poses3...posesn）

每个文件夹中都有对应的*_out.pdbqt文件与*_converted.sdf文件，调用

micromamba run -n vina python vina_merge_and_check.py --n_splits <split_number_n> --out_dir ./result --output_prefix poses --poses_dir ./result/poses_all

会将所有的n 个对接结果文件夹中*_converted.sdf文件存放到 ./result/poses_all 目录，同时会检测是否有提交时候过快导致遗漏某个批次没有对接，需要注意查看。

分析对接结果

在*_converted.sdf文件中存在20个对接构象，取决于scripts/batch_docking.sh 中 NUM_MODES 设置多少数目，默认设置为 20。

其中每个 sdf 构象存在下面的<meeko>字段用于获取对接打分等属性用于后续筛选分子。

>  <meeko>  (20) 
{"is_sidechain": [false], "free_energy": -6.38, "intermolecular_energy": -15.695, "internal_energy": -2.912}

batch 模式对接

vina=1.2.7可以使用batch 模式进行批量对接。

mkdir -p results/poses
vina --receptor input/receptors/TrpE_entry_1.pdbqt \  
     --batch input/ligands/test \  
     --config ./configs/TrpE_entry_1.box.txt \  
     --dir results/poses \  
     --exhaustiveness=32

# 使用脚本对接
./scripts/batch_docking.sh ./receptors/TrpE_entry_1.pdbqt ./config/TrpE_entry_1.box.txt ligands/test output test.log /share/home/lyzeng24/rdkit_script/vina/vina

环境安装

conda install -c conda-forge vina meeko rdkit joblib rich ipython parallel -y

准备小分子pdbqt

# 单个配体准备  
mk_prepare_ligand.py -i molecule.sdf -o molecule.pdbqt  
  
# 批量准备
micromamba run -n vina ./scripts/batch_prepare_ligands.sh ligands/sdf ligands/pdbqt/ batch_prepare_ligands.log 128

#监控文件
watch -n 1 "ls -l pdbqt/*.pdbqt 2>/dev/null | wc -l"

准备受体pdbqt

# 受体准备（带柔性侧链）
mk_prepare_receptor.py -i nucleic_acid.cif -o my_receptor -j -p -f A:42

batch对接模式

./scripts/batch_docking.sh input/receptors/TrpE_entry_1.pdbqt \  
                          input/configs/TrpE_entry_1.box.txt \  
                          input/ligands/pdbqt \  
                          results/poses \  
                          results/batch_docking.log

监控对接结果

watch -n 1 'for i in {1..12}; do printf "poses$i: "; ls results/poses$i/*.pdbqt 2>/dev/null | wc -l; done'

将对接结果还原为sdf文件

mk_export.py 命令行工具的各个参数选项。

cd output
mk_export.py ./*_out.pdbqt --suffix _converted

分析vina对接结果

# 结果导出
mk_export.py vina_results.pdbqt -j my_receptor.json -s lig_docked.sdf -p rec_docked.pdb

djob 运行时间耗时长的批次任务

24562323     vina_job15   RUNNING    lyzeng24     default      default      2025/07/31 23:16:30  -                    agent-ARM-17         
24562322     vina_job14   RUNNING    lyzeng24     default      default      2025/07/31 23:16:30  -                    agent-ARM-17         
24562321     vina_job13   RUNNING    lyzeng24     default      default      2025/07/31 23:16:30  -                    agent-ARM-17         
24562320     vina_job12   RUNNING    lyzeng24     default      default      2025/07/31 23:16:29  -                    agent-ARM-21         
24562319     vina_job11   RUNNING    lyzeng24     default      default      2025/07/31 23:16:29  -                    agent-ARM-21         
24562318     vina_job10   RUNNING    lyzeng24     default      default      2025/07/31 23:16:29  -                    agent-ARM-21         
24562317     vina_job9    RUNNING    lyzeng24     default      default      2025/07/31 23:16:28  -                    agent-ARM-21         
24562316     vina_job8    RUNNING    lyzeng24     default      default      2025/07/31 23:16:28  -                    agent-ARM-16         
24562315     vina_job7    RUNNING    lyzeng24     default      default      2025/07/31 23:16:28  -                    agent-ARM-16         
24562314     vina_job6    RUNNING    lyzeng24     default      default      2025/07/31 23:16:27  -                    agent-ARM-16         
24562313     vina_job5    RUNNING    lyzeng24     default      default      2025/07/31 23:16:27  -                    agent-ARM-19         
24562312     vina_job4    RUNNING    lyzeng24     default      default      2025/07/31 23:16:27  -                    agent-ARM-19         
24562311     vina_job3    RUNNING    lyzeng24     default      default      2025/07/31 23:16:27  -                    agent-ARM-19

autodock vina 参考分子对接

trpe:(PDB ID: 5cwa)

./vina --receptor ./refence/trpe/TrpE_entry_1.pdbqt --ligand ./refence/trpe/align_5cwa_0GA_addH.pdbqt --config ./refence/trpe/TrpE_entry_1.box.txt --out ./refence/trpe/align_5cwa_0GA_addH_out.pdbqt --exhaustiveness="32" --num_modes="20"  --energy_range="5.0"

result:

AutoDock Vina v1.2.7
#################################################################
# If you used AutoDock Vina in your work, please cite:          #
#                                                               #
# J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli  #
# AutoDock Vina 1.2.0: New Docking Methods, Expanded Force      #
# Field, and Python Bindings, J. Chem. Inf. Model. (2021)       #
# DOI 10.1021/acs.jcim.1c00203                                  #
#                                                               #
# O. Trott, A. J. Olson,                                        #
# AutoDock Vina: improving the speed and accuracy of docking    #
# with a new scoring function, efficient optimization and       #
# multithreading, J. Comp. Chem. (2010)                         #
# DOI 10.1002/jcc.21334                                         #
#                                                               #
# Please see https://github.com/ccsb-scripps/AutoDock-Vina for  #
# more information.                                             #
#################################################################

Scoring function : vina
Rigid receptor: ./refence/trpe/TrpE_entry_1.pdbqt
Ligand: ./refence/trpe/align_5cwa_0GA_addH.pdbqt
Grid center: X 7.402 Y -4.783 Z -11.818
Grid size  : X 30 Y 30 Z 30
Grid space : 0.375
Exhaustiveness: 32
CPU: 0
Verbosity: 1

Computing Vina grid ... done.
WARNING: At low exhaustiveness, it may be impossible to utilize all CPUs.
Performing docking (random seed: 650309048) ... 
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

mode |   affinity | dist from best mode
     | (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
   1       -6.531          0          0
   2       -6.352      3.988      6.453
   3         -6.3      1.447      5.602
   4       -6.291       1.94      5.284
   5       -6.283      1.044      2.037
   6       -6.159      3.798      5.275
   7       -6.124       1.43      5.553
   8       -5.988      3.499      5.489
   9       -5.925      3.311      4.252
  10       -5.912      3.647      4.894
  11       -5.889      7.256      10.49
  12       -5.821      2.351       5.29
  13       -5.763      3.731       6.18
  14       -5.732      3.557      6.002
  15       -5.729      7.213      9.251
  16       -5.693      4.179      5.642
  17       -5.684      3.058      4.111
  18       -5.679      4.117      5.518
  19       -5.671      4.656      6.098
  20       -5.663      4.112      5.705

fgbar:（PDB ID： 8izd）

./vina --receptor ./refence/fgbar/FgBar1_cut_proteinprep.pdbqt --ligand ./refence/fgbar/align_8izd_F_9NY_addH.pdbqt --config ./refence/fgbar/FgBar1_entry_1.box.txt --out ./refence/fgbar/align_8izd_F_9NY_addH_out.pdbqt --exhaustiveness="32" --num_modes="20"  --energy_range="5.0"

reusult:

AutoDock Vina v1.2.7
#################################################################
# If you used AutoDock Vina in your work, please cite:          #
#                                                               #
# J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli  #
# AutoDock Vina 1.2.0: New Docking Methods, Expanded Force      #
# Field, and Python Bindings, J. Chem. Inf. Model. (2021)       #
# DOI 10.1021/acs.jcim.1c00203                                  #
#                                                               #
# O. Trott, A. J. Olson,                                        #
# AutoDock Vina: improving the speed and accuracy of docking    #
# with a new scoring function, efficient optimization and       #
# multithreading, J. Comp. Chem. (2010)                         #
# DOI 10.1002/jcc.21334                                         #
#                                                               #
# Please see https://github.com/ccsb-scripps/AutoDock-Vina for  #
# more information.                                             #
#################################################################

Scoring function : vina
Rigid receptor: ./refence/fgbar/FgBar1_cut_proteinprep.pdbqt
Ligand: ./refence/fgbar/align_8izd_F_9NY_addH.pdbqt
Grid center: X -12.7 Y -9.1 Z -0.3
Grid size  : X 49.1 Y 37.6 Z 35.2
Grid space : 0.375
Exhaustiveness: 32
CPU: 0
Verbosity: 1

WARNING: Search space volume is greater than 27000 Angstrom^3 (See FAQ)
Computing Vina grid ... done.
WARNING: At low exhaustiveness, it may be impossible to utilize all CPUs.
Performing docking (random seed: -399012800) ... 
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

mode |   affinity | dist from best mode
     | (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
   1       -5.268          0          0
   2       -5.106      3.453       7.96
   3       -5.003      3.114      6.709
   4       -4.986       6.86      13.92
   5       -4.947      5.434         13
   6       -4.875      4.933      10.47
   7       -4.867      6.888      13.75
   8       -4.862      4.244      9.114
   9       -4.835      3.776      6.806
  10       -4.826      3.682      7.143
  11       -4.824        5.4      10.17
  12        -4.81      5.364      7.809
  13       -4.808      4.364      11.15
  14       -4.805      3.211      5.684
  15       -4.783      3.585      8.995
  16       -4.773       6.47      13.64
  17       -4.773      3.465      6.652
  18       -4.731       4.73      9.619
  19       -4.726      4.867      10.88
  20       -4.716      4.834      8.903

对接结果并不理想，可能是分子中灵活的扭转角多，柔性较大。AutoDock Vina 更偏向刚性对接。

分析策略

trpe

AutoDock vina：QED 针对小空间（分子量小的）trpe，QED 过滤。（）

karamadock：只看 qed 过滤后的小分子对接情况（过滤标准：小分子，QED）

glide: 小分子，QED。（vina 打分好的 1w 个，按照底物标准）

fgbar

vina，karamadock，底物标准，选择交集做 glide。

README.md Unescape Escape

目录结构

1. Preparation

2. Execution Steps

2.1 Protein Preparation

2.2 Ligand Preparation

2.3 Docking Execution

2.4 Result Format Conversion

3. Result Analysis

3.1 Calculate QED Properties

3.2 Analyze QED and Molecular Weight Distribution

3.3 Advanced Analysis Options

4. API Usage

5. Output Files

AutoDock Vina Pipeline

受体准备 pdbqt 文件

举例1：用默认输出名生成 pdbqt 和 vina box 配置

举例2：为指定残基设置模板/柔性，并生成 box 配置

举例3：自动包络某配体生成 box 配置

小分子 3D 构象准备

小分子格式转化

小分子批量提交对接

对接结果合并

分析对接结果

batch 模式对接

环境安装

准备小分子pdbqt

准备受体pdbqt

batch对接模式

监控对接结果

将对接结果还原为sdf文件

分析vina对接结果

djob 运行时间耗时长的批次任务

autodock vina 参考分子对接

分析策略

trpe

fgbar

README.md