过滤结果保存
This commit is contained in:
297
README.md
297
README.md
@@ -1,151 +1,36 @@
|
||||
## 目录结构
|
||||
|
||||
```shell
|
||||
project_root/
|
||||
├── input/
|
||||
│ ├── receptors/
|
||||
│ │ ├── TrpE_entry_1.pdb
|
||||
│ │ └── TrpE_entry_1.pdbqt
|
||||
│ ├── ligands/
|
||||
│ │ ├── sdf/
|
||||
│ │ │ ├── ligand_001.sdf
|
||||
│ │ │ ├── ligand_002.sdf
|
||||
│ │ │ └── ...
|
||||
│ │ └── pdbqt/
|
||||
│ │ ├── ligand_001.pdbqt
|
||||
│ │ ├── ligand_002.pdbqt
|
||||
│ │ └── ...
|
||||
│ └── configs/
|
||||
│ ├── TrpE_entry_1.box.txt
|
||||
│ └── TrpE_entry_1.box.pdb
|
||||
├── results/
|
||||
│ ├── poses/
|
||||
│ │ ├── ligand_001_out.pdbqt
|
||||
│ │ ├── ligand_002_out.pdbqt
|
||||
│ │ └── ...
|
||||
│ └── scores/
|
||||
│ ├── docking_scores.csv
|
||||
│ └── summary_report.txt
|
||||
└── scripts/
|
||||
├── batch_prepare_ligands.sh
|
||||
├── batch_docking.sh
|
||||
└── analyze_results.py
|
||||
```
|
||||
.
|
||||
├── config/ # Configuration files (box definitions, etc.)
|
||||
├── ligand/ # Ligand files
|
||||
│ └── pdbqt/ # Prepared ligand files in PDBQT format
|
||||
├── receptor/ # Receptor files
|
||||
├── result/ # Docking results
|
||||
│ ├── fgbar/ # FgBar dataset results
|
||||
│ │ └── poses_all/ # Individual docking results in SDF format
|
||||
│ ├── trpe/ # TrpE dataset results
|
||||
│ │ └── poses_all/ # Individual docking results in SDF format
|
||||
│ └── refence/ # Reference molecule files
|
||||
│ ├── fgbar/ # FgBar reference molecules
|
||||
│ └── trpe/ # TrpE reference molecules
|
||||
├── scripts/ # Analysis scripts and utilities
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
## 1. Preparation
|
||||
|
||||
Before running the pipeline, you need to prepare the following files:
|
||||
|
||||
1. Protein structure file (PDB format)
|
||||
2. Ligand library (MOL2 format, named according to the format CNPxxxxxx.1.mol2)
|
||||
3. Configuration file (box.txt format, defining the docking box parameters)
|
||||
|
||||
## 2. Execution Steps
|
||||
|
||||
### 2.1 Protein Preparation
|
||||
|
||||
```bash
|
||||
prepare_receptor4.py -r protein.pdb -o protein.pdbqt
|
||||
```
|
||||
|
||||
### 2.2 Ligand Preparation
|
||||
|
||||
```bash
|
||||
prepare_ligand4.py -l ligand.mol2 -o ligand.pdbqt
|
||||
```
|
||||
|
||||
### 2.3 Docking Execution
|
||||
|
||||
```bash
|
||||
vina --config box.txt --receptor protein.pdbqt --ligand ligand.pdbqt --out out.pdbqt
|
||||
```
|
||||
|
||||
### 2.4 Result Format Conversion
|
||||
|
||||
Convert PDBQT format results to SDF format:
|
||||
|
||||
```bash
|
||||
mk_export.py ./*_out.pdbqt --suffix _converted
|
||||
```
|
||||
|
||||
## 3. Result Analysis
|
||||
|
||||
### 3.1 Calculate QED Properties
|
||||
|
||||
Calculate QED values for all molecules:
|
||||
|
||||
```bash
|
||||
cd scripts
|
||||
python calculate_qed_values.py
|
||||
```
|
||||
|
||||
This script processes both the docked molecules in the poses_all directories and the reference molecules in the refence directories. It generates two CSV files:
|
||||
- qed_values_fgbar.csv
|
||||
- qed_values_trpe.csv
|
||||
|
||||
Each CSV file contains the following columns:
|
||||
- smiles: SMILES representation of the molecule
|
||||
- filename: Name of the source file
|
||||
- qed: QED value of the molecule
|
||||
- molecular_weight: Molecular weight of the molecule
|
||||
- vina_scores: List of Vina scores for all conformers
|
||||
|
||||
### 3.2 Analyze QED and Molecular Weight Distribution
|
||||
|
||||
Analyze the distribution of QED and molecular weight properties and generate KDE plots:
|
||||
|
||||
```bash
|
||||
python analyze_qed_mw_distribution.py qed_values_fgbar.csv qed_values_trpe.csv --dataset-names fgbar --dataset-names trpe
|
||||
```
|
||||
|
||||
This will generate four plots:
|
||||
1. kde_distribution_fgbar_normalized.png - Normalized distribution for fgbar dataset
|
||||
2. kde_distribution_fgbar_actual.png - Actual values distribution for fgbar dataset
|
||||
3. kde_distribution_trpe_normalized.png - Normalized distribution for trpe dataset
|
||||
4. kde_distribution_trpe_actual.png - Actual values distribution for trpe dataset
|
||||
|
||||
Each plot contains three distributions:
|
||||
- QED distribution (blue)
|
||||
- Molecular weight distribution (red)
|
||||
- Vina score distribution (green)
|
||||
|
||||
Reference molecules are marked with different colored markers and labeled with their identifiers and corresponding values.
|
||||
|
||||
### 3.3 Advanced Analysis Options
|
||||
|
||||
You can also specify custom reference scores and conformation rank:
|
||||
|
||||
```bash
|
||||
python analyze_qed_mw_distribution.py qed_values_fgbar.csv qed_values_trpe.csv --dataset-names fgbar --dataset-names trpe --rank 0
|
||||
```
|
||||
|
||||
The `--rank` option allows you to specify which conformation from the reference molecule's docking results to use for the Vina score reference.
|
||||
- Rank 0 (default) uses the best scoring conformation (rank 1 in Vina results)
|
||||
- Rank 1 uses the second best scoring conformation, and so on
|
||||
|
||||
The maximum valid rank is determined by the minimum number of conformations generated across all docked molecules. If you specify a rank that exceeds this minimum, the script will raise an error and inform you of the maximum valid rank.
|
||||
|
||||
You can also specify custom reference scores:
|
||||
|
||||
```bash
|
||||
python analyze_qed_mw_distribution.py qed_values_fgbar.csv qed_values_trpe.csv --dataset-names fgbar --dataset-names trpe --reference-scores '{"fgbar": {"9NY": -5.268}, "trpe": {"0GA": -6.531}}'
|
||||
```
|
||||
|
||||
## 4. API Usage
|
||||
|
||||
The analysis functions can also be called directly from Python:
|
||||
|
||||
```python
|
||||
import sys
|
||||
sys.path.append('scripts')
|
||||
from analyze_qed_mw_distribution import main_api
|
||||
|
||||
# Basic usage
|
||||
main_api(['scripts/qed_values_fgbar.csv', 'scripts/qed_values_trpe.csv'], ['fgbar', 'trpe'])
|
||||
|
||||
# With custom reference scores
|
||||
main_api(['scripts/qed_values_fgbar.csv', 'scripts/qed_values_trpe.csv'], ['fgbar', 'trpe'],
|
||||
reference_scores={'fgbar': {'9NY': -5.268}, 'trpe': {'0GA': -6.531}})
|
||||
|
||||
# With specific conformation rank
|
||||
main_api(['scripts/qed_values_fgbar.csv', 'scripts/qed_values_trpe.csv'], ['fgbar', 'trpe'], rank=0)
|
||||
```
|
||||
|
||||
## 5. Output Files
|
||||
|
||||
The analysis generates several output files:
|
||||
- CSV files with QED values and Vina scores for all molecules
|
||||
- KDE distribution plots in both normalized and actual values formats
|
||||
|
||||
# AutoDock Vina Pipeline
|
||||
|
||||
This repository contains a complete pipeline for molecular docking using AutoDock Vina, including preparation, execution, result processing, and analysis.
|
||||
|
||||
## 受体准备 pdbqt 文件
|
||||
|
||||
@@ -193,6 +78,29 @@ mk_prepare_receptor.py -i prot.pdb \
|
||||
```
|
||||
|
||||
|
||||
### 受体准备介绍
|
||||
|
||||
```shell
|
||||
mk_prepare_receptor.py -i xxx.pdb -o my_receptor -p -j -v
|
||||
mk_prepare_receptor.py -i FgBar1_cut_proteinprep.pdb -o FgBar1_cut_proteinprep -p
|
||||
```
|
||||
|
||||
这样会生成:
|
||||
|
||||
my_receptor.pdbqt(对接用的受体文件)
|
||||
|
||||
my_receptor.json(结构元数据,编程用得上)
|
||||
|
||||
my_receptor.vina_box.txt(对接区域参数,给 vina 用)
|
||||
|
||||
| 选项 | 作用 | 输出文件例子 |
|
||||
| ---- | -------------------- | ------------------ |
|
||||
| `-p` | 输出PDBQT文件(受体) | `xxx.pdbqt` |
|
||||
| `-j` | 输出JSON文件(元数据) | `xxx.json` |
|
||||
| `-v` | 输出vina box参数 | `xxx.vina_box.txt` |
|
||||
| `-g` | 输出GPF文件(老版AutoDock用) | `xxx.gpf` |
|
||||
|
||||
|
||||
## 小分子 3D 构象准备
|
||||
|
||||
需要给小分子一个初始化的 3d 构象存放到`ligand/sdf`
|
||||
@@ -263,7 +171,7 @@ vina --receptor input/receptors/TrpE_entry_1.pdbqt \
|
||||
## 环境安装
|
||||
|
||||
```shell
|
||||
conda install -c conda-forge vina meeko rdkit joblib rich ipython parallel -y
|
||||
conda install -c conda-forge vina meeko rdkit joblib rich ipython parallel openpyxl pandas mordred -y
|
||||
```
|
||||
|
||||
|
||||
@@ -321,7 +229,7 @@ mk_export.py vina_results.pdbqt -j my_receptor.json -s lig_docked.sdf -p rec_doc
|
||||
|
||||
## djob 运行时间耗时长的批次任务
|
||||
|
||||
```
|
||||
```shell
|
||||
24562323 vina_job15 RUNNING lyzeng24 default default 2025/07/31 23:16:30 - agent-ARM-17
|
||||
24562322 vina_job14 RUNNING lyzeng24 default default 2025/07/31 23:16:30 - agent-ARM-17
|
||||
24562321 vina_job13 RUNNING lyzeng24 default default 2025/07/31 23:16:30 - agent-ARM-17
|
||||
@@ -337,6 +245,31 @@ mk_export.py vina_results.pdbqt -j my_receptor.json -s lig_docked.sdf -p rec_doc
|
||||
24562311 vina_job3 RUNNING lyzeng24 default default 2025/07/31 23:16:27 - agent-ARM-19
|
||||
```
|
||||
|
||||
## plant_metabolit 数据集 准备
|
||||
|
||||
阮耀师兄之前用构建的植物代谢网络,里面包含的代谢物
|
||||
|
||||
执行命令:
|
||||
|
||||
```shell
|
||||
cd /Users/lingyuzeng/Downloads/211.69.141.180/202508021824/vina/ligand/plant_meta
|
||||
chmod +x run_convert_smiles.sh
|
||||
./run_convert_smiles.sh
|
||||
```
|
||||
|
||||
执行结果:
|
||||
|
||||
```shell
|
||||
Conversion Summary:
|
||||
Total SMILES processed: 8086
|
||||
Successfully converted: 6238
|
||||
Failed conversions: 1848
|
||||
Skipped molecules (empty abbreviation): 0
|
||||
Output directory: sdf
|
||||
Success rate: 77.1%
|
||||
Script execution completed.
|
||||
```
|
||||
|
||||
## autodock vina 参考分子对接
|
||||
|
||||
trpe:(PDB ID: 5cwa)
|
||||
@@ -347,7 +280,7 @@ trpe:(PDB ID: 5cwa)
|
||||
|
||||
result:
|
||||
|
||||
```
|
||||
```shell
|
||||
AutoDock Vina v1.2.7
|
||||
#################################################################
|
||||
# If you used AutoDock Vina in your work, please cite: #
|
||||
@@ -417,7 +350,7 @@ fgbar:(PDB ID: 8izd)
|
||||
|
||||
reusult:
|
||||
|
||||
```
|
||||
```shell
|
||||
AutoDock Vina v1.2.7
|
||||
#################################################################
|
||||
# If you used AutoDock Vina in your work, please cite: #
|
||||
@@ -484,14 +417,76 @@ mode | affinity | dist from best mode
|
||||
|
||||
## 分析策略
|
||||
|
||||
### trpe
|
||||
### trpe(COCUNT)
|
||||
|
||||
AutoDock vina:QED 针对小空间(分子量小的)trpe,QED 过滤。()
|
||||
#### AutoDock Vina 筛选
|
||||
|
||||
过滤结果:
|
||||
|
||||
1. 针对 trpe 口袋
|
||||
|
||||
因为 trpe 口袋和参考分子较小,考虑使用小分子先过滤(MW < 800)。
|
||||
|
||||
针对 AutoDock Vina 的 score score 参考 align_5cwa_0GA_addH 结果 < -6.5 分子保留。
|
||||
|
||||
剩下根据 QED 排名选择前 100 个分子作为最后实验分子。
|
||||
|
||||
2. 针对 fgbar 口袋筛选
|
||||
|
||||
fgbar 口袋的参考分子较大,MW 不进行筛选。 针对 QED 进行过滤,QED > 0.5 , 参考分子align_8izd_F_9NY_addH rank1 的Vina 分数 < -5.2过滤,之后选择 rank 前 100 的分子。
|
||||
|
||||
AutoDock vina:QED 针对小空间(分子量小的)trpe,QED 过滤。
|
||||
|
||||
过滤结果
|
||||
|
||||
```shell
|
||||
使用 head 命令查看了两个 CSV 文件的数据结构
|
||||
|
||||
验证了 vina_scores 列的数据完整性
|
||||
|
||||
trpe 数据集发现 1919 个文件的构象数少于 20 个
|
||||
fgbar 数据集发现 404 个文件的构象数少于 20 个
|
||||
所有分子的最小构象数为 1
|
||||
按照 README.md 的要求实现了数据过滤:
|
||||
|
||||
TRPE 过滤条件:MW < 800 且 Vina < -6.5
|
||||
FGBAR 过滤条件:QED > 0.5 且 Vina < -5.2
|
||||
生成了过滤结果文件:
|
||||
|
||||
/result/filtered_results/qed_values_fgbar_combined_filtered.csv (1878.1KB)
|
||||
/result/filtered_results/qed_values_fgbar_top100.csv (27.6KB)
|
||||
/result/filtered_results/qed_values_trpe_combined_filtered.csv (6090.1KB)
|
||||
/result/filtered_results/qed_values_trpe_top100.csv (27.5KB)
|
||||
输出了统计信息:
|
||||
|
||||
TRPE 数据统计:
|
||||
原始数据总数: 41166
|
||||
仅QED过滤后数据总数: 7229
|
||||
仅Vina得分过滤后数据总数: 29728
|
||||
同时满足QED和Vina得分条件的数据总数: 18787
|
||||
FGBAR 数据统计:
|
||||
原始数据总数: 41166
|
||||
仅QED过滤后数据总数: 7228
|
||||
仅Vina得分过滤后数据总数: 36111
|
||||
同时满足QED和Vina得分条件的数据总数: 6568
|
||||
```
|
||||
|
||||
#### karamadock 筛选
|
||||
|
||||
待反馈结构结果
|
||||
|
||||
karamadock:只看 qed 过滤后的小分子对接情况(过滤标准:**小分子**,QED)
|
||||
|
||||
glide: 小分子,QED。(vina 打分好的 1w 个 , 按照底物标准)
|
||||
|
||||
#### glide 筛选
|
||||
|
||||
之前是考虑底物标准的交集,这里使用底物标准剩余的分子全部使用glide 进行分子对接。
|
||||
|
||||
将 `qed_values_fgbar_filtered.csv`
|
||||
|
||||
将 `qed_values_fgbar_filtered.csv`
|
||||
|
||||
---
|
||||
|
||||
### fgbar
|
||||
|
||||
Reference in New Issue
Block a user