过滤结果保存

2025-08-12 18:45:33 +08:00
parent df72a0b9c0
commit e58f90cd1e
6 changed files with 25790 additions and 151 deletions
--- a/README.md
+++ b/README.md
@@ -1,151 +1,36 @@
 ## 目录结构

+```shell
+project_root/  
+├── input/  
+│   ├── receptors/  
+│   │   ├── TrpE_entry_1.pdb  
+│   │   └── TrpE_entry_1.pdbqt  
+│   ├── ligands/  
+│   │   ├── sdf/  
+│   │   │   ├── ligand_001.sdf  
+│   │   │   ├── ligand_002.sdf  
+│   │   │   └── ...  
+│   │   └── pdbqt/  
+│   │       ├── ligand_001.pdbqt  
+│   │       ├── ligand_002.pdbqt  
+│   │       └── ...  
+│   └── configs/  
+│       ├── TrpE_entry_1.box.txt  
+│       └── TrpE_entry_1.box.pdb  
+├── results/  
+│   ├── poses/  
+│   │   ├── ligand_001_out.pdbqt  
+│   │   ├── ligand_002_out.pdbqt  
+│   │   └── ...  
+│   └── scores/  
+│       ├── docking_scores.csv  
+│       └── summary_report.txt  
+└── scripts/  
+    ├── batch_prepare_ligands.sh  
+    ├── batch_docking.sh  
+    └── analyze_results.py  
 ```
-.
-├── config/                    # Configuration files (box definitions, etc.)
-├── ligand/                    # Ligand files
-│   └── pdbqt/                 # Prepared ligand files in PDBQT format
-├── receptor/                  # Receptor files
-├── result/                    # Docking results
-│   ├── fgbar/                 # FgBar dataset results
-│   │   └── poses_all/         # Individual docking results in SDF format
-│   ├── trpe/                  # TrpE dataset results
-│   │   └── poses_all/         # Individual docking results in SDF format
-│   └── refence/               # Reference molecule files
-│       ├── fgbar/             # FgBar reference molecules
-│       └── trpe/              # TrpE reference molecules
-├── scripts/                   # Analysis scripts and utilities
-└── README.md                  # This file
-```
-
-## 1. Preparation
-
-Before running the pipeline, you need to prepare the following files:
-
-1. Protein structure file (PDB format)
-2. Ligand library (MOL2 format, named according to the format CNPxxxxxx.1.mol2)
-3. Configuration file (box.txt format, defining the docking box parameters)
-
-## 2. Execution Steps
-
-### 2.1 Protein Preparation
-
-```bash
-prepare_receptor4.py -r protein.pdb -o protein.pdbqt
-```
-
-### 2.2 Ligand Preparation
-
-```bash
-prepare_ligand4.py -l ligand.mol2 -o ligand.pdbqt
-```
-
-### 2.3 Docking Execution
-
-```bash
-vina --config box.txt --receptor protein.pdbqt --ligand ligand.pdbqt --out out.pdbqt
-```
-
-### 2.4 Result Format Conversion
-
-Convert PDBQT format results to SDF format:
-
-```bash
-mk_export.py ./*_out.pdbqt --suffix _converted
-```
-
-## 3. Result Analysis
-
-### 3.1 Calculate QED Properties
-
-Calculate QED values for all molecules:
-
-```bash
-cd scripts
-python calculate_qed_values.py
-```
-
-This script processes both the docked molecules in the poses_all directories and the reference molecules in the refence directories. It generates two CSV files:
- qed_values_fgbar.csv
- qed_values_trpe.csv
-
-Each CSV file contains the following columns:
- smiles: SMILES representation of the molecule
- filename: Name of the source file
- qed: QED value of the molecule
- molecular_weight: Molecular weight of the molecule
- vina_scores: List of Vina scores for all conformers
-
-### 3.2 Analyze QED and Molecular Weight Distribution
-
-Analyze the distribution of QED and molecular weight properties and generate KDE plots:
-
-```bash
-python analyze_qed_mw_distribution.py qed_values_fgbar.csv qed_values_trpe.csv --dataset-names fgbar --dataset-names trpe
-```
-
-This will generate four plots:
-1. kde_distribution_fgbar_normalized.png - Normalized distribution for fgbar dataset
-2. kde_distribution_fgbar_actual.png - Actual values distribution for fgbar dataset
-3. kde_distribution_trpe_normalized.png - Normalized distribution for trpe dataset
-4. kde_distribution_trpe_actual.png - Actual values distribution for trpe dataset
-
-Each plot contains three distributions:
- QED distribution (blue)
- Molecular weight distribution (red)
- Vina score distribution (green)
-
-Reference molecules are marked with different colored markers and labeled with their identifiers and corresponding values.
-
-### 3.3 Advanced Analysis Options
-
-You can also specify custom reference scores and conformation rank:
-
-```bash
-python analyze_qed_mw_distribution.py qed_values_fgbar.csv qed_values_trpe.csv --dataset-names fgbar --dataset-names trpe --rank 0
-```
-
-The `--rank` option allows you to specify which conformation from the reference molecule's docking results to use for the Vina score reference. 
- Rank 0 (default) uses the best scoring conformation (rank 1 in Vina results)
- Rank 1 uses the second best scoring conformation, and so on
-
-The maximum valid rank is determined by the minimum number of conformations generated across all docked molecules. If you specify a rank that exceeds this minimum, the script will raise an error and inform you of the maximum valid rank.
-
-You can also specify custom reference scores:
-
-```bash
-python analyze_qed_mw_distribution.py qed_values_fgbar.csv qed_values_trpe.csv --dataset-names fgbar --dataset-names trpe --reference-scores '{"fgbar": {"9NY": -5.268}, "trpe": {"0GA": -6.531}}'
-```
-
-## 4. API Usage
-
-The analysis functions can also be called directly from Python:
-
-```python
-import sys
-sys.path.append('scripts')
-from analyze_qed_mw_distribution import main_api
-
-# Basic usage
-main_api(['scripts/qed_values_fgbar.csv', 'scripts/qed_values_trpe.csv'], ['fgbar', 'trpe'])
-
-# With custom reference scores
-main_api(['scripts/qed_values_fgbar.csv', 'scripts/qed_values_trpe.csv'], ['fgbar', 'trpe'], 
-         reference_scores={'fgbar': {'9NY': -5.268}, 'trpe': {'0GA': -6.531}})
-
-# With specific conformation rank
-main_api(['scripts/qed_values_fgbar.csv', 'scripts/qed_values_trpe.csv'], ['fgbar', 'trpe'], rank=0)
-```
-
-## 5. Output Files
-
-The analysis generates several output files:
- CSV files with QED values and Vina scores for all molecules
- KDE distribution plots in both normalized and actual values formats
-
-# AutoDock Vina Pipeline
-
-This repository contains a complete pipeline for molecular docking using AutoDock Vina, including preparation, execution, result processing, and analysis.

 ## 受体准备 pdbqt 文件

@@ -193,6 +78,29 @@ mk_prepare_receptor.py -i prot.pdb \
 ```


+### 受体准备介绍
+
+```shell
+mk_prepare_receptor.py -i xxx.pdb -o my_receptor -p -j -v
+mk_prepare_receptor.py -i FgBar1_cut_proteinprep.pdb -o FgBar1_cut_proteinprep -p
+```
+
+这样会生成：
+
+my_receptor.pdbqt（对接用的受体文件）
+
+my_receptor.json（结构元数据，编程用得上）
+
+my_receptor.vina_box.txt（对接区域参数，给 vina 用）
+
+| 选项   | 作用                   | 输出文件例子             |
+| ---- | -------------------- | ------------------ |
+| `-p` | 输出PDBQT文件（受体）        | `xxx.pdbqt`        |
+| `-j` | 输出JSON文件（元数据）        | `xxx.json`         |
+| `-v` | 输出vina box参数         | `xxx.vina_box.txt` |
+| `-g` | 输出GPF文件（老版AutoDock用） | `xxx.gpf`          |
+
+
 ## 小分子 3D 构象准备

 需要给小分子一个初始化的 3d 构象存放到`ligand/sdf`
@@ -263,7 +171,7 @@ vina --receptor input/receptors/TrpE_entry_1.pdbqt \
 ## 环境安装

 ```shell
-conda install -c conda-forge vina meeko rdkit joblib rich ipython parallel -y
+conda install -c conda-forge vina meeko rdkit joblib rich ipython parallel openpyxl pandas mordred -y
 ```


@@ -321,7 +229,7 @@ mk_export.py vina_results.pdbqt -j my_receptor.json -s lig_docked.sdf -p rec_doc

 ## djob 运行时间耗时长的批次任务

-```
+```shell
 24562323     vina_job15   RUNNING    lyzeng24     default      default      2025/07/31 23:16:30  -                    agent-ARM-17         
 24562322     vina_job14   RUNNING    lyzeng24     default      default      2025/07/31 23:16:30  -                    agent-ARM-17         
 24562321     vina_job13   RUNNING    lyzeng24     default      default      2025/07/31 23:16:30  -                    agent-ARM-17         
@@ -337,6 +245,31 @@ mk_export.py vina_results.pdbqt -j my_receptor.json -s lig_docked.sdf -p rec_doc
 24562311     vina_job3    RUNNING    lyzeng24     default      default      2025/07/31 23:16:27  -                    agent-ARM-19
 ```

+## plant_metabolit 数据集 准备
+
+阮耀师兄之前用构建的植物代谢网络，里面包含的代谢物
+
+执行命令：
+
+```shell
+cd /Users/lingyuzeng/Downloads/211.69.141.180/202508021824/vina/ligand/plant_meta
+chmod +x run_convert_smiles.sh
+./run_convert_smiles.sh
+```
+
+执行结果：
+
+```shell
+Conversion Summary:
+Total SMILES processed: 8086
+Successfully converted: 6238
+Failed conversions: 1848
+Skipped molecules (empty abbreviation): 0
+Output directory: sdf
+Success rate: 77.1%
+Script execution completed.
+```
+
 ## autodock vina 参考分子对接

 trpe:(PDB ID: 5cwa)
@@ -347,7 +280,7 @@ trpe:(PDB ID: 5cwa)

 result:

-```
+```shell
 AutoDock Vina v1.2.7
 #################################################################
 # If you used AutoDock Vina in your work, please cite:          #
@@ -417,7 +350,7 @@ fgbar:（PDB ID： 8izd）

 reusult:

-```
+```shell
 AutoDock Vina v1.2.7
 #################################################################
 # If you used AutoDock Vina in your work, please cite:          #
@@ -484,14 +417,76 @@ mode |   affinity | dist from best mode

 ## 分析策略

-### trpe
+### trpe（COCUNT）

-AutoDock vina：QED 针对小空间（分子量小的）trpe，QED 过滤。（）
+#### AutoDock Vina 筛选
+
+过滤结果：
+
+1. 针对 trpe 口袋
+
+因为 trpe 口袋和参考分子较小，考虑使用小分子先过滤(MW < 800)。
+
+针对 AutoDock Vina 的 score score 参考 align_5cwa_0GA_addH 结果 < -6.5 分子保留。
+
+剩下根据 QED 排名选择前 100 个分子作为最后实验分子。
+
+2. 针对 fgbar 口袋筛选
+
+fgbar 口袋的参考分子较大，MW 不进行筛选。 针对 QED 进行过滤，QED > 0.5 , 参考分子align_8izd_F_9NY_addH rank1 的Vina 分数 < -5.2过滤，之后选择 rank 前 100 的分子。
+
+AutoDock vina：QED 针对小空间（分子量小的）trpe，QED 过滤。
+
+过滤结果
+
+```shell
+使用 head 命令查看了两个 CSV 文件的数据结构
+
+验证了 vina_scores 列的数据完整性
+
+trpe 数据集发现 1919 个文件的构象数少于 20 个
+fgbar 数据集发现 404 个文件的构象数少于 20 个
+所有分子的最小构象数为 1
+按照 README.md 的要求实现了数据过滤：
+
+TRPE 过滤条件：MW < 800 且 Vina < -6.5
+FGBAR 过滤条件：QED > 0.5 且 Vina < -5.2
+生成了过滤结果文件：
+
+/result/filtered_results/qed_values_fgbar_combined_filtered.csv (1878.1KB)
+/result/filtered_results/qed_values_fgbar_top100.csv (27.6KB)
+/result/filtered_results/qed_values_trpe_combined_filtered.csv (6090.1KB)
+/result/filtered_results/qed_values_trpe_top100.csv (27.5KB)
+输出了统计信息：
+
+TRPE 数据统计：
+原始数据总数: 41166
+仅QED过滤后数据总数: 7229
+仅Vina得分过滤后数据总数: 29728
+同时满足QED和Vina得分条件的数据总数: 18787
+FGBAR 数据统计：
+原始数据总数: 41166
+仅QED过滤后数据总数: 7228
+仅Vina得分过滤后数据总数: 36111
+同时满足QED和Vina得分条件的数据总数: 6568
+```
+
+#### karamadock 筛选
+
+待反馈结构结果

 karamadock：只看 qed 过滤后的小分子对接情况（过滤标准：**小分子**，QED）

 glide: 小分子，QED。（vina 打分好的 1w 个 ， 按照底物标准）

+#### glide 筛选
+
+之前是考虑底物标准的交集，这里使用底物标准剩余的分子全部使用glide 进行分子对接。
+
+将 `qed_values_fgbar_filtered.csv`
+
+将 `qed_values_fgbar_filtered.csv`
+
 ---

 ### fgbar