add cluster func
This commit is contained in:
303
README.md
303
README.md
@@ -469,12 +469,222 @@ FGBAR 数据统计:
|
||||
仅QED过滤后数据总数: 7228
|
||||
仅Vina得分过滤后数据总数: 36111
|
||||
同时满足QED和Vina得分条件的数据总数: 6568
|
||||
|
||||
## 聚类探索分析
|
||||
|
||||
### trpe 聚类探索分析
|
||||
|
||||
radius=3 是为了让聚类更敏感于分子结构差异,尤其是在功能团位置上的差异。
|
||||
|
||||
n-bits=1024 是因为这个值足够区分几万甚至几十万分子,同时计算代价可接受。
|
||||
|
||||
```shell
|
||||
(vina) lingyuzeng@Mac-mini vina % python scripts/cluster_granularity_scan.py \
|
||||
--csv result/filtered_results/qed_values_trpe_combined_filtered.csv \
|
||||
--smiles-col smiles \
|
||||
--radius 3 \
|
||||
--n-bits 1024
|
||||
计算 Tanimoto 相似度矩阵...
|
||||
Method Params #Clusters AvgSize AvgIntraSim
|
||||
Butina {'cutoff': 0.4} 8960 2.10 0.706
|
||||
Butina {'cutoff': 0.5} 6720 2.80 0.625
|
||||
Butina {'cutoff': 0.6} 4648 4.04 0.548
|
||||
Butina {'cutoff': 0.7} 2783 6.75 0.463
|
||||
Butina {'cutoff': 0.8} 958 19.61 0.333
|
||||
Hierarchical {'threshold': 0.3} 12235 1.54 0.814
|
||||
Hierarchical {'threshold': 0.4} 9603 1.96 0.739
|
||||
Hierarchical {'threshold': 0.5} 7300 2.57 0.664
|
||||
DBSCAN {'eps': 0.2} 2050 3.18 0.106
|
||||
DBSCAN {'eps': 0.3} 2275 4.61 0.113
|
||||
DBSCAN {'eps': 0.4} 2014 6.65 0.113
|
||||
KMeans {'k': 10} 10 1878.70 0.204
|
||||
KMeans {'k': 20} 20 939.35 0.200
|
||||
KMeans {'k': 50} 50 375.74 0.233
|
||||
```
|
||||
|
||||
| 方法 | 参数候选 | 簇数 | AvgSize | AvgIntraSim | 解释 |
|
||||
| ------------ | ----------------- | ---------- | ------- | ----------- | ------------------------------------ |
|
||||
| Butina | **cutoff=0.6** | 4648 | 4.04 | 0.548 | 簇数适中,簇内平均相似度≈0.55,代表簇内有一定结构多样性但不是太松散 |
|
||||
| Butina | cutoff=0.7 | 2783 | 6.75 | 0.463 | 簇更少,但簇内相似度低,可能太松散 |
|
||||
| Hierarchical | **threshold=0.5** | 7300 | 2.57 | 0.664 | 簇更细,每簇2–3个分子,相似度高,代表性好 |
|
||||
| Hierarchical | threshold=0.6 | (没测,但会更松散) | - | - | 不推荐,可能太松散 |
|
||||
|
||||
推荐参数:
|
||||
|
||||
Butina → --cutoff 0.6
|
||||
|
||||
Hierarchical → --cutoff 0.5(脚本里 Hierarchical cutoff 是距离阈值 t,这里 threshold=0.5 就直接传 0.5)
|
||||
|
||||
|
||||
### fgbar 聚类探索分析
|
||||
|
||||
```shell
|
||||
(vina) [lyzeng24@cli-ARM-01 vina]$ python scripts/cluster_granularity_scan.py --csv result/filtered_results/qed_values_fgbar_combined_filtered.csv \
|
||||
> --smiles-col smiles \
|
||||
> --radius 3 \
|
||||
> --n-bits 1024
|
||||
计算 Tanimoto 相似度矩阵...
|
||||
Method Params #Clusters AvgSize AvgIntraSim
|
||||
Butina {'cutoff': 0.4} 4810 1.37 0.727
|
||||
Butina {'cutoff': 0.5} 3915 1.68 0.628
|
||||
Butina {'cutoff': 0.6} 3027 2.17 0.544
|
||||
Butina {'cutoff': 0.7} 1943 3.38 0.441
|
||||
Butina {'cutoff': 0.8} 690 9.52 0.302
|
||||
Hierarchical {'threshold': 0.3} 5661 1.16 0.849
|
||||
Hierarchical {'threshold': 0.4} 4993 1.32 0.750
|
||||
Hierarchical {'threshold': 0.5} 4210 1.56 0.664
|
||||
DBSCAN {'eps': 0.2} 393 2.24 0.091
|
||||
DBSCAN {'eps': 0.3} 648 2.54 0.092
|
||||
DBSCAN {'eps': 0.4} 862 3.25 0.096
|
||||
KMeans {'k': 10} 10 656.80 0.144
|
||||
KMeans {'k': 20} 20 328.40 0.159
|
||||
KMeans {'k': 50} 50 131.36 0.180
|
||||
```
|
||||
|
||||
| 方法 | 参数候选 | 簇数 | AvgSize | AvgIntraSim | 解释 |
|
||||
| ------------ | ----------------- | ---- | ------- | ----------- | ------------- |
|
||||
| Butina | **cutoff=0.6** | 3027 | 2.17 | 0.544 | 簇数适中,簇内多样性不错 |
|
||||
| Butina | cutoff=0.7 | 1943 | 3.38 | 0.441 | 簇更少,簇内相似度低,松散 |
|
||||
| Hierarchical | **threshold=0.5** | 4210 | 1.56 | 0.664 | 簇小且相似度高,代表性好 |
|
||||
|
||||
推荐参数:
|
||||
|
||||
Butina → --cutoff 0.6
|
||||
|
||||
Hierarchical → --cutoff 0.5
|
||||
|
||||
|
||||
### 选择Butina
|
||||
|
||||
分子相似性聚类的老牌方法
|
||||
Butina 是专门针对分子指纹 + Tanimoto 相似度设计的,化学信息学里用得很广,聚类结果可解释性强。
|
||||
|
||||
速度更快
|
||||
在几万分子的规模上,Butina 算法一般比层次聚类(Hierarchical)快很多,也更省内存。
|
||||
|
||||
参数直观
|
||||
cutoff 就是“簇内相似度下限”,调节起来很直观(比如 0.6~0.7 控制簇大小和多样性)。
|
||||
|
||||
Hierarchical 的优点是可以得到聚类树(树状图可以动态选择阈值),但缺点是:
|
||||
|
||||
大数据量时计算慢,占内存多(需要完整的相似度矩阵)
|
||||
|
||||
阈值不如 Butina 的 cutoff 直观,分子数多的时候聚类树的可视化会很复杂
|
||||
|
||||
所以我建议:
|
||||
|
||||
日常自动化批量筛选 → Butina
|
||||
|
||||
需要深入探索簇之间的关系 → Hierarchical
|
||||
|
||||
然后根据聚类结果:
|
||||
|
||||
聚类:用 Butina cutoff ≈ 0.6–0.7 或 Hierarchical 阈值 ≈ 0.5–0.6(保持簇内差异可控,簇数不要太多)。
|
||||
|
||||
选代表:每个簇取 1 个中心分子(簇内与其他成员平均相似度最高的那个)。
|
||||
|
||||
如果仍想增强多样性,可以在代表集中再跑一次 MaxMin picking。
|
||||
|
||||
设计思想:
|
||||
|
||||
1. 聚类(控制簇内差异)
|
||||
目标:把相似的分子放到同一簇,不同簇差异尽量大。
|
||||
|
||||
选择 Butina cutoff ≈ 0.6–0.7 或 层次聚类阈值 ≈ 0.5–0.6 是为了既保留结构差异,又不让簇数太多。
|
||||
|
||||
阈值越低 → 簇数多、粒度细(多样性高但不聚拢),阈值越高 → 簇数少、粒度粗(代表性强但多样性低)。
|
||||
|
||||
你这里是在中间值取平衡,保证每个簇内的分子还是相似的,但簇的总数也不会大到难以处理。
|
||||
|
||||
2. 簇内代表分子选择
|
||||
在每个簇里,找出与簇内所有成员平均相似度最高的分子作为“簇中心”。
|
||||
|
||||
这样可以确保这个分子最能代表整个簇的化学特征。
|
||||
|
||||
这种方法类似medoid picking,比随便选一个分子更稳。
|
||||
|
||||
3. 再次提升多样性(MaxMin picking)
|
||||
如果代表集依然太集中,可以再用一次MaxMin picking:
|
||||
|
||||
第一步选一个分子(比如最活跃的、或者随机一个)
|
||||
|
||||
每次选下一个时,都选当前与已选分子集合中最不相似的那个
|
||||
|
||||
这样会最大化代表集之间的结构差异,让你挑到的分子更加分散、多样。
|
||||
|
||||
每个簇选一个分子后,对每个分子进行对于打分进行 ranking 排序,选择前 100 个分子用于后续实验。
|
||||
|
||||
|
||||
## 聚类结果
|
||||
|
||||
### trpe 聚类结果
|
||||
|
||||
Butina 聚类(0.6)
|
||||
|
||||
```shell
|
||||
python scripts/cluster_best_vina.py \
|
||||
--csv result/filtered_results/qed_values_trpe_combined_filtered.csv \
|
||||
--smiles-col smiles \
|
||||
--radius 3 \
|
||||
--n-bits 1024 \
|
||||
--cutoff 0.6 \
|
||||
--out result/trpe_cluster_best_vina_butina.csv
|
||||
[i] butina 聚类完成:8960 个簇,有效分子数 18787
|
||||
[+] 已保存:result/trpe_cluster_best_vina_butina_butina.csv(8960 条,每簇 1 条)
|
||||
```
|
||||
|
||||
Hierarchical 聚类(0.5)
|
||||
|
||||
```shell
|
||||
python scripts/cluster_best_vina.py \
|
||||
--csv result/filtered_results/qed_values_trpe_combined_filtered.csv \
|
||||
--smiles-col smiles \
|
||||
--radius 3 \
|
||||
--n-bits 1024 \
|
||||
--method hierarchical \
|
||||
--cutoff 0.5 \
|
||||
--out result/trpe_cluster_best_vina_hier.csv
|
||||
[i] hierarchical 聚类完成:7267 个簇,有效分子数 18787
|
||||
[+] 已保存:result/trpe_cluster_best_vina_hier_hierarchical.csv(7267 条,每簇 1 条)
|
||||
```
|
||||
|
||||
### fgbar 聚类结果
|
||||
|
||||
Butina 聚类(0.6)
|
||||
|
||||
```shell
|
||||
python scripts/cluster_best_vina.py \
|
||||
--csv result/filtered_results/qed_values_fgbar_combined_filtered.csv \
|
||||
--smiles-col smiles \
|
||||
--radius 3 \
|
||||
--n-bits 1024 \
|
||||
--cutoff 0.6 \
|
||||
--out result/fgbar_cluster_best_vina_butina.csv
|
||||
[i] butina 聚类完成:4810 个簇,有效分子数 6568
|
||||
[+] 已保存:result/fgbar_cluster_best_vina_butina_butina.csv(4810 条,每簇 1 条)
|
||||
```
|
||||
|
||||
Hierarchical 聚类(0.5)
|
||||
|
||||
```shell
|
||||
python scripts/cluster_best_vina.py \
|
||||
--csv result/filtered_results/qed_values_fgbar_combined_filtered.csv \
|
||||
--smiles-col smiles \
|
||||
--radius 3 \
|
||||
--n-bits 1024 \
|
||||
--method hierarchical \
|
||||
--cutoff 0.5 \
|
||||
--out result/fgbar_cluster_best_vina_hier.csv
|
||||
[i] hierarchical 聚类完成:4176 个簇,有效分子数 6568
|
||||
[+] 已保存:result/fgbar_cluster_best_vina_hier_hierarchical.csv(4176 条,每簇 1 条)
|
||||
```
|
||||
|
||||
|
||||
|
||||
#### karamadock 筛选
|
||||
|
||||
待反馈结构结果
|
||||
|
||||
|
||||
karamadock:只看 qed 过滤后的小分子对接情况(过滤标准:**小分子**,QED)
|
||||
|
||||
glide: 小分子,QED。(vina 打分好的 1w 个 , 按照底物标准)
|
||||
@@ -492,3 +702,94 @@ glide: 小分子,QED。(vina 打分好的 1w 个 , 按照底物标准)
|
||||
### fgbar
|
||||
|
||||
vina,karamadock,底物标准,选择 交集做 glide。
|
||||
|
||||
## tanimoto_clustering.py 用法
|
||||
|
||||
环境依赖
|
||||
|
||||
```shell
|
||||
micromamba create -n chem_cluster python=3.11 mordred chemplot pandas rdkit scikit-learn
|
||||
```
|
||||
|
||||
```python
|
||||
from utils.chem_cluster import TanimotoClusterer, search_best_config, SearchConfig
|
||||
```
|
||||
|
||||
几个核心类和函数:
|
||||
|
||||
TanimotoClusterer:主聚类类,可配置方法、参数,fit() / stats() / representatives() / chemplot_scatter()。
|
||||
|
||||
SearchConfig:自动搜索的参数集合(半径、nBits、聚类器等的网格)。
|
||||
|
||||
search_best_config:自动搜索“分得比较开”的聚类配置,并返回最佳聚类器、统计和历史对比表。
|
||||
|
||||
ClusterStats:聚类统计结构体(样本数、簇数、最大簇比例、轮廓系数等)。
|
||||
|
||||
# 分子对接分析流程
|
||||
|
||||
## 目录
|
||||
|
||||
[TOC]
|
||||
|
||||
## 1. 项目概述
|
||||
...
|
||||
|
||||
## 2. 系统功能
|
||||
...
|
||||
|
||||
## 3. 技术架构
|
||||
...
|
||||
|
||||
## 4. 技术选型
|
||||
...
|
||||
|
||||
## 5. 开发环境
|
||||
...
|
||||
|
||||
## 6. 技术约束
|
||||
...
|
||||
|
||||
## 使用说明
|
||||
|
||||
### 数据准备
|
||||
...
|
||||
|
||||
### 分子对接
|
||||
...
|
||||
|
||||
### 结果分析
|
||||
...
|
||||
|
||||
### 聚类分析
|
||||
...
|
||||
|
||||
### 新增:代表分子选择
|
||||
|
||||
#### 选择簇内最高Vina得分分子
|
||||
|
||||
```bash
|
||||
# 示例命令
|
||||
python scripts/select_representatives.py \
|
||||
--csv result/filtered_results/qed_values_fgbar_top100.csv \
|
||||
--smiles-col smiles \
|
||||
--vina-col vina_scores \
|
||||
--radius 3 \
|
||||
--n-bits 1024 \
|
||||
--sim-cutoff 0.6 \
|
||||
--output result/clustered_representatives.csv
|
||||
```
|
||||
|
||||
- `--csv`: 输入CSV文件路径
|
||||
- `--smiles-col`: SMILES列名
|
||||
- `--vina-col`: Vina得分列名
|
||||
- `--radius`: Morgan指纹半径(默认3)
|
||||
- `--n-bits`: 指纹位数(默认1024)
|
||||
- `--sim-cutoff`: Butina相似度阈值(默认0.6)
|
||||
- `--output`: 输出文件路径
|
||||
|
||||
该脚本会:
|
||||
1. 使用Butina算法对分子进行聚类
|
||||
2. 在每个簇中选择Vina得分最高的分子
|
||||
3. 将结果保存到新的CSV文件
|
||||
|
||||
输出文件包含每个簇的代表性分子,共保留与簇数量相同的分子。
|
||||
|
||||
160
scripts/cluster_best_vina.py
Normal file
160
scripts/cluster_best_vina.py
Normal file
@@ -0,0 +1,160 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
聚类并从每个簇选择 vina_scores 最优分子(支持 Butina 或 Hierarchical[scipy linkage])
|
||||
- 正确解析 vina_scores(字符串列表 -> list[float]),默认取最小值(更负更好)
|
||||
- 稳健处理无效 SMILES:只在有效子集上聚类与分组,最后导出原始行
|
||||
- 可选择方法:--method butina / hierarchical / both
|
||||
- hierarchical 采用 scipy.cluster.hierarchy.linkage(method='average') + fcluster 按距离阈值切分
|
||||
|
||||
示例:
|
||||
python scripts/cluster_best_vina.py \
|
||||
--csv result/filtered_results/qed_values_trpe_combined_filtered.csv \
|
||||
--smiles-col smiles \
|
||||
--method both \
|
||||
--cutoff 0.6 \
|
||||
--radius 3 --n-bits 1024 \
|
||||
--out result/cluster_best_vina.csv
|
||||
|
||||
只跑 Butina(推荐起步)
|
||||
|
||||
python scripts/cluster_best_vina.py \
|
||||
--csv scripts/finally_data/qed_values_poses_fgbar_all.csv \
|
||||
--smiles-col smiles \
|
||||
--radius 3 --n-bits 1024 \
|
||||
--method butina \
|
||||
--cutoff 0.6 \
|
||||
--out scripts/finally_data/fgbar_cluster_best_vina_butina.csv
|
||||
|
||||
python scripts/cluster_best_vina.py \
|
||||
--csv scripts/finally_data/qed_values_poses_fgbar_all.csv \
|
||||
--smiles-col smiles \
|
||||
--radius 3 --n-bits 1024 \
|
||||
--method hierarchical \
|
||||
--cutoff 0.6 \
|
||||
--out scripts/finally_data/fgbar_cluster_best_vina_hierarchical.csv
|
||||
# 产出:fgbar_cluster_best_vina_hierarchical.csv
|
||||
|
||||
python scripts/cluster_best_vina.py \
|
||||
--csv scripts/finally_data/qed_values_poses_fgbar_all.csv \
|
||||
--smiles-col smiles \
|
||||
--radius 3 --n-bits 1024 \
|
||||
--method both \
|
||||
--cutoff 0.6 \
|
||||
--out scripts/finally_data/fgbar_cluster_best_vina.csv
|
||||
# 产出两份对比文件
|
||||
|
||||
"""
|
||||
import argparse
|
||||
import ast
|
||||
import sys, os
|
||||
from pathlib import Path
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
# 让项目根目录的 utils.chem_cluster 可用
|
||||
sys.path.append(Path(os.path.abspath(__file__)).parent.parent.as_posix())
|
||||
from utils.chem_cluster import TanimotoClusterer, FPConfig
|
||||
|
||||
def parse_vina_scores(v):
|
||||
"""把 '[-6.9, -6.1, ...]' 解析为 list[float];出错返回 [nan]"""
|
||||
try:
|
||||
seq = ast.literal_eval(v)
|
||||
if isinstance(seq, (list, tuple)):
|
||||
return [float(x) for x in seq]
|
||||
return [float(v)]
|
||||
except Exception:
|
||||
return [np.nan]
|
||||
|
||||
def select_best_each_cluster(df_valid: pd.DataFrame, labels: np.ndarray) -> pd.DataFrame:
|
||||
"""
|
||||
在已经对齐的 df_valid(仅有效 SMILES 行)上:
|
||||
- 计算 df_valid['vina_best'] = min(vina_scores)
|
||||
- 每簇取 vina_best 最小的那一行
|
||||
"""
|
||||
out = df_valid.copy()
|
||||
out["cluster_id"] = labels
|
||||
out["vina_list"] = out["vina_scores"].apply(parse_vina_scores)
|
||||
# 默认对接分数越小越好(更负越好);如相反改成 max
|
||||
out["vina_best"] = out["vina_list"].apply(min)
|
||||
# groupby 每簇取 vina_best 最小的那条
|
||||
picked = (
|
||||
out.loc[out.groupby("cluster_id")["vina_best"].idxmin()]
|
||||
.sort_values(["cluster_id", "vina_best"])
|
||||
.reset_index(drop=True)
|
||||
)
|
||||
return picked
|
||||
|
||||
def run_one_method(df: pd.DataFrame, smiles_col: str, fp_radius: int, fp_n_bits: int,
|
||||
method: str, cutoff: float, out_path: Path) -> pd.DataFrame:
|
||||
"""
|
||||
method: 'butina' 或 'hierarchical'
|
||||
cutoff:
|
||||
- butina -> sim_cutoff (相似度阈值)
|
||||
- hierarchical -> 距离阈值 t(= 1 - 相似度阈值),我们直接把 --cutoff 当做 t 使用
|
||||
"""
|
||||
smiles_all = df[smiles_col].astype(str).tolist()
|
||||
|
||||
api = TanimotoClusterer(fp_cfg=FPConfig(radius=fp_radius, n_bits=fp_n_bits))
|
||||
|
||||
if method == "butina":
|
||||
res = api.fit_from_smiles(smiles_all, method="butina",
|
||||
method_kwargs={"sim_cutoff": float(cutoff)})
|
||||
elif method == "hierarchical":
|
||||
# 用 scipy linkage + fcluster 切分(average)
|
||||
res = api.fit_from_smiles(smiles_all, method="scipy_linkage",
|
||||
method_kwargs={"method": "average",
|
||||
"t": float(cutoff),
|
||||
"criterion": "distance"})
|
||||
else:
|
||||
raise ValueError("method ∈ {'butina','hierarchical'}")
|
||||
|
||||
labels = res["labels"] # 仅对“有效 SMILES”产生标签
|
||||
keep_idx = res["keep_idx"] # 有效 SMILES 在原 df 中的行号
|
||||
|
||||
print(f"[i] {method} 聚类完成:{len(np.unique(labels))} 个簇,有效分子数 {len(keep_idx)}")
|
||||
|
||||
# 仅在有效子集上挑代表
|
||||
df_valid = df.iloc[keep_idx].copy()
|
||||
picked_valid = select_best_each_cluster(df_valid, labels)
|
||||
|
||||
# 存盘
|
||||
picked_valid.to_csv(out_path, index=False)
|
||||
print(f"[+] 已保存:{out_path.as_posix()}({len(picked_valid)} 条,每簇 1 条)")
|
||||
|
||||
return picked_valid
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="聚类并从每簇选择 vina_scores 最优分子(支持 Butina / Hierarchical)")
|
||||
parser.add_argument("--csv", required=True, help="输入 CSV 文件路径")
|
||||
parser.add_argument("--smiles-col", default="smiles", help="SMILES 列名")
|
||||
parser.add_argument("--radius", type=int, default=3, help="ECFP 半径(2=ECFP4, 3=ECFP6)")
|
||||
parser.add_argument("--n-bits", type=int, default=1024, help="指纹位数(1024/2048)")
|
||||
parser.add_argument("--method", choices=["butina", "hierarchical", "both"], default="butina",
|
||||
help="选择聚类方法")
|
||||
parser.add_argument("--cutoff", type=float, default=0.6,
|
||||
help="阈值(Butina: sim_cutoff;Hierarchical: 距离阈值 t)")
|
||||
parser.add_argument("--out", default="cluster_best_vina.csv", help="输出文件名(会自动加后缀 _butina/_hierarchical)")
|
||||
args = parser.parse_args()
|
||||
|
||||
df = pd.read_csv(args.csv)
|
||||
if args.smiles_col not in df.columns:
|
||||
raise ValueError(f"找不到 SMILES 列: {args.smiles_col}")
|
||||
|
||||
out_base = Path(args.out)
|
||||
|
||||
if args.method in ("butina", "both"):
|
||||
run_one_method(
|
||||
df, args.smiles_col, args.radius, args.n_bits,
|
||||
method="butina", cutoff=args.cutoff,
|
||||
out_path=out_base.with_name(f"{out_base.stem}_butina{out_base.suffix}")
|
||||
)
|
||||
|
||||
if args.method in ("hierarchical", "both"):
|
||||
run_one_method(
|
||||
df, args.smiles_col, args.radius, args.n_bits,
|
||||
method="hierarchical", cutoff=args.cutoff,
|
||||
out_path=out_base.with_name(f"{out_base.stem}_hierarchical{out_base.suffix}")
|
||||
)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
103
scripts/cluster_granularity_scan.py
Normal file
103
scripts/cluster_granularity_scan.py
Normal file
@@ -0,0 +1,103 @@
|
||||
#!/usr/bin/env python
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
聚类粒度扫描脚本
|
||||
运行示例:
|
||||
python scripts/cluster_granularity_scan.py \
|
||||
--csv result/filtered_results/qed_values_trpe_combined_filtered.csv \
|
||||
--smiles-col smiles \
|
||||
--radius 3 \
|
||||
--n-bits 1024
|
||||
"""
|
||||
import sys, os
|
||||
from pathlib import Path
|
||||
import argparse
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
from rdkit import Chem, DataStructs
|
||||
from rdkit.Chem import AllChem
|
||||
from sklearn.cluster import AgglomerativeClustering, KMeans, DBSCAN
|
||||
|
||||
# 把项目根目录加入 sys.path
|
||||
sys.path.append(Path(os.path.abspath(__file__)).parent.parent.as_posix())
|
||||
from utils.chem_cluster import TanimotoClusterer, FPConfig
|
||||
|
||||
def tanimoto_matrix(smiles, radius=3, n_bits=1024):
|
||||
fps = [AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(s), radius, nBits=n_bits) for s in smiles]
|
||||
n = len(fps)
|
||||
sim_mat = np.zeros((n, n))
|
||||
for i in range(n):
|
||||
sims = DataStructs.BulkTanimotoSimilarity(fps[i], fps)
|
||||
sim_mat[i, :] = sims
|
||||
return sim_mat
|
||||
|
||||
def avg_intra_cluster_similarity(labels, sim_mat):
|
||||
"""计算每个簇的平均内部相似度"""
|
||||
cluster_sims = []
|
||||
for lbl in set(labels):
|
||||
idx = np.where(labels == lbl)[0]
|
||||
if len(idx) > 1:
|
||||
sub_sim = sim_mat[np.ix_(idx, idx)]
|
||||
tril_idx = np.tril_indices_from(sub_sim, k=-1)
|
||||
cluster_sims.append(np.mean(sub_sim[tril_idx]))
|
||||
return np.mean(cluster_sims) if cluster_sims else 0
|
||||
|
||||
def scan(args):
|
||||
df = pd.read_csv(args.csv)
|
||||
smiles = df[args.smiles_col].astype(str).tolist()
|
||||
|
||||
# 预先计算相似度矩阵
|
||||
print("计算 Tanimoto 相似度矩阵...")
|
||||
sim_mat = tanimoto_matrix(smiles, radius=args.radius, n_bits=args.n_bits)
|
||||
dist_mat = 1 - sim_mat # 聚类使用距离矩阵
|
||||
|
||||
results = []
|
||||
|
||||
# 1. Butina 聚类
|
||||
from rdkit.ML.Cluster import Butina
|
||||
for cutoff in np.linspace(0.4, 0.8, 5):
|
||||
cluster_res = list(Butina.ClusterData(dist_mat, len(smiles), cutoff, isDistData=True))
|
||||
labels = np.zeros(len(smiles), dtype=int)
|
||||
for cid, members in enumerate(cluster_res):
|
||||
for m in members:
|
||||
labels[m] = cid
|
||||
avg_sim = avg_intra_cluster_similarity(labels, sim_mat)
|
||||
results.append(("Butina", {"cutoff": round(cutoff, 2)}, len(set(labels)), np.mean(np.bincount(labels)), avg_sim))
|
||||
|
||||
# 2. 层次聚类
|
||||
for thresh in [0.3, 0.4, 0.5]:
|
||||
model = AgglomerativeClustering(n_clusters=None, metric='precomputed', linkage='average', distance_threshold=thresh)
|
||||
labels = model.fit_predict(dist_mat)
|
||||
avg_sim = avg_intra_cluster_similarity(labels, sim_mat)
|
||||
results.append(("Hierarchical", {"threshold": thresh}, len(set(labels)), np.mean(np.bincount(labels)), avg_sim))
|
||||
|
||||
# 3. DBSCAN
|
||||
for eps in [0.2, 0.3, 0.4]:
|
||||
model = DBSCAN(eps=eps, min_samples=2, metric="precomputed")
|
||||
labels = model.fit_predict(dist_mat)
|
||||
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
|
||||
avg_sim = avg_intra_cluster_similarity(labels[labels != -1], sim_mat)
|
||||
results.append(("DBSCAN", {"eps": eps}, n_clusters, np.mean(np.bincount(labels[labels != -1])) if n_clusters > 0 else 0, avg_sim))
|
||||
|
||||
# 4. KMeans (先降维再聚类)
|
||||
from sklearn.decomposition import PCA
|
||||
coords = PCA(n_components=10).fit_transform(sim_mat)
|
||||
for k in [10, 20, 50]:
|
||||
model = KMeans(n_clusters=k, random_state=42)
|
||||
labels = model.fit_predict(coords)
|
||||
avg_sim = avg_intra_cluster_similarity(labels, sim_mat)
|
||||
results.append(("KMeans", {"k": k}, len(set(labels)), np.mean(np.bincount(labels)), avg_sim))
|
||||
|
||||
# 输出结果表
|
||||
print(f"{'Method':<15} {'Params':<25} {'#Clusters':<10} {'AvgSize':<10} {'AvgIntraSim':<10}")
|
||||
for r in results:
|
||||
print(f"{r[0]:<15} {str(r[1]):<25} {r[2]:<10} {r[3]:<10.2f} {r[4]:<10.3f}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description="聚类粒度扫描")
|
||||
parser.add_argument("--csv", type=str, required=True, help="输入 CSV 文件路径")
|
||||
parser.add_argument("--smiles-col", type=str, required=True, help="SMILES 列名")
|
||||
parser.add_argument("--radius", type=int, default=3, help="Morgan 指纹半径")
|
||||
parser.add_argument("--n-bits", type=int, default=1024, help="指纹位数")
|
||||
args = parser.parse_args()
|
||||
scan(args)
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user