This commit is contained in:
2025-10-23 16:21:52 +08:00
parent 5e21419a67
commit 9f0a0fbcdc
25 changed files with 38489 additions and 1 deletions

90
data/ring12_20/README.md Normal file
View File

@@ -0,0 +1,90 @@
Your Filtered Macrolactone Database
11036 compounds have been filtered from MacrolactoneDB based on your specified inputs.
```bash
uv run embedding-atlas data/ring12_20/temp_with_macrocycles_with_ecfp4.parquet --text ecfp4_binary
uv run embedding-atlas data/ring12_20/temp_with_macrocycles_with_ecfp4.parquet --text tanimoto_top_neighbors
uv run embedding-atlas data/ring12_20/temp_with_macrocycles_with_ecfp4.parquet --text smiles
```
## 嵌入和投影优化
### projection_x 和 projection_y 的生成过程
UMAP 降维计算
这两个坐标是通过 _run_umap() 函数生成的,该函数使用 UMAP 算法将高维嵌入向量降维到 2D 空间 projection.py:64-88 。
具体流程如下:
计算最近邻 - 首先使用 nearest_neighbors() 计算每个点的 k 个最近邻 projection.py:76-83
UMAP 投影 - 然后使用预计算的邻居信息进行 UMAP 降维 projection.py:85-86
坐标分配 - 结果的第一列成为 projection_x第二列成为 projection_y projection.py:259-260
默认参数设置
UMAP 算法使用以下默认参数:
邻居数量: 15 个最近邻 projection.py:74
距离度量: cosine 距离 projection.py:73
在不同数据类型中的应用
文本数据处理
对于您的 SMILES 分子数据,系统首先使用 SentenceTransformers 生成文本嵌入,然后通过 UMAP 降维 projection.py:251-260 。
预计算向量处理
如果您有预计算的 ECFP4 向量,系统会直接对这些向量进行 UMAP 降维 projection.py:311-318 。
可视化中的作用
在前端可视化界面中,这些坐标用作:
散点图的 X/Y 轴 - 每个数据点在 2D 空间中的位置
颜色编码的基础 - 可以根据坐标值进行颜色映射 embedding-atlas.md:68-70
演示数据示例
在项目的演示数据生成中,可以看到相同的处理流程:使用 SentenceTransformers 计算嵌入,然后通过 UMAP 生成 projection_x 和 projection_y 坐标 generate_demo_data.py:42-43 。
Notes
这些投影坐标的质量很大程度上取决于原始嵌入的质量和 UMAP 参数的选择。对于化学分子数据,使用专门的分子嵌入模型通常会产生更有意义的 2D 投影,其中化学结构相似的分子会在投影空间中聚集在一起。
### UMAP 参数调优
您可以通过调整 UMAP 参数来获得更好的可视化效果:
```bash
# 调整邻居数量和距离参数
uv run embedding-atlas data/ring12_20/temp_with_macrocycles_with_ecfp4.parquet --text smiles \
--umap-n-neighbors 30 \
--umap-min-dist 0.1 \
--umap-metric cosine \
--umap-random-state 42
```
## 自定义嵌入模型
对于化学分子数据,您可能想使用专门的模型:
并且符合:
模型支持范围
embedding-atlas 支持两种类型的自定义模型:
文本嵌入模型
对于文本数据(如您的 SMILES 分子数据),系统使用 SentenceTransformers 库 projection.py:118-126 。这意味着您可以使用任何与 SentenceTransformers 兼容的 Hugging Face 模型。
图像嵌入模型
对于图像数据,系统使用 transformers 库的 pipeline 功能 projection.py:168-180 。
模型格式要求
SentenceTransformers 兼容性
文本模型必须与 SentenceTransformers 库兼容 projection.py:98-99 。这包括:
专门训练用于句子嵌入的模型
支持 .encode() 方法的模型
能够输出固定维度向量的模型
```bash
uv run embedding-atlas data/ring12_20/temp_with_macrocycles_with_ecfp4.parquet --text smiles \
--umap-n-neighbors 30 \
--umap-min-dist 0.1 \
--umap-metric cosine \
--umap-random-state 42
```

89
data/ring12_20/counts.txt Normal file
View File

@@ -0,0 +1,89 @@
Target Organisms
Homo sapiens 815
Homo sapiens, None 180
Plasmodium falciparum 161
Hepatitis C virus, None 112
Homo sapiens, Plasmodium falciparum 63
Oryctolagus cuniculus 62
Mus musculus 60
Toxoplasma gondii 39
Homo sapiens, Rattus norvegicus 27
Mus musculus, Homo sapiens 24
None, Rattus norvegicus 23
Human immunodeficiency virus 1 20
Hepatitis C virus 18
Rattus norvegicus 17
Homo sapiens, Sus scrofa 11
Homo sapiens, Chlorocebus aethiops 10
Serratia marcescens 9
Escherichia coli 8
Oryctolagus cuniculus, Homo sapiens 7
Streptococcus pneumoniae 6
Oryctolagus cuniculus, Staphylococcus aureus, Raoultella planticola, Bacillus subtilis, Mus musculus, Micrococcus luteus, None, Escherichia coli, Plasmodium falciparum, Streptococcus pneumoniae, Homo sapiens, Escherichia coli K-12, Toxoplasma gondii 6
Plasmodium falciparum K1 5
Bacillus anthracis 5
Mus musculus, Homo sapiens, None 5
Bacillus anthracis, Homo sapiens 4
Candida albicans, Cryptococcus neoformans, Aspergillus fumigatus 4
Mus musculus, None 4
Plasmodium falciparum, Homo sapiens, None 4
None, Homo sapiens, Plasmodium falciparum 3
Bacillus subtilis, Homo sapiens 3
Oryctolagus cuniculus, Homo sapiens, None 3
Sus scrofa, Mus musculus, None, Plasmodium falciparum, Homo sapiens, Rattus norvegicus 2
Homo sapiens, None, Rattus norvegicus 2
Cryptococcus neoformans 2
Homo sapiens, None, Chlorocebus aethiops 2
Staphylococcus aureus 2
Candida albicans, Cryptococcus neoformans, Mycobacterium intracellulare, Aspergillus fumigatus 2
Mus musculus, None, Human immunodeficiency virus 1 2
Escherichia coli (strain K12) 2
Plasmodium falciparum 3D7, Homo sapiens 2
Aspergillus fumigatus 1
Sus scrofa 1
Saccharomyces cerevisiae S288c, Human immunodeficiency virus 1, Human herpesvirus 1, Plasmodium falciparum, None, Homo sapiens, Rattus norvegicus 1
Hepatitis C virus, Homo sapiens, None 1
Plasmodium falciparum 3D7 1
Bacillus subtilis 1
Mus musculus, Homo sapiens, None, Saccharomyces cerevisiae 1
Chlorocebus aethiops 1
Homo sapiens, Escherichia coli K-12, None 1
Hepatitis C virus, Homo sapiens, None, Rattus norvegicus 1
None, Homo sapiens, Human herpesvirus 1 1
Homo sapiens, None, Trypanosoma brucei brucei 1
Homo sapiens, None, Cryptococcus neoformans 1
Homo sapiens, Rattus norvegicus, Human immunodeficiency virus 1 1
None, Plasmodium falciparum, Escherichia coli, Streptococcus pneumoniae, Naegleria fowleri, Homo sapiens, Streptococcus, Toxoplasma gondii 1
Giardia intestinalis, Trypanosoma cruzi, Equus caballus, Bos taurus, Mus musculus, None, Plasmodium falciparum, Chlorocebus aethiops, Homo sapiens 1
Plasmodium falciparum NF54, Trypanosoma cruzi, Trypanosoma brucei rhodesiense, Rattus norvegicus 1
None, Homo sapiens, Plasmodium falciparum K1, Plasmodium falciparum 1
Saccharomyces cerevisiae S288c, Homo sapiens, None, Saccharomyces cerevisiae, Phytophthora sojae 1
Bacillus subtilis, Homo sapiens, Schistosoma mansoni, Saccharomyces cerevisiae, Giardia intestinalis 1
Streptococcus, Homo sapiens, None 1
Mus musculus, Homo sapiens, Rattus norvegicus 1
Homo sapiens, Spinacia oleracea 1
Human immunodeficiency virus 1, Mus musculus, None, Hepatitis C virus, Homo sapiens, Rattus norvegicus 1
None, Plasmodium falciparum, Trypanosoma brucei rhodesiense 1
Hepatitis C virus, None, Rattus norvegicus 1
Homo sapiens, Equus caballus 1
Plasmodium falciparum NF54, Trypanosoma cruzi, Trypanosoma brucei rhodesiense 1
Schistosoma mansoni, Influenza A virus 1
Leishmania chagasi, Trypanosoma cruzi 1
Candida albicans, Cryptococcus neoformans 1
None, Plasmodium falciparum 1
Caenorhabditis elegans 1
Bos taurus, Sus scrofa 1
Plasmodium falciparum, Enterococcus faecium 1
Homo sapiens, Gallus gallus 1
Homo sapiens, Escherichia coli 1
Plasmodium falciparum, Homo sapiens, None, Rattus norvegicus, Schistosoma mansoni 1
Homo sapiens, None, Influenza A virus 1
Mycobacterium tuberculosis, None 1
Escherichia coli, Homo sapiens, Toxoplasma gondii, None, Streptococcus pneumoniae 1
Bacillus subtilis, Oryctolagus cuniculus, Homo sapiens, Schistosoma mansoni, Giardia intestinalis 1
Homo sapiens, None, Rattus norvegicus, Escherichia coli O157:H7 1
Giardia intestinalis, Schistosoma mansoni, Mus musculus, None, Homo sapiens, Saccharomyces cerevisiae 1
Trypanosoma cruzi 1
Influenza A virus 1
Escherichia coli K-12 1
Human herpesvirus 4 (strain B95-8) 1

11037
data/ring12_20/temp.csv Normal file

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long