update
This commit is contained in:
90
data/ring12_20/README.md
Normal file
90
data/ring12_20/README.md
Normal file
@@ -0,0 +1,90 @@
|
||||
Your Filtered Macrolactone Database
|
||||
|
||||
11036 compounds have been filtered from MacrolactoneDB based on your specified inputs.
|
||||
|
||||
```bash
|
||||
uv run embedding-atlas data/ring12_20/temp_with_macrocycles_with_ecfp4.parquet --text ecfp4_binary
|
||||
uv run embedding-atlas data/ring12_20/temp_with_macrocycles_with_ecfp4.parquet --text tanimoto_top_neighbors
|
||||
uv run embedding-atlas data/ring12_20/temp_with_macrocycles_with_ecfp4.parquet --text smiles
|
||||
```
|
||||
|
||||
## 嵌入和投影优化
|
||||
|
||||
### projection_x 和 projection_y 的生成过程
|
||||
|
||||
UMAP 降维计算
|
||||
这两个坐标是通过 _run_umap() 函数生成的,该函数使用 UMAP 算法将高维嵌入向量降维到 2D 空间 projection.py:64-88 。
|
||||
|
||||
具体流程如下:
|
||||
|
||||
计算最近邻 - 首先使用 nearest_neighbors() 计算每个点的 k 个最近邻 projection.py:76-83
|
||||
UMAP 投影 - 然后使用预计算的邻居信息进行 UMAP 降维 projection.py:85-86
|
||||
坐标分配 - 结果的第一列成为 projection_x,第二列成为 projection_y projection.py:259-260
|
||||
默认参数设置
|
||||
UMAP 算法使用以下默认参数:
|
||||
|
||||
邻居数量: 15 个最近邻 projection.py:74
|
||||
距离度量: cosine 距离 projection.py:73
|
||||
在不同数据类型中的应用
|
||||
文本数据处理
|
||||
对于您的 SMILES 分子数据,系统首先使用 SentenceTransformers 生成文本嵌入,然后通过 UMAP 降维 projection.py:251-260 。
|
||||
|
||||
预计算向量处理
|
||||
如果您有预计算的 ECFP4 向量,系统会直接对这些向量进行 UMAP 降维 projection.py:311-318 。
|
||||
|
||||
可视化中的作用
|
||||
在前端可视化界面中,这些坐标用作:
|
||||
|
||||
散点图的 X/Y 轴 - 每个数据点在 2D 空间中的位置
|
||||
颜色编码的基础 - 可以根据坐标值进行颜色映射 embedding-atlas.md:68-70
|
||||
演示数据示例
|
||||
在项目的演示数据生成中,可以看到相同的处理流程:使用 SentenceTransformers 计算嵌入,然后通过 UMAP 生成 projection_x 和 projection_y 坐标 generate_demo_data.py:42-43 。
|
||||
|
||||
Notes
|
||||
|
||||
这些投影坐标的质量很大程度上取决于原始嵌入的质量和 UMAP 参数的选择。对于化学分子数据,使用专门的分子嵌入模型通常会产生更有意义的 2D 投影,其中化学结构相似的分子会在投影空间中聚集在一起。
|
||||
|
||||
### UMAP 参数调优
|
||||
|
||||
您可以通过调整 UMAP 参数来获得更好的可视化效果:
|
||||
|
||||
```bash
|
||||
# 调整邻居数量和距离参数
|
||||
uv run embedding-atlas data/ring12_20/temp_with_macrocycles_with_ecfp4.parquet --text smiles \
|
||||
--umap-n-neighbors 30 \
|
||||
--umap-min-dist 0.1 \
|
||||
--umap-metric cosine \
|
||||
--umap-random-state 42
|
||||
```
|
||||
|
||||
## 自定义嵌入模型
|
||||
|
||||
对于化学分子数据,您可能想使用专门的模型:
|
||||
|
||||
并且符合:
|
||||
|
||||
模型支持范围
|
||||
embedding-atlas 支持两种类型的自定义模型:
|
||||
|
||||
文本嵌入模型
|
||||
对于文本数据(如您的 SMILES 分子数据),系统使用 SentenceTransformers 库 projection.py:118-126 。这意味着您可以使用任何与 SentenceTransformers 兼容的 Hugging Face 模型。
|
||||
|
||||
图像嵌入模型
|
||||
对于图像数据,系统使用 transformers 库的 pipeline 功能 projection.py:168-180 。
|
||||
|
||||
模型格式要求
|
||||
SentenceTransformers 兼容性
|
||||
文本模型必须与 SentenceTransformers 库兼容 projection.py:98-99 。这包括:
|
||||
|
||||
专门训练用于句子嵌入的模型
|
||||
支持 .encode() 方法的模型
|
||||
能够输出固定维度向量的模型
|
||||
|
||||
|
||||
```bash
|
||||
uv run embedding-atlas data/ring12_20/temp_with_macrocycles_with_ecfp4.parquet --text smiles \
|
||||
--umap-n-neighbors 30 \
|
||||
--umap-min-dist 0.1 \
|
||||
--umap-metric cosine \
|
||||
--umap-random-state 42
|
||||
```
|
||||
89
data/ring12_20/counts.txt
Normal file
89
data/ring12_20/counts.txt
Normal file
@@ -0,0 +1,89 @@
|
||||
Target Organisms
|
||||
Homo sapiens 815
|
||||
Homo sapiens, None 180
|
||||
Plasmodium falciparum 161
|
||||
Hepatitis C virus, None 112
|
||||
Homo sapiens, Plasmodium falciparum 63
|
||||
Oryctolagus cuniculus 62
|
||||
Mus musculus 60
|
||||
Toxoplasma gondii 39
|
||||
Homo sapiens, Rattus norvegicus 27
|
||||
Mus musculus, Homo sapiens 24
|
||||
None, Rattus norvegicus 23
|
||||
Human immunodeficiency virus 1 20
|
||||
Hepatitis C virus 18
|
||||
Rattus norvegicus 17
|
||||
Homo sapiens, Sus scrofa 11
|
||||
Homo sapiens, Chlorocebus aethiops 10
|
||||
Serratia marcescens 9
|
||||
Escherichia coli 8
|
||||
Oryctolagus cuniculus, Homo sapiens 7
|
||||
Streptococcus pneumoniae 6
|
||||
Oryctolagus cuniculus, Staphylococcus aureus, Raoultella planticola, Bacillus subtilis, Mus musculus, Micrococcus luteus, None, Escherichia coli, Plasmodium falciparum, Streptococcus pneumoniae, Homo sapiens, Escherichia coli K-12, Toxoplasma gondii 6
|
||||
Plasmodium falciparum K1 5
|
||||
Bacillus anthracis 5
|
||||
Mus musculus, Homo sapiens, None 5
|
||||
Bacillus anthracis, Homo sapiens 4
|
||||
Candida albicans, Cryptococcus neoformans, Aspergillus fumigatus 4
|
||||
Mus musculus, None 4
|
||||
Plasmodium falciparum, Homo sapiens, None 4
|
||||
None, Homo sapiens, Plasmodium falciparum 3
|
||||
Bacillus subtilis, Homo sapiens 3
|
||||
Oryctolagus cuniculus, Homo sapiens, None 3
|
||||
Sus scrofa, Mus musculus, None, Plasmodium falciparum, Homo sapiens, Rattus norvegicus 2
|
||||
Homo sapiens, None, Rattus norvegicus 2
|
||||
Cryptococcus neoformans 2
|
||||
Homo sapiens, None, Chlorocebus aethiops 2
|
||||
Staphylococcus aureus 2
|
||||
Candida albicans, Cryptococcus neoformans, Mycobacterium intracellulare, Aspergillus fumigatus 2
|
||||
Mus musculus, None, Human immunodeficiency virus 1 2
|
||||
Escherichia coli (strain K12) 2
|
||||
Plasmodium falciparum 3D7, Homo sapiens 2
|
||||
Aspergillus fumigatus 1
|
||||
Sus scrofa 1
|
||||
Saccharomyces cerevisiae S288c, Human immunodeficiency virus 1, Human herpesvirus 1, Plasmodium falciparum, None, Homo sapiens, Rattus norvegicus 1
|
||||
Hepatitis C virus, Homo sapiens, None 1
|
||||
Plasmodium falciparum 3D7 1
|
||||
Bacillus subtilis 1
|
||||
Mus musculus, Homo sapiens, None, Saccharomyces cerevisiae 1
|
||||
Chlorocebus aethiops 1
|
||||
Homo sapiens, Escherichia coli K-12, None 1
|
||||
Hepatitis C virus, Homo sapiens, None, Rattus norvegicus 1
|
||||
None, Homo sapiens, Human herpesvirus 1 1
|
||||
Homo sapiens, None, Trypanosoma brucei brucei 1
|
||||
Homo sapiens, None, Cryptococcus neoformans 1
|
||||
Homo sapiens, Rattus norvegicus, Human immunodeficiency virus 1 1
|
||||
None, Plasmodium falciparum, Escherichia coli, Streptococcus pneumoniae, Naegleria fowleri, Homo sapiens, Streptococcus, Toxoplasma gondii 1
|
||||
Giardia intestinalis, Trypanosoma cruzi, Equus caballus, Bos taurus, Mus musculus, None, Plasmodium falciparum, Chlorocebus aethiops, Homo sapiens 1
|
||||
Plasmodium falciparum NF54, Trypanosoma cruzi, Trypanosoma brucei rhodesiense, Rattus norvegicus 1
|
||||
None, Homo sapiens, Plasmodium falciparum K1, Plasmodium falciparum 1
|
||||
Saccharomyces cerevisiae S288c, Homo sapiens, None, Saccharomyces cerevisiae, Phytophthora sojae 1
|
||||
Bacillus subtilis, Homo sapiens, Schistosoma mansoni, Saccharomyces cerevisiae, Giardia intestinalis 1
|
||||
Streptococcus, Homo sapiens, None 1
|
||||
Mus musculus, Homo sapiens, Rattus norvegicus 1
|
||||
Homo sapiens, Spinacia oleracea 1
|
||||
Human immunodeficiency virus 1, Mus musculus, None, Hepatitis C virus, Homo sapiens, Rattus norvegicus 1
|
||||
None, Plasmodium falciparum, Trypanosoma brucei rhodesiense 1
|
||||
Hepatitis C virus, None, Rattus norvegicus 1
|
||||
Homo sapiens, Equus caballus 1
|
||||
Plasmodium falciparum NF54, Trypanosoma cruzi, Trypanosoma brucei rhodesiense 1
|
||||
Schistosoma mansoni, Influenza A virus 1
|
||||
Leishmania chagasi, Trypanosoma cruzi 1
|
||||
Candida albicans, Cryptococcus neoformans 1
|
||||
None, Plasmodium falciparum 1
|
||||
Caenorhabditis elegans 1
|
||||
Bos taurus, Sus scrofa 1
|
||||
Plasmodium falciparum, Enterococcus faecium 1
|
||||
Homo sapiens, Gallus gallus 1
|
||||
Homo sapiens, Escherichia coli 1
|
||||
Plasmodium falciparum, Homo sapiens, None, Rattus norvegicus, Schistosoma mansoni 1
|
||||
Homo sapiens, None, Influenza A virus 1
|
||||
Mycobacterium tuberculosis, None 1
|
||||
Escherichia coli, Homo sapiens, Toxoplasma gondii, None, Streptococcus pneumoniae 1
|
||||
Bacillus subtilis, Oryctolagus cuniculus, Homo sapiens, Schistosoma mansoni, Giardia intestinalis 1
|
||||
Homo sapiens, None, Rattus norvegicus, Escherichia coli O157:H7 1
|
||||
Giardia intestinalis, Schistosoma mansoni, Mus musculus, None, Homo sapiens, Saccharomyces cerevisiae 1
|
||||
Trypanosoma cruzi 1
|
||||
Influenza A virus 1
|
||||
Escherichia coli K-12 1
|
||||
Human herpesvirus 4 (strain B95-8) 1
|
||||
11037
data/ring12_20/temp.csv
Normal file
11037
data/ring12_20/temp.csv
Normal file
File diff suppressed because one or more lines are too long
11037
data/ring12_20/temp_with_macrocycles.csv
Normal file
11037
data/ring12_20/temp_with_macrocycles.csv
Normal file
File diff suppressed because one or more lines are too long
11037
data/ring12_20/temp_with_macrocycles_with_ecfp4.csv
Normal file
11037
data/ring12_20/temp_with_macrocycles_with_ecfp4.csv
Normal file
File diff suppressed because one or more lines are too long
BIN
data/ring12_20/temp_with_macrocycles_with_ecfp4.parquet
Normal file
BIN
data/ring12_20/temp_with_macrocycles_with_ecfp4.parquet
Normal file
Binary file not shown.
Reference in New Issue
Block a user