embedding_atlas

lingyuzeng/embedding_atlas

Fork 0

Files

History

lingyuzeng 9f0a0fbcdc update

2025-10-23 16:21:52 +08:00

counts.txt

update

2025-10-23 16:21:52 +08:00

README.md

update

2025-10-23 16:21:52 +08:00

temp_with_macrocycles_with_ecfp4.csv

update

2025-10-23 16:21:52 +08:00

temp_with_macrocycles_with_ecfp4.parquet

update

2025-10-23 16:21:52 +08:00

temp_with_macrocycles.csv

update

2025-10-23 16:21:52 +08:00

temp.csv

update

2025-10-23 16:21:52 +08:00

README.md

Your Filtered Macrolactone Database

11036 compounds have been filtered from MacrolactoneDB based on your specified inputs.

uv run embedding-atlas data/ring12_20/temp_with_macrocycles_with_ecfp4.parquet --text ecfp4_binary
uv run embedding-atlas data/ring12_20/temp_with_macrocycles_with_ecfp4.parquet --text tanimoto_top_neighbors
uv run embedding-atlas data/ring12_20/temp_with_macrocycles_with_ecfp4.parquet --text smiles

嵌入和投影优化

projection_x 和 projection_y 的生成过程

UMAP 降维计算这两个坐标是通过 _run_umap() 函数生成的，该函数使用 UMAP 算法将高维嵌入向量降维到 2D 空间 projection.py:64-88 。

具体流程如下：

计算最近邻 - 首先使用 nearest_neighbors() 计算每个点的 k 个最近邻 projection.py:76-83 UMAP 投影 - 然后使用预计算的邻居信息进行 UMAP 降维 projection.py:85-86 坐标分配 - 结果的第一列成为 projection_x，第二列成为 projection_y projection.py:259-260 默认参数设置 UMAP 算法使用以下默认参数：

邻居数量: 15 个最近邻 projection.py:74 距离度量: cosine 距离 projection.py:73 在不同数据类型中的应用文本数据处理对于您的 SMILES 分子数据，系统首先使用 SentenceTransformers 生成文本嵌入，然后通过 UMAP 降维 projection.py:251-260 。

预计算向量处理如果您有预计算的 ECFP4 向量，系统会直接对这些向量进行 UMAP 降维 projection.py:311-318 。

可视化中的作用在前端可视化界面中，这些坐标用作：

散点图的 X/Y 轴 - 每个数据点在 2D 空间中的位置颜色编码的基础 - 可以根据坐标值进行颜色映射 embedding-atlas.md:68-70 演示数据示例在项目的演示数据生成中，可以看到相同的处理流程：使用 SentenceTransformers 计算嵌入，然后通过 UMAP 生成 projection_x 和 projection_y 坐标 generate_demo_data.py:42-43 。

Notes

这些投影坐标的质量很大程度上取决于原始嵌入的质量和 UMAP 参数的选择。对于化学分子数据，使用专门的分子嵌入模型通常会产生更有意义的 2D 投影，其中化学结构相似的分子会在投影空间中聚集在一起。

UMAP 参数调优

您可以通过调整 UMAP 参数来获得更好的可视化效果：

# 调整邻居数量和距离参数  
uv run embedding-atlas data/ring12_20/temp_with_macrocycles_with_ecfp4.parquet --text smiles \  
  --umap-n-neighbors 30 \  
  --umap-min-dist 0.1 \  
  --umap-metric cosine \  
  --umap-random-state 42

自定义嵌入模型

对于化学分子数据，您可能想使用专门的模型：

并且符合：

模型支持范围 embedding-atlas 支持两种类型的自定义模型：

文本嵌入模型对于文本数据（如您的 SMILES 分子数据），系统使用 SentenceTransformers 库 projection.py:118-126 。这意味着您可以使用任何与 SentenceTransformers 兼容的 Hugging Face 模型。

图像嵌入模型对于图像数据，系统使用 transformers 库的 pipeline 功能 projection.py:168-180 。

模型格式要求 SentenceTransformers 兼容性文本模型必须与 SentenceTransformers 库兼容 projection.py:98-99 。这包括：

专门训练用于句子嵌入的模型支持 .encode() 方法的模型能够输出固定维度向量的模型

uv run embedding-atlas data/ring12_20/temp_with_macrocycles_with_ecfp4.parquet --text smiles \
  --umap-n-neighbors 30 \
  --umap-min-dist 0.1 \
  --umap-metric cosine \
  --umap-random-state 42

README.md Unescape Escape

嵌入和投影优化

projection_x 和 projection_y 的生成过程

UMAP 参数调优

自定义嵌入模型

README.md