feat: 添加CSV文件比较可视化功能和pixi配置更新
主要改动: 1. 新增CSV文件比较可视化功能: - 添加了src/visualization/comparison.py模块,支持比较两个CSV文件并使用不同颜色可视化 - 支持命令行和API两种使用方式 - 可生成静态图像或启动交互式查看器 - 支持自定义标签、模型和UMAP参数 2. 更新pixi.toml配置: - 添加linux-64平台支持 - 增加多个依赖项:ipykernel, anywidget, rdkit, selfies, fastapi, fastmcp, docker等 - 完善依赖版本约束 3. 更新README.md文档: - 添加CSV文件比较可视化功能说明和使用示例
This commit is contained in:
79
README.md
79
README.md
@@ -147,6 +147,83 @@ embedding-atlas dataset.parquet --text text_column
|
||||
embedding-atlas dataset.parquet --x projection_x --y projection_y
|
||||
```
|
||||
|
||||
## CSV文件比较可视化
|
||||
|
||||
本项目提供了一个强大的工具用于比较两个CSV文件中的分子数据,并使用Embedding Atlas进行可视化。
|
||||
|
||||
### Python API使用方法
|
||||
|
||||
```python
|
||||
from script.visualize_csv_comparison import visualize_csv_comparison, create_embedding_service
|
||||
|
||||
# 比较两个CSV文件
|
||||
visualize_csv_comparison(
|
||||
"file1.csv",
|
||||
"file2.csv",
|
||||
column1="smiles",
|
||||
column2="smiles",
|
||||
output_path="comparison.png",
|
||||
label1="Dataset A",
|
||||
label2="Dataset B",
|
||||
launch_interactive=True,
|
||||
port=5055,
|
||||
model="all-MiniLM-L6-v2"
|
||||
)
|
||||
|
||||
# 直接从文本列表创建可视化服务
|
||||
texts1 = ["CCO", "CCN", "CCC"]
|
||||
texts2 = ["c1ccccc1", "CCCN", "CCCO"]
|
||||
|
||||
create_embedding_service(
|
||||
texts1,
|
||||
texts2,
|
||||
labels=("Alcohols", "Others"),
|
||||
port=8080,
|
||||
model="all-MiniLM-L6-v2",
|
||||
umap_args={
|
||||
"n_neighbors": 30,
|
||||
"min_dist": 0.05,
|
||||
"metric": "cosine"
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
### 命令行使用方法
|
||||
|
||||
```bash
|
||||
# 基本用法
|
||||
python script/visualize_csv_comparison.py file1.csv file2.csv
|
||||
|
||||
# 指定不同的列名
|
||||
python script/visualize_csv_comparison.py file1.csv file2.csv \
|
||||
--column1 smiles --column2 SMILES
|
||||
|
||||
# 自定义标签和输出文件
|
||||
python script/visualize_csv_comparison.py file1.csv file2.csv \
|
||||
--label1 "Dataset A" --label2 "Dataset B" \
|
||||
--output comparison.png
|
||||
|
||||
# 启动交互式查看器
|
||||
python script/visualize_csv_comparison.py file1.csv file2.csv \
|
||||
--interactive --port 8080
|
||||
|
||||
# 使用自定义模型和参数
|
||||
python script/visualize_csv_comparison.py file1.csv file2.csv \
|
||||
--interactive \
|
||||
--model all-MiniLM-L6-v2 \
|
||||
--batch-size 64
|
||||
```
|
||||
|
||||
### 功能特点
|
||||
|
||||
1. **双模式支持**:既可以生成静态可视化图像,也可以启动交互式查看器
|
||||
2. **数据源区分**:在可视化中使用不同颜色区分两个数据源,并在图例中标识
|
||||
3. **自动标注**:默认使用文件名作为数据源标签,也支持自定义标签
|
||||
4. **端口配置**:可以自定义交互式查看器的端口
|
||||
5. **模型选择**:支持指定不同的SentenceTransformer模型
|
||||
6. **UMAP参数调优**:可以自定义UMAP算法参数以获得更好的可视化效果
|
||||
7. **灵活输入**:支持直接从文本列表创建可视化服务,无需CSV文件
|
||||
|
||||
## 划分 MolGen 第一轮微调数据集
|
||||
|
||||
```bash
|
||||
@@ -158,7 +235,7 @@ uv run python script/split_drugbank.py \
|
||||
```
|
||||
|
||||
产物:`split_train.csv` / `split_val.csv` / `split_test.csv`
|
||||
其中 `split_val` 和 `split_test` 中的分子不会出现在训练集里,且整体 QED/MW 分布接近训练集,便于后续“用未见参考分子做条件生成并观察邻域覆盖”。
|
||||
其中 `split_val` 和 `split_test` 中的分子不会出现在训练集里,且整体 QED/MW 分布接近训练集,便于后续"用未见参考分子做条件生成并观察邻域覆盖"。
|
||||
|
||||
## 合并分割的数据集进行可视化
|
||||
|
||||
|
||||
16
pixi.toml
16
pixi.toml
@@ -2,9 +2,23 @@
|
||||
authors = ["lingyuzeng <pylyzeng@gmail.com>"]
|
||||
channels = ["conda-forge"]
|
||||
name = "embedding_atlas"
|
||||
platforms = ["osx-arm64"]
|
||||
platforms = ["linux-64","osx-arm64"]
|
||||
version = "0.1.0"
|
||||
|
||||
[tasks]
|
||||
|
||||
[dependencies]
|
||||
python = "3.12.*"
|
||||
ipykernel = "*"
|
||||
anywidget = "*"
|
||||
notebook = "*"
|
||||
rdkit = "*"
|
||||
pandas = "*"
|
||||
selfies = "==2.1.1"
|
||||
fastapi = ">=0.111"
|
||||
uvicorn = "*"
|
||||
fastmcp = ">=2.11"
|
||||
docker = ">=7.1"
|
||||
httpx = ">=0.27"
|
||||
pydantic-settings = ">=2.2"
|
||||
matplotlib = "*"
|
||||
@@ -12,7 +12,7 @@ import os
|
||||
from typing import Optional, List, Dict, Any
|
||||
import numpy as np
|
||||
|
||||
def launch_interactive_viewer(df: pd.DataFrame, text_column: str, port: int = 5055, host: str = "localhost"):
|
||||
def launch_interactive_viewer(df: pd.DataFrame, text_column: str, port: int = 5055, host: str = "0.0.0.0"):
|
||||
"""使用Python API启动交互式服务器"""
|
||||
try:
|
||||
from embedding_atlas.server import make_server
|
||||
@@ -66,7 +66,7 @@ def create_embedding_service(
|
||||
texts2: List[str],
|
||||
labels: tuple = ("Group1", "Group2"),
|
||||
port: int = 5055,
|
||||
host: str = "localhost",
|
||||
host: str = "0.0.0.0",
|
||||
text_column: str = "text",
|
||||
model: str = "all-MiniLM-L6-v2",
|
||||
batch_size: int = 32,
|
||||
@@ -162,7 +162,7 @@ def visualize_csv_comparison(
|
||||
label2: Optional[str] = None,
|
||||
launch_interactive: bool = False,
|
||||
port: int = 5055,
|
||||
host: str = "localhost",
|
||||
host: str = "0.0.0.0",
|
||||
model: str = "all-MiniLM-L6-v2",
|
||||
batch_size: int = 32,
|
||||
umap_args: Optional[Dict[str, Any]] = None
|
||||
@@ -180,7 +180,7 @@ def visualize_csv_comparison(
|
||||
label2: Label for the second dataset (default: filename)
|
||||
launch_interactive: Whether to launch interactive viewer (default: False)
|
||||
port: Port for interactive viewer (default: 5055)
|
||||
host: Host for interactive viewer (default: "localhost")
|
||||
host: Host for interactive viewer (default: "0.0.0.0")
|
||||
model: Embedding model to use (default: "all-MiniLM-L6-v2")
|
||||
batch_size: Batch size for embedding computation (default: 32)
|
||||
umap_args: UMAP arguments as dictionary (default: None)
|
||||
@@ -333,7 +333,7 @@ Examples:
|
||||
parser.add_argument("--label2", help="Label for the second dataset (default: filename)")
|
||||
parser.add_argument("--interactive", "-i", action="store_true", help="Launch interactive viewer")
|
||||
parser.add_argument("--port", "-p", type=int, default=5055, help="Port for interactive viewer (default: 5055)")
|
||||
parser.add_argument("--host", default="localhost", help="Host for interactive viewer (default: localhost)")
|
||||
parser.add_argument("--host", default="0.0.0.0", help="Host for interactive viewer (default: 0.0.0.0)")
|
||||
parser.add_argument("--model", default="all-MiniLM-L6-v2", help="Embedding model to use (default: all-MiniLM-L6-v2)")
|
||||
parser.add_argument("--batch-size", type=int, default=32, help="Batch size for embedding computation (default: 32)")
|
||||
|
||||
|
||||
Reference in New Issue
Block a user