9.1 KiB
9.1 KiB
AGENTS.md - Embedding Atlas Complete Usage Guide
This document provides complete instructions for AI agents to directly use the Embedding Atlas cheminformatics visualization platform without additional setup.
Quick Start - For Immediate Use
1. Environment Setup (First Time Only)
# Install uv if not available
curl -LsSf https://astral.sh/uv/install.sh | sh
# Navigate to project directory
cd /Users/lingyuzeng/project/embedding_atlas
# Create virtual environment and install dependencies
uv sync
2. Ready-to-Use Commands
Visualize Molecular Data (SMILES/SELFIES)
# Basic visualization with SMILES column
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles
# Interactive mode on custom port
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles --port 8080 --interactive
# Export as standalone web application
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles --export-application visualization.zip
Compare Two Datasets
# Compare two CSV files with molecular data
uv run python src/visualization/comparison.py file1.csv file2.csv \
--column1 smiles --column2 smiles \
--label1 "Dataset A" --label2 "Dataset B" \
--interactive --port 5055
# Generate static comparison image
uv run python src/visualization/comparison.py file1.csv file2.csv \
--column1 smiles --column2 smiles \
--output comparison.png
Start Backend Services
# Start FastAPI orchestrator (background)
uv run embedding-backend-api &
# Start MCP server (stdio mode)
uv run embedding-backend-mcp
# Test API endpoint
curl http://localhost:9000/sessions
Project Overview
Embedding Atlas is a complete cheminformatics visualization platform that:
- Creates interactive 2D embeddings of molecular data using sentence transformers
- Supports UMAP dimensionality reduction for chemical structures
- Provides web-based interactive visualization
- Includes dataset comparison tools
- Offers containerized session management
Current Version: Embedding Atlas v0.13.0 Python Version: ≥3.12 Package Manager: uv (cross-platform compatible)
Complete Usage Examples
Example 1: Single Dataset Visualization
# Visualize DrugBank dataset with SMILES
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \
--text smiles \
--model all-MiniLM-L6-v2 \
--port 5055 \
--interactive
# Access visualization at http://localhost:5055
Example 2: Dataset Comparison Workflow
# Step 1: Split dataset for comparison
uv run python script/split_drugbank.py \
--in-csv data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \
--out-dir splits_v2 \
--train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1
# Step 2: Merge splits with labels
uv run python script/merge_splits.py \
--input-dir splits_v2/ \
--output data/drugbank_split_merge.csv
# Step 3: Visualize merged dataset
uv run embedding-atlas data/drugbank_split_merge.csv --text smiles
Example 3: Advanced Comparison with Custom Parameters
# Compare datasets with custom UMAP parameters
uv run python src/visualization/comparison.py \
data/small_molecules.csv data/large_molecules.csv \
--column1 smiles --column2 smiles \
--label1 "Small Molecules" --label2 "Large Molecules" \
--model all-mpnet-base-v2 \
--batch-size 16 \
--umap-args '{"n_neighbors": 20, "min_dist": 0.2, "metric": "cosine"}' \
--interactive --port 8080
Example 4: Backend Orchestration
# Start orchestrator
curl -X POST http://localhost:9000/sessions \
-H 'Content-Type: application/json' \
-d '{
"session_id": "demo-session",
"data_url": "https://example.com/molecules.csv",
"extra_args": ["--text", "smiles"],
"environment": {"HF_ENDPOINT": "https://hf-mirror.com"}
}'
# Check session status
curl http://localhost:9000/sessions
# Delete session when done
curl -X DELETE http://localhost:9000/sessions/demo-session
Data Processing Pipeline
1. Prepare Molecular Data
# Add ECFP4 fingerprints for similarity analysis
uv run python script/data_processing/add_ecfp4_tanimoto.py input.csv output_with_fingerprints.csv
# Add macrocycle detection
uv run python script/data_processing/add_macrocycle_columns.py input.csv output_with_macrocycles.csv
2. Structure-Aware Dataset Splitting
# Create train/val/test splits with scaffold diversity
uv run python script/split_drugbank.py \
--in-csv molecules.csv \
--out-dir splits/ \
--train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 \
--n_qed_bins 5 --n_mw_bins 5 \
--largest-first
Docker Deployment (Optional)
Quick Docker Setup
# Build embedding atlas image
docker build -f docker/embedding-atlas.Dockerfile -t embedding-atlas:latest .
# Start full orchestration stack
docker compose -f docker/docker-compose.yml up --build
# Access services:
# - Orchestrator API: http://localhost:9000
# - Individual sessions: http://localhost:6000-6999
Common Data Formats
Supported Input Formats
- CSV files with molecular data
- Parquet files for large datasets
- Hugging Face datasets (via dataset name)
Required Columns
smilesorSMILES: Molecular structuresselfies: Alternative molecular representationidorcompound_id: Unique identifiers (optional)
Generated Columns
projection_x,projection_y: 2D coordinates__neighbors: Nearest neighbor information
Environment Configuration
Regional Access (China/Restricted Networks)
# Use Hugging Face mirror
export HF_ENDPOINT=https://hf-mirror.com
export HF_HUB_OFFLINE=1
# Use domestic PyPI mirror
export UV_PIP_INDEX_URL=https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
Docker Configuration
# Custom Docker settings
export EMBEDDING_DOCKER_URL=tcp://engine:2375
export EMBEDDING_CONTAINER_IMAGE=embedding-atlas:latest
export EMBEDDING_API_PORT=9000
Troubleshooting Quick Reference
Common Issues and Solutions
Model Download Failures
# Set mirror for model downloads
export HF_ENDPOINT=https://hf-mirror.com
uv run embedding-atlas data.csv --text smiles
Port Already in Use
# Use auto-port selection
uv run embedding-atlas data.csv --text smiles --auto-port
# Or specify different port
uv run embedding-atlas data.csv --text smiles --port 8080
Memory Issues with Large Datasets
# Reduce batch size
uv run embedding-atlas large_dataset.csv --text smiles --batch-size 8
# Use sampling
uv run embedding-atlas large_dataset.csv --text smiles --sample 10000
Docker Connection Issues
# Check Docker daemon
sudo systemctl status docker
# Test Docker connection
docker ps
Performance Optimization
For Large Datasets (>100k molecules)
# Use smaller batch size and sampling
uv run embedding-atlas large_dataset.csv \
--text smiles \
--batch-size 8 \
--sample 50000 \
--umap-n-neighbors 10
For High-Quality Embeddings
# Use better model with optimized parameters
uv run embedding-atlas dataset.csv \
--text smiles \
--model all-mpnet-base-v2 \
--umap-n-neighbors 30 \
--umap-min-dist 0.1
Validation Commands
Test Installation
# Verify Embedding Atlas version
uv run python -c "import embedding_atlas; print(f'Version: {embedding_atlas.__version__}')"
# Test basic functionality
uv run embedding-atlas --help
# Test backend
uv run embedding-backend-api --help
Test with Sample Data
# Quick visualization test
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \
--text smiles \
--sample 100 \
--port 5055
Project Structure Reference
/Users/lingyuzeng/project/embedding_atlas/
├── data/ # Sample datasets
│ ├── drugbank_pre_filtered_mordred_qed_id_selfies.csv
│ └── splits_v2/ # Pre-split datasets
├── src/visualization/ # Comparison tools
│ └── comparison.py # Main comparison script
├── src/embedding_backend/ # Backend services
│ ├── main.py # FastAPI/MCP entry points
│ ├── config.py # Configuration
│ └── docker_manager.py # Container management
├── script/data_processing/ # Data utilities
│ ├── split_drugbank.py # Dataset splitting
│ ├── merge_splits.py # Dataset merging
│ └── add_ecfp4_tanimoto.py # Fingerprint generation
└── docker/ # Container configs
├── embedding-atlas.Dockerfile
└── docker-compose.yml
Next Steps
- Choose your use case: Single dataset visualization, dataset comparison, or backend orchestration
- Prepare your data: Ensure CSV files have appropriate molecular structure columns
- Run the appropriate command from the examples above
- Access results: Web interface at specified ports or exported files
Support: All commands are tested with Embedding Atlas v0.13.0 on Python ≥3.12 using uv package manager.