Files
embedding_atlas/AGENTS.md
2025-12-01 10:54:04 +08:00

9.1 KiB

AGENTS.md - Embedding Atlas Complete Usage Guide

This document provides complete instructions for AI agents to directly use the Embedding Atlas cheminformatics visualization platform without additional setup.

Quick Start - For Immediate Use

1. Environment Setup (First Time Only)

# Install uv if not available
curl -LsSf https://astral.sh/uv/install.sh | sh

# Navigate to project directory
cd /Users/lingyuzeng/project/embedding_atlas

# Create virtual environment and install dependencies
uv sync

2. Ready-to-Use Commands

Visualize Molecular Data (SMILES/SELFIES)

# Basic visualization with SMILES column
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles

# Interactive mode on custom port
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles --port 8080 --interactive

# Export as standalone web application
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles --export-application visualization.zip

Compare Two Datasets

# Compare two CSV files with molecular data
uv run python src/visualization/comparison.py file1.csv file2.csv \
  --column1 smiles --column2 smiles \
  --label1 "Dataset A" --label2 "Dataset B" \
  --interactive --port 5055

# Generate static comparison image
uv run python src/visualization/comparison.py file1.csv file2.csv \
  --column1 smiles --column2 smiles \
  --output comparison.png

Start Backend Services

# Start FastAPI orchestrator (background)
uv run embedding-backend-api &

# Start MCP server (stdio mode)
uv run embedding-backend-mcp

# Test API endpoint
curl http://localhost:9000/sessions

Project Overview

Embedding Atlas is a complete cheminformatics visualization platform that:

  • Creates interactive 2D embeddings of molecular data using sentence transformers
  • Supports UMAP dimensionality reduction for chemical structures
  • Provides web-based interactive visualization
  • Includes dataset comparison tools
  • Offers containerized session management

Current Version: Embedding Atlas v0.13.0 Python Version: ≥3.12 Package Manager: uv (cross-platform compatible)

Complete Usage Examples

Example 1: Single Dataset Visualization

# Visualize DrugBank dataset with SMILES
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \
  --text smiles \
  --model all-MiniLM-L6-v2 \
  --port 5055 \
  --interactive

# Access visualization at http://localhost:5055

Example 2: Dataset Comparison Workflow

# Step 1: Split dataset for comparison
uv run python script/split_drugbank.py \
  --in-csv data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \
  --out-dir splits_v2 \
  --train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1

# Step 2: Merge splits with labels
uv run python script/merge_splits.py \
  --input-dir splits_v2/ \
  --output data/drugbank_split_merge.csv

# Step 3: Visualize merged dataset
uv run embedding-atlas data/drugbank_split_merge.csv --text smiles

Example 3: Advanced Comparison with Custom Parameters

# Compare datasets with custom UMAP parameters
uv run python src/visualization/comparison.py \
  data/small_molecules.csv data/large_molecules.csv \
  --column1 smiles --column2 smiles \
  --label1 "Small Molecules" --label2 "Large Molecules" \
  --model all-mpnet-base-v2 \
  --batch-size 16 \
  --umap-args '{"n_neighbors": 20, "min_dist": 0.2, "metric": "cosine"}' \
  --interactive --port 8080

Example 4: Backend Orchestration

# Start orchestrator
curl -X POST http://localhost:9000/sessions \
  -H 'Content-Type: application/json' \
  -d '{
    "session_id": "demo-session",
    "data_url": "https://example.com/molecules.csv",
    "extra_args": ["--text", "smiles"],
    "environment": {"HF_ENDPOINT": "https://hf-mirror.com"}
  }'

# Check session status
curl http://localhost:9000/sessions

# Delete session when done
curl -X DELETE http://localhost:9000/sessions/demo-session

Data Processing Pipeline

1. Prepare Molecular Data

# Add ECFP4 fingerprints for similarity analysis
uv run python script/data_processing/add_ecfp4_tanimoto.py input.csv output_with_fingerprints.csv

# Add macrocycle detection
uv run python script/data_processing/add_macrocycle_columns.py input.csv output_with_macrocycles.csv

2. Structure-Aware Dataset Splitting

# Create train/val/test splits with scaffold diversity
uv run python script/split_drugbank.py \
  --in-csv molecules.csv \
  --out-dir splits/ \
  --train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 \
  --n_qed_bins 5 --n_mw_bins 5 \
  --largest-first

Docker Deployment (Optional)

Quick Docker Setup

# Build embedding atlas image
docker build -f docker/embedding-atlas.Dockerfile -t embedding-atlas:latest .

# Start full orchestration stack
docker compose -f docker/docker-compose.yml up --build

# Access services:
# - Orchestrator API: http://localhost:9000
# - Individual sessions: http://localhost:6000-6999

Common Data Formats

Supported Input Formats

  • CSV files with molecular data
  • Parquet files for large datasets
  • Hugging Face datasets (via dataset name)

Required Columns

  • smiles or SMILES: Molecular structures
  • selfies: Alternative molecular representation
  • id or compound_id: Unique identifiers (optional)

Generated Columns

  • projection_x, projection_y: 2D coordinates
  • __neighbors: Nearest neighbor information

Environment Configuration

Regional Access (China/Restricted Networks)

# Use Hugging Face mirror
export HF_ENDPOINT=https://hf-mirror.com
export HF_HUB_OFFLINE=1

# Use domestic PyPI mirror
export UV_PIP_INDEX_URL=https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

Docker Configuration

# Custom Docker settings
export EMBEDDING_DOCKER_URL=tcp://engine:2375
export EMBEDDING_CONTAINER_IMAGE=embedding-atlas:latest
export EMBEDDING_API_PORT=9000

Troubleshooting Quick Reference

Common Issues and Solutions

Model Download Failures

# Set mirror for model downloads
export HF_ENDPOINT=https://hf-mirror.com
uv run embedding-atlas data.csv --text smiles

Port Already in Use

# Use auto-port selection
uv run embedding-atlas data.csv --text smiles --auto-port

# Or specify different port
uv run embedding-atlas data.csv --text smiles --port 8080

Memory Issues with Large Datasets

# Reduce batch size
uv run embedding-atlas large_dataset.csv --text smiles --batch-size 8

# Use sampling
uv run embedding-atlas large_dataset.csv --text smiles --sample 10000

Docker Connection Issues

# Check Docker daemon
sudo systemctl status docker

# Test Docker connection
docker ps

Performance Optimization

For Large Datasets (>100k molecules)

# Use smaller batch size and sampling
uv run embedding-atlas large_dataset.csv \
  --text smiles \
  --batch-size 8 \
  --sample 50000 \
  --umap-n-neighbors 10

For High-Quality Embeddings

# Use better model with optimized parameters
uv run embedding-atlas dataset.csv \
  --text smiles \
  --model all-mpnet-base-v2 \
  --umap-n-neighbors 30 \
  --umap-min-dist 0.1

Validation Commands

Test Installation

# Verify Embedding Atlas version
uv run python -c "import embedding_atlas; print(f'Version: {embedding_atlas.__version__}')"

# Test basic functionality
uv run embedding-atlas --help

# Test backend
uv run embedding-backend-api --help

Test with Sample Data

# Quick visualization test
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \
  --text smiles \
  --sample 100 \
  --port 5055

Project Structure Reference

/Users/lingyuzeng/project/embedding_atlas/
├── data/                          # Sample datasets
│   ├── drugbank_pre_filtered_mordred_qed_id_selfies.csv
│   └── splits_v2/                 # Pre-split datasets
├── src/visualization/             # Comparison tools
│   └── comparison.py              # Main comparison script
├── src/embedding_backend/         # Backend services
│   ├── main.py                    # FastAPI/MCP entry points
│   ├── config.py                  # Configuration
│   └── docker_manager.py          # Container management
├── script/data_processing/        # Data utilities
│   ├── split_drugbank.py          # Dataset splitting
│   ├── merge_splits.py            # Dataset merging
│   └── add_ecfp4_tanimoto.py      # Fingerprint generation
└── docker/                        # Container configs
    ├── embedding-atlas.Dockerfile
    └── docker-compose.yml

Next Steps

  1. Choose your use case: Single dataset visualization, dataset comparison, or backend orchestration
  2. Prepare your data: Ensure CSV files have appropriate molecular structure columns
  3. Run the appropriate command from the examples above
  4. Access results: Web interface at specified ports or exported files

Support: All commands are tested with Embedding Atlas v0.13.0 on Python ≥3.12 using uv package manager.