Files
embedding_atlas/AGENTS.md

7.4 KiB

AGENTS.md - Embedding Atlas Development Guide

This document provides essential information for AI agents working with the Embedding Atlas codebase.

Project Overview

Embedding Atlas is a cheminformatics visualization platform that creates interactive 2D embeddings of molecular data using sentence transformers and UMAP dimensionality reduction. The project supports both standalone CLI usage and containerized session orchestration via FastAPI/FastMCP backends.

Project Structure

/Users/lingyuzeng/project/embedding_atlas/
├── src/                          # Main source code
│   ├── visualization/            # Visualization tools and comparison scripts
│   └── embedding_backend/        # FastAPI/FastMCP session orchestrator
├── script/                       # Data processing and analysis scripts
│   ├── data_processing/          # Splitting, merging, ECFP4 processing
│   └── visualization/            # UMAP visualization scripts
├── data/                         # Sample datasets and splits
├── docker/                       # Containerization configs
└── runtime/sessions/             # Session data storage

Essential Commands

Environment Setup

# Use uv package manager (preferred)
uv lock            # Update lock file
uv sync            # Create/update virtual environment
source .venv/bin/activate  # Activate environment

# Or use pixi for conda environment
pixi install

Running the Application

# Standalone embedding atlas
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles

# Start backend orchestrator
uv run embedding-backend-api     # FastAPI mode
uv run embedding-backend-mcp     # FastMCP stdio mode

# CSV comparison visualization
uv run python src/visualization/comparison.py file1.csv file2.csv --column1 smiles --column2 smiles --interactive --port 5055

Data Processing

# Split DrugBank dataset
uv run python script/split_drugbank.py --in-csv data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --out-dir splits_v2 --seed 20250922 --train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 --n_qed_bins 5 --n_mw_bins 5 --largest-first

# Merge splits for visualization
uv run python script/merge_splits.py --input-dir splits_v2/ --output data/drugbank_split_merge.csv

# Add ECFP4 fingerprints
uv run python script/data_processing/add_ecfp4_tanimoto.py input.csv output.csv

Docker Deployment

# Build embedding atlas image
docker build -f docker/embedding-atlas.Dockerfile -t embedding-atlas:latest .

# Start orchestrator with DinD
docker compose -f docker/docker-compose.yml up --build

Code Organization and Patterns

Package Structure

  • src/visualization/: Contains the main comparison tool (comparison.py) that handles CSV file comparison and interactive visualization
  • src/embedding_backend/: FastAPI/FastMCP orchestrator for managing containerized embedding sessions
  • script/: Standalone scripts for data processing and analysis

Key Dependencies

  • embedding-atlas: Core visualization library
  • rdkit: Cheminformatics toolkit
  • sentence-transformers: For molecular embeddings
  • umap-learn: Dimensionality reduction
  • fastapi/uvicorn: Backend API server
  • fastmcp: MCP protocol support
  • docker: Container management

Configuration Management

  • Uses pydantic-settings for environment-based configuration
  • Settings prefixed with EMBEDDING_ environment variables
  • Default configuration in src/embedding_backend/config.py

Naming Conventions and Style

File Naming

  • Python modules use snake_case: split_drugbank.py, comparison.py
  • Scripts in script/ directory follow descriptive naming
  • Docker files use descriptive suffixes: embedding-atlas.Dockerfile

Code Style

  • Type hints used throughout (Python 3.12+)
  • Docstrings follow Google style format
  • Import organization: standard library, third-party, local modules
  • Error handling with try/except blocks for external dependencies

Data Column Conventions

  • Molecular structures: smiles, SMILES, selfies
  • Properties: qed, molecular_weight, mw
  • Identifiers: id, compound_id
  • Projections: projection_x, projection_y

Testing and Validation

Data Processing Validation

  • Scripts include data validation checks (e.g., RDKit molecule validity)
  • Scaffold-based splitting ensures structural diversity
  • QED/MW distribution alignment for train/val/test splits

Error Handling Patterns

  • Graceful handling of missing RDKit dependencies
  • Fallback mechanisms for SELFIES decoding
  • Timeout handling for downloads and API calls

Important Gotchas and Non-Obvious Patterns

Embedding Atlas Integration

  1. Props Format: Always use props format for frontend metadata, never manually add database field
  2. Server Creation: Let make_server() handle database configuration automatically
  3. DataFrame Handling: Pass raw pandas DataFrames to DataSource, not JSON strings

Container Orchestration

  1. Session Management: Sessions auto-expire after 10 hours (configurable via auto_remove_seconds)
  2. Port Allocation: Dynamic port allocation in range 6000-6999 for embedding containers
  3. Volume Sharing: Sessions use shared volumes between orchestrator and embedding containers

Data Processing

  1. SELFIES/SMILES Conversion: Scripts handle both formats with automatic conversion
  2. Scaffold Detection: Uses RDKit's MurckoScaffold for structure-aware splitting
  3. Distribution Alignment: QED/MW binning ensures representative splits

Environment Variables

# Hugging Face mirror (for China/regional access)
export HF_ENDPOINT=https://hf-mirror.com
export HF_HUB_OFFLINE=1

# Docker configuration
export EMBEDDING_DOCKER_URL=tcp://engine:2375
export EMBEDDING_CONTAINER_IMAGE=embedding-atlas:latest

API Usage Patterns

  1. Session Creation: POST to /sessions with data_url and optional parameters
  2. Session Cleanup: DELETE to /sessions/{session_id} for immediate cleanup
  3. Status Checking: GET /sessions returns active sessions

Common Development Tasks

Adding New Visualization Features

  1. Extend src/visualization/comparison.py for new comparison modes
  2. Follow existing parameter patterns for CLI and API interfaces
  3. Ensure compatibility with both static and interactive modes

Modifying Data Processing

  1. Scripts in script/data_processing/ are standalone and self-contained
  2. Maintain backward compatibility with existing data formats
  3. Include validation and error handling for chemical data

Backend Modifications

  1. Configuration changes go in src/embedding_backend/config.py
  2. Route handlers in src/embedding_backend/routes.py
  3. Docker management logic in src/embedding_backend/docker_manager.py

Debugging Tips

Common Issues

  1. Model Download Failures: Set HF_ENDPOINT mirror for regional access
  2. Docker Connection: Ensure Docker daemon is accessible at configured URL
  3. Port Conflicts: Check dynamic port range availability (6000-6999)

Log Locations

  • Backend logs: Standard output from uvicorn/fastapi
  • Container logs: Accessible via Docker API
  • Session data: Stored in runtime/sessions/{session_id}/

Performance Considerations

  • Batch size parameter affects memory usage and processing speed
  • UMAP parameters can be tuned for different dataset sizes
  • Embedding model choice impacts both quality and performance