7.4 KiB
7.4 KiB
AGENTS.md - Embedding Atlas Development Guide
This document provides essential information for AI agents working with the Embedding Atlas codebase.
Project Overview
Embedding Atlas is a cheminformatics visualization platform that creates interactive 2D embeddings of molecular data using sentence transformers and UMAP dimensionality reduction. The project supports both standalone CLI usage and containerized session orchestration via FastAPI/FastMCP backends.
Project Structure
/Users/lingyuzeng/project/embedding_atlas/
├── src/ # Main source code
│ ├── visualization/ # Visualization tools and comparison scripts
│ └── embedding_backend/ # FastAPI/FastMCP session orchestrator
├── script/ # Data processing and analysis scripts
│ ├── data_processing/ # Splitting, merging, ECFP4 processing
│ └── visualization/ # UMAP visualization scripts
├── data/ # Sample datasets and splits
├── docker/ # Containerization configs
└── runtime/sessions/ # Session data storage
Essential Commands
Environment Setup
# Use uv package manager (preferred)
uv lock # Update lock file
uv sync # Create/update virtual environment
source .venv/bin/activate # Activate environment
# Or use pixi for conda environment
pixi install
Running the Application
# Standalone embedding atlas
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles
# Start backend orchestrator
uv run embedding-backend-api # FastAPI mode
uv run embedding-backend-mcp # FastMCP stdio mode
# CSV comparison visualization
uv run python src/visualization/comparison.py file1.csv file2.csv --column1 smiles --column2 smiles --interactive --port 5055
Data Processing
# Split DrugBank dataset
uv run python script/split_drugbank.py --in-csv data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --out-dir splits_v2 --seed 20250922 --train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 --n_qed_bins 5 --n_mw_bins 5 --largest-first
# Merge splits for visualization
uv run python script/merge_splits.py --input-dir splits_v2/ --output data/drugbank_split_merge.csv
# Add ECFP4 fingerprints
uv run python script/data_processing/add_ecfp4_tanimoto.py input.csv output.csv
Docker Deployment
# Build embedding atlas image
docker build -f docker/embedding-atlas.Dockerfile -t embedding-atlas:latest .
# Start orchestrator with DinD
docker compose -f docker/docker-compose.yml up --build
Code Organization and Patterns
Package Structure
- src/visualization/: Contains the main comparison tool (
comparison.py) that handles CSV file comparison and interactive visualization - src/embedding_backend/: FastAPI/FastMCP orchestrator for managing containerized embedding sessions
- script/: Standalone scripts for data processing and analysis
Key Dependencies
- embedding-atlas: Core visualization library
- rdkit: Cheminformatics toolkit
- sentence-transformers: For molecular embeddings
- umap-learn: Dimensionality reduction
- fastapi/uvicorn: Backend API server
- fastmcp: MCP protocol support
- docker: Container management
Configuration Management
- Uses
pydantic-settingsfor environment-based configuration - Settings prefixed with
EMBEDDING_environment variables - Default configuration in
src/embedding_backend/config.py
Naming Conventions and Style
File Naming
- Python modules use snake_case:
split_drugbank.py,comparison.py - Scripts in
script/directory follow descriptive naming - Docker files use descriptive suffixes:
embedding-atlas.Dockerfile
Code Style
- Type hints used throughout (Python 3.12+)
- Docstrings follow Google style format
- Import organization: standard library, third-party, local modules
- Error handling with try/except blocks for external dependencies
Data Column Conventions
- Molecular structures:
smiles,SMILES,selfies - Properties:
qed,molecular_weight,mw - Identifiers:
id,compound_id - Projections:
projection_x,projection_y
Testing and Validation
Data Processing Validation
- Scripts include data validation checks (e.g., RDKit molecule validity)
- Scaffold-based splitting ensures structural diversity
- QED/MW distribution alignment for train/val/test splits
Error Handling Patterns
- Graceful handling of missing RDKit dependencies
- Fallback mechanisms for SELFIES decoding
- Timeout handling for downloads and API calls
Important Gotchas and Non-Obvious Patterns
Embedding Atlas Integration
- Props Format: Always use
propsformat for frontend metadata, never manually adddatabasefield - Server Creation: Let
make_server()handle database configuration automatically - DataFrame Handling: Pass raw pandas DataFrames to DataSource, not JSON strings
Container Orchestration
- Session Management: Sessions auto-expire after 10 hours (configurable via
auto_remove_seconds) - Port Allocation: Dynamic port allocation in range 6000-6999 for embedding containers
- Volume Sharing: Sessions use shared volumes between orchestrator and embedding containers
Data Processing
- SELFIES/SMILES Conversion: Scripts handle both formats with automatic conversion
- Scaffold Detection: Uses RDKit's MurckoScaffold for structure-aware splitting
- Distribution Alignment: QED/MW binning ensures representative splits
Environment Variables
# Hugging Face mirror (for China/regional access)
export HF_ENDPOINT=https://hf-mirror.com
export HF_HUB_OFFLINE=1
# Docker configuration
export EMBEDDING_DOCKER_URL=tcp://engine:2375
export EMBEDDING_CONTAINER_IMAGE=embedding-atlas:latest
API Usage Patterns
- Session Creation: POST to
/sessionswith data_url and optional parameters - Session Cleanup: DELETE to
/sessions/{session_id}for immediate cleanup - Status Checking: GET
/sessionsreturns active sessions
Common Development Tasks
Adding New Visualization Features
- Extend
src/visualization/comparison.pyfor new comparison modes - Follow existing parameter patterns for CLI and API interfaces
- Ensure compatibility with both static and interactive modes
Modifying Data Processing
- Scripts in
script/data_processing/are standalone and self-contained - Maintain backward compatibility with existing data formats
- Include validation and error handling for chemical data
Backend Modifications
- Configuration changes go in
src/embedding_backend/config.py - Route handlers in
src/embedding_backend/routes.py - Docker management logic in
src/embedding_backend/docker_manager.py
Debugging Tips
Common Issues
- Model Download Failures: Set HF_ENDPOINT mirror for regional access
- Docker Connection: Ensure Docker daemon is accessible at configured URL
- Port Conflicts: Check dynamic port range availability (6000-6999)
Log Locations
- Backend logs: Standard output from uvicorn/fastapi
- Container logs: Accessible via Docker API
- Session data: Stored in
runtime/sessions/{session_id}/
Performance Considerations
- Batch size parameter affects memory usage and processing speed
- UMAP parameters can be tuned for different dataset sizes
- Embedding model choice impacts both quality and performance