# AGENTS.md - Embedding Atlas Complete Usage Guide This document provides complete instructions for AI agents to directly use the Embedding Atlas cheminformatics visualization platform without additional setup. ## Quick Start - For Immediate Use ### 1. Environment Setup (First Time Only) ```bash # Install uv if not available curl -LsSf https://astral.sh/uv/install.sh | sh # Navigate to project directory cd /Users/lingyuzeng/project/embedding_atlas # Create virtual environment and install dependencies uv sync ``` ### 2. Ready-to-Use Commands #### Visualize Molecular Data (SMILES/SELFIES) ```bash # Basic visualization with SMILES column uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles # Interactive mode on custom port uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles --port 8080 --interactive # Export as standalone web application uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles --export-application visualization.zip ``` #### Compare Two Datasets ```bash # Compare two CSV files with molecular data uv run python src/visualization/comparison.py file1.csv file2.csv \ --column1 smiles --column2 smiles \ --label1 "Dataset A" --label2 "Dataset B" \ --interactive --port 5055 # Generate static comparison image uv run python src/visualization/comparison.py file1.csv file2.csv \ --column1 smiles --column2 smiles \ --output comparison.png ``` #### Start Backend Services ```bash # Start FastAPI orchestrator (background) uv run embedding-backend-api & # Start MCP server (stdio mode) uv run embedding-backend-mcp # Test API endpoint curl http://localhost:9000/sessions ``` ## Project Overview **Embedding Atlas** is a complete cheminformatics visualization platform that: - Creates interactive 2D embeddings of molecular data using sentence transformers - Supports UMAP dimensionality reduction for chemical structures - Provides web-based interactive visualization - Includes dataset comparison tools - Offers containerized session management **Current Version**: Embedding Atlas v0.13.0 **Python Version**: ≥3.12 **Package Manager**: uv (cross-platform compatible) ## Complete Usage Examples ### Example 1: Single Dataset Visualization ```bash # Visualize DrugBank dataset with SMILES uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \ --text smiles \ --model all-MiniLM-L6-v2 \ --port 5055 \ --interactive # Access visualization at http://localhost:5055 ``` ### Example 2: Dataset Comparison Workflow ```bash # Step 1: Split dataset for comparison uv run python script/split_drugbank.py \ --in-csv data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \ --out-dir splits_v2 \ --train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 # Step 2: Merge splits with labels uv run python script/merge_splits.py \ --input-dir splits_v2/ \ --output data/drugbank_split_merge.csv # Step 3: Visualize merged dataset uv run embedding-atlas data/drugbank_split_merge.csv --text smiles ``` ### Example 3: Advanced Comparison with Custom Parameters ```bash # Compare datasets with custom UMAP parameters uv run python src/visualization/comparison.py \ data/small_molecules.csv data/large_molecules.csv \ --column1 smiles --column2 smiles \ --label1 "Small Molecules" --label2 "Large Molecules" \ --model all-mpnet-base-v2 \ --batch-size 16 \ --umap-args '{"n_neighbors": 20, "min_dist": 0.2, "metric": "cosine"}' \ --interactive --port 8080 ``` ### Example 4: Backend Orchestration ```bash # Start orchestrator curl -X POST http://localhost:9000/sessions \ -H 'Content-Type: application/json' \ -d '{ "session_id": "demo-session", "data_url": "https://example.com/molecules.csv", "extra_args": ["--text", "smiles"], "environment": {"HF_ENDPOINT": "https://hf-mirror.com"} }' # Check session status curl http://localhost:9000/sessions # Delete session when done curl -X DELETE http://localhost:9000/sessions/demo-session ``` ## Data Processing Pipeline ### 1. Prepare Molecular Data ```bash # Add ECFP4 fingerprints for similarity analysis uv run python script/data_processing/add_ecfp4_tanimoto.py input.csv output_with_fingerprints.csv # Add macrocycle detection uv run python script/data_processing/add_macrocycle_columns.py input.csv output_with_macrocycles.csv ``` ### 2. Structure-Aware Dataset Splitting ```bash # Create train/val/test splits with scaffold diversity uv run python script/split_drugbank.py \ --in-csv molecules.csv \ --out-dir splits/ \ --train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 \ --n_qed_bins 5 --n_mw_bins 5 \ --largest-first ``` ## Docker Deployment (Optional) ### Quick Docker Setup ```bash # Build embedding atlas image docker build -f docker/embedding-atlas.Dockerfile -t embedding-atlas:latest . # Start full orchestration stack docker compose -f docker/docker-compose.yml up --build # Access services: # - Orchestrator API: http://localhost:9000 # - Individual sessions: http://localhost:6000-6999 ``` ## Common Data Formats ### Supported Input Formats - **CSV files** with molecular data - **Parquet files** for large datasets - **Hugging Face datasets** (via dataset name) ### Required Columns - `smiles` or `SMILES`: Molecular structures - `selfies`: Alternative molecular representation - `id` or `compound_id`: Unique identifiers (optional) ### Generated Columns - `projection_x`, `projection_y`: 2D coordinates - `__neighbors`: Nearest neighbor information ## Environment Configuration ### Regional Access (China/Restricted Networks) ```bash # Use Hugging Face mirror export HF_ENDPOINT=https://hf-mirror.com export HF_HUB_OFFLINE=1 # Use domestic PyPI mirror export UV_PIP_INDEX_URL=https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple ``` ### Docker Configuration ```bash # Custom Docker settings export EMBEDDING_DOCKER_URL=tcp://engine:2375 export EMBEDDING_CONTAINER_IMAGE=embedding-atlas:latest export EMBEDDING_API_PORT=9000 ``` ## Troubleshooting Quick Reference ### Common Issues and Solutions **Model Download Failures** ```bash # Set mirror for model downloads export HF_ENDPOINT=https://hf-mirror.com uv run embedding-atlas data.csv --text smiles ``` **Port Already in Use** ```bash # Use auto-port selection uv run embedding-atlas data.csv --text smiles --auto-port # Or specify different port uv run embedding-atlas data.csv --text smiles --port 8080 ``` **Memory Issues with Large Datasets** ```bash # Reduce batch size uv run embedding-atlas large_dataset.csv --text smiles --batch-size 8 # Use sampling uv run embedding-atlas large_dataset.csv --text smiles --sample 10000 ``` **Docker Connection Issues** ```bash # Check Docker daemon sudo systemctl status docker # Test Docker connection docker ps ``` ## Performance Optimization ### For Large Datasets (>100k molecules) ```bash # Use smaller batch size and sampling uv run embedding-atlas large_dataset.csv \ --text smiles \ --batch-size 8 \ --sample 50000 \ --umap-n-neighbors 10 ``` ### For High-Quality Embeddings ```bash # Use better model with optimized parameters uv run embedding-atlas dataset.csv \ --text smiles \ --model all-mpnet-base-v2 \ --umap-n-neighbors 30 \ --umap-min-dist 0.1 ``` ## Validation Commands ### Test Installation ```bash # Verify Embedding Atlas version uv run python -c "import embedding_atlas; print(f'Version: {embedding_atlas.__version__}')" # Test basic functionality uv run embedding-atlas --help # Test backend uv run embedding-backend-api --help ``` ### Test with Sample Data ```bash # Quick visualization test uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \ --text smiles \ --sample 100 \ --port 5055 ``` ## Project Structure Reference ``` /Users/lingyuzeng/project/embedding_atlas/ ├── data/ # Sample datasets │ ├── drugbank_pre_filtered_mordred_qed_id_selfies.csv │ └── splits_v2/ # Pre-split datasets ├── src/visualization/ # Comparison tools │ └── comparison.py # Main comparison script ├── src/embedding_backend/ # Backend services │ ├── main.py # FastAPI/MCP entry points │ ├── config.py # Configuration │ └── docker_manager.py # Container management ├── script/data_processing/ # Data utilities │ ├── split_drugbank.py # Dataset splitting │ ├── merge_splits.py # Dataset merging │ └── add_ecfp4_tanimoto.py # Fingerprint generation └── docker/ # Container configs ├── embedding-atlas.Dockerfile └── docker-compose.yml ``` ## Next Steps 1. **Choose your use case**: Single dataset visualization, dataset comparison, or backend orchestration 2. **Prepare your data**: Ensure CSV files have appropriate molecular structure columns 3. **Run the appropriate command** from the examples above 4. **Access results**: Web interface at specified ports or exported files **Support**: All commands are tested with Embedding Atlas v0.13.0 on Python ≥3.12 using uv package manager.