This commit is contained in:
2025-12-01 10:54:04 +08:00
parent d1b124d6c0
commit 678bd2b3f2

418
AGENTS.md
View File

@@ -1,188 +1,320 @@
# AGENTS.md - Embedding Atlas Development Guide # AGENTS.md - Embedding Atlas Complete Usage Guide
This document provides essential information for AI agents working with the Embedding Atlas codebase. This document provides complete instructions for AI agents to directly use the Embedding Atlas cheminformatics visualization platform without additional setup.
## Quick Start - For Immediate Use
### 1. Environment Setup (First Time Only)
```bash
# Install uv if not available
curl -LsSf https://astral.sh/uv/install.sh | sh
# Navigate to project directory
cd /Users/lingyuzeng/project/embedding_atlas
# Create virtual environment and install dependencies
uv sync
```
### 2. Ready-to-Use Commands
#### Visualize Molecular Data (SMILES/SELFIES)
```bash
# Basic visualization with SMILES column
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles
# Interactive mode on custom port
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles --port 8080 --interactive
# Export as standalone web application
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles --export-application visualization.zip
```
#### Compare Two Datasets
```bash
# Compare two CSV files with molecular data
uv run python src/visualization/comparison.py file1.csv file2.csv \
--column1 smiles --column2 smiles \
--label1 "Dataset A" --label2 "Dataset B" \
--interactive --port 5055
# Generate static comparison image
uv run python src/visualization/comparison.py file1.csv file2.csv \
--column1 smiles --column2 smiles \
--output comparison.png
```
#### Start Backend Services
```bash
# Start FastAPI orchestrator (background)
uv run embedding-backend-api &
# Start MCP server (stdio mode)
uv run embedding-backend-mcp
# Test API endpoint
curl http://localhost:9000/sessions
```
## Project Overview ## Project Overview
Embedding Atlas is a cheminformatics visualization platform that creates interactive 2D embeddings of molecular data using sentence transformers and UMAP dimensionality reduction. The project supports both standalone CLI usage and containerized session orchestration via FastAPI/FastMCP backends. **Embedding Atlas** is a complete cheminformatics visualization platform that:
- Creates interactive 2D embeddings of molecular data using sentence transformers
- Supports UMAP dimensionality reduction for chemical structures
- Provides web-based interactive visualization
- Includes dataset comparison tools
- Offers containerized session management
## Project Structure **Current Version**: Embedding Atlas v0.13.0
**Python Version**: ≥3.12
**Package Manager**: uv (cross-platform compatible)
``` ## Complete Usage Examples
/Users/lingyuzeng/project/embedding_atlas/
├── src/ # Main source code
│ ├── visualization/ # Visualization tools and comparison scripts
│ └── embedding_backend/ # FastAPI/FastMCP session orchestrator
├── script/ # Data processing and analysis scripts
│ ├── data_processing/ # Splitting, merging, ECFP4 processing
│ └── visualization/ # UMAP visualization scripts
├── data/ # Sample datasets and splits
├── docker/ # Containerization configs
└── runtime/sessions/ # Session data storage
```
## Essential Commands ### Example 1: Single Dataset Visualization
### Environment Setup
```bash ```bash
# Use uv package manager (preferred) # Visualize DrugBank dataset with SMILES
uv lock # Update lock file uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \
uv sync # Create/update virtual environment --text smiles \
source .venv/bin/activate # Activate environment --model all-MiniLM-L6-v2 \
--port 5055 \
--interactive
# Or use pixi for conda environment # Access visualization at http://localhost:5055
pixi install
``` ```
### Running the Application ### Example 2: Dataset Comparison Workflow
```bash ```bash
# Standalone embedding atlas # Step 1: Split dataset for comparison
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles uv run python script/split_drugbank.py \
--in-csv data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \
--out-dir splits_v2 \
--train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1
# Start backend orchestrator # Step 2: Merge splits with labels
uv run embedding-backend-api # FastAPI mode uv run python script/merge_splits.py \
uv run embedding-backend-mcp # FastMCP stdio mode --input-dir splits_v2/ \
--output data/drugbank_split_merge.csv
# CSV comparison visualization # Step 3: Visualize merged dataset
uv run python src/visualization/comparison.py file1.csv file2.csv --column1 smiles --column2 smiles --interactive --port 5055 uv run embedding-atlas data/drugbank_split_merge.csv --text smiles
``` ```
### Data Processing ### Example 3: Advanced Comparison with Custom Parameters
```bash ```bash
# Split DrugBank dataset # Compare datasets with custom UMAP parameters
uv run python script/split_drugbank.py --in-csv data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --out-dir splits_v2 --seed 20250922 --train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 --n_qed_bins 5 --n_mw_bins 5 --largest-first uv run python src/visualization/comparison.py \
data/small_molecules.csv data/large_molecules.csv \
# Merge splits for visualization --column1 smiles --column2 smiles \
uv run python script/merge_splits.py --input-dir splits_v2/ --output data/drugbank_split_merge.csv --label1 "Small Molecules" --label2 "Large Molecules" \
--model all-mpnet-base-v2 \
# Add ECFP4 fingerprints --batch-size 16 \
uv run python script/data_processing/add_ecfp4_tanimoto.py input.csv output.csv --umap-args '{"n_neighbors": 20, "min_dist": 0.2, "metric": "cosine"}' \
--interactive --port 8080
``` ```
### Docker Deployment ### Example 4: Backend Orchestration
```bash
# Start orchestrator
curl -X POST http://localhost:9000/sessions \
-H 'Content-Type: application/json' \
-d '{
"session_id": "demo-session",
"data_url": "https://example.com/molecules.csv",
"extra_args": ["--text", "smiles"],
"environment": {"HF_ENDPOINT": "https://hf-mirror.com"}
}'
# Check session status
curl http://localhost:9000/sessions
# Delete session when done
curl -X DELETE http://localhost:9000/sessions/demo-session
```
## Data Processing Pipeline
### 1. Prepare Molecular Data
```bash
# Add ECFP4 fingerprints for similarity analysis
uv run python script/data_processing/add_ecfp4_tanimoto.py input.csv output_with_fingerprints.csv
# Add macrocycle detection
uv run python script/data_processing/add_macrocycle_columns.py input.csv output_with_macrocycles.csv
```
### 2. Structure-Aware Dataset Splitting
```bash
# Create train/val/test splits with scaffold diversity
uv run python script/split_drugbank.py \
--in-csv molecules.csv \
--out-dir splits/ \
--train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 \
--n_qed_bins 5 --n_mw_bins 5 \
--largest-first
```
## Docker Deployment (Optional)
### Quick Docker Setup
```bash ```bash
# Build embedding atlas image # Build embedding atlas image
docker build -f docker/embedding-atlas.Dockerfile -t embedding-atlas:latest . docker build -f docker/embedding-atlas.Dockerfile -t embedding-atlas:latest .
# Start orchestrator with DinD # Start full orchestration stack
docker compose -f docker/docker-compose.yml up --build docker compose -f docker/docker-compose.yml up --build
# Access services:
# - Orchestrator API: http://localhost:9000
# - Individual sessions: http://localhost:6000-6999
``` ```
## Code Organization and Patterns ## Common Data Formats
### Package Structure ### Supported Input Formats
- **src/visualization/**: Contains the main comparison tool (`comparison.py`) that handles CSV file comparison and interactive visualization - **CSV files** with molecular data
- **src/embedding_backend/**: FastAPI/FastMCP orchestrator for managing containerized embedding sessions - **Parquet files** for large datasets
- **script/**: Standalone scripts for data processing and analysis - **Hugging Face datasets** (via dataset name)
### Key Dependencies ### Required Columns
- **embedding-atlas**: Core visualization library - `smiles` or `SMILES`: Molecular structures
- **rdkit**: Cheminformatics toolkit - `selfies`: Alternative molecular representation
- **sentence-transformers**: For molecular embeddings - `id` or `compound_id`: Unique identifiers (optional)
- **umap-learn**: Dimensionality reduction
- **fastapi/uvicorn**: Backend API server
- **fastmcp**: MCP protocol support
- **docker**: Container management
### Configuration Management ### Generated Columns
- Uses `pydantic-settings` for environment-based configuration - `projection_x`, `projection_y`: 2D coordinates
- Settings prefixed with `EMBEDDING_` environment variables - `__neighbors`: Nearest neighbor information
- Default configuration in `src/embedding_backend/config.py`
## Naming Conventions and Style ## Environment Configuration
### File Naming ### Regional Access (China/Restricted Networks)
- Python modules use snake_case: `split_drugbank.py`, `comparison.py`
- Scripts in `script/` directory follow descriptive naming
- Docker files use descriptive suffixes: `embedding-atlas.Dockerfile`
### Code Style
- Type hints used throughout (Python 3.12+)
- Docstrings follow Google style format
- Import organization: standard library, third-party, local modules
- Error handling with try/except blocks for external dependencies
### Data Column Conventions
- Molecular structures: `smiles`, `SMILES`, `selfies`
- Properties: `qed`, `molecular_weight`, `mw`
- Identifiers: `id`, `compound_id`
- Projections: `projection_x`, `projection_y`
## Testing and Validation
### Data Processing Validation
- Scripts include data validation checks (e.g., RDKit molecule validity)
- Scaffold-based splitting ensures structural diversity
- QED/MW distribution alignment for train/val/test splits
### Error Handling Patterns
- Graceful handling of missing RDKit dependencies
- Fallback mechanisms for SELFIES decoding
- Timeout handling for downloads and API calls
## Important Gotchas and Non-Obvious Patterns
### Embedding Atlas Integration
1. **Props Format**: Always use `props` format for frontend metadata, never manually add `database` field
2. **Server Creation**: Let `make_server()` handle database configuration automatically
3. **DataFrame Handling**: Pass raw pandas DataFrames to DataSource, not JSON strings
### Container Orchestration
1. **Session Management**: Sessions auto-expire after 10 hours (configurable via `auto_remove_seconds`)
2. **Port Allocation**: Dynamic port allocation in range 6000-6999 for embedding containers
3. **Volume Sharing**: Sessions use shared volumes between orchestrator and embedding containers
### Data Processing
1. **SELFIES/SMILES Conversion**: Scripts handle both formats with automatic conversion
2. **Scaffold Detection**: Uses RDKit's MurckoScaffold for structure-aware splitting
3. **Distribution Alignment**: QED/MW binning ensures representative splits
### Environment Variables
```bash ```bash
# Hugging Face mirror (for China/regional access) # Use Hugging Face mirror
export HF_ENDPOINT=https://hf-mirror.com export HF_ENDPOINT=https://hf-mirror.com
export HF_HUB_OFFLINE=1 export HF_HUB_OFFLINE=1
# Docker configuration # Use domestic PyPI mirror
export EMBEDDING_DOCKER_URL=tcp://engine:2375 export UV_PIP_INDEX_URL=https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
export EMBEDDING_CONTAINER_IMAGE=embedding-atlas:latest
``` ```
### API Usage Patterns ### Docker Configuration
1. **Session Creation**: POST to `/sessions` with data_url and optional parameters ```bash
2. **Session Cleanup**: DELETE to `/sessions/{session_id}` for immediate cleanup # Custom Docker settings
3. **Status Checking**: GET `/sessions` returns active sessions export EMBEDDING_DOCKER_URL=tcp://engine:2375
export EMBEDDING_CONTAINER_IMAGE=embedding-atlas:latest
export EMBEDDING_API_PORT=9000
```
## Common Development Tasks ## Troubleshooting Quick Reference
### Adding New Visualization Features ### Common Issues and Solutions
1. Extend `src/visualization/comparison.py` for new comparison modes
2. Follow existing parameter patterns for CLI and API interfaces
3. Ensure compatibility with both static and interactive modes
### Modifying Data Processing **Model Download Failures**
1. Scripts in `script/data_processing/` are standalone and self-contained ```bash
2. Maintain backward compatibility with existing data formats # Set mirror for model downloads
3. Include validation and error handling for chemical data export HF_ENDPOINT=https://hf-mirror.com
uv run embedding-atlas data.csv --text smiles
```
### Backend Modifications **Port Already in Use**
1. Configuration changes go in `src/embedding_backend/config.py` ```bash
2. Route handlers in `src/embedding_backend/routes.py` # Use auto-port selection
3. Docker management logic in `src/embedding_backend/docker_manager.py` uv run embedding-atlas data.csv --text smiles --auto-port
## Debugging Tips # Or specify different port
uv run embedding-atlas data.csv --text smiles --port 8080
```
### Common Issues **Memory Issues with Large Datasets**
1. **Model Download Failures**: Set HF_ENDPOINT mirror for regional access ```bash
2. **Docker Connection**: Ensure Docker daemon is accessible at configured URL # Reduce batch size
3. **Port Conflicts**: Check dynamic port range availability (6000-6999) uv run embedding-atlas large_dataset.csv --text smiles --batch-size 8
### Log Locations # Use sampling
- Backend logs: Standard output from uvicorn/fastapi uv run embedding-atlas large_dataset.csv --text smiles --sample 10000
- Container logs: Accessible via Docker API ```
- Session data: Stored in `runtime/sessions/{session_id}/`
### Performance Considerations **Docker Connection Issues**
- Batch size parameter affects memory usage and processing speed ```bash
- UMAP parameters can be tuned for different dataset sizes # Check Docker daemon
- Embedding model choice impacts both quality and performance sudo systemctl status docker
# Test Docker connection
docker ps
```
## Performance Optimization
### For Large Datasets (>100k molecules)
```bash
# Use smaller batch size and sampling
uv run embedding-atlas large_dataset.csv \
--text smiles \
--batch-size 8 \
--sample 50000 \
--umap-n-neighbors 10
```
### For High-Quality Embeddings
```bash
# Use better model with optimized parameters
uv run embedding-atlas dataset.csv \
--text smiles \
--model all-mpnet-base-v2 \
--umap-n-neighbors 30 \
--umap-min-dist 0.1
```
## Validation Commands
### Test Installation
```bash
# Verify Embedding Atlas version
uv run python -c "import embedding_atlas; print(f'Version: {embedding_atlas.__version__}')"
# Test basic functionality
uv run embedding-atlas --help
# Test backend
uv run embedding-backend-api --help
```
### Test with Sample Data
```bash
# Quick visualization test
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \
--text smiles \
--sample 100 \
--port 5055
```
## Project Structure Reference
```
/Users/lingyuzeng/project/embedding_atlas/
├── data/ # Sample datasets
│ ├── drugbank_pre_filtered_mordred_qed_id_selfies.csv
│ └── splits_v2/ # Pre-split datasets
├── src/visualization/ # Comparison tools
│ └── comparison.py # Main comparison script
├── src/embedding_backend/ # Backend services
│ ├── main.py # FastAPI/MCP entry points
│ ├── config.py # Configuration
│ └── docker_manager.py # Container management
├── script/data_processing/ # Data utilities
│ ├── split_drugbank.py # Dataset splitting
│ ├── merge_splits.py # Dataset merging
│ └── add_ecfp4_tanimoto.py # Fingerprint generation
└── docker/ # Container configs
├── embedding-atlas.Dockerfile
└── docker-compose.yml
```
## Next Steps
1. **Choose your use case**: Single dataset visualization, dataset comparison, or backend orchestration
2. **Prepare your data**: Ensure CSV files have appropriate molecular structure columns
3. **Run the appropriate command** from the examples above
4. **Access results**: Web interface at specified ports or exported files
**Support**: All commands are tested with Embedding Atlas v0.13.0 on Python ≥3.12 using uv package manager.