This commit is contained in:
2025-12-01 10:54:04 +08:00
parent d1b124d6c0
commit 678bd2b3f2

418
AGENTS.md
View File

@@ -1,188 +1,320 @@
# AGENTS.md - Embedding Atlas Development Guide
# AGENTS.md - Embedding Atlas Complete Usage Guide
This document provides essential information for AI agents working with the Embedding Atlas codebase.
This document provides complete instructions for AI agents to directly use the Embedding Atlas cheminformatics visualization platform without additional setup.
## Quick Start - For Immediate Use
### 1. Environment Setup (First Time Only)
```bash
# Install uv if not available
curl -LsSf https://astral.sh/uv/install.sh | sh
# Navigate to project directory
cd /Users/lingyuzeng/project/embedding_atlas
# Create virtual environment and install dependencies
uv sync
```
### 2. Ready-to-Use Commands
#### Visualize Molecular Data (SMILES/SELFIES)
```bash
# Basic visualization with SMILES column
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles
# Interactive mode on custom port
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles --port 8080 --interactive
# Export as standalone web application
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles --export-application visualization.zip
```
#### Compare Two Datasets
```bash
# Compare two CSV files with molecular data
uv run python src/visualization/comparison.py file1.csv file2.csv \
--column1 smiles --column2 smiles \
--label1 "Dataset A" --label2 "Dataset B" \
--interactive --port 5055
# Generate static comparison image
uv run python src/visualization/comparison.py file1.csv file2.csv \
--column1 smiles --column2 smiles \
--output comparison.png
```
#### Start Backend Services
```bash
# Start FastAPI orchestrator (background)
uv run embedding-backend-api &
# Start MCP server (stdio mode)
uv run embedding-backend-mcp
# Test API endpoint
curl http://localhost:9000/sessions
```
## Project Overview
Embedding Atlas is a cheminformatics visualization platform that creates interactive 2D embeddings of molecular data using sentence transformers and UMAP dimensionality reduction. The project supports both standalone CLI usage and containerized session orchestration via FastAPI/FastMCP backends.
**Embedding Atlas** is a complete cheminformatics visualization platform that:
- Creates interactive 2D embeddings of molecular data using sentence transformers
- Supports UMAP dimensionality reduction for chemical structures
- Provides web-based interactive visualization
- Includes dataset comparison tools
- Offers containerized session management
## Project Structure
**Current Version**: Embedding Atlas v0.13.0
**Python Version**: ≥3.12
**Package Manager**: uv (cross-platform compatible)
```
/Users/lingyuzeng/project/embedding_atlas/
├── src/ # Main source code
│ ├── visualization/ # Visualization tools and comparison scripts
│ └── embedding_backend/ # FastAPI/FastMCP session orchestrator
├── script/ # Data processing and analysis scripts
│ ├── data_processing/ # Splitting, merging, ECFP4 processing
│ └── visualization/ # UMAP visualization scripts
├── data/ # Sample datasets and splits
├── docker/ # Containerization configs
└── runtime/sessions/ # Session data storage
```
## Complete Usage Examples
## Essential Commands
### Environment Setup
### Example 1: Single Dataset Visualization
```bash
# Use uv package manager (preferred)
uv lock # Update lock file
uv sync # Create/update virtual environment
source .venv/bin/activate # Activate environment
# Visualize DrugBank dataset with SMILES
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \
--text smiles \
--model all-MiniLM-L6-v2 \
--port 5055 \
--interactive
# Or use pixi for conda environment
pixi install
# Access visualization at http://localhost:5055
```
### Running the Application
### Example 2: Dataset Comparison Workflow
```bash
# Standalone embedding atlas
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles
# Step 1: Split dataset for comparison
uv run python script/split_drugbank.py \
--in-csv data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \
--out-dir splits_v2 \
--train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1
# Start backend orchestrator
uv run embedding-backend-api # FastAPI mode
uv run embedding-backend-mcp # FastMCP stdio mode
# Step 2: Merge splits with labels
uv run python script/merge_splits.py \
--input-dir splits_v2/ \
--output data/drugbank_split_merge.csv
# CSV comparison visualization
uv run python src/visualization/comparison.py file1.csv file2.csv --column1 smiles --column2 smiles --interactive --port 5055
# Step 3: Visualize merged dataset
uv run embedding-atlas data/drugbank_split_merge.csv --text smiles
```
### Data Processing
### Example 3: Advanced Comparison with Custom Parameters
```bash
# Split DrugBank dataset
uv run python script/split_drugbank.py --in-csv data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --out-dir splits_v2 --seed 20250922 --train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 --n_qed_bins 5 --n_mw_bins 5 --largest-first
# Merge splits for visualization
uv run python script/merge_splits.py --input-dir splits_v2/ --output data/drugbank_split_merge.csv
# Add ECFP4 fingerprints
uv run python script/data_processing/add_ecfp4_tanimoto.py input.csv output.csv
# Compare datasets with custom UMAP parameters
uv run python src/visualization/comparison.py \
data/small_molecules.csv data/large_molecules.csv \
--column1 smiles --column2 smiles \
--label1 "Small Molecules" --label2 "Large Molecules" \
--model all-mpnet-base-v2 \
--batch-size 16 \
--umap-args '{"n_neighbors": 20, "min_dist": 0.2, "metric": "cosine"}' \
--interactive --port 8080
```
### Docker Deployment
### Example 4: Backend Orchestration
```bash
# Start orchestrator
curl -X POST http://localhost:9000/sessions \
-H 'Content-Type: application/json' \
-d '{
"session_id": "demo-session",
"data_url": "https://example.com/molecules.csv",
"extra_args": ["--text", "smiles"],
"environment": {"HF_ENDPOINT": "https://hf-mirror.com"}
}'
# Check session status
curl http://localhost:9000/sessions
# Delete session when done
curl -X DELETE http://localhost:9000/sessions/demo-session
```
## Data Processing Pipeline
### 1. Prepare Molecular Data
```bash
# Add ECFP4 fingerprints for similarity analysis
uv run python script/data_processing/add_ecfp4_tanimoto.py input.csv output_with_fingerprints.csv
# Add macrocycle detection
uv run python script/data_processing/add_macrocycle_columns.py input.csv output_with_macrocycles.csv
```
### 2. Structure-Aware Dataset Splitting
```bash
# Create train/val/test splits with scaffold diversity
uv run python script/split_drugbank.py \
--in-csv molecules.csv \
--out-dir splits/ \
--train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 \
--n_qed_bins 5 --n_mw_bins 5 \
--largest-first
```
## Docker Deployment (Optional)
### Quick Docker Setup
```bash
# Build embedding atlas image
docker build -f docker/embedding-atlas.Dockerfile -t embedding-atlas:latest .
# Start orchestrator with DinD
# Start full orchestration stack
docker compose -f docker/docker-compose.yml up --build
# Access services:
# - Orchestrator API: http://localhost:9000
# - Individual sessions: http://localhost:6000-6999
```
## Code Organization and Patterns
## Common Data Formats
### Package Structure
- **src/visualization/**: Contains the main comparison tool (`comparison.py`) that handles CSV file comparison and interactive visualization
- **src/embedding_backend/**: FastAPI/FastMCP orchestrator for managing containerized embedding sessions
- **script/**: Standalone scripts for data processing and analysis
### Supported Input Formats
- **CSV files** with molecular data
- **Parquet files** for large datasets
- **Hugging Face datasets** (via dataset name)
### Key Dependencies
- **embedding-atlas**: Core visualization library
- **rdkit**: Cheminformatics toolkit
- **sentence-transformers**: For molecular embeddings
- **umap-learn**: Dimensionality reduction
- **fastapi/uvicorn**: Backend API server
- **fastmcp**: MCP protocol support
- **docker**: Container management
### Required Columns
- `smiles` or `SMILES`: Molecular structures
- `selfies`: Alternative molecular representation
- `id` or `compound_id`: Unique identifiers (optional)
### Configuration Management
- Uses `pydantic-settings` for environment-based configuration
- Settings prefixed with `EMBEDDING_` environment variables
- Default configuration in `src/embedding_backend/config.py`
### Generated Columns
- `projection_x`, `projection_y`: 2D coordinates
- `__neighbors`: Nearest neighbor information
## Naming Conventions and Style
## Environment Configuration
### File Naming
- Python modules use snake_case: `split_drugbank.py`, `comparison.py`
- Scripts in `script/` directory follow descriptive naming
- Docker files use descriptive suffixes: `embedding-atlas.Dockerfile`
### Code Style
- Type hints used throughout (Python 3.12+)
- Docstrings follow Google style format
- Import organization: standard library, third-party, local modules
- Error handling with try/except blocks for external dependencies
### Data Column Conventions
- Molecular structures: `smiles`, `SMILES`, `selfies`
- Properties: `qed`, `molecular_weight`, `mw`
- Identifiers: `id`, `compound_id`
- Projections: `projection_x`, `projection_y`
## Testing and Validation
### Data Processing Validation
- Scripts include data validation checks (e.g., RDKit molecule validity)
- Scaffold-based splitting ensures structural diversity
- QED/MW distribution alignment for train/val/test splits
### Error Handling Patterns
- Graceful handling of missing RDKit dependencies
- Fallback mechanisms for SELFIES decoding
- Timeout handling for downloads and API calls
## Important Gotchas and Non-Obvious Patterns
### Embedding Atlas Integration
1. **Props Format**: Always use `props` format for frontend metadata, never manually add `database` field
2. **Server Creation**: Let `make_server()` handle database configuration automatically
3. **DataFrame Handling**: Pass raw pandas DataFrames to DataSource, not JSON strings
### Container Orchestration
1. **Session Management**: Sessions auto-expire after 10 hours (configurable via `auto_remove_seconds`)
2. **Port Allocation**: Dynamic port allocation in range 6000-6999 for embedding containers
3. **Volume Sharing**: Sessions use shared volumes between orchestrator and embedding containers
### Data Processing
1. **SELFIES/SMILES Conversion**: Scripts handle both formats with automatic conversion
2. **Scaffold Detection**: Uses RDKit's MurckoScaffold for structure-aware splitting
3. **Distribution Alignment**: QED/MW binning ensures representative splits
### Environment Variables
### Regional Access (China/Restricted Networks)
```bash
# Hugging Face mirror (for China/regional access)
# Use Hugging Face mirror
export HF_ENDPOINT=https://hf-mirror.com
export HF_HUB_OFFLINE=1
# Docker configuration
export EMBEDDING_DOCKER_URL=tcp://engine:2375
export EMBEDDING_CONTAINER_IMAGE=embedding-atlas:latest
# Use domestic PyPI mirror
export UV_PIP_INDEX_URL=https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
```
### API Usage Patterns
1. **Session Creation**: POST to `/sessions` with data_url and optional parameters
2. **Session Cleanup**: DELETE to `/sessions/{session_id}` for immediate cleanup
3. **Status Checking**: GET `/sessions` returns active sessions
### Docker Configuration
```bash
# Custom Docker settings
export EMBEDDING_DOCKER_URL=tcp://engine:2375
export EMBEDDING_CONTAINER_IMAGE=embedding-atlas:latest
export EMBEDDING_API_PORT=9000
```
## Common Development Tasks
## Troubleshooting Quick Reference
### Adding New Visualization Features
1. Extend `src/visualization/comparison.py` for new comparison modes
2. Follow existing parameter patterns for CLI and API interfaces
3. Ensure compatibility with both static and interactive modes
### Common Issues and Solutions
### Modifying Data Processing
1. Scripts in `script/data_processing/` are standalone and self-contained
2. Maintain backward compatibility with existing data formats
3. Include validation and error handling for chemical data
**Model Download Failures**
```bash
# Set mirror for model downloads
export HF_ENDPOINT=https://hf-mirror.com
uv run embedding-atlas data.csv --text smiles
```
### Backend Modifications
1. Configuration changes go in `src/embedding_backend/config.py`
2. Route handlers in `src/embedding_backend/routes.py`
3. Docker management logic in `src/embedding_backend/docker_manager.py`
**Port Already in Use**
```bash
# Use auto-port selection
uv run embedding-atlas data.csv --text smiles --auto-port
## Debugging Tips
# Or specify different port
uv run embedding-atlas data.csv --text smiles --port 8080
```
### Common Issues
1. **Model Download Failures**: Set HF_ENDPOINT mirror for regional access
2. **Docker Connection**: Ensure Docker daemon is accessible at configured URL
3. **Port Conflicts**: Check dynamic port range availability (6000-6999)
**Memory Issues with Large Datasets**
```bash
# Reduce batch size
uv run embedding-atlas large_dataset.csv --text smiles --batch-size 8
### Log Locations
- Backend logs: Standard output from uvicorn/fastapi
- Container logs: Accessible via Docker API
- Session data: Stored in `runtime/sessions/{session_id}/`
# Use sampling
uv run embedding-atlas large_dataset.csv --text smiles --sample 10000
```
### Performance Considerations
- Batch size parameter affects memory usage and processing speed
- UMAP parameters can be tuned for different dataset sizes
- Embedding model choice impacts both quality and performance
**Docker Connection Issues**
```bash
# Check Docker daemon
sudo systemctl status docker
# Test Docker connection
docker ps
```
## Performance Optimization
### For Large Datasets (>100k molecules)
```bash
# Use smaller batch size and sampling
uv run embedding-atlas large_dataset.csv \
--text smiles \
--batch-size 8 \
--sample 50000 \
--umap-n-neighbors 10
```
### For High-Quality Embeddings
```bash
# Use better model with optimized parameters
uv run embedding-atlas dataset.csv \
--text smiles \
--model all-mpnet-base-v2 \
--umap-n-neighbors 30 \
--umap-min-dist 0.1
```
## Validation Commands
### Test Installation
```bash
# Verify Embedding Atlas version
uv run python -c "import embedding_atlas; print(f'Version: {embedding_atlas.__version__}')"
# Test basic functionality
uv run embedding-atlas --help
# Test backend
uv run embedding-backend-api --help
```
### Test with Sample Data
```bash
# Quick visualization test
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \
--text smiles \
--sample 100 \
--port 5055
```
## Project Structure Reference
```
/Users/lingyuzeng/project/embedding_atlas/
├── data/ # Sample datasets
│ ├── drugbank_pre_filtered_mordred_qed_id_selfies.csv
│ └── splits_v2/ # Pre-split datasets
├── src/visualization/ # Comparison tools
│ └── comparison.py # Main comparison script
├── src/embedding_backend/ # Backend services
│ ├── main.py # FastAPI/MCP entry points
│ ├── config.py # Configuration
│ └── docker_manager.py # Container management
├── script/data_processing/ # Data utilities
│ ├── split_drugbank.py # Dataset splitting
│ ├── merge_splits.py # Dataset merging
│ └── add_ecfp4_tanimoto.py # Fingerprint generation
└── docker/ # Container configs
├── embedding-atlas.Dockerfile
└── docker-compose.yml
```
## Next Steps
1. **Choose your use case**: Single dataset visualization, dataset comparison, or backend orchestration
2. **Prepare your data**: Ensure CSV files have appropriate molecular structure columns
3. **Run the appropriate command** from the examples above
4. **Access results**: Web interface at specified ports or exported files
**Support**: All commands are tested with Embedding Atlas v0.13.0 on Python ≥3.12 using uv package manager.