Files
embedding_atlas/AGENTS.md
2025-12-01 10:54:04 +08:00

320 lines
9.1 KiB
Markdown

# AGENTS.md - Embedding Atlas Complete Usage Guide
This document provides complete instructions for AI agents to directly use the Embedding Atlas cheminformatics visualization platform without additional setup.
## Quick Start - For Immediate Use
### 1. Environment Setup (First Time Only)
```bash
# Install uv if not available
curl -LsSf https://astral.sh/uv/install.sh | sh
# Navigate to project directory
cd /Users/lingyuzeng/project/embedding_atlas
# Create virtual environment and install dependencies
uv sync
```
### 2. Ready-to-Use Commands
#### Visualize Molecular Data (SMILES/SELFIES)
```bash
# Basic visualization with SMILES column
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles
# Interactive mode on custom port
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles --port 8080 --interactive
# Export as standalone web application
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles --export-application visualization.zip
```
#### Compare Two Datasets
```bash
# Compare two CSV files with molecular data
uv run python src/visualization/comparison.py file1.csv file2.csv \
--column1 smiles --column2 smiles \
--label1 "Dataset A" --label2 "Dataset B" \
--interactive --port 5055
# Generate static comparison image
uv run python src/visualization/comparison.py file1.csv file2.csv \
--column1 smiles --column2 smiles \
--output comparison.png
```
#### Start Backend Services
```bash
# Start FastAPI orchestrator (background)
uv run embedding-backend-api &
# Start MCP server (stdio mode)
uv run embedding-backend-mcp
# Test API endpoint
curl http://localhost:9000/sessions
```
## Project Overview
**Embedding Atlas** is a complete cheminformatics visualization platform that:
- Creates interactive 2D embeddings of molecular data using sentence transformers
- Supports UMAP dimensionality reduction for chemical structures
- Provides web-based interactive visualization
- Includes dataset comparison tools
- Offers containerized session management
**Current Version**: Embedding Atlas v0.13.0
**Python Version**: ≥3.12
**Package Manager**: uv (cross-platform compatible)
## Complete Usage Examples
### Example 1: Single Dataset Visualization
```bash
# Visualize DrugBank dataset with SMILES
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \
--text smiles \
--model all-MiniLM-L6-v2 \
--port 5055 \
--interactive
# Access visualization at http://localhost:5055
```
### Example 2: Dataset Comparison Workflow
```bash
# Step 1: Split dataset for comparison
uv run python script/split_drugbank.py \
--in-csv data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \
--out-dir splits_v2 \
--train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1
# Step 2: Merge splits with labels
uv run python script/merge_splits.py \
--input-dir splits_v2/ \
--output data/drugbank_split_merge.csv
# Step 3: Visualize merged dataset
uv run embedding-atlas data/drugbank_split_merge.csv --text smiles
```
### Example 3: Advanced Comparison with Custom Parameters
```bash
# Compare datasets with custom UMAP parameters
uv run python src/visualization/comparison.py \
data/small_molecules.csv data/large_molecules.csv \
--column1 smiles --column2 smiles \
--label1 "Small Molecules" --label2 "Large Molecules" \
--model all-mpnet-base-v2 \
--batch-size 16 \
--umap-args '{"n_neighbors": 20, "min_dist": 0.2, "metric": "cosine"}' \
--interactive --port 8080
```
### Example 4: Backend Orchestration
```bash
# Start orchestrator
curl -X POST http://localhost:9000/sessions \
-H 'Content-Type: application/json' \
-d '{
"session_id": "demo-session",
"data_url": "https://example.com/molecules.csv",
"extra_args": ["--text", "smiles"],
"environment": {"HF_ENDPOINT": "https://hf-mirror.com"}
}'
# Check session status
curl http://localhost:9000/sessions
# Delete session when done
curl -X DELETE http://localhost:9000/sessions/demo-session
```
## Data Processing Pipeline
### 1. Prepare Molecular Data
```bash
# Add ECFP4 fingerprints for similarity analysis
uv run python script/data_processing/add_ecfp4_tanimoto.py input.csv output_with_fingerprints.csv
# Add macrocycle detection
uv run python script/data_processing/add_macrocycle_columns.py input.csv output_with_macrocycles.csv
```
### 2. Structure-Aware Dataset Splitting
```bash
# Create train/val/test splits with scaffold diversity
uv run python script/split_drugbank.py \
--in-csv molecules.csv \
--out-dir splits/ \
--train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 \
--n_qed_bins 5 --n_mw_bins 5 \
--largest-first
```
## Docker Deployment (Optional)
### Quick Docker Setup
```bash
# Build embedding atlas image
docker build -f docker/embedding-atlas.Dockerfile -t embedding-atlas:latest .
# Start full orchestration stack
docker compose -f docker/docker-compose.yml up --build
# Access services:
# - Orchestrator API: http://localhost:9000
# - Individual sessions: http://localhost:6000-6999
```
## Common Data Formats
### Supported Input Formats
- **CSV files** with molecular data
- **Parquet files** for large datasets
- **Hugging Face datasets** (via dataset name)
### Required Columns
- `smiles` or `SMILES`: Molecular structures
- `selfies`: Alternative molecular representation
- `id` or `compound_id`: Unique identifiers (optional)
### Generated Columns
- `projection_x`, `projection_y`: 2D coordinates
- `__neighbors`: Nearest neighbor information
## Environment Configuration
### Regional Access (China/Restricted Networks)
```bash
# Use Hugging Face mirror
export HF_ENDPOINT=https://hf-mirror.com
export HF_HUB_OFFLINE=1
# Use domestic PyPI mirror
export UV_PIP_INDEX_URL=https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
```
### Docker Configuration
```bash
# Custom Docker settings
export EMBEDDING_DOCKER_URL=tcp://engine:2375
export EMBEDDING_CONTAINER_IMAGE=embedding-atlas:latest
export EMBEDDING_API_PORT=9000
```
## Troubleshooting Quick Reference
### Common Issues and Solutions
**Model Download Failures**
```bash
# Set mirror for model downloads
export HF_ENDPOINT=https://hf-mirror.com
uv run embedding-atlas data.csv --text smiles
```
**Port Already in Use**
```bash
# Use auto-port selection
uv run embedding-atlas data.csv --text smiles --auto-port
# Or specify different port
uv run embedding-atlas data.csv --text smiles --port 8080
```
**Memory Issues with Large Datasets**
```bash
# Reduce batch size
uv run embedding-atlas large_dataset.csv --text smiles --batch-size 8
# Use sampling
uv run embedding-atlas large_dataset.csv --text smiles --sample 10000
```
**Docker Connection Issues**
```bash
# Check Docker daemon
sudo systemctl status docker
# Test Docker connection
docker ps
```
## Performance Optimization
### For Large Datasets (>100k molecules)
```bash
# Use smaller batch size and sampling
uv run embedding-atlas large_dataset.csv \
--text smiles \
--batch-size 8 \
--sample 50000 \
--umap-n-neighbors 10
```
### For High-Quality Embeddings
```bash
# Use better model with optimized parameters
uv run embedding-atlas dataset.csv \
--text smiles \
--model all-mpnet-base-v2 \
--umap-n-neighbors 30 \
--umap-min-dist 0.1
```
## Validation Commands
### Test Installation
```bash
# Verify Embedding Atlas version
uv run python -c "import embedding_atlas; print(f'Version: {embedding_atlas.__version__}')"
# Test basic functionality
uv run embedding-atlas --help
# Test backend
uv run embedding-backend-api --help
```
### Test with Sample Data
```bash
# Quick visualization test
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \
--text smiles \
--sample 100 \
--port 5055
```
## Project Structure Reference
```
/Users/lingyuzeng/project/embedding_atlas/
├── data/ # Sample datasets
│ ├── drugbank_pre_filtered_mordred_qed_id_selfies.csv
│ └── splits_v2/ # Pre-split datasets
├── src/visualization/ # Comparison tools
│ └── comparison.py # Main comparison script
├── src/embedding_backend/ # Backend services
│ ├── main.py # FastAPI/MCP entry points
│ ├── config.py # Configuration
│ └── docker_manager.py # Container management
├── script/data_processing/ # Data utilities
│ ├── split_drugbank.py # Dataset splitting
│ ├── merge_splits.py # Dataset merging
│ └── add_ecfp4_tanimoto.py # Fingerprint generation
└── docker/ # Container configs
├── embedding-atlas.Dockerfile
└── docker-compose.yml
```
## Next Steps
1. **Choose your use case**: Single dataset visualization, dataset comparison, or backend orchestration
2. **Prepare your data**: Ensure CSV files have appropriate molecular structure columns
3. **Run the appropriate command** from the examples above
4. **Access results**: Web interface at specified ports or exported files
**Support**: All commands are tested with Embedding Atlas v0.13.0 on Python ≥3.12 using uv package manager.