320 lines
9.1 KiB
Markdown
320 lines
9.1 KiB
Markdown
# AGENTS.md - Embedding Atlas Complete Usage Guide
|
|
|
|
This document provides complete instructions for AI agents to directly use the Embedding Atlas cheminformatics visualization platform without additional setup.
|
|
|
|
## Quick Start - For Immediate Use
|
|
|
|
### 1. Environment Setup (First Time Only)
|
|
```bash
|
|
# Install uv if not available
|
|
curl -LsSf https://astral.sh/uv/install.sh | sh
|
|
|
|
# Navigate to project directory
|
|
cd /Users/lingyuzeng/project/embedding_atlas
|
|
|
|
# Create virtual environment and install dependencies
|
|
uv sync
|
|
```
|
|
|
|
### 2. Ready-to-Use Commands
|
|
|
|
#### Visualize Molecular Data (SMILES/SELFIES)
|
|
```bash
|
|
# Basic visualization with SMILES column
|
|
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles
|
|
|
|
# Interactive mode on custom port
|
|
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles --port 8080 --interactive
|
|
|
|
# Export as standalone web application
|
|
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles --export-application visualization.zip
|
|
```
|
|
|
|
#### Compare Two Datasets
|
|
```bash
|
|
# Compare two CSV files with molecular data
|
|
uv run python src/visualization/comparison.py file1.csv file2.csv \
|
|
--column1 smiles --column2 smiles \
|
|
--label1 "Dataset A" --label2 "Dataset B" \
|
|
--interactive --port 5055
|
|
|
|
# Generate static comparison image
|
|
uv run python src/visualization/comparison.py file1.csv file2.csv \
|
|
--column1 smiles --column2 smiles \
|
|
--output comparison.png
|
|
```
|
|
|
|
#### Start Backend Services
|
|
```bash
|
|
# Start FastAPI orchestrator (background)
|
|
uv run embedding-backend-api &
|
|
|
|
# Start MCP server (stdio mode)
|
|
uv run embedding-backend-mcp
|
|
|
|
# Test API endpoint
|
|
curl http://localhost:9000/sessions
|
|
```
|
|
|
|
## Project Overview
|
|
|
|
**Embedding Atlas** is a complete cheminformatics visualization platform that:
|
|
- Creates interactive 2D embeddings of molecular data using sentence transformers
|
|
- Supports UMAP dimensionality reduction for chemical structures
|
|
- Provides web-based interactive visualization
|
|
- Includes dataset comparison tools
|
|
- Offers containerized session management
|
|
|
|
**Current Version**: Embedding Atlas v0.13.0
|
|
**Python Version**: ≥3.12
|
|
**Package Manager**: uv (cross-platform compatible)
|
|
|
|
## Complete Usage Examples
|
|
|
|
### Example 1: Single Dataset Visualization
|
|
```bash
|
|
# Visualize DrugBank dataset with SMILES
|
|
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \
|
|
--text smiles \
|
|
--model all-MiniLM-L6-v2 \
|
|
--port 5055 \
|
|
--interactive
|
|
|
|
# Access visualization at http://localhost:5055
|
|
```
|
|
|
|
### Example 2: Dataset Comparison Workflow
|
|
```bash
|
|
# Step 1: Split dataset for comparison
|
|
uv run python script/split_drugbank.py \
|
|
--in-csv data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \
|
|
--out-dir splits_v2 \
|
|
--train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1
|
|
|
|
# Step 2: Merge splits with labels
|
|
uv run python script/merge_splits.py \
|
|
--input-dir splits_v2/ \
|
|
--output data/drugbank_split_merge.csv
|
|
|
|
# Step 3: Visualize merged dataset
|
|
uv run embedding-atlas data/drugbank_split_merge.csv --text smiles
|
|
```
|
|
|
|
### Example 3: Advanced Comparison with Custom Parameters
|
|
```bash
|
|
# Compare datasets with custom UMAP parameters
|
|
uv run python src/visualization/comparison.py \
|
|
data/small_molecules.csv data/large_molecules.csv \
|
|
--column1 smiles --column2 smiles \
|
|
--label1 "Small Molecules" --label2 "Large Molecules" \
|
|
--model all-mpnet-base-v2 \
|
|
--batch-size 16 \
|
|
--umap-args '{"n_neighbors": 20, "min_dist": 0.2, "metric": "cosine"}' \
|
|
--interactive --port 8080
|
|
```
|
|
|
|
### Example 4: Backend Orchestration
|
|
```bash
|
|
# Start orchestrator
|
|
curl -X POST http://localhost:9000/sessions \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{
|
|
"session_id": "demo-session",
|
|
"data_url": "https://example.com/molecules.csv",
|
|
"extra_args": ["--text", "smiles"],
|
|
"environment": {"HF_ENDPOINT": "https://hf-mirror.com"}
|
|
}'
|
|
|
|
# Check session status
|
|
curl http://localhost:9000/sessions
|
|
|
|
# Delete session when done
|
|
curl -X DELETE http://localhost:9000/sessions/demo-session
|
|
```
|
|
|
|
## Data Processing Pipeline
|
|
|
|
### 1. Prepare Molecular Data
|
|
```bash
|
|
# Add ECFP4 fingerprints for similarity analysis
|
|
uv run python script/data_processing/add_ecfp4_tanimoto.py input.csv output_with_fingerprints.csv
|
|
|
|
# Add macrocycle detection
|
|
uv run python script/data_processing/add_macrocycle_columns.py input.csv output_with_macrocycles.csv
|
|
```
|
|
|
|
### 2. Structure-Aware Dataset Splitting
|
|
```bash
|
|
# Create train/val/test splits with scaffold diversity
|
|
uv run python script/split_drugbank.py \
|
|
--in-csv molecules.csv \
|
|
--out-dir splits/ \
|
|
--train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 \
|
|
--n_qed_bins 5 --n_mw_bins 5 \
|
|
--largest-first
|
|
```
|
|
|
|
## Docker Deployment (Optional)
|
|
|
|
### Quick Docker Setup
|
|
```bash
|
|
# Build embedding atlas image
|
|
docker build -f docker/embedding-atlas.Dockerfile -t embedding-atlas:latest .
|
|
|
|
# Start full orchestration stack
|
|
docker compose -f docker/docker-compose.yml up --build
|
|
|
|
# Access services:
|
|
# - Orchestrator API: http://localhost:9000
|
|
# - Individual sessions: http://localhost:6000-6999
|
|
```
|
|
|
|
## Common Data Formats
|
|
|
|
### Supported Input Formats
|
|
- **CSV files** with molecular data
|
|
- **Parquet files** for large datasets
|
|
- **Hugging Face datasets** (via dataset name)
|
|
|
|
### Required Columns
|
|
- `smiles` or `SMILES`: Molecular structures
|
|
- `selfies`: Alternative molecular representation
|
|
- `id` or `compound_id`: Unique identifiers (optional)
|
|
|
|
### Generated Columns
|
|
- `projection_x`, `projection_y`: 2D coordinates
|
|
- `__neighbors`: Nearest neighbor information
|
|
|
|
## Environment Configuration
|
|
|
|
### Regional Access (China/Restricted Networks)
|
|
```bash
|
|
# Use Hugging Face mirror
|
|
export HF_ENDPOINT=https://hf-mirror.com
|
|
export HF_HUB_OFFLINE=1
|
|
|
|
# Use domestic PyPI mirror
|
|
export UV_PIP_INDEX_URL=https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
|
```
|
|
|
|
### Docker Configuration
|
|
```bash
|
|
# Custom Docker settings
|
|
export EMBEDDING_DOCKER_URL=tcp://engine:2375
|
|
export EMBEDDING_CONTAINER_IMAGE=embedding-atlas:latest
|
|
export EMBEDDING_API_PORT=9000
|
|
```
|
|
|
|
## Troubleshooting Quick Reference
|
|
|
|
### Common Issues and Solutions
|
|
|
|
**Model Download Failures**
|
|
```bash
|
|
# Set mirror for model downloads
|
|
export HF_ENDPOINT=https://hf-mirror.com
|
|
uv run embedding-atlas data.csv --text smiles
|
|
```
|
|
|
|
**Port Already in Use**
|
|
```bash
|
|
# Use auto-port selection
|
|
uv run embedding-atlas data.csv --text smiles --auto-port
|
|
|
|
# Or specify different port
|
|
uv run embedding-atlas data.csv --text smiles --port 8080
|
|
```
|
|
|
|
**Memory Issues with Large Datasets**
|
|
```bash
|
|
# Reduce batch size
|
|
uv run embedding-atlas large_dataset.csv --text smiles --batch-size 8
|
|
|
|
# Use sampling
|
|
uv run embedding-atlas large_dataset.csv --text smiles --sample 10000
|
|
```
|
|
|
|
**Docker Connection Issues**
|
|
```bash
|
|
# Check Docker daemon
|
|
sudo systemctl status docker
|
|
|
|
# Test Docker connection
|
|
docker ps
|
|
```
|
|
|
|
## Performance Optimization
|
|
|
|
### For Large Datasets (>100k molecules)
|
|
```bash
|
|
# Use smaller batch size and sampling
|
|
uv run embedding-atlas large_dataset.csv \
|
|
--text smiles \
|
|
--batch-size 8 \
|
|
--sample 50000 \
|
|
--umap-n-neighbors 10
|
|
```
|
|
|
|
### For High-Quality Embeddings
|
|
```bash
|
|
# Use better model with optimized parameters
|
|
uv run embedding-atlas dataset.csv \
|
|
--text smiles \
|
|
--model all-mpnet-base-v2 \
|
|
--umap-n-neighbors 30 \
|
|
--umap-min-dist 0.1
|
|
```
|
|
|
|
## Validation Commands
|
|
|
|
### Test Installation
|
|
```bash
|
|
# Verify Embedding Atlas version
|
|
uv run python -c "import embedding_atlas; print(f'Version: {embedding_atlas.__version__}')"
|
|
|
|
# Test basic functionality
|
|
uv run embedding-atlas --help
|
|
|
|
# Test backend
|
|
uv run embedding-backend-api --help
|
|
```
|
|
|
|
### Test with Sample Data
|
|
```bash
|
|
# Quick visualization test
|
|
uv run embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \
|
|
--text smiles \
|
|
--sample 100 \
|
|
--port 5055
|
|
```
|
|
|
|
## Project Structure Reference
|
|
|
|
```
|
|
/Users/lingyuzeng/project/embedding_atlas/
|
|
├── data/ # Sample datasets
|
|
│ ├── drugbank_pre_filtered_mordred_qed_id_selfies.csv
|
|
│ └── splits_v2/ # Pre-split datasets
|
|
├── src/visualization/ # Comparison tools
|
|
│ └── comparison.py # Main comparison script
|
|
├── src/embedding_backend/ # Backend services
|
|
│ ├── main.py # FastAPI/MCP entry points
|
|
│ ├── config.py # Configuration
|
|
│ └── docker_manager.py # Container management
|
|
├── script/data_processing/ # Data utilities
|
|
│ ├── split_drugbank.py # Dataset splitting
|
|
│ ├── merge_splits.py # Dataset merging
|
|
│ └── add_ecfp4_tanimoto.py # Fingerprint generation
|
|
└── docker/ # Container configs
|
|
├── embedding-atlas.Dockerfile
|
|
└── docker-compose.yml
|
|
```
|
|
|
|
## Next Steps
|
|
|
|
1. **Choose your use case**: Single dataset visualization, dataset comparison, or backend orchestration
|
|
2. **Prepare your data**: Ensure CSV files have appropriate molecular structure columns
|
|
3. **Run the appropriate command** from the examples above
|
|
4. **Access results**: Web interface at specified ports or exported files
|
|
|
|
**Support**: All commands are tested with Embedding Atlas v0.13.0 on Python ≥3.12 using uv package manager. |