1288 lines
37 KiB
Markdown
1288 lines
37 KiB
Markdown
# MacrolactoneDB Validation Implementation Plan
|
|
|
|
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
|
|
|
**Goal:** Create a validation script that samples 10% of MacrolactoneDB 12-20 membered rings, classifies them, fragments side chains with isotope tagging, and stores results in SQLite with visualizations.
|
|
|
|
**Architecture:** SQLModel ORM for database, RDKit for chemistry, PIL for visualization. Core is a `MacrolactoneValidator` class that orchestrates sampling, processing, and output generation.
|
|
|
|
**Tech Stack:** Python 3.12, SQLModel, RDKit, Pandas, Pixi (environment)
|
|
|
|
**Reference Design:** See `docs/plans/2026-03-19-macrolactone-validation-design.md` for full design details.
|
|
|
|
---
|
|
|
|
## Prerequisites
|
|
|
|
**Worktree:** This plan should be executed in a dedicated worktree.
|
|
|
|
**Design Doc:** Read `docs/plans/2026-03-19-macrolactone-validation-design.md` before starting.
|
|
|
|
---
|
|
|
|
## Task 1: Create Database Models (SQLModel)
|
|
|
|
**Files:**
|
|
- Create: `src/macro_lactone_toolkit/validation/__init__.py`
|
|
- Create: `src/macro_lactone_toolkit/validation/models.py`
|
|
|
|
**Context:** These models implement the schema from the design doc. `SideChainFragment` uses `dummy_isotope` field to store the cleavage position for reconstruction.
|
|
|
|
**Step 1: Create directory and __init__.py**
|
|
|
|
Run:
|
|
```bash
|
|
mkdir -p src/macro_lactone_toolkit/validation
|
|
touch src/macro_lactone_toolkit/validation/__init__.py
|
|
```
|
|
|
|
**Step 2: Write database models**
|
|
|
|
Create `src/macro_lactone_toolkit/validation/models.py`:
|
|
|
|
```python
|
|
from __future__ import annotations
|
|
|
|
from datetime import datetime
|
|
from enum import Enum
|
|
from typing import List, Optional
|
|
|
|
from sqlmodel import Field, Relationship, SQLModel
|
|
|
|
|
|
class ClassificationType(str, Enum):
|
|
STANDARD = "standard_macrolactone"
|
|
NON_STANDARD = "non_standard_macrocycle"
|
|
NOT_MACROLACTONE = "not_macrolactone"
|
|
|
|
|
|
class ProcessingStatus(str, Enum):
|
|
PENDING = "pending"
|
|
SUCCESS = "success"
|
|
FAILED = "failed"
|
|
SKIPPED = "skipped"
|
|
|
|
|
|
class ParentMolecule(SQLModel, table=True):
|
|
"""Original molecule information."""
|
|
|
|
__tablename__ = "parent_molecules"
|
|
|
|
id: Optional[int] = Field(default=None, primary_key=True)
|
|
source_id: str = Field(index=True)
|
|
molecule_name: Optional[str] = None
|
|
smiles: str = Field(index=True)
|
|
classification: ClassificationType = Field(index=True)
|
|
ring_size: Optional[int] = Field(default=None, index=True)
|
|
primary_reason_code: Optional[str] = None
|
|
primary_reason_message: Optional[str] = None
|
|
processing_status: ProcessingStatus = Field(default=ProcessingStatus.PENDING)
|
|
error_message: Optional[str] = None
|
|
num_sidechains: Optional[int] = None
|
|
cleavage_positions: Optional[str] = None
|
|
numbered_image_path: Optional[str] = None
|
|
created_at: datetime = Field(default_factory=datetime.utcnow)
|
|
processed_at: Optional[datetime] = None
|
|
|
|
fragments: List["SideChainFragment"] = Relationship(back_populates="parent")
|
|
numbering: Optional["RingNumbering"] = Relationship(back_populates="parent")
|
|
|
|
|
|
class RingNumbering(SQLModel, table=True):
|
|
"""Ring numbering details."""
|
|
|
|
__tablename__ = "ring_numberings"
|
|
|
|
id: Optional[int] = Field(default=None, primary_key=True)
|
|
parent_id: int = Field(foreign_key="parent_molecules.id", unique=True)
|
|
ring_size: int
|
|
carbonyl_carbon_idx: int
|
|
ester_oxygen_idx: int
|
|
position_to_atom: str
|
|
atom_to_position: str
|
|
|
|
parent: Optional[ParentMolecule] = Relationship(back_populates="numbering")
|
|
|
|
|
|
class SideChainFragment(SQLModel, table=True):
|
|
"""Side chain fragments from cleavage."""
|
|
|
|
__tablename__ = "side_chain_fragments"
|
|
|
|
id: Optional[int] = Field(default=None, primary_key=True)
|
|
parent_id: int = Field(foreign_key="parent_molecules.id", index=True)
|
|
fragment_id: str = Field(index=True)
|
|
cleavage_position: int = Field(index=True)
|
|
attachment_atom_idx: int
|
|
attachment_atom_symbol: str
|
|
fragment_smiles_labeled: str
|
|
fragment_smiles_plain: str
|
|
dummy_isotope: int
|
|
atom_count: int
|
|
heavy_atom_count: int
|
|
molecular_weight: float
|
|
original_bond_type: str
|
|
image_path: Optional[str] = None
|
|
|
|
parent: Optional[ParentMolecule] = Relationship(back_populates="fragments")
|
|
|
|
|
|
class ValidationResult(SQLModel, table=True):
|
|
"""Manual validation records."""
|
|
|
|
__tablename__ = "validation_results"
|
|
|
|
id: Optional[int] = Field(default=None, primary_key=True)
|
|
parent_id: int = Field(foreign_key="parent_molecules.id")
|
|
numbering_correct: Optional[bool] = None
|
|
cleavage_correct: Optional[bool] = None
|
|
classification_correct: Optional[bool] = None
|
|
notes: Optional[str] = None
|
|
validated_by: Optional[str] = None
|
|
validated_at: Optional[datetime] = None
|
|
```
|
|
|
|
**Step 3: Verify SQLModel imports work**
|
|
|
|
Run:
|
|
```bash
|
|
pixi run python -c "from macro_lactone_toolkit.validation.models import ParentMolecule; print('Models OK')"
|
|
```
|
|
|
|
Expected: `Models OK`
|
|
|
|
**Step 4: Commit**
|
|
|
|
```bash
|
|
git add src/macro_lactone_toolkit/validation/
|
|
git commit -m "feat(validation): add SQLModel database models"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 2: Create Database Connection Module
|
|
|
|
**Files:**
|
|
- Create: `src/macro_lactone_toolkit/validation/database.py`
|
|
|
|
**Context:** Provides SQLite engine, session context manager, and init function.
|
|
|
|
**Step 1: Write database module**
|
|
|
|
Create `src/macro_lactone_toolkit/validation/database.py`:
|
|
|
|
```python
|
|
from __future__ import annotations
|
|
|
|
from contextlib import contextmanager
|
|
from pathlib import Path
|
|
|
|
from sqlmodel import Session, SQLModel, create_engine
|
|
|
|
|
|
def get_engine(db_path: str | Path):
|
|
"""Create SQLite engine."""
|
|
db_path = Path(db_path)
|
|
db_path.parent.mkdir(parents=True, exist_ok=True)
|
|
url = f"sqlite:///{db_path}"
|
|
return create_engine(url, echo=False)
|
|
|
|
|
|
@contextmanager
|
|
def get_session(engine):
|
|
"""Context manager for database sessions."""
|
|
with Session(engine) as session:
|
|
yield session
|
|
|
|
|
|
def init_database(engine):
|
|
"""Create all tables."""
|
|
SQLModel.metadata.create_all(engine)
|
|
```
|
|
|
|
**Step 2: Test database initialization**
|
|
|
|
Create test script `test_db.py`:
|
|
```python
|
|
from pathlib import Path
|
|
from macro_lactone_toolkit.validation.database import get_engine, init_database, get_session
|
|
from macro_lactone_toolkit.validation.models import ParentMolecule, ClassificationType
|
|
|
|
test_db = Path("/tmp/test_fragments.db")
|
|
if test_db.exists():
|
|
test_db.unlink()
|
|
|
|
engine = get_engine(test_db)
|
|
init_database(engine)
|
|
|
|
with get_session(engine) as session:
|
|
parent = ParentMolecule(
|
|
source_id="TEST001",
|
|
smiles="O=C1CCCCCCCCCCCCCCO1",
|
|
classification=ClassificationType.STANDARD,
|
|
ring_size=16,
|
|
)
|
|
session.add(parent)
|
|
session.commit()
|
|
print(f"Inserted parent with id: {parent.id}")
|
|
|
|
print("Database test passed!")
|
|
```
|
|
|
|
Run:
|
|
```bash
|
|
pixi run python test_db.py
|
|
```
|
|
|
|
Expected:
|
|
```
|
|
Inserted parent with id: 1
|
|
Database test passed!
|
|
```
|
|
|
|
**Step 3: Cleanup and commit**
|
|
|
|
Run:
|
|
```bash
|
|
rm test_db.py /tmp/test_fragments.db
|
|
git add src/macro_lactone_toolkit/validation/database.py
|
|
git commit -m "feat(validation): add database connection module"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 3: Create Isotope Tagging Utilities
|
|
|
|
**Files:**
|
|
- Create: `src/macro_lactone_toolkit/validation/isotope_utils.py`
|
|
|
|
**Context:** Implements isotope tagging inspired by Molassembler. Uses cleavage position as isotope value.
|
|
|
|
**Step 1: Write isotope utilities**
|
|
|
|
Create `src/macro_lactone_toolkit/validation/isotope_utils.py`:
|
|
|
|
```python
|
|
from __future__ import annotations
|
|
|
|
from rdkit import Chem
|
|
|
|
|
|
def build_fragment_with_isotope(
|
|
mol: Chem.Mol,
|
|
side_chain_atoms: list[int],
|
|
side_chain_start_idx: int,
|
|
ring_atom_idx: int,
|
|
cleavage_position: int,
|
|
) -> tuple[str, str, str]:
|
|
"""
|
|
Build fragment SMILES with isotope tagging.
|
|
|
|
Returns:
|
|
Tuple of (labeled_smiles, plain_smiles, bond_type)
|
|
"""
|
|
# Get original bond type
|
|
bond = mol.GetBondBetweenAtoms(ring_atom_idx, side_chain_start_idx)
|
|
bond_type = bond.GetBondType().name if bond else "SINGLE"
|
|
|
|
# Create editable molecule
|
|
emol = Chem.EditableMol(Chem.Mol(mol))
|
|
|
|
# Add dummy atom with isotope = cleavage position
|
|
dummy_atom = Chem.Atom(0)
|
|
dummy_atom.SetIsotope(cleavage_position)
|
|
dummy_idx = emol.AddAtom(dummy_atom)
|
|
|
|
# Add bond between dummy and side chain start
|
|
emol.AddBond(dummy_idx, side_chain_start_idx, bond.GetBondType())
|
|
|
|
# Remove ring atom to side chain bond (will be reconnected via dummy)
|
|
# Actually, we keep the side chain atoms and dummy, remove everything else
|
|
|
|
# Determine atoms to keep
|
|
atoms_to_keep = set([dummy_idx, side_chain_start_idx] + list(side_chain_atoms))
|
|
|
|
# Remove atoms not in keep list
|
|
# Need to remove in reverse order to maintain valid indices
|
|
all_atoms = list(range(mol.GetNumAtoms()))
|
|
atoms_to_remove = [i for i in all_atoms if i not in atoms_to_keep]
|
|
|
|
for atom_idx in sorted(atoms_to_remove, reverse=True):
|
|
emol.RemoveAtom(atom_idx)
|
|
|
|
fragment = emol.GetMol()
|
|
Chem.SanitizeMol(fragment)
|
|
|
|
# Get labeled SMILES (with isotope)
|
|
labeled_smiles = Chem.MolToSmiles(fragment)
|
|
|
|
# Get plain SMILES (without isotope)
|
|
plain_fragment = Chem.Mol(fragment)
|
|
for atom in plain_fragment.GetAtoms():
|
|
if atom.GetIsotope() > 0:
|
|
atom.SetIsotope(0)
|
|
plain_smiles = Chem.MolToSmiles(plain_fragment)
|
|
|
|
return labeled_smiles, plain_smiles, bond_type
|
|
|
|
|
|
def extract_isotope_position(fragment_smiles: str) -> int:
|
|
"""Extract cleavage position from fragment SMILES."""
|
|
mol = Chem.MolFromSmiles(fragment_smiles)
|
|
if mol is None:
|
|
return 0
|
|
|
|
for atom in mol.GetAtoms():
|
|
if atom.GetAtomicNum() == 0 and atom.GetIsotope() > 0:
|
|
return atom.GetIsotope()
|
|
return 0
|
|
```
|
|
|
|
**Step 2: Write test**
|
|
|
|
Create `tests/validation/test_isotope_utils.py`:
|
|
|
|
```python
|
|
import pytest
|
|
from rdkit import Chem
|
|
|
|
from macro_lactone_toolkit.validation.isotope_utils import (
|
|
build_fragment_with_isotope,
|
|
extract_isotope_position,
|
|
)
|
|
|
|
|
|
def test_build_fragment_with_isotope():
|
|
# Create a simple test molecule: ethyl group attached to position 5
|
|
mol = Chem.MolFromSmiles("CCCC(CC)CCC") # Position 4 (0-indexed) has ethyl
|
|
assert mol is not None
|
|
|
|
side_chain_atoms = [4, 5] # The ethyl group atoms
|
|
side_chain_start = 4
|
|
ring_atom = 3
|
|
cleavage_pos = 5
|
|
|
|
labeled, plain, bond_type = build_fragment_with_isotope(
|
|
mol, side_chain_atoms, side_chain_start, ring_atom, cleavage_pos
|
|
)
|
|
|
|
assert labeled is not None
|
|
assert plain is not None
|
|
assert bond_type == "SINGLE"
|
|
|
|
# Check isotope was set
|
|
extracted_pos = extract_isotope_position(labeled)
|
|
assert extracted_pos == cleavage_pos
|
|
|
|
# Plain should have no isotope
|
|
extracted_plain = extract_isotope_position(plain)
|
|
assert extracted_plain == 0
|
|
```
|
|
|
|
**Step 3: Run test**
|
|
|
|
Run:
|
|
```bash
|
|
pixi run pytest tests/validation/test_isotope_utils.py -v
|
|
```
|
|
|
|
Expected: `test_build_fragment_with_isotope PASSED`
|
|
|
|
**Step 4: Commit**
|
|
|
|
```bash
|
|
git add src/macro_lactone_toolkit/validation/isotope_utils.py tests/validation/test_isotope_utils.py
|
|
git commit -m "feat(validation): add isotope tagging utilities"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 4: Create Stratified Sampling Module
|
|
|
|
**Files:**
|
|
- Create: `src/macro_lactone_toolkit/validation/sampling.py`
|
|
|
|
**Context:** Implements 10% stratified sampling by ring size (12-20).
|
|
|
|
**Step 1: Write sampling module**
|
|
|
|
Create `src/macro_lactone_toolkit/validation/sampling.py`:
|
|
|
|
```python
|
|
from __future__ import annotations
|
|
|
|
import pandas as pd
|
|
|
|
from macro_lactone_toolkit import MacroLactoneAnalyzer
|
|
|
|
|
|
def stratified_sample_by_ring_size(
|
|
df: pd.DataFrame,
|
|
sample_ratio: float,
|
|
smiles_col: str = "smiles",
|
|
random_state: int = 42,
|
|
) -> pd.DataFrame:
|
|
"""
|
|
Perform stratified sampling by ring size.
|
|
|
|
First classifies all molecules, then samples 10% from each ring size layer.
|
|
"""
|
|
analyzer = MacroLactoneAnalyzer()
|
|
|
|
# Classify all molecules
|
|
classifications = []
|
|
ring_sizes = []
|
|
|
|
for smiles in df[smiles_col]:
|
|
result = analyzer.classify_macrocycle(smiles)
|
|
classifications.append(result.classification)
|
|
ring_sizes.append(result.ring_size)
|
|
|
|
df = df.copy()
|
|
df["_classification"] = classifications
|
|
df["_ring_size"] = ring_sizes
|
|
|
|
# Group by ring size and sample from each group
|
|
sampled_groups = []
|
|
|
|
for ring_size in range(12, 21):
|
|
group = df[df["_ring_size"] == ring_size]
|
|
if len(group) > 0:
|
|
n_samples = max(1, int(len(group) * sample_ratio))
|
|
sampled = group.sample(n=min(n_samples, len(group)), random_state=random_state)
|
|
sampled_groups.append(sampled)
|
|
|
|
# Also sample from unknown ring size (None)
|
|
unknown_group = df[df["_ring_size"].isna()]
|
|
if len(unknown_group) > 0:
|
|
n_samples = max(1, int(len(unknown_group) * sample_ratio))
|
|
sampled = unknown_group.sample(n=min(n_samples, len(unknown_group)), random_state=random_state)
|
|
sampled_groups.append(sampled)
|
|
|
|
if not sampled_groups:
|
|
return pd.DataFrame()
|
|
|
|
result = pd.concat(sampled_groups, ignore_index=True)
|
|
return result
|
|
```
|
|
|
|
**Step 2: Create test**
|
|
|
|
Create `tests/validation/test_sampling.py`:
|
|
|
|
```python
|
|
import pandas as pd
|
|
import pytest
|
|
|
|
from macro_lactone_toolkit.validation.sampling import stratified_sample_by_ring_size
|
|
|
|
|
|
def test_stratified_sample():
|
|
# Create test data with known ring sizes
|
|
data = {
|
|
"smiles": [
|
|
"O=C1CCCCCCCCCCCCCCO1", # 16-membered
|
|
"O=C1CCCCCCCCCCCCO1", # 14-membered
|
|
"O=C1CCCCCCCCCCCCCCCCO1", # 18-membered
|
|
],
|
|
"id": ["A", "B", "C"],
|
|
}
|
|
df = pd.DataFrame(data)
|
|
|
|
sampled = stratified_sample_by_ring_size(df, sample_ratio=0.5, random_state=42)
|
|
|
|
# Should get at least 1 from each ring size (50% of 1 = 1)
|
|
assert len(sampled) >= 1
|
|
assert len(sampled) <= 3
|
|
```
|
|
|
|
**Step 3: Run test**
|
|
|
|
Run:
|
|
```bash
|
|
pixi run pytest tests/validation/test_sampling.py -v
|
|
```
|
|
|
|
Expected: `test_stratified_sample PASSED`
|
|
|
|
**Step 4: Commit**
|
|
|
|
```bash
|
|
git add src/macro_lactone_toolkit/validation/sampling.py tests/validation/test_sampling.py
|
|
git commit -m "feat(validation): add stratified sampling by ring size"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 5: Create Visualization Output Module
|
|
|
|
**Files:**
|
|
- Create: `src/macro_lactone_toolkit/validation/visualization_output.py`
|
|
|
|
**Context:** Handles saving numbered molecule images and fragment images to organized directory structure.
|
|
|
|
**Step 1: Write visualization output module**
|
|
|
|
Create `src/macro_lactone_toolkit/validation/visualization_output.py`:
|
|
|
|
```python
|
|
from __future__ import annotations
|
|
|
|
from pathlib import Path
|
|
|
|
from rdkit import Chem
|
|
|
|
from macro_lactone_toolkit.visualization import save_numbered_molecule_png, save_fragment_png
|
|
|
|
|
|
def get_output_paths(output_dir: Path, source_id: str, ring_size: int, classification: str) -> dict:
|
|
"""Get organized output paths for a molecule."""
|
|
ring_dir = output_dir / f"ring_size_{ring_size}"
|
|
|
|
if classification == "standard_macrolactone":
|
|
base_dir = ring_dir / "standard"
|
|
numbered_dir = base_dir / "numbered"
|
|
sidechains_dir = base_dir / "sidechains" / source_id
|
|
elif classification == "non_standard_macrocycle":
|
|
base_dir = ring_dir / "non_standard" / "original"
|
|
numbered_dir = base_dir
|
|
sidechains_dir = None
|
|
else:
|
|
base_dir = ring_dir / "rejected" / "original"
|
|
numbered_dir = base_dir
|
|
sidechains_dir = None
|
|
|
|
numbered_dir.mkdir(parents=True, exist_ok=True)
|
|
if sidechains_dir:
|
|
sidechains_dir.mkdir(parents=True, exist_ok=True)
|
|
|
|
return {
|
|
"numbered_image": numbered_dir / f"{source_id}_numbered.png",
|
|
"sidechains_dir": sidechains_dir,
|
|
}
|
|
|
|
|
|
def save_numbered_molecule(
|
|
smiles: str,
|
|
output_path: Path,
|
|
ring_size: int | None = None,
|
|
size: tuple[int, int] = (800, 800),
|
|
) -> Path | None:
|
|
"""Save numbered molecule image."""
|
|
try:
|
|
return save_numbered_molecule_png(
|
|
smiles,
|
|
output_path,
|
|
ring_size=ring_size,
|
|
size=size,
|
|
)
|
|
except Exception as e:
|
|
print(f"Failed to save numbered image: {e}")
|
|
return None
|
|
|
|
|
|
def save_fragment_images(
|
|
fragments: list,
|
|
output_dir: Path,
|
|
source_id: str,
|
|
size: tuple[int, int] = (400, 400),
|
|
) -> list[str]:
|
|
"""Save fragment images and return paths."""
|
|
paths = []
|
|
|
|
for i, fragment in enumerate(fragments):
|
|
try:
|
|
output_path = output_dir / f"{source_id}_frag_{i}_pos{fragment.cleavage_position}.png"
|
|
save_fragment_png(fragment.fragment_smiles_plain, output_path, size=size)
|
|
paths.append(str(output_path.relative_to(output_dir.parent.parent)))
|
|
except Exception as e:
|
|
print(f"Failed to save fragment {i}: {e}")
|
|
paths.append(None)
|
|
|
|
return paths
|
|
```
|
|
|
|
**Step 2: Commit**
|
|
|
|
```bash
|
|
git add src/macro_lactone_toolkit/validation/visualization_output.py
|
|
git commit -m "feat(validation): add visualization output module"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 6: Create Main Validator Class
|
|
|
|
**Files:**
|
|
- Create: `src/macro_lactone_toolkit/validation/validator.py`
|
|
|
|
**Context:** Core orchestrator that processes molecules, stores results in database, and generates visualizations.
|
|
|
|
**Step 1: Write validator class**
|
|
|
|
Create `src/macro_lactone_toolkit/validation/validator.py`:
|
|
|
|
```python
|
|
from __future__ import annotations
|
|
|
|
import json
|
|
from datetime import datetime
|
|
from pathlib import Path
|
|
|
|
import pandas as pd
|
|
from rdkit import Chem
|
|
from rdkit.Chem import Descriptors
|
|
|
|
from macro_lactone_toolkit import MacroLactoneAnalyzer, MacrolactoneFragmenter
|
|
from macro_lactone_toolkit.validation.database import get_engine, get_session, init_database
|
|
from macro_lactone_toolkit.validation.isotope_utils import build_fragment_with_isotope
|
|
from macro_lactone_toolkit.validation.models import (
|
|
ClassificationType,
|
|
ParentMolecule,
|
|
ProcessingStatus,
|
|
RingNumbering,
|
|
SideChainFragment,
|
|
)
|
|
from macro_lactone_toolkit.validation.sampling import stratified_sample_by_ring_size
|
|
from macro_lactone_toolkit.validation.visualization_output import (
|
|
get_output_paths,
|
|
save_fragment_images,
|
|
save_numbered_molecule,
|
|
)
|
|
|
|
|
|
class MacrolactoneValidator:
|
|
"""Validates macrolactone database with sampling and fragmentation."""
|
|
|
|
def __init__(
|
|
self,
|
|
output_dir: str | Path,
|
|
sample_ratio: float = 0.1,
|
|
smiles_col: str = "smiles",
|
|
id_col: str = "IDs",
|
|
):
|
|
self.output_dir = Path(output_dir)
|
|
self.sample_ratio = sample_ratio
|
|
self.smiles_col = smiles_col
|
|
self.id_col = id_col
|
|
|
|
self.analyzer = MacroLactoneAnalyzer()
|
|
self.fragmenter = MacrolactoneFragmenter()
|
|
|
|
# Initialize database
|
|
self.db_path = self.output_dir / "fragments.db"
|
|
self.engine = get_engine(self.db_path)
|
|
init_database(self.engine)
|
|
|
|
def run(self, input_csv: str | Path) -> dict:
|
|
"""Run validation on input CSV."""
|
|
# Load data
|
|
df = pd.read_csv(input_csv)
|
|
print(f"Loaded {len(df)} molecules from {input_csv}")
|
|
|
|
# Stratified sampling
|
|
print(f"Performing stratified sampling (ratio={self.sample_ratio})...")
|
|
sampled = stratified_sample_by_ring_size(df, self.sample_ratio, self.smiles_col)
|
|
print(f"Sampled {len(sampled)} molecules")
|
|
|
|
# Process each molecule
|
|
results = {"total": len(sampled), "success": 0, "failed": 0, "skipped": 0}
|
|
|
|
for idx, row in sampled.iterrows():
|
|
status = self._process_molecule(row)
|
|
results[status] += 1
|
|
if (idx + 1) % 100 == 0:
|
|
print(f"Processed {idx + 1}/{len(sampled)} molecules")
|
|
|
|
# Generate summary
|
|
self._generate_summary()
|
|
|
|
return results
|
|
|
|
def _process_molecule(self, row: pd.Series) -> str:
|
|
"""Process a single molecule. Returns status."""
|
|
source_id = str(row[self.id_col])
|
|
smiles = row[self.smiles_col]
|
|
name = row.get("molecule_pref_name", None)
|
|
|
|
# Classify
|
|
classification_result = self.analyzer.classify_macrocycle(smiles)
|
|
classification = ClassificationType(classification_result.classification)
|
|
ring_size = classification_result.ring_size
|
|
|
|
# Create parent record
|
|
parent = ParentMolecule(
|
|
source_id=source_id,
|
|
molecule_name=name,
|
|
smiles=smiles,
|
|
classification=classification,
|
|
ring_size=ring_size,
|
|
primary_reason_code=classification_result.primary_reason_code,
|
|
primary_reason_message=classification_result.primary_reason_message,
|
|
)
|
|
|
|
with get_session(self.engine) as session:
|
|
session.add(parent)
|
|
session.commit()
|
|
session.refresh(parent)
|
|
|
|
# Skip non-standard molecules
|
|
if classification != ClassificationType.STANDARD:
|
|
parent.processing_status = ProcessingStatus.SKIPPED
|
|
session.add(parent)
|
|
session.commit()
|
|
self._save_original_image(smiles, source_id, ring_size, classification)
|
|
return "skipped"
|
|
|
|
# Process standard macrolactone
|
|
try:
|
|
self._process_standard_macrolactone(session, parent, smiles)
|
|
return "success"
|
|
except Exception as e:
|
|
parent.processing_status = ProcessingStatus.FAILED
|
|
parent.error_message = str(e)
|
|
parent.processed_at = datetime.utcnow()
|
|
session.add(parent)
|
|
session.commit()
|
|
return "failed"
|
|
|
|
def _process_standard_macrolactone(self, session, parent: ParentMolecule, smiles: str):
|
|
"""Process a standard macrolactone."""
|
|
# Get numbering
|
|
numbering = self.fragmenter.number_molecule(smiles)
|
|
|
|
# Save numbering to database
|
|
numbering_record = RingNumbering(
|
|
parent_id=parent.id,
|
|
ring_size=numbering.ring_size,
|
|
carbonyl_carbon_idx=numbering.carbonyl_carbon_idx,
|
|
ester_oxygen_idx=numbering.ester_oxygen_idx,
|
|
position_to_atom=json.dumps(numbering.position_to_atom),
|
|
atom_to_position=json.dumps(numbering.atom_to_position),
|
|
)
|
|
session.add(numbering_record)
|
|
|
|
# Save numbered image
|
|
paths = get_output_paths(
|
|
self.output_dir, parent.source_id, parent.ring_size, "standard_macrolactone"
|
|
)
|
|
image_path = save_numbered_molecule(smiles, paths["numbered_image"], parent.ring_size)
|
|
if image_path:
|
|
parent.numbered_image_path = str(image_path.relative_to(self.output_dir))
|
|
|
|
# Fragment side chains
|
|
mol = Chem.MolFromSmiles(smiles)
|
|
ring_atom_set = set(numbering.ring_atoms)
|
|
fragments = []
|
|
fragment_idx = 0
|
|
|
|
from macro_lactone_toolkit._core import collect_side_chain_atoms, is_intrinsic_lactone_neighbor
|
|
|
|
for position, ring_atom_idx in numbering.position_to_atom.items():
|
|
ring_atom = mol.GetAtomWithIdx(ring_atom_idx)
|
|
|
|
for neighbor in ring_atom.GetNeighbors():
|
|
neighbor_idx = neighbor.GetIdx()
|
|
|
|
# Skip ring atoms and intrinsic lactone neighbors
|
|
if neighbor_idx in ring_atom_set:
|
|
continue
|
|
if is_intrinsic_lactone_neighbor(mol, numbering, ring_atom_idx, neighbor_idx):
|
|
continue
|
|
|
|
# Collect side chain atoms
|
|
side_chain_atoms = collect_side_chain_atoms(mol, neighbor_idx, ring_atom_set)
|
|
if not side_chain_atoms:
|
|
continue
|
|
|
|
# Build fragment with isotope tagging
|
|
labeled_smiles, plain_smiles, bond_type = build_fragment_with_isotope(
|
|
mol, side_chain_atoms, neighbor_idx, ring_atom_idx, position
|
|
)
|
|
|
|
# Calculate properties
|
|
plain_mol = Chem.MolFromSmiles(plain_smiles)
|
|
if plain_mol is None:
|
|
continue
|
|
|
|
atom_count = sum(1 for a in plain_mol.GetAtoms() if a.GetAtomicNum() != 0)
|
|
heavy_atom_count = sum(1 for a in plain_mol.GetAtoms() if a.GetAtomicNum() not in [0, 1])
|
|
mw = Descriptors.MolWt(plain_mol)
|
|
|
|
# Create fragment record
|
|
fragment = SideChainFragment(
|
|
parent_id=parent.id,
|
|
fragment_id=f"{parent.source_id}_frag_{fragment_idx}",
|
|
cleavage_position=position,
|
|
attachment_atom_idx=ring_atom_idx,
|
|
attachment_atom_symbol=ring_atom.GetSymbol(),
|
|
fragment_smiles_labeled=labeled_smiles,
|
|
fragment_smiles_plain=plain_smiles,
|
|
dummy_isotope=position,
|
|
atom_count=atom_count,
|
|
heavy_atom_count=heavy_atom_count,
|
|
molecular_weight=round(mw, 4),
|
|
original_bond_type=bond_type,
|
|
)
|
|
session.add(fragment)
|
|
fragments.append(fragment)
|
|
fragment_idx += 1
|
|
|
|
# Save fragment images
|
|
if fragments and paths["sidechains_dir"]:
|
|
image_paths = save_fragment_images(fragments, paths["sidechains_dir"], parent.source_id)
|
|
for frag, img_path in zip(fragments, image_paths):
|
|
frag.image_path = img_path
|
|
session.add(frag)
|
|
|
|
# Update parent record
|
|
parent.processing_status = ProcessingStatus.SUCCESS
|
|
parent.num_sidechains = len(fragments)
|
|
parent.cleavage_positions = json.dumps([f.cleavage_position for f in fragments])
|
|
parent.processed_at = datetime.utcnow()
|
|
session.add(parent)
|
|
session.commit()
|
|
|
|
def _save_original_image(self, smiles: str, source_id: str, ring_size: int, classification: ClassificationType):
|
|
"""Save original image for non-standard molecules."""
|
|
paths = get_output_paths(self.output_dir, source_id, ring_size, classification.value)
|
|
try:
|
|
from rdkit.Chem import Draw
|
|
mol = Chem.MolFromSmiles(smiles)
|
|
if mol:
|
|
Draw.MolToFile(mol, str(paths["numbered_image"]), size=(400, 400))
|
|
except Exception:
|
|
pass
|
|
|
|
def _generate_summary(self):
|
|
"""Generate summary CSV and statistics."""
|
|
with get_session(self.engine) as session:
|
|
# Query all parents
|
|
from sqlmodel import select
|
|
statement = select(ParentMolecule)
|
|
parents = session.exec(statement).all()
|
|
|
|
# Convert to DataFrame
|
|
data = []
|
|
for p in parents:
|
|
data.append({
|
|
"id": p.id,
|
|
"source_id": p.source_id,
|
|
"molecule_name": p.molecule_name,
|
|
"smiles": p.smiles,
|
|
"classification": p.classification.value,
|
|
"ring_size": p.ring_size,
|
|
"primary_reason_code": p.primary_reason_code,
|
|
"primary_reason_message": p.primary_reason_message,
|
|
"processing_status": p.processing_status.value,
|
|
"error_message": p.error_message,
|
|
"num_sidechains": p.num_sidechains,
|
|
"cleavage_positions": p.cleavage_positions,
|
|
"numbered_image_path": p.numbered_image_path,
|
|
"processed_at": p.processed_at,
|
|
})
|
|
|
|
df = pd.DataFrame(data)
|
|
df.to_csv(self.output_dir / "summary.csv", index=False)
|
|
|
|
# Generate statistics
|
|
stats = {
|
|
"total_molecules": len(parents),
|
|
"by_classification": df["classification"].value_counts().to_dict(),
|
|
"by_ring_size": df[df["ring_size"].notna()]["ring_size"].value_counts().to_dict(),
|
|
"by_status": df["processing_status"].value_counts().to_dict(),
|
|
}
|
|
|
|
with open(self.output_dir / "summary_statistics.json", "w") as f:
|
|
json.dump(stats, f, indent=2, default=str)
|
|
|
|
print(f"\nSummary saved to {self.output_dir / 'summary.csv'}")
|
|
print(f"Statistics: {stats}")
|
|
```
|
|
|
|
**Step 2: Commit**
|
|
|
|
```bash
|
|
git add src/macro_lactone_toolkit/validation/validator.py
|
|
git commit -m "feat(validation): add main validator class"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 7: Create CLI Script
|
|
|
|
**Files:**
|
|
- Create: `scripts/validate_macrolactone_db.py`
|
|
|
|
**Context:** Entry point script that uses pixi environment.
|
|
|
|
**Step 1: Write CLI script**
|
|
|
|
Create `scripts/validate_macrolactone_db.py`:
|
|
|
|
```python
|
|
#!/usr/bin/env python3
|
|
"""
|
|
Validate MacrolactoneDB 12-20 membered rings.
|
|
|
|
Usage:
|
|
pixi run python scripts/validate_macrolactone_db.py \
|
|
--input data/MacrolactoneDB/ring12_20/temp.csv \
|
|
--output validation_output \
|
|
--sample-ratio 0.1
|
|
"""
|
|
|
|
import argparse
|
|
import sys
|
|
from pathlib import Path
|
|
|
|
# Add src to path
|
|
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
|
|
|
|
from macro_lactone_toolkit.validation.validator import MacrolactoneValidator
|
|
|
|
|
|
def main():
|
|
parser = argparse.ArgumentParser(
|
|
description="Validate MacrolactoneDB 12-20 membered rings"
|
|
)
|
|
parser.add_argument(
|
|
"--input",
|
|
type=str,
|
|
default="data/MacrolactoneDB/ring12_20/temp.csv",
|
|
help="Input CSV file path",
|
|
)
|
|
parser.add_argument(
|
|
"--output",
|
|
type=str,
|
|
default="validation_output",
|
|
help="Output directory",
|
|
)
|
|
parser.add_argument(
|
|
"--sample-ratio",
|
|
type=float,
|
|
default=0.1,
|
|
help="Sampling ratio (0.0-1.0)",
|
|
)
|
|
parser.add_argument(
|
|
"--smiles-col",
|
|
type=str,
|
|
default="smiles",
|
|
help="SMILES column name",
|
|
)
|
|
parser.add_argument(
|
|
"--id-col",
|
|
type=str,
|
|
default="IDs",
|
|
help="ID column name",
|
|
)
|
|
|
|
args = parser.parse_args()
|
|
|
|
print("=" * 60)
|
|
print("MacrolactoneDB Validation")
|
|
print("=" * 60)
|
|
print(f"Input: {args.input}")
|
|
print(f"Output: {args.output}")
|
|
print(f"Sample ratio: {args.sample_ratio}")
|
|
print("=" * 60)
|
|
|
|
validator = MacrolactoneValidator(
|
|
output_dir=args.output,
|
|
sample_ratio=args.sample_ratio,
|
|
smiles_col=args.smiles_col,
|
|
id_col=args.id_col,
|
|
)
|
|
|
|
results = validator.run(args.input)
|
|
|
|
print("\n" + "=" * 60)
|
|
print("Validation Complete")
|
|
print("=" * 60)
|
|
print(f"Total processed: {results['total']}")
|
|
print(f"Success: {results['success']}")
|
|
print(f"Failed: {results['failed']}")
|
|
print(f"Skipped: {results['skipped']}")
|
|
print("=" * 60)
|
|
|
|
return 0
|
|
|
|
|
|
if __name__ == "__main__":
|
|
sys.exit(main())
|
|
```
|
|
|
|
Make executable:
|
|
```bash
|
|
chmod +x scripts/validate_macrolactone_db.py
|
|
```
|
|
|
|
**Step 2: Test help message**
|
|
|
|
Run:
|
|
```bash
|
|
pixi run python scripts/validate_macrolactone_db.py --help
|
|
```
|
|
|
|
Expected: Shows help message with all arguments.
|
|
|
|
**Step 3: Commit**
|
|
|
|
```bash
|
|
git add scripts/validate_macrolactone_db.py
|
|
git commit -m "feat(validation): add CLI entry point script"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 8: Create Output Directory README
|
|
|
|
**Files:**
|
|
- Create: Template for `validation_output/README.md` (generated by validator)
|
|
|
|
**Context:** README explaining the output directory structure.
|
|
|
|
**Step 1: Add README generation to validator**
|
|
|
|
Add method to `validator.py` before `_generate_summary`:
|
|
|
|
```python
|
|
def _generate_readme(self):
|
|
"""Generate README explaining output structure."""
|
|
readme_content = """# MacrolactoneDB Validation Output
|
|
|
|
This directory contains validation results for MacrolactoneDB 12-20 membered rings.
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
validation_output/
|
|
├── README.md # This file
|
|
├── fragments.db # SQLite database with all data
|
|
├── summary.csv # Summary of all processed molecules
|
|
├── summary_statistics.json # Statistical summary
|
|
│
|
|
├── ring_size_12/ # 12-membered rings
|
|
├── ring_size_13/ # 13-membered rings
|
|
...
|
|
└── ring_size_20/ # 20-membered rings
|
|
├── molecules.csv # Molecules in this ring size
|
|
├── standard/ # Standard macrolactones
|
|
│ ├── numbered/ # Numbered ring images
|
|
│ │ └── {id}_numbered.png
|
|
│ └── sidechains/ # Fragment images
|
|
│ └── {id}/
|
|
│ └── {id}_frag_{n}_pos{pos}.png
|
|
├── non_standard/ # Non-standard macrocycles
|
|
│ └── original/
|
|
│ └── {id}_original.png
|
|
└── rejected/ # Not macrolactones
|
|
└── original/
|
|
└── {id}_original.png
|
|
```
|
|
|
|
## Database Schema
|
|
|
|
### Tables
|
|
|
|
- **parent_molecules**: Original molecule information
|
|
- **ring_numberings**: Ring atom numbering details
|
|
- **side_chain_fragments**: Fragmentation results with isotope tags
|
|
- **validation_results**: Manual validation records
|
|
|
|
### Key Fields
|
|
|
|
- `classification`: standard_macrolactone | non_standard_macrocycle | not_macrolactone
|
|
- `dummy_isotope`: Cleavage position stored as isotope value for reconstruction
|
|
- `cleavage_position`: Position on ring where side chain was attached
|
|
|
|
## Ring Numbering Convention
|
|
|
|
1. Position 1 = Lactone carbonyl carbon (C=O)
|
|
2. Position 2 = Ester oxygen (-O-)
|
|
3. Positions 3-N = Sequential around ring
|
|
|
|
## Isotope Tagging
|
|
|
|
Fragments use isotope values to mark cleavage position:
|
|
- `[5*]CCO` = Fragment from position 5, dummy atom has isotope=5
|
|
- This enables precise reconstruction during reassembly
|
|
|
|
## CSV Columns
|
|
|
|
### summary.csv
|
|
|
|
- `source_id`: Original molecule ID from MacrolactoneDB
|
|
- `classification`: Classification result
|
|
- `ring_size`: Detected ring size (12-20)
|
|
- `num_sidechains`: Number of side chains detected
|
|
- `cleavage_positions`: JSON array of cleavage positions
|
|
- `processing_status`: pending | success | failed | skipped
|
|
|
|
## Querying the Database
|
|
|
|
```bash
|
|
# List tables
|
|
sqlite3 fragments.db ".tables"
|
|
|
|
# Get standard macrolactones with fragments
|
|
sqlite3 fragments.db "SELECT * FROM parent_molecules WHERE classification='standard_macrolactone' LIMIT 5;"
|
|
|
|
# Get fragments for a specific molecule
|
|
sqlite3 fragments.db "SELECT * FROM side_chain_fragments WHERE parent_id=1;"
|
|
|
|
# Count by ring size
|
|
sqlite3 fragments.db "SELECT ring_size, COUNT(*) FROM parent_molecules GROUP BY ring_size;"
|
|
```
|
|
"""
|
|
readme_path = self.output_dir / "README.md"
|
|
readme_path.write_text(readme_content)
|
|
```
|
|
|
|
Add call in `run` method before `return results`:
|
|
```python
|
|
self._generate_readme()
|
|
self._generate_summary()
|
|
```
|
|
|
|
**Step 2: Commit**
|
|
|
|
```bash
|
|
git add src/macro_lactone_toolkit/validation/validator.py
|
|
git commit -m "feat(validation): add README generation for output directory"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 9: Update Package __init__.py
|
|
|
|
**Files:**
|
|
- Modify: `src/macro_lactone_toolkit/__init__.py`
|
|
|
|
**Context:** Export validation module.
|
|
|
|
**Step 1: Add validation exports**
|
|
|
|
Modify `src/macro_lactone_toolkit/__init__.py` to add:
|
|
|
|
```python
|
|
# Validation module (optional import)
|
|
try:
|
|
from .validation.validator import MacrolactoneValidator
|
|
from .validation.models import ParentMolecule, SideChainFragment
|
|
except ImportError:
|
|
pass # SQLModel not installed
|
|
```
|
|
|
|
**Step 2: Commit**
|
|
|
|
```bash
|
|
git add src/macro_lactone_toolkit/__init__.py
|
|
git commit -m "feat(validation): export validation module"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 10: Run Integration Test
|
|
|
|
**Files:**
|
|
- Test with: Small sample of actual data
|
|
|
|
**Context:** Run the validator on a small subset to verify everything works.
|
|
|
|
**Step 1: Create test with small sample**
|
|
|
|
Run:
|
|
```bash
|
|
# Create small test sample
|
|
head -20 data/MacrolactoneDB/ring12_20/temp.csv > /tmp/test_sample.csv
|
|
|
|
# Run validation
|
|
pixi run python scripts/validate_macrolactone_db.py \
|
|
--input /tmp/test_sample.csv \
|
|
--output /tmp/test_validation_output \
|
|
--sample-ratio 1.0
|
|
```
|
|
|
|
Expected output shows processing and summary.
|
|
|
|
**Step 2: Verify outputs**
|
|
|
|
Run:
|
|
```bash
|
|
ls -la /tmp/test_validation_output/
|
|
cat /tmp/test_validation_output/summary_statistics.json
|
|
sqlite3 /tmp/test_validation_output/fragments.db "SELECT COUNT(*) FROM parent_molecules;"
|
|
```
|
|
|
|
Expected: Directory exists, has database, summary CSV, and ring size subdirectories.
|
|
|
|
**Step 3: Cleanup**
|
|
|
|
```bash
|
|
rm -rf /tmp/test_validation_output /tmp/test_sample.csv
|
|
```
|
|
|
|
---
|
|
|
|
## Task 11: Final Commit and Summary
|
|
|
|
**Step 1: Final review and commit**
|
|
|
|
```bash
|
|
git status
|
|
git log --oneline -10
|
|
```
|
|
|
|
**Step 2: Push to branch (if using worktree)**
|
|
|
|
```bash
|
|
git push origin HEAD
|
|
```
|
|
|
|
---
|
|
|
|
## Execution Commands Reference
|
|
|
|
### Full validation run
|
|
|
|
```bash
|
|
cd /Users/lingyuzeng/project/macro-lactone-sidechain-profiler/macro_split
|
|
pixi run python scripts/validate_macrolactone_db.py \
|
|
--input data/MacrolactoneDB/ring12_20/temp.csv \
|
|
--output validation_output \
|
|
--sample-ratio 0.1
|
|
```
|
|
|
|
### Query results
|
|
|
|
```bash
|
|
# Summary statistics
|
|
cat validation_output/summary_statistics.json
|
|
|
|
# Database queries
|
|
sqlite3 validation_output/fragments.db "SELECT * FROM parent_molecules LIMIT 5;"
|
|
sqlite3 validation_output/fragments.db "SELECT * FROM side_chain_fragments WHERE cleavage_position > 2 LIMIT 5;"
|
|
```
|
|
|
|
### Check specific ring size
|
|
|
|
```bash
|
|
ls validation_output/ring_size_16/standard/numbered/
|
|
ls validation_output/ring_size_16/standard/sidechains/
|
|
```
|
|
|
|
---
|
|
|
|
## Verification Checklist
|
|
|
|
- [ ] All SQLModel models created and importable
|
|
- [ ] Database initializes without errors
|
|
- [ ] Isotope tagging preserves cleavage position
|
|
- [ ] Stratified sampling produces even distribution
|
|
- [ ] Visualization outputs created in correct structure
|
|
- [ ] Summary CSV contains all expected columns
|
|
- [ ] README generated with accurate documentation
|
|
- [ ] CLI script runs with --help
|
|
- [ ] Integration test passes on small sample
|