Files
macro_split/docs/plans/2026-03-19-macrolactone-validation-implementation-plan.md

37 KiB

MacrolactoneDB Validation Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Create a validation script that samples 10% of MacrolactoneDB 12-20 membered rings, classifies them, fragments side chains with isotope tagging, and stores results in SQLite with visualizations.

Architecture: SQLModel ORM for database, RDKit for chemistry, PIL for visualization. Core is a MacrolactoneValidator class that orchestrates sampling, processing, and output generation.

Tech Stack: Python 3.12, SQLModel, RDKit, Pandas, Pixi (environment)

Reference Design: See docs/plans/2026-03-19-macrolactone-validation-design.md for full design details.


Prerequisites

Worktree: This plan should be executed in a dedicated worktree.

Design Doc: Read docs/plans/2026-03-19-macrolactone-validation-design.md before starting.


Task 1: Create Database Models (SQLModel)

Files:

  • Create: src/macro_lactone_toolkit/validation/__init__.py
  • Create: src/macro_lactone_toolkit/validation/models.py

Context: These models implement the schema from the design doc. SideChainFragment uses dummy_isotope field to store the cleavage position for reconstruction.

Step 1: Create directory and init.py

Run:

mkdir -p src/macro_lactone_toolkit/validation
touch src/macro_lactone_toolkit/validation/__init__.py

Step 2: Write database models

Create src/macro_lactone_toolkit/validation/models.py:

from __future__ import annotations

from datetime import datetime
from enum import Enum
from typing import List, Optional

from sqlmodel import Field, Relationship, SQLModel


class ClassificationType(str, Enum):
    STANDARD = "standard_macrolactone"
    NON_STANDARD = "non_standard_macrocycle"
    NOT_MACROLACTONE = "not_macrolactone"


class ProcessingStatus(str, Enum):
    PENDING = "pending"
    SUCCESS = "success"
    FAILED = "failed"
    SKIPPED = "skipped"


class ParentMolecule(SQLModel, table=True):
    """Original molecule information."""

    __tablename__ = "parent_molecules"

    id: Optional[int] = Field(default=None, primary_key=True)
    source_id: str = Field(index=True)
    molecule_name: Optional[str] = None
    smiles: str = Field(index=True)
    classification: ClassificationType = Field(index=True)
    ring_size: Optional[int] = Field(default=None, index=True)
    primary_reason_code: Optional[str] = None
    primary_reason_message: Optional[str] = None
    processing_status: ProcessingStatus = Field(default=ProcessingStatus.PENDING)
    error_message: Optional[str] = None
    num_sidechains: Optional[int] = None
    cleavage_positions: Optional[str] = None
    numbered_image_path: Optional[str] = None
    created_at: datetime = Field(default_factory=datetime.utcnow)
    processed_at: Optional[datetime] = None

    fragments: List["SideChainFragment"] = Relationship(back_populates="parent")
    numbering: Optional["RingNumbering"] = Relationship(back_populates="parent")


class RingNumbering(SQLModel, table=True):
    """Ring numbering details."""

    __tablename__ = "ring_numberings"

    id: Optional[int] = Field(default=None, primary_key=True)
    parent_id: int = Field(foreign_key="parent_molecules.id", unique=True)
    ring_size: int
    carbonyl_carbon_idx: int
    ester_oxygen_idx: int
    position_to_atom: str
    atom_to_position: str

    parent: Optional[ParentMolecule] = Relationship(back_populates="numbering")


class SideChainFragment(SQLModel, table=True):
    """Side chain fragments from cleavage."""

    __tablename__ = "side_chain_fragments"

    id: Optional[int] = Field(default=None, primary_key=True)
    parent_id: int = Field(foreign_key="parent_molecules.id", index=True)
    fragment_id: str = Field(index=True)
    cleavage_position: int = Field(index=True)
    attachment_atom_idx: int
    attachment_atom_symbol: str
    fragment_smiles_labeled: str
    fragment_smiles_plain: str
    dummy_isotope: int
    atom_count: int
    heavy_atom_count: int
    molecular_weight: float
    original_bond_type: str
    image_path: Optional[str] = None

    parent: Optional[ParentMolecule] = Relationship(back_populates="fragments")


class ValidationResult(SQLModel, table=True):
    """Manual validation records."""

    __tablename__ = "validation_results"

    id: Optional[int] = Field(default=None, primary_key=True)
    parent_id: int = Field(foreign_key="parent_molecules.id")
    numbering_correct: Optional[bool] = None
    cleavage_correct: Optional[bool] = None
    classification_correct: Optional[bool] = None
    notes: Optional[str] = None
    validated_by: Optional[str] = None
    validated_at: Optional[datetime] = None

Step 3: Verify SQLModel imports work

Run:

pixi run python -c "from macro_lactone_toolkit.validation.models import ParentMolecule; print('Models OK')"

Expected: Models OK

Step 4: Commit

git add src/macro_lactone_toolkit/validation/
git commit -m "feat(validation): add SQLModel database models"

Task 2: Create Database Connection Module

Files:

  • Create: src/macro_lactone_toolkit/validation/database.py

Context: Provides SQLite engine, session context manager, and init function.

Step 1: Write database module

Create src/macro_lactone_toolkit/validation/database.py:

from __future__ import annotations

from contextlib import contextmanager
from pathlib import Path

from sqlmodel import Session, SQLModel, create_engine


def get_engine(db_path: str | Path):
    """Create SQLite engine."""
    db_path = Path(db_path)
    db_path.parent.mkdir(parents=True, exist_ok=True)
    url = f"sqlite:///{db_path}"
    return create_engine(url, echo=False)


@contextmanager
def get_session(engine):
    """Context manager for database sessions."""
    with Session(engine) as session:
        yield session


def init_database(engine):
    """Create all tables."""
    SQLModel.metadata.create_all(engine)

Step 2: Test database initialization

Create test script test_db.py:

from pathlib import Path
from macro_lactone_toolkit.validation.database import get_engine, init_database, get_session
from macro_lactone_toolkit.validation.models import ParentMolecule, ClassificationType

test_db = Path("/tmp/test_fragments.db")
if test_db.exists():
    test_db.unlink()

engine = get_engine(test_db)
init_database(engine)

with get_session(engine) as session:
    parent = ParentMolecule(
        source_id="TEST001",
        smiles="O=C1CCCCCCCCCCCCCCO1",
        classification=ClassificationType.STANDARD,
        ring_size=16,
    )
    session.add(parent)
    session.commit()
    print(f"Inserted parent with id: {parent.id}")

print("Database test passed!")

Run:

pixi run python test_db.py

Expected:

Inserted parent with id: 1
Database test passed!

Step 3: Cleanup and commit

Run:

rm test_db.py /tmp/test_fragments.db
git add src/macro_lactone_toolkit/validation/database.py
git commit -m "feat(validation): add database connection module"

Task 3: Create Isotope Tagging Utilities

Files:

  • Create: src/macro_lactone_toolkit/validation/isotope_utils.py

Context: Implements isotope tagging inspired by Molassembler. Uses cleavage position as isotope value.

Step 1: Write isotope utilities

Create src/macro_lactone_toolkit/validation/isotope_utils.py:

from __future__ import annotations

from rdkit import Chem


def build_fragment_with_isotope(
    mol: Chem.Mol,
    side_chain_atoms: list[int],
    side_chain_start_idx: int,
    ring_atom_idx: int,
    cleavage_position: int,
) -> tuple[str, str, str]:
    """
    Build fragment SMILES with isotope tagging.

    Returns:
        Tuple of (labeled_smiles, plain_smiles, bond_type)
    """
    # Get original bond type
    bond = mol.GetBondBetweenAtoms(ring_atom_idx, side_chain_start_idx)
    bond_type = bond.GetBondType().name if bond else "SINGLE"

    # Create editable molecule
    emol = Chem.EditableMol(Chem.Mol(mol))

    # Add dummy atom with isotope = cleavage position
    dummy_atom = Chem.Atom(0)
    dummy_atom.SetIsotope(cleavage_position)
    dummy_idx = emol.AddAtom(dummy_atom)

    # Add bond between dummy and side chain start
    emol.AddBond(dummy_idx, side_chain_start_idx, bond.GetBondType())

    # Remove ring atom to side chain bond (will be reconnected via dummy)
    # Actually, we keep the side chain atoms and dummy, remove everything else

    # Determine atoms to keep
    atoms_to_keep = set([dummy_idx, side_chain_start_idx] + list(side_chain_atoms))

    # Remove atoms not in keep list
    # Need to remove in reverse order to maintain valid indices
    all_atoms = list(range(mol.GetNumAtoms()))
    atoms_to_remove = [i for i in all_atoms if i not in atoms_to_keep]

    for atom_idx in sorted(atoms_to_remove, reverse=True):
        emol.RemoveAtom(atom_idx)

    fragment = emol.GetMol()
    Chem.SanitizeMol(fragment)

    # Get labeled SMILES (with isotope)
    labeled_smiles = Chem.MolToSmiles(fragment)

    # Get plain SMILES (without isotope)
    plain_fragment = Chem.Mol(fragment)
    for atom in plain_fragment.GetAtoms():
        if atom.GetIsotope() > 0:
            atom.SetIsotope(0)
    plain_smiles = Chem.MolToSmiles(plain_fragment)

    return labeled_smiles, plain_smiles, bond_type


def extract_isotope_position(fragment_smiles: str) -> int:
    """Extract cleavage position from fragment SMILES."""
    mol = Chem.MolFromSmiles(fragment_smiles)
    if mol is None:
        return 0

    for atom in mol.GetAtoms():
        if atom.GetAtomicNum() == 0 and atom.GetIsotope() > 0:
            return atom.GetIsotope()
    return 0

Step 2: Write test

Create tests/validation/test_isotope_utils.py:

import pytest
from rdkit import Chem

from macro_lactone_toolkit.validation.isotope_utils import (
    build_fragment_with_isotope,
    extract_isotope_position,
)


def test_build_fragment_with_isotope():
    # Create a simple test molecule: ethyl group attached to position 5
    mol = Chem.MolFromSmiles("CCCC(CC)CCC")  # Position 4 (0-indexed) has ethyl
    assert mol is not None

    side_chain_atoms = [4, 5]  # The ethyl group atoms
    side_chain_start = 4
    ring_atom = 3
    cleavage_pos = 5

    labeled, plain, bond_type = build_fragment_with_isotope(
        mol, side_chain_atoms, side_chain_start, ring_atom, cleavage_pos
    )

    assert labeled is not None
    assert plain is not None
    assert bond_type == "SINGLE"

    # Check isotope was set
    extracted_pos = extract_isotope_position(labeled)
    assert extracted_pos == cleavage_pos

    # Plain should have no isotope
    extracted_plain = extract_isotope_position(plain)
    assert extracted_plain == 0

Step 3: Run test

Run:

pixi run pytest tests/validation/test_isotope_utils.py -v

Expected: test_build_fragment_with_isotope PASSED

Step 4: Commit

git add src/macro_lactone_toolkit/validation/isotope_utils.py tests/validation/test_isotope_utils.py
git commit -m "feat(validation): add isotope tagging utilities"

Task 4: Create Stratified Sampling Module

Files:

  • Create: src/macro_lactone_toolkit/validation/sampling.py

Context: Implements 10% stratified sampling by ring size (12-20).

Step 1: Write sampling module

Create src/macro_lactone_toolkit/validation/sampling.py:

from __future__ import annotations

import pandas as pd

from macro_lactone_toolkit import MacroLactoneAnalyzer


def stratified_sample_by_ring_size(
    df: pd.DataFrame,
    sample_ratio: float,
    smiles_col: str = "smiles",
    random_state: int = 42,
) -> pd.DataFrame:
    """
    Perform stratified sampling by ring size.

    First classifies all molecules, then samples 10% from each ring size layer.
    """
    analyzer = MacroLactoneAnalyzer()

    # Classify all molecules
    classifications = []
    ring_sizes = []

    for smiles in df[smiles_col]:
        result = analyzer.classify_macrocycle(smiles)
        classifications.append(result.classification)
        ring_sizes.append(result.ring_size)

    df = df.copy()
    df["_classification"] = classifications
    df["_ring_size"] = ring_sizes

    # Group by ring size and sample from each group
    sampled_groups = []

    for ring_size in range(12, 21):
        group = df[df["_ring_size"] == ring_size]
        if len(group) > 0:
            n_samples = max(1, int(len(group) * sample_ratio))
            sampled = group.sample(n=min(n_samples, len(group)), random_state=random_state)
            sampled_groups.append(sampled)

    # Also sample from unknown ring size (None)
    unknown_group = df[df["_ring_size"].isna()]
    if len(unknown_group) > 0:
        n_samples = max(1, int(len(unknown_group) * sample_ratio))
        sampled = unknown_group.sample(n=min(n_samples, len(unknown_group)), random_state=random_state)
        sampled_groups.append(sampled)

    if not sampled_groups:
        return pd.DataFrame()

    result = pd.concat(sampled_groups, ignore_index=True)
    return result

Step 2: Create test

Create tests/validation/test_sampling.py:

import pandas as pd
import pytest

from macro_lactone_toolkit.validation.sampling import stratified_sample_by_ring_size


def test_stratified_sample():
    # Create test data with known ring sizes
    data = {
        "smiles": [
            "O=C1CCCCCCCCCCCCCCO1",  # 16-membered
            "O=C1CCCCCCCCCCCCO1",     # 14-membered
            "O=C1CCCCCCCCCCCCCCCCO1", # 18-membered
        ],
        "id": ["A", "B", "C"],
    }
    df = pd.DataFrame(data)

    sampled = stratified_sample_by_ring_size(df, sample_ratio=0.5, random_state=42)

    # Should get at least 1 from each ring size (50% of 1 = 1)
    assert len(sampled) >= 1
    assert len(sampled) <= 3

Step 3: Run test

Run:

pixi run pytest tests/validation/test_sampling.py -v

Expected: test_stratified_sample PASSED

Step 4: Commit

git add src/macro_lactone_toolkit/validation/sampling.py tests/validation/test_sampling.py
git commit -m "feat(validation): add stratified sampling by ring size"

Task 5: Create Visualization Output Module

Files:

  • Create: src/macro_lactone_toolkit/validation/visualization_output.py

Context: Handles saving numbered molecule images and fragment images to organized directory structure.

Step 1: Write visualization output module

Create src/macro_lactone_toolkit/validation/visualization_output.py:

from __future__ import annotations

from pathlib import Path

from rdkit import Chem

from macro_lactone_toolkit.visualization import save_numbered_molecule_png, save_fragment_png


def get_output_paths(output_dir: Path, source_id: str, ring_size: int, classification: str) -> dict:
    """Get organized output paths for a molecule."""
    ring_dir = output_dir / f"ring_size_{ring_size}"

    if classification == "standard_macrolactone":
        base_dir = ring_dir / "standard"
        numbered_dir = base_dir / "numbered"
        sidechains_dir = base_dir / "sidechains" / source_id
    elif classification == "non_standard_macrocycle":
        base_dir = ring_dir / "non_standard" / "original"
        numbered_dir = base_dir
        sidechains_dir = None
    else:
        base_dir = ring_dir / "rejected" / "original"
        numbered_dir = base_dir
        sidechains_dir = None

    numbered_dir.mkdir(parents=True, exist_ok=True)
    if sidechains_dir:
        sidechains_dir.mkdir(parents=True, exist_ok=True)

    return {
        "numbered_image": numbered_dir / f"{source_id}_numbered.png",
        "sidechains_dir": sidechains_dir,
    }


def save_numbered_molecule(
    smiles: str,
    output_path: Path,
    ring_size: int | None = None,
    size: tuple[int, int] = (800, 800),
) -> Path | None:
    """Save numbered molecule image."""
    try:
        return save_numbered_molecule_png(
            smiles,
            output_path,
            ring_size=ring_size,
            size=size,
        )
    except Exception as e:
        print(f"Failed to save numbered image: {e}")
        return None


def save_fragment_images(
    fragments: list,
    output_dir: Path,
    source_id: str,
    size: tuple[int, int] = (400, 400),
) -> list[str]:
    """Save fragment images and return paths."""
    paths = []

    for i, fragment in enumerate(fragments):
        try:
            output_path = output_dir / f"{source_id}_frag_{i}_pos{fragment.cleavage_position}.png"
            save_fragment_png(fragment.fragment_smiles_plain, output_path, size=size)
            paths.append(str(output_path.relative_to(output_dir.parent.parent)))
        except Exception as e:
            print(f"Failed to save fragment {i}: {e}")
            paths.append(None)

    return paths

Step 2: Commit

git add src/macro_lactone_toolkit/validation/visualization_output.py
git commit -m "feat(validation): add visualization output module"

Task 6: Create Main Validator Class

Files:

  • Create: src/macro_lactone_toolkit/validation/validator.py

Context: Core orchestrator that processes molecules, stores results in database, and generates visualizations.

Step 1: Write validator class

Create src/macro_lactone_toolkit/validation/validator.py:

from __future__ import annotations

import json
from datetime import datetime
from pathlib import Path

import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors

from macro_lactone_toolkit import MacroLactoneAnalyzer, MacrolactoneFragmenter
from macro_lactone_toolkit.validation.database import get_engine, get_session, init_database
from macro_lactone_toolkit.validation.isotope_utils import build_fragment_with_isotope
from macro_lactone_toolkit.validation.models import (
    ClassificationType,
    ParentMolecule,
    ProcessingStatus,
    RingNumbering,
    SideChainFragment,
)
from macro_lactone_toolkit.validation.sampling import stratified_sample_by_ring_size
from macro_lactone_toolkit.validation.visualization_output import (
    get_output_paths,
    save_fragment_images,
    save_numbered_molecule,
)


class MacrolactoneValidator:
    """Validates macrolactone database with sampling and fragmentation."""

    def __init__(
        self,
        output_dir: str | Path,
        sample_ratio: float = 0.1,
        smiles_col: str = "smiles",
        id_col: str = "IDs",
    ):
        self.output_dir = Path(output_dir)
        self.sample_ratio = sample_ratio
        self.smiles_col = smiles_col
        self.id_col = id_col

        self.analyzer = MacroLactoneAnalyzer()
        self.fragmenter = MacrolactoneFragmenter()

        # Initialize database
        self.db_path = self.output_dir / "fragments.db"
        self.engine = get_engine(self.db_path)
        init_database(self.engine)

    def run(self, input_csv: str | Path) -> dict:
        """Run validation on input CSV."""
        # Load data
        df = pd.read_csv(input_csv)
        print(f"Loaded {len(df)} molecules from {input_csv}")

        # Stratified sampling
        print(f"Performing stratified sampling (ratio={self.sample_ratio})...")
        sampled = stratified_sample_by_ring_size(df, self.sample_ratio, self.smiles_col)
        print(f"Sampled {len(sampled)} molecules")

        # Process each molecule
        results = {"total": len(sampled), "success": 0, "failed": 0, "skipped": 0}

        for idx, row in sampled.iterrows():
            status = self._process_molecule(row)
            results[status] += 1
            if (idx + 1) % 100 == 0:
                print(f"Processed {idx + 1}/{len(sampled)} molecules")

        # Generate summary
        self._generate_summary()

        return results

    def _process_molecule(self, row: pd.Series) -> str:
        """Process a single molecule. Returns status."""
        source_id = str(row[self.id_col])
        smiles = row[self.smiles_col]
        name = row.get("molecule_pref_name", None)

        # Classify
        classification_result = self.analyzer.classify_macrocycle(smiles)
        classification = ClassificationType(classification_result.classification)
        ring_size = classification_result.ring_size

        # Create parent record
        parent = ParentMolecule(
            source_id=source_id,
            molecule_name=name,
            smiles=smiles,
            classification=classification,
            ring_size=ring_size,
            primary_reason_code=classification_result.primary_reason_code,
            primary_reason_message=classification_result.primary_reason_message,
        )

        with get_session(self.engine) as session:
            session.add(parent)
            session.commit()
            session.refresh(parent)

            # Skip non-standard molecules
            if classification != ClassificationType.STANDARD:
                parent.processing_status = ProcessingStatus.SKIPPED
                session.add(parent)
                session.commit()
                self._save_original_image(smiles, source_id, ring_size, classification)
                return "skipped"

            # Process standard macrolactone
            try:
                self._process_standard_macrolactone(session, parent, smiles)
                return "success"
            except Exception as e:
                parent.processing_status = ProcessingStatus.FAILED
                parent.error_message = str(e)
                parent.processed_at = datetime.utcnow()
                session.add(parent)
                session.commit()
                return "failed"

    def _process_standard_macrolactone(self, session, parent: ParentMolecule, smiles: str):
        """Process a standard macrolactone."""
        # Get numbering
        numbering = self.fragmenter.number_molecule(smiles)

        # Save numbering to database
        numbering_record = RingNumbering(
            parent_id=parent.id,
            ring_size=numbering.ring_size,
            carbonyl_carbon_idx=numbering.carbonyl_carbon_idx,
            ester_oxygen_idx=numbering.ester_oxygen_idx,
            position_to_atom=json.dumps(numbering.position_to_atom),
            atom_to_position=json.dumps(numbering.atom_to_position),
        )
        session.add(numbering_record)

        # Save numbered image
        paths = get_output_paths(
            self.output_dir, parent.source_id, parent.ring_size, "standard_macrolactone"
        )
        image_path = save_numbered_molecule(smiles, paths["numbered_image"], parent.ring_size)
        if image_path:
            parent.numbered_image_path = str(image_path.relative_to(self.output_dir))

        # Fragment side chains
        mol = Chem.MolFromSmiles(smiles)
        ring_atom_set = set(numbering.ring_atoms)
        fragments = []
        fragment_idx = 0

        from macro_lactone_toolkit._core import collect_side_chain_atoms, is_intrinsic_lactone_neighbor

        for position, ring_atom_idx in numbering.position_to_atom.items():
            ring_atom = mol.GetAtomWithIdx(ring_atom_idx)

            for neighbor in ring_atom.GetNeighbors():
                neighbor_idx = neighbor.GetIdx()

                # Skip ring atoms and intrinsic lactone neighbors
                if neighbor_idx in ring_atom_set:
                    continue
                if is_intrinsic_lactone_neighbor(mol, numbering, ring_atom_idx, neighbor_idx):
                    continue

                # Collect side chain atoms
                side_chain_atoms = collect_side_chain_atoms(mol, neighbor_idx, ring_atom_set)
                if not side_chain_atoms:
                    continue

                # Build fragment with isotope tagging
                labeled_smiles, plain_smiles, bond_type = build_fragment_with_isotope(
                    mol, side_chain_atoms, neighbor_idx, ring_atom_idx, position
                )

                # Calculate properties
                plain_mol = Chem.MolFromSmiles(plain_smiles)
                if plain_mol is None:
                    continue

                atom_count = sum(1 for a in plain_mol.GetAtoms() if a.GetAtomicNum() != 0)
                heavy_atom_count = sum(1 for a in plain_mol.GetAtoms() if a.GetAtomicNum() not in [0, 1])
                mw = Descriptors.MolWt(plain_mol)

                # Create fragment record
                fragment = SideChainFragment(
                    parent_id=parent.id,
                    fragment_id=f"{parent.source_id}_frag_{fragment_idx}",
                    cleavage_position=position,
                    attachment_atom_idx=ring_atom_idx,
                    attachment_atom_symbol=ring_atom.GetSymbol(),
                    fragment_smiles_labeled=labeled_smiles,
                    fragment_smiles_plain=plain_smiles,
                    dummy_isotope=position,
                    atom_count=atom_count,
                    heavy_atom_count=heavy_atom_count,
                    molecular_weight=round(mw, 4),
                    original_bond_type=bond_type,
                )
                session.add(fragment)
                fragments.append(fragment)
                fragment_idx += 1

        # Save fragment images
        if fragments and paths["sidechains_dir"]:
            image_paths = save_fragment_images(fragments, paths["sidechains_dir"], parent.source_id)
            for frag, img_path in zip(fragments, image_paths):
                frag.image_path = img_path
                session.add(frag)

        # Update parent record
        parent.processing_status = ProcessingStatus.SUCCESS
        parent.num_sidechains = len(fragments)
        parent.cleavage_positions = json.dumps([f.cleavage_position for f in fragments])
        parent.processed_at = datetime.utcnow()
        session.add(parent)
        session.commit()

    def _save_original_image(self, smiles: str, source_id: str, ring_size: int, classification: ClassificationType):
        """Save original image for non-standard molecules."""
        paths = get_output_paths(self.output_dir, source_id, ring_size, classification.value)
        try:
            from rdkit.Chem import Draw
            mol = Chem.MolFromSmiles(smiles)
            if mol:
                Draw.MolToFile(mol, str(paths["numbered_image"]), size=(400, 400))
        except Exception:
            pass

    def _generate_summary(self):
        """Generate summary CSV and statistics."""
        with get_session(self.engine) as session:
            # Query all parents
            from sqlmodel import select
            statement = select(ParentMolecule)
            parents = session.exec(statement).all()

            # Convert to DataFrame
            data = []
            for p in parents:
                data.append({
                    "id": p.id,
                    "source_id": p.source_id,
                    "molecule_name": p.molecule_name,
                    "smiles": p.smiles,
                    "classification": p.classification.value,
                    "ring_size": p.ring_size,
                    "primary_reason_code": p.primary_reason_code,
                    "primary_reason_message": p.primary_reason_message,
                    "processing_status": p.processing_status.value,
                    "error_message": p.error_message,
                    "num_sidechains": p.num_sidechains,
                    "cleavage_positions": p.cleavage_positions,
                    "numbered_image_path": p.numbered_image_path,
                    "processed_at": p.processed_at,
                })

            df = pd.DataFrame(data)
            df.to_csv(self.output_dir / "summary.csv", index=False)

            # Generate statistics
            stats = {
                "total_molecules": len(parents),
                "by_classification": df["classification"].value_counts().to_dict(),
                "by_ring_size": df[df["ring_size"].notna()]["ring_size"].value_counts().to_dict(),
                "by_status": df["processing_status"].value_counts().to_dict(),
            }

            with open(self.output_dir / "summary_statistics.json", "w") as f:
                json.dump(stats, f, indent=2, default=str)

            print(f"\nSummary saved to {self.output_dir / 'summary.csv'}")
            print(f"Statistics: {stats}")

Step 2: Commit

git add src/macro_lactone_toolkit/validation/validator.py
git commit -m "feat(validation): add main validator class"

Task 7: Create CLI Script

Files:

  • Create: scripts/validate_macrolactone_db.py

Context: Entry point script that uses pixi environment.

Step 1: Write CLI script

Create scripts/validate_macrolactone_db.py:

#!/usr/bin/env python3
"""
Validate MacrolactoneDB 12-20 membered rings.

Usage:
    pixi run python scripts/validate_macrolactone_db.py \
        --input data/MacrolactoneDB/ring12_20/temp.csv \
        --output validation_output \
        --sample-ratio 0.1
"""

import argparse
import sys
from pathlib import Path

# Add src to path
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))

from macro_lactone_toolkit.validation.validator import MacrolactoneValidator


def main():
    parser = argparse.ArgumentParser(
        description="Validate MacrolactoneDB 12-20 membered rings"
    )
    parser.add_argument(
        "--input",
        type=str,
        default="data/MacrolactoneDB/ring12_20/temp.csv",
        help="Input CSV file path",
    )
    parser.add_argument(
        "--output",
        type=str,
        default="validation_output",
        help="Output directory",
    )
    parser.add_argument(
        "--sample-ratio",
        type=float,
        default=0.1,
        help="Sampling ratio (0.0-1.0)",
    )
    parser.add_argument(
        "--smiles-col",
        type=str,
        default="smiles",
        help="SMILES column name",
    )
    parser.add_argument(
        "--id-col",
        type=str,
        default="IDs",
        help="ID column name",
    )

    args = parser.parse_args()

    print("=" * 60)
    print("MacrolactoneDB Validation")
    print("=" * 60)
    print(f"Input: {args.input}")
    print(f"Output: {args.output}")
    print(f"Sample ratio: {args.sample_ratio}")
    print("=" * 60)

    validator = MacrolactoneValidator(
        output_dir=args.output,
        sample_ratio=args.sample_ratio,
        smiles_col=args.smiles_col,
        id_col=args.id_col,
    )

    results = validator.run(args.input)

    print("\n" + "=" * 60)
    print("Validation Complete")
    print("=" * 60)
    print(f"Total processed: {results['total']}")
    print(f"Success: {results['success']}")
    print(f"Failed: {results['failed']}")
    print(f"Skipped: {results['skipped']}")
    print("=" * 60)

    return 0


if __name__ == "__main__":
    sys.exit(main())

Make executable:

chmod +x scripts/validate_macrolactone_db.py

Step 2: Test help message

Run:

pixi run python scripts/validate_macrolactone_db.py --help

Expected: Shows help message with all arguments.

Step 3: Commit

git add scripts/validate_macrolactone_db.py
git commit -m "feat(validation): add CLI entry point script"

Task 8: Create Output Directory README

Files:

  • Create: Template for validation_output/README.md (generated by validator)

Context: README explaining the output directory structure.

Step 1: Add README generation to validator

Add method to validator.py before _generate_summary:

    def _generate_readme(self):
        """Generate README explaining output structure."""
        readme_content = """# MacrolactoneDB Validation Output

This directory contains validation results for MacrolactoneDB 12-20 membered rings.

## Directory Structure

validation_output/ ├── README.md # This file ├── fragments.db # SQLite database with all data ├── summary.csv # Summary of all processed molecules ├── summary_statistics.json # Statistical summary │ ├── ring_size_12/ # 12-membered rings ├── ring_size_13/ # 13-membered rings ... └── ring_size_20/ # 20-membered rings ├── molecules.csv # Molecules in this ring size ├── standard/ # Standard macrolactones │ ├── numbered/ # Numbered ring images │ │ └── {id}_numbered.png │ └── sidechains/ # Fragment images │ └── {id}/ │ └── {id}frag{n}_pos{pos}.png ├── non_standard/ # Non-standard macrocycles │ └── original/ │ └── {id}_original.png └── rejected/ # Not macrolactones └── original/ └── {id}_original.png


## Database Schema

### Tables

- **parent_molecules**: Original molecule information
- **ring_numberings**: Ring atom numbering details
- **side_chain_fragments**: Fragmentation results with isotope tags
- **validation_results**: Manual validation records

### Key Fields

- `classification`: standard_macrolactone | non_standard_macrocycle | not_macrolactone
- `dummy_isotope`: Cleavage position stored as isotope value for reconstruction
- `cleavage_position`: Position on ring where side chain was attached

## Ring Numbering Convention

1. Position 1 = Lactone carbonyl carbon (C=O)
2. Position 2 = Ester oxygen (-O-)
3. Positions 3-N = Sequential around ring

## Isotope Tagging

Fragments use isotope values to mark cleavage position:
- `[5*]CCO` = Fragment from position 5, dummy atom has isotope=5
- This enables precise reconstruction during reassembly

## CSV Columns

### summary.csv

- `source_id`: Original molecule ID from MacrolactoneDB
- `classification`: Classification result
- `ring_size`: Detected ring size (12-20)
- `num_sidechains`: Number of side chains detected
- `cleavage_positions`: JSON array of cleavage positions
- `processing_status`: pending | success | failed | skipped

## Querying the Database

```bash
# List tables
sqlite3 fragments.db ".tables"

# Get standard macrolactones with fragments
sqlite3 fragments.db "SELECT * FROM parent_molecules WHERE classification='standard_macrolactone' LIMIT 5;"

# Get fragments for a specific molecule
sqlite3 fragments.db "SELECT * FROM side_chain_fragments WHERE parent_id=1;"

# Count by ring size
sqlite3 fragments.db "SELECT ring_size, COUNT(*) FROM parent_molecules GROUP BY ring_size;"

""" readme_path = self.output_dir / "README.md" readme_path.write_text(readme_content)


Add call in `run` method before `return results`:
```python
self._generate_readme()
self._generate_summary()

Step 2: Commit

git add src/macro_lactone_toolkit/validation/validator.py
git commit -m "feat(validation): add README generation for output directory"

Task 9: Update Package init.py

Files:

  • Modify: src/macro_lactone_toolkit/__init__.py

Context: Export validation module.

Step 1: Add validation exports

Modify src/macro_lactone_toolkit/__init__.py to add:

# Validation module (optional import)
try:
    from .validation.validator import MacrolactoneValidator
    from .validation.models import ParentMolecule, SideChainFragment
except ImportError:
    pass  # SQLModel not installed

Step 2: Commit

git add src/macro_lactone_toolkit/__init__.py
git commit -m "feat(validation): export validation module"

Task 10: Run Integration Test

Files:

  • Test with: Small sample of actual data

Context: Run the validator on a small subset to verify everything works.

Step 1: Create test with small sample

Run:

# Create small test sample
head -20 data/MacrolactoneDB/ring12_20/temp.csv > /tmp/test_sample.csv

# Run validation
pixi run python scripts/validate_macrolactone_db.py \
    --input /tmp/test_sample.csv \
    --output /tmp/test_validation_output \
    --sample-ratio 1.0

Expected output shows processing and summary.

Step 2: Verify outputs

Run:

ls -la /tmp/test_validation_output/
cat /tmp/test_validation_output/summary_statistics.json
sqlite3 /tmp/test_validation_output/fragments.db "SELECT COUNT(*) FROM parent_molecules;"

Expected: Directory exists, has database, summary CSV, and ring size subdirectories.

Step 3: Cleanup

rm -rf /tmp/test_validation_output /tmp/test_sample.csv

Task 11: Final Commit and Summary

Step 1: Final review and commit

git status
git log --oneline -10

Step 2: Push to branch (if using worktree)

git push origin HEAD

Execution Commands Reference

Full validation run

cd /Users/lingyuzeng/project/macro-lactone-sidechain-profiler/macro_split
pixi run python scripts/validate_macrolactone_db.py \
    --input data/MacrolactoneDB/ring12_20/temp.csv \
    --output validation_output \
    --sample-ratio 0.1

Query results

# Summary statistics
cat validation_output/summary_statistics.json

# Database queries
sqlite3 validation_output/fragments.db "SELECT * FROM parent_molecules LIMIT 5;"
sqlite3 validation_output/fragments.db "SELECT * FROM side_chain_fragments WHERE cleavage_position > 2 LIMIT 5;"

Check specific ring size

ls validation_output/ring_size_16/standard/numbered/
ls validation_output/ring_size_16/standard/sidechains/

Verification Checklist

  • All SQLModel models created and importable
  • Database initializes without errors
  • Isotope tagging preserves cleavage position
  • Stratified sampling produces even distribution
  • Visualization outputs created in correct structure
  • Summary CSV contains all expected columns
  • README generated with accurate documentation
  • CLI script runs with --help
  • Integration test passes on small sample