# MacrolactoneDB Validation Implementation Plan > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. **Goal:** Create a validation script that samples 10% of MacrolactoneDB 12-20 membered rings, classifies them, fragments side chains with isotope tagging, and stores results in SQLite with visualizations. **Architecture:** SQLModel ORM for database, RDKit for chemistry, PIL for visualization. Core is a `MacrolactoneValidator` class that orchestrates sampling, processing, and output generation. **Tech Stack:** Python 3.12, SQLModel, RDKit, Pandas, Pixi (environment) **Reference Design:** See `docs/plans/2026-03-19-macrolactone-validation-design.md` for full design details. --- ## Prerequisites **Worktree:** This plan should be executed in a dedicated worktree. **Design Doc:** Read `docs/plans/2026-03-19-macrolactone-validation-design.md` before starting. --- ## Task 1: Create Database Models (SQLModel) **Files:** - Create: `src/macro_lactone_toolkit/validation/__init__.py` - Create: `src/macro_lactone_toolkit/validation/models.py` **Context:** These models implement the schema from the design doc. `SideChainFragment` uses `dummy_isotope` field to store the cleavage position for reconstruction. **Step 1: Create directory and __init__.py** Run: ```bash mkdir -p src/macro_lactone_toolkit/validation touch src/macro_lactone_toolkit/validation/__init__.py ``` **Step 2: Write database models** Create `src/macro_lactone_toolkit/validation/models.py`: ```python from __future__ import annotations from datetime import datetime from enum import Enum from typing import List, Optional from sqlmodel import Field, Relationship, SQLModel class ClassificationType(str, Enum): STANDARD = "standard_macrolactone" NON_STANDARD = "non_standard_macrocycle" NOT_MACROLACTONE = "not_macrolactone" class ProcessingStatus(str, Enum): PENDING = "pending" SUCCESS = "success" FAILED = "failed" SKIPPED = "skipped" class ParentMolecule(SQLModel, table=True): """Original molecule information.""" __tablename__ = "parent_molecules" id: Optional[int] = Field(default=None, primary_key=True) source_id: str = Field(index=True) molecule_name: Optional[str] = None smiles: str = Field(index=True) classification: ClassificationType = Field(index=True) ring_size: Optional[int] = Field(default=None, index=True) primary_reason_code: Optional[str] = None primary_reason_message: Optional[str] = None processing_status: ProcessingStatus = Field(default=ProcessingStatus.PENDING) error_message: Optional[str] = None num_sidechains: Optional[int] = None cleavage_positions: Optional[str] = None numbered_image_path: Optional[str] = None created_at: datetime = Field(default_factory=datetime.utcnow) processed_at: Optional[datetime] = None fragments: List["SideChainFragment"] = Relationship(back_populates="parent") numbering: Optional["RingNumbering"] = Relationship(back_populates="parent") class RingNumbering(SQLModel, table=True): """Ring numbering details.""" __tablename__ = "ring_numberings" id: Optional[int] = Field(default=None, primary_key=True) parent_id: int = Field(foreign_key="parent_molecules.id", unique=True) ring_size: int carbonyl_carbon_idx: int ester_oxygen_idx: int position_to_atom: str atom_to_position: str parent: Optional[ParentMolecule] = Relationship(back_populates="numbering") class SideChainFragment(SQLModel, table=True): """Side chain fragments from cleavage.""" __tablename__ = "side_chain_fragments" id: Optional[int] = Field(default=None, primary_key=True) parent_id: int = Field(foreign_key="parent_molecules.id", index=True) fragment_id: str = Field(index=True) cleavage_position: int = Field(index=True) attachment_atom_idx: int attachment_atom_symbol: str fragment_smiles_labeled: str fragment_smiles_plain: str dummy_isotope: int atom_count: int heavy_atom_count: int molecular_weight: float original_bond_type: str image_path: Optional[str] = None parent: Optional[ParentMolecule] = Relationship(back_populates="fragments") class ValidationResult(SQLModel, table=True): """Manual validation records.""" __tablename__ = "validation_results" id: Optional[int] = Field(default=None, primary_key=True) parent_id: int = Field(foreign_key="parent_molecules.id") numbering_correct: Optional[bool] = None cleavage_correct: Optional[bool] = None classification_correct: Optional[bool] = None notes: Optional[str] = None validated_by: Optional[str] = None validated_at: Optional[datetime] = None ``` **Step 3: Verify SQLModel imports work** Run: ```bash pixi run python -c "from macro_lactone_toolkit.validation.models import ParentMolecule; print('Models OK')" ``` Expected: `Models OK` **Step 4: Commit** ```bash git add src/macro_lactone_toolkit/validation/ git commit -m "feat(validation): add SQLModel database models" ``` --- ## Task 2: Create Database Connection Module **Files:** - Create: `src/macro_lactone_toolkit/validation/database.py` **Context:** Provides SQLite engine, session context manager, and init function. **Step 1: Write database module** Create `src/macro_lactone_toolkit/validation/database.py`: ```python from __future__ import annotations from contextlib import contextmanager from pathlib import Path from sqlmodel import Session, SQLModel, create_engine def get_engine(db_path: str | Path): """Create SQLite engine.""" db_path = Path(db_path) db_path.parent.mkdir(parents=True, exist_ok=True) url = f"sqlite:///{db_path}" return create_engine(url, echo=False) @contextmanager def get_session(engine): """Context manager for database sessions.""" with Session(engine) as session: yield session def init_database(engine): """Create all tables.""" SQLModel.metadata.create_all(engine) ``` **Step 2: Test database initialization** Create test script `test_db.py`: ```python from pathlib import Path from macro_lactone_toolkit.validation.database import get_engine, init_database, get_session from macro_lactone_toolkit.validation.models import ParentMolecule, ClassificationType test_db = Path("/tmp/test_fragments.db") if test_db.exists(): test_db.unlink() engine = get_engine(test_db) init_database(engine) with get_session(engine) as session: parent = ParentMolecule( source_id="TEST001", smiles="O=C1CCCCCCCCCCCCCCO1", classification=ClassificationType.STANDARD, ring_size=16, ) session.add(parent) session.commit() print(f"Inserted parent with id: {parent.id}") print("Database test passed!") ``` Run: ```bash pixi run python test_db.py ``` Expected: ``` Inserted parent with id: 1 Database test passed! ``` **Step 3: Cleanup and commit** Run: ```bash rm test_db.py /tmp/test_fragments.db git add src/macro_lactone_toolkit/validation/database.py git commit -m "feat(validation): add database connection module" ``` --- ## Task 3: Create Isotope Tagging Utilities **Files:** - Create: `src/macro_lactone_toolkit/validation/isotope_utils.py` **Context:** Implements isotope tagging inspired by Molassembler. Uses cleavage position as isotope value. **Step 1: Write isotope utilities** Create `src/macro_lactone_toolkit/validation/isotope_utils.py`: ```python from __future__ import annotations from rdkit import Chem def build_fragment_with_isotope( mol: Chem.Mol, side_chain_atoms: list[int], side_chain_start_idx: int, ring_atom_idx: int, cleavage_position: int, ) -> tuple[str, str, str]: """ Build fragment SMILES with isotope tagging. Returns: Tuple of (labeled_smiles, plain_smiles, bond_type) """ # Get original bond type bond = mol.GetBondBetweenAtoms(ring_atom_idx, side_chain_start_idx) bond_type = bond.GetBondType().name if bond else "SINGLE" # Create editable molecule emol = Chem.EditableMol(Chem.Mol(mol)) # Add dummy atom with isotope = cleavage position dummy_atom = Chem.Atom(0) dummy_atom.SetIsotope(cleavage_position) dummy_idx = emol.AddAtom(dummy_atom) # Add bond between dummy and side chain start emol.AddBond(dummy_idx, side_chain_start_idx, bond.GetBondType()) # Remove ring atom to side chain bond (will be reconnected via dummy) # Actually, we keep the side chain atoms and dummy, remove everything else # Determine atoms to keep atoms_to_keep = set([dummy_idx, side_chain_start_idx] + list(side_chain_atoms)) # Remove atoms not in keep list # Need to remove in reverse order to maintain valid indices all_atoms = list(range(mol.GetNumAtoms())) atoms_to_remove = [i for i in all_atoms if i not in atoms_to_keep] for atom_idx in sorted(atoms_to_remove, reverse=True): emol.RemoveAtom(atom_idx) fragment = emol.GetMol() Chem.SanitizeMol(fragment) # Get labeled SMILES (with isotope) labeled_smiles = Chem.MolToSmiles(fragment) # Get plain SMILES (without isotope) plain_fragment = Chem.Mol(fragment) for atom in plain_fragment.GetAtoms(): if atom.GetIsotope() > 0: atom.SetIsotope(0) plain_smiles = Chem.MolToSmiles(plain_fragment) return labeled_smiles, plain_smiles, bond_type def extract_isotope_position(fragment_smiles: str) -> int: """Extract cleavage position from fragment SMILES.""" mol = Chem.MolFromSmiles(fragment_smiles) if mol is None: return 0 for atom in mol.GetAtoms(): if atom.GetAtomicNum() == 0 and atom.GetIsotope() > 0: return atom.GetIsotope() return 0 ``` **Step 2: Write test** Create `tests/validation/test_isotope_utils.py`: ```python import pytest from rdkit import Chem from macro_lactone_toolkit.validation.isotope_utils import ( build_fragment_with_isotope, extract_isotope_position, ) def test_build_fragment_with_isotope(): # Create a simple test molecule: ethyl group attached to position 5 mol = Chem.MolFromSmiles("CCCC(CC)CCC") # Position 4 (0-indexed) has ethyl assert mol is not None side_chain_atoms = [4, 5] # The ethyl group atoms side_chain_start = 4 ring_atom = 3 cleavage_pos = 5 labeled, plain, bond_type = build_fragment_with_isotope( mol, side_chain_atoms, side_chain_start, ring_atom, cleavage_pos ) assert labeled is not None assert plain is not None assert bond_type == "SINGLE" # Check isotope was set extracted_pos = extract_isotope_position(labeled) assert extracted_pos == cleavage_pos # Plain should have no isotope extracted_plain = extract_isotope_position(plain) assert extracted_plain == 0 ``` **Step 3: Run test** Run: ```bash pixi run pytest tests/validation/test_isotope_utils.py -v ``` Expected: `test_build_fragment_with_isotope PASSED` **Step 4: Commit** ```bash git add src/macro_lactone_toolkit/validation/isotope_utils.py tests/validation/test_isotope_utils.py git commit -m "feat(validation): add isotope tagging utilities" ``` --- ## Task 4: Create Stratified Sampling Module **Files:** - Create: `src/macro_lactone_toolkit/validation/sampling.py` **Context:** Implements 10% stratified sampling by ring size (12-20). **Step 1: Write sampling module** Create `src/macro_lactone_toolkit/validation/sampling.py`: ```python from __future__ import annotations import pandas as pd from macro_lactone_toolkit import MacroLactoneAnalyzer def stratified_sample_by_ring_size( df: pd.DataFrame, sample_ratio: float, smiles_col: str = "smiles", random_state: int = 42, ) -> pd.DataFrame: """ Perform stratified sampling by ring size. First classifies all molecules, then samples 10% from each ring size layer. """ analyzer = MacroLactoneAnalyzer() # Classify all molecules classifications = [] ring_sizes = [] for smiles in df[smiles_col]: result = analyzer.classify_macrocycle(smiles) classifications.append(result.classification) ring_sizes.append(result.ring_size) df = df.copy() df["_classification"] = classifications df["_ring_size"] = ring_sizes # Group by ring size and sample from each group sampled_groups = [] for ring_size in range(12, 21): group = df[df["_ring_size"] == ring_size] if len(group) > 0: n_samples = max(1, int(len(group) * sample_ratio)) sampled = group.sample(n=min(n_samples, len(group)), random_state=random_state) sampled_groups.append(sampled) # Also sample from unknown ring size (None) unknown_group = df[df["_ring_size"].isna()] if len(unknown_group) > 0: n_samples = max(1, int(len(unknown_group) * sample_ratio)) sampled = unknown_group.sample(n=min(n_samples, len(unknown_group)), random_state=random_state) sampled_groups.append(sampled) if not sampled_groups: return pd.DataFrame() result = pd.concat(sampled_groups, ignore_index=True) return result ``` **Step 2: Create test** Create `tests/validation/test_sampling.py`: ```python import pandas as pd import pytest from macro_lactone_toolkit.validation.sampling import stratified_sample_by_ring_size def test_stratified_sample(): # Create test data with known ring sizes data = { "smiles": [ "O=C1CCCCCCCCCCCCCCO1", # 16-membered "O=C1CCCCCCCCCCCCO1", # 14-membered "O=C1CCCCCCCCCCCCCCCCO1", # 18-membered ], "id": ["A", "B", "C"], } df = pd.DataFrame(data) sampled = stratified_sample_by_ring_size(df, sample_ratio=0.5, random_state=42) # Should get at least 1 from each ring size (50% of 1 = 1) assert len(sampled) >= 1 assert len(sampled) <= 3 ``` **Step 3: Run test** Run: ```bash pixi run pytest tests/validation/test_sampling.py -v ``` Expected: `test_stratified_sample PASSED` **Step 4: Commit** ```bash git add src/macro_lactone_toolkit/validation/sampling.py tests/validation/test_sampling.py git commit -m "feat(validation): add stratified sampling by ring size" ``` --- ## Task 5: Create Visualization Output Module **Files:** - Create: `src/macro_lactone_toolkit/validation/visualization_output.py` **Context:** Handles saving numbered molecule images and fragment images to organized directory structure. **Step 1: Write visualization output module** Create `src/macro_lactone_toolkit/validation/visualization_output.py`: ```python from __future__ import annotations from pathlib import Path from rdkit import Chem from macro_lactone_toolkit.visualization import save_numbered_molecule_png, save_fragment_png def get_output_paths(output_dir: Path, source_id: str, ring_size: int, classification: str) -> dict: """Get organized output paths for a molecule.""" ring_dir = output_dir / f"ring_size_{ring_size}" if classification == "standard_macrolactone": base_dir = ring_dir / "standard" numbered_dir = base_dir / "numbered" sidechains_dir = base_dir / "sidechains" / source_id elif classification == "non_standard_macrocycle": base_dir = ring_dir / "non_standard" / "original" numbered_dir = base_dir sidechains_dir = None else: base_dir = ring_dir / "rejected" / "original" numbered_dir = base_dir sidechains_dir = None numbered_dir.mkdir(parents=True, exist_ok=True) if sidechains_dir: sidechains_dir.mkdir(parents=True, exist_ok=True) return { "numbered_image": numbered_dir / f"{source_id}_numbered.png", "sidechains_dir": sidechains_dir, } def save_numbered_molecule( smiles: str, output_path: Path, ring_size: int | None = None, size: tuple[int, int] = (800, 800), ) -> Path | None: """Save numbered molecule image.""" try: return save_numbered_molecule_png( smiles, output_path, ring_size=ring_size, size=size, ) except Exception as e: print(f"Failed to save numbered image: {e}") return None def save_fragment_images( fragments: list, output_dir: Path, source_id: str, size: tuple[int, int] = (400, 400), ) -> list[str]: """Save fragment images and return paths.""" paths = [] for i, fragment in enumerate(fragments): try: output_path = output_dir / f"{source_id}_frag_{i}_pos{fragment.cleavage_position}.png" save_fragment_png(fragment.fragment_smiles_plain, output_path, size=size) paths.append(str(output_path.relative_to(output_dir.parent.parent))) except Exception as e: print(f"Failed to save fragment {i}: {e}") paths.append(None) return paths ``` **Step 2: Commit** ```bash git add src/macro_lactone_toolkit/validation/visualization_output.py git commit -m "feat(validation): add visualization output module" ``` --- ## Task 6: Create Main Validator Class **Files:** - Create: `src/macro_lactone_toolkit/validation/validator.py` **Context:** Core orchestrator that processes molecules, stores results in database, and generates visualizations. **Step 1: Write validator class** Create `src/macro_lactone_toolkit/validation/validator.py`: ```python from __future__ import annotations import json from datetime import datetime from pathlib import Path import pandas as pd from rdkit import Chem from rdkit.Chem import Descriptors from macro_lactone_toolkit import MacroLactoneAnalyzer, MacrolactoneFragmenter from macro_lactone_toolkit.validation.database import get_engine, get_session, init_database from macro_lactone_toolkit.validation.isotope_utils import build_fragment_with_isotope from macro_lactone_toolkit.validation.models import ( ClassificationType, ParentMolecule, ProcessingStatus, RingNumbering, SideChainFragment, ) from macro_lactone_toolkit.validation.sampling import stratified_sample_by_ring_size from macro_lactone_toolkit.validation.visualization_output import ( get_output_paths, save_fragment_images, save_numbered_molecule, ) class MacrolactoneValidator: """Validates macrolactone database with sampling and fragmentation.""" def __init__( self, output_dir: str | Path, sample_ratio: float = 0.1, smiles_col: str = "smiles", id_col: str = "IDs", ): self.output_dir = Path(output_dir) self.sample_ratio = sample_ratio self.smiles_col = smiles_col self.id_col = id_col self.analyzer = MacroLactoneAnalyzer() self.fragmenter = MacrolactoneFragmenter() # Initialize database self.db_path = self.output_dir / "fragments.db" self.engine = get_engine(self.db_path) init_database(self.engine) def run(self, input_csv: str | Path) -> dict: """Run validation on input CSV.""" # Load data df = pd.read_csv(input_csv) print(f"Loaded {len(df)} molecules from {input_csv}") # Stratified sampling print(f"Performing stratified sampling (ratio={self.sample_ratio})...") sampled = stratified_sample_by_ring_size(df, self.sample_ratio, self.smiles_col) print(f"Sampled {len(sampled)} molecules") # Process each molecule results = {"total": len(sampled), "success": 0, "failed": 0, "skipped": 0} for idx, row in sampled.iterrows(): status = self._process_molecule(row) results[status] += 1 if (idx + 1) % 100 == 0: print(f"Processed {idx + 1}/{len(sampled)} molecules") # Generate summary self._generate_summary() return results def _process_molecule(self, row: pd.Series) -> str: """Process a single molecule. Returns status.""" source_id = str(row[self.id_col]) smiles = row[self.smiles_col] name = row.get("molecule_pref_name", None) # Classify classification_result = self.analyzer.classify_macrocycle(smiles) classification = ClassificationType(classification_result.classification) ring_size = classification_result.ring_size # Create parent record parent = ParentMolecule( source_id=source_id, molecule_name=name, smiles=smiles, classification=classification, ring_size=ring_size, primary_reason_code=classification_result.primary_reason_code, primary_reason_message=classification_result.primary_reason_message, ) with get_session(self.engine) as session: session.add(parent) session.commit() session.refresh(parent) # Skip non-standard molecules if classification != ClassificationType.STANDARD: parent.processing_status = ProcessingStatus.SKIPPED session.add(parent) session.commit() self._save_original_image(smiles, source_id, ring_size, classification) return "skipped" # Process standard macrolactone try: self._process_standard_macrolactone(session, parent, smiles) return "success" except Exception as e: parent.processing_status = ProcessingStatus.FAILED parent.error_message = str(e) parent.processed_at = datetime.utcnow() session.add(parent) session.commit() return "failed" def _process_standard_macrolactone(self, session, parent: ParentMolecule, smiles: str): """Process a standard macrolactone.""" # Get numbering numbering = self.fragmenter.number_molecule(smiles) # Save numbering to database numbering_record = RingNumbering( parent_id=parent.id, ring_size=numbering.ring_size, carbonyl_carbon_idx=numbering.carbonyl_carbon_idx, ester_oxygen_idx=numbering.ester_oxygen_idx, position_to_atom=json.dumps(numbering.position_to_atom), atom_to_position=json.dumps(numbering.atom_to_position), ) session.add(numbering_record) # Save numbered image paths = get_output_paths( self.output_dir, parent.source_id, parent.ring_size, "standard_macrolactone" ) image_path = save_numbered_molecule(smiles, paths["numbered_image"], parent.ring_size) if image_path: parent.numbered_image_path = str(image_path.relative_to(self.output_dir)) # Fragment side chains mol = Chem.MolFromSmiles(smiles) ring_atom_set = set(numbering.ring_atoms) fragments = [] fragment_idx = 0 from macro_lactone_toolkit._core import collect_side_chain_atoms, is_intrinsic_lactone_neighbor for position, ring_atom_idx in numbering.position_to_atom.items(): ring_atom = mol.GetAtomWithIdx(ring_atom_idx) for neighbor in ring_atom.GetNeighbors(): neighbor_idx = neighbor.GetIdx() # Skip ring atoms and intrinsic lactone neighbors if neighbor_idx in ring_atom_set: continue if is_intrinsic_lactone_neighbor(mol, numbering, ring_atom_idx, neighbor_idx): continue # Collect side chain atoms side_chain_atoms = collect_side_chain_atoms(mol, neighbor_idx, ring_atom_set) if not side_chain_atoms: continue # Build fragment with isotope tagging labeled_smiles, plain_smiles, bond_type = build_fragment_with_isotope( mol, side_chain_atoms, neighbor_idx, ring_atom_idx, position ) # Calculate properties plain_mol = Chem.MolFromSmiles(plain_smiles) if plain_mol is None: continue atom_count = sum(1 for a in plain_mol.GetAtoms() if a.GetAtomicNum() != 0) heavy_atom_count = sum(1 for a in plain_mol.GetAtoms() if a.GetAtomicNum() not in [0, 1]) mw = Descriptors.MolWt(plain_mol) # Create fragment record fragment = SideChainFragment( parent_id=parent.id, fragment_id=f"{parent.source_id}_frag_{fragment_idx}", cleavage_position=position, attachment_atom_idx=ring_atom_idx, attachment_atom_symbol=ring_atom.GetSymbol(), fragment_smiles_labeled=labeled_smiles, fragment_smiles_plain=plain_smiles, dummy_isotope=position, atom_count=atom_count, heavy_atom_count=heavy_atom_count, molecular_weight=round(mw, 4), original_bond_type=bond_type, ) session.add(fragment) fragments.append(fragment) fragment_idx += 1 # Save fragment images if fragments and paths["sidechains_dir"]: image_paths = save_fragment_images(fragments, paths["sidechains_dir"], parent.source_id) for frag, img_path in zip(fragments, image_paths): frag.image_path = img_path session.add(frag) # Update parent record parent.processing_status = ProcessingStatus.SUCCESS parent.num_sidechains = len(fragments) parent.cleavage_positions = json.dumps([f.cleavage_position for f in fragments]) parent.processed_at = datetime.utcnow() session.add(parent) session.commit() def _save_original_image(self, smiles: str, source_id: str, ring_size: int, classification: ClassificationType): """Save original image for non-standard molecules.""" paths = get_output_paths(self.output_dir, source_id, ring_size, classification.value) try: from rdkit.Chem import Draw mol = Chem.MolFromSmiles(smiles) if mol: Draw.MolToFile(mol, str(paths["numbered_image"]), size=(400, 400)) except Exception: pass def _generate_summary(self): """Generate summary CSV and statistics.""" with get_session(self.engine) as session: # Query all parents from sqlmodel import select statement = select(ParentMolecule) parents = session.exec(statement).all() # Convert to DataFrame data = [] for p in parents: data.append({ "id": p.id, "source_id": p.source_id, "molecule_name": p.molecule_name, "smiles": p.smiles, "classification": p.classification.value, "ring_size": p.ring_size, "primary_reason_code": p.primary_reason_code, "primary_reason_message": p.primary_reason_message, "processing_status": p.processing_status.value, "error_message": p.error_message, "num_sidechains": p.num_sidechains, "cleavage_positions": p.cleavage_positions, "numbered_image_path": p.numbered_image_path, "processed_at": p.processed_at, }) df = pd.DataFrame(data) df.to_csv(self.output_dir / "summary.csv", index=False) # Generate statistics stats = { "total_molecules": len(parents), "by_classification": df["classification"].value_counts().to_dict(), "by_ring_size": df[df["ring_size"].notna()]["ring_size"].value_counts().to_dict(), "by_status": df["processing_status"].value_counts().to_dict(), } with open(self.output_dir / "summary_statistics.json", "w") as f: json.dump(stats, f, indent=2, default=str) print(f"\nSummary saved to {self.output_dir / 'summary.csv'}") print(f"Statistics: {stats}") ``` **Step 2: Commit** ```bash git add src/macro_lactone_toolkit/validation/validator.py git commit -m "feat(validation): add main validator class" ``` --- ## Task 7: Create CLI Script **Files:** - Create: `scripts/validate_macrolactone_db.py` **Context:** Entry point script that uses pixi environment. **Step 1: Write CLI script** Create `scripts/validate_macrolactone_db.py`: ```python #!/usr/bin/env python3 """ Validate MacrolactoneDB 12-20 membered rings. Usage: pixi run python scripts/validate_macrolactone_db.py \ --input data/MacrolactoneDB/ring12_20/temp.csv \ --output validation_output \ --sample-ratio 0.1 """ import argparse import sys from pathlib import Path # Add src to path sys.path.insert(0, str(Path(__file__).parent.parent / "src")) from macro_lactone_toolkit.validation.validator import MacrolactoneValidator def main(): parser = argparse.ArgumentParser( description="Validate MacrolactoneDB 12-20 membered rings" ) parser.add_argument( "--input", type=str, default="data/MacrolactoneDB/ring12_20/temp.csv", help="Input CSV file path", ) parser.add_argument( "--output", type=str, default="validation_output", help="Output directory", ) parser.add_argument( "--sample-ratio", type=float, default=0.1, help="Sampling ratio (0.0-1.0)", ) parser.add_argument( "--smiles-col", type=str, default="smiles", help="SMILES column name", ) parser.add_argument( "--id-col", type=str, default="IDs", help="ID column name", ) args = parser.parse_args() print("=" * 60) print("MacrolactoneDB Validation") print("=" * 60) print(f"Input: {args.input}") print(f"Output: {args.output}") print(f"Sample ratio: {args.sample_ratio}") print("=" * 60) validator = MacrolactoneValidator( output_dir=args.output, sample_ratio=args.sample_ratio, smiles_col=args.smiles_col, id_col=args.id_col, ) results = validator.run(args.input) print("\n" + "=" * 60) print("Validation Complete") print("=" * 60) print(f"Total processed: {results['total']}") print(f"Success: {results['success']}") print(f"Failed: {results['failed']}") print(f"Skipped: {results['skipped']}") print("=" * 60) return 0 if __name__ == "__main__": sys.exit(main()) ``` Make executable: ```bash chmod +x scripts/validate_macrolactone_db.py ``` **Step 2: Test help message** Run: ```bash pixi run python scripts/validate_macrolactone_db.py --help ``` Expected: Shows help message with all arguments. **Step 3: Commit** ```bash git add scripts/validate_macrolactone_db.py git commit -m "feat(validation): add CLI entry point script" ``` --- ## Task 8: Create Output Directory README **Files:** - Create: Template for `validation_output/README.md` (generated by validator) **Context:** README explaining the output directory structure. **Step 1: Add README generation to validator** Add method to `validator.py` before `_generate_summary`: ```python def _generate_readme(self): """Generate README explaining output structure.""" readme_content = """# MacrolactoneDB Validation Output This directory contains validation results for MacrolactoneDB 12-20 membered rings. ## Directory Structure ``` validation_output/ ├── README.md # This file ├── fragments.db # SQLite database with all data ├── summary.csv # Summary of all processed molecules ├── summary_statistics.json # Statistical summary │ ├── ring_size_12/ # 12-membered rings ├── ring_size_13/ # 13-membered rings ... └── ring_size_20/ # 20-membered rings ├── molecules.csv # Molecules in this ring size ├── standard/ # Standard macrolactones │ ├── numbered/ # Numbered ring images │ │ └── {id}_numbered.png │ └── sidechains/ # Fragment images │ └── {id}/ │ └── {id}_frag_{n}_pos{pos}.png ├── non_standard/ # Non-standard macrocycles │ └── original/ │ └── {id}_original.png └── rejected/ # Not macrolactones └── original/ └── {id}_original.png ``` ## Database Schema ### Tables - **parent_molecules**: Original molecule information - **ring_numberings**: Ring atom numbering details - **side_chain_fragments**: Fragmentation results with isotope tags - **validation_results**: Manual validation records ### Key Fields - `classification`: standard_macrolactone | non_standard_macrocycle | not_macrolactone - `dummy_isotope`: Cleavage position stored as isotope value for reconstruction - `cleavage_position`: Position on ring where side chain was attached ## Ring Numbering Convention 1. Position 1 = Lactone carbonyl carbon (C=O) 2. Position 2 = Ester oxygen (-O-) 3. Positions 3-N = Sequential around ring ## Isotope Tagging Fragments use isotope values to mark cleavage position: - `[5*]CCO` = Fragment from position 5, dummy atom has isotope=5 - This enables precise reconstruction during reassembly ## CSV Columns ### summary.csv - `source_id`: Original molecule ID from MacrolactoneDB - `classification`: Classification result - `ring_size`: Detected ring size (12-20) - `num_sidechains`: Number of side chains detected - `cleavage_positions`: JSON array of cleavage positions - `processing_status`: pending | success | failed | skipped ## Querying the Database ```bash # List tables sqlite3 fragments.db ".tables" # Get standard macrolactones with fragments sqlite3 fragments.db "SELECT * FROM parent_molecules WHERE classification='standard_macrolactone' LIMIT 5;" # Get fragments for a specific molecule sqlite3 fragments.db "SELECT * FROM side_chain_fragments WHERE parent_id=1;" # Count by ring size sqlite3 fragments.db "SELECT ring_size, COUNT(*) FROM parent_molecules GROUP BY ring_size;" ``` """ readme_path = self.output_dir / "README.md" readme_path.write_text(readme_content) ``` Add call in `run` method before `return results`: ```python self._generate_readme() self._generate_summary() ``` **Step 2: Commit** ```bash git add src/macro_lactone_toolkit/validation/validator.py git commit -m "feat(validation): add README generation for output directory" ``` --- ## Task 9: Update Package __init__.py **Files:** - Modify: `src/macro_lactone_toolkit/__init__.py` **Context:** Export validation module. **Step 1: Add validation exports** Modify `src/macro_lactone_toolkit/__init__.py` to add: ```python # Validation module (optional import) try: from .validation.validator import MacrolactoneValidator from .validation.models import ParentMolecule, SideChainFragment except ImportError: pass # SQLModel not installed ``` **Step 2: Commit** ```bash git add src/macro_lactone_toolkit/__init__.py git commit -m "feat(validation): export validation module" ``` --- ## Task 10: Run Integration Test **Files:** - Test with: Small sample of actual data **Context:** Run the validator on a small subset to verify everything works. **Step 1: Create test with small sample** Run: ```bash # Create small test sample head -20 data/MacrolactoneDB/ring12_20/temp.csv > /tmp/test_sample.csv # Run validation pixi run python scripts/validate_macrolactone_db.py \ --input /tmp/test_sample.csv \ --output /tmp/test_validation_output \ --sample-ratio 1.0 ``` Expected output shows processing and summary. **Step 2: Verify outputs** Run: ```bash ls -la /tmp/test_validation_output/ cat /tmp/test_validation_output/summary_statistics.json sqlite3 /tmp/test_validation_output/fragments.db "SELECT COUNT(*) FROM parent_molecules;" ``` Expected: Directory exists, has database, summary CSV, and ring size subdirectories. **Step 3: Cleanup** ```bash rm -rf /tmp/test_validation_output /tmp/test_sample.csv ``` --- ## Task 11: Final Commit and Summary **Step 1: Final review and commit** ```bash git status git log --oneline -10 ``` **Step 2: Push to branch (if using worktree)** ```bash git push origin HEAD ``` --- ## Execution Commands Reference ### Full validation run ```bash cd /Users/lingyuzeng/project/macro-lactone-sidechain-profiler/macro_split pixi run python scripts/validate_macrolactone_db.py \ --input data/MacrolactoneDB/ring12_20/temp.csv \ --output validation_output \ --sample-ratio 0.1 ``` ### Query results ```bash # Summary statistics cat validation_output/summary_statistics.json # Database queries sqlite3 validation_output/fragments.db "SELECT * FROM parent_molecules LIMIT 5;" sqlite3 validation_output/fragments.db "SELECT * FROM side_chain_fragments WHERE cleavage_position > 2 LIMIT 5;" ``` ### Check specific ring size ```bash ls validation_output/ring_size_16/standard/numbered/ ls validation_output/ring_size_16/standard/sidechains/ ``` --- ## Verification Checklist - [ ] All SQLModel models created and importable - [ ] Database initializes without errors - [ ] Isotope tagging preserves cleavage position - [ ] Stratified sampling produces even distribution - [ ] Visualization outputs created in correct structure - [ ] Summary CSV contains all expected columns - [ ] README generated with accurate documentation - [ ] CLI script runs with --help - [ ] Integration test passes on small sample