macro_split/docs/plans/2026-01-23-tylosin-splicing-design.md

# Tylosin High-Throughput Splicing & Screening System Design

## 1. System Overview

The **Tylosin Splicer** is a combinatorial chemistry engine designed to optimize the Tylosin scaffold. It systematically modifies positions 7, 15, and 16 of the macrolactone ring by splicing high-potential fragments identified by the SIME platform, then immediately evaluating their predicted antibacterial activity.

## 2. Component Architecture

```mermaid
componentDiagram
    package "Inputs" {
        [Tylosin SMILES] as InputCore
        [Fragment CSVs] as InputFrags
        note right of InputFrags: SIME predicted\nhigh-activity fragments
    }

    package "Core Preparation" {
        [Scaffold Preparer] as CorePrep
        [Ring Numbering] as RingNum
        note right of CorePrep: Identifies 7, 15, 16\nReplaces groups with anchors
    }

    package "Fragment Processing" {
        [Fragment Loader] as FragLoad
        [Attachment Point Selector] as AttachSel
        note right of AttachSel: Heuristic rules to\nfind connection points
    }

    package "Splicing Engine" {
        [Combinatorial Splicer] as Splicer
        [Conformer Validator] as Validator
        note right of Splicer: RDKit ChemicalReaction\nor ReplaceSubstructs
    }

    package "Evaluation (SIME)" {
        [Activity Predictor] as Predictor
        [Broad Spectrum Model] as Model
    }

    package "Outputs" {
        [Ranked Results CSV] as Output
    }

    InputCore --> CorePrep
    RingNum -.-> CorePrep : "Locate positions"

    InputFrags --> FragLoad
    FragLoad --> AttachSel

    CorePrep --> Splicer : "Scaffold with Anchors (*)"
    AttachSel --> Splicer : "Activated Fragments (R-Groups)"

    Splicer --> Validator : "Raw Candidates"
    Validator --> Predictor : "Valid 3D Structures"

    Predictor --> Model : "Inference"
    Model --> Output : "Scores & Rankings"
```

## 3. Data Flow Strategy

### Step 1: Scaffold Preparation (`CorePrep`)
- **Input**: Tylosin SMILES.
- **Action**:
    1. Parse SMILES using `macro_split` utils.
    2. Use `RingNumbering` to identify atoms at indices 7, 15, 16.
    3. Perform "surgical removal": Break bonds to existing side chains at these indices.
    4. Attach "Anchor Atoms" (Isotopes or Dummy Atoms `[*:1]`, `[*:2]`, `[*:3]`) to the ring carbons.

### Step 2: Fragment Activation (`AttachSel`)
- **Input**: Fragment SMILES from SIME CSVs.
- **Action**: Convert a standalone molecule into a substituent (R-Group).
    - **Strategy A (Smart)**: Identify heteroatoms (-NH2, -OH) as attachment points.
    - **Strategy B (Random)**: Randomly replace a Hydrogen with an attachment point.
    - **Strategy C (Linker)**: Add a small linker (e.g., -CH2-) if needed.

### Step 3: Combinatorial Splicing (`Splicer`)
- **Input**: 1 Scaffold + N Fragments.
- **Action**:
    - **Single Point**: Modify only pos 7, or 15, or 16.
    - **Multi Point**: Combinatorial modification (e.g., 7+15).
    - **Reaction**: use `rdkit.Chem.rdChemReactions` or `ReplaceSubstructs`.

### Step 4: High-Throughput Prediction (`Predictor`)
- **Integration**: Import `SIME.utils.mole_predictor`.
- **Batching**: Collect valid spliced molecules into batches of 128/256.
- **Scoring**: Run `ParallelBroadSpectrumPredictor`.
- **Filtering**: Keep only molecules with `broad_spectrum == True` or high inhibition scores.

## 4. Technology Stack
- **Core Logic**: Python 3.9+
- **Chemistry Engine**: RDKit
- **Data Handling**: Pandas, NumPy
- **ML Inference**: PyTorch (via SIME models)
- **Parallelization**: Python `multiprocessing` (via SIME batch predictor)