Move project docs to docs/project-docs and update references

- Move AGENTS.md, CLEANUP_SUMMARY.md, DOCUMENTATION_GUIDE.md,
  IMPLEMENTATION_SUMMARY.md, QUICK_COMMANDS.md to docs/project-docs/
- Update AGENTS.md to include splicing module documentation
- Update mkdocs.yml navigation to include project-docs section
- Update .gitignore to track docs/ directory
- Add docs/plans/ splicing design documents

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-18 17:56:03 +08:00
parent 68f171ad1d
commit a768d26e47
10 changed files with 555 additions and 7 deletions

View File

@@ -0,0 +1,95 @@
# Tylosin High-Throughput Splicing & Screening System Design
## 1. System Overview
The **Tylosin Splicer** is a combinatorial chemistry engine designed to optimize the Tylosin scaffold. It systematically modifies positions 7, 15, and 16 of the macrolactone ring by splicing high-potential fragments identified by the SIME platform, then immediately evaluating their predicted antibacterial activity.
## 2. Component Architecture
```mermaid
componentDiagram
package "Inputs" {
[Tylosin SMILES] as InputCore
[Fragment CSVs] as InputFrags
note right of InputFrags: SIME predicted\nhigh-activity fragments
}
package "Core Preparation" {
[Scaffold Preparer] as CorePrep
[Ring Numbering] as RingNum
note right of CorePrep: Identifies 7, 15, 16\nReplaces groups with anchors
}
package "Fragment Processing" {
[Fragment Loader] as FragLoad
[Attachment Point Selector] as AttachSel
note right of AttachSel: Heuristic rules to\nfind connection points
}
package "Splicing Engine" {
[Combinatorial Splicer] as Splicer
[Conformer Validator] as Validator
note right of Splicer: RDKit ChemicalReaction\nor ReplaceSubstructs
}
package "Evaluation (SIME)" {
[Activity Predictor] as Predictor
[Broad Spectrum Model] as Model
}
package "Outputs" {
[Ranked Results CSV] as Output
}
InputCore --> CorePrep
RingNum -.-> CorePrep : "Locate positions"
InputFrags --> FragLoad
FragLoad --> AttachSel
CorePrep --> Splicer : "Scaffold with Anchors (*)"
AttachSel --> Splicer : "Activated Fragments (R-Groups)"
Splicer --> Validator : "Raw Candidates"
Validator --> Predictor : "Valid 3D Structures"
Predictor --> Model : "Inference"
Model --> Output : "Scores & Rankings"
```
## 3. Data Flow Strategy
### Step 1: Scaffold Preparation (`CorePrep`)
- **Input**: Tylosin SMILES.
- **Action**:
1. Parse SMILES using `macro_split` utils.
2. Use `RingNumbering` to identify atoms at indices 7, 15, 16.
3. Perform "surgical removal": Break bonds to existing side chains at these indices.
4. Attach "Anchor Atoms" (Isotopes or Dummy Atoms `[*:1]`, `[*:2]`, `[*:3]`) to the ring carbons.
### Step 2: Fragment Activation (`AttachSel`)
- **Input**: Fragment SMILES from SIME CSVs.
- **Action**: Convert a standalone molecule into a substituent (R-Group).
- **Strategy A (Smart)**: Identify heteroatoms (-NH2, -OH) as attachment points.
- **Strategy B (Random)**: Randomly replace a Hydrogen with an attachment point.
- **Strategy C (Linker)**: Add a small linker (e.g., -CH2-) if needed.
### Step 3: Combinatorial Splicing (`Splicer`)
- **Input**: 1 Scaffold + N Fragments.
- **Action**:
- **Single Point**: Modify only pos 7, or 15, or 16.
- **Multi Point**: Combinatorial modification (e.g., 7+15).
- **Reaction**: use `rdkit.Chem.rdChemReactions` or `ReplaceSubstructs`.
### Step 4: High-Throughput Prediction (`Predictor`)
- **Integration**: Import `SIME.utils.mole_predictor`.
- **Batching**: Collect valid spliced molecules into batches of 128/256.
- **Scoring**: Run `ParallelBroadSpectrumPredictor`.
- **Filtering**: Keep only molecules with `broad_spectrum == True` or high inhibition scores.
## 4. Technology Stack
- **Core Logic**: Python 3.9+
- **Chemistry Engine**: RDKit
- **Data Handling**: Pandas, NumPy
- **ML Inference**: PyTorch (via SIME models)
- **Parallelization**: Python `multiprocessing` (via SIME batch predictor)

View File

@@ -0,0 +1,183 @@
# Tylosin Splicing System Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Build a pipeline to splice SIME-identified fragments onto the Tylosin scaffold at positions 7, 15, and 16, and predict their antibacterial activity.
**Architecture:** A Python-based ETL pipeline using RDKit for structural manipulation (`macro_split`) and PyTorch for activity prediction (`SIME`).
**Tech Stack:** Python, RDKit, Pandas, PyTorch (SIME), Pytest.
---
### Task 1: Environment & Project Structure Setup
**Files:**
- Create: `scripts/tylosin_splicer.py` (Main entry point stub)
- Create: `src/splicing/__init__.py`
- Create: `src/splicing/scaffold_prep.py`
- Create: `tests/test_splicing.py`
**Step 1: Create directory structure**
```bash
mkdir -p src/splicing
touch src/splicing/__init__.py
```
**Step 2: Create a basic test to verify environment**
Write a test that imports both `macro_split` and `SIME` modules to ensure the workspace handles imports correctly.
```python
# tests/test_env_integration.py
import sys
import os
sys.path.append("/home/zly/project/SIME") # Hack for now, will clean up later
sys.path.append("/home/zly/project/merge/macro_split")
def test_imports():
from src.ring_numbering import get_macrolactone_numbering
from utils.mole_predictor import ParallelBroadSpectrumPredictor
assert True
```
**Step 3: Run test**
`pixi run pytest tests/test_env_integration.py`
---
### Task 2: Scaffold Preparation (The "Socket")
**Files:**
- Modify: `src/splicing/scaffold_prep.py`
- Test: `tests/test_scaffold_prep.py`
**Step 1: Write failing test**
Test that `prepare_tylosin_scaffold` returns a molecule with dummy atoms at positions 7, 15, and 16.
```python
# tests/test_scaffold_prep.py
from rdkit import Chem
from src.splicing.scaffold_prep import prepare_tylosin_scaffold
TYLOSIN_SMILES = "CCC1OC(=O)C(C)C(O)C(C)C(O)C(C)C(OC2CC(C)(O)C(O)C(C)O2)CC(C)C(=O)C=CC=C1COC3OS(C)C(O)C(N(C)C)C3O" # Simplified/Example
def test_scaffold_prep():
scaffold, mapping = prepare_tylosin_scaffold(TYLOSIN_SMILES, positions=[7, 15, 16])
# Check if we have mapped atoms
assert 7 in mapping
assert 15 in mapping
assert 16 in mapping
# Check if they are dummy atoms or have specific isotopes
```
**Step 2: Implement `prepare_tylosin_scaffold`**
Use `get_macrolactone_numbering` to find the atom indices.
Use `RWMol` to replace side chains at those indices with a dummy atom (e.g., At number 0 or Isotope).
**Step 3: Run tests**
`pixi run pytest tests/test_scaffold_prep.py`
---
### Task 3: Fragment Activation (The "Plug")
**Files:**
- Create: `src/splicing/fragment_prep.py`
- Test: `tests/test_fragment_prep.py`
**Step 1: Write failing test**
Test that `activate_fragment` takes a SMILES and returns a molecule with *one* attachment point.
```python
# tests/test_fragment_prep.py
from src.splicing.fragment_prep import activate_fragment
def test_activate_fragment_smart():
# Fragment with -OH
frag_smiles = "CCO"
activated = activate_fragment(frag_smiles, strategy="smart")
# Should find the O and replace H with attachment point
assert "*" in Chem.MolToSmiles(activated)
def test_activate_fragment_random():
frag_smiles = "CCCCC"
activated = activate_fragment(frag_smiles, strategy="random")
assert "*" in Chem.MolToSmiles(activated)
```
**Step 2: Implement `activate_fragment`**
- **Smart**: Look for -NH2, -OH, -SH. Use SMARTS to find them, replace a H with `*`.
- **Random**: Pick a random Carbon, replace a H with `*`.
**Step 3: Run tests**
`pixi run pytest tests/test_fragment_prep.py`
---
### Task 4: Splicing Engine (The Assembly)
**Files:**
- Create: `src/splicing/engine.py`
- Test: `tests/test_splicing_engine.py`
**Step 1: Write failing test**
Test connecting an activated fragment to the scaffold.
```python
def test_splice_molecules():
scaffold = ... # prepared scaffold
fragment = ... # activated fragment
product = splice_molecule(scaffold, fragment, position=7)
assert product is not None
assert Chem.MolToSmiles(product) != Chem.MolToSmiles(scaffold)
```
**Step 2: Implement `splice_molecule`**
Use `Chem.ReplaceSubstructs` or `Chem.rdChemReactions`.
Ensure the connection is chemically valid.
**Step 3: Run tests**
`pixi run pytest tests/test_splicing_engine.py`
---
### Task 5: Prediction Pipeline Integration
**Files:**
- Create: `src/splicing/pipeline.py`
- Test: `tests/test_pipeline.py`
**Step 1: Write failing test (Mocked)**
Mock the SIME predictor to avoid loading heavy models during unit tests.
```python
def test_pipeline_flow(mocker):
# Mock predictor
mocker.patch('utils.mole_predictor.ParallelBroadSpectrumPredictor')
frags = ["CCO", "CCN"]
results = run_splicing_pipeline(TYLOSIN_SMILES, frags, positions=[7])
assert len(results) > 0
```
**Step 2: Implement `run_splicing_pipeline`**
1. Prep scaffold.
2. Loop fragments -> activate -> splice.
3. Batch generate SMILES.
4. Call `ParallelBroadSpectrumPredictor`.
5. Return results.
**Step 3: Run tests**
---
### Task 6: CLI and Final Execution
**Files:**
- Create: `scripts/run_tylosin_optimization.py`
**Step 1: Implement CLI**
Arguments: `--input-scaffold`, `--fragment-csv`, `--positions`, `--output`.
**Step 2: Integration Test**
Run with a small subset of the fragment CSV (head -n 10).