Files
macrolactone-toolkit/2026-03-18-standard-macrocycle-classification-plan.md

140 lines
6.9 KiB
Markdown

# Standard vs Non-Standard Macrocycle Classification
## Summary
Add a formal molecule-level classification layer on top of the current `macro_lactone_toolkit` detection logic so the toolkit can distinguish:
- `standard_macrolactone`
- `non_standard_macrocycle`
- `not_macrolactone`
This classification must support the two new rejection rules:
1. After ring numbering is assigned, positions `3..N` must all be carbon atoms. If any atom at positions `3..N` is not carbon, classify the molecule as `non_standard_macrocycle`.
2. If multiple candidate macrolactone rings overlap in the same atom set graph, classify the molecule as `non_standard_macrocycle`. Use only overlapping candidate rings for this rule; disconnected or non-overlapping candidates do not trigger this specific rejection.
Do not rely on a “largest ring” assumption. Base detection on RDKit ring candidates from `RingInfo.AtomRings()` plus explicit lactone validation, then apply the new standard/non-standard filters.
## Public API And Output Changes
Add a new result type, e.g. `MacrocycleClassificationResult`, with these fields:
- `smiles: str`
- `classification: Literal["standard_macrolactone", "non_standard_macrocycle", "not_macrolactone"]`
- `ring_size: int | None`
- `primary_reason_code: str | None`
- `primary_reason_message: str | None`
- `all_reason_codes: list[str]`
- `all_reason_messages: list[str]`
- `candidate_ring_sizes: list[int]`
Add a new public API on `MacroLactoneAnalyzer`:
- `classify_macrocycle(mol_input: str | Chem.Mol, ring_size: int | None = None) -> MacrocycleClassificationResult`
Behavior:
- If `ring_size` is omitted, inspect all 12-20 membered lactone candidates.
- If `ring_size` is provided, restrict candidate selection to that size before classification.
- Invalid SMILES should keep raising the existing detection exception path; do not encode invalid input as a classification result.
- For `standard_macrolactone`, `ring_size` must be the accepted ring size and all reason fields must be empty.
- For `non_standard_macrocycle`, `ring_size` should be the candidate ring size if exactly one size remains relevant, otherwise `None`.
- For `not_macrolactone`, return no ring size and a reason describing why no valid 12-20 lactone candidate survived.
Reason codes must be decision-complete and fixed:
- `contains_non_carbon_ring_atoms_outside_positions_1_2`
- `multiple_overlapping_macrocycle_candidates`
- `no_lactone_ring_in_12_to_20_range`
- `requested_ring_size_not_found`
Reason messages must be short English sentences:
- `Ring positions 3..N contain non-carbon atoms.`
- `Overlapping macrolactone candidate rings were detected.`
- `No 12-20 membered lactone ring was detected.`
- `The requested ring size was not detected as a lactone ring.`
Update CLI `macro-lactone-toolkit analyze` to return this classification result shape for single-SMILES mode and row-wise CSV mode.
Do not add a new CLI subcommand. Keep `analyze` as the classification surface.
## Implementation Changes
### Detection And Candidate Grouping
In the current core detection module:
- Keep the existing lactone-ring candidate search based on `RingInfo.AtomRings()` and lactone atom validation.
- Add an overlap-group pass over candidate rings:
- Build a graph where two candidates are connected if their ring atom sets intersect.
- Compute connected components on this graph.
- If any connected component contains more than one candidate, classify as `non_standard_macrocycle` with `multiple_overlapping_macrocycle_candidates`.
- Do not treat disconnected candidate rings as overlapping.
- Keep `candidate_ring_sizes` as the sorted unique sizes from the filtered candidate list.
### Standard Macrocycle Filter
For any single candidate that survives overlap rejection:
- Build numbering exactly as today: position 1 is the lactone carbonyl carbon, position 2 is the ring ester oxygen.
- Inspect positions `3..N`.
- Every atom at positions `3..N` must have atomic number 6.
- If any position `3..N` is not carbon, classify as `non_standard_macrocycle` with `contains_non_carbon_ring_atoms_outside_positions_1_2`.
This rule must reject ring peptides and other heteroatom-containing macrocycles even if they contain a lactone bond.
### Fragmenter Integration
Update `MacrolactoneFragmenter` so that:
- `number_molecule()` and `fragment_molecule()` first call `classify_macrocycle()`.
- They only proceed when classification is `standard_macrolactone`.
- For `non_standard_macrocycle` or `not_macrolactone`, raise the existing detection exception type with a message that includes the classification and the primary reason code.
- Do not change fragmentation output semantics for standard macrolactones.
### Files To Change
Concentrate changes in:
- `src/macro_lactone_toolkit/_core.py`
- `src/macro_lactone_toolkit/analyzer.py`
- `src/macro_lactone_toolkit/cli.py`
Add the new result type in the existing models module instead of inventing a second schema location.
## Test Plan
Add tests first, verify they fail, then implement.
Required test cases:
- Standard 12, 14, 16, and 20 membered macrolactones still classify as `standard_macrolactone` and return the correct `ring_size`.
- A macrocycle with a valid lactone bond but a non-carbon atom at position `3..N` classifies as `non_standard_macrocycle` with:
- `primary_reason_code == "contains_non_carbon_ring_atoms_outside_positions_1_2"`
- the expected English message
- An overlapping-candidate example classifies as `non_standard_macrocycle` with:
- `primary_reason_code == "multiple_overlapping_macrocycle_candidates"`
- the expected English message
- A non-lactone macrocycle classifies as `not_macrolactone` with `no_lactone_ring_in_12_to_20_range`.
- Explicit `ring_size` with no candidate of that size returns `not_macrolactone` with `requested_ring_size_not_found`.
- `macro-lactone-toolkit analyze --smiles ...` returns the new fields for:
- one standard example
- one heteroatom-rejected example
- one overlap-rejected example
- Existing numbering, fragmentation, labeled/plain dummy round-trip, and splicing tests remain green for standard macrolactones.
Test fixture guidance:
- Reuse the existing synthetic macrocycle helper for standard rings.
- Extend the helper or add a new fixture helper for:
- a lactone-containing ring with one non-carbon atom at a numbered position beyond 2
- an overlapping-candidate ring example specifically built to share ring atoms between candidate rings
## Assumptions And Defaults
- Classification is molecule-level, but the overlap rejection only applies to overlapping candidate rings, not disconnected candidates elsewhere in the molecule.
- Invalid SMILES remain exceptions, not classification payloads.
- `analyze` becomes the official classification output; `get_valid_ring_sizes()` may remain as a lower-level helper.
- The implementation should stay aligned with RDKit ring APIs as candidate generators, not as the final definition of a standard macrolactone.