Files
macrolactone-toolkit/2026-03-18-standard-macrocycle-classification-plan.md

6.9 KiB

Standard vs Non-Standard Macrocycle Classification

Summary

Add a formal molecule-level classification layer on top of the current macro_lactone_toolkit detection logic so the toolkit can distinguish:

  • standard_macrolactone
  • non_standard_macrocycle
  • not_macrolactone

This classification must support the two new rejection rules:

  1. After ring numbering is assigned, positions 3..N must all be carbon atoms. If any atom at positions 3..N is not carbon, classify the molecule as non_standard_macrocycle.
  2. If multiple candidate macrolactone rings overlap in the same atom set graph, classify the molecule as non_standard_macrocycle. Use only overlapping candidate rings for this rule; disconnected or non-overlapping candidates do not trigger this specific rejection.

Do not rely on a “largest ring” assumption. Base detection on RDKit ring candidates from RingInfo.AtomRings() plus explicit lactone validation, then apply the new standard/non-standard filters.

Public API And Output Changes

Add a new result type, e.g. MacrocycleClassificationResult, with these fields:

  • smiles: str
  • classification: Literal["standard_macrolactone", "non_standard_macrocycle", "not_macrolactone"]
  • ring_size: int | None
  • primary_reason_code: str | None
  • primary_reason_message: str | None
  • all_reason_codes: list[str]
  • all_reason_messages: list[str]
  • candidate_ring_sizes: list[int]

Add a new public API on MacroLactoneAnalyzer:

  • classify_macrocycle(mol_input: str | Chem.Mol, ring_size: int | None = None) -> MacrocycleClassificationResult

Behavior:

  • If ring_size is omitted, inspect all 12-20 membered lactone candidates.
  • If ring_size is provided, restrict candidate selection to that size before classification.
  • Invalid SMILES should keep raising the existing detection exception path; do not encode invalid input as a classification result.
  • For standard_macrolactone, ring_size must be the accepted ring size and all reason fields must be empty.
  • For non_standard_macrocycle, ring_size should be the candidate ring size if exactly one size remains relevant, otherwise None.
  • For not_macrolactone, return no ring size and a reason describing why no valid 12-20 lactone candidate survived.

Reason codes must be decision-complete and fixed:

  • contains_non_carbon_ring_atoms_outside_positions_1_2
  • multiple_overlapping_macrocycle_candidates
  • no_lactone_ring_in_12_to_20_range
  • requested_ring_size_not_found

Reason messages must be short English sentences:

  • Ring positions 3..N contain non-carbon atoms.
  • Overlapping macrolactone candidate rings were detected.
  • No 12-20 membered lactone ring was detected.
  • The requested ring size was not detected as a lactone ring.

Update CLI macro-lactone-toolkit analyze to return this classification result shape for single-SMILES mode and row-wise CSV mode.

Do not add a new CLI subcommand. Keep analyze as the classification surface.

Implementation Changes

Detection And Candidate Grouping

In the current core detection module:

  • Keep the existing lactone-ring candidate search based on RingInfo.AtomRings() and lactone atom validation.
  • Add an overlap-group pass over candidate rings:
    • Build a graph where two candidates are connected if their ring atom sets intersect.
    • Compute connected components on this graph.
    • If any connected component contains more than one candidate, classify as non_standard_macrocycle with multiple_overlapping_macrocycle_candidates.
  • Do not treat disconnected candidate rings as overlapping.
  • Keep candidate_ring_sizes as the sorted unique sizes from the filtered candidate list.

Standard Macrocycle Filter

For any single candidate that survives overlap rejection:

  • Build numbering exactly as today: position 1 is the lactone carbonyl carbon, position 2 is the ring ester oxygen.
  • Inspect positions 3..N.
  • Every atom at positions 3..N must have atomic number 6.
  • If any position 3..N is not carbon, classify as non_standard_macrocycle with contains_non_carbon_ring_atoms_outside_positions_1_2.

This rule must reject ring peptides and other heteroatom-containing macrocycles even if they contain a lactone bond.

Fragmenter Integration

Update MacrolactoneFragmenter so that:

  • number_molecule() and fragment_molecule() first call classify_macrocycle().
  • They only proceed when classification is standard_macrolactone.
  • For non_standard_macrocycle or not_macrolactone, raise the existing detection exception type with a message that includes the classification and the primary reason code.
  • Do not change fragmentation output semantics for standard macrolactones.

Files To Change

Concentrate changes in:

  • src/macro_lactone_toolkit/_core.py
  • src/macro_lactone_toolkit/analyzer.py
  • src/macro_lactone_toolkit/cli.py

Add the new result type in the existing models module instead of inventing a second schema location.

Test Plan

Add tests first, verify they fail, then implement.

Required test cases:

  • Standard 12, 14, 16, and 20 membered macrolactones still classify as standard_macrolactone and return the correct ring_size.
  • A macrocycle with a valid lactone bond but a non-carbon atom at position 3..N classifies as non_standard_macrocycle with:
    • primary_reason_code == "contains_non_carbon_ring_atoms_outside_positions_1_2"
    • the expected English message
  • An overlapping-candidate example classifies as non_standard_macrocycle with:
    • primary_reason_code == "multiple_overlapping_macrocycle_candidates"
    • the expected English message
  • A non-lactone macrocycle classifies as not_macrolactone with no_lactone_ring_in_12_to_20_range.
  • Explicit ring_size with no candidate of that size returns not_macrolactone with requested_ring_size_not_found.
  • macro-lactone-toolkit analyze --smiles ... returns the new fields for:
    • one standard example
    • one heteroatom-rejected example
    • one overlap-rejected example
  • Existing numbering, fragmentation, labeled/plain dummy round-trip, and splicing tests remain green for standard macrolactones.

Test fixture guidance:

  • Reuse the existing synthetic macrocycle helper for standard rings.
  • Extend the helper or add a new fixture helper for:
    • a lactone-containing ring with one non-carbon atom at a numbered position beyond 2
    • an overlapping-candidate ring example specifically built to share ring atoms between candidate rings

Assumptions And Defaults

  • Classification is molecule-level, but the overlap rejection only applies to overlapping candidate rings, not disconnected candidates elsewhere in the molecule.
  • Invalid SMILES remain exceptions, not classification payloads.
  • analyze becomes the official classification output; get_valid_ring_sizes() may remain as a lower-level helper.
  • The implementation should stay aligned with RDKit ring APIs as candidate generators, not as the final definition of a standard macrolactone.