6.9 KiB
Standard vs Non-Standard Macrocycle Classification
Summary
Add a formal molecule-level classification layer on top of the current macro_lactone_toolkit detection logic so the toolkit can distinguish:
standard_macrolactonenon_standard_macrocyclenot_macrolactone
This classification must support the two new rejection rules:
- After ring numbering is assigned, positions
3..Nmust all be carbon atoms. If any atom at positions3..Nis not carbon, classify the molecule asnon_standard_macrocycle. - If multiple candidate macrolactone rings overlap in the same atom set graph, classify the molecule as
non_standard_macrocycle. Use only overlapping candidate rings for this rule; disconnected or non-overlapping candidates do not trigger this specific rejection.
Do not rely on a “largest ring” assumption. Base detection on RDKit ring candidates from RingInfo.AtomRings() plus explicit lactone validation, then apply the new standard/non-standard filters.
Public API And Output Changes
Add a new result type, e.g. MacrocycleClassificationResult, with these fields:
smiles: strclassification: Literal["standard_macrolactone", "non_standard_macrocycle", "not_macrolactone"]ring_size: int | Noneprimary_reason_code: str | Noneprimary_reason_message: str | Noneall_reason_codes: list[str]all_reason_messages: list[str]candidate_ring_sizes: list[int]
Add a new public API on MacroLactoneAnalyzer:
classify_macrocycle(mol_input: str | Chem.Mol, ring_size: int | None = None) -> MacrocycleClassificationResult
Behavior:
- If
ring_sizeis omitted, inspect all 12-20 membered lactone candidates. - If
ring_sizeis provided, restrict candidate selection to that size before classification. - Invalid SMILES should keep raising the existing detection exception path; do not encode invalid input as a classification result.
- For
standard_macrolactone,ring_sizemust be the accepted ring size and all reason fields must be empty. - For
non_standard_macrocycle,ring_sizeshould be the candidate ring size if exactly one size remains relevant, otherwiseNone. - For
not_macrolactone, return no ring size and a reason describing why no valid 12-20 lactone candidate survived.
Reason codes must be decision-complete and fixed:
contains_non_carbon_ring_atoms_outside_positions_1_2multiple_overlapping_macrocycle_candidatesno_lactone_ring_in_12_to_20_rangerequested_ring_size_not_found
Reason messages must be short English sentences:
Ring positions 3..N contain non-carbon atoms.Overlapping macrolactone candidate rings were detected.No 12-20 membered lactone ring was detected.The requested ring size was not detected as a lactone ring.
Update CLI macro-lactone-toolkit analyze to return this classification result shape for single-SMILES mode and row-wise CSV mode.
Do not add a new CLI subcommand. Keep analyze as the classification surface.
Implementation Changes
Detection And Candidate Grouping
In the current core detection module:
- Keep the existing lactone-ring candidate search based on
RingInfo.AtomRings()and lactone atom validation. - Add an overlap-group pass over candidate rings:
- Build a graph where two candidates are connected if their ring atom sets intersect.
- Compute connected components on this graph.
- If any connected component contains more than one candidate, classify as
non_standard_macrocyclewithmultiple_overlapping_macrocycle_candidates.
- Do not treat disconnected candidate rings as overlapping.
- Keep
candidate_ring_sizesas the sorted unique sizes from the filtered candidate list.
Standard Macrocycle Filter
For any single candidate that survives overlap rejection:
- Build numbering exactly as today: position 1 is the lactone carbonyl carbon, position 2 is the ring ester oxygen.
- Inspect positions
3..N. - Every atom at positions
3..Nmust have atomic number 6. - If any position
3..Nis not carbon, classify asnon_standard_macrocyclewithcontains_non_carbon_ring_atoms_outside_positions_1_2.
This rule must reject ring peptides and other heteroatom-containing macrocycles even if they contain a lactone bond.
Fragmenter Integration
Update MacrolactoneFragmenter so that:
number_molecule()andfragment_molecule()first callclassify_macrocycle().- They only proceed when classification is
standard_macrolactone. - For
non_standard_macrocycleornot_macrolactone, raise the existing detection exception type with a message that includes the classification and the primary reason code. - Do not change fragmentation output semantics for standard macrolactones.
Files To Change
Concentrate changes in:
src/macro_lactone_toolkit/_core.pysrc/macro_lactone_toolkit/analyzer.pysrc/macro_lactone_toolkit/cli.py
Add the new result type in the existing models module instead of inventing a second schema location.
Test Plan
Add tests first, verify they fail, then implement.
Required test cases:
- Standard 12, 14, 16, and 20 membered macrolactones still classify as
standard_macrolactoneand return the correctring_size. - A macrocycle with a valid lactone bond but a non-carbon atom at position
3..Nclassifies asnon_standard_macrocyclewith:primary_reason_code == "contains_non_carbon_ring_atoms_outside_positions_1_2"- the expected English message
- An overlapping-candidate example classifies as
non_standard_macrocyclewith:primary_reason_code == "multiple_overlapping_macrocycle_candidates"- the expected English message
- A non-lactone macrocycle classifies as
not_macrolactonewithno_lactone_ring_in_12_to_20_range. - Explicit
ring_sizewith no candidate of that size returnsnot_macrolactonewithrequested_ring_size_not_found. macro-lactone-toolkit analyze --smiles ...returns the new fields for:- one standard example
- one heteroatom-rejected example
- one overlap-rejected example
- Existing numbering, fragmentation, labeled/plain dummy round-trip, and splicing tests remain green for standard macrolactones.
Test fixture guidance:
- Reuse the existing synthetic macrocycle helper for standard rings.
- Extend the helper or add a new fixture helper for:
- a lactone-containing ring with one non-carbon atom at a numbered position beyond 2
- an overlapping-candidate ring example specifically built to share ring atoms between candidate rings
Assumptions And Defaults
- Classification is molecule-level, but the overlap rejection only applies to overlapping candidate rings, not disconnected candidates elsewhere in the molecule.
- Invalid SMILES remain exceptions, not classification payloads.
analyzebecomes the official classification output;get_valid_ring_sizes()may remain as a lower-level helper.- The implementation should stay aligned with RDKit ring APIs as candidate generators, not as the final definition of a standard macrolactone.