140 lines
6.9 KiB
Markdown
140 lines
6.9 KiB
Markdown
# Standard vs Non-Standard Macrocycle Classification
|
|
|
|
## Summary
|
|
|
|
Add a formal molecule-level classification layer on top of the current `macro_lactone_toolkit` detection logic so the toolkit can distinguish:
|
|
|
|
- `standard_macrolactone`
|
|
- `non_standard_macrocycle`
|
|
- `not_macrolactone`
|
|
|
|
This classification must support the two new rejection rules:
|
|
|
|
1. After ring numbering is assigned, positions `3..N` must all be carbon atoms. If any atom at positions `3..N` is not carbon, classify the molecule as `non_standard_macrocycle`.
|
|
2. If multiple candidate macrolactone rings overlap in the same atom set graph, classify the molecule as `non_standard_macrocycle`. Use only overlapping candidate rings for this rule; disconnected or non-overlapping candidates do not trigger this specific rejection.
|
|
|
|
Do not rely on a “largest ring” assumption. Base detection on RDKit ring candidates from `RingInfo.AtomRings()` plus explicit lactone validation, then apply the new standard/non-standard filters.
|
|
|
|
## Public API And Output Changes
|
|
|
|
Add a new result type, e.g. `MacrocycleClassificationResult`, with these fields:
|
|
|
|
- `smiles: str`
|
|
- `classification: Literal["standard_macrolactone", "non_standard_macrocycle", "not_macrolactone"]`
|
|
- `ring_size: int | None`
|
|
- `primary_reason_code: str | None`
|
|
- `primary_reason_message: str | None`
|
|
- `all_reason_codes: list[str]`
|
|
- `all_reason_messages: list[str]`
|
|
- `candidate_ring_sizes: list[int]`
|
|
|
|
Add a new public API on `MacroLactoneAnalyzer`:
|
|
|
|
- `classify_macrocycle(mol_input: str | Chem.Mol, ring_size: int | None = None) -> MacrocycleClassificationResult`
|
|
|
|
Behavior:
|
|
|
|
- If `ring_size` is omitted, inspect all 12-20 membered lactone candidates.
|
|
- If `ring_size` is provided, restrict candidate selection to that size before classification.
|
|
- Invalid SMILES should keep raising the existing detection exception path; do not encode invalid input as a classification result.
|
|
- For `standard_macrolactone`, `ring_size` must be the accepted ring size and all reason fields must be empty.
|
|
- For `non_standard_macrocycle`, `ring_size` should be the candidate ring size if exactly one size remains relevant, otherwise `None`.
|
|
- For `not_macrolactone`, return no ring size and a reason describing why no valid 12-20 lactone candidate survived.
|
|
|
|
Reason codes must be decision-complete and fixed:
|
|
|
|
- `contains_non_carbon_ring_atoms_outside_positions_1_2`
|
|
- `multiple_overlapping_macrocycle_candidates`
|
|
- `no_lactone_ring_in_12_to_20_range`
|
|
- `requested_ring_size_not_found`
|
|
|
|
Reason messages must be short English sentences:
|
|
|
|
- `Ring positions 3..N contain non-carbon atoms.`
|
|
- `Overlapping macrolactone candidate rings were detected.`
|
|
- `No 12-20 membered lactone ring was detected.`
|
|
- `The requested ring size was not detected as a lactone ring.`
|
|
|
|
Update CLI `macro-lactone-toolkit analyze` to return this classification result shape for single-SMILES mode and row-wise CSV mode.
|
|
|
|
Do not add a new CLI subcommand. Keep `analyze` as the classification surface.
|
|
|
|
## Implementation Changes
|
|
|
|
### Detection And Candidate Grouping
|
|
|
|
In the current core detection module:
|
|
|
|
- Keep the existing lactone-ring candidate search based on `RingInfo.AtomRings()` and lactone atom validation.
|
|
- Add an overlap-group pass over candidate rings:
|
|
- Build a graph where two candidates are connected if their ring atom sets intersect.
|
|
- Compute connected components on this graph.
|
|
- If any connected component contains more than one candidate, classify as `non_standard_macrocycle` with `multiple_overlapping_macrocycle_candidates`.
|
|
- Do not treat disconnected candidate rings as overlapping.
|
|
- Keep `candidate_ring_sizes` as the sorted unique sizes from the filtered candidate list.
|
|
|
|
### Standard Macrocycle Filter
|
|
|
|
For any single candidate that survives overlap rejection:
|
|
|
|
- Build numbering exactly as today: position 1 is the lactone carbonyl carbon, position 2 is the ring ester oxygen.
|
|
- Inspect positions `3..N`.
|
|
- Every atom at positions `3..N` must have atomic number 6.
|
|
- If any position `3..N` is not carbon, classify as `non_standard_macrocycle` with `contains_non_carbon_ring_atoms_outside_positions_1_2`.
|
|
|
|
This rule must reject ring peptides and other heteroatom-containing macrocycles even if they contain a lactone bond.
|
|
|
|
### Fragmenter Integration
|
|
|
|
Update `MacrolactoneFragmenter` so that:
|
|
|
|
- `number_molecule()` and `fragment_molecule()` first call `classify_macrocycle()`.
|
|
- They only proceed when classification is `standard_macrolactone`.
|
|
- For `non_standard_macrocycle` or `not_macrolactone`, raise the existing detection exception type with a message that includes the classification and the primary reason code.
|
|
- Do not change fragmentation output semantics for standard macrolactones.
|
|
|
|
### Files To Change
|
|
|
|
Concentrate changes in:
|
|
|
|
- `src/macro_lactone_toolkit/_core.py`
|
|
- `src/macro_lactone_toolkit/analyzer.py`
|
|
- `src/macro_lactone_toolkit/cli.py`
|
|
|
|
Add the new result type in the existing models module instead of inventing a second schema location.
|
|
|
|
## Test Plan
|
|
|
|
Add tests first, verify they fail, then implement.
|
|
|
|
Required test cases:
|
|
|
|
- Standard 12, 14, 16, and 20 membered macrolactones still classify as `standard_macrolactone` and return the correct `ring_size`.
|
|
- A macrocycle with a valid lactone bond but a non-carbon atom at position `3..N` classifies as `non_standard_macrocycle` with:
|
|
- `primary_reason_code == "contains_non_carbon_ring_atoms_outside_positions_1_2"`
|
|
- the expected English message
|
|
- An overlapping-candidate example classifies as `non_standard_macrocycle` with:
|
|
- `primary_reason_code == "multiple_overlapping_macrocycle_candidates"`
|
|
- the expected English message
|
|
- A non-lactone macrocycle classifies as `not_macrolactone` with `no_lactone_ring_in_12_to_20_range`.
|
|
- Explicit `ring_size` with no candidate of that size returns `not_macrolactone` with `requested_ring_size_not_found`.
|
|
- `macro-lactone-toolkit analyze --smiles ...` returns the new fields for:
|
|
- one standard example
|
|
- one heteroatom-rejected example
|
|
- one overlap-rejected example
|
|
- Existing numbering, fragmentation, labeled/plain dummy round-trip, and splicing tests remain green for standard macrolactones.
|
|
|
|
Test fixture guidance:
|
|
|
|
- Reuse the existing synthetic macrocycle helper for standard rings.
|
|
- Extend the helper or add a new fixture helper for:
|
|
- a lactone-containing ring with one non-carbon atom at a numbered position beyond 2
|
|
- an overlapping-candidate ring example specifically built to share ring atoms between candidate rings
|
|
|
|
## Assumptions And Defaults
|
|
|
|
- Classification is molecule-level, but the overlap rejection only applies to overlapping candidate rings, not disconnected candidates elsewhere in the molecule.
|
|
- Invalid SMILES remain exceptions, not classification payloads.
|
|
- `analyze` becomes the official classification output; `get_valid_ring_sizes()` may remain as a lower-level helper.
|
|
- The implementation should stay aligned with RDKit ring APIs as candidate generators, not as the final definition of a standard macrolactone.
|