Track refreshed validation outputs and add a filtered fragment library export that retains only side-chain fragments with more than 3 heavy atoms.
98 lines
3.5 KiB
Markdown
98 lines
3.5 KiB
Markdown
# MacrolactoneDB Validation Output
|
|
|
|
This directory contains validation results for MacrolactoneDB 12-20 membered rings.
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
validation_output/
|
|
├── README.md # This file
|
|
├── fragments.db # SQLite database with all data
|
|
├── fragment_library.csv # Unified fragment library export
|
|
├── summary.csv # Summary of all processed molecules
|
|
├── summary_statistics.json # Statistical summary
|
|
│
|
|
├── ring_size_12/ # 12-membered rings
|
|
├── ring_size_13/ # 13-membered rings
|
|
...
|
|
└── ring_size_20/ # 20-membered rings
|
|
├── molecules.csv # Molecules in this ring size
|
|
├── standard/ # Standard macrolactones
|
|
│ ├── numbered/ # Numbered ring images
|
|
│ │ └── {id}_numbered.png
|
|
│ └── sidechains/ # Fragment images
|
|
│ └── {id}/
|
|
│ └── {id}_frag_{n}_pos{pos}.png
|
|
├── non_standard/ # Non-standard macrocycles
|
|
│ └── original/
|
|
│ └── {id}_original.png
|
|
└── rejected/ # Not macrolactones
|
|
└── original/
|
|
└── {id}_original.png
|
|
```
|
|
|
|
## Database Schema
|
|
|
|
### Tables
|
|
|
|
- **parent_molecules**: Original molecule information
|
|
- **ring_numberings**: Ring atom numbering details
|
|
- **side_chain_fragments**: Fragmentation results with isotope tags
|
|
- **fragment_library_entries**: Unified fragment library rows for downstream design
|
|
- **validation_results**: Manual validation records
|
|
|
|
### Key Fields
|
|
|
|
- `classification`: standard_macrolactone | non_standard_macrocycle | not_macrolactone
|
|
- `dummy_isotope`: Cleavage position stored as isotope value for reconstruction
|
|
- `cleavage_position`: Position on ring where side chain was attached
|
|
- `has_dummy_atom`: Whether the fragment contains a dummy atom for splicing
|
|
- `dummy_atom_count`: Number of dummy atoms in the fragment
|
|
|
|
## Ring Numbering Convention
|
|
|
|
1. Position 1 = Lactone carbonyl carbon (C=O)
|
|
2. Position 2 = Ester oxygen (-O-)
|
|
3. Positions 3-N = Sequential around ring
|
|
|
|
## Isotope Tagging
|
|
|
|
Fragments use isotope values to mark cleavage position:
|
|
- `[5*]CCO` = Fragment from position 5, dummy atom has isotope=5
|
|
- This enables precise reconstruction during reassembly
|
|
|
|
## CSV Columns
|
|
|
|
### summary.csv
|
|
|
|
- `ml_id`: MacrolactoneDB unique ID (e.g., ML00000001)
|
|
- `chembl_id`: Original CHEMBL ID (if available)
|
|
- `classification`: Classification result
|
|
- `ring_size`: Detected ring size (12-20)
|
|
- `num_sidechains`: Number of side chains detected
|
|
- `cleavage_positions`: JSON array of cleavage positions
|
|
- `processing_status`: pending | success | failed | skipped
|
|
|
|
### fragment_library.csv
|
|
|
|
- `source_type`: validation_extract | supplemental (reserved)
|
|
- `has_dummy_atom`: Whether the fragment contains a dummy atom
|
|
- `dummy_atom_count`: Number of dummy atoms
|
|
- `splice_ready`: Whether the fragment is directly compatible with single-anchor splicing
|
|
|
|
## Querying the Database
|
|
|
|
```bash
|
|
# List tables
|
|
sqlite3 fragments.db ".tables"
|
|
|
|
# Get standard macrolactones with fragments
|
|
sqlite3 fragments.db "SELECT * FROM parent_molecules WHERE classification='standard_macrolactone' LIMIT 5;"
|
|
|
|
# Get fragments for a specific molecule
|
|
sqlite3 fragments.db "SELECT * FROM side_chain_fragments WHERE parent_id=1;"
|
|
|
|
# Count by ring size
|
|
sqlite3 fragments.db "SELECT ring_size, COUNT(*) FROM parent_molecules GROUP BY ring_size;"
|
|
```
|