Initial commit
This commit is contained in:
134
README.md
Normal file
134
README.md
Normal file
@@ -0,0 +1,134 @@
|
||||
# SDF File Processing and RDKit Substructure Matching
|
||||
|
||||
This project provides a complete workflow for extracting SDF files from various archive formats and performing high-throughput substructure matching using RDKit with multiprocessing.
|
||||
|
||||
## Setup
|
||||
|
||||
### Prerequisites
|
||||
- Pixi package manager
|
||||
- Python 3.11+
|
||||
|
||||
### Installation
|
||||
|
||||
1. Initialize the pixi environment (already done):
|
||||
```bash
|
||||
pixi init
|
||||
pixi add rdkit joblib pandas tqdm
|
||||
```
|
||||
|
||||
2. Activate the environment:
|
||||
```bash
|
||||
pixi shell
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
search_macro/
|
||||
├── pixi.toml # Pixi configuration
|
||||
├── README.md # This file
|
||||
├── data/ # Input directory with zip/rar/sdf files
|
||||
├── notebooks/ # Jupyter notebooks
|
||||
│ ├── 01_extract_sdf_files.ipynb # Extract SDF from archives
|
||||
│ └── 02_rdkit_substructure_matching.ipynb # Parallel substructure matching
|
||||
├── extracted_sdf_files/ # Output directory for extracted SDF files
|
||||
└── matching_results/ # Results directory for matching output
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Step 1: Extract SDF Files
|
||||
|
||||
Open the first notebook to extract all SDF files from ZIP/RAR archives:
|
||||
|
||||
```bash
|
||||
jupyter notebook notebooks/01_extract_sdf_files.ipynb
|
||||
```
|
||||
|
||||
This notebook will:
|
||||
- Scan the `data/` directory for compressed files (ZIP, RAR, TAR.GZ)
|
||||
- Extract all SDF/MOL/SD files from archives
|
||||
- Copy existing SDF files to a unified directory
|
||||
- Generate a comprehensive file list with metadata
|
||||
|
||||
### Step 2: Perform Substructure Matching
|
||||
|
||||
Open the second notebook for parallel RDKit matching:
|
||||
|
||||
```bash
|
||||
jupyter notebook notebooks/02_rdkit_substructure_matching.ipynb
|
||||
```
|
||||
|
||||
This notebook will:
|
||||
- Load all extracted SDF files
|
||||
- Define and compile SMARTS patterns for common chemical substructures
|
||||
- Process molecules using 220 parallel processes
|
||||
- Generate comprehensive matching results and statistics
|
||||
|
||||
## Features
|
||||
|
||||
### Archive Support
|
||||
- ✅ ZIP files
|
||||
- ✅ RAR files
|
||||
- ✅ TAR.GZ files
|
||||
- ✅ GZIP compressed SDF files
|
||||
|
||||
### SMARTS Patterns
|
||||
The matching includes 18 common chemical substructures:
|
||||
- Benzene rings, pyridine, heterocycles
|
||||
- Functional groups: carboxylic acids, alcohols, amines, amides, esters, ketones, aldehydes
|
||||
- Other features: nitro groups, halogens, sulfonamides, aromatic rings, alkenes, alkynes, ethers, phenols
|
||||
|
||||
### Performance
|
||||
- **220 parallel processes** as requested
|
||||
- Batch processing to manage memory usage
|
||||
- Progress tracking with tqdm
|
||||
- Intermediate result saving
|
||||
- Performance statistics and analysis
|
||||
|
||||
## Output Files
|
||||
|
||||
### Extraction Results (`extracted_sdf_files/`)
|
||||
- `sdf_file_list.csv` - Complete list of all SDF files with metadata
|
||||
- Organized subdirectories maintaining original archive structure
|
||||
|
||||
### Matching Results (`matching_results/`)
|
||||
- `complete_matching_results.csv` - All molecules with pattern matches
|
||||
- `matching_summary.csv` - Summary statistics for each pattern
|
||||
- `molecules_with_{pattern}.csv` - Individual files for each pattern
|
||||
- `pattern_frequencies.png` - Visualization of pattern frequencies
|
||||
- Intermediate results saved every 10 batches
|
||||
|
||||
## Performance Notes
|
||||
|
||||
- Designed for high-performance systems (256 threads available)
|
||||
- Uses 220 processes to maximize throughput while leaving system resources
|
||||
- Memory-efficient batch processing
|
||||
- Automatic cleanup and error handling
|
||||
|
||||
## Dependencies
|
||||
|
||||
- `rdkit` - Chemical informatics and structure processing
|
||||
- `joblib` - Parallel processing backend
|
||||
- `pandas` - Data manipulation and analysis
|
||||
- `tqdm` - Progress bars
|
||||
- `pathlib` - Modern path handling
|
||||
- `zipfile`, `rarfile`, `tarfile` - Archive extraction
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **RAR file extraction errors**: Install `rarfile` and ensure `unrar` is available
|
||||
2. **Memory issues**: Reduce `N_PROCESSES` or `N_FILES_PER_BATCH` in the matching notebook
|
||||
3. **Permission errors**: Ensure write access to output directories
|
||||
|
||||
### Performance Tuning
|
||||
|
||||
- Adjust `N_PROCESSES` based on available CPU cores
|
||||
- Modify `N_FILES_PER_BATCH` for memory constraints
|
||||
- Use `max_molecules` parameter for testing with smaller datasets
|
||||
|
||||
## License
|
||||
|
||||
This project is provided as-is for research and educational purposes.
|
||||
Reference in New Issue
Block a user