Initial commit

This commit is contained in:
2025-11-14 18:46:03 +08:00
commit b85faf48cd
70 changed files with 57687 additions and 0 deletions

134
README.md Normal file
View File

@@ -0,0 +1,134 @@
# SDF File Processing and RDKit Substructure Matching
This project provides a complete workflow for extracting SDF files from various archive formats and performing high-throughput substructure matching using RDKit with multiprocessing.
## Setup
### Prerequisites
- Pixi package manager
- Python 3.11+
### Installation
1. Initialize the pixi environment (already done):
```bash
pixi init
pixi add rdkit joblib pandas tqdm
```
2. Activate the environment:
```bash
pixi shell
```
## Project Structure
```
search_macro/
├── pixi.toml # Pixi configuration
├── README.md # This file
├── data/ # Input directory with zip/rar/sdf files
├── notebooks/ # Jupyter notebooks
│ ├── 01_extract_sdf_files.ipynb # Extract SDF from archives
│ └── 02_rdkit_substructure_matching.ipynb # Parallel substructure matching
├── extracted_sdf_files/ # Output directory for extracted SDF files
└── matching_results/ # Results directory for matching output
```
## Usage
### Step 1: Extract SDF Files
Open the first notebook to extract all SDF files from ZIP/RAR archives:
```bash
jupyter notebook notebooks/01_extract_sdf_files.ipynb
```
This notebook will:
- Scan the `data/` directory for compressed files (ZIP, RAR, TAR.GZ)
- Extract all SDF/MOL/SD files from archives
- Copy existing SDF files to a unified directory
- Generate a comprehensive file list with metadata
### Step 2: Perform Substructure Matching
Open the second notebook for parallel RDKit matching:
```bash
jupyter notebook notebooks/02_rdkit_substructure_matching.ipynb
```
This notebook will:
- Load all extracted SDF files
- Define and compile SMARTS patterns for common chemical substructures
- Process molecules using 220 parallel processes
- Generate comprehensive matching results and statistics
## Features
### Archive Support
- ✅ ZIP files
- ✅ RAR files
- ✅ TAR.GZ files
- ✅ GZIP compressed SDF files
### SMARTS Patterns
The matching includes 18 common chemical substructures:
- Benzene rings, pyridine, heterocycles
- Functional groups: carboxylic acids, alcohols, amines, amides, esters, ketones, aldehydes
- Other features: nitro groups, halogens, sulfonamides, aromatic rings, alkenes, alkynes, ethers, phenols
### Performance
- **220 parallel processes** as requested
- Batch processing to manage memory usage
- Progress tracking with tqdm
- Intermediate result saving
- Performance statistics and analysis
## Output Files
### Extraction Results (`extracted_sdf_files/`)
- `sdf_file_list.csv` - Complete list of all SDF files with metadata
- Organized subdirectories maintaining original archive structure
### Matching Results (`matching_results/`)
- `complete_matching_results.csv` - All molecules with pattern matches
- `matching_summary.csv` - Summary statistics for each pattern
- `molecules_with_{pattern}.csv` - Individual files for each pattern
- `pattern_frequencies.png` - Visualization of pattern frequencies
- Intermediate results saved every 10 batches
## Performance Notes
- Designed for high-performance systems (256 threads available)
- Uses 220 processes to maximize throughput while leaving system resources
- Memory-efficient batch processing
- Automatic cleanup and error handling
## Dependencies
- `rdkit` - Chemical informatics and structure processing
- `joblib` - Parallel processing backend
- `pandas` - Data manipulation and analysis
- `tqdm` - Progress bars
- `pathlib` - Modern path handling
- `zipfile`, `rarfile`, `tarfile` - Archive extraction
## Troubleshooting
### Common Issues
1. **RAR file extraction errors**: Install `rarfile` and ensure `unrar` is available
2. **Memory issues**: Reduce `N_PROCESSES` or `N_FILES_PER_BATCH` in the matching notebook
3. **Permission errors**: Ensure write access to output directories
### Performance Tuning
- Adjust `N_PROCESSES` based on available CPU cores
- Modify `N_FILES_PER_BATCH` for memory constraints
- Use `max_molecules` parameter for testing with smaller datasets
## License
This project is provided as-is for research and educational purposes.