07b360da01d679ae538d1a4e37f6019f3555b00c
SDF File Processing and RDKit Substructure Matching
This project provides a complete workflow for extracting SDF files from various archive formats and performing high-throughput substructure matching using RDKit with multiprocessing.
Setup
Prerequisites
- Pixi package manager
- Python 3.11+
Installation
- Initialize the pixi environment (already done):
pixi init
pixi add rdkit joblib pandas tqdm
- Activate the environment:
pixi shell
Project Structure
search_macro/
├── pixi.toml # Pixi configuration
├── README.md # This file
├── data/ # Input directory with zip/rar/sdf files
├── notebooks/ # Jupyter notebooks
│ ├── 01_extract_sdf_files.ipynb # Extract SDF from archives
│ └── 02_rdkit_substructure_matching.ipynb # Parallel substructure matching
├── extracted_sdf_files/ # Output directory for extracted SDF files
└── matching_results/ # Results directory for matching output
Usage
Step 1: Extract SDF Files
Open the first notebook to extract all SDF files from ZIP/RAR archives:
jupyter notebook notebooks/01_extract_sdf_files.ipynb
This notebook will:
- Scan the
data/directory for compressed files (ZIP, RAR, TAR.GZ) - Extract all SDF/MOL/SD files from archives
- Copy existing SDF files to a unified directory
- Generate a comprehensive file list with metadata
Step 2: Perform Substructure Matching
Open the second notebook for parallel RDKit matching:
jupyter notebook notebooks/02_rdkit_substructure_matching.ipynb
This notebook will:
- Load all extracted SDF files
- Define and compile SMARTS patterns for common chemical substructures
- Process molecules using 220 parallel processes
- Generate comprehensive matching results and statistics
Features
Archive Support
- ✅ ZIP files
- ✅ RAR files
- ✅ TAR.GZ files
- ✅ GZIP compressed SDF files
SMARTS Patterns
The matching includes 18 common chemical substructures:
- Benzene rings, pyridine, heterocycles
- Functional groups: carboxylic acids, alcohols, amines, amides, esters, ketones, aldehydes
- Other features: nitro groups, halogens, sulfonamides, aromatic rings, alkenes, alkynes, ethers, phenols
Performance
- 220 parallel processes as requested
- Batch processing to manage memory usage
- Progress tracking with tqdm
- Intermediate result saving
- Performance statistics and analysis
Output Files
Extraction Results (extracted_sdf_files/)
sdf_file_list.csv- Complete list of all SDF files with metadata- Organized subdirectories maintaining original archive structure
Matching Results (matching_results/)
complete_matching_results.csv- All molecules with pattern matchesmatching_summary.csv- Summary statistics for each patternmolecules_with_{pattern}.csv- Individual files for each patternpattern_frequencies.png- Visualization of pattern frequencies- Intermediate results saved every 10 batches
Performance Notes
- Designed for high-performance systems (256 threads available)
- Uses 220 processes to maximize throughput while leaving system resources
- Memory-efficient batch processing
- Automatic cleanup and error handling
Dependencies
rdkit- Chemical informatics and structure processingjoblib- Parallel processing backendpandas- Data manipulation and analysistqdm- Progress barspathlib- Modern path handlingzipfile,rarfile,tarfile- Archive extraction
Troubleshooting
Common Issues
- RAR file extraction errors: Install
rarfileand ensureunraris available - Memory issues: Reduce
N_PROCESSESorN_FILES_PER_BATCHin the matching notebook - Permission errors: Ensure write access to output directories
Performance Tuning
- Adjust
N_PROCESSESbased on available CPU cores - Modify
N_FILES_PER_BATCHfor memory constraints - Use
max_moleculesparameter for testing with smaller datasets
License
This project is provided as-is for research and educational purposes.
Description
Languages
Jupyter Notebook
97.5%
Python
2.4%