# SDF File Processing and RDKit Substructure Matching This project provides a complete workflow for extracting SDF files from various archive formats and performing high-throughput substructure matching using RDKit with multiprocessing. ## Setup ### Prerequisites - Pixi package manager - Python 3.11+ ### Installation 1. Initialize the pixi environment (already done): ```bash pixi init pixi add rdkit joblib pandas tqdm ``` 2. Activate the environment: ```bash pixi shell ``` ## Project Structure ``` search_macro/ ├── pixi.toml # Pixi configuration ├── README.md # This file ├── data/ # Input directory with zip/rar/sdf files ├── notebooks/ # Jupyter notebooks │ ├── 01_extract_sdf_files.ipynb # Extract SDF from archives │ └── 02_rdkit_substructure_matching.ipynb # Parallel substructure matching ├── extracted_sdf_files/ # Output directory for extracted SDF files └── matching_results/ # Results directory for matching output ``` ## Usage ### Step 1: Extract SDF Files Open the first notebook to extract all SDF files from ZIP/RAR archives: ```bash jupyter notebook notebooks/01_extract_sdf_files.ipynb ``` This notebook will: - Scan the `data/` directory for compressed files (ZIP, RAR, TAR.GZ) - Extract all SDF/MOL/SD files from archives - Copy existing SDF files to a unified directory - Generate a comprehensive file list with metadata ### Step 2: Perform Substructure Matching Open the second notebook for parallel RDKit matching: ```bash jupyter notebook notebooks/02_rdkit_substructure_matching.ipynb ``` This notebook will: - Load all extracted SDF files - Define and compile SMARTS patterns for common chemical substructures - Process molecules using 220 parallel processes - Generate comprehensive matching results and statistics ## Features ### Archive Support - ✅ ZIP files - ✅ RAR files - ✅ TAR.GZ files - ✅ GZIP compressed SDF files ### SMARTS Patterns The matching includes 18 common chemical substructures: - Benzene rings, pyridine, heterocycles - Functional groups: carboxylic acids, alcohols, amines, amides, esters, ketones, aldehydes - Other features: nitro groups, halogens, sulfonamides, aromatic rings, alkenes, alkynes, ethers, phenols ### Performance - **220 parallel processes** as requested - Batch processing to manage memory usage - Progress tracking with tqdm - Intermediate result saving - Performance statistics and analysis ## Output Files ### Extraction Results (`extracted_sdf_files/`) - `sdf_file_list.csv` - Complete list of all SDF files with metadata - Organized subdirectories maintaining original archive structure ### Matching Results (`matching_results/`) - `complete_matching_results.csv` - All molecules with pattern matches - `matching_summary.csv` - Summary statistics for each pattern - `molecules_with_{pattern}.csv` - Individual files for each pattern - `pattern_frequencies.png` - Visualization of pattern frequencies - Intermediate results saved every 10 batches ## Performance Notes - Designed for high-performance systems (256 threads available) - Uses 220 processes to maximize throughput while leaving system resources - Memory-efficient batch processing - Automatic cleanup and error handling ## Dependencies - `rdkit` - Chemical informatics and structure processing - `joblib` - Parallel processing backend - `pandas` - Data manipulation and analysis - `tqdm` - Progress bars - `pathlib` - Modern path handling - `zipfile`, `rarfile`, `tarfile` - Archive extraction ## Troubleshooting ### Common Issues 1. **RAR file extraction errors**: Install `rarfile` and ensure `unrar` is available 2. **Memory issues**: Reduce `N_PROCESSES` or `N_FILES_PER_BATCH` in the matching notebook 3. **Permission errors**: Ensure write access to output directories ### Performance Tuning - Adjust `N_PROCESSES` based on available CPU cores - Modify `N_FILES_PER_BATCH` for memory constraints - Use `max_molecules` parameter for testing with smaller datasets ## License This project is provided as-is for research and educational purposes.