This project provides a complete workflow for extracting SDF files from various archive formats and performing high-throughput substructure matching using RDKit with multiprocessing.

Setup

Prerequisites

Pixi package manager
Python 3.11+

Installation

Initialize the pixi environment (already done):

pixi init
pixi add rdkit joblib pandas tqdm

Activate the environment:

pixi shell

Project Structure

search_macro/
├── pixi.toml              # Pixi configuration
├── README.md              # This file
├── data/                  # Input directory with zip/rar/sdf files
├── notebooks/             # Jupyter notebooks
│   ├── 01_extract_sdf_files.ipynb      # Extract SDF from archives
│   └── 02_rdkit_substructure_matching.ipynb  # Parallel substructure matching
├── extracted_sdf_files/   # Output directory for extracted SDF files
└── matching_results/      # Results directory for matching output

Usage

Step 1: Extract SDF Files

Open the first notebook to extract all SDF files from ZIP/RAR archives:

jupyter notebook notebooks/01_extract_sdf_files.ipynb

This notebook will:

Scan the data/ directory for compressed files (ZIP, RAR, TAR.GZ)
Extract all SDF/MOL/SD files from archives
Copy existing SDF files to a unified directory
Generate a comprehensive file list with metadata

Step 2: Perform Substructure Matching

Open the second notebook for parallel RDKit matching:

jupyter notebook notebooks/02_rdkit_substructure_matching.ipynb

This notebook will:

Load all extracted SDF files
Define and compile SMARTS patterns for common chemical substructures
Process molecules using 220 parallel processes
Generate comprehensive matching results and statistics

Features

Archive Support

✅ ZIP files
✅ RAR files
✅ TAR.GZ files
✅ GZIP compressed SDF files

SMARTS Patterns

The matching includes 18 common chemical substructures:

Benzene rings, pyridine, heterocycles
Functional groups: carboxylic acids, alcohols, amines, amides, esters, ketones, aldehydes
Other features: nitro groups, halogens, sulfonamides, aromatic rings, alkenes, alkynes, ethers, phenols

Performance

220 parallel processes as requested
Batch processing to manage memory usage
Progress tracking with tqdm
Intermediate result saving
Performance statistics and analysis

Output Files

Extraction Results (`extracted_sdf_files/`)

sdf_file_list.csv - Complete list of all SDF files with metadata
Organized subdirectories maintaining original archive structure

Matching Results (`matching_results/`)

complete_matching_results.csv - All molecules with pattern matches
matching_summary.csv - Summary statistics for each pattern
molecules_with_{pattern}.csv - Individual files for each pattern
pattern_frequencies.png - Visualization of pattern frequencies
Intermediate results saved every 10 batches

Performance Notes

Designed for high-performance systems (256 threads available)
Uses 220 processes to maximize throughput while leaving system resources
Memory-efficient batch processing
Automatic cleanup and error handling

Dependencies

rdkit - Chemical informatics and structure processing
joblib - Parallel processing backend
pandas - Data manipulation and analysis
tqdm - Progress bars
pathlib - Modern path handling
zipfile, rarfile, tarfile - Archive extraction

Troubleshooting

Common Issues

RAR file extraction errors: Install rarfile and ensure unrar is available
Memory issues: Reduce N_PROCESSES or N_FILES_PER_BATCH in the matching notebook
Permission errors: Ensure write access to output directories

Performance Tuning

Adjust N_PROCESSES based on available CPU cores
Modify N_FILES_PER_BATCH for memory constraints
Use max_molecules parameter for testing with smaller datasets

License

This project is provided as-is for research and educational purposes.

README.md

SDF File Processing and RDKit Substructure Matching