# SDF File Processing and RDKit Substructure Matching

This project provides a complete workflow for extracting SDF files from various archive formats and performing high-throughput substructure matching using RDKit with multiprocessing.

## Setup

### Prerequisites
- Pixi package manager
- Python 3.11+

### Installation

1. Initialize the pixi environment (already done):
```bash
pixi init
pixi add rdkit joblib pandas tqdm
```

2. Activate the environment:
```bash
pixi shell
```

## Project Structure

```
search_macro/
├── pixi.toml              # Pixi configuration
├── README.md              # This file
├── data/                  # Input directory with zip/rar/sdf files
├── notebooks/             # Jupyter notebooks
│   ├── 01_extract_sdf_files.ipynb      # Extract SDF from archives
│   └── 02_rdkit_substructure_matching.ipynb  # Parallel substructure matching
├── extracted_sdf_files/   # Output directory for extracted SDF files
└── matching_results/      # Results directory for matching output
```

## Usage

### Step 1: Extract SDF Files

Open the first notebook to extract all SDF files from ZIP/RAR archives:

```bash
jupyter notebook notebooks/01_extract_sdf_files.ipynb
```

This notebook will:
- Scan the `data/` directory for compressed files (ZIP, RAR, TAR.GZ)
- Extract all SDF/MOL/SD files from archives
- Copy existing SDF files to a unified directory
- Generate a comprehensive file list with metadata

### Step 2: Perform Substructure Matching

Open the second notebook for parallel RDKit matching:

```bash
jupyter notebook notebooks/02_rdkit_substructure_matching.ipynb
```

This notebook will:
- Load all extracted SDF files
- Define and compile SMARTS patterns for common chemical substructures
- Process molecules using 220 parallel processes
- Generate comprehensive matching results and statistics

## Features

### Archive Support
- ✅ ZIP files
- ✅ RAR files  
- ✅ TAR.GZ files
- ✅ GZIP compressed SDF files

### SMARTS Patterns
The matching includes 18 common chemical substructures:
- Benzene rings, pyridine, heterocycles
- Functional groups: carboxylic acids, alcohols, amines, amides, esters, ketones, aldehydes
- Other features: nitro groups, halogens, sulfonamides, aromatic rings, alkenes, alkynes, ethers, phenols

### Performance
- **220 parallel processes** as requested
- Batch processing to manage memory usage
- Progress tracking with tqdm
- Intermediate result saving
- Performance statistics and analysis

## Output Files

### Extraction Results (`extracted_sdf_files/`)
- `sdf_file_list.csv` - Complete list of all SDF files with metadata
- Organized subdirectories maintaining original archive structure

### Matching Results (`matching_results/`)
- `complete_matching_results.csv` - All molecules with pattern matches
- `matching_summary.csv` - Summary statistics for each pattern
- `molecules_with_{pattern}.csv` - Individual files for each pattern
- `pattern_frequencies.png` - Visualization of pattern frequencies
- Intermediate results saved every 10 batches

## Performance Notes

- Designed for high-performance systems (256 threads available)
- Uses 220 processes to maximize throughput while leaving system resources
- Memory-efficient batch processing
- Automatic cleanup and error handling

## Dependencies

- `rdkit` - Chemical informatics and structure processing
- `joblib` - Parallel processing backend
- `pandas` - Data manipulation and analysis
- `tqdm` - Progress bars
- `pathlib` - Modern path handling
- `zipfile`, `rarfile`, `tarfile` - Archive extraction

## Troubleshooting

### Common Issues

1. **RAR file extraction errors**: Install `rarfile` and ensure `unrar` is available
2. **Memory issues**: Reduce `N_PROCESSES` or `N_FILES_PER_BATCH` in the matching notebook
3. **Permission errors**: Ensure write access to output directories

### Performance Tuning

- Adjust `N_PROCESSES` based on available CPU cores
- Modify `N_FILES_PER_BATCH` for memory constraints
- Use `max_molecules` parameter for testing with smaller datasets

## License

This project is provided as-is for research and educational purposes.