Initial commit

2025-11-14 18:46:03 +08:00
commit b85faf48cd
70 changed files with 57687 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,134 @@
+# SDF File Processing and RDKit Substructure Matching
+
+This project provides a complete workflow for extracting SDF files from various archive formats and performing high-throughput substructure matching using RDKit with multiprocessing.
+
+## Setup
+
+### Prerequisites
+- Pixi package manager
+- Python 3.11+
+
+### Installation
+
+1. Initialize the pixi environment (already done):
+```bash
+pixi init
+pixi add rdkit joblib pandas tqdm
+```
+
+2. Activate the environment:
+```bash
+pixi shell
+```
+
+## Project Structure
+
+```
+search_macro/
+├── pixi.toml              # Pixi configuration
+├── README.md              # This file
+├── data/                  # Input directory with zip/rar/sdf files
+├── notebooks/             # Jupyter notebooks
+│   ├── 01_extract_sdf_files.ipynb      # Extract SDF from archives
+│   └── 02_rdkit_substructure_matching.ipynb  # Parallel substructure matching
+├── extracted_sdf_files/   # Output directory for extracted SDF files
+└── matching_results/      # Results directory for matching output
+```
+
+## Usage
+
+### Step 1: Extract SDF Files
+
+Open the first notebook to extract all SDF files from ZIP/RAR archives:
+
+```bash
+jupyter notebook notebooks/01_extract_sdf_files.ipynb
+```
+
+This notebook will:
+- Scan the `data/` directory for compressed files (ZIP, RAR, TAR.GZ)
+- Extract all SDF/MOL/SD files from archives
+- Copy existing SDF files to a unified directory
+- Generate a comprehensive file list with metadata
+
+### Step 2: Perform Substructure Matching
+
+Open the second notebook for parallel RDKit matching:
+
+```bash
+jupyter notebook notebooks/02_rdkit_substructure_matching.ipynb
+```
+
+This notebook will:
+- Load all extracted SDF files
+- Define and compile SMARTS patterns for common chemical substructures
+- Process molecules using 220 parallel processes
+- Generate comprehensive matching results and statistics
+
+## Features
+
+### Archive Support
+- ✅ ZIP files
+- ✅ RAR files  
+- ✅ TAR.GZ files
+- ✅ GZIP compressed SDF files
+
+### SMARTS Patterns
+The matching includes 18 common chemical substructures:
+- Benzene rings, pyridine, heterocycles
+- Functional groups: carboxylic acids, alcohols, amines, amides, esters, ketones, aldehydes
+- Other features: nitro groups, halogens, sulfonamides, aromatic rings, alkenes, alkynes, ethers, phenols
+
+### Performance
+- **220 parallel processes** as requested
+- Batch processing to manage memory usage
+- Progress tracking with tqdm
+- Intermediate result saving
+- Performance statistics and analysis
+
+## Output Files
+
+### Extraction Results (`extracted_sdf_files/`)
+- `sdf_file_list.csv` - Complete list of all SDF files with metadata
+- Organized subdirectories maintaining original archive structure
+
+### Matching Results (`matching_results/`)
+- `complete_matching_results.csv` - All molecules with pattern matches
+- `matching_summary.csv` - Summary statistics for each pattern
+- `molecules_with_{pattern}.csv` - Individual files for each pattern
+- `pattern_frequencies.png` - Visualization of pattern frequencies
+- Intermediate results saved every 10 batches
+
+## Performance Notes
+
+- Designed for high-performance systems (256 threads available)
+- Uses 220 processes to maximize throughput while leaving system resources
+- Memory-efficient batch processing
+- Automatic cleanup and error handling
+
+## Dependencies
+
+- `rdkit` - Chemical informatics and structure processing
+- `joblib` - Parallel processing backend
+- `pandas` - Data manipulation and analysis
+- `tqdm` - Progress bars
+- `pathlib` - Modern path handling
+- `zipfile`, `rarfile`, `tarfile` - Archive extraction
+
+## Troubleshooting
+
+### Common Issues
+
+1. **RAR file extraction errors**: Install `rarfile` and ensure `unrar` is available
+2. **Memory issues**: Reduce `N_PROCESSES` or `N_FILES_PER_BATCH` in the matching notebook
+3. **Permission errors**: Ensure write access to output directories
+
+### Performance Tuning
+
+- Adjust `N_PROCESSES` based on available CPU cores
+- Modify `N_FILES_PER_BATCH` for memory constraints
+- Use `max_molecules` parameter for testing with smaller datasets
+
+## License
+
+This project is provided as-is for research and educational purposes.