2025-11-14 21:44:19 +08:00
2025-11-14 18:46:03 +08:00
2025-11-14 21:44:19 +08:00
2025-11-14 18:46:03 +08:00
2025-11-14 21:44:02 +08:00
2025-11-14 18:46:03 +08:00
2025-11-14 18:46:03 +08:00
2025-11-14 18:46:03 +08:00
2025-11-14 18:46:03 +08:00
2025-11-14 18:46:03 +08:00

SDF File Processing and RDKit Substructure Matching

This project provides a complete workflow for extracting SDF files from various archive formats and performing high-throughput substructure matching using RDKit with multiprocessing.

Setup

Prerequisites

  • Pixi package manager
  • Python 3.11+

Installation

  1. Initialize the pixi environment (already done):
pixi init
pixi add rdkit joblib pandas tqdm
  1. Activate the environment:
pixi shell

Project Structure

search_macro/
├── pixi.toml              # Pixi configuration
├── README.md              # This file
├── data/                  # Input directory with zip/rar/sdf files
├── notebooks/             # Jupyter notebooks
│   ├── 01_extract_sdf_files.ipynb      # Extract SDF from archives
│   └── 02_rdkit_substructure_matching.ipynb  # Parallel substructure matching
├── extracted_sdf_files/   # Output directory for extracted SDF files
└── matching_results/      # Results directory for matching output

Usage

Step 1: Extract SDF Files

Open the first notebook to extract all SDF files from ZIP/RAR archives:

jupyter notebook notebooks/01_extract_sdf_files.ipynb

This notebook will:

  • Scan the data/ directory for compressed files (ZIP, RAR, TAR.GZ)
  • Extract all SDF/MOL/SD files from archives
  • Copy existing SDF files to a unified directory
  • Generate a comprehensive file list with metadata

Step 2: Perform Substructure Matching

Open the second notebook for parallel RDKit matching:

jupyter notebook notebooks/02_rdkit_substructure_matching.ipynb

This notebook will:

  • Load all extracted SDF files
  • Define and compile SMARTS patterns for common chemical substructures
  • Process molecules using 220 parallel processes
  • Generate comprehensive matching results and statistics

Features

Archive Support

  • ZIP files
  • RAR files
  • TAR.GZ files
  • GZIP compressed SDF files

SMARTS Patterns

The matching includes 18 common chemical substructures:

  • Benzene rings, pyridine, heterocycles
  • Functional groups: carboxylic acids, alcohols, amines, amides, esters, ketones, aldehydes
  • Other features: nitro groups, halogens, sulfonamides, aromatic rings, alkenes, alkynes, ethers, phenols

Performance

  • 220 parallel processes as requested
  • Batch processing to manage memory usage
  • Progress tracking with tqdm
  • Intermediate result saving
  • Performance statistics and analysis

Output Files

Extraction Results (extracted_sdf_files/)

  • sdf_file_list.csv - Complete list of all SDF files with metadata
  • Organized subdirectories maintaining original archive structure

Matching Results (matching_results/)

  • complete_matching_results.csv - All molecules with pattern matches
  • matching_summary.csv - Summary statistics for each pattern
  • molecules_with_{pattern}.csv - Individual files for each pattern
  • pattern_frequencies.png - Visualization of pattern frequencies
  • Intermediate results saved every 10 batches

Performance Notes

  • Designed for high-performance systems (256 threads available)
  • Uses 220 processes to maximize throughput while leaving system resources
  • Memory-efficient batch processing
  • Automatic cleanup and error handling

Dependencies

  • rdkit - Chemical informatics and structure processing
  • joblib - Parallel processing backend
  • pandas - Data manipulation and analysis
  • tqdm - Progress bars
  • pathlib - Modern path handling
  • zipfile, rarfile, tarfile - Archive extraction

Troubleshooting

Common Issues

  1. RAR file extraction errors: Install rarfile and ensure unrar is available
  2. Memory issues: Reduce N_PROCESSES or N_FILES_PER_BATCH in the matching notebook
  3. Permission errors: Ensure write access to output directories

Performance Tuning

  • Adjust N_PROCESSES based on available CPU cores
  • Modify N_FILES_PER_BATCH for memory constraints
  • Use max_molecules parameter for testing with smaller datasets

License

This project is provided as-is for research and educational purposes.

Description
从陶术的分子数据库里面搜索所有12-20 环大环内酯分子结构
Readme 1.9 MiB
Languages
Jupyter Notebook 97.5%
Python 2.4%