# CRISPR-Cas Analysis Module - Implementation Plan ## Overview This document outlines the planned implementation of CRISPR-Cas system analysis for the BtToxin Pipeline. This feature is **reserved for future development** and provides a roadmap for integrating CRISPR-Cas detection with toxin activity assessment. ## Status: RESERVED All infrastructure is prepared but implementation is **not yet started**. This module will be activated when resources and requirements are finalized. --- ## Architecture ### Directory Structure (to be created) ``` crispr_cas/ ├── scripts/ │ ├── detect_crispr.py # CRISPR array detection │ ├── fusion_analysis.py # Spacer-toxin gene analysis │ └── crispr_scoring.py # Integration with shoter scoring ├── docs/ │ ├── IMPLEMENTATION.md # This file │ └── API_REFERENCE.md # Module API documentation (to be created) └── tests/ ├── test_detect_crispr.py └── test_fusion_analysis.py ``` ### Data Flow ``` Genome (.fna) → CRISPRCasFinder → CRISPR Results (JSON) ↓ Fusion Analysis Module ↓ Toxin Genes (All_Toxins.txt) ↓ Enhanced Shoter Scoring ↓ CRISPR-Augmented Activity Scores ``` --- ## Implementation Plan ### Phase 1: CRISPR Detection **File:** `crispr_cas/scripts/detect_crispr.py` **Tool:** CRISPRCasFinder (https://crisprcas.i2bc.paris-saclay.fr/ **Tasks:** 1. Integrate CRISPRCasFinder CLI or implement Python wrapper 2. Parse CRISPRCasFinder output (General Case or Cas spacer) 3. Extract: - Cas type/subtype (I-E, I-F, II-A, V-A, etc.) - CRISPR array positions - Spacer sequences - Repeat sequences - Protospacer Adjacent Motif (PAM) sequences **Output Format (JSON):** ```json { "strain_id": { "cas_type": "I-E", "arrays": [ { "position": "contig_1:12345-12678", "repeat": "5'-GTTTTAGAGCTATGCTGTTTTGAATGGTCCCAAAAC-3'", "spacers": [ {"sequence": "ATGCGTCGAC", "position": 0}, {"sequence": "CGTAGCTAGC", "position": 37} ] } ], "summary": {"num_arrays": 3, "num_spacers": 24} } } ``` --- ### Phase 2: Spacer-Toxin Gene Association **File:** `crispr_cas/scripts/fusion_analysis.py` **Tasks:** 1. Map CRISPR arrays to genomic positions 2. Identify toxin genes near CRISPR arrays (within 10kb window) 3. Analyze potential spacer-target matches: - Extract toxin gene sequences - Perform BLAST of spacers against toxin genes - Identify potential immunity or targeting relationships **Output Format (JSON):** ```json { "strain_id": { "crispr_toxin_associations": [ { "crispr_array": "contig_1:12345-12678", "nearby_toxins": ["Cry1Ac1", "Cry2Aa3"], "spacer_targets": [ {"spacer": "ATGCGTCGAC", "target": "Cry1Ac1", "identity": 0.95} ], "distance_to_toxin": 2500 } ] } } ``` --- ### Phase 3: Integration with Shoter Scoring **File:** Modify `scripts/bttoxin_shoter.py` **Reserved Parameters (add to argument parser):** ```python # CRISPR-Cas Integration (Reserved for Future Implementation) ap.add_argument("--crispr_weight", type=float, default=0.0, help="[FUTURE] Weight for CRISPR-Cas contribution to activity scores (0-1)") ap.add_argument("--crispr_results", type=Path, default=None, help="[FUTURE] Path to CRISPR-Cas detection results JSON") ap.add_argument("--crispr_fusion", action="store_true", default=False, help="[FUTURE] Enable spacer-toxin fusion analysis") ``` **Scoring Integration (in `score_strain()` function):** ```python # Reserved: CRISPR-Cas scoring integration # When CRISPR is enabled, modify strain scores: # # if args.crispr_weight > 0 and crispr_data: # crispr_boost = calculate_crispr_activity_boost( # strain=strain, # crispr_data=crispr_data.get(strain, {}), # toxin_hits=toxin_hits # ) # # Apply CRISPR boost to target order scores # for order, score in sscore.scores.items(): # sscore.scores[order] = score * (1 - args.crispr_weight) + \ # crispr_boost.get(order, 0) * args.crispr_weight ``` --- ### Phase 4: Enhanced Visualization **File:** `scripts/plot_shotter.py` **Tasks:** 1. Add CRISPR-Cas panel to existing heatmaps 2. Visualize: - CRISPR array positions on genome - Spacer-toxin targeting relationships - CRISPR-enhanced activity scores **Output Format:** - Extended PDF report with CRISPR section - Additional JSON with CRISPR metadata - Optional: Genomic track visualization (SVG/PNG) --- ## Pixi Integration The pixi environment is already configured (commented out) in `pixi.toml`: ```toml # ========================= # CRISPR-Cas 环境:预留用于未来的 CRISPR-Cas 分析 # ========================= # [feature.crispr.dependencies] # python = ">=3.9" # biopython = "*" # pandas = ">=2.0.0" # ========================= # [feature.crispr.tasks] # crispr-detect = "python crispr_cas/scripts/detect_crispr.py" # crispr-fusion = "python crispr_cas/scripts/fusion_analysis.py" ``` **To activate CRISPR module:** 1. Uncomment the `[feature.crispr.dependencies]` section 2. Uncomment the `[feature.crispr.tasks]` section 3. Add `crispr` to environments list 4. Run `pixi install` --- ## Usage Examples (When Implemented) ### Basic CRISPR Detection ```bash pixi run -e crispr crispr-detect \ --input genome.fna \ --output crispr_results.json ``` ### Full Pipeline with CRISPR Integration ```bash # Run CRISPR detection first pixi run -e crispr crispr-detect --input genome.fna --output crispr.json # Run pipeline with CRISPR-enhanced scoring pixi run pipeline \ --input genome.fna \ --toxicity_csv Data/toxicity-data.csv \ --crispr_results crispr.json \ --crispr_weight 0.2 \ --crispr_fusion ``` ### API Integration (Future) ```python # Backend API endpoint (to be implemented) POST /api/v1/tasks { "files": ["genome.fna"], "crispr_enabled": true, "crispr_weight": 0.2, "crispr_fusion": true } ``` --- ## Scientific Background ### Why CRISPR-Cas in Bt Analysis? 1. **Self-Immunity**: CRISPR-Cas systems in Bt may provide immunity against phages, affecting strain fitness 2. **Plasmid Tracking**: CRISPR spacers can indicate plasmid content and horizontal gene transfer history 3. **Strain Differentiation**: CRISPR array patterns can distinguish closely related strains 4. **Toxin Gene Proximity**: CRISPR arrays near toxin genes may indicate genomic defense mechanisms ### Expected Benefits - Enhanced strain characterization beyond toxin profiling - Better understanding of strain evolution and adaptation - Potential correlation with biocontrol efficacy - Additional markers for strain selection --- ## Testing Strategy ### Unit Tests - CRISPR detection mock data parsing - Spacer-toxin distance calculation - CRISPR score calculation logic ### Integration Tests - End-to-end pipeline with small genome - Comparison with manual CRISPRCasFinder results - Scoring consistency with/without CRISPR ### Validation - Compare CRISPR-enhanced scores with experimental bioassay data - Validate CRISPR-toxin associations using known literature --- ## Dependencies ### External Tools - **CRISPRCasFinder** (v4.2+): https://crisprcas.i2bc.paris-saclay.fr/ - **BLAST+** (for spacer-toxin alignment) ### Python Packages - biopython >= 1.79 - pandas >= 2.0.0 - numpy >= 1.21.0 --- ## Timeline Estimate - **Phase 1**: 2-3 weeks (CRISPR detection wrapper) - **Phase 2**: 2-3 weeks (Fusion analysis) - **Phase 3**: 1-2 weeks (Shoter integration) - **Phase 4**: 2-3 weeks (Visualization) **Total**: ~2-3 months for full implementation --- ## References 1. Couvin, D. et al. (2018) CRISPRCasFinder, an update of CRISPRFinder, includes a portable version, a web server and many tools to study CRISPRs. *Bioinformatics*, 34(20), 3579-3581. 2. Chakraborty, S. et al. (2020) CRISPR-Cas systems in Bacillus thuringiensis: diversity, evolution and potential applications. *Frontiers in Microbiology*, 11, 591. 3. BtToxin Pipeline Documentation: `docs/shotter_workflow.md` --- ## Contact For questions or implementation guidance, refer to the main project documentation or create an issue in the project repository. **Last Updated:** 2025-01-13 **Status:** Reserved - Implementation Pending