Files
bttoxin-pipeline/docs/CRISPR_IMPLEMENTATION_PLAN.md

8.5 KiB

CRISPR-Cas Analysis Module - Implementation Plan

Overview

This document outlines the planned implementation of CRISPR-Cas system analysis for the BtToxin Pipeline. This feature is reserved for future development and provides a roadmap for integrating CRISPR-Cas detection with toxin activity assessment.

Status: RESERVED

All infrastructure is prepared but implementation is not yet started. This module will be activated when resources and requirements are finalized.


Architecture

Directory Structure (to be created)

crispr_cas/
├── scripts/
│   ├── detect_crispr.py      # CRISPR array detection
│   ├── fusion_analysis.py    # Spacer-toxin gene analysis
│   └── crispr_scoring.py      # Integration with shoter scoring
├── docs/
│   ├── IMPLEMENTATION.md      # This file
│   └── API_REFERENCE.md       # Module API documentation (to be created)
└── tests/
    ├── test_detect_crispr.py
    └── test_fusion_analysis.py

Data Flow

Genome (.fna) → CRISPRCasFinder → CRISPR Results (JSON)
                                            ↓
                                    Fusion Analysis Module
                                            ↓
                            Toxin Genes (All_Toxins.txt)
                                            ↓
                                    Enhanced Shoter Scoring
                                            ↓
                            CRISPR-Augmented Activity Scores

Implementation Plan

Phase 1: CRISPR Detection

File: crispr_cas/scripts/detect_crispr.py

Tool: CRISPRCasFinder (https://crisprcas.i2bc.paris-saclay.fr/

Tasks:

  1. Integrate CRISPRCasFinder CLI or implement Python wrapper
  2. Parse CRISPRCasFinder output (General Case or Cas spacer)
  3. Extract:
    • Cas type/subtype (I-E, I-F, II-A, V-A, etc.)
    • CRISPR array positions
    • Spacer sequences
    • Repeat sequences
    • Protospacer Adjacent Motif (PAM) sequences

Output Format (JSON):

{
  "strain_id": {
    "cas_type": "I-E",
    "arrays": [
      {
        "position": "contig_1:12345-12678",
        "repeat": "5'-GTTTTAGAGCTATGCTGTTTTGAATGGTCCCAAAAC-3'",
        "spacers": [
          {"sequence": "ATGCGTCGAC", "position": 0},
          {"sequence": "CGTAGCTAGC", "position": 37}
        ]
      }
    ],
    "summary": {"num_arrays": 3, "num_spacers": 24}
  }
}

Phase 2: Spacer-Toxin Gene Association

File: crispr_cas/scripts/fusion_analysis.py

Tasks:

  1. Map CRISPR arrays to genomic positions
  2. Identify toxin genes near CRISPR arrays (within 10kb window)
  3. Analyze potential spacer-target matches:
    • Extract toxin gene sequences
    • Perform BLAST of spacers against toxin genes
    • Identify potential immunity or targeting relationships

Output Format (JSON):

{
  "strain_id": {
    "crispr_toxin_associations": [
      {
        "crispr_array": "contig_1:12345-12678",
        "nearby_toxins": ["Cry1Ac1", "Cry2Aa3"],
        "spacer_targets": [
          {"spacer": "ATGCGTCGAC", "target": "Cry1Ac1", "identity": 0.95}
        ],
        "distance_to_toxin": 2500
      }
    ]
  }
}

Phase 3: Integration with Shoter Scoring

File: Modify scripts/bttoxin_shoter.py

Reserved Parameters (add to argument parser):

# CRISPR-Cas Integration (Reserved for Future Implementation)
ap.add_argument("--crispr_weight", type=float, default=0.0,
                help="[FUTURE] Weight for CRISPR-Cas contribution to activity scores (0-1)")
ap.add_argument("--crispr_results", type=Path, default=None,
                help="[FUTURE] Path to CRISPR-Cas detection results JSON")
ap.add_argument("--crispr_fusion", action="store_true", default=False,
                help="[FUTURE] Enable spacer-toxin fusion analysis")

Scoring Integration (in score_strain() function):

# Reserved: CRISPR-Cas scoring integration
# When CRISPR is enabled, modify strain scores:
#
# if args.crispr_weight > 0 and crispr_data:
#     crispr_boost = calculate_crispr_activity_boost(
#         strain=strain,
#         crispr_data=crispr_data.get(strain, {}),
#         toxin_hits=toxin_hits
#     )
#     # Apply CRISPR boost to target order scores
#     for order, score in sscore.scores.items():
#         sscore.scores[order] = score * (1 - args.crispr_weight) + \
#                                crispr_boost.get(order, 0) * args.crispr_weight

Phase 4: Enhanced Visualization

File: scripts/plot_shotter.py

Tasks:

  1. Add CRISPR-Cas panel to existing heatmaps
  2. Visualize:
    • CRISPR array positions on genome
    • Spacer-toxin targeting relationships
    • CRISPR-enhanced activity scores

Output Format:

  • Extended PDF report with CRISPR section
  • Additional JSON with CRISPR metadata
  • Optional: Genomic track visualization (SVG/PNG)

Pixi Integration

The pixi environment is already configured (commented out) in pixi.toml:

# =========================
# CRISPR-Cas 环境:预留用于未来的 CRISPR-Cas 分析
# =========================
# [feature.crispr.dependencies]
# python = ">=3.9"
# biopython = "*"
# pandas = ">=2.0.0"
# =========================
# [feature.crispr.tasks]
# crispr-detect = "python crispr_cas/scripts/detect_crispr.py"
# crispr-fusion = "python crispr_cas/scripts/fusion_analysis.py"

To activate CRISPR module:

  1. Uncomment the [feature.crispr.dependencies] section
  2. Uncomment the [feature.crispr.tasks] section
  3. Add crispr to environments list
  4. Run pixi install

Usage Examples (When Implemented)

Basic CRISPR Detection

pixi run -e crispr crispr-detect \
  --input genome.fna \
  --output crispr_results.json

Full Pipeline with CRISPR Integration

# Run CRISPR detection first
pixi run -e crispr crispr-detect --input genome.fna --output crispr.json

# Run pipeline with CRISPR-enhanced scoring
pixi run pipeline \
  --input genome.fna \
  --toxicity_csv Data/toxicity-data.csv \
  --crispr_results crispr.json \
  --crispr_weight 0.2 \
  --crispr_fusion

API Integration (Future)

# Backend API endpoint (to be implemented)
POST /api/v1/tasks
{
  "files": ["genome.fna"],
  "crispr_enabled": true,
  "crispr_weight": 0.2,
  "crispr_fusion": true
}

Scientific Background

Why CRISPR-Cas in Bt Analysis?

  1. Self-Immunity: CRISPR-Cas systems in Bt may provide immunity against phages, affecting strain fitness
  2. Plasmid Tracking: CRISPR spacers can indicate plasmid content and horizontal gene transfer history
  3. Strain Differentiation: CRISPR array patterns can distinguish closely related strains
  4. Toxin Gene Proximity: CRISPR arrays near toxin genes may indicate genomic defense mechanisms

Expected Benefits

  • Enhanced strain characterization beyond toxin profiling
  • Better understanding of strain evolution and adaptation
  • Potential correlation with biocontrol efficacy
  • Additional markers for strain selection

Testing Strategy

Unit Tests

  • CRISPR detection mock data parsing
  • Spacer-toxin distance calculation
  • CRISPR score calculation logic

Integration Tests

  • End-to-end pipeline with small genome
  • Comparison with manual CRISPRCasFinder results
  • Scoring consistency with/without CRISPR

Validation

  • Compare CRISPR-enhanced scores with experimental bioassay data
  • Validate CRISPR-toxin associations using known literature

Dependencies

External Tools

Python Packages

  • biopython >= 1.79
  • pandas >= 2.0.0
  • numpy >= 1.21.0

Timeline Estimate

  • Phase 1: 2-3 weeks (CRISPR detection wrapper)
  • Phase 2: 2-3 weeks (Fusion analysis)
  • Phase 3: 1-2 weeks (Shoter integration)
  • Phase 4: 2-3 weeks (Visualization)

Total: ~2-3 months for full implementation


References

  1. Couvin, D. et al. (2018) CRISPRCasFinder, an update of CRISPRFinder, includes a portable version, a web server and many tools to study CRISPRs. Bioinformatics, 34(20), 3579-3581.
  2. Chakraborty, S. et al. (2020) CRISPR-Cas systems in Bacillus thuringiensis: diversity, evolution and potential applications. Frontiers in Microbiology, 11, 591.
  3. BtToxin Pipeline Documentation: docs/shotter_workflow.md

Contact

For questions or implementation guidance, refer to the main project documentation or create an issue in the project repository.

Last Updated: 2025-01-13 Status: Reserved - Implementation Pending