Files
bttoxin-pipeline/docs/CRISPR_IMPLEMENTATION_PLAN.md

310 lines
8.5 KiB
Markdown

# CRISPR-Cas Analysis Module - Implementation Plan
## Overview
This document outlines the planned implementation of CRISPR-Cas system analysis for the BtToxin Pipeline. This feature is **reserved for future development** and provides a roadmap for integrating CRISPR-Cas detection with toxin activity assessment.
## Status: RESERVED
All infrastructure is prepared but implementation is **not yet started**. This module will be activated when resources and requirements are finalized.
---
## Architecture
### Directory Structure (to be created)
```
crispr_cas/
├── scripts/
│ ├── detect_crispr.py # CRISPR array detection
│ ├── fusion_analysis.py # Spacer-toxin gene analysis
│ └── crispr_scoring.py # Integration with shoter scoring
├── docs/
│ ├── IMPLEMENTATION.md # This file
│ └── API_REFERENCE.md # Module API documentation (to be created)
└── tests/
├── test_detect_crispr.py
└── test_fusion_analysis.py
```
### Data Flow
```
Genome (.fna) → CRISPRCasFinder → CRISPR Results (JSON)
Fusion Analysis Module
Toxin Genes (All_Toxins.txt)
Enhanced Shoter Scoring
CRISPR-Augmented Activity Scores
```
---
## Implementation Plan
### Phase 1: CRISPR Detection
**File:** `crispr_cas/scripts/detect_crispr.py`
**Tool:** CRISPRCasFinder (https://crisprcas.i2bc.paris-saclay.fr/
**Tasks:**
1. Integrate CRISPRCasFinder CLI or implement Python wrapper
2. Parse CRISPRCasFinder output (General Case or Cas spacer)
3. Extract:
- Cas type/subtype (I-E, I-F, II-A, V-A, etc.)
- CRISPR array positions
- Spacer sequences
- Repeat sequences
- Protospacer Adjacent Motif (PAM) sequences
**Output Format (JSON):**
```json
{
"strain_id": {
"cas_type": "I-E",
"arrays": [
{
"position": "contig_1:12345-12678",
"repeat": "5'-GTTTTAGAGCTATGCTGTTTTGAATGGTCCCAAAAC-3'",
"spacers": [
{"sequence": "ATGCGTCGAC", "position": 0},
{"sequence": "CGTAGCTAGC", "position": 37}
]
}
],
"summary": {"num_arrays": 3, "num_spacers": 24}
}
}
```
---
### Phase 2: Spacer-Toxin Gene Association
**File:** `crispr_cas/scripts/fusion_analysis.py`
**Tasks:**
1. Map CRISPR arrays to genomic positions
2. Identify toxin genes near CRISPR arrays (within 10kb window)
3. Analyze potential spacer-target matches:
- Extract toxin gene sequences
- Perform BLAST of spacers against toxin genes
- Identify potential immunity or targeting relationships
**Output Format (JSON):**
```json
{
"strain_id": {
"crispr_toxin_associations": [
{
"crispr_array": "contig_1:12345-12678",
"nearby_toxins": ["Cry1Ac1", "Cry2Aa3"],
"spacer_targets": [
{"spacer": "ATGCGTCGAC", "target": "Cry1Ac1", "identity": 0.95}
],
"distance_to_toxin": 2500
}
]
}
}
```
---
### Phase 3: Integration with Shoter Scoring
**File:** Modify `scripts/bttoxin_shoter.py`
**Reserved Parameters (add to argument parser):**
```python
# CRISPR-Cas Integration (Reserved for Future Implementation)
ap.add_argument("--crispr_weight", type=float, default=0.0,
help="[FUTURE] Weight for CRISPR-Cas contribution to activity scores (0-1)")
ap.add_argument("--crispr_results", type=Path, default=None,
help="[FUTURE] Path to CRISPR-Cas detection results JSON")
ap.add_argument("--crispr_fusion", action="store_true", default=False,
help="[FUTURE] Enable spacer-toxin fusion analysis")
```
**Scoring Integration (in `score_strain()` function):**
```python
# Reserved: CRISPR-Cas scoring integration
# When CRISPR is enabled, modify strain scores:
#
# if args.crispr_weight > 0 and crispr_data:
# crispr_boost = calculate_crispr_activity_boost(
# strain=strain,
# crispr_data=crispr_data.get(strain, {}),
# toxin_hits=toxin_hits
# )
# # Apply CRISPR boost to target order scores
# for order, score in sscore.scores.items():
# sscore.scores[order] = score * (1 - args.crispr_weight) + \
# crispr_boost.get(order, 0) * args.crispr_weight
```
---
### Phase 4: Enhanced Visualization
**File:** `scripts/plot_shotter.py`
**Tasks:**
1. Add CRISPR-Cas panel to existing heatmaps
2. Visualize:
- CRISPR array positions on genome
- Spacer-toxin targeting relationships
- CRISPR-enhanced activity scores
**Output Format:**
- Extended PDF report with CRISPR section
- Additional JSON with CRISPR metadata
- Optional: Genomic track visualization (SVG/PNG)
---
## Pixi Integration
The pixi environment is already configured (commented out) in `pixi.toml`:
```toml
# =========================
# CRISPR-Cas 环境:预留用于未来的 CRISPR-Cas 分析
# =========================
# [feature.crispr.dependencies]
# python = ">=3.9"
# biopython = "*"
# pandas = ">=2.0.0"
# =========================
# [feature.crispr.tasks]
# crispr-detect = "python crispr_cas/scripts/detect_crispr.py"
# crispr-fusion = "python crispr_cas/scripts/fusion_analysis.py"
```
**To activate CRISPR module:**
1. Uncomment the `[feature.crispr.dependencies]` section
2. Uncomment the `[feature.crispr.tasks]` section
3. Add `crispr` to environments list
4. Run `pixi install`
---
## Usage Examples (When Implemented)
### Basic CRISPR Detection
```bash
pixi run -e crispr crispr-detect \
--input genome.fna \
--output crispr_results.json
```
### Full Pipeline with CRISPR Integration
```bash
# Run CRISPR detection first
pixi run -e crispr crispr-detect --input genome.fna --output crispr.json
# Run pipeline with CRISPR-enhanced scoring
pixi run pipeline \
--input genome.fna \
--toxicity_csv Data/toxicity-data.csv \
--crispr_results crispr.json \
--crispr_weight 0.2 \
--crispr_fusion
```
### API Integration (Future)
```python
# Backend API endpoint (to be implemented)
POST /api/v1/tasks
{
"files": ["genome.fna"],
"crispr_enabled": true,
"crispr_weight": 0.2,
"crispr_fusion": true
}
```
---
## Scientific Background
### Why CRISPR-Cas in Bt Analysis?
1. **Self-Immunity**: CRISPR-Cas systems in Bt may provide immunity against phages, affecting strain fitness
2. **Plasmid Tracking**: CRISPR spacers can indicate plasmid content and horizontal gene transfer history
3. **Strain Differentiation**: CRISPR array patterns can distinguish closely related strains
4. **Toxin Gene Proximity**: CRISPR arrays near toxin genes may indicate genomic defense mechanisms
### Expected Benefits
- Enhanced strain characterization beyond toxin profiling
- Better understanding of strain evolution and adaptation
- Potential correlation with biocontrol efficacy
- Additional markers for strain selection
---
## Testing Strategy
### Unit Tests
- CRISPR detection mock data parsing
- Spacer-toxin distance calculation
- CRISPR score calculation logic
### Integration Tests
- End-to-end pipeline with small genome
- Comparison with manual CRISPRCasFinder results
- Scoring consistency with/without CRISPR
### Validation
- Compare CRISPR-enhanced scores with experimental bioassay data
- Validate CRISPR-toxin associations using known literature
---
## Dependencies
### External Tools
- **CRISPRCasFinder** (v4.2+): https://crisprcas.i2bc.paris-saclay.fr/
- **BLAST+** (for spacer-toxin alignment)
### Python Packages
- biopython >= 1.79
- pandas >= 2.0.0
- numpy >= 1.21.0
---
## Timeline Estimate
- **Phase 1**: 2-3 weeks (CRISPR detection wrapper)
- **Phase 2**: 2-3 weeks (Fusion analysis)
- **Phase 3**: 1-2 weeks (Shoter integration)
- **Phase 4**: 2-3 weeks (Visualization)
**Total**: ~2-3 months for full implementation
---
## References
1. Couvin, D. et al. (2018) CRISPRCasFinder, an update of CRISPRFinder, includes a portable version, a web server and many tools to study CRISPRs. *Bioinformatics*, 34(20), 3579-3581.
2. Chakraborty, S. et al. (2020) CRISPR-Cas systems in Bacillus thuringiensis: diversity, evolution and potential applications. *Frontiers in Microbiology*, 11, 591.
3. BtToxin Pipeline Documentation: `docs/shotter_workflow.md`
---
## Contact
For questions or implementation guidance, refer to the main project documentation or create an issue in the project repository.
**Last Updated:** 2025-01-13
**Status:** Reserved - Implementation Pending