Refactor: Unified pipeline execution, simplified UI, and fixed Docker config

- Backend: Refactored tasks.py to directly invoke run_single_fna_pipeline.py for consistency.
- Backend: Changed output format to ZIP and added auto-cleanup of intermediate files.
- Backend: Fixed language parameter passing in API and tasks.
- Frontend: Removed CRISPR Fusion UI elements from Submit and Monitor views.
- Frontend: Implemented simulated progress bar for better UX.
- Frontend: Restored One-click load button and added result file structure documentation.
- Docker: Fixed critical Restarting loop by removing incorrect image directive in docker-compose.yml.
- Docker: Optimized Dockerfile to correct .pixi environment path issues and prevent accidental deletion of frontend assets.
This commit is contained in:
zly
2026-01-20 20:25:25 +08:00
parent 5067169b0b
commit c75c85c53b
134 changed files with 146457 additions and 996647 deletions

81
tools/README.md Normal file
View File

@@ -0,0 +1,81 @@
# BtToxin Analysis Modules
This directory contains specialized analysis modules integrated into the BtToxin Pipeline. Each module focuses on identifying and characterizing specific genomic features that contribute to the insecticidal potential of *Bacillus thuringiensis* strains.
## 1. BtToxin_Digger
**Core Toxin Identification Module**
This is the foundational module of the pipeline, responsible for identifying Cry, Cyt, and Vip toxin genes in bacterial genomes.
* **Function**:
* Predicts Open Reading Frames (ORFs) from genomic sequences (.fna).
* Translates coding sequences (CDS) to proteins.
* Uses BLAST and HMM (Hidden Markov Models) to search against a curated database of known Bt toxins.
* Identifies toxin candidates and classifies them into families/subfamilies based on sequence identity.
* **Key Metrics**: Sequence Identity (`Identity`), Coverage (`Coverage`), and HMM domain hits.
* **Role**: Provides the primary "evidence" ($w_i$) for the Shotter scoring system.
## 2. BGC Analysis (bgc_analysis)
**Biosynthetic Gene Cluster Detection**
This module detects three specific classes of insecticidal protein gene clusters that serve as independent markers of insecticidal activity.
* **Targets**:
* **ZWA**: Zwittermicin A biosynthetic gene cluster.
* **Thu**: Thuringiensin (beta-exotoxin) biosynthetic gene cluster.
* **TAA**: Toxin A (insecticidal protein) gene cluster.
* **Methodology**:
* Uses BLAST/HMM to detect signature enzymes and backbone genes specific to these clusters.
* Returns a binary status (Present/Absent) for each cluster type ($b_Z, b_T, b_A \in \{0, 1\}$).
* **Contribution to Scoring**:
* The presence of these clusters acts as a **positive prior**, boosting the final toxicity score ($S_{\text{final}}$) because they represent functional insecticidal modules independent of Cry/Vip proteins.
## 3. Mobilome Analysis (mobilome_analysis)
**Mobile Genetic Element Quantification**
This module quantifies the "mobilome"—the collection of mobile genetic elements—which correlates with a strain's ability to acquire, rearrange, and maintain toxin genes.
* **Targets**:
* **Transposases**: Enzymes that facilitate gene movement.
* **Plasmids**: Extrachromosomal DNA often carrying toxin genes in Bt.
* **Phages**: Viral elements that can mediate horizontal gene transfer.
* **Methodology**:
* Annotates and counts these elements in the genome.
* Returns a total count or specific counts ($m$).
* **Contribution to Scoring**:
* A higher mobilome count indicates a more "open" genome capable of HGT (Horizontal Gene Transfer).
* Contributes a **positive prior** (via a saturation function $g(m)$) to the toxicity score, reflecting a higher potential for evolving or acquiring diverse toxin cocktails.
## 4. CRISPR-Cas Analysis (crispr_cas_analysis)
**Genome Defense System Characterization**
This module characterizes the CRISPR-Cas immune systems, which act as barriers to foreign DNA (including plasmids and phages).
* **Targets**:
* **Cas Proteins**: Identification of Cas gene clusters.
* **CRISPR Arrays**: Detection of direct repeats and spacers.
* **Methodology**:
* Classifies the system status into three levels: **Complete** (functional), **Incomplete** (degraded), or **Absent**.
* Returns a status code $c \in \{0, 1, 2\}$ (0=Absent, 1=Incomplete, 2=Complete).
* **Contribution to Scoring**:
* **Negative Prior**: A complete, functional CRISPR system ($c=2$) limits the intake of foreign plasmids (which often carry toxins).
* Therefore, an **Absent** system allows for the highest potential of plasmid-borne toxin acquisition (Highest score boost), while a **Complete** system penalizes the prior probability (Lowest/No boost). This follows the logic: *Absent > Incomplete > Complete* for toxicity potential.
---
## Integration in Shotter Scoring
These modules work together to refine the final insecticidal activity prediction:
1. **Evidence**: **BtToxin_Digger** provides direct evidence of toxin genes ($S_{\text{tox}}$).
2. **Priors**: **BGC**, **Mobilome**, and **CRISPR** modules provide a "genomic context" prior ($\Delta(\text{strain})$).
The final score combines these using a logit-based adjustment:
$$
S_{\text{final}} = \sigma\left( \operatorname{logit}(S_{\text{tox}}) + \Delta(\text{strain}) \right)
$$
Where $\Delta(\text{strain})$ aggregates the positive boosts from BGCs/Mobilome and the adjustment from CRISPR status.
For full mathematical details, see [docs/shotter_math_full_zh_typora.md](../docs/shotter_math_full_zh_typora.md).

View File

@@ -0,0 +1,31 @@
#!/usr/bin/env python3
"""
Mock BGC Detector (ZWA/Thu/TAA)
Returns random presence/absence for testing.
"""
import argparse
import json
import random
from pathlib import Path
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--input", required=True, help="Input genome file")
parser.add_argument("--output", required=True, help="Output JSON file")
args = parser.parse_args()
# Mock logic: Randomly assign 0 or 1
# In real impl, this would run HMM/BLAST against specific BGC databases
results = {
"ZWA": random.choice([0, 1]),
"Thu": random.choice([0, 1]),
"TAA": random.choice([0, 1])
}
with open(args.output, "w") as f:
json.dump(results, f, indent=2)
print(f"Mock BGC results written to {args.output}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1 @@
"""CRISPR-Cas Analysis Module"""

View File

@@ -0,0 +1 @@
"""Scripts for CRISPR-Cas detection and analysis"""

View File

@@ -0,0 +1,139 @@
#!/usr/bin/env python3
"""
CRISPR-Cas Detection Wrapper
Wrapper for CRISPRCasFinder or similar tools to detect CRISPR arrays and Cas genes.
"""
import argparse
import json
import logging
import shutil
import subprocess
import sys
from pathlib import Path
from typing import Dict, List, Any
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
def parse_args():
parser = argparse.ArgumentParser(description="Detect CRISPR arrays and Cas genes in genome")
parser.add_argument("--input", "-i", type=Path, required=True, help="Input genome file (.fna)")
parser.add_argument("--output", "-o", type=Path, required=True, help="Output JSON results file")
parser.add_argument("--tool-path", type=Path, default=None, help="Path to CRISPRCasFinder.pl")
parser.add_argument("--mock", action="store_true", help="Use mock data (for testing without external tools)")
return parser.parse_args()
def check_dependencies(tool_path: Path = None) -> bool:
"""Check if CRISPRCasFinder is available"""
if tool_path and tool_path.exists():
return True
# Check in PATH
if shutil.which("CRISPRCasFinder.pl"):
return True
return False
def generate_mock_results(genome_file: Path) -> Dict[str, Any]:
"""Generate mock CRISPR results for testing"""
logger.info(f"Generating mock CRISPR results for {genome_file.name}")
strain_id = genome_file.stem
return {
"strain_id": strain_id,
"cas_systems": [
{
"type": "I-E",
"subtype": "I-E",
"position": "contig_1:15000-25000",
"genes": ["cas1", "cas2", "cas3", "casA", "casB", "casC", "casD", "casE"]
}
],
"arrays": [
{
"id": "CRISPR_1",
"contig": "contig_1",
"start": 12345,
"end": 12678,
"consensus_repeat": "GTTTTAGAGCTATGCTGTTTTGAATGGTCCCAAAAC",
"num_spacers": 5,
"spacers": [
{"sequence": "ATGCGTCGACATGCGTCGACATGCGTCGAC", "position": 1},
{"sequence": "CGTAGCTAGCCGTAGCTAGCCGTAGCTAGC", "position": 2},
{"sequence": "TGCATGCATGTGCATGCATGTGCATGCATG", "position": 3},
{"sequence": "GCTAGCTAGCGCTAGCTAGCGCTAGCTAGC", "position": 4},
{"sequence": "AAAAATTTTTAAAAATTTTTAAAAATTTTT", "position": 5}
]
},
{
"id": "CRISPR_2",
"contig": "contig_2",
"start": 50000,
"end": 50500,
"consensus_repeat": "GTTTTAGAGCTATGCTGTTTTGAATGGTCCCAAAAC",
"num_spacers": 8,
"spacers": [
{"sequence": "CCCGGGAAACCCGGGAAACCCGGGAAA", "position": 1}
]
}
],
"summary": {
"has_cas": True,
"has_crispr": True,
"num_arrays": 2,
"num_spacers": 13,
"cas_types": ["I-E"]
},
"metadata": {
"tool": "CRISPRCasFinder",
"version": "Mock-v1.0",
"date": "2025-01-14"
}
}
def run_crisprcasfinder(input_file: Path, output_file: Path, tool_path: Path = None):
"""Run actual CRISPRCasFinder tool (Placeholder)"""
# This would implement the actual subprocess call to CRISPRCasFinder.pl
# For now, we raise NotImplementedError unless mock is used
raise NotImplementedError("Real tool integration not yet implemented. Use --mock flag.")
def main():
args = parse_args()
if not args.input.exists():
logger.error(f"Input file not found: {args.input}")
sys.exit(1)
# Create parent directory for output if needed
args.output.parent.mkdir(parents=True, exist_ok=True)
try:
if args.mock:
results = generate_mock_results(args.input)
else:
if not check_dependencies(args.tool_path):
logger.warning("CRISPRCasFinder not found. Falling back to mock data.")
results = generate_mock_results(args.input)
else:
# Real implementation would go here
run_crisprcasfinder(args.input, args.output, args.tool_path)
return
# Write results
with open(args.output, 'w') as f:
json.dump(results, f, indent=2)
logger.info(f"Results written to {args.output}")
except Exception as e:
logger.error(f"Error executing CRISPR detection: {e}")
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,166 @@
#!/usr/bin/env python3
"""
CRISPR-Toxin Fusion Analysis
Analyzes associations between CRISPR spacers and toxin genes.
"""
import argparse
import json
import logging
import sys
from pathlib import Path
from typing import Dict, List, Any
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
def parse_args():
parser = argparse.ArgumentParser(description="Analyze CRISPR-Toxin associations")
parser.add_argument("--crispr-results", type=Path, required=True, help="CRISPR detection results (JSON)")
parser.add_argument("--toxin-results", type=Path, required=True, help="Toxin detection results (JSON or TXT)")
parser.add_argument("--genome", type=Path, required=True, help="Original genome file (.fna)")
parser.add_argument("--output", "-o", type=Path, required=True, help="Output analysis JSON")
parser.add_argument("--mock", action="store_true", help="Use mock analysis logic")
return parser.parse_args()
def load_json(path: Path) -> Dict:
with open(path) as f:
return json.load(f)
def calculate_distance(range1: str, range2: str) -> int:
"""
Calculate distance between two genomic ranges.
Format: 'contig:start-end'
"""
try:
contig1, coords1 = range1.split(':')
start1, end1 = map(int, coords1.split('-'))
contig2, coords2 = range2.split(':')
start2, end2 = map(int, coords2.split('-'))
if contig1 != contig2:
return -1 # Different contigs
# Check for overlap
if max(start1, start2) <= min(end1, end2):
return 0
# Calculate distance
if start1 > end2:
return start1 - end2
else:
return start2 - end1
except Exception as e:
logger.warning(f"Error calculating distance: {e}")
return -1
def mock_blast_spacers(spacers: List[str], toxins: List[Dict]) -> List[Dict]:
"""Mock BLAST spacers against toxins"""
matches = []
# Simulate a match if 'Cry' is in the spacer name (just for demo logic) or random
# In reality, we'd blast sequences.
# Let's just create a fake match for the first spacer
if spacers and toxins:
matches.append({
"spacer_seq": spacers[0],
"target_toxin": toxins[0].get("name", "Unknown"),
"identity": 98.5,
"alignment_length": 32,
"mismatches": 1
})
return matches
def perform_fusion_analysis(crispr_data: Dict, toxin_file: Path, mock: bool = False) -> Dict:
"""
Main analysis logic.
1. Map CRISPR arrays
2. Map Toxin genes
3. Calculate distances
4. Check for spacer matches
"""
analysis_results = {
"strain_id": crispr_data.get("strain_id"),
"associations": [],
"summary": {"proximal_pairs": 0, "spacer_matches": 0}
}
# Extract arrays
arrays = crispr_data.get("arrays", [])
# Mock Toxin Parsing (assuming simple list for now if not JSON)
toxins = []
if mock:
toxins = [
{"name": "Cry1Ac1", "position": "contig_1:10000-12000"},
{"name": "Vip3Aa1", "position": "contig_2:60000-62000"}
]
else:
# TODO: Implement real toxin file parsing (e.g. from All_Toxins.txt)
logger.warning("Real toxin parsing not implemented yet, using empty list")
# Analyze Proximity
for array in arrays:
array_pos = f"{array.get('contig')}:{array.get('start')}-{array.get('end')}"
for toxin in toxins:
dist = calculate_distance(array_pos, toxin["position"])
if dist != -1 and dist < 10000: # 10kb window
association = {
"type": "proximity",
"array_id": array.get("id"),
"toxin": toxin["name"],
"distance": dist,
"array_position": array_pos,
"toxin_position": toxin["position"]
}
analysis_results["associations"].append(association)
analysis_results["summary"]["proximal_pairs"] += 1
# Analyze Spacer Matches (Mock)
all_spacers = []
for array in arrays:
for spacer in array.get("spacers", []):
all_spacers.append(spacer.get("sequence"))
matches = mock_blast_spacers(all_spacers, toxins)
for match in matches:
analysis_results["associations"].append({
"type": "spacer_match",
**match
})
analysis_results["summary"]["spacer_matches"] += 1
return analysis_results
def main():
args = parse_args()
if not args.crispr_results.exists():
logger.error(f"CRISPR results file not found: {args.crispr_results}")
sys.exit(1)
try:
crispr_data = load_json(args.crispr_results)
results = perform_fusion_analysis(crispr_data, args.toxin_results, args.mock)
# Write results
with open(args.output, 'w') as f:
json.dump(results, f, indent=2)
logger.info(f"Fusion analysis complete. Results: {args.output}")
except Exception as e:
logger.error(f"Error during fusion analysis: {e}")
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1 @@
"""Tests for CRISPR-Cas module"""

View File

@@ -0,0 +1,42 @@
import pytest
import json
import shutil
from pathlib import Path
from crispr_cas.scripts.detect_crispr import generate_mock_results
def test_generate_mock_results(tmp_path):
"""Test mock result generation"""
input_file = tmp_path / "test_genome.fna"
input_file.touch()
results = generate_mock_results(input_file)
assert results["strain_id"] == "test_genome"
assert "cas_systems" in results
assert "arrays" in results
assert results["summary"]["has_cas"] is True
assert len(results["arrays"]) > 0
def test_script_execution(tmp_path):
"""Test full script execution via subprocess"""
# Create dummy input
input_file = tmp_path / "genome.fna"
input_file.touch()
output_file = tmp_path / "results.json"
script_path = Path("crispr_cas/scripts/detect_crispr.py").absolute()
import subprocess
cmd = [
"python3", str(script_path),
"--input", str(input_file),
"--output", str(output_file),
"--mock"
]
result = subprocess.run(cmd, capture_output=True, text=True)
assert result.returncode == 0
assert output_file.exists()
with open(output_file) as f:
data = json.load(f)
assert data["strain_id"] == "genome"

View File

@@ -0,0 +1,93 @@
import pytest
import json
from pathlib import Path
import sys
# Add project root to path to allow importing modules
sys.path.insert(0, str(Path(__file__).parents[2]))
from crispr_cas.scripts.fusion_analysis import calculate_distance, perform_fusion_analysis
def test_calculate_distance():
"""Test genomic distance calculation"""
# Same contig, no overlap
# Range1: 100-200, Range2: 300-400 -> Dist 100
assert calculate_distance("c1:100-200", "c1:300-400") == 100
# Same contig, overlap
# Range1: 100-300, Range2: 200-400 -> Dist 0
assert calculate_distance("c1:100-300", "c1:200-400") == 0
# Different contig
assert calculate_distance("c1:100-200", "c2:300-400") == -1
# Invalid format
assert calculate_distance("invalid", "c1:100-200") == -1
def test_fusion_analysis_logic(tmp_path):
"""Test main analysis logic with mock data"""
# Mock CRISPR data
crispr_data = {
"strain_id": "test_strain",
"arrays": [
{
"id": "A1",
"contig": "contig_1",
"start": 1000,
"end": 2000,
"spacers": [{"sequence": "ATGC"}]
}
]
}
# Mock toxin file (just a placeholder for path)
toxin_file = tmp_path / "toxins.txt"
toxin_file.touch()
# Run analysis in mock mode
# In mock mode, the script generates its own toxin list:
# {"name": "Cry1Ac1", "position": "contig_1:10000-12000"}
# Distance: 10000 - 2000 = 8000 (< 10000 threshold) -> Should match
results = perform_fusion_analysis(crispr_data, toxin_file, mock=True)
assert results["strain_id"] == "test_strain"
assert len(results["associations"]) > 0
# Check for proximity match
proximity_matches = [a for a in results["associations"] if a["type"] == "proximity"]
assert len(proximity_matches) > 0
assert proximity_matches[0]["distance"] == 8000
def test_script_execution(tmp_path):
"""Test full script execution via subprocess"""
# Create input files
crispr_file = tmp_path / "crispr.json"
with open(crispr_file, 'w') as f:
json.dump({"strain_id": "test", "arrays": []}, f)
toxin_file = tmp_path / "toxins.txt"
toxin_file.touch()
genome_file = tmp_path / "genome.fna"
genome_file.touch()
output_file = tmp_path / "output.json"
script_path = Path("crispr_cas/scripts/fusion_analysis.py").absolute()
import subprocess
cmd = [
"python3", str(script_path),
"--crispr-results", str(crispr_file),
"--toxin-results", str(toxin_file),
"--genome", str(genome_file),
"--output", str(output_file),
"--mock"
]
result = subprocess.run(cmd, capture_output=True, text=True)
assert result.returncode == 0
assert output_file.exists()

View File

@@ -0,0 +1,31 @@
#!/usr/bin/env python3
"""
Mock Mobilome Analyzer
Returns random count of mobile elements (transposases, plasmids, phages).
"""
import argparse
import json
import random
from pathlib import Path
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--input", required=True, help="Input genome file")
parser.add_argument("--output", required=True, help="Output JSON file")
args = parser.parse_args()
# Mock logic: Random count between 0 and 100
# In real impl, this would sum hits of IS elements, plasmid replicons, phage proteins
count = random.randint(0, 100)
results = {
"mobile_elements_count": count
}
with open(args.output, "w") as f:
json.dump(results, f, indent=2)
print(f"Mock Mobilome results written to {args.output}")
if __name__ == "__main__":
main()