feat: add base_prokka tool and CRISPR-Cas analysis source code

- Add base_prokka genome annotation tool with pixi config
- Add CRISPR-Cas analysis src (CRISPRCasFinder.pl, environment config)
- Add test data and documentation

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
zly
2026-01-28 20:31:00 +08:00
parent ef09a1c5d5
commit 8e0deb1691
17 changed files with 14189 additions and 0 deletions

View File

@@ -0,0 +1,80 @@
# CRISPR-Cas Analysis Module
This module provides tools for detecting and analyzing CRISPR-Cas systems in bacterial genomes using CRISPRCasFinder and MacSyFinder.
## Installation & Setup
This directory is a standalone `pixi` project.
1. **Enter the directory**:
```bash
cd tools/crispr_cas_analysis
```
2. **Install dependencies**:
```bash
pixi install
```
3. **Install CASFinder Definitions**:
This step downloads the required CASFinder model definitions.
```bash
pixi run install-casfinder
```
## Usage
### Environment
To run commands, you can either prepend `pixi run` or enter the shell:
```bash
pixi shell
```
### Running Detection
Use the provided `CRISPRCasFinder.pl` script to analyze a genome assembly (FASTA format).
**Example Command (running from `tools/crispr_cas_analysis` directory)**:
```bash
# 1. Clean up previous results if they exist
rm -rf tests/test_output
# 先创建输出目录(如果不存在)
mkdir -p ./tests/test_output
# 进入输出目录
cd ./tests/test_output
# 从这里运行命令,调整相关路径
pixi run perl ../../src/CRISPRCasFinder.pl \
-in ../20141126CLLT035_contig341.fna \
-out . \
-so ../../src/sel392v2.so \
-cas -q -log
# # 2. Run detection using relative paths
pixi run perl src/CRISPRCasFinder.pl \
-in ./tests/20141126CLLT035_contig341.fna \
-q -cas -log -html -ccvRep \
-cpuMacSyFinder 20 \
-cluster 20000 \
-getSummaryCasfinder \
-so /home/gzy/Bt_Project/software/sel392v2.so \
-gffAnnot /home/gzy/Bt_Project/1_sequencing_genome_annotation/20120412LHLT139/20120412LHLT139.gff \
-proteome /home/gzy/Bt_Project/1_sequencing_genome_annotation/20120412LHLT139/20120412LHLT139.faa
-out ./tests/test_output \
-so ./src/sel392v2.so
```
### Output Explanation
The output directory (`tests/test_output`) will contain several key files:
* `CRISPR-Cas_summary.tsv`: Summary of detected CRISPR arrays and Cas systems.
* `Cas_REPORT.tsv`: Detailed report of detected Cas proteins.
* `Crisprs_REPORT.tsv`: Detailed report of detected CRISPR arrays.
* `GFF/`: Annotations of the findings.
* `Visualization/`: HTML visualization of the results.
## Directory Structure
* `src/`: Source code and scripts (CRISPRCasFinder.pl, etc.).
* `scripts/`: Wrapper scripts for the pipeline.
* `tests/`: Test data.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1 @@
{

View File

@@ -0,0 +1,85 @@
# 模块 2CRISPR/Cas 完整性压分(模板)
> 目标:对每个基因组评估 CRISPR/Cas 系统状态,并映射为:
> **不存在0 > 不完整1 > 完整2**,且 **越完整越压分**(负向、单调)。
---
## 1) Conda 环境
- 环境名:`crisprcasfinder`
- 说明:用于运行 CRISPRCasFinders 检测工具
- 激活:
```bash
conda env create -f ccf.environment.yml -n crisprcasfinder
conda activate crisprcasfinder
conda install -c bioconda macsyfinder=2.1.2
macsydata install -u CASFinder==3.1.0
```
---
## 2) 启动命令
### 2.1 单基因组运行(示例模板:以“检测 + 完整性判定”两步为例)
```bash
# Step 1) 运行crisprcasfinder
perl ./script/CRISPRCasFinder.pl \
-in /home/gzy/Bt_Project/1_sequencing_genome_annotation/20120412LHLT139/20120412LHLT139.fna \
-q -cas -log -html -ccvRep \ # 参数保持默认
-cpuMacSyFinder 20 \
-cluster 20000 \
-getSummaryCasfinder \
-so /home/gzy/Bt_Project/software/sel392v2.so \
-gffAnnot /home/gzy/Bt_Project/1_sequencing_genome_annotation/20120412LHLT139/20120412LHLT139.gff \
-proteome /home/gzy/Bt_Project/1_sequencing_genome_annotation/20120412LHLT139/20120412LHLT139.faa
# Step 2) 将检测结果归类为 0/1/2不存在/不完整/完整)
python crispr_cas_stats.py \
--input <OUTDIR>/CRISPR-Cas_summary.tsv \
--output <OUTDIR>/CRISPR-Cas_statistics.tsv
```
---
## 3) 参数说明(阈值含义 + 默认值)
| 参数 | 含义 | 默认值 |
|---|---|---|
| `-in` | 基因组fna文件 |
| `-cpuMacSyFinder` | 线程数 | 1 |
| `-cluster` | 距离阈值 | 20000 |
| `-getSummaryCasfinder` | |
| `-so` | | ./sel392v2.so |
| `-gffAnnot` | 基因组gff文件 | 来自prokka软件注释的gff文件 |
| `-proteome` | 基因组faa文件 | 来自prokka软件注释的faa文件 |
---
## 4) 输出结果文件(结构与解析)
### 4.1 CRISPRCasFinder输出目录结构
```
OUTDIR/
LOGs/
TSV/
CRISPR-Cas_summary.tsv
CRISPR-Cas_clusters.tsv
Crisprs_REPORT.tsv
Cas_REPORT.tsv
Visualization/
index.html
```
### 4.2 crispr_cas_stats.py结果文件结构
| 字段 | 类型 | 说明 |
|---|---|---|
| `state` | int | 0: 不存在 CRISPR/Cas系统; 1: CRISPR/Cas系统存在但不完整; 2: CRISPR/Cas系统存在且完整 |
| `typess` | string | 推断系统类型I/II/III/V/…) |
---

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,25 @@
name: crisprcasfinder
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- python >=3.10
- wget
- curl
- git
- java-jdk
- parallel
- perl-app-cpanminus
- hmmer
- emboss
- blast
- perl-bioperl-core
- perl-xml-simple
- perl-digest-md5
- vmatch
- muscle
- prodigal
- mamba
- macsyfinder=2.0
prefix: crisprcasfinder

View File

@@ -0,0 +1,15 @@
>20141126CLLT035_contig341
AGAAGGATTTTAAAACCGTAAGACACTTAGAGAGGGGAAACAACTATGTCACTTTTACAG
CAACATTTTGAAGAAAGAAGAGAATACATTTTCAATCGTCTTAAACAACCAGAATACATG
GAAAGAAGCATAGAAAAAGTTCGCCAAGCTCAAAAAGAGATCAAAAATACAGTGCGAACG
ATTAAAGATTTGTTACTCTTAGACAAAACCACTGATCCTTGCCTTTAATTTATTCACTAA
TATTACACTGTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTACTGTTTTTATCATGATGTTTAAT
GCAAAAGAGAAAATCCTCATGGATTCTTATAAGAAAAAGCGTAGATCACAAACAGAACTT
CATTATGATGTTGCTGACAAAGAAGGGTTTGACAAAGCGTTTTATGAAGCGCGTATTGAT
TCATTACGAAATGACATTCGTGTAATATCTTTCAAAAAGCTATGTGAAAATGAACCCGCA
CCAGAAGACTTAGAACTATTCAAACAACGCTATGAAACAATTGTTTTACCAAAAATACAA
GAAATTGTTTCCCTAATTGAACCAAGTTTAATAGATATAGACGTATTTTTAAATCCAGTA
ATCCAATATGGTGTAGGAGAAATTACTTTAGATGAAATGATTCAAAAACTACACAAAAAC
CTTTCTCTATTTCACGAATTATCAAAGGT