feat(digger): containerize BtToxin_Digger with v5 database integration

- Added Dockerfile and docker-compose.yml for BtToxin_Digger
- Integrated external v5 BLAST database into the container image
- Updated main docker-compose.yml to include the digger service
- Updated documentation with database update instructions
This commit is contained in:
zly
2026-01-17 12:14:39 +08:00
parent 6f2365981d
commit 700bdb8307
33 changed files with 232973 additions and 75716 deletions

View File

@@ -1,255 +1,141 @@
# BtToxin_Digger (pixi) reproduction
# BtToxin_Digger (pixi) reproduction & Docker Image
This repo is a **reproducible runtime environment + example outputs** for
BtToxin_Digger 1.0.10 with **BLAST v5 database compatibility**. It is **not**
an official fork or a new BtToxin_Digger release.
This repo is a **reproducible runtime environment** for BtToxin_Digger 1.0.10, packaged as a Docker image based on `ghcr.io/prefix-dev/pixi`.
It includes:
1. **BtToxin_Digger 1.0.10** (installed via Pixi)
2. **BLAST+ 2.16.0** (compatible with v5 databases)
3. **Pre-bundled BtToxin Database** (baked into the image)
## License / Citation / Disclaimer
- **BtToxin_Digger** is developed by its original authors; cite the upstream
publication if you use it in research.
- **This repository** only provides an environment wrapper (pixi) and example
runs for reproducibility; it does not modify BtToxin_Digger source code.
- **Disclaimer**: This is an independent, community-maintained setup and is
not endorsed by the upstream authors.
- **BtToxin_Digger** is developed by its original authors; cite the upstream publication if you use it in research.
- **This repository** only provides an environment wrapper (pixi/docker); it does not modify BtToxin_Digger source code.
This directory reproduces the BtToxin_Digger environment from
`quay.io/biocontainers/bttoxin_digger:1.0.10--hdfd78af_0` using pixi so the
`scripts/run_single_fna_pipeline.py` digger step can be run without Docker.
## 1. Quick Start with Docker
## 1) Environment definition (vs docker image)
The easiest way to run this is using the included `docker-compose.yml` or the global project configuration.
- `pixi.toml` keeps `bttoxin_digger=1.0.10` + `perl=5.26.2` (legacy stack) while
upgrading `blast` to a v5-capable release for BLASTDB v5.
- Changes relative to `quay.io/biocontainers/bttoxin_digger:1.0.10--hdfd78af_0`:
- BLAST+ upgraded from 2.12.0 to 2.16.0 (required to read v5 databases).
- Explicitly pinned `perl-file-tee==0.07` and `perl-list-util==1.38`.
- `channel-priority = "disabled"` to allow mixing bioconda/conda-forge and
the legacy label for perl compatibility.
Create the environment:
### Build the Image
```bash
# In this directory
docker compose build
```
cd /home/zly/project/bttoxin-pipeline/runs/bttoxin_digger_v5_repro
### Run Analysis
Place your input `.fna` files in `examples/inputs` (or mount your own directory), then run:
```bash
# Run help
docker compose run --rm digger-repro pixi run BtToxin_Digger --help
# Run analysis on a specific file
# Note: Input path must match the internal mount point (/app/jobs)
docker compose run --rm digger-repro pixi run BtToxin_Digger \
--SeqPath /app/jobs \
--Scaf_suffix .fna \
--threads 4
```
### Directory Mounting
- `/app/jobs`: Mount your input sequence files here.
- `/app/data`: Mount your desired output directory here (if using absolute paths in arguments).
## 2. Docker Image Construction
The image is built using `docker/Dockerfile`.
### Base Image
Uses `ghcr.io/prefix-dev/pixi:latest` to ensure a consistent conda-compatible environment.
### Database Integration
The external database (`external_dbs/bt_toxin`) is **copied into the image** during build time.
Target location: `/app/.pixi/envs/default/bin/BTTCMP_db/bt_toxin`
This replaces the default database shipped with the bioconda package, ensuring:
1. Latest toxin definitions are used.
2. BLAST v5 indices are compatible with the installed BLAST+ 2.16.0.
### Environment Definition (`pixi.toml`)
- `bttoxin_digger = "==1.0.10"`
- `perl = "==5.26.2"` (Legacy requirement)
- `blast = "==2.16.0"` (Upgraded for v5 DB support)
- `channel-priority = "disabled"`
## 3. Development / Manual Usage
If you want to run without Docker using local Pixi:
```bash
# Install environment
pixi install
# Link the database (required manually if not using Docker)
# The Dockerfile does this automatically by copying files.
ENV_BIN=.pixi/envs/default/bin
rm -rf "$ENV_BIN/BTTCMP_db/bt_toxin"
ln -sfn $(pwd)/external_dbs/bt_toxin "$ENV_BIN/BTTCMP_db/bt_toxin"
# Run
pixi run BtToxin_Digger --help
```
## 2) Database wiring (BLAST v4 vs v5)
The external BTTCMP database under `external_dbs/bt_toxin` ships with a BLAST
v5 index (built by newer BLAST+). If you run with BLAST 2.7, you must rebuild
v4 databases; with BLAST >= 2.10, you can use the v5 database directly.
### Recommended: use the shared `external_dbs` (no copy)
Keep a single source of truth and link it into the pixi environment:
## 4. Repository Layout
```
ENV_BIN=/home/zly/project/bttoxin-pipeline/runs/bttoxin_digger_v5_repro/.pixi/envs/default/bin
ln -sfn /home/zly/project/bttoxin-pipeline/external_dbs/bt_toxin \
"$ENV_BIN/BTTCMP_db/bt_toxin"
.
├── docker/
└── Dockerfile # Docker build definition
├── docker-compose.yml # Local test orchestration
├── external_dbs/ # Database source (copied into image)
│ └── bt_toxin/ # The actual database files
├── pixi.toml # Environment dependencies
├── pixi.lock # Exact version lock
└── examples/ # Test inputs and outputs
```
This avoids duplicating a large database inside the repo.
## 5. Updating the Database (Important for Future Updates)
### Optional: freeze a snapshot inside this repo
The database consists of two parts in `external_dbs/bt_toxin`:
1. **`seq/` Directory**: Contains the raw FASTA sequence files (e.g., `bt_toxin20251104.fas`).
2. **`db/` Directory**: Contains the BLAST indices (`.phr`, `.pin`, `.psq`) generated from the sequences.
If you want this repo to be self-contained, copy a snapshot and point the
environment at it (note: consider Git LFS if you intend to push it):
**Relationship**: The files in `db/` are **generated from** the FASTA files in `seq/` using `makeblastdb`. The filename of the source FASTA (e.g., `bt_toxin20251104.fas`) is embedded in the `db` files metadata.
```
SNAPSHOT=/home/zly/project/bttoxin-pipeline/runs/bttoxin_digger_v5_repro/external_dbs_snapshot
mkdir -p "$SNAPSHOT"
cp -a /home/zly/project/bttoxin-pipeline/external_dbs/bt_toxin "$SNAPSHOT/"
ln -sfn "$SNAPSHOT/bt_toxin" "$ENV_BIN/BTTCMP_db/bt_toxin"
```
### How to Update (e.g., for 2026/2027 data)
Rebuild `bt_toxin` using the external FASTA:
If a new database version is released (e.g., from https://github.com/liaochenlanruo/BtToxin_Digger), follow these steps:
```
ENV_BIN=/home/zly/project/bttoxin-pipeline/runs/bttoxin_digger_v5_repro/.pixi/envs/default/bin
V4_DB=/home/zly/project/bttoxin-pipeline/runs/bttoxin_digger_v5_repro/bt_toxin_v4
1. **Download New Sequences**:
Place the new FASTA file (e.g., `bt_toxin2026xxxx.fas`) into `external_dbs/bt_toxin/seq/`.
mkdir -p "$V4_DB"
cp -a /home/zly/project/bttoxin-pipeline/external_dbs/bt_toxin/db "$V4_DB/"
ln -sfn /home/zly/project/bttoxin-pipeline/external_dbs/bt_toxin/seq "$V4_DB/seq"
2. **Generate New Indices (Critical Step)**:
You must regenerate the indices in `external_dbs/bt_toxin/db/`. You can use a temporary container or local BLAST+ to do this.
"$ENV_BIN/makeblastdb" \
-in /home/zly/project/bttoxin-pipeline/external_dbs/bt_toxin/seq/bt_toxin20251104.fas \
-dbtype prot \
-out "$V4_DB/db/bt_toxin" \
-parse_seqids
```bash
# Example using the local pixi environment (if installed)
# Or use a container with blast installed
makeblastdb \
-in external_dbs/bt_toxin/seq/bt_toxin2026xxxx.fas \
-dbtype prot \
-out external_dbs/bt_toxin/db/bt_toxin \
-parse_seqids
```
ln -sfn "$V4_DB" "$ENV_BIN/BTTCMP_db/bt_toxin"
```
*Note: The `-out` parameter must end with `bt_toxin` to match what the tool expects.*
For BLAST v5 (current pixi.toml), point back to the external DB:
3. **Rebuild Docker Image**:
The Dockerfile copies `external_dbs/bt_toxin` into the image. You must rebuild it to include the changes.
```
ln -sfn /home/zly/project/bttoxin-pipeline/external_dbs/bt_toxin \
"$ENV_BIN/BTTCMP_db/bt_toxin"
```
```bash
docker compose build --no-cache
```
Rebuild the negative-set (back) database bundled with BtToxin_Digger:
```
"$ENV_BIN/makeblastdb" \
-in "$ENV_BIN/BTTCMP_db/back/seq/negative_set-20210607" \
-dbtype prot \
-out "$ENV_BIN/BTTCMP_db/back/db/back" \
-parse_seqids
```
## 3) Run BtToxin_Digger (assembled genome)
`run_digger_pixi.sh` sets `RATTLER_CACHE_DIR` inside this directory so pixi can
write its cache in the workspace (the default `~/.cache` path is blocked by the
sandbox).
Example for a single `.fna` (use a clean working directory):
```
mkdir -p /home/zly/project/bttoxin-pipeline/runs/bttoxin_digger_v5_repro/work/C15_pixi_run_v5
cd /home/zly/project/bttoxin-pipeline/runs/bttoxin_digger_v5_repro/work/C15_pixi_run_v5
bash ../run_digger_pixi.sh ../examples/inputs .fna 4
```
If you want to bind `external_dbs/bt_toxin` explicitly:
```
bash ../run_digger_pixi.sh ../examples/inputs .fna 4 /home/zly/project/bttoxin-pipeline/external_dbs/bt_toxin
```
Outputs land under `Results/` in the working directory.
### 参数说明pixi run_digger_pixi.sh
- `input_dir`: 输入目录(里面放 `.fna` 文件)
- `scaf_suffix`: 输入文件后缀(例如 `.fna`
- `threads`: 线程数(默认 4
- `bttoxin_db_dir`: 外部 bt_toxin 数据库路径(可选)
### 与 scripts/run_single_fna_pipeline.py 的一致性
pixi 脚本调用的 BtToxin_Digger 参数与 `scripts/run_single_fna_pipeline.py`
里的 docker 调用一致,核心参数对照如下:
- `--SeqPath <dir>`:输入目录
- `--SequenceType nucl`:核酸输入
- `--Scaf_suffix .fna`:文件后缀
- `--threads 4`:线程数
差异点:
- docker 版本会自动绑定 `external_dbs/bt_toxin`(若存在),并把输出整理到
`runs/<out_root>/digger`pixi 版本默认在当前工作目录生成 `Results/`
- `scripts/run_single_fna_pipeline.py` 还会继续运行 Shotter + report
pixi 脚本只执行 BtToxin_Digger 本体。
## 4) Outputs and comparison (examples)
Inputs copied into this workspace:
- `runs/bttoxin_digger_v5_repro/examples/inputs/C15.fna`
- `runs/bttoxin_digger_v5_repro/examples/inputs/HAN055.fna`
- Example pixi runs:
- `runs/bttoxin_digger_v5_repro/examples/C15_pixi_v5`
- `runs/bttoxin_digger_v5_repro/examples/HAN055_pixi_v5_clean`
- Example docker runs:
- `runs/bttoxin_digger_v5_repro/examples/C15_docker/digger`
- `runs/bttoxin_digger_v5_repro/examples/HAN055_docker/digger`
See `runs/bttoxin_digger_v5_repro/examples/COMPARE_REPORT.md` for the comparison summary.
Diff files:
- `runs/bttoxin_digger_v5_repro/examples/diffs/C15_docker_vs_pixi_v5.diff`
- `runs/bttoxin_digger_v5_repro/examples/diffs/HAN055_docker_vs_pixi_v5_clean.diff`
## 5) External DB update (v5)
When `external_dbs/bt_toxin` is updated from the BtToxin_Digger repo, the BLAST
database is v5, which requires BLAST >= 2.10.0. That is why this pixi
environment upgrades BLAST to 2.16.0.
After updating `external_dbs/bt_toxin`, ensure the pixi environment still points
to that directory (see Section 2). With BLAST 2.16.0, no re-index is needed
because the upstream repo already ships v5 indices. If you downgrade BLAST to
2.7, rebuild a v4 DB (Section 2).
### 更新步骤
```bash
mkdir -p external_dbs
rm -rf external_dbs/bt_toxin tmp_bttoxin_repo
git clone --filter=blob:none --no-checkout https://github.com/liaochenlanruo/BtToxin_Digger.git tmp_bttoxin_repo
cd tmp_bttoxin_repo
git sparse-checkout init --cone
git sparse-checkout set BTTCMP_db/bt_toxin
git checkout master
# 把目录拷贝到你的项目 external_dbs 下
cd ..
cp -a tmp_bttoxin_repo/BTTCMP_db/bt_toxin external_dbs/bt_toxin
# 清理临时 repo
rm -rf tmp_bttoxin_repo
```
### 验证数据库绑定
```bash
# 检查数据库文件是否完整
ls -lh external_dbs/bt_toxin/db/
# 验证容器能正确访问绑定的数据库
docker run --rm \
-v "$(pwd)/external_dbs/bt_toxin:/usr/local/bin/BTTCMP_db/bt_toxin:ro" \
quay.io/biocontainers/bttoxin_digger:1.0.10--hdfd78af_0 \
bash -lc 'ls -lh /usr/local/bin/BTTCMP_db/bt_toxin/db | head'
```
输出应显示 `.pin/.psq/.phr` 等文件,且时间戳/大小与宿主机一致,说明绑定成功。
### 使用外部数据库运行 Pipeline
脚本会自动检测 `external_dbs/bt_toxin` 目录,若存在则自动绑定:
```bash
# 自动使用 external_dbs/bt_toxin推荐
uv run python scripts/run_single_fna_pipeline.py --fna tests/test_data/HAN055.fna
# 或手动指定数据库路径
uv run python scripts/run_single_fna_pipeline.py \
--fna tests/test_data/HAN055.fna \
--bttoxin_db_dir /path/to/custom/bt_toxin
```
### 注意事项
- `db/` 目录是必需的:运行时 BLAST 只读取 `db/` 下的索引文件
- `seq/` 目录是可选的:仅用于留档或重新生成索引
- 绑定模式为只读 (`ro`):防止容器意外修改宿主机数据库
- 不需要重新 indexGitHub 仓库已包含预构建的 BLAST 索引
## 6) Repository layout
```
runs/bttoxin_digger_v5_repro/
├─ .pixi/ # pixi environment cache
├─ pixi.toml # environment definition (bttoxin_digger + blast)
├─ pixi.lock # resolved environment
├─ run_digger_pixi.sh # wrapper to run BtToxin_Digger in this env
├─ README.md
└─ examples/
├─ inputs/ # copied test inputs (C15.fna, HAN055.fna)
├─ C15_pixi_v5/ # pixi run output (example)
├─ HAN055_pixi_v5_clean/ # pixi run output (example)
├─ C15_docker/ # docker output copy (baseline)
├─ HAN055_docker/ # docker output copy (baseline)
├─ diffs/ # docker vs pixi diffs
└─ COMPARE_REPORT.md
```
4. **Verify**:
Check the database version inside the new container:
```bash
docker compose run --rm digger-repro pixi run blastdbcmd -db /app/.pixi/envs/default/bin/BTTCMP_db/bt_toxin/db/bt_toxin -info
```