docs: update README and add pixi-based tests

- Add property-based tests for PixiRunner
- Add HAN055.fna test data file
- Update README with pixi installation and usage guide
- Update .gitignore for pixi and test artifacts
- Update CLI to remove Docker-related arguments
This commit is contained in:
2026-01-08 16:59:17 +08:00
parent ae4c6351d9
commit 8d11216481
6 changed files with 76125 additions and 181 deletions

448
README.md
View File

@@ -1,169 +1,199 @@
# BtToxin Pipeline
Automated Bacillus thuringiensis toxin mining system with CI/CD integration.
Automated Bacillus thuringiensis toxin mining system using pixi-managed environments.
## Quick Start (单机部署)
### uv .venv
```bash
uv venv --managed-python -p 3.12 --seed .venv
uv pip install backend/requirements.txt
```
## Quick Start
### Prerequisites
- Docker / Podman
- Python 3.10+
- Node.js 18+
- [pixi](https://pixi.sh) - Modern package manager for conda environments
- Linux x86_64 (linux-64 platform)
### Installation
1. Install pixi (if not already installed):
```bash
# Linux/macOS
curl -fsSL https://pixi.sh/install.sh | bash
# Or via Homebrew
brew install pixi
```
2. Clone and setup the project:
### Development Setup
```bash
# 1. Clone and setup
git clone <your-repo>
cd bttoxin-pipeline
# 2. 使用 Makefile 初始化与启动(单机)
make setup
make start
# 3. 初始化数据库(创建表)
make db-init
# 4. 访问服务
# API: http://localhost:8000/docs
# Flower: http://localhost:5555
# Frontend:http://localhost:3000
# (可选) 本地开发
# Backend: uvicorn app.main:app --reload
# Frontend: npm run dev
# Install all environments (digger + pipeline)
pixi install
```
## Architecture
This creates two isolated environments:
- `digger`: BtToxin_Digger with bioconda dependencies (perl, blast, etc.)
- `pipeline`: Python analysis tools (pandas, matplotlib, seaborn)
Nginx (Reverse Proxy)
├── Frontend (Vue 3 Static)
└── Backend (FastAPI + Swagger)
├── PostgreSQL (SQLModel via SQLAlchemy)
├── Redis (Broker/Result)
├── Celery (Worker/Beat + Flower)
└── Docker Engine (BtToxin_Digger)
### Running the Pipeline
## Documentation
#### Full Pipeline (Recommended)
- API 文档: 浏览器打开 `http://localhost:8000/docs`
- 单机编排: `docker/docker-compose.yml`(唯一来源)
- 环境变量示例: `backend/.env.example`
- 常用命令: `make help`
### macOS + Podman 使用注意事项
- Podman 在 macOS 上通过虚拟机运行,宿主目录绑定到容器时,写权限可能受限。
- 我们已在运行逻辑中对 macOS 进行特殊处理:将输入复制到容器内 `/tmp/input`,在 `/tmp` 执行 BtToxin_Digger结束后把 `Results/` 与关键输出复制回挂载的 `/workspace`(宿主输出目录)。
- 如仍遇写入问题:
- 在 Podman Desktop 的虚拟机共享目录中,添加项目路径并开启写权限。
- 如需,启用 rootful 模式并重启:`podman machine stop && podman machine set --rootful && podman machine start`
- 手动验证挂载:`podman run --rm -v $(pwd)/tests/output:/workspace:rw alpine sh -lc 'echo ok > /workspace/test.txt && ls -l /workspace'`
### 本地离线容器测试(可选)
使用 `scripts/test_bttoxin_digger.py` 最小测试:
Run the complete analysis pipeline with a single command:
```bash
uv run python scripts/test_bttoxin_digger.py
pixi run pipeline --fna tests/test_data/HAN055.fna
```
要求:`tests/test_data` 下存在 `97-27.fna``C15.fna`,测试成功后在 `tests/output/Results/Toxins` 看到 6 个关键文件。
This executes three stages:
1. **Digger**: BtToxin_Digger toxin mining
2. **Shotter**: Toxin scoring and target prediction
3. **Plot**: Heatmap generation and report creation
#### 输入文件格式说明
#### CLI Options
.fna 文件是 FASTA 格式的核酸序列文件,包含细菌的完整基因组序列:
```bash
pixi run pipeline --fna <file> [options]
- **97-27.fna**: Bacillus thuringiensis strain 97-27 的完整基因组序列
- **C15.fna**: Bacillus thuringiensis strain C15 的完整基因组序列
文件格式示例:
```>NZ_CP010088.1 Bacillus thuringiensis strain 97-27 chromosome, complete genome
TAATGTAACACCAGTAAATATTTCATTCATATATTCTTTTAACTGTATTTTATATTCTTTCTACTCTACAATTTCTTTTA
ACTGCCAATATGCATCTTCTAGCCAAGGGTGTAAAACTTTCAACGTGTCTTTTCTATCCCACAAATATGAAATATATGCA
...
Options:
--fna PATH Input .fna file (required)
--out_root PATH Output directory (default: runs/<stem>_run)
--toxicity_csv PATH Toxicity data CSV (default: Data/toxicity-data.csv)
--min_identity FLOAT Minimum identity threshold 0-1 (default: 0.0)
--min_coverage FLOAT Minimum coverage threshold 0-1 (default: 0.0)
--disallow_unknown_families Exclude unknown toxin families
--require_index_hit Keep only hits with known specificity
--lang {zh,en} Report language (default: zh)
--bttoxin_db_dir PATH Custom bt_toxin database directory
--threads INT Number of threads (default: 4)
```
#### 挖掘结果解读
#### Examples
BtToxin_Digger 分析完成后会生成以下关键结果文件:
```bash
# Basic run with default settings
pixi run pipeline --fna tests/test_data/C15.fna
**1. 菌株毒素列表文件 (`.list`)**
- 包含每个菌株中预测到的各类毒素蛋白的详细分类信息
- 毒素类型包括Cry、Cyt、Vip、Others、App、Gpp、Mcf、Mpf、Mpp、Mtx、Pra、Prb、Spp、Tpp、Vpa、Vpb、Xpp
- 每个毒素显示蛋白ID、长度、等级(Rank1-4)、BLAST结果、最佳匹配、覆盖度、相似度、SVM和HMM预测结果
# Strict filtering for high-confidence results
pixi run pipeline --fna tests/test_data/HAN055.fna \
--min_identity 0.50 --min_coverage 0.60 \
--disallow_unknown_families --require_index_hit
**2. 基因银行格式文件 (`.gbk`)**
- 包含预测毒素基因的详细注释信息
- 记录基因位置、蛋白描述、BLAST比对详情、预测结果等
- 可用于后续的功能分析和可视化
# English report with custom output directory
pixi run pipeline --fna tests/test_data/HAN055.fna \
--out_root runs/HAN055_strict --lang en
**3. 汇总表格 (`Bt_all_genes.table`)**
- 所有菌株的毒素基因汇总表格
- 显示每个菌株中不同类型毒素基因的数量和相似度信息
**4. 全部毒素列表 (`All_Toxins.txt`)**
- 包含所有预测到的毒素基因的完整信息
- 字段包括菌株、蛋白ID、蛋白长度、链向、基因位置、SVM预测、BLAST结果、HMM结果、命中ID、比对长度、一致性、E值等
**测试结果示例**
- 97-27菌株预测到12个毒素基因包括InhA1/2、Bmp1、Spp1Aa1、Zwa5A/6等
- C15菌株预测到多个Cry毒素基因Cry21Aa2、Cry21Aa3、Cry21Ca2、Cry5Ba1和其他辅助毒素
- 毒素等级分为Rank1-4Rank1为最高置信度Rank4为最低置信度
- 相似度范围从27.62%到100%,表明与已知毒素的相似程度
### 单目录方案(跨平台稳定写入)
- 运行前,程序会将输入文件复制到宿主输出目录下的 `input_files/` 子目录;容器仅挂载该输出目录(读写)为 `/workspace`。
- 工具运行时的 `--SeqPath` 指向 `/workspace/input_files`,工作目录也固定在 `/workspace`;所有结果与中间文件都会落在宿主的 `tests/output/` 下。
目录示例:
```
tests/output/
├── input_files/ # 输入文件副本
│ ├── 97-27.fna
│ └── C15.fna
├── Results/ # BtToxin_Digger 输出
│ └── Toxins/
│ ├── 97-27.list
│ ├── 97-27.gbk
│ └── ...
├── StatsFiles/ # 统计文件(如有)
├── All_Toxins.txt
└── BtToxin_Digger.log
# Use custom database
pixi run pipeline --fna tests/test_data/HAN055.fna \
--bttoxin_db_dir /path/to/custom/bt_toxin
```
## bttoxin_db更新
### Individual Stage Commands
BtToxin_Digger 容器内置的数据库版本较旧2021年8月建议使用官方 GitHub 仓库的最新数据库。
Run stages separately when needed:
### 数据库目录结构
#### Digger Only
```
external_dbs/bt_toxin/
├── db/ # BLAST 索引文件(运行时必需)
│ ├── bt_toxin.phr
│ ├── bt_toxin.pin
│ ├── bt_toxin.psq
│ ├── bt_toxin.pdb
│ ├── bt_toxin.pjs
│ ├── bt_toxin.pot
│ ├── bt_toxin.ptf
│ ├── bt_toxin.pto
│ └── old/
└── seq/ # 序列源文件(留档/更新用)
├── bt_toxin20251104.fas
└── ...
```bash
pixi run digger-only --fna <file> [options]
Options:
--fna PATH Input .fna file (required)
--out_dir PATH Output directory (default: runs/<stem>_digger_only)
--bttoxin_db_dir PATH Custom database directory
--threads INT Number of threads (default: 4)
--sequence_type Sequence type: nucl/orfs/prot/reads (default: nucl)
```
### 更新步骤
Example:
```bash
pixi run digger-only --fna tests/test_data/C15.fna --threads 8
```
#### Shotter (Scoring)
```bash
pixi run shotter [options]
Options:
--toxicity_csv PATH Toxicity data CSV
--all_toxins PATH All_Toxins.txt from Digger
--output_dir PATH Output directory
--min_identity FLOAT Minimum identity threshold
--min_coverage FLOAT Minimum coverage threshold
--allow_unknown_families / --disallow_unknown_families
--require_index_hit Keep only indexed hits
```
Example:
```bash
pixi run shotter \
--all_toxins runs/C15_run/digger/Results/Toxins/All_Toxins.txt \
--output_dir runs/C15_run/shotter
```
#### Plot (Visualization)
```bash
pixi run plot [options]
Options:
--strain_scores PATH strain_target_scores.tsv from Shotter
--toxin_support PATH toxin_support.tsv (optional)
--species_scores PATH strain_target_species_scores.tsv (optional)
--out_dir PATH Output directory
--cmap STRING Colormap (default: viridis)
--per_hit_strain NAME Generate per-hit heatmap for specific strain
--merge_unresolved Merge other/unknown into unresolved
--report_mode {summary,paper} Report style (default: paper)
--lang {zh,en} Report language (default: zh)
```
Example:
```bash
pixi run plot \
--strain_scores runs/C15_run/shotter/strain_target_scores.tsv \
--toxin_support runs/C15_run/shotter/toxin_support.tsv \
--out_dir runs/C15_run/shotter \
--per_hit_strain C15 --lang en
```
## Output Structure
After running the pipeline:
```
runs/<strain>_run/
├── stage/ # Staged input file
│ └── <strain>.fna
├── digger/ # BtToxin_Digger outputs
│ ├── Results/
│ │ └── Toxins/
│ │ ├── All_Toxins.txt
│ │ ├── <strain>.list
│ │ ├── <strain>.gbk
│ │ └── Bt_all_genes.table
│ └── BtToxin_Digger.log
├── shotter/ # Shotter outputs
│ ├── strain_target_scores.tsv
│ ├── strain_scores.json
│ ├── toxin_support.tsv
│ ├── strain_target_species_scores.tsv
│ ├── strain_species_scores.json
│ ├── strain_target_scores.png
│ ├── strain_target_species_scores.png
│ ├── per_hit_<strain>.png
│ └── shotter_report_paper.md
├── logs/
│ └── digger_execution.log
└── pipeline_results.tar.gz # Bundled results
```
## Database Update
BtToxin_Digger's built-in database may be outdated. Use the latest from GitHub:
### Update Steps
```bash
mkdir -p external_dbs
@@ -176,49 +206,149 @@ git sparse-checkout init --cone
git sparse-checkout set BTTCMP_db/bt_toxin
git checkout master
# 把目录拷贝到你的项目 external_dbs 下
cd ..
cp -a tmp_bttoxin_repo/BTTCMP_db/bt_toxin external_dbs/bt_toxin
# 清理临时 repo
rm -rf tmp_bttoxin_repo
```
### 验证数据库绑定
The pipeline automatically detects `external_dbs/bt_toxin` if present.
```bash
# 检查数据库文件是否完整
ls -lh external_dbs/bt_toxin/db/
### Database Structure
# 验证容器能正确访问绑定的数据库
docker run --rm \
-v "$(pwd)/external_dbs/bt_toxin:/usr/local/bin/BTTCMP_db/bt_toxin:ro" \
quay.io/biocontainers/bttoxin_digger:1.0.10--hdfd78af_0 \
bash -lc 'ls -lh /usr/local/bin/BTTCMP_db/bt_toxin/db | head'
```
external_dbs/bt_toxin/
├── db/ # BLAST index files (required)
├── bt_toxin.phr
├── bt_toxin.pin
│ ├── bt_toxin.psq
│ └── ...
└── seq/ # Source sequences (optional, for reference)
└── bt_toxin*.fas
```
输出应显示 `.pin/.psq/.phr` 等文件,且时间戳/大小与宿主机一致,说明绑定成功。
## Input File Format
### 使用外部数据库运行 Pipeline
`.fna` files are FASTA-format nucleotide sequence files containing bacterial genome sequences:
脚本会自动检测 `external_dbs/bt_toxin` 目录,若存在则自动绑定:
```bash
# 自动使用 external_dbs/bt_toxin推荐
uv run python scripts/run_single_fna_pipeline.py --fna tests/test_data/HAN055.fna
# 或手动指定数据库路径
uv run python scripts/run_single_fna_pipeline.py \
--fna tests/test_data/HAN055.fna \
--bttoxin_db_dir /path/to/custom/bt_toxin
```
>NZ_CP010088.1 Bacillus thuringiensis strain 97-27 chromosome, complete genome
TAATGTAACACCAGTAAATATTTCATTCATATATTCTTTTAACTGTATTTTATATTCTTTCTACTCTACAATTTCTTTTA
ACTGCCAATATGCATCTTCTAGCCAAGGGTGTAAAACTTTCAACGTGTCTTTTCTATCCCACAAATATGAAATATATGCA
...
```
### 注意事项
## Result Interpretation
- **db/ 目录是必需的**:运行时 BLAST 只读取 `db/` 下的索引文件
- **seq/ 目录是可选的**:仅用于留档或重新生成索引
- **绑定模式为只读 (ro)**:防止容器意外修改宿主机数据库
- **不需要重新 index**GitHub 仓库已包含预构建的 BLAST 索引
### Key Output Files
**All_Toxins.txt** - Complete toxin predictions with:
- Strain, Protein ID, coordinates
- SVM/BLAST/HMM predictions
- Hit ID, alignment length, identity, E-value
**strain_target_scores.tsv** - Strain-level target predictions:
- TopOrder: Most likely target insect order
- TopScore: Confidence score (0-1)
- Per-order scores for all target orders
**toxin_support.tsv** - Per-hit contribution details:
- Individual toxin weights and contributions
- Family classification and partner status
### Toxin Rankings
- **Rank1**: Highest confidence (identity ≥78%, coverage ≥80%)
- **Rank2-3**: Moderate confidence
- **Rank4**:
Lowest confidence predictions
### Target Orders
Common insect orders in predictions:
- **Lepidoptera**: Moths and butterflies
- **Coleoptera**: Beetles
- **Diptera**: Flies and mosquitoes
- **Hemiptera**: True bugs
- **Nematoda**: Roundworms
## Development
### Python Development Environment
For development work outside pixi:
```bash
uv venv --managed-python -p 3.12 --seed .venv
source .venv/bin/activate
uv pip install -e .
```
### Running Tests
```bash
# Run property-based tests
pixi run -e pipeline python -m pytest tests/test_pixi_runner.py -v
```
### Project Structure
```
bttoxin-pipeline/
├── pixi.toml # Pixi environment configuration
├── pyproject.toml # Python package configuration
├── scripts/ # Core pipeline scripts
│ ├── run_single_fna_pipeline.py # Main pipeline orchestrator
│ ├── run_digger_stage.py # Digger-only stage
│ ├── bttoxin_shoter.py # Toxin scoring module
│ ├── plot_shotter.py # Visualization & reporting
│ └── pixi_runner.py # PixiRunner class
├── bttoxin/ # Python package (CLI entry point)
│ ├── __init__.py
│ ├── api.py
│ └── cli.py
├── Data/ # Reference data
│ └── toxicity-data.csv # BPPRC specificity data
├── external_dbs/ # External databases (optional)
│ └── bt_toxin/ # Updated BtToxin database
├── tests/ # Test suite
│ ├── test_pixi_runner.py # Property-based tests
│ └── test_data/ # Test input files
├── docs/ # Documentation
├── runs/ # Pipeline outputs (gitignored)
├── backend/ # FastAPI backend (optional web service)
└── frontend/ # Vue.js frontend (optional web UI)
```
## Troubleshooting
### pixi not found
```bash
# Ensure pixi is in PATH
export PATH="$HOME/.pixi/bin:$PATH"
# Or reinstall
curl -fsSL https://pixi.sh/install.sh | bash
```
### Environment not found
```bash
# Reinstall environments
pixi install
```
### BtToxin_Digger not available
```bash
# Verify digger environment
pixi run -e digger BtToxin_Digger --help
```
### Permission errors
Ensure write permissions on output directories. The pipeline creates directories automatically.
## License