docs: update README and add pixi-based tests

- Add property-based tests for PixiRunner - Add HAN055.fna test data file - Update README with pixi installation and usage guide - Update .gitignore for pixi and test artifacts - Update CLI to remove Docker-related arguments
2026-01-08 16:59:17 +08:00
parent ae4c6351d9
commit 8d11216481
6 changed files with 76125 additions and 181 deletions
--- a/README.md
+++ b/README.md
@@ -1,169 +1,199 @@
 # BtToxin Pipeline

-Automated Bacillus thuringiensis toxin mining system with CI/CD integration.
+Automated Bacillus thuringiensis toxin mining system using pixi-managed environments.

-## Quick Start (单机部署)
-
-### uv .venv
-
-```bash
-uv venv --managed-python -p 3.12 --seed .venv
-uv pip install backend/requirements.txt
-```
+## Quick Start

 ### Prerequisites

- Docker / Podman
- Python 3.10+
- Node.js 18+
+- [pixi](https://pixi.sh) - Modern package manager for conda environments
+- Linux x86_64 (linux-64 platform)
+
+### Installation
+
+1. Install pixi (if not already installed):
+
+```bash
+# Linux/macOS
+curl -fsSL https://pixi.sh/install.sh | bash
+
+# Or via Homebrew
+brew install pixi
+```
+
+2. Clone and setup the project:

-### Development Setup
 ```bash
-# 1. Clone and setup
 git clone <your-repo>
 cd bttoxin-pipeline

-# 2. 使用 Makefile 初始化与启动（单机）
-make setup
-make start
-
-# 3. 初始化数据库（创建表）
-make db-init
-
-# 4. 访问服务
-# API:     http://localhost:8000/docs
-# Flower:  http://localhost:5555
-# Frontend:http://localhost:3000
-
-# (可选) 本地开发
-# Backend: uvicorn app.main:app --reload
-# Frontend: npm run dev
+# Install all environments (digger + pipeline)
+pixi install
 ```

-## Architecture
+This creates two isolated environments:
+- `digger`: BtToxin_Digger with bioconda dependencies (perl, blast, etc.)
+- `pipeline`: Python analysis tools (pandas, matplotlib, seaborn)

-Nginx (Reverse Proxy)
-  ├── Frontend (Vue 3 Static)
-  └── Backend (FastAPI + Swagger)
-      ├── PostgreSQL (SQLModel via SQLAlchemy)
-      ├── Redis (Broker/Result)
-      ├── Celery (Worker/Beat + Flower)
-      └── Docker Engine (BtToxin_Digger)
+### Running the Pipeline

-## Documentation
+#### Full Pipeline (Recommended)

- API 文档: 浏览器打开 `http://localhost:8000/docs`
- 单机编排: `docker/docker-compose.yml`（唯一来源）
- 环境变量示例: `backend/.env.example`
- 常用命令: `make help`
-
-### macOS + Podman 使用注意事项
-
- Podman 在 macOS 上通过虚拟机运行，宿主目录绑定到容器时，写权限可能受限。
- 我们已在运行逻辑中对 macOS 进行特殊处理：将输入复制到容器内 `/tmp/input`，在 `/tmp` 执行 BtToxin_Digger，结束后把 `Results/` 与关键输出复制回挂载的 `/workspace`（宿主输出目录）。
- 如仍遇写入问题：
-  - 在 Podman Desktop 的虚拟机共享目录中，添加项目路径并开启写权限。
-  - 如需，启用 rootful 模式并重启：`podman machine stop && podman machine set --rootful && podman machine start`
-  - 手动验证挂载：`podman run --rm -v $(pwd)/tests/output:/workspace:rw alpine sh -lc 'echo ok > /workspace/test.txt && ls -l /workspace'`
-
-### 本地离线容器测试（可选）
-
-使用 `scripts/test_bttoxin_digger.py` 最小测试：
+Run the complete analysis pipeline with a single command:

 ```bash
-uv run python scripts/test_bttoxin_digger.py
+pixi run pipeline --fna tests/test_data/HAN055.fna
 ```

-要求：`tests/test_data` 下存在 `97-27.fna` 与 `C15.fna`，测试成功后在 `tests/output/Results/Toxins` 看到 6 个关键文件。
+This executes three stages:
+1. **Digger**: BtToxin_Digger toxin mining
+2. **Shotter**: Toxin scoring and target prediction
+3. **Plot**: Heatmap generation and report creation

-#### 输入文件格式说明
+#### CLI Options

-.fna 文件是 FASTA 格式的核酸序列文件，包含细菌的完整基因组序列：
+```bash
+pixi run pipeline --fna <file> [options]

- **97-27.fna**: Bacillus thuringiensis strain 97-27 的完整基因组序列
- **C15.fna**: Bacillus thuringiensis strain C15 的完整基因组序列
-
-文件格式示例：
-```>NZ_CP010088.1 Bacillus thuringiensis strain 97-27 chromosome, complete genome
-TAATGTAACACCAGTAAATATTTCATTCATATATTCTTTTAACTGTATTTTATATTCTTTCTACTCTACAATTTCTTTTA
-ACTGCCAATATGCATCTTCTAGCCAAGGGTGTAAAACTTTCAACGTGTCTTTTCTATCCCACAAATATGAAATATATGCA
-...
+Options:
+  --fna PATH              Input .fna file (required)
+  --out_root PATH         Output directory (default: runs/<stem>_run)
+  --toxicity_csv PATH     Toxicity data CSV (default: Data/toxicity-data.csv)
+  --min_identity FLOAT    Minimum identity threshold 0-1 (default: 0.0)
+  --min_coverage FLOAT    Minimum coverage threshold 0-1 (default: 0.0)
+  --disallow_unknown_families  Exclude unknown toxin families
+  --require_index_hit     Keep only hits with known specificity
+  --lang {zh,en}          Report language (default: zh)
+  --bttoxin_db_dir PATH   Custom bt_toxin database directory
+  --threads INT           Number of threads (default: 4)
 ```

-#### 挖掘结果解读
+#### Examples

-BtToxin_Digger 分析完成后会生成以下关键结果文件：
+```bash
+# Basic run with default settings
+pixi run pipeline --fna tests/test_data/C15.fna

-**1. 菌株毒素列表文件 (`.list`)**
- 包含每个菌株中预测到的各类毒素蛋白的详细分类信息
- 毒素类型包括：Cry、Cyt、Vip、Others、App、Gpp、Mcf、Mpf、Mpp、Mtx、Pra、Prb、Spp、Tpp、Vpa、Vpb、Xpp
- 每个毒素显示：蛋白ID、长度、等级(Rank1-4)、BLAST结果、最佳匹配、覆盖度、相似度、SVM和HMM预测结果
+# Strict filtering for high-confidence results
+pixi run pipeline --fna tests/test_data/HAN055.fna \
+  --min_identity 0.50 --min_coverage 0.60 \
+  --disallow_unknown_families --require_index_hit

-**2. 基因银行格式文件 (`.gbk`)**
- 包含预测毒素基因的详细注释信息
- 记录基因位置、蛋白描述、BLAST比对详情、预测结果等
- 可用于后续的功能分析和可视化
+# English report with custom output directory
+pixi run pipeline --fna tests/test_data/HAN055.fna \
+  --out_root runs/HAN055_strict --lang en

-**3. 汇总表格 (`Bt_all_genes.table`)**
- 所有菌株的毒素基因汇总表格
- 显示每个菌株中不同类型毒素基因的数量和相似度信息
-
-**4. 全部毒素列表 (`All_Toxins.txt`)**
- 包含所有预测到的毒素基因的完整信息
- 字段包括：菌株、蛋白ID、蛋白长度、链向、基因位置、SVM预测、BLAST结果、HMM结果、命中ID、比对长度、一致性、E值等
-
-**测试结果示例**：
- 97-27菌株预测到12个毒素基因，包括InhA1/2、Bmp1、Spp1Aa1、Zwa5A/6等
- C15菌株预测到多个Cry毒素基因（Cry21Aa2、Cry21Aa3、Cry21Ca2、Cry5Ba1）和其他辅助毒素
- 毒素等级分为Rank1-4，Rank1为最高置信度，Rank4为最低置信度
- 相似度范围从27.62%到100%，表明与已知毒素的相似程度
-
-### 单目录方案（跨平台稳定写入）
-
- 运行前，程序会将输入文件复制到宿主输出目录下的 `input_files/` 子目录；容器仅挂载该输出目录（读写）为 `/workspace`。
- 工具运行时的 `--SeqPath` 指向 `/workspace/input_files`，工作目录也固定在 `/workspace`；所有结果与中间文件都会落在宿主的 `tests/output/` 下。
-
-目录示例：
-
-```
-tests/output/
-├── input_files/          # 输入文件副本
-│   ├── 97-27.fna
-│   └── C15.fna
-├── Results/              # BtToxin_Digger 输出
-│   └── Toxins/
-│       ├── 97-27.list
-│       ├── 97-27.gbk
-│       └── ...
-├── StatsFiles/           # 统计文件（如有）
-├── All_Toxins.txt
-└── BtToxin_Digger.log
+# Use custom database
+pixi run pipeline --fna tests/test_data/HAN055.fna \
+  --bttoxin_db_dir /path/to/custom/bt_toxin
 ```

-## bttoxin_db更新
+### Individual Stage Commands

-BtToxin_Digger 容器内置的数据库版本较旧（2021年8月），建议使用官方 GitHub 仓库的最新数据库。
+Run stages separately when needed:

-### 数据库目录结构
+#### Digger Only

-```
-external_dbs/bt_toxin/
-├── db/                    # BLAST 索引文件（运行时必需）
-│   ├── bt_toxin.phr
-│   ├── bt_toxin.pin
-│   ├── bt_toxin.psq
-│   ├── bt_toxin.pdb
-│   ├── bt_toxin.pjs
-│   ├── bt_toxin.pot
-│   ├── bt_toxin.ptf
-│   ├── bt_toxin.pto
-│   └── old/
-└── seq/                   # 序列源文件（留档/更新用）
-    ├── bt_toxin20251104.fas
-    └── ...
+```bash
+pixi run digger-only --fna <file> [options]
+
+Options:
+  --fna PATH              Input .fna file (required)
+  --out_dir PATH          Output directory (default: runs/<stem>_digger_only)
+  --bttoxin_db_dir PATH   Custom database directory
+  --threads INT           Number of threads (default: 4)
+  --sequence_type         Sequence type: nucl/orfs/prot/reads (default: nucl)
 ```

-### 更新步骤
+Example:
+```bash
+pixi run digger-only --fna tests/test_data/C15.fna --threads 8
+```
+
+#### Shotter (Scoring)
+
+```bash
+pixi run shotter [options]
+
+Options:
+  --toxicity_csv PATH     Toxicity data CSV
+  --all_toxins PATH       All_Toxins.txt from Digger
+  --output_dir PATH       Output directory
+  --min_identity FLOAT    Minimum identity threshold
+  --min_coverage FLOAT    Minimum coverage threshold
+  --allow_unknown_families / --disallow_unknown_families
+  --require_index_hit     Keep only indexed hits
+```
+
+Example:
+```bash
+pixi run shotter \
+  --all_toxins runs/C15_run/digger/Results/Toxins/All_Toxins.txt \
+  --output_dir runs/C15_run/shotter
+```
+
+#### Plot (Visualization)
+
+```bash
+pixi run plot [options]
+
+Options:
+  --strain_scores PATH    strain_target_scores.tsv from Shotter
+  --toxin_support PATH    toxin_support.tsv (optional)
+  --species_scores PATH   strain_target_species_scores.tsv (optional)
+  --out_dir PATH          Output directory
+  --cmap STRING           Colormap (default: viridis)
+  --per_hit_strain NAME   Generate per-hit heatmap for specific strain
+  --merge_unresolved      Merge other/unknown into unresolved
+  --report_mode {summary,paper}  Report style (default: paper)
+  --lang {zh,en}          Report language (default: zh)
+```
+
+Example:
+```bash
+pixi run plot \
+  --strain_scores runs/C15_run/shotter/strain_target_scores.tsv \
+  --toxin_support runs/C15_run/shotter/toxin_support.tsv \
+  --out_dir runs/C15_run/shotter \
+  --per_hit_strain C15 --lang en
+```
+
+## Output Structure
+
+After running the pipeline:
+
+```
+runs/<strain>_run/
+├── stage/                    # Staged input file
+│   └── <strain>.fna
+├── digger/                   # BtToxin_Digger outputs
+│   ├── Results/
+│   │   └── Toxins/
+│   │       ├── All_Toxins.txt
+│   │       ├── <strain>.list
+│   │       ├── <strain>.gbk
+│   │       └── Bt_all_genes.table
+│   └── BtToxin_Digger.log
+├── shotter/                  # Shotter outputs
+│   ├── strain_target_scores.tsv
+│   ├── strain_scores.json
+│   ├── toxin_support.tsv
+│   ├── strain_target_species_scores.tsv
+│   ├── strain_species_scores.json
+│   ├── strain_target_scores.png
+│   ├── strain_target_species_scores.png
+│   ├── per_hit_<strain>.png
+│   └── shotter_report_paper.md
+├── logs/
+│   └── digger_execution.log
+└── pipeline_results.tar.gz   # Bundled results
+```
+
+## Database Update
+
+BtToxin_Digger's built-in database may be outdated. Use the latest from GitHub:
+
+### Update Steps

 ```bash
 mkdir -p external_dbs
@@ -176,49 +206,149 @@ git sparse-checkout init --cone
 git sparse-checkout set BTTCMP_db/bt_toxin
 git checkout master

-# 把目录拷贝到你的项目 external_dbs 下
 cd ..
 cp -a tmp_bttoxin_repo/BTTCMP_db/bt_toxin external_dbs/bt_toxin
-
-# 清理临时 repo
 rm -rf tmp_bttoxin_repo
 ```

-### 验证数据库绑定
+The pipeline automatically detects `external_dbs/bt_toxin` if present.

-```bash
-# 检查数据库文件是否完整
-ls -lh external_dbs/bt_toxin/db/
+### Database Structure

-# 验证容器能正确访问绑定的数据库
-docker run --rm \
-  -v "$(pwd)/external_dbs/bt_toxin:/usr/local/bin/BTTCMP_db/bt_toxin:ro" \
-  quay.io/biocontainers/bttoxin_digger:1.0.10--hdfd78af_0 \
-  bash -lc 'ls -lh /usr/local/bin/BTTCMP_db/bt_toxin/db | head'
+```
+external_dbs/bt_toxin/
+├── db/                    # BLAST index files (required)
+│   ├── bt_toxin.phr
+│   ├── bt_toxin.pin
+│   ├── bt_toxin.psq
+│   └── ...
+└── seq/                   # Source sequences (optional, for reference)
+    └── bt_toxin*.fas
 ```

-输出应显示 `.pin/.psq/.phr` 等文件，且时间戳/大小与宿主机一致，说明绑定成功。
+## Input File Format

-### 使用外部数据库运行 Pipeline
+`.fna` files are FASTA-format nucleotide sequence files containing bacterial genome sequences:

-脚本会自动检测 `external_dbs/bt_toxin` 目录，若存在则自动绑定：
-
-```bash
-# 自动使用 external_dbs/bt_toxin（推荐）
-uv run python scripts/run_single_fna_pipeline.py --fna tests/test_data/HAN055.fna
-
-# 或手动指定数据库路径
-uv run python scripts/run_single_fna_pipeline.py \
-  --fna tests/test_data/HAN055.fna \
-  --bttoxin_db_dir /path/to/custom/bt_toxin
+```
+>NZ_CP010088.1 Bacillus thuringiensis strain 97-27 chromosome, complete genome
+TAATGTAACACCAGTAAATATTTCATTCATATATTCTTTTAACTGTATTTTATATTCTTTCTACTCTACAATTTCTTTTA
+ACTGCCAATATGCATCTTCTAGCCAAGGGTGTAAAACTTTCAACGTGTCTTTTCTATCCCACAAATATGAAATATATGCA
+...
 ```

-### 注意事项
+## Result Interpretation

- **db/ 目录是必需的**：运行时 BLAST 只读取 `db/` 下的索引文件
- **seq/ 目录是可选的**：仅用于留档或重新生成索引
- **绑定模式为只读 (ro)**：防止容器意外修改宿主机数据库
- **不需要重新 index**：GitHub 仓库已包含预构建的 BLAST 索引
+
+### Key Output Files
+
+**All_Toxins.txt** - Complete toxin predictions with:
+- Strain, Protein ID, coordinates
+- SVM/BLAST/HMM predictions
+- Hit ID, alignment length, identity, E-value
+
+**strain_target_scores.tsv** - Strain-level target predictions:
+- TopOrder: Most likely target insect order
+- TopScore: Confidence score (0-1)
+- Per-order scores for all target orders
+
+**toxin_support.tsv** - Per-hit contribution details:
+- Individual toxin weights and contributions
+- Family classification and partner status
+
+### Toxin Rankings
+
+- **Rank1**: Highest confidence (identity ≥78%, coverage ≥80%)
+- **Rank2-3**: Moderate confidence
+- **Rank4**:
+ Lowest confidence predictions
+
+### Target Orders
+
+Common insect orders in predictions:
+- **Lepidoptera**: Moths and butterflies
+- **Coleoptera**: Beetles
+- **Diptera**: Flies and mosquitoes
+- **Hemiptera**: True bugs
+- **Nematoda**: Roundworms
+
+## Development
+
+### Python Development Environment
+
+For development work outside pixi:
+
+```bash
+uv venv --managed-python -p 3.12 --seed .venv
+source .venv/bin/activate
+uv pip install -e .
+```
+
+### Running Tests
+
+```bash
+# Run property-based tests
+pixi run -e pipeline python -m pytest tests/test_pixi_runner.py -v
+```
+
+### Project Structure
+
+```
+bttoxin-pipeline/
+├── pixi.toml                 # Pixi environment configuration
+├── pyproject.toml            # Python package configuration
+├── scripts/                  # Core pipeline scripts
+│   ├── run_single_fna_pipeline.py  # Main pipeline orchestrator
+│   ├── run_digger_stage.py         # Digger-only stage
+│   ├── bttoxin_shoter.py           # Toxin scoring module
+│   ├── plot_shotter.py             # Visualization & reporting
+│   └── pixi_runner.py              # PixiRunner class
+├── bttoxin/                  # Python package (CLI entry point)
+│   ├── __init__.py
+│   ├── api.py
+│   └── cli.py
+├── Data/                     # Reference data
+│   └── toxicity-data.csv     # BPPRC specificity data
+├── external_dbs/             # External databases (optional)
+│   └── bt_toxin/             # Updated BtToxin database
+├── tests/                    # Test suite
+│   ├── test_pixi_runner.py   # Property-based tests
+│   └── test_data/            # Test input files
+├── docs/                     # Documentation
+├── runs/                     # Pipeline outputs (gitignored)
+├── backend/                  # FastAPI backend (optional web service)
+└── frontend/                 # Vue.js frontend (optional web UI)
+```
+
+## Troubleshooting
+
+### pixi not found
+
+```bash
+# Ensure pixi is in PATH
+export PATH="$HOME/.pixi/bin:$PATH"
+
+# Or reinstall
+curl -fsSL https://pixi.sh/install.sh | bash
+```
+
+### Environment not found
+
+```bash
+# Reinstall environments
+pixi install
+```
+
+### BtToxin_Digger not available
+
+```bash
+# Verify digger environment
+pixi run -e digger BtToxin_Digger --help
+```
+
+### Permission errors
+
+Ensure write permissions on output directories. The pipeline creates directories automatically.

 ## License