add usage markdown file

2025-11-21 20:23:15 +08:00
parent 50a901c167
commit 82cfb13e50
4 changed files with 685 additions and 0 deletions
--- a/docs/shotter_user_guide.md
+++ b/docs/shotter_user_guide.md
@@ -0,0 +1,207 @@
+# Bttoxin Shotter 使用指南（User Guide）
+
+本指南面向使用 Bttoxin Shotter 的研究者/工程师，说明工具的使用流程、参数、生成结果与方法学原理。本文档与当前代码严格对应：
+- 评分主程序：`scripts/bttoxin_shoter.py`
+- 绘图与报告：`scripts/plot_shotter.py`
+
+如需论文式报告，绘图脚本已支持“paper”模式（中英文可选）。
+
+---
+
+## 一、快速上手（Quick Start）
+
+1. 准备输入
+   - BPPRC Specificity Database（CSV），路径默认：`Data/toxicity-data.csv`
+   - BtToxin_Digger 产物 `All_Toxins.txt`，例如：`tests/output/Results/Toxins/All_Toxins.txt`
+
+2. 运行评分（Shotter）
+```bash
+python scripts/bttoxin_shoter.py \
+  --toxicity_csv Data/toxicity-data.csv \
+  --all_toxins tests/output/Results/Toxins/All_Toxins.txt \
+  --output_dir shotter_outputs \
+  --min_identity 0.50 \
+  --min_coverage 0.60 \
+  --disallow_unknown_families \
+  --require_index_hit
+```
+输出将保存在 `shotter_outputs/`（或可写的回退路径）。
+
+3. 绘图与报告
+```bash
+python scripts/plot_shotter.py \
+  --strain_scores shotter_outputs/strain_target_scores.tsv \
+  --toxin_support shotter_outputs/toxin_support.tsv \
+  --species_scores shotter_outputs/strain_target_species_scores.tsv \
+  --out_dir shotter_outputs \
+  --merge_unresolved \
+  --per_hit_strain <你的菌株名> \
+  --report_mode paper \
+  --lang zh
+```
+生成：
+- 热图：`strain_target_scores.png`、（可选）`per_hit_<strain>.png`、（可选）`strain_target_species_scores.png`
+- 论文式报告：`shotter_report_paper.md`（可用 pandoc 导出 HTML/PDF）
+
+---
+
+## 二、输出文件说明
+- `toxin_support.tsv`
+  - 每条命中（Hit）的详细信息与对各“昆虫目”的贡献列，含 TopOrder/TopScore。
+- `strain_target_scores.tsv`
+  - 每个菌株在各“昆虫目”的合成分数，含 TopOrder/TopScore。
+- `strain_scores.json`
+  - 与上同内容的 JSON 版本。
+- 若 CSV 含 `target_species`：
+  - `strain_target_species_scores.tsv`、`strain_species_scores.json`
+- 热图与报告：
+  - `strain_target_scores.png`、（可选）`per_hit_<strain>.png`、`strain_target_species_scores.png`
+  - `shotter_report_paper.md`（或 `shotter_summary.md`）
+
+---
+
+## 三、方法学与实现（与代码一致）
+
+### 3.1 数据来源与阳性筛选
+- BPPRC Specificity Database CSV 仅保留 `activity == "Yes"` 的阳性样本作为证据（`SpecificityIndex.from_csv`）。
+- 若存在 `target_species` 列，则同时建立物种维度的特异性分布。
+
+### 3.2 效价（potency）映射与归一
+- 字段：`lc50` 与/或 `percentage_mortality`。
+- 单位归一：`units` 中的 `ppm` 合并到 `ug/g` 的桶（diet 语境），在同一个单位桶内分布。
+- 规则（`_potency_from_row`）：
+  - 若 `lc50` 为数值：先保留，稍后在单位桶内按分位做“越小越强”的反排归一，得到 [0,1]。
+  - 否则按 `percentage_mortality` 文本粗映射：>80%→0.9；60-100%/50-80%→~0.65；低效/部分效应→~0.25。
+  - 全缺省但为阳性 → 0.55。
+
+### 3.3 特异性索引：P(order|·) 与 P(species|·)
+- 对“蛋白名称 name”聚合：按 `target_order`（或 `target_species`）对 `_potency` 求和并整体归一。
+- 回退聚合：若名称无分布，则尝试亚家族（首字母）、再尝试家族（前缀+数字）。
+- 由此得到：`name_to_orders`、`subfam_to_orders`、`fam_to_orders`；以及（可选）物种维度映射。
+
+### 3.4 Digger 命中解析与家族判定
+- 读取 `All_Toxins.txt`，标准化字段，并计算：
+  - `coverage = Aln_length / Hit_length`；`identity01 = Identity / 100`
+  - `HMM` 转布尔（`YES` → True）
+  - `Hit_id_norm`：移除后缀 `-other` 等噪声（`normalize_hit_id`）
+  - `family_key`：使用正则在命名中解析家族与亚家族（支持：Cry/Cyt/Vip/Vpa/Vpb/Mpp/Tpp/Spp/App/Mcf/Mpf/Pra/Prb/Txp/Gpp/Mtx/Xpp）。解析失败标记为 `unknown`。
+
+### 3.5 命中权重（w_hit）计算（`compute_similarity_weight`）
+- `base(identity, coverage)`：
+  - 若 `identity≥0.78` 且 `coverage≥0.8`：`base=1.0`
+  - 若 `0.45≤identity<0.78`：`base=(identity-0.45)/(0.78-0.45)`（截断到 [0,1]）
+  - 否则 `base=0`
+- `w = min(1, base×coverage + 0.1×I(HMM))`
+- 配对要求（`partner_fulfilled_for_hit`）：
+  - 家族层面启发式对（Vip1-Vip2、Vpa-Vpb、BinA-BinB）。
+  - 若需配对但该菌株内不存在满足 `w≥0.3` 的搭档，`w×=0.2` 作为惩罚。
+
+### 3.6 贡献计算与菌株层合成（noisy-OR）
+- 对命中 `i`，在目标目 `o` 的贡献：`c_i(o) = w_i × P(o|·)`；
+- 菌株层合成：`score(o) = 1 - ∏_i (1 - c_i(o))`；
+- `other` 与 `unknown` 桶：
+  - 若能解析家族但在索引中没有任何证据 → 贡献计入 `other`；
+  - 若无法解析出家族前缀 → 贡献计入 `unknown`；
+  - Top 排名时优先在真实“目标目”中选择，不以 `other/unknown` 决定 Top（除非没有任何真实目分数）。
+
+### 3.7 物种维度（可选）
+- 若 CSV 含 `target_species`，同样按上式计算 `species` 的潜在活性分数与 TopSpecies。
+
+### 3.8 Top 列含义
+- `TopOrder/TopScore`（命中或菌株）：在“目标目”维度下的最高分及数值，用于快速汇报与排序。
+- `TopSpecies/TopSpeciesScore`：在“物种”维度下的最高分及数值（若存在物种维度）。
+
+---
+
+## 四、命令行参数（CLI）
+
+### 4.1 评分程序：`scripts/bttoxin_shoter.py`
+- 必要输入
+  - `--toxicity_csv`：BPPRC CSV（默认：`Data/toxicity-data.csv`）
+  - `--all_toxins`：Digger 的 `All_Toxins.txt`（默认示例路径见代码）
+  - `--output_dir`：输出目录（自动检测可写路径并回退）
+- 过滤与阈值（与代码一致）
+  - `--min_identity [0-1]`、`--min_coverage [0-1]`
+  - `--allow_unknown_families`（默认启用）/ `--disallow_unknown_families`
+  - `--require_index_hit`：仅保留能在特异性索引中回退到 name/亚家族/家族之一的命中
+
+### 4.2 绘图与报告：`scripts/plot_shotter.py`
+- 基本输入
+  - `--strain_scores`、`--toxin_support`、`--species_scores`（可选）
+  - `--out_dir`、`--cmap`、`--vmin`、`--vmax`、`--figsize`
+- 可视化与报告
+  - `--merge_unresolved`：将 `other+unknown` 合并为 `unresolved`（仅影响可视化/汇总）
+  - `--per_hit_strain <Strain>`：仅用于绘制“单菌株每命中×目标目”的热图，不改变计算
+  - `--report_mode {paper,summary}`：论文式或简版报告（默认 paper）
+  - `--lang {zh,en}`：报告语言（默认 zh）
+  - `--summary_md <PATH>`：报告写入路径（不指定则自动命名）
+
+---
+
+## 五、常见问题（FAQ）
+
+- Q：`other` 与 `unknown` 有何区别？
+  - A：`other` 表示能解析出家族名，但在 BPPRC 特异性索引里没有任何证据；`unknown` 表示连家族前缀都解析不出来。两者都不能合理分配到具体“昆虫目”，但含义不同，有助溯源是“命名/解析问题”还是“索引覆盖不足”。
+
+- Q：这是否意味着“都不知道杀什么虫子”？
+  - A：是“没有证据支撑分配到具体目”，不等于“无活性”。`other/unknown` 不会增加任何真实“目”的分数，仅计为未解析部分。
+
+- Q：如何减少 `other/unknown`？
+  - A：
+    - 提高阈值：`--min_identity`、`--min_coverage`；
+    - 过滤：`--require_index_hit`、`--disallow_unknown_families`；
+    - 清洗命名、补充 BPPRC 数据或构建别名映射。
+
+- Q：`--per_hit_strain` 是做什么的？
+  - A：仅用于选择一个菌株生成“每命中×目标目”的热图，帮助解释哪些命中支撑了某个 TopOrder/TopScore。不会改变任何评分结果。
+
+- Q：`TopOrder/TopScore` 是什么？
+  - A：在“目标目”维度下的最高分及其数值（命中或菌株），用于快速汇报与排序。若存在真实“目”，Top 不会选 `other/unknown`。
+
+---
+
+## 六、示例（Examples）
+
+- 推荐的初筛参数：
+```bash
+python scripts/bttoxin_shoter.py \
+  --toxicity_csv Data/toxicity-data.csv \
+  --all_toxins tests/output/Results/Toxins/All_Toxins.txt \
+  --output_dir shotter_outputs \
+  --min_identity 0.50 --min_coverage 0.60 \
+  --disallow_unknown_families --require_index_hit
+```
+- 绘图与报告（paper/中文）：
+```bash
+python scripts/plot_shotter.py \
+  --strain_scores shotter_outputs/strain_target_scores.tsv \
+  --toxin_support shotter_outputs/toxin_support.tsv \
+  --species_scores shotter_outputs/strain_target_species_scores.tsv \
+  --out_dir shotter_outputs \
+  --merge_unresolved \
+  --per_hit_strain C15 \
+  --report_mode paper --lang zh
+```
+- 将报告导出为 HTML（示例）：
+```bash
+pandoc -s shotter_outputs/shotter_report_paper.md -o shotter_outputs/shotter_report_paper.html
+```
+
+---
+
+## 七、再现性与路径策略
+- 输出目录写入策略：优先 `--output_dir`；不可写则回退到 `<output_dir>/Shotter`；再回退到 `./shotter_outputs/`。
+- 若未提供 `species_scores` 或 CSV 无 `target_species`，物种热图与报告章节将自动跳过。
+
+---
+
+## 八、附录：家族/亚家族解析
+- 受支持的家族前缀（正则）：
+  - `Cry`、`Cyt`、`Vip`、`Vpa`、`Vpb`、`Mpp`、`Tpp`、`Spp`、`App`、`Mcf`、`Mpf`、`Pra`、`Prb`、`Txp`、`Gpp`、`Mtx`、`Xpp`
+- 例如：
+  - `Cry1Ac1` → family=`Cry1`，subfamily=`Cry1A`
+  - `Bmp1-other` 标准化后 → `Bmp1`（若索引中无证据，将落入 `other`）
+
+---
+
+如需在“计算层面”将 `other+unknown` 永久合并为 `unresolved`，或希望在报告中插入自定义方法学/参考文献段落，请提出需求，我可在现有脚本中加入相应开关与模板扩展。