From 82cfb13e506eaeb1f357177434a50f75194f6880 Mon Sep 17 00:00:00 2001
From: hotwa <pylyzeng@gmail.com>
Date: Fri, 21 Nov 2025 20:23:15 +0800
Subject: [PATCH] add usage markdown file

---
 docs/shotter_results.md     | 203 +++++++++++++++++++++++++++++++++++
 docs/shotter_user_guide.md  | 207 ++++++++++++++++++++++++++++++++++++
 docs/shotter_workflow.md    | 155 +++++++++++++++++++++++++++
 docs/shotter_workflow_zh.md | 120 +++++++++++++++++++++
 4 files changed, 685 insertions(+)
 create mode 100644 docs/shotter_results.md
 create mode 100644 docs/shotter_user_guide.md
 create mode 100644 docs/shotter_workflow.md
 create mode 100644 docs/shotter_workflow_zh.md

diff --git a/docs/shotter_results.md b/docs/shotter_results.md
new file mode 100644
index 0000000..21ee594
--- /dev/null
+++ b/docs/shotter_results.md
@@ -0,0 +1,203 @@
+# Bttoxin_Shoter 结果文件说明与可视化建议
+
+本页介绍 Shotter 的输出文件结构、字段含义、解读建议，以及如何快速画热力图等可视化。
+
+## 文件一览（默认写入 tests/output/Results/Toxins 或可写回退目录）
+- toxin_support.tsv
+  - 逐命中（per-hit）对各目标“目”的贡献矩阵。
+- strain_target_scores.tsv
+  - 每个菌株（strain）对各“目”的综合分数矩阵（0–1）。
+- strain_scores.json
+  - 与上表相同信息的 JSON 版。
+
+注：All_Toxins.txt 是 BtToxin_Digger 的输出，Shotter 的输入；toxin_support.tsv/strain_target_scores.tsv/strain_scores.json 是 Shotter 的新输出文件。
+
+若默认目录不可写（如容器生成目录归 root 所有），Shotter 会自动回退到：
+1) tests/output/Results/Toxins/Shotter 子目录；若仍不可写，
+2) 当前工作目录下的 ./shotter_outputs。
+命令行也可用 --output_dir 指定任意可写目录。
+
+## 字段与含义
+
+### toxin_support.tsv（per-hit 矩阵）
+- Strain：菌株名（来自 All_Toxins.txt 的第一列）
+- Protein_id：Digger 报告的蛋白/ORF 标识
+- Hit_id：Digger 命中名（去掉 “-other” 尾巴后的标准名）
+- Identity：序列一致性（0–1），由 Digger 的 Identity 除以 100 后得到
+- Aln_length / Hit_length：比对长度与参考长度
+- Coverage：覆盖度（Aln_length/Hit_length，0–1）
+- HMM：Digger 的 HMM 列（YES/NO）
+- Family：解析到的家族键（如 Cry1、Vip3）；无法解析则为 unknown
+- NameKey：用于回退检索的名称键（通常等于 Hit_id）
+- PartnerFulfilled：是否满足配对家族（如 Vip1/Vip2、Vpa/Vpb）
+- Weight：该命中的相似性权重 w_hit，公式：
+  - 若 id≥0.78 且 cov≥0.80：base=1.0；elif 0.45≤id<0.78：base=(id-0.45)/(0.78-0.45)；else base=0
+  - w_hit = min(1, base×coverage + 0.1×HMM)
+  - 若需配对但未满足：w_hit×=0.2
+- 后续每一列：一个 target_order（如 Lepidoptera、Coleoptera… 以及 other、unknown），值为该命中对该“目”的贡献
+  - 贡献 c(order)=w_hit×P(order|name/subfamily/family)
+  - 若能解析出家族但 CSV 无证据 → 计入 other
+  - 若无法解析家族 → 计入 unknown
+
+### strain_target_scores.tsv（per-strain 矩阵）
+- Strain：菌株名
+- 每个 target_order 一列（含 other、unknown）：
+  - Score(strain, order)=1−∏(1−c(order))（noisy-OR 合成）
+  - 落在 0–1。值越大，综合证据越强。可自定阈值作“高置信”标记（例如 ≥0.5）。
+
+### strain_scores.json
+- 与 strain_target_scores.tsv 等价的 JSON，便于程序消费。
+
+## 如何解读
+- per-hit：适合审查具体哪条毒蛋白命中贡献了哪些“目”。Weight 与 Coverage/Identity/HMM 协同体现可信度；PartnerFulfilled=NO 时表示该家族可能需要配对，单独命中仅做弱证据。
+- per-strain：纵览一个菌株对哪些“目”潜在活性更强。可以取 Top-N order 查看最可能靶标，也可设置阈值筛“高置信”目标。
+- 列 other/unknown：
+  - other：可解析到家族，但 CSV 中无对应靶标证据（回退失败）
+  - unknown：无法从命中名中解析出家族前缀
+
+## 热力图可视化示例
+
+### 1) 菌株×目标“目”分数热力图
+```python
+import pandas as pd
+import seaborn as sns
+import matplotlib.pyplot as plt
+
+df = pd.read_csv("tests/output/Results/Toxins/strain_target_scores.tsv", sep="\t")
+# 如果程序回退到 ./shotter_outputs，请改成相应路径
+
+orders = [c for c in df.columns if c != "Strain"]
+mat = df.set_index("Strain")[orders]
+
+plt.figure(figsize=(0.6*len(orders)+3, 0.4*len(mat)+2))
+sns.heatmap(mat, cmap="viridis", vmin=0, vmax=1, cbar_kws={"label": "score"})
+plt.title("Strain vs Target Orders (Shotter)")
+plt.tight_layout()
+plt.show()
+```
+
+### 2) 单个菌株：命中×目标“目”贡献热图
+```python
+import pandas as pd
+import seaborn as sns
+import matplotlib.pyplot as plt
+
+hits = pd.read_csv("tests/output/Results/Toxins/toxin_support.tsv", sep="\t")
+# 选择一个菌株
+s = "97-27"
+sub = hits.query("Strain == @s").copy()
+orders = [c for c in sub.columns if c not in [
+    "Strain","Protein_id","Hit_id","Identity","Aln_length","Hit_length","Coverage",
+    "HMM","Family","NameKey","PartnerFulfilled","Weight"
+]]
+mat = sub.set_index("Hit_id")[orders]
+
+plt.figure(figsize=(0.6*len(orders)+3, 0.4*len(mat)+2))
+sns.heatmap(mat, cmap="magma", vmin=0, vmax=1, cbar_kws={"label": "contrib"})
+plt.title(f"Per-hit contributions for {s}")
+plt.tight_layout()
+plt.show()
+```
+
+## 常见问题
+- 为什么不再做 BLAST？
+  - Shotter 直接使用 Digger 的相似性结果与 HMM 标记；避免重复计算，提高速度与可复现性。
+- 为什么会有 PermissionError？
+  - 某些 Digger 输出目录可能由容器创建（root 所有），普通用户不可写。可传入 `--output_dir` 指向可写目录，或使用程序的自动回退（Shotter 子目录或 ./shotter_outputs）。
+- 是否只看 Top1？
+  - 矩阵保留了所有“目”的得分，更利于发现“次 high”的潜在靶标。若需要，我可以在输出中追加 `TopOrder/TopScore` 两列。
+
+## 复现命令
+```bash
+python scripts/bttoxin_shoter.py \
+  --toxicity_csv Data/toxicity-data.csv \
+  --all_toxins tests/output/Results/Toxins/All_Toxins.txt \
+  --output_dir tests/output/Results/Toxins
+# 若目录不可写可改为： --output_dir tests/output/Results/Toxins/Shotter
+```
+
+## 绘图脚本用法（CLI）与图像解读
+
+- 生成“菌株×目标目”热力图：
+  ```bash
+  python scripts/plot_shotter.py \
+    --strain_scores shotter_outputs/strain_target_scores.tsv \
+    --out_dir shotter_outputs
+  ```
+  - 行=菌株（Strain）。
+  - 列=目标“目”（含 other/unknown）。
+  - 颜色=综合分数 Score(strain, order) ∈ [0,1]。分数越大，综合证据越强。非“能杀”，而是“潜在活性支持度”。建议论文中给出阈值（如≥0.5）或标注Top-N order。
+
+- 生成指定菌株的“命中×目标目贡献”热力图：
+  ```bash
+  python scripts/plot_shotter.py \
+    --strain_scores shotter_outputs/strain_target_scores.tsv \
+    --toxin_support shotter_outputs/toxin_support.tsv \
+    --out_dir shotter_outputs \
+    --per_hit_strain 97-27
+  ```
+  - 行=命中蛋白（Hit_id；可在图注中标注对应 Family）。
+  - 列=目标“目”。
+  - 颜色=贡献 c(order)=w_hit×P(order|·)。可直观看到“哪个命中倾向于哪个‘目’”。
+
+提示：若想在图中暂时不展示 other/unknown，可在绘图前用 pandas 过滤掉相应列。
+
+## 命中识别与打分算法（适合论文引用）
+
+记：
+- id ∈ [0,1]：Identity/100；cov ∈ [0,1]：Coverage=Aln_length/Hit_length；hmm∈{0,1}：HMM 标记。
+- P(order | name/subfamily/family)：由 BPPRC 阳性数据聚合归一得到的“靶标目”分布。
+- 伙伴家族（如 Vip1/Vip2、Vpa/Vpb）需成对出现，否则降权。
+
+1) 单命中相似性权重 w_hit：
+
+  若 id ≥ 0.78 且 cov ≥ 0.80：base = 1.0；
+  否则若 0.45 ≤ id < 0.78：base = (id − 0.45) / (0.78 − 0.45)；
+  否则 base = 0。
+
+  w_hit = min(1, base × cov + 0.1 × hmm)。
+
+  若命中所属家族需配对但菌株内未检测到配对家族（且其权重≥0.3），则 w_hit := 0.2 × w_hit。
+
+2) 单命中对各“目”的贡献 c(order)：
+
+  c(order) = w_hit × P(order | name/subfamily/family)。
+
+  回退顺序：name → subfamily → family。仍无证据时：
+  - 若能解析到家族但无证据 → 贡献计入 other；
+  - 若无法解析家族前缀 → 贡献计入 unknown。
+
+3) 菌株层合成（noisy-OR）：
+
+  Score(strain, order) = 1 − ∏_hits (1 − c(order))。
+
+理由与阈值说明：0.45/0.78 阈值源于 Bt 命名规范（家族/亚家族）；cov≥0.8 保护部分匹配；HMM 加0.1为结构域级支持；noisy-OR 可在[0,1]内自然合成多条证据、避免重复计数。论文中可将 Score 视作“相对支持度/潜在概率分数”，并配置信心阈值或校准方案。
+
+## other/unknown 的来源与处理
+
+- other：
+  - 能解析到家族（如 Cry1），但在特异性索引（BPPRC CSV）中没有该 name/subfamily/family 的靶标证据条目，于是回退失败，贡献计入 other。
+  - 也可能该命名属于非经典 Bt 杀虫蛋白分支、或尚无公开的活性证据。
+- unknown：
+  - 无法从 Hit_id 解析出家族前缀（例如 InhA2-other、Bmp1-other、Zwa6-other 等不在 Cry/Cyt/Vip/Tpp/Mpp…前缀集合中）。此类命中多为其他功能蛋白或未纳入本索引体系的类别。
+- 为什么不能仅用命名系统直接映射靶标？
+  - 命名主要反映系统发育/结构域类型；同一（亚）家族中不同变体对受体/虫阶的谱系可能显著不同，且跨“目”的活性也常见。以经验规则硬编码会产生系统性偏差，因此我们采用从文献/专利汇总的阳性证据构建 P(order|·) 的统计方法，更稳健、可更新。
+- 绘图实践：若 other/unknown 占比高，表示大量命中缺乏到“目”级证据或无法解析；可在展示时隐藏这两列，或在方法学中单独说明其来源与占比。
+
+## 两类热力图在科研中的用途建议
+
+- 菌株×“目”热力图（per-strain）：适合横向比较多个菌株，展示总体“靶标目”潜在谱系；可配 Top-N 或阈值标注。
+- 命中×“目”热力图（per-hit）：适合阐释某一菌株的组成性证据与机制依据，说明哪些命中推动了特定“目”的得分。
+- 论文建议：两者结合使用——正文给出 per-strain 总览、补充材料给出关键样本的 per-hit 解析图。
+
+## 示例图解读（常见疑问）
+
+  - 非零表示“有一定证据支持潜在活性”，并非确证致死。建议报告高于阈值（如≥0.5）的“高置信目标目”，并在方法中给出阈值与依据。
+- “per-hit 图纵坐标是什么？”
+  - 纵轴是命中蛋白的 Hit_id（可在图注中标注对应 Family）。
+  - 行=命中蛋白（Hit_id；可在图注中标注对应 Family）。
+  - 列=目标“目”。
+  - 颜色=贡献 c(order)=w_hit×P(order|·)。可直观看到“哪个命中倾向于哪个‘目’”。
+
+- “why other 很多？”
+  - 多因命中属于未纳入家族前缀集合或 CSV 无证据（例如 InhA2/Bmp1/Zwa6 等），因此贡献汇入 other/unknown。可在展示时隐藏这两列或单列占比说明。
diff --git a/docs/shotter_user_guide.md b/docs/shotter_user_guide.md
new file mode 100644
index 0000000..d8311b6
--- /dev/null
+++ b/docs/shotter_user_guide.md
@@ -0,0 +1,207 @@
+# Bttoxin Shotter 使用指南（User Guide）
+
+本指南面向使用 Bttoxin Shotter 的研究者/工程师，说明工具的使用流程、参数、生成结果与方法学原理。本文档与当前代码严格对应：
+- 评分主程序：`scripts/bttoxin_shoter.py`
+- 绘图与报告：`scripts/plot_shotter.py`
+
+如需论文式报告，绘图脚本已支持“paper”模式（中英文可选）。
+
+---
+
+## 一、快速上手（Quick Start）
+
+1. 准备输入
+   - BPPRC Specificity Database（CSV），路径默认：`Data/toxicity-data.csv`
+   - BtToxin_Digger 产物 `All_Toxins.txt`，例如：`tests/output/Results/Toxins/All_Toxins.txt`
+
+2. 运行评分（Shotter）
+```bash
+python scripts/bttoxin_shoter.py \
+  --toxicity_csv Data/toxicity-data.csv \
+  --all_toxins tests/output/Results/Toxins/All_Toxins.txt \
+  --output_dir shotter_outputs \
+  --min_identity 0.50 \
+  --min_coverage 0.60 \
+  --disallow_unknown_families \
+  --require_index_hit
+```
+输出将保存在 `shotter_outputs/`（或可写的回退路径）。
+
+3. 绘图与报告
+```bash
+python scripts/plot_shotter.py \
+  --strain_scores shotter_outputs/strain_target_scores.tsv \
+  --toxin_support shotter_outputs/toxin_support.tsv \
+  --species_scores shotter_outputs/strain_target_species_scores.tsv \
+  --out_dir shotter_outputs \
+  --merge_unresolved \
+  --per_hit_strain <你的菌株名> \
+  --report_mode paper \
+  --lang zh
+```
+生成：
+- 热图：`strain_target_scores.png`、（可选）`per_hit_<strain>.png`、（可选）`strain_target_species_scores.png`
+- 论文式报告：`shotter_report_paper.md`（可用 pandoc 导出 HTML/PDF）
+
+---
+
+## 二、输出文件说明
+- `toxin_support.tsv`
+  - 每条命中（Hit）的详细信息与对各“昆虫目”的贡献列，含 TopOrder/TopScore。
+- `strain_target_scores.tsv`
+  - 每个菌株在各“昆虫目”的合成分数，含 TopOrder/TopScore。
+- `strain_scores.json`
+  - 与上同内容的 JSON 版本。
+- 若 CSV 含 `target_species`：
+  - `strain_target_species_scores.tsv`、`strain_species_scores.json`
+- 热图与报告：
+  - `strain_target_scores.png`、（可选）`per_hit_<strain>.png`、`strain_target_species_scores.png`
+  - `shotter_report_paper.md`（或 `shotter_summary.md`）
+
+---
+
+## 三、方法学与实现（与代码一致）
+
+### 3.1 数据来源与阳性筛选
+- BPPRC Specificity Database CSV 仅保留 `activity == "Yes"` 的阳性样本作为证据（`SpecificityIndex.from_csv`）。
+- 若存在 `target_species` 列，则同时建立物种维度的特异性分布。
+
+### 3.2 效价（potency）映射与归一
+- 字段：`lc50` 与/或 `percentage_mortality`。
+- 单位归一：`units` 中的 `ppm` 合并到 `ug/g` 的桶（diet 语境），在同一个单位桶内分布。
+- 规则（`_potency_from_row`）：
+  - 若 `lc50` 为数值：先保留，稍后在单位桶内按分位做“越小越强”的反排归一，得到 [0,1]。
+  - 否则按 `percentage_mortality` 文本粗映射：>80%→0.9；60-100%/50-80%→~0.65；低效/部分效应→~0.25。
+  - 全缺省但为阳性 → 0.55。
+
+### 3.3 特异性索引：P(order|·) 与 P(species|·)
+- 对“蛋白名称 name”聚合：按 `target_order`（或 `target_species`）对 `_potency` 求和并整体归一。
+- 回退聚合：若名称无分布，则尝试亚家族（首字母）、再尝试家族（前缀+数字）。
+- 由此得到：`name_to_orders`、`subfam_to_orders`、`fam_to_orders`；以及（可选）物种维度映射。
+
+### 3.4 Digger 命中解析与家族判定
+- 读取 `All_Toxins.txt`，标准化字段，并计算：
+  - `coverage = Aln_length / Hit_length`；`identity01 = Identity / 100`
+  - `HMM` 转布尔（`YES` → True）
+  - `Hit_id_norm`：移除后缀 `-other` 等噪声（`normalize_hit_id`）
+  - `family_key`：使用正则在命名中解析家族与亚家族（支持：Cry/Cyt/Vip/Vpa/Vpb/Mpp/Tpp/Spp/App/Mcf/Mpf/Pra/Prb/Txp/Gpp/Mtx/Xpp）。解析失败标记为 `unknown`。
+
+### 3.5 命中权重（w_hit）计算（`compute_similarity_weight`）
+- `base(identity, coverage)`：
+  - 若 `identity≥0.78` 且 `coverage≥0.8`：`base=1.0`
+  - 若 `0.45≤identity<0.78`：`base=(identity-0.45)/(0.78-0.45)`（截断到 [0,1]）
+  - 否则 `base=0`
+- `w = min(1, base×coverage + 0.1×I(HMM))`
+- 配对要求（`partner_fulfilled_for_hit`）：
+  - 家族层面启发式对（Vip1-Vip2、Vpa-Vpb、BinA-BinB）。
+  - 若需配对但该菌株内不存在满足 `w≥0.3` 的搭档，`w×=0.2` 作为惩罚。
+
+### 3.6 贡献计算与菌株层合成（noisy-OR）
+- 对命中 `i`，在目标目 `o` 的贡献：`c_i(o) = w_i × P(o|·)`；
+- 菌株层合成：`score(o) = 1 - ∏_i (1 - c_i(o))`；
+- `other` 与 `unknown` 桶：
+  - 若能解析家族但在索引中没有任何证据 → 贡献计入 `other`；
+  - 若无法解析出家族前缀 → 贡献计入 `unknown`；
+  - Top 排名时优先在真实“目标目”中选择，不以 `other/unknown` 决定 Top（除非没有任何真实目分数）。
+
+### 3.7 物种维度（可选）
+- 若 CSV 含 `target_species`，同样按上式计算 `species` 的潜在活性分数与 TopSpecies。
+
+### 3.8 Top 列含义
+- `TopOrder/TopScore`（命中或菌株）：在“目标目”维度下的最高分及数值，用于快速汇报与排序。
+- `TopSpecies/TopSpeciesScore`：在“物种”维度下的最高分及数值（若存在物种维度）。
+
+---
+
+## 四、命令行参数（CLI）
+
+### 4.1 评分程序：`scripts/bttoxin_shoter.py`
+- 必要输入
+  - `--toxicity_csv`：BPPRC CSV（默认：`Data/toxicity-data.csv`）
+  - `--all_toxins`：Digger 的 `All_Toxins.txt`（默认示例路径见代码）
+  - `--output_dir`：输出目录（自动检测可写路径并回退）
+- 过滤与阈值（与代码一致）
+  - `--min_identity [0-1]`、`--min_coverage [0-1]`
+  - `--allow_unknown_families`（默认启用）/ `--disallow_unknown_families`
+  - `--require_index_hit`：仅保留能在特异性索引中回退到 name/亚家族/家族之一的命中
+
+### 4.2 绘图与报告：`scripts/plot_shotter.py`
+- 基本输入
+  - `--strain_scores`、`--toxin_support`、`--species_scores`（可选）
+  - `--out_dir`、`--cmap`、`--vmin`、`--vmax`、`--figsize`
+- 可视化与报告
+  - `--merge_unresolved`：将 `other+unknown` 合并为 `unresolved`（仅影响可视化/汇总）
+  - `--per_hit_strain <Strain>`：仅用于绘制“单菌株每命中×目标目”的热图，不改变计算
+  - `--report_mode {paper,summary}`：论文式或简版报告（默认 paper）
+  - `--lang {zh,en}`：报告语言（默认 zh）
+  - `--summary_md <PATH>`：报告写入路径（不指定则自动命名）
+
+---
+
+## 五、常见问题（FAQ）
+
+- Q：`other` 与 `unknown` 有何区别？
+  - A：`other` 表示能解析出家族名，但在 BPPRC 特异性索引里没有任何证据；`unknown` 表示连家族前缀都解析不出来。两者都不能合理分配到具体“昆虫目”，但含义不同，有助溯源是“命名/解析问题”还是“索引覆盖不足”。
+
+- Q：这是否意味着“都不知道杀什么虫子”？
+  - A：是“没有证据支撑分配到具体目”，不等于“无活性”。`other/unknown` 不会增加任何真实“目”的分数，仅计为未解析部分。
+
+- Q：如何减少 `other/unknown`？
+  - A：
+    - 提高阈值：`--min_identity`、`--min_coverage`；
+    - 过滤：`--require_index_hit`、`--disallow_unknown_families`；
+    - 清洗命名、补充 BPPRC 数据或构建别名映射。
+
+- Q：`--per_hit_strain` 是做什么的？
+  - A：仅用于选择一个菌株生成“每命中×目标目”的热图，帮助解释哪些命中支撑了某个 TopOrder/TopScore。不会改变任何评分结果。
+
+- Q：`TopOrder/TopScore` 是什么？
+  - A：在“目标目”维度下的最高分及其数值（命中或菌株），用于快速汇报与排序。若存在真实“目”，Top 不会选 `other/unknown`。
+
+---
+
+## 六、示例（Examples）
+
+- 推荐的初筛参数：
+```bash
+python scripts/bttoxin_shoter.py \
+  --toxicity_csv Data/toxicity-data.csv \
+  --all_toxins tests/output/Results/Toxins/All_Toxins.txt \
+  --output_dir shotter_outputs \
+  --min_identity 0.50 --min_coverage 0.60 \
+  --disallow_unknown_families --require_index_hit
+```
+- 绘图与报告（paper/中文）：
+```bash
+python scripts/plot_shotter.py \
+  --strain_scores shotter_outputs/strain_target_scores.tsv \
+  --toxin_support shotter_outputs/toxin_support.tsv \
+  --species_scores shotter_outputs/strain_target_species_scores.tsv \
+  --out_dir shotter_outputs \
+  --merge_unresolved \
+  --per_hit_strain C15 \
+  --report_mode paper --lang zh
+```
+- 将报告导出为 HTML（示例）：
+```bash
+pandoc -s shotter_outputs/shotter_report_paper.md -o shotter_outputs/shotter_report_paper.html
+```
+
+---
+
+## 七、再现性与路径策略
+- 输出目录写入策略：优先 `--output_dir`；不可写则回退到 `<output_dir>/Shotter`；再回退到 `./shotter_outputs/`。
+- 若未提供 `species_scores` 或 CSV 无 `target_species`，物种热图与报告章节将自动跳过。
+
+---
+
+## 八、附录：家族/亚家族解析
+- 受支持的家族前缀（正则）：
+  - `Cry`、`Cyt`、`Vip`、`Vpa`、`Vpb`、`Mpp`、`Tpp`、`Spp`、`App`、`Mcf`、`Mpf`、`Pra`、`Prb`、`Txp`、`Gpp`、`Mtx`、`Xpp`
+- 例如：
+  - `Cry1Ac1` → family=`Cry1`，subfamily=`Cry1A`
+  - `Bmp1-other` 标准化后 → `Bmp1`（若索引中无证据，将落入 `other`）
+
+---
+
+如需在“计算层面”将 `other+unknown` 永久合并为 `unresolved`，或希望在报告中插入自定义方法学/参考文献段落，请提出需求，我可在现有脚本中加入相应开关与模板扩展。
diff --git a/docs/shotter_workflow.md b/docs/shotter_workflow.md
new file mode 100644
index 0000000..19418ea
--- /dev/null
+++ b/docs/shotter_workflow.md
@@ -0,0 +1,155 @@
+# Bttoxin_Shoter v1 workflow and formulas
+
+This document describes how Bttoxin_Shoter converts BtToxin_Digger outputs into predicted target insect orders and strain-level scores, and the rationale behind each step.
+
+## Inputs
+- Digger outputs (from tests/output/Results/Toxins):
+  - All_Toxins.txt (tab-separated summary across strains)
+- Specificity data (from BPPRC, placed in Data/):
+  - Data/toxicity-data.csv (positive activity records)
+- Optional reference FASTA:
+  - Data/holotype_fasta_sequences.txt (holotype representatives; not required for v1 scoring)
+
+## Outputs (default under output_dir / shotter_outputs)
+- toxin_support.tsv
+  Per-hit contributions and metadata.
+- strain_target_scores.tsv
+  Per-strain target-order scores (0–1) across all known orders.
+- strain_scores.json
+  Same as TSV in JSON form.
+ - If CSV contains target_species:
+   - strain_target_species_scores.tsv
+   - strain_species_scores.json
+
+## CLI (scoring)
+```
+python scripts/bttoxin_shoter.py \
+  --toxicity_csv Data/toxicity-data.csv \
+  --all_toxins tests/output/Results/Toxins/All_Toxins.txt \
+  --output_dir shotter_outputs \
+  --min_identity 0.50 \
+  --min_coverage 0.60 \
+  --disallow_unknown_families \
+  --require_index_hit
+```
+Flags explained
+- --min_identity, --min_coverage: filter weak hits (0–1)
+- --allow_unknown_families/--disallow_unknown_families: keep or drop unparseable families
+- --require_index_hit: keep only hits that can back-off to name/subfamily/family in the specificity index
+
+## High-level pipeline
+
+```mermaid
+flowchart TD
+  A["Genome .fna"]
+  subgraph B["BtToxin_Digger (Docker)"]
+    direction TB
+    B0["ORF/CDS prediction"]
+    B1t["Translate CDS to proteins"]
+    B2t["Toxin identification (BLAST/HMM)"]
+    B0 --> B1t --> B2t
+  end
+  A --> B0
+  B2t --> AT[All_Toxins.txt]
+  B2t --> LG[*.list / *.gbk]
+  C[BPPRC specificity CSV] --> D[Build specificity index]
+  AT --> E[Per-hit parsing]
+  D --> F[Per-hit scoring]
+  E --> F
+  F --> G[Per-strain combination]
+  G --> H[Outputs: toxin_support.tsv, strain_target_scores.tsv, strain_scores.json]
+```
+
+ASCII fallback:
+- Genome -> Digger: ORF/CDS prediction -> Translate to proteins -> Toxin identification (BLAST/HMM) -> All_Toxins.txt, *.list, *.gbk
+- toxicity-data.csv -> Build index
+- All_Toxins -> Per-hit parsing -> Per-hit scoring (with index) -> Strain-level combine -> Outputs
+
+- **是否把基因序列转成“识别好的”蛋白序列**
+  是的。在本项目的用法中我们用核酸输入运行 BtToxin_Digger（见 scripts/test_bttoxin_digger.py 的 `sequence_type="nucl"`）。Digger 会在容器里做 ORF/CDS 预测，把核酸翻译成蛋白，再用 BLAST/HMM 对蛋白进行毒素识别。识别结果被汇总到 `All_Toxins.txt`，其中包含每条蛋白与某个毒素条目的匹配信息（如 `Identity`、`Aln_length`、`Hit_length`、`HMM` 等；参见 scripts/bttoxin_shoter.py 的解析逻辑）。
+
+- **`*.list` `*.gbk` 的用途**
+
+  - **`*.list`**：每个菌株（样本）的毒素命中列表，便于快速查看此菌株命中了哪些毒素名称/家族、对应的蛋白 ID/坐标等，常用于人工复核与回溯。
+  - **`*.gbk`**：带注释的 GenBank 文件，包含序列和注释特征（CDS、蛋白翻译、坐标等），便于在基因组浏览器或其他注释/二次分析工具中查看与下游使用。
+
+- 后续靶点分析/打分是否需要 `*.list`  `*.gbk`
+  不需要。本仓库的打分脚本 `scripts/bttoxin_shoter.py` 只读取 `All_Toxins.txt`（见 docs/shotter_workflow.md 的流程图，只有 `All_Toxins.txt` 被接入打分环节）。`*.list` 和 `*.gbk` 是 Digger 的配套产物，适合做人工检查、回溯坐标与提取序列，但不作为打分脚本的输入。
+
+## Name and family parsing
+- Extract family/subfamily keys from names (e.g., Cry1Ac1 -> family Cry1, subfamily Cry1A). Names like `5618-Cry1Ia` are supported.
+- Hit IDs such as `Xxx-other` are normalized by removing the `-other` suffix.
+
+## Specificity index from CSV (positive-only)
+- Keep only rows where `activity == Yes`.
+- Compute a potency weight per row:
+  - If `lc50` present: normalize within unit buckets by inverted percentile: `potency = 1 - rank_percentile(lc50)`.
+  - If `percentage_mortality` present: map to coarse weights
+    - ≥80% or phrases like "greater than 80%" -> 0.90
+    - 50–80% (or 60–80%, 60–100%) -> 0.65
+    - 0–60%, 0–50%, 25%, or vague "some"/"stunting" -> 0.25
+  - Otherwise default positive evidence -> 0.55
+- Aggregate to distributions P(order | name), P(order | subfamily), P(order | family) by summing potencies per `target_order` then normalizing to 1.
+- Unit harmonization rationale: diet incorporation studies often report `ppm` roughly equivalent to `µg/g`. We place `ppm, µg/g, ug/g, μg/g` into one `ug_per_g` bucket for within-bucket normalization. Other units stay in their own buckets.
+
+## Partner protein logic
+- Some families require partners for activity (e.g., Vip1/Vip2, Vpa/Vpb). A hit from such a family is down-weighted by 0.2 unless a partner family is also present in the same strain with a non-trivial similarity weight (≥0.3).
+- This is applied at family prefix level and is independent of CSV `partnerprotein` text.
+
+## Similarity weight (per hit)
+Let `id` = identity (0–1), `cov` = alignment coverage, `hmm` = HMM flag.
+- Base identity component:
+  - if `id ≥ 0.78` and `cov ≥ 0.80`: `base = 1.0`
+  - elif `0.45 ≤ id < 0.78`: `base = (id - 0.45) / (0.78 - 0.45)`
+  - else: `base = 0.0`
+- Combine with coverage and HMM bonus:
+  - `w_hit = min(1, base * cov + (0.1 if hmm else 0))`
+- Rationale: 0.45 and 0.78 are the conventional Bt nomenclature thresholds for family and subfamily; coverage ≥0.8 protects against partial hits; HMM bonus rewards domain-level confirmation without overpowering identity/coverage.
+
+## Per-hit order contributions and species (optional)
+- For each hit, get a distribution `P(order | name/family)` from the index, with back-off: name -> subfamily -> family. If still unknown, contributions are sent to the `other` bucket.
+- Hit contribution to an order: `c(order) = w_hit × P(order | name/family)`.
+ - If `target_species` is present in the CSV, compute in parallel on species: `c(species) = w_hit × P(species | ·)`
+
+## Strain-level combination
+- Across all hits for a strain, we use noisy-OR:
+  - `Score(strain, order) = 1 - Π_hits (1 - c(order))`
+- This stays in [0,1], increases with multiple concordant hits, and avoids double-counting.
+
+## Special buckets and Top columns
+- `other`: used when a hit has a parsable family (e.g., Cry1, Vip3) but no mapping evidence is found in the specificity index at any level (name/subfamily/family).
+- `unknown`: used when the hit's family cannot be parsed from its name, so no meaningful back-off is possible.
+ - TopOrder/TopScore (per hit and per strain): the max contributing order and its score. Strain-level tops prefer canonical orders over `other/unknown` when present.
+
+## Files and dataclasses
+- Dataclasses:
+  - `ToxinHit(strain, protein_id, hit_id, identity, aln_length, hit_length, coverage, hmm, family, name_key, partner_fulfilled, weight, order_contribs, top_order, top_score)`
+    - Methods: `to_tsv_row`, `save_list_tsv(...)`
+  - `StrainScores(strain, order_scores, top_order, top_score)`
+    - Methods: `to_tsv_row`, `save_list_tsv(...)`, `save_list_json(...)`
+  - `StrainSpeciesScores(strain, species_scores, top_species, top_species_score)` (if species dimension is present)
+    - Methods: `to_tsv_row`, `save_list_tsv(...)`, `save_list_json(...)`
+
+## Plotting and report
+Use `scripts/plot_shotter.py` to render heatmaps and write a Markdown report.
+```
+python scripts/plot_shotter.py \
+  --strain_scores shotter_outputs/strain_target_scores.tsv \
+  --toxin_support shotter_outputs/toxin_support.tsv \
+  --species_scores shotter_outputs/strain_target_species_scores.tsv \
+  --out_dir shotter_outputs \
+  --merge_unresolved \
+  --per_hit_strain <Strain> \
+  --report_mode paper \
+  --lang en
+```
+Options
+- --merge_unresolved: merge `other+unknown` into `unresolved` for visualization only (computation unchanged)
+- --per_hit_strain: pick one strain to show per-hit × order heatmap (interpretation aid)
+- --report_mode: paper or summary; --lang: zh/en
+
+## Notes and limitations
+- Only positive activity records are used from the specificity CSV; negative records are ignored per design.
+- The partner list is heuristic; extend as needed.
+- If no CSV evidence exists for a family, hits are routed to `other` and do not inflate canonical orders.
+- v2 can add HMM family models and phylogenetic placement for improved back-off.
diff --git a/docs/shotter_workflow_zh.md b/docs/shotter_workflow_zh.md
new file mode 100644
index 0000000..574205c
--- /dev/null
+++ b/docs/shotter_workflow_zh.md
@@ -0,0 +1,120 @@
+# Bttoxin_Shoter v1 工作流（中文）
+
+本文档说明 Shotter 如何将 BtToxin_Digger 的识别结果转换为“靶标目”的预测与打分，并给出变量定义与计算理由。
+
+## 输入与输出
+- 输入（来自 Digger）
+  - tests/output/Results/Toxins/All_Toxins.txt（必须）
+  - 说明：该文件包含每条蛋白命中的毒素条目、Identity、比对长度、HMM 标记等。
+- 输入（来自 BPPRC）
+  - Data/toxicity-data.csv（正向活性记录）
+- 输出（由 Shotter 生成）
+  - toxin_support.tsv：逐命中的“命中×目”贡献矩阵（per-hit）
+  - strain_target_scores.tsv：每个菌株的“菌株×目”分数矩阵（per-strain）
+  - strain_scores.json：同上 JSON 版
+  - 若 CSV 含 target_species：
+    - strain_target_species_scores.tsv：每个菌株的“菌株×物种”分数矩阵
+    - strain_species_scores.json：同上 JSON 版
+
+## 总体流程
+```mermaid
+flowchart TD
+  A[Genome .fna] --> B0[ORF/CDS 预测]
+  B0 --> B1[CDS 翻译为蛋白]
+  B1 --> B2[毒素识别 (BLAST/HMM/SVM)]
+  B2 --> AT[All_Toxins.txt]
+  C[BPPRC specificity CSV] --> D[构建靶标分布索引]
+  AT --> E[逐命中解析]
+  D --> F[逐命中打分]
+  E --> F
+  F --> G[菌株层合成]
+  G --> H[toxin_support.tsv / strain_target_scores.tsv / strain_scores.json]
+```
+
+## 变量与符号说明
+- id（Identity）：Digger 的 Identity 字段/100，范围 [0,1]
+- Aln_length：比对长度；Hit_length：参考长度
+- cov（Coverage）：cov = Aln_length / Hit_length，范围 [0,1]
+- hmm：Digger HMM 列，YES→1/NO→0
+- base：基于 id 与阈值的分段线性权重
+- w_hit：单命中的相似性权重（结合了 base、coverage、HMM）
+- P(order | name/subfamily/family)：从 CSV 学得的“蛋白名/亚家族/家族 → 目标目分布”
+- c(order)：单命中对某“目”的贡献
+- Score(strain, order)：某菌株对“目”的综合分数（noisy-OR 合成）
+
+## CSV→靶标分布：P(order | ·) 的构建
+1. 仅保留阳性记录（activity==Yes）。
+2. 为每条记录计算“证据强度”potency：
+   - 若含 lc50：在单位桶内按反分位归一：potency = 1 - rank_percentile(lc50)。理由：LC50 越低毒力越强，反分位能将不同条目在同单位下归一到 [0,1]。
+   - 若含 percentage_mortality：文本→分数：
+     - ≥80% 或“greater than 80%”→0.90
+     - 50–80%（含 60–80%、60–100%）→0.65
+     - 0–60%/0–50%/“25%”/“some”/“stunting”→0.25
+   - 否则（仅“有活性”）→0.55
+3. 单位合并与理由：饮食掺喂实验中 ppm≈µg/g，故将 ppm/µg/g/ug/g/μg/g 合并到同一桶做归一；其它单位各自成桶，避免不合理跨单位比较。
+4. 以“蛋白名 name”“亚家族（如 Cry1A）”“家族（如 Cry1）”三个层级分别聚合至各 target_order 的 potency 和，并在每层归一化为分布 P(order | ·)。
+
+## 相似性权重 w_hit（为什么这样算）
+- 分段阈值来源：Bt 命名规范广泛使用 ~0.45/0.78 作为家族/亚家族划界。
+- 覆盖度约束：≥0.8 能抑制“碎片/部分结构域”的误判。
+- HMM 加分：HMM 表明结构域级别的支持，但不应压倒序列一致性，因此只 +0.1 尾数，且上限 1。 
+- 具体公式：
+  - 若 id ≥ 0.78 且 cov ≥ 0.80：base = 1.0
+  - elif 0.45 ≤ id < 0.78：base = (id - 0.45) / (0.78 - 0.45)
+  - else：base = 0.0
+  - w_hit = min(1, base × cov + 0.1 × hmm)
+- 配对家族（如 Vip1/Vip2、Vpa/Vpb）：若同菌株内未检测到配对家族且其权重≥0.3，则下调：w_hit := 0.2 × w_hit（保留弱证据）。
+
+## 单命中→各“目”的贡献 c(order)
+- 回退顺序：name → subfamily → family；若能解析到家族但无证据，贡献计入 other；若无法解析家族前缀，贡献计入 unknown。
+- c(order) = w_hit × P(order | name/subfamily/family)
+
+## 菌株层合成（noisy-OR）
+- Score(strain, order) = 1 − Π_hits (1 − c(order))
+- 理由：在 0–1 范围内自然累积多条独立证据，避免简单相加的“溢出/重复计数”。
+
+## 特殊列
+- other：可解析出家族但 CSV 无证据的命中所汇总的贡献。
+- unknown：无法解析家族前缀的命中所汇总的贡献。
+
+## 结果解读
+- per-hit（toxin_support.tsv）：检查具体哪条毒蛋白对哪些“目”贡献较大；Weight（w_hit）综合反映一致性、覆盖与 HMM；PartnerFulfilled=NO 时仅做弱证据。
+- per-strain（strain_target_scores.tsv）：看每个菌株对哪些“目”分数高；可取 Top-N 或设阈值（如 ≥0.5）标注高置信目标。
+
+## 可视化与热力图
+- 建议用热力图查看：
+  - 菌株×“目”分数热力图（per-strain 矩阵）
+  - 单菌株命中×“目”贡献热力图（per-hit 矩阵）
+- 我们提供了独立脚本 scripts/plot_shotter.py 以生成 PNG：
+  - 避免将绘图依赖耦合到核心打分脚本；
+  - 也便于以后替换成别的可视化方案。
+
+### 物种维度热图与报告
+- 若生成了 `strain_target_species_scores.tsv`，绘图脚本会输出菌株×物种热图。
+- 绘图脚本还支持论文式报告（`--report_mode paper`，`--lang zh|en`），自动汇总 Top 列与热图。
+
+### 展示开关
+- `--merge_unresolved`：仅在可视化层面合并 `other+unknown` → `unresolved`（不改变计算）。
+- `--per_hit_strain <Strain>`：选择一个菌株绘制“每命中×目标目”的贡献热图，用于解释该菌株的 TopOrder/TopScore 来源。
+
+## Top 列与未解析桶
+- TopOrder/TopScore（命中与菌株）：在“目标目”维度下的最高分及其数值。
+- 若存在真实“目”，Top 排名会优先在真实“目”中选择，不以 `other/unknown` 决定 Top。
+- other：家族可解析但索引无证据；unknown：家族不可解析。
+
+## 命令行参数（评分脚本 bttoxin_shoter.py）
+- 基本输入：`--toxicity_csv`，`--all_toxins`，`--output_dir`。
+- 过滤与阈值：
+  - `--min_identity [0-1]`，`--min_coverage [0-1]`
+  - `--allow_unknown_families`（默认） / `--disallow_unknown_families`
+  - `--require_index_hit`：仅保留可在索引中回退到 name/亚家族/家族之一的命中。
+
+## 物种维度（可选）
+- 若 `toxicity-data.csv` 含 `target_species`，将构建 `P(species | name/subfamily/family)` 并用与“目标目”相同的公式合成菌株层分数：
+  - 命中贡献：`c(species) = w_hit × P(species | ·)`
+  - 菌株层合成：`Score(strain, species) = 1 − Π_hits(1 − c(species))`
+
+## 常见问答
+- 是否需要再做 BLAST？不需要，直接使用 Digger 的相似性结果。
+- 为什么只用阳性记录？与“无活性/阴性”混用容易引入语义歧义，v1 选择保守做法：阳性证据做加权统计。
+- 为什么 noisy-OR？它能够把多条独立证据以概率意义进行合成，避免简单相加导致的>1或重复计数。