Files
bttoxin-pipeline/docs/shotter_workflow.md
2025-11-21 20:23:15 +08:00

156 lines
8.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Bttoxin_Shoter v1 workflow and formulas
This document describes how Bttoxin_Shoter converts BtToxin_Digger outputs into predicted target insect orders and strain-level scores, and the rationale behind each step.
## Inputs
- Digger outputs (from tests/output/Results/Toxins):
- All_Toxins.txt (tab-separated summary across strains)
- Specificity data (from BPPRC, placed in Data/):
- Data/toxicity-data.csv (positive activity records)
- Optional reference FASTA:
- Data/holotype_fasta_sequences.txt (holotype representatives; not required for v1 scoring)
## Outputs (default under output_dir / shotter_outputs)
- toxin_support.tsv
Per-hit contributions and metadata.
- strain_target_scores.tsv
Per-strain target-order scores (01) across all known orders.
- strain_scores.json
Same as TSV in JSON form.
- If CSV contains target_species:
- strain_target_species_scores.tsv
- strain_species_scores.json
## CLI (scoring)
```
python scripts/bttoxin_shoter.py \
--toxicity_csv Data/toxicity-data.csv \
--all_toxins tests/output/Results/Toxins/All_Toxins.txt \
--output_dir shotter_outputs \
--min_identity 0.50 \
--min_coverage 0.60 \
--disallow_unknown_families \
--require_index_hit
```
Flags explained
- --min_identity, --min_coverage: filter weak hits (01)
- --allow_unknown_families/--disallow_unknown_families: keep or drop unparseable families
- --require_index_hit: keep only hits that can back-off to name/subfamily/family in the specificity index
## High-level pipeline
```mermaid
flowchart TD
A["Genome .fna"]
subgraph B["BtToxin_Digger (Docker)"]
direction TB
B0["ORF/CDS prediction"]
B1t["Translate CDS to proteins"]
B2t["Toxin identification (BLAST/HMM)"]
B0 --> B1t --> B2t
end
A --> B0
B2t --> AT[All_Toxins.txt]
B2t --> LG[*.list / *.gbk]
C[BPPRC specificity CSV] --> D[Build specificity index]
AT --> E[Per-hit parsing]
D --> F[Per-hit scoring]
E --> F
F --> G[Per-strain combination]
G --> H[Outputs: toxin_support.tsv, strain_target_scores.tsv, strain_scores.json]
```
ASCII fallback:
- Genome -> Digger: ORF/CDS prediction -> Translate to proteins -> Toxin identification (BLAST/HMM) -> All_Toxins.txt, *.list, *.gbk
- toxicity-data.csv -> Build index
- All_Toxins -> Per-hit parsing -> Per-hit scoring (with index) -> Strain-level combine -> Outputs
- **是否把基因序列转成“识别好的”蛋白序列**
是的。在本项目的用法中我们用核酸输入运行 BtToxin_Digger见 scripts/test_bttoxin_digger.py 的 `sequence_type="nucl"`。Digger 会在容器里做 ORF/CDS 预测,把核酸翻译成蛋白,再用 BLAST/HMM 对蛋白进行毒素识别。识别结果被汇总到 `All_Toxins.txt`,其中包含每条蛋白与某个毒素条目的匹配信息(如 `Identity``Aln_length``Hit_length``HMM` 等;参见 scripts/bttoxin_shoter.py 的解析逻辑)。
- **`*.list` `*.gbk` 的用途**
- **`*.list`**:每个菌株(样本)的毒素命中列表,便于快速查看此菌株命中了哪些毒素名称/家族、对应的蛋白 ID/坐标等,常用于人工复核与回溯。
- **`*.gbk`**:带注释的 GenBank 文件包含序列和注释特征CDS、蛋白翻译、坐标等便于在基因组浏览器或其他注释/二次分析工具中查看与下游使用。
- 后续靶点分析/打分是否需要 `*.list` `*.gbk`
不需要。本仓库的打分脚本 `scripts/bttoxin_shoter.py` 只读取 `All_Toxins.txt`(见 docs/shotter_workflow.md 的流程图,只有 `All_Toxins.txt` 被接入打分环节)。`*.list``*.gbk` 是 Digger 的配套产物,适合做人工检查、回溯坐标与提取序列,但不作为打分脚本的输入。
## Name and family parsing
- Extract family/subfamily keys from names (e.g., Cry1Ac1 -> family Cry1, subfamily Cry1A). Names like `5618-Cry1Ia` are supported.
- Hit IDs such as `Xxx-other` are normalized by removing the `-other` suffix.
## Specificity index from CSV (positive-only)
- Keep only rows where `activity == Yes`.
- Compute a potency weight per row:
- If `lc50` present: normalize within unit buckets by inverted percentile: `potency = 1 - rank_percentile(lc50)`.
- If `percentage_mortality` present: map to coarse weights
- ≥80% or phrases like "greater than 80%" -> 0.90
- 5080% (or 6080%, 60100%) -> 0.65
- 060%, 050%, 25%, or vague "some"/"stunting" -> 0.25
- Otherwise default positive evidence -> 0.55
- Aggregate to distributions P(order | name), P(order | subfamily), P(order | family) by summing potencies per `target_order` then normalizing to 1.
- Unit harmonization rationale: diet incorporation studies often report `ppm` roughly equivalent to `µg/g`. We place `ppm, µg/g, ug/g, μg/g` into one `ug_per_g` bucket for within-bucket normalization. Other units stay in their own buckets.
## Partner protein logic
- Some families require partners for activity (e.g., Vip1/Vip2, Vpa/Vpb). A hit from such a family is down-weighted by 0.2 unless a partner family is also present in the same strain with a non-trivial similarity weight (≥0.3).
- This is applied at family prefix level and is independent of CSV `partnerprotein` text.
## Similarity weight (per hit)
Let `id` = identity (01), `cov` = alignment coverage, `hmm` = HMM flag.
- Base identity component:
- if `id ≥ 0.78` and `cov ≥ 0.80`: `base = 1.0`
- elif `0.45 ≤ id < 0.78`: `base = (id - 0.45) / (0.78 - 0.45)`
- else: `base = 0.0`
- Combine with coverage and HMM bonus:
- `w_hit = min(1, base * cov + (0.1 if hmm else 0))`
- Rationale: 0.45 and 0.78 are the conventional Bt nomenclature thresholds for family and subfamily; coverage ≥0.8 protects against partial hits; HMM bonus rewards domain-level confirmation without overpowering identity/coverage.
## Per-hit order contributions and species (optional)
- For each hit, get a distribution `P(order | name/family)` from the index, with back-off: name -> subfamily -> family. If still unknown, contributions are sent to the `other` bucket.
- Hit contribution to an order: `c(order) = w_hit × P(order | name/family)`.
- If `target_species` is present in the CSV, compute in parallel on species: `c(species) = w_hit × P(species | ·)`
## Strain-level combination
- Across all hits for a strain, we use noisy-OR:
- `Score(strain, order) = 1 - Π_hits (1 - c(order))`
- This stays in [0,1], increases with multiple concordant hits, and avoids double-counting.
## Special buckets and Top columns
- `other`: used when a hit has a parsable family (e.g., Cry1, Vip3) but no mapping evidence is found in the specificity index at any level (name/subfamily/family).
- `unknown`: used when the hit's family cannot be parsed from its name, so no meaningful back-off is possible.
- TopOrder/TopScore (per hit and per strain): the max contributing order and its score. Strain-level tops prefer canonical orders over `other/unknown` when present.
## Files and dataclasses
- Dataclasses:
- `ToxinHit(strain, protein_id, hit_id, identity, aln_length, hit_length, coverage, hmm, family, name_key, partner_fulfilled, weight, order_contribs, top_order, top_score)`
- Methods: `to_tsv_row`, `save_list_tsv(...)`
- `StrainScores(strain, order_scores, top_order, top_score)`
- Methods: `to_tsv_row`, `save_list_tsv(...)`, `save_list_json(...)`
- `StrainSpeciesScores(strain, species_scores, top_species, top_species_score)` (if species dimension is present)
- Methods: `to_tsv_row`, `save_list_tsv(...)`, `save_list_json(...)`
## Plotting and report
Use `scripts/plot_shotter.py` to render heatmaps and write a Markdown report.
```
python scripts/plot_shotter.py \
--strain_scores shotter_outputs/strain_target_scores.tsv \
--toxin_support shotter_outputs/toxin_support.tsv \
--species_scores shotter_outputs/strain_target_species_scores.tsv \
--out_dir shotter_outputs \
--merge_unresolved \
--per_hit_strain <Strain> \
--report_mode paper \
--lang en
```
Options
- --merge_unresolved: merge `other+unknown` into `unresolved` for visualization only (computation unchanged)
- --per_hit_strain: pick one strain to show per-hit × order heatmap (interpretation aid)
- --report_mode: paper or summary; --lang: zh/en
## Notes and limitations
- Only positive activity records are used from the specificity CSV; negative records are ignored per design.
- The partner list is heuristic; extend as needed.
- If no CSV evidence exists for a family, hits are routed to `other` and do not inflate canonical orders.
- v2 can add HMM family models and phylogenetic placement for improved back-off.