156 lines
8.4 KiB
Markdown
156 lines
8.4 KiB
Markdown
# Bttoxin_Shoter v1 workflow and formulas
|
||
|
||
This document describes how Bttoxin_Shoter converts BtToxin_Digger outputs into predicted target insect orders and strain-level scores, and the rationale behind each step.
|
||
|
||
## Inputs
|
||
- Digger outputs (from tests/output/Results/Toxins):
|
||
- All_Toxins.txt (tab-separated summary across strains)
|
||
- Specificity data (from BPPRC, placed in Data/):
|
||
- Data/toxicity-data.csv (positive activity records)
|
||
- Optional reference FASTA:
|
||
- Data/holotype_fasta_sequences.txt (holotype representatives; not required for v1 scoring)
|
||
|
||
## Outputs (default under output_dir / shotter_outputs)
|
||
- toxin_support.tsv
|
||
Per-hit contributions and metadata.
|
||
- strain_target_scores.tsv
|
||
Per-strain target-order scores (0–1) across all known orders.
|
||
- strain_scores.json
|
||
Same as TSV in JSON form.
|
||
- If CSV contains target_species:
|
||
- strain_target_species_scores.tsv
|
||
- strain_species_scores.json
|
||
|
||
## CLI (scoring)
|
||
```
|
||
python scripts/bttoxin_shoter.py \
|
||
--toxicity_csv Data/toxicity-data.csv \
|
||
--all_toxins tests/output/Results/Toxins/All_Toxins.txt \
|
||
--output_dir shotter_outputs \
|
||
--min_identity 0.50 \
|
||
--min_coverage 0.60 \
|
||
--disallow_unknown_families \
|
||
--require_index_hit
|
||
```
|
||
Flags explained
|
||
- --min_identity, --min_coverage: filter weak hits (0–1)
|
||
- --allow_unknown_families/--disallow_unknown_families: keep or drop unparseable families
|
||
- --require_index_hit: keep only hits that can back-off to name/subfamily/family in the specificity index
|
||
|
||
## High-level pipeline
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
A["Genome .fna"]
|
||
subgraph B["BtToxin_Digger (Docker)"]
|
||
direction TB
|
||
B0["ORF/CDS prediction"]
|
||
B1t["Translate CDS to proteins"]
|
||
B2t["Toxin identification (BLAST/HMM)"]
|
||
B0 --> B1t --> B2t
|
||
end
|
||
A --> B0
|
||
B2t --> AT[All_Toxins.txt]
|
||
B2t --> LG[*.list / *.gbk]
|
||
C[BPPRC specificity CSV] --> D[Build specificity index]
|
||
AT --> E[Per-hit parsing]
|
||
D --> F[Per-hit scoring]
|
||
E --> F
|
||
F --> G[Per-strain combination]
|
||
G --> H[Outputs: toxin_support.tsv, strain_target_scores.tsv, strain_scores.json]
|
||
```
|
||
|
||
ASCII fallback:
|
||
- Genome -> Digger: ORF/CDS prediction -> Translate to proteins -> Toxin identification (BLAST/HMM) -> All_Toxins.txt, *.list, *.gbk
|
||
- toxicity-data.csv -> Build index
|
||
- All_Toxins -> Per-hit parsing -> Per-hit scoring (with index) -> Strain-level combine -> Outputs
|
||
|
||
- **是否把基因序列转成“识别好的”蛋白序列**
|
||
是的。在本项目的用法中我们用核酸输入运行 BtToxin_Digger(见 scripts/test_bttoxin_digger.py 的 `sequence_type="nucl"`)。Digger 会在容器里做 ORF/CDS 预测,把核酸翻译成蛋白,再用 BLAST/HMM 对蛋白进行毒素识别。识别结果被汇总到 `All_Toxins.txt`,其中包含每条蛋白与某个毒素条目的匹配信息(如 `Identity`、`Aln_length`、`Hit_length`、`HMM` 等;参见 scripts/bttoxin_shoter.py 的解析逻辑)。
|
||
|
||
- **`*.list` `*.gbk` 的用途**
|
||
|
||
- **`*.list`**:每个菌株(样本)的毒素命中列表,便于快速查看此菌株命中了哪些毒素名称/家族、对应的蛋白 ID/坐标等,常用于人工复核与回溯。
|
||
- **`*.gbk`**:带注释的 GenBank 文件,包含序列和注释特征(CDS、蛋白翻译、坐标等),便于在基因组浏览器或其他注释/二次分析工具中查看与下游使用。
|
||
|
||
- 后续靶点分析/打分是否需要 `*.list` `*.gbk`
|
||
不需要。本仓库的打分脚本 `scripts/bttoxin_shoter.py` 只读取 `All_Toxins.txt`(见 docs/shotter_workflow.md 的流程图,只有 `All_Toxins.txt` 被接入打分环节)。`*.list` 和 `*.gbk` 是 Digger 的配套产物,适合做人工检查、回溯坐标与提取序列,但不作为打分脚本的输入。
|
||
|
||
## Name and family parsing
|
||
- Extract family/subfamily keys from names (e.g., Cry1Ac1 -> family Cry1, subfamily Cry1A). Names like `5618-Cry1Ia` are supported.
|
||
- Hit IDs such as `Xxx-other` are normalized by removing the `-other` suffix.
|
||
|
||
## Specificity index from CSV (positive-only)
|
||
- Keep only rows where `activity == Yes`.
|
||
- Compute a potency weight per row:
|
||
- If `lc50` present: normalize within unit buckets by inverted percentile: `potency = 1 - rank_percentile(lc50)`.
|
||
- If `percentage_mortality` present: map to coarse weights
|
||
- ≥80% or phrases like "greater than 80%" -> 0.90
|
||
- 50–80% (or 60–80%, 60–100%) -> 0.65
|
||
- 0–60%, 0–50%, 25%, or vague "some"/"stunting" -> 0.25
|
||
- Otherwise default positive evidence -> 0.55
|
||
- Aggregate to distributions P(order | name), P(order | subfamily), P(order | family) by summing potencies per `target_order` then normalizing to 1.
|
||
- Unit harmonization rationale: diet incorporation studies often report `ppm` roughly equivalent to `µg/g`. We place `ppm, µg/g, ug/g, μg/g` into one `ug_per_g` bucket for within-bucket normalization. Other units stay in their own buckets.
|
||
|
||
## Partner protein logic
|
||
- Some families require partners for activity (e.g., Vip1/Vip2, Vpa/Vpb). A hit from such a family is down-weighted by 0.2 unless a partner family is also present in the same strain with a non-trivial similarity weight (≥0.3).
|
||
- This is applied at family prefix level and is independent of CSV `partnerprotein` text.
|
||
|
||
## Similarity weight (per hit)
|
||
Let `id` = identity (0–1), `cov` = alignment coverage, `hmm` = HMM flag.
|
||
- Base identity component:
|
||
- if `id ≥ 0.78` and `cov ≥ 0.80`: `base = 1.0`
|
||
- elif `0.45 ≤ id < 0.78`: `base = (id - 0.45) / (0.78 - 0.45)`
|
||
- else: `base = 0.0`
|
||
- Combine with coverage and HMM bonus:
|
||
- `w_hit = min(1, base * cov + (0.1 if hmm else 0))`
|
||
- Rationale: 0.45 and 0.78 are the conventional Bt nomenclature thresholds for family and subfamily; coverage ≥0.8 protects against partial hits; HMM bonus rewards domain-level confirmation without overpowering identity/coverage.
|
||
|
||
## Per-hit order contributions and species (optional)
|
||
- For each hit, get a distribution `P(order | name/family)` from the index, with back-off: name -> subfamily -> family. If still unknown, contributions are sent to the `other` bucket.
|
||
- Hit contribution to an order: `c(order) = w_hit × P(order | name/family)`.
|
||
- If `target_species` is present in the CSV, compute in parallel on species: `c(species) = w_hit × P(species | ·)`
|
||
|
||
## Strain-level combination
|
||
- Across all hits for a strain, we use noisy-OR:
|
||
- `Score(strain, order) = 1 - Π_hits (1 - c(order))`
|
||
- This stays in [0,1], increases with multiple concordant hits, and avoids double-counting.
|
||
|
||
## Special buckets and Top columns
|
||
- `other`: used when a hit has a parsable family (e.g., Cry1, Vip3) but no mapping evidence is found in the specificity index at any level (name/subfamily/family).
|
||
- `unknown`: used when the hit's family cannot be parsed from its name, so no meaningful back-off is possible.
|
||
- TopOrder/TopScore (per hit and per strain): the max contributing order and its score. Strain-level tops prefer canonical orders over `other/unknown` when present.
|
||
|
||
## Files and dataclasses
|
||
- Dataclasses:
|
||
- `ToxinHit(strain, protein_id, hit_id, identity, aln_length, hit_length, coverage, hmm, family, name_key, partner_fulfilled, weight, order_contribs, top_order, top_score)`
|
||
- Methods: `to_tsv_row`, `save_list_tsv(...)`
|
||
- `StrainScores(strain, order_scores, top_order, top_score)`
|
||
- Methods: `to_tsv_row`, `save_list_tsv(...)`, `save_list_json(...)`
|
||
- `StrainSpeciesScores(strain, species_scores, top_species, top_species_score)` (if species dimension is present)
|
||
- Methods: `to_tsv_row`, `save_list_tsv(...)`, `save_list_json(...)`
|
||
|
||
## Plotting and report
|
||
Use `scripts/plot_shotter.py` to render heatmaps and write a Markdown report.
|
||
```
|
||
python scripts/plot_shotter.py \
|
||
--strain_scores shotter_outputs/strain_target_scores.tsv \
|
||
--toxin_support shotter_outputs/toxin_support.tsv \
|
||
--species_scores shotter_outputs/strain_target_species_scores.tsv \
|
||
--out_dir shotter_outputs \
|
||
--merge_unresolved \
|
||
--per_hit_strain <Strain> \
|
||
--report_mode paper \
|
||
--lang en
|
||
```
|
||
Options
|
||
- --merge_unresolved: merge `other+unknown` into `unresolved` for visualization only (computation unchanged)
|
||
- --per_hit_strain: pick one strain to show per-hit × order heatmap (interpretation aid)
|
||
- --report_mode: paper or summary; --lang: zh/en
|
||
|
||
## Notes and limitations
|
||
- Only positive activity records are used from the specificity CSV; negative records are ignored per design.
|
||
- The partner list is heuristic; extend as needed.
|
||
- If no CSV evidence exists for a family, hits are routed to `other` and do not inflate canonical orders.
|
||
- v2 can add HMM family models and phylogenetic placement for improved back-off.
|