Files
bttoxin-pipeline/docs/shotter_workflow.md
2025-11-21 20:23:15 +08:00

8.4 KiB
Raw Blame History

Bttoxin_Shoter v1 workflow and formulas

This document describes how Bttoxin_Shoter converts BtToxin_Digger outputs into predicted target insect orders and strain-level scores, and the rationale behind each step.

Inputs

  • Digger outputs (from tests/output/Results/Toxins):
    • All_Toxins.txt (tab-separated summary across strains)
  • Specificity data (from BPPRC, placed in Data/):
    • Data/toxicity-data.csv (positive activity records)
  • Optional reference FASTA:
    • Data/holotype_fasta_sequences.txt (holotype representatives; not required for v1 scoring)

Outputs (default under output_dir / shotter_outputs)

  • toxin_support.tsv Per-hit contributions and metadata.
  • strain_target_scores.tsv Per-strain target-order scores (01) across all known orders.
  • strain_scores.json Same as TSV in JSON form.
  • If CSV contains target_species:
    • strain_target_species_scores.tsv
    • strain_species_scores.json

CLI (scoring)

python scripts/bttoxin_shoter.py \
  --toxicity_csv Data/toxicity-data.csv \
  --all_toxins tests/output/Results/Toxins/All_Toxins.txt \
  --output_dir shotter_outputs \
  --min_identity 0.50 \
  --min_coverage 0.60 \
  --disallow_unknown_families \
  --require_index_hit

Flags explained

  • --min_identity, --min_coverage: filter weak hits (01)
  • --allow_unknown_families/--disallow_unknown_families: keep or drop unparseable families
  • --require_index_hit: keep only hits that can back-off to name/subfamily/family in the specificity index

High-level pipeline

flowchart TD
  A["Genome .fna"]
  subgraph B["BtToxin_Digger (Docker)"]
    direction TB
    B0["ORF/CDS prediction"]
    B1t["Translate CDS to proteins"]
    B2t["Toxin identification (BLAST/HMM)"]
    B0 --> B1t --> B2t
  end
  A --> B0
  B2t --> AT[All_Toxins.txt]
  B2t --> LG[*.list / *.gbk]
  C[BPPRC specificity CSV] --> D[Build specificity index]
  AT --> E[Per-hit parsing]
  D --> F[Per-hit scoring]
  E --> F
  F --> G[Per-strain combination]
  G --> H[Outputs: toxin_support.tsv, strain_target_scores.tsv, strain_scores.json]

ASCII fallback:

  • Genome -> Digger: ORF/CDS prediction -> Translate to proteins -> Toxin identification (BLAST/HMM) -> All_Toxins.txt, *.list, *.gbk

  • toxicity-data.csv -> Build index

  • All_Toxins -> Per-hit parsing -> Per-hit scoring (with index) -> Strain-level combine -> Outputs

  • 是否把基因序列转成“识别好的”蛋白序列 是的。在本项目的用法中我们用核酸输入运行 BtToxin_Digger见 scripts/test_bttoxin_digger.py 的 sequence_type="nucl"。Digger 会在容器里做 ORF/CDS 预测,把核酸翻译成蛋白,再用 BLAST/HMM 对蛋白进行毒素识别。识别结果被汇总到 All_Toxins.txt,其中包含每条蛋白与某个毒素条目的匹配信息(如 IdentityAln_lengthHit_lengthHMM 等;参见 scripts/bttoxin_shoter.py 的解析逻辑)。

  • *.list *.gbk 的用途

    • *.list:每个菌株(样本)的毒素命中列表,便于快速查看此菌株命中了哪些毒素名称/家族、对应的蛋白 ID/坐标等,常用于人工复核与回溯。
    • *.gbk:带注释的 GenBank 文件包含序列和注释特征CDS、蛋白翻译、坐标等便于在基因组浏览器或其他注释/二次分析工具中查看与下游使用。
  • 后续靶点分析/打分是否需要 *.list *.gbk 不需要。本仓库的打分脚本 scripts/bttoxin_shoter.py 只读取 All_Toxins.txt(见 docs/shotter_workflow.md 的流程图,只有 All_Toxins.txt 被接入打分环节)。*.list*.gbk 是 Digger 的配套产物,适合做人工检查、回溯坐标与提取序列,但不作为打分脚本的输入。

Name and family parsing

  • Extract family/subfamily keys from names (e.g., Cry1Ac1 -> family Cry1, subfamily Cry1A). Names like 5618-Cry1Ia are supported.
  • Hit IDs such as Xxx-other are normalized by removing the -other suffix.

Specificity index from CSV (positive-only)

  • Keep only rows where activity == Yes.
  • Compute a potency weight per row:
    • If lc50 present: normalize within unit buckets by inverted percentile: potency = 1 - rank_percentile(lc50).
    • If percentage_mortality present: map to coarse weights
      • ≥80% or phrases like "greater than 80%" -> 0.90
      • 5080% (or 6080%, 60100%) -> 0.65
      • 060%, 050%, 25%, or vague "some"/"stunting" -> 0.25
    • Otherwise default positive evidence -> 0.55
  • Aggregate to distributions P(order | name), P(order | subfamily), P(order | family) by summing potencies per target_order then normalizing to 1.
  • Unit harmonization rationale: diet incorporation studies often report ppm roughly equivalent to µg/g. We place ppm, µg/g, ug/g, μg/g into one ug_per_g bucket for within-bucket normalization. Other units stay in their own buckets.

Partner protein logic

  • Some families require partners for activity (e.g., Vip1/Vip2, Vpa/Vpb). A hit from such a family is down-weighted by 0.2 unless a partner family is also present in the same strain with a non-trivial similarity weight (≥0.3).
  • This is applied at family prefix level and is independent of CSV partnerprotein text.

Similarity weight (per hit)

Let id = identity (01), cov = alignment coverage, hmm = HMM flag.

  • Base identity component:
    • if id ≥ 0.78 and cov ≥ 0.80: base = 1.0
    • elif 0.45 ≤ id < 0.78: base = (id - 0.45) / (0.78 - 0.45)
    • else: base = 0.0
  • Combine with coverage and HMM bonus:
    • w_hit = min(1, base * cov + (0.1 if hmm else 0))
  • Rationale: 0.45 and 0.78 are the conventional Bt nomenclature thresholds for family and subfamily; coverage ≥0.8 protects against partial hits; HMM bonus rewards domain-level confirmation without overpowering identity/coverage.

Per-hit order contributions and species (optional)

  • For each hit, get a distribution P(order | name/family) from the index, with back-off: name -> subfamily -> family. If still unknown, contributions are sent to the other bucket.
  • Hit contribution to an order: c(order) = w_hit × P(order | name/family).
  • If target_species is present in the CSV, compute in parallel on species: c(species) = w_hit × P(species | ·)

Strain-level combination

  • Across all hits for a strain, we use noisy-OR:
    • Score(strain, order) = 1 - Π_hits (1 - c(order))
  • This stays in [0,1], increases with multiple concordant hits, and avoids double-counting.

Special buckets and Top columns

  • other: used when a hit has a parsable family (e.g., Cry1, Vip3) but no mapping evidence is found in the specificity index at any level (name/subfamily/family).
  • unknown: used when the hit's family cannot be parsed from its name, so no meaningful back-off is possible.
  • TopOrder/TopScore (per hit and per strain): the max contributing order and its score. Strain-level tops prefer canonical orders over other/unknown when present.

Files and dataclasses

  • Dataclasses:
    • ToxinHit(strain, protein_id, hit_id, identity, aln_length, hit_length, coverage, hmm, family, name_key, partner_fulfilled, weight, order_contribs, top_order, top_score)
      • Methods: to_tsv_row, save_list_tsv(...)
    • StrainScores(strain, order_scores, top_order, top_score)
      • Methods: to_tsv_row, save_list_tsv(...), save_list_json(...)
    • StrainSpeciesScores(strain, species_scores, top_species, top_species_score) (if species dimension is present)
      • Methods: to_tsv_row, save_list_tsv(...), save_list_json(...)

Plotting and report

Use scripts/plot_shotter.py to render heatmaps and write a Markdown report.

python scripts/plot_shotter.py \
  --strain_scores shotter_outputs/strain_target_scores.tsv \
  --toxin_support shotter_outputs/toxin_support.tsv \
  --species_scores shotter_outputs/strain_target_species_scores.tsv \
  --out_dir shotter_outputs \
  --merge_unresolved \
  --per_hit_strain <Strain> \
  --report_mode paper \
  --lang en

Options

  • --merge_unresolved: merge other+unknown into unresolved for visualization only (computation unchanged)
  • --per_hit_strain: pick one strain to show per-hit × order heatmap (interpretation aid)
  • --report_mode: paper or summary; --lang: zh/en

Notes and limitations

  • Only positive activity records are used from the specificity CSV; negative records are ignored per design.
  • The partner list is heuristic; extend as needed.
  • If no CSV evidence exists for a family, hits are routed to other and do not inflate canonical orders.
  • v2 can add HMM family models and phylogenetic placement for improved back-off.