add test data

This commit is contained in:
2025-12-01 09:41:39 +08:00
parent b80939f0ca
commit 6b923ae567
9 changed files with 168001 additions and 0 deletions

View File

@@ -0,0 +1,200 @@
# Bti AM65-52 基因组下载与 BtToxin_Digger/Shotter 阳性对照准备记录
## 1. 目的
- 选取 **Bacillus thuringiensis serovar israelensis strain AM65-52** 作为双翅目Diptera蚊类阳性对照。
- 获取其 **完整基因组(染色体 + 质粒FASTA**,用于 BtToxin_Digger + Shotter pipeline。
- 记录整个下载与文件选择过程,方便后续复现。
---
## 2. 数据来源与装配信息
- 物种:*Bacillus thuringiensis* serovar israelensis
- 菌株AM65-52
- Assembly**ASM344539v2**
- Accession
- GenBank**GCA_003445395.2**
- RefSeq**GCF_003445395.2**
- 装配等级Complete Genome完整基因组
- 组成1 条染色体 + 多个质粒plasmids
**说明:**
GCA = 原始提交的 GenBank 装配;
GCF = NCBI RefSeq 基于 GCA 再整理校对后的版本,一般更规范,推荐在分析中使用。
---
## 3. NCBI 网页端下载流程(概念步骤)
1. 打开 NCBI Assembly 页面GCF_003445395.2 / ASM344539v2
2. 在页面右上角点击 **“Download” / “下载”**。
3. 在弹出窗口中选择大致如下选项:
- Data type**Genomic sequence**
- File format**FASTA**
- Dataset type**Genome data package**
4. 点击 **Download**,得到一个 zip 压缩包(本次为 `ncbi_dataset.zip`)。
**注意:**
如果只在 Nucleotide 里打开 `CP013275.1`(染色体)直接下载 FASTA那只有染色体一个 replicon会缺失质粒
完整基因组分析和毒素搜索需要 **染色体 + 所有质粒**,因此要从 **Assembly** 页面下载“整套基因组”。
---
## 4. 解压后的目录结构
本次下载的 `ncbi_dataset.zip` 解压后结构:
```text
README.md
md5sum.txt
ncbi_dataset/
└── data/
├── data_summary.tsv
├── assembly_data_report.jsonl
├── dataset_catalog.json
├── GCA_003445395.2/
│ └── GCA_003445395.2_ASM344539v2_genomic.fna
└── GCF_003445395.2/
└── GCF_003445395.2_ASM344539v2_genomic.fna
```
各文件含义:
- `README.md`
NCBI Datasets 的通用说明文档,描述这个压缩包是什么、如何使用等。
- `md5sum.txt`
每个数据文件的 MD5 校验和,用于验证下载是否完整无损。
- `ncbi_dataset/data/data_summary.tsv`
- 按装配列出的简要统计信息物种名、菌株、装配等级、总长度、N50、提交日期、Gene 数量、BioProject/BioSample 编号等。
- 适合作为“样本信息表”的来源。
- `ncbi_dataset/data/assembly_data_report.jsonl`
- JSON Lines 格式的详细装配报告,每一行对应一个 assembly。
- 内容包括:装配名称、装配等级、测序与装配方法、每个 replicon 的基本信息(染色体 / 质粒、长度、是否 circular 等)。
- 后续如需自动化提取“有哪些质粒、长度多少”之类的信息,可以解析这个文件。
- `ncbi_dataset/data/dataset_catalog.json`
- 描述这次数据包里都包含了哪些文件(例如哪几个 `.fna`、文件大小等)。
- 主要用于程序化检查或二次开发。
- `ncbi_dataset/data/GCA_003445395.2/GCA_003445395.2_ASM344539v2_genomic.fna`
- GenBank 装配版本GCA**基因组核酸 FASTA**
- 多条序列组合在一个 FASTA 文件里multiFASTA包含染色体 + 质粒。
- `ncbi_dataset/data/GCF_003445395.2/GCF_003445395.2_ASM344539v2_genomic.fna`
- RefSeq 装配版本GCF**基因组核酸 FASTA**
- 同样是 multiFASTA包含染色体 + 质粒。
- 推荐在 BtToxin_Digger / Shotter pipeline 中优先使用这个文件。
---
## 5. .fna 与 FASTA、基因组 vs 质粒的基础概念
### 5.1 .fna 与 .fasta 的关系
- **FASTA** 是一种文本格式,用于存储生物序列:
第一行以 `>` 开头,为注释;后面多行是 A/C/G/T或氨基酸字母
- `.fna` 只是 **文件扩展名的约定**
- `.fna` = FASTA nucleic acid核酸的 FASTA
- `.faa` = FASTA amino acid蛋白质序列 FASTA
- `.fasta` / `.fa` 只是更一般的命名。
- 内容本质是一样的,工具上不会因为 `.fna``.fasta` 有根本区别。
一个 `.fna` 文件中可以有 **一条或多条序列**
```text
>chromosome
ATGCGT...
>plasmid_1
TTGACA...
>plasmid_2
...
```
这种情况就叫 **multiFASTA**
### 5.2 基因组genome、染色体chromosome与质粒plasmid
- 染色体:
- 细菌的主要“本体”大 DNA 分子,包含核心代谢和生长必需基因。
- 质粒:
- 较小、通常可移动的 DNA 分子,可以在菌株之间水平转移。
- 常常携带“可选增强功能”,比如抗生素抗性、毒素、分泌系统等。
对于 *Bacillus thuringiensis*
- 很多 **Cry/Vip/Cyt 等杀虫毒素基因在质粒上,而不在染色体上**
- 如果只分析染色体(例如只下载 `CP013275.1`BtToxin_Digger 可能找不到任何典型 Bt 毒素。
- 因此,做“预测杀虫靶标”的 pipeline 时,必须把 **染色体 + 所有质粒** 一起丢给 Digger 扫描。
### 5.3 Assembly 的概念
- 单个 Nucleotide 条目(如 `CP013275.1`)只是“一条序列”:可能是染色体,也可能是质粒或 scaffold。
- **Assembly装配** 是把一个菌株的所有相关序列(染色体 + 质粒 + 可能的 contigs作为一个整体来管理
- 给出一个统一的装配号GCA_xxx / GCF_xxx
- 记录这个基因组由哪些 replicon 组成,各自长度、角色、是否 circular
- 方便用户“一次性下载整个基因组”。
所以可以理解为:
> Assembly 是在 Nucleotide 底层序列基础上,按“某个菌株的整个基因组”做的整理与归档。
---
## 6. 本次下载中两份 .fna 文件的区别与选择
本次数据包中:
1. `GCA_003445395.2_ASM344539v2_genomic.fna`GenBank 提交版本)
2. `GCF_003445395.2_ASM344539v2_genomic.fna`RefSeq 整理版本)
两者都是 **multiFASTA**,包含:
- 染色体:`NZ_CP013275.1`(或者无 `NZ_` 前缀的 CP013275.1
- 多个质粒:`NZ_CP013276.1` ~ `NZ_CP013284.1` 等等,每个一条序列。
对于 BtToxin_Digger / Shotter
- 序列内容在绝大部分位置是等价/极接近的;
- 推荐使用 **RefSeq 版本**
- 文件路径:
- `ncbi_dataset/data/GCF_003445395.2/GCF_003445395.2_ASM344539v2_genomic.fna`
- 原因是 RefSeq 做过额外的规范化和质量检查,命名也更统一。
---
## 7. 用于 pipeline 的 fna 选择与运行示例
### 7.1 选择 fna 文件
后续的 BtToxin_Digger + Shotter pipeline 中,建议:
-`GCF_003445395.2_ASM344539v2_genomic.fna` 拷贝/链接到你的项目目录,例如:
- `data/bti_AM65-52_all.fna`
### 7.2 运行命令示例
```bash
python scripts/run_single_fna_pipeline.py --fna data/bti_AM65-52_all.fna --toxicity_csv Data/toxicity-data.csv --out_root runs/bti_AM65-52_run --min_identity 0.50 --min_coverage 0.60 --disallow_unknown_families --require_index_hit --lang zh
```
运行完成后,关键输出位于:
- `runs/bti_AM65-52_run/digger/Results/Toxins/All_Toxins.txt`
- 用于检查识别到哪些 Cry/Vip/Cyt 毒素家族。
- `runs/bti_AM65-52_run/shotter/strain_target_scores.tsv`
- 查看各昆虫目的预测得分;理论上 **Diptera双翅目应为高分靶标**
- `runs/bti_AM65-52_run/shotter/toxin_support.tsv`
- 可查看哪些毒素(命中)在支持 Diptera 靶标,以及它们的相似度指标。
---
## 8. 小结
- 不能只从 Nucleotide 的 `CP013275.1` 下载染色体就直接跑 BtToxin_Digger因为会缺失质粒上的杀虫毒素基因。
- 正确做法是:从 **AssemblyGCF_003445395.2 / ASM344539v2** 下载完整基因组数据包,从中取出 **RefSeq 版 genomic.fna**
- `.fna` 是核酸 FASTA 文件,本例中是一个 multiFASTA包含 AM65-52 的染色体和所有质粒。
- 将该 `.fna` 作为输入,可以让 BtToxin_Digger + Shotter 更完整地识别 Cry/Vip/Cyt 毒素,并正确预测出 Bti 的双翅目靶标谱。

41
tests/test_data/README.md Normal file
View File

@@ -0,0 +1,41 @@
# NCBI Datasets
https://www.ncbi.nlm.nih.gov/datasets
This zip archive contains an NCBI Datasets Data Package.
NCBI Datasets Data Packages can include sequence, annotation and other data files, and metadata in one or more data report files.
Data report files are in JSON Lines format.
---
## FAQs
### Where is the data I requested?
Your data is in the subdirectory `ncbi_dataset/data/` contained within this zip archive.
### I still can't find my data, can you help?
We have identified a bug affecting Mac Safari users. When downloading data from the NCBI Datasets web interface, you may see only this README file after the download has completed (while other files appear to be missing).
As a workaround to prevent this issue from recurring, we recommend disabling automatic zip archive extraction in Safari until Apple releases a bug fix.
For more information, visit:
https://www.ncbi.nlm.nih.gov/datasets/docs/reference-docs/mac-zip-bug/
### How do I work with JSON Lines data reports?
Visit our JSON Lines data report documentation page:
https://www.ncbi.nlm.nih.gov/datasets/docs/v2/tutorials/working-with-jsonl-data-reports/
### What is NCBI Datasets?
NCBI Datasets is a resource that lets you easily gather data from across NCBI databases. Find and download gene, transcript, protein and genome sequences, annotation and metadata.
### Where can I find NCBI Datasets documentation?
Visit the NCBI Datasets documentation pages:
https://www.ncbi.nlm.nih.gov/datasets/docs/
---
National Center for Biotechnology Information
National Library of Medicine
info@ncbi.nlm.nih.gov

View File

@@ -0,0 +1,5 @@
df09793206f915d51fa4fd51b3d8b317 ncbi_dataset/data/data_summary.tsv
ba49a038334bc374df55a16d4a76e3e5 ncbi_dataset/data/assembly_data_report.jsonl
02066d1bbc77e9ada3dac67ed2027931 ncbi_dataset/data/GCA_003445395.2/GCA_003445395.2_ASM344539v2_genomic.fna
cb0d496001e37f9fe00471fce7693ec2 ncbi_dataset/data/GCF_003445395.2/GCF_003445395.2_ASM344539v2_genomic.fna
9345f52067acb8e2bb8af7fedac54e22 ncbi_dataset/data/dataset_catalog.json

Binary file not shown.

View File

@@ -0,0 +1,2 @@
{"assemblyInfo":{"assemblyLevel":"Complete Genome","assemblyName":"ASM344539v2","assemblyType":"haploid","submitter":"MICALIS INRA","bioprojectLineage":[{"bioprojects":[{"accession":"PRJNA303961","title":"Bacillus thuringiensis serovar israelensis strain AM65-52 and other related strains"}]}],"sequencingTech":"Illumina","biosample":{"accession":"SAMN04288432","lastUpdated":"2021-02-28T02:59:31.467","publicationDate":"2016-05-01T00:00:00.000","submissionDate":"2015-11-24T05:04:10.000","sampleIds":[{"label":"Sample name","value":"WNF19#1"},{"db":"SRA","value":"SRS4274306"}],"description":{"title":"strain AM65-52","organism":{"taxId":1430,"organismName":"Bacillus thuringiensis serovar israelensis"},"comment":"bacterial colony"},"owner":{"name":"MICALIS INRA","contacts":[{}]},"models":["Microbe, viral or environmental"],"package":"Microbe.1.0","attributes":[{"name":"strain","value":"AM65-52"},{"name":"host","value":"Vectobac"},{"name":"isolation_source","value":"Commercial powder"},{"name":"collection_date","value":"2014-06-20"},{"name":"geo_loc_name","value":"France"},{"name":"sample_type","value":"bacterial clone"},{"name":"altitude","value":"100 m"},{"name":"biomaterial_provider","value":"Micalis, INRA"},{"name":"collected_by","value":"Alexei Sorokin"},{"name":"genotype","value":"toxic against mosquito"},{"name":"identified_by","value":"Alexei Sorokin"},{"name":"lab_host","value":"Bacillus thuringiensis"},{"name":"passage_history","value":"direct colony from commertial powder"},{"name":"samp_size","value":"one kg"},{"name":"serovar","value":"serovar=H14"},{"name":"temp","value":"30C"},{"name":"component_organism","value":"Bacillus phage pGIL02"}],"status":{"status":"live","when":"2016-05-01T00:50:56.207"},"biomaterialProvider":"Micalis, INRA","collectedBy":"Alexei Sorokin","collectionDate":"2014-06-20","geoLocName":"France","host":"Vectobac","identifiedBy":"Alexei Sorokin","isolationSource":"Commercial powder","serovar":"serovar=H14","strain":"AM65-52"},"comments":"Annotation was added by the NCBI Prokaryotic Genome Annotation Pipeline (released 2013). Information about the Pipeline can be found here: http://www.ncbi.nlm.nih.gov/genome/annotation_prok/","assemblyStatus":"current","pairedAssembly":{"accession":"GCF_003445395.2","status":"current","annotationName":"GCF_003445395.2-RS_2025_03_19","onlyGenbank":"chromosome ","refseqGenbankAreDifferent":true,"differences":"Removed chromosome "},"bioprojectAccession":"PRJNA303961","assemblyMethod":"SPAdes v. 3.5.0","releaseDate":"2016-11-29"},"assemblyStats":{"totalNumberOfChromosomes":9,"totalSequenceLength":"6700047","totalUngappedLength":"6700047","numberOfContigs":9,"contigN50":5499731,"contigL50":1,"numberOfScaffolds":9,"scaffoldN50":5499731,"scaffoldL50":1,"numberOfComponentSequences":9,"gcCount":"2343207","gcPercent":35,"genomeCoverage":"300","atgcCount":"6700047"},"annotationInfo":{"name":"NCBI Prokaryotic Genome Annotation Pipeline (PGAP)","provider":"NCBI","releaseDate":"2015-11-24","stats":{"geneCounts":{"total":6950,"proteinCoding":6641,"nonCoding":173,"pseudogene":136}},"method":"Best-placed reference protein set; GeneMarkS+","pipeline":"NCBI Prokaryotic Genome Annotation Pipeline (PGAP)","softwareVersion":"3.0"},"currentAccession":"GCA_003445395.2","checkmInfo":{"checkmMarkerSet":"Bacillus thuringiensis","checkmMarkerSetRank":"species","checkmVersion":"v1.2.3","completeness":99.3,"contamination":0.74,"completenessPercentile":83.2006,"checkmSpeciesTaxId":1428},"averageNucleotideIdentity":{"taxonomyCheckStatus":"Inconclusive","matchStatus":"below_threshold_match","submittedOrganism":"Bacillus thuringiensis serovar israelensis","submittedSpecies":"Bacillus thuringiensis","category":"category_na","submittedAniMatch":{"assembly":"GCA_002243685.1","organismName":"Bacillus thuringiensis","category":"suspected_type","ani":96.5,"assemblyCoverage":72.97,"typeAssemblyCoverage":76.02},"bestAniMatch":{"assembly":"GCA_002243685.1","organismName":"Bacillus thuringiensis","category":"suspected_type","ani":96.5,"assemblyCoverage":72.97,"typeAssemblyCoverage":76.02},"comment":"na"},"accession":"GCA_003445395.2","pairedAccession":"GCF_003445395.2","sourceDatabase":"SOURCE_DATABASE_GENBANK","organism":{"taxId":1430,"organismName":"Bacillus thuringiensis serovar israelensis","infraspecificNames":{"strain":"AM65-52"}}}
{"assemblyInfo":{"assemblyLevel":"Complete Genome","assemblyName":"ASM344539v2","assemblyType":"haploid","submitter":"MICALIS INRA","bioprojectLineage":[{"bioprojects":[{"accession":"PRJNA303961","title":"Bacillus thuringiensis serovar israelensis strain AM65-52 and other related strains"}]}],"sequencingTech":"Illumina","biosample":{"accession":"SAMN04288432","lastUpdated":"2021-02-28T02:59:31.467","publicationDate":"2016-05-01T00:00:00.000","submissionDate":"2015-11-24T05:04:10.000","sampleIds":[{"label":"Sample name","value":"WNF19#1"},{"db":"SRA","value":"SRS4274306"}],"description":{"title":"strain AM65-52","organism":{"taxId":1430,"organismName":"Bacillus thuringiensis serovar israelensis"},"comment":"bacterial colony"},"owner":{"name":"MICALIS INRA","contacts":[{}]},"models":["Microbe, viral or environmental"],"package":"Microbe.1.0","attributes":[{"name":"strain","value":"AM65-52"},{"name":"host","value":"Vectobac"},{"name":"isolation_source","value":"Commercial powder"},{"name":"collection_date","value":"2014-06-20"},{"name":"geo_loc_name","value":"France"},{"name":"sample_type","value":"bacterial clone"},{"name":"altitude","value":"100 m"},{"name":"biomaterial_provider","value":"Micalis, INRA"},{"name":"collected_by","value":"Alexei Sorokin"},{"name":"genotype","value":"toxic against mosquito"},{"name":"identified_by","value":"Alexei Sorokin"},{"name":"lab_host","value":"Bacillus thuringiensis"},{"name":"passage_history","value":"direct colony from commertial powder"},{"name":"samp_size","value":"one kg"},{"name":"serovar","value":"serovar=H14"},{"name":"temp","value":"30C"},{"name":"component_organism","value":"Bacillus phage pGIL02"}],"status":{"status":"live","when":"2016-05-01T00:50:56.207"},"biomaterialProvider":"Micalis, INRA","collectedBy":"Alexei Sorokin","collectionDate":"2014-06-20","geoLocName":"France","host":"Vectobac","identifiedBy":"Alexei Sorokin","isolationSource":"Commercial powder","serovar":"serovar=H14","strain":"AM65-52"},"comments":"The annotation was added by the NCBI Prokaryotic Genome Annotation Pipeline (PGAP). Information about PGAP can be found here: https://www.ncbi.nlm.nih.gov/genome/annotation_prok/","assemblyStatus":"current","pairedAssembly":{"accession":"GCA_003445395.2","status":"current","annotationName":"NCBI Prokaryotic Genome Annotation Pipeline (PGAP)","onlyGenbank":"chromosome ","refseqGenbankAreDifferent":true,"differences":"Removed chromosome "},"bioprojectAccession":"PRJNA303961","assemblyMethod":"SPAdes v. 3.5.0","releaseDate":"2016-11-29"},"assemblyStats":{"totalNumberOfChromosomes":9,"totalSequenceLength":"6700047","totalUngappedLength":"6700047","numberOfContigs":9,"contigN50":5499731,"contigL50":1,"numberOfScaffolds":9,"scaffoldN50":5499731,"scaffoldL50":1,"numberOfComponentSequences":9,"gcCount":"2343207","gcPercent":35,"genomeCoverage":"300","atgcCount":"6700047"},"annotationInfo":{"name":"GCF_003445395.2-RS_2025_03_19","provider":"NCBI RefSeq","releaseDate":"2025-03-19","stats":{"geneCounts":{"total":7000,"proteinCoding":6579,"nonCoding":179,"pseudogene":242}},"method":"Best-placed reference protein set; GeneMarkS-2+","pipeline":"NCBI Prokaryotic Genome Annotation Pipeline (PGAP)","softwareVersion":"6.9"},"currentAccession":"GCF_003445395.2","checkmInfo":{"checkmMarkerSet":"Bacillus thuringiensis","checkmMarkerSetRank":"species","checkmVersion":"v1.2.3","completeness":99.3,"contamination":0.74,"completenessPercentile":83.2006,"checkmSpeciesTaxId":1428},"averageNucleotideIdentity":{"taxonomyCheckStatus":"Inconclusive","matchStatus":"below_threshold_match","submittedOrganism":"Bacillus thuringiensis serovar israelensis","submittedSpecies":"Bacillus thuringiensis","category":"category_na","submittedAniMatch":{"assembly":"GCA_002243685.1","organismName":"Bacillus thuringiensis","category":"suspected_type","ani":96.5,"assemblyCoverage":72.97,"typeAssemblyCoverage":76.02},"bestAniMatch":{"assembly":"GCA_002243685.1","organismName":"Bacillus thuringiensis","category":"suspected_type","ani":96.5,"assemblyCoverage":72.97,"typeAssemblyCoverage":76.02},"comment":"na"},"accession":"GCF_003445395.2","pairedAccession":"GCA_003445395.2","sourceDatabase":"SOURCE_DATABASE_REFSEQ","organism":{"taxId":1430,"organismName":"Bacillus thuringiensis serovar israelensis","infraspecificNames":{"strain":"AM65-52"}}}

View File

@@ -0,0 +1,3 @@
Organism Scientific Name Organism Common Name Organism Qualifier Taxonomy id Assembly Name Assembly Accession Source Annotation Level Contig N50 Size Submission Date Gene Count BioProject BioSample
Bacillus thuringiensis serovar israelensis strain: AM65-52 1430 ASM344539v2 GCA_003445395.2 GenBank NCBI Prokaryotic Genome Annotation Pipeline (PGAP) Complete Genome 5499731 6700047 2016-11-29 6950 PRJNA303961 SAMN04288432
Bacillus thuringiensis serovar israelensis strain: AM65-52 1430 ASM344539v2 GCF_003445395.2 RefSeq GCF_003445395.2-RS_2025_03_19 Complete Genome 5499731 6700047 2016-11-29 7000 PRJNA303961 SAMN04288432
1 Organism Scientific Name Organism Common Name Organism Qualifier Taxonomy id Assembly Name Assembly Accession Source Annotation Level Contig N50 Size Submission Date Gene Count BioProject BioSample
2 Bacillus thuringiensis serovar israelensis strain: AM65-52 1430 ASM344539v2 GCA_003445395.2 GenBank NCBI Prokaryotic Genome Annotation Pipeline (PGAP) Complete Genome 5499731 6700047 2016-11-29 6950 PRJNA303961 SAMN04288432
3 Bacillus thuringiensis serovar israelensis strain: AM65-52 1430 ASM344539v2 GCF_003445395.2 RefSeq GCF_003445395.2-RS_2025_03_19 Complete Genome 5499731 6700047 2016-11-29 7000 PRJNA303961 SAMN04288432

View File

@@ -0,0 +1,35 @@
{
"apiVersion": "V2",
"assemblies": [
{
"files": [
{
"filePath": "data_summary.tsv",
"fileType": "DATA_TABLE",
"uncompressedLengthBytes": "629"
},
{
"filePath": "assembly_data_report.jsonl",
"fileType": "DATA_REPORT",
"uncompressedLengthBytes": "8666"
}
]
},{
"accession": "GCA_003445395.2",
"files": [
{
"filePath": "GCA_003445395.2/GCA_003445395.2_ASM344539v2_genomic.fna",
"fileType": "GENOMIC_NUCLEOTIDE_FASTA",
"uncompressedLengthBytes": "6799989"
}
]
},{
"accession": "GCF_003445395.2",
"files": [
{
"filePath": "GCF_003445395.2/GCF_003445395.2_ASM344539v2_genomic.fna",
"fileType": "GENOMIC_NUCLEOTIDE_FASTA",
"uncompressedLengthBytes": "6784825"
}
]
}]}