一文搞定細菌基因組De Novo測序分析

2021-02-18 基因的生物信息學分析

以一個細菌的測序數據為例子，介紹細菌基因組測序分析流程。本次實驗中的細菌基因組大小為3.4M，通過illumina PE150bp測序，獲得測序數據量為～800M，通過拼接得到該樣本的草圖基因組序列，並進行基因注釋等分析。

內容

1 概述

2 基因組de novo測序的分析

2.1 分析流程圖

2.2 測序數據格式

2.3 數據準備

2.4 數據質控

2.5 基因組組裝

2.6 組裝結果統計

2.7 組裝示意圖

2.8 基因組注釋

3 在線分析網站

4 詞彙表

概述

微生物廣泛存在於自然界中，與人類的生產和生活息息相關。目前一般將微生物主要分為細菌、真菌、放線菌、螺旋體、立克次體、衣原體、支原體和病毒。

隨著測序成本的大幅度降低以及測序效率的數量級提升，全基因組測序對微生物單菌基因組學研究起到了巨大的推動作用。通過全基因組序列，可構建該物種的基因組資料庫；為後續研究該物種的生長、發育、進化、起源等重大問題搭建一個高效平臺，並為後續的基因挖掘、功能驗證提供DNA序列信息。

細菌基因組測序，可分為細菌基因組de novo測序和細菌基因組重測序兩類。細菌基因組de novo測序，即從頭測序，是指在沒有任何現有的序列信息的情況下，對某個細菌物種進行測序，利用生物信息學分析手段對序列進行拼裝，從而獲得該細菌物種的基因組序列。細菌基因組重測序是對已有參考基因組序列（Reference Sequence）的物種的不同個體進行基因組測序，並以此為基礎進行個體或群體水平的差異性分析。可關注大量的單核苷酸多態性位點（SNP）、插入缺失（InDel, Insertion/Deletion）、結構變異（Structure Variation，SV）等變異信息。

本篇文章將重點介紹細菌基因組de novo測序的分析流程。

基因組de novo測序的分析分析流程圖

測序數據格式

根據barcode序列區分樣本，提取出的數據以標準的fastq格式保存。以雙端測序（PE：paired-end）數據為例，每一個樣本有read1.fastq和read2.fastq兩個文件，分別代表5』 -> 3』和 3』->5』的測序結果。在生信入門：Fasta與Fastq格式文件詳解裡面，我們已經初步認識了fastq格式。

第4行是該序列的測序質量，每個字符對應為第2行每個鹼基的測序質量值，用Phred值+33後，所對應的ASCII字符來表示。Phred值的計算方法為：Q =-10*log10(e) 其中e是鹼基錯誤率。

質量得分與錯誤概率對照表:

Phred Quality ScoreProbability of incorrect base callBase call accuracy101 in 1090%201 in 10099%301 in 100099.9%401 in 1000099.99%

如上表示，如果一個鹼基的Q值為20，表示這個鹼基可能測錯的概率為1%。

數據準備

!ls

test_1.fq.gz test_2.fq.gz

通過ls命令查看，發現當前目錄下有兩個fastq的壓縮文件，它們是illumina測序獲得的原始數據，也是通常要提交到NCBI SRA資料庫的文件。關於如何提交請參考生信入門：如何將測序原始數據上傳NCBI。

數據質控

使用fastqc軟體對原始測序reads進行質控，生成網頁統計報告，快速獲得數據質量的好壞。

!fastqc -t 10 test_1.fq.gz test_2.fq.gz

Started analysis of test_1.fq.gz

Started analysis of test_2.fq.gz

Approx 5% complete for test_1.fq.gz

Approx 5% complete for test_2.fq.gz

Approx 10% complete for test_1.fq.gz

Approx 10% complete for test_2.fq.gz

Approx 15% complete for test_1.fq.gz

Approx 15% complete for test_2.fq.gz

Approx 20% complete for test_1.fq.gz

Approx 20% complete for test_2.fq.gz

Approx 25% complete for test_1.fq.gz

Approx 25% complete for test_2.fq.gz

Approx 30% complete for test_1.fq.gz

Approx 30% complete for test_2.fq.gz

Approx 35% complete for test_1.fq.gz

Approx 35% complete for test_2.fq.gz

Approx 40% complete for test_1.fq.gz

Approx 40% complete for test_2.fq.gz

Approx 45% complete for test_1.fq.gz

Approx 45% complete for test_2.fq.gz

Approx 50% complete for test_1.fq.gz

Approx 50% complete for test_2.fq.gz

Approx 55% complete for test_1.fq.gz

Approx 55% complete for test_2.fq.gz

Approx 60% complete for test_1.fq.gz

Approx 60% complete for test_2.fq.gz

Approx 65% complete for test_1.fq.gz

Approx 65% complete for test_2.fq.gz

Approx 70% complete for test_1.fq.gz

Approx 70% complete for test_2.fq.gz

Approx 75% complete for test_1.fq.gz

Approx 75% complete for test_2.fq.gz

Approx 80% complete for test_1.fq.gz

Approx 80% complete for test_2.fq.gz

Approx 85% complete for test_1.fq.gz

Approx 85% complete for test_2.fq.gz

Approx 90% complete for test_1.fq.gz

Approx 90% complete for test_2.fq.gz

Approx 95% complete for test_1.fq.gz

Approx 95% complete for test_2.fq.gz

Analysis complete for test_1.fq.gz

Analysis complete for test_2.fq.gz

查看運行fastqc獲得的結果文件

!ls -t test*html #-t sort by modification time, newest first

test_1_fastqc.html test_2_fastqc.html

打開test_1_fastqc.html文件查看圖形化的質量結果，如下：

這個數據的Q值曲線都在30以上，說明質量挺不錯。橫軸代表位置，縱軸代表quality。紅色表示中位數，黃色是25%-75%區間，觸鬚是10%-90%區間，藍線是平均數。

使用fastp軟體對原始測序數據進行過濾，去除低質量、adapter及N鹼基等，得到cleandata。

!fastp -i test_1.fq.gz -I test_2.fq.gz -o clean_1.fq.gz -O clean_2.fq.gz

Read1 before filtering:

total reads: 2783612

total bases: 417541800

Q20 bases: 409754276(98.1349%)

Q30 bases: 398333598(95.3997%)

Read1 after filtering:

total reads: 2708145

total bases: 406146010

Q20 bases: 401516631(98.8602%)

Q30 bases: 391420330(96.3743%)

Read2 before filtering:

total reads: 2783612

total bases: 417541800

Q20 bases: 399039636(95.5688%)

Q30 bases: 376827191(90.249%)

Read2 aftering filtering:

total reads: 2708145

total bases: 406146010

Q20 bases: 392948823(96.7506%)

Q30 bases: 372542262(91.7262%)

Filtering result:

reads passed filter: 5416290

reads failed due to low quality: 149360

reads failed due to too many N: 1574

reads failed due to too short: 0

reads with adapter trimmed: 9616

bases trimmed due to adapters: 157916

JSON report: fastp.json

HTML report: fastp.html

fastp -i test_1.fq.gz -I test_2.fq.gz -o clean_1.fq.gz -O clean_2.fq.gz

fastp v0.12.2, time used: 66 seconds

查看發現運行fastp獲得結果文件clean1.fq.gz和clean2.fq.gz

!ls -t

clean_2.fq.gz fastp.html test_1_fastqc.html test_1_fastqc.zip test_1.fq.gz

clean_1.fq.gz fastp.json test_2_fastqc.html test_2_fastqc.zip test_2.fq.gz

基因組組裝

使用SPAdes (版本：3.11) 短序列組裝軟體對Clean Data進行組裝，經多次調整參數後獲得最優組裝結果；然後reads將比對回組裝獲得的Contig上，再根據reads的paired-end和overlap關係，對組裝結果進行局部組裝和優化

!spades.py -h

SPAdes genome assembler v3.11.0

Usage: /usr/local/bin/spades.py [options] -o <output_dir>

Basic options:

-o <output_dir> directory to store all the resulting files (required)

--sc this flag is required for MDA (single-cell) data

--meta this flag is required for metagenomic sample data

--rna this flag is required for RNA-Seq data

--plasmid runs plasmidSPAdes pipeline for plasmid detection

--iontorrent this flag is required for IonTorrent data

--test runs SPAdes on toy dataset

-h/--help prints this usage message

-v/--version prints version

Input data:

--12 <filename> file with interlaced forward and reverse paired-end reads

-1 <filename> file with forward paired-end reads

-2 <filename> file with reverse paired-end reads

-s <filename> file with unpaired reads

--pe<#>-12 <filename> file with interlaced reads for paired-end library number <#> (<#> = 1,2,..,9)

--pe<#>-1 <filename> file with forward reads for paired-end library number <#> (<#> = 1,2,..,9)

--pe<#>-2 <filename> file with reverse reads for paired-end library number <#> (<#> = 1,2,..,9)

--pe<#>-s <filename> file with unpaired reads for paired-end library number <#> (<#> = 1,2,..,9)

--pe<#>-<or> orientation of reads for paired-end library number <#> (<#> = 1,2,..,9; <or> = fr, rf, ff)

--s<#> <filename> file with unpaired reads for single reads library number <#> (<#> = 1,2,..,9)

--mp<#>-12 <filename> file with interlaced reads for mate-pair library number <#> (<#> = 1,2,..,9)

--mp<#>-1 <filename> file with forward reads for mate-pair library number <#> (<#> = 1,2,..,9)

--mp<#>-2 <filename> file with reverse reads for mate-pair library number <#> (<#> = 1,2,..,9)

--mp<#>-s <filename> file with unpaired reads for mate-pair library number <#> (<#> = 1,2,..,9)

--mp<#>-<or> orientation of reads for mate-pair library number <#> (<#> = 1,2,..,9; <or> = fr, rf, ff)

--hqmp<#>-12 <filename> file with interlaced reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)

--hqmp<#>-1 <filename> file with forward reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)

--hqmp<#>-2 <filename> file with reverse reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)

--hqmp<#>-s <filename> file with unpaired reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)

--hqmp<#>-<or> orientation of reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9; <or> = fr, rf, ff)

--nxmate<#>-1 <filename> file with forward reads for Lucigen NxMate library number <#> (<#> = 1,2,..,9)

--nxmate<#>-2 <filename> file with reverse reads for Lucigen NxMate library number <#> (<#> = 1,2,..,9)

--sanger <filename> file with Sanger reads

--pacbio <filename> file with PacBio reads

--nanopore <filename> file with Nanopore reads

--tslr <filename> file with TSLR-contigs

--trusted-contigs <filename> file with trusted contigs

--untrusted-contigs <filename> file with untrusted contigs

Pipeline options:

--only-error-correction runs only read error correction (without assembling)

--only-assembler runs only assembling (without read error correction)

--careful tries to reduce number of mismatches and short indels

--continue continue run from the last available check-point

--restart-from <cp> restart run with updated options and from the specified check-point ('ec', 'as', 'k<int>', 'mc')

--disable-gzip-output forces error correction not to compress the corrected reads

--disable-rr disables repeat resolution stage of assembling

Advanced options:

--dataset <filename> file with dataset description in YAML format

-t/--threads <int> number of threads

[default: 16]

-m/--memory <int> RAM limit for SPAdes in Gb (terminates if exceeded)

[default: 250]

--tmp-dir <dirname> directory for temporary files

[default: <output_dir>/tmp]

-k <int,int,...> comma-separated list of k-mer sizes (must be odd and

less than 128) [default: 'auto']

--cov-cutoff <float> coverage cutoff value (a positive float number, or 'auto', or 'off') [default: 'off']

--phred-offset <33 or 64> PHRED quality offset in the input reads (33 or 64)

[default: auto-detect]

運行spades程序進行組裝，可以自定義-k 選項獲得最佳組裝結果

!spades.py -k 77,87,97,117 --careful --only-assembler

-1 clean_1.fq.gz -2 clean_2.fq.gz -o assembly -t 20

組裝結果統計

QUAST 用於基因組和宏基因組的拼接評估

#查看quast幫助文檔

!quast.py -h

QUAST: QUality ASsessment Tool for Genome Assemblies

Version: 4.5

Usage: python /usr/local/bin/quast.py [options] <files_with_contigs>

Options:

-o --output-dir <dirname> Directory to store all result files [default: quast_results/results_<datetime>]

-R <filename> Reference genome file

-G --genes <filename> File with gene coordinates in the reference (GFF, BED, NCBI or TXT)

-O --operons <filename> File with operon coordinates in the reference (GFF, BED, NCBI or TXT)

-m --min-contig <int> Lower threshold for contig length [default: 500]

-t --threads <int> Maximum number of threads [default: 25% of CPUs]

Advanced options:

-s --scaffolds Assemblies are scaffolds, split them and add contigs to the comparison

-l --labels "label, label, ..." Names of assemblies to use in reports, comma-separated. If contain spaces, use quotes

-L Take assembly names from their parent directory names

-f --gene-finding Predict genes using GeneMarkS (prokaryotes, default) or GeneMark-ES (eukaryotes, use --eukaryote)

--glimmer Use GlimmerHMM for gene prediction (instead of the default finder, see above)

--mgm Use MetaGeneMark for gene prediction (instead of the default finder, see above)

--gene-thresholds <int,int,...> Comma-separated list of threshold lengths of genes to search with Gene Finding module

[default: 0,300,1500,3000]

-e --eukaryote Genome is eukaryotic

--est-ref-size <int> Estimated reference size (for computing NGx metrics without a reference)

--gage Use GAGE (results are in gage_report.txt)

--contig-thresholds <int,int,...> Comma-separated list of contig length thresholds [default: 0,1000,5000,10000,25000,50000]

-u --use-all-alignments Compute genome fraction, # genes, # operons in QUAST v1.* style.

By default, QUAST filters Nucmer's alignments to keep only best ones

-i --min-alignment <int> Nucmer's parameter: the minimum alignment length [default: 0]

--min-identity <float> Nucmer's parameter: the minimum alignment identity (80.0, 100.0) [default: 95.0]

-a --ambiguity-usage <none|one|all> Use none, one, or all alignments of a contig when all of them

are almost equally good (see --ambiguity-score) [default: one]

--ambiguity-score <float> Score S for defining equally good alignments of a single contig. All alignments are sorted

by decreasing LEN * IDY% value. All alignments with LEN * IDY% < S * best(LEN * IDY%) are

discarded. S should be between 0.8 and 1.0 [default: 0.99]

--strict-NA Break contigs in any misassembly event when compute NAx and NGAx

By default, QUAST breaks contigs only by extensive misassemblies (not local ones)

-x --extensive-mis-size <int> Lower threshold for extensive misassembly size. All relocations with inconsistency

less than extensive-mis-size are counted as local misassemblies [default: 1000]

--scaffold-gap-max-size <int> Max allowed scaffold gap length difference. All relocations with inconsistency

less than scaffold-gap-size are counted as scaffold gap misassemblies [default: 10000]

Only scaffold assemblies are affected (use -s/--scaffolds)!

--unaligned-part-size <int> Lower threshold for detecting partially unaligned contigs. Such contig should have

at least one unaligned fragment >= the threshold [default: 500]

--fragmented Reference genome may be fragmented into small pieces (e.g. scaffolded reference)

--fragmented-max-indent <int> Mark translocation as fake if both alignments are located no further than N bases

from the ends of the reference fragments [default: 85]

Requires --fragmented option.

--plots-format <str> Save plots in specified format [default: pdf]

Supported formats: emf, eps, pdf, png, ps, raw, rgba, svg, svgz

--memory-efficient Run Nucmer using one thread, separately per each assembly and each chromosome

This may significantly reduce memory consumption on large genomes

--space-efficient Create only reports and plots files. .stdout, .stderr, .coords and other aux files will not be created

This may significantly reduce space consumption on large genomes. Icarus viewers also will not be built

-1 --reads1 <filename> File with forward reads (in FASTQ format, may be gzipped)

-2 --reads2 <filename> File with reverse reads (in FASTQ format, may be gzipped)

--sam <filename> SAM alignment file

--bam <filename> BAM alignment file

Reads (or SAM/BAM file) are used for structural variation detection and

coverage histogram building in Icarus

--sv-bedpe <filename> File with structural variations (in BEDPE format)

Speedup options:

--no-check Do not check and correct input fasta files. Use at your own risk (see manual)

--no-plots Do not draw plots

--no-html Do not build html reports and Icarus viewers

--no-icarus Do not build Icarus viewers

--no-snps Do not report SNPs (may significantly reduce memory consumption on large genomes)

--no-gc Do not compute GC% and GC-distribution

--no-sv Do not run structural variation detection (make sense only if reads are specified)

--no-gzip Do not compress large output files

--fast A combination of all speedup options except --no-check

Other:

--silent Do not print detailed information about each step to stdout (log file is not affected)

--test Run QUAST on the data from the test_data folder, output to quast_test_output

--test-sv Run QUAST with structural variants detection on the data from the test_data folder,

output to quast_test_output

-h --help Print full usage message

-v --version Print version

#運行quast程序

!quast.py assembly/scaffolds.fasta

#查看拼接結果統計

!more quast_results/results_2019_02_23_12_05_14/report.tsv

Assembly scaffolds

# contigs (>= 0 bp) 510

# contigs (>= 1000 bp) 18

# contigs (>= 5000 bp) 14

# contigs (>= 10000 bp) 12

# contigs (>= 25000 bp) 11

# contigs (>= 50000 bp) 9

Total length (>= 0 bp) 3421550

Total length (>= 1000 bp) 3256575

Total length (>= 5000 bp) 3245372

Total length (>= 10000 bp) 3233600

Total length (>= 25000 bp) 3216581

Total length (>= 50000 bp) 3122437

# contigs 35

Largest contig 883888

Total length 3266291

GC (%) 57.23

N50 385978

N75 363878

L50 3

L75 5

# N's per 100 kbp 0.00

組裝示意圖

基因組從頭組裝結果包含大量的Contigs序列以及錯綜複雜的連接方式，使用Bandage (a Bioinformatics Application for Navigating De novo Assembly Graphs Easily)可以方便的展示基因組組裝結果的 de Bruijn圖，此外還可以將基因注釋結果直觀的展現在組裝結果上。

!Bandage image assembly/assembly_graph.fastg assembly.png

Contigs組裝連接圖

基因組組裝的de Bruijn圖，圖中上方構成組裝的基因組草圖，圖中帶有顏色的線表示不同的contig，斷點表示出現的contig組成的分支序列。圖中下方表示未能組成scaffolds的contig。

基因組注釋

Prokka是一個快速注釋原核生物（細菌、古細菌、病毒等）基因組的軟體工具。它產生GFF3、GBK和SQN文件，能夠在Sequin中編輯並最終上傳到Genbank/DDJB/ENA資料庫。不能注釋真核生物。

使用的注釋工具：

Prodigal 編碼序列（CDS）

RNAmmer 核糖體RNA基因（rRNA）

Aragorn 轉運RNA基因（tRNA）

SignalP 信號肽

Infernal 非編碼RNA

輸出：

.fna 原始輸入contigs的FASTA文件（核苷酸）

.faa 翻譯的編碼基因的FASTA文件（蛋白質）

.ffn 所有基因組特徵的FASTA文件（核苷酸）

.fsa 用於提交的Contig序列（核苷酸）

.tbl 用於提交的特徵表（Feature table）

.sqn 用於提交的Sequin可編輯文件

.gbk 包含序列和注釋的Genbank文件

.gff 包含序列和注釋的GFF v3文件

.log 日誌文件

.txt 注釋匯總統計

#查看prokka幫助文檔

!prokka -h

Option h is ambiguous (help, hmms)

Name:

Prokka 1.12 by Torsten Seemann <torsten.seemann@gmail.com>

Synopsis:

rapid bacterial genome annotation

Usage:

prokka [options] <contigs.fasta>

General:

--help This help

--version Print version and exit

--docs Show full manual/documentation

--citation Print citation for referencing Prokka

--quiet No screen output (default OFF)

--debug Debug mode: keep all temporary files (default OFF)

Setup:

--listdb List all configured databases

--setupdb Index all installed databases

--cleandb Remove all database indices

--depends List all software dependencies

Outputs:

--outdir [X] Output folder [auto] (default '')

--force Force overwriting existing output folder (default OFF)

--prefix [X] Filename output prefix [auto] (default '')

--addgenes Add 'gene' features for each 'CDS' feature (default OFF)

--addmrna Add 'mRNA' features for each 'CDS' feature (default OFF)

--locustag [X] Locus tag prefix [auto] (default '')

--increment [N] Locus tag counter increment (default '1')

--gffver [N] GFF version (default '3')

--compliant Force Genbank/ENA/DDJB compliance: --addgenes --mincontiglen 200 --centre XXX (default OFF)

--centre [X] Sequencing centre ID. (default '')

--accver [N] Version to put in Genbank file (default '1')

Organism details:

--genus [X] Genus name (default 'Genus')

--species [X] Species name (default 'species')

--strain [X] Strain name (default 'strain')

--plasmid [X] Plasmid name or identifier (default '')

Annotations:

--kingdom [X] Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria')

--gcode [N] Genetic code / Translation table (set if --kingdom is set) (default '0')

--gram [X] Gram: -/neg +/pos (default '')

--usegenus Use genus-specific BLAST databases (needs --genus) (default OFF)

--proteins [X] FASTA or GBK file to use as 1st priority (default '')

--hmms [X] Trusted HMM to first annotate from (default '')

--metagenome Improve gene predictions for highly fragmented genomes (default OFF)

--rawproduct Do not clean up /product annotation (default OFF)

--cdsrnaolap Allow [tr]RNA to overlap CDS (default OFF)

Computation:

--cpus [N] Number of CPUs to use [0=all] (default '8')

--fast Fast mode - only use basic BLASTP databases (default OFF)

--noanno For CDS just set /product="unannotated protein" (default OFF)

--mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')

--evalue [n.n] Similarity e-value cut-off (default '1e-06')

--rfam Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')

--norrna Don't run rRNA search (default OFF)

--notrna Don't run tRNA search (default OFF)

--rnammer Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)

#基因組注釋命令prokka

prokka --outdir annotation/ --force --locustag test --prefix test assembly/scaffolds.fasta --force

--cpus 20 --centre test --compliant

#列出當前目錄下的文件及大小

!tree -h -L 2 -n

├── [4.0K] annotation

│ ├── [ 92] errorsummary.val

│ ├── [ 316] test.ecn

│ ├── [422K] test.err

│ ├── [1.0M] test.faa

│ ├── [2.9M] test.ffn

│ ├── [3.3M] test.fna

│ ├── [3.4M] test.fsa

│ ├── [7.4M] test.gbf

│ ├── [4.5M] test.gff

│ ├── [ 45K] test.log

│ ├── [ 13M] test.sqn

│ ├── [857K] test.tbl

│ ├── [230K] test.tsv

│ ├── [ 99] test.txt

│ └── [3.8K] test.val

├── [4.0K] assembly

│ ├── [6.7M] assembly_graph.fastg

│ ├── [3.3M] assembly_graph_with_scaffolds.gfa

│ ├── [3.3M] before_rr.fasta

│ ├── [3.3M] contigs.fasta

│ ├── [ 43K] contigs.paths

│ ├── [ 68] dataset.info

│ ├── [ 186] input_dataset.yaml

│ ├── [4.0K] K117

│ ├── [4.0K] K77

│ ├── [4.0K] K87

│ ├── [4.0K] K97

│ ├── [4.0K] misc

│ ├── [4.0K] mismatch_corrector

│ ├── [1.3K] params.txt

│ ├── [3.3M] scaffolds.fasta

│ ├── [ 43K] scaffolds.paths

│ ├── [144K] spades.log

│ └── [4.0K] tmp

├── [234M] clean_1.fq.gz

├── [258M] clean_2.fq.gz

├── [ 138] do.sh

├── [470K] fastp.html

├── [127K] fastp.json

├── [4.0K] quast_results

│ ├── [ 27] latest -> results_2019_02_23_12_07_03

│ ├── [4.0K] results_2019_02_23_12_05_14

│ ├── [4.0K] results_2019_02_23_12_06_26

│ └── [4.0K] results_2019_02_23_12_07_03

├── [4.0K] test_1_fastqc

│ ├── [123K] fastqc_data.txt

│ ├── [9.9K] fastqc.fo

│ ├── [311K] fastqc_report.html

│ ├── [4.0K] Icons

│ ├── [4.0K] Images

│ └── [ 494] summary.txt

├── [311K] test_1_fastqc.html

├── [389K] test_1_fastqc.zip

├── [208M] test_1.fq.gz

├── [315K] test_2_fastqc.html

├── [393K] test_2_fastqc.zip

└── [231M] test_2.fq.gz

17 directories, 41 files

在線分析網站

基因組注釋 RAST

http://rast.nmpdr.org/

基因預測

http://prodigal.ornl.gov/

http://ccb.jhu.edu/software/glimmer/index.shtml

RNAmer

http://www.cbs.dtu.dk/services/RNAmmer/

tRNAscan

http://lowelab.ucsc.edu/tRNAscan-SE/

trf預測

http://tandem.bu.edu/trf/trf407b.linux.download.html

操縱子預測

http://www.microbesonline.org/operons/

http://operondb.cbcb.umd.edu/cgi-bin/operondb/operons.cgi

基因島預測 IslandViewer

http://www.pathogenomics.sfu.ca/islandviewer/browse/

預測噬菌體

http://phaster.ca

預測信號肽

http://www.cbs.dtu.dk/services/SignalP/

毒力基因數據 VFDB

http://www.mgc.ac.cn/VFs/

耐藥基因數據 CARD

https://card.mcmaster.ca/

詞彙表

1 De novo測序
從頭測序，無需任何參考序列，直接對一個物種進行測序，然後進行拼接、組裝成該物種的基因組序列圖譜。

2 建庫
在基因組隨機打斷的片段上機前，為保證在測序時有足夠的數據強度支持，所進行的基於PCR反應的片段擴增，區別於通常說的基於Fosmid/BAC等載體的建立文庫。

3 PCR-free建庫
對於異常GC含量（<35%，>65%）的樣品，為避免常規建庫使用PCR導致測序偏向性產生從而影響後期信息分析，而採用的基於非PCR反應的建庫手段。對於樣品量需求也大一些（>10 ug），因為建庫過程沒有PCR擴增過程，此外，只適用於小片段的構建。

4 Index測序
常見於混合樣品測序。將不同的樣品混合在一起進行測序，為區分各個樣品的數據，在待測序列後加一段已知（一般8 bp）的序列作為標籤（index標籤）。

5 Paired-End測序
雙末端測序，對插入片段兩端進行測序，產生具有Paired-End關係的reads。

6 測序策略sequences strategy
PE100：即Paired-End (100,100)，採用雙末端測序法對插入片段兩端進行讀長為100 bp的高通量測序。

7 讀長
測序儀所能獲取的實際長度，即reads的長度，例如：90bp、125bp、300bp。

8 插入片段長度（Insert Size）
待測基因組序列被隨機打斷的長度，用於下一步的文庫構建，不同長度的插入片段在組裝中有不同的作用，小片段（<500 bp）長度一般用於Contig級別的組裝，大片段（>2 k）一般用於Scaffold級別的組裝。

9 Raw data（reads）
即下機數據，原始的reads。

10 Clean data（reads）
即下機數據經過去接頭汙染、過濾低質量reads、去duplication（針對大片段數據）等之後，實際用於組裝及分析的數據。

11 Duplication
兩對或多對Paired-End的reads，每對reads中的read1 和read2分別對應完全一樣，屬於 duplication，數據處理時會去掉這樣的duplication，只保留一對reads。

12 Contig序列
由來源於同一基因組，具有overlapping關係的reads拼接而成的片段。

13 Scaffold序列
又稱super Contig，通過reads的Paired-End關係將Contig連接，並用N補充其中的內洞而形成。

14 N50/N90
將組裝所得片段（Scaffold/Contig）按照從長到短排序並累加求和，累加值達到基因組總長度一半時的片段長度即是該組裝結果的N50值，通常用來衡量組裝情況；N90與之類似，即累加長度為基因組總長90%時，該片段的長度。

15 K-mer
K-mer 就是一個長度為K 的DNA 序列，K 為正整數。有時候，如K=15，也稱為15-mer。 K-mer 有多種用途，用於糾正測序錯誤，構建Contig，以及估計基因組大小，雜合率，和重複序列含量等。

16 非一致序列
與所測基因組不一致的序列，通常情況下為外源基因序列汙染。

17 平均測序深度（Depth）
平均每個鹼基被測到的次數。與基因組被覆蓋的程度相關，值為cleandata的數據量大小/所測物種基因組的大小。理論上30X的數據量可以覆蓋全部的基因組序列，實際由於基因組情況的不同，相同測序深度對基因組覆蓋程度也會發生變化。

18 Ref_genome/gene
參考基因組/基因。

19 基因組覆蓋度
組裝獲得的測序基因組與參考基因組共有片段佔全部共有片段的比例。有三種評價方式，基於reads，K-mer分析和參考序列。

20 基因區覆蓋度
組裝獲得的測序基因組與參考基因組共有基因長度佔全部共有基因長度的比例。

21 單鹼基錯誤率
測序或者組裝導致的錯誤鹼基數目佔整個序列長度的比例。

22 GC skew
使用（G-C）/（G+C）的計算方法得出，因為在細菌複製起始位點附件GC含量會有一個較大波動，所以通過該方法可以用來推測細菌基因組的複製起點。

一文搞定細菌基因組De Novo測序分析

相關焦點

簡化基因組數據分析實戰(一)

細菌基因組:結核桿菌測序耐藥位點分析

一個細菌基因組完整分析腳本

基因組也有「體檢單」,趕緊了解一下

首發┃河南農大孫治強團隊公布絲瓜高質量基因組圖譜

河南農大孫治強團隊公布絲瓜高質量基因組圖譜

梅花香自苦寒來 | 薔薇科李亞科首個測序物種

腸道菌群:宏基因組測序分析流程解讀(上)

【三代測序傳】——應用概述

科學家研發出可高效組裝人基因組的方法

基因組與新興生物技術整合研究思路拓展

首次解析de novo DNA甲基轉移酶和天然底物核小體的高解析度結構

Nat Commun | 跨越18Mb的人類多樣性參考基因組?

高通量DNA測序數據的生物信息學方法

研究解析de novo DNA甲基轉移酶和天然底物核小體的高解析度結構

重磅喜訊:全基因組測序助力準父母訂製雙胞胎!

研究揭示高效從頭基因組組裝工具WENGAN

合作文章 | 菲律賓蛤仔基因組測序闡述了其底棲適應性生活和殼色多樣性的分子基礎

華大科技首推基於Complete Genomics平臺的人全外顯子組測序服務

《核酸研究》:高質量模式微生物基因組資料庫及分析平臺