以一個細菌的測序數據為例子,介紹細菌基因組測序分析流程。本次實驗中的細菌基因組大小為3.4M,通過illumina PE150bp測序,獲得測序數據量為~800M,通過拼接得到該樣本的草圖基因組序列,並進行基因注釋等分析。
內容1 概述
2 基因組de novo測序的分析
2.1 分析流程圖
2.2 測序數據格式
2.3 數據準備
2.4 數據質控
2.5 基因組組裝
2.6 組裝結果統計
2.7 組裝示意圖
2.8 基因組注釋
3 在線分析網站
4 詞彙表
概述
微生物廣泛存在於自然界中,與人類的生產和生活息息相關。目前一般將微生物主要分為細菌、真菌、放線菌、螺旋體、立克次體、衣原體、支原體和病毒。
隨著測序成本的大幅度降低以及測序效率的數量級提升,全基因組測序對微生物單菌基因組學研究起到了巨大的推動作用。通過全基因組序列,可構建該物種的基因組資料庫;為後續研究該物種的生長、發育、進化、起源等重大問題搭建一個高效平臺,並為後續的基因挖掘、功能驗證提供DNA序列信息。
細菌基因組測序,可分為細菌基因組de novo測序和細菌基因組重測序兩類。細菌基因組de novo測序,即從頭測序,是指在沒有任何現有的序列信息的情況下,對某個細菌物種進行測序,利用生物信息學分析手段對序列進行拼裝,從而獲得該細菌物種的基因組序列。細菌基因組重測序是對已有參考基因組序列(Reference Sequence)的物種的不同個體進行基因組測序,並以此為基礎進行個體或群體水平的差異性分析。可關注大量的單核苷酸多態性位點(SNP)、插入缺失(InDel, Insertion/Deletion)、結構變異(Structure Variation,SV)等變異信息。
本篇文章將重點介紹細菌基因組de novo測序的分析流程。
基因組de novo測序的分析分析流程圖測序數據格式根據barcode序列區分樣本,提取出的數據以標準的fastq格式保存。以雙端測序(PE:paired-end)數據為例,每一個樣本有read1.fastq和read2.fastq兩個文件,分別代表5』 -> 3』和 3』->5』的測序結果。在生信入門:Fasta與Fastq格式文件詳解裡面,我們已經初步認識了fastq格式。
第4行是該序列的測序質量,每個字符對應為第2行每個鹼基的測序質量值,用Phred值+33後,所對應的ASCII字符來表示。Phred值的計算方法為:Q =-10*log10(e) 其中e是鹼基錯誤率。
質量得分與錯誤概率對照表:
Phred Quality ScoreProbability of incorrect base callBase call accuracy101 in 1090%201 in 10099%301 in 100099.9%401 in 1000099.99%如上表示,如果一個鹼基的Q值為20,表示這個鹼基可能測錯的概率為1%。
數據準備!ls
test_1.fq.gz test_2.fq.gz
通過ls命令查看,發現當前目錄下有兩個fastq的壓縮文件,它們是illumina測序獲得的原始數據,也是通常要提交到NCBI SRA資料庫的文件。關於如何提交請參考生信入門:如何將測序原始數據上傳NCBI。
數據質控使用fastqc軟體對原始測序reads進行質控,生成網頁統計報告,快速獲得數據質量的好壞。
!fastqc -t 10 test_1.fq.gz test_2.fq.gz
Started analysis of test_1.fq.gz
Started analysis of test_2.fq.gz
Approx 5% complete for test_1.fq.gz
Approx 5% complete for test_2.fq.gz
Approx 10% complete for test_1.fq.gz
Approx 10% complete for test_2.fq.gz
Approx 15% complete for test_1.fq.gz
Approx 15% complete for test_2.fq.gz
Approx 20% complete for test_1.fq.gz
Approx 20% complete for test_2.fq.gz
Approx 25% complete for test_1.fq.gz
Approx 25% complete for test_2.fq.gz
Approx 30% complete for test_1.fq.gz
Approx 30% complete for test_2.fq.gz
Approx 35% complete for test_1.fq.gz
Approx 35% complete for test_2.fq.gz
Approx 40% complete for test_1.fq.gz
Approx 40% complete for test_2.fq.gz
Approx 45% complete for test_1.fq.gz
Approx 45% complete for test_2.fq.gz
Approx 50% complete for test_1.fq.gz
Approx 50% complete for test_2.fq.gz
Approx 55% complete for test_1.fq.gz
Approx 55% complete for test_2.fq.gz
Approx 60% complete for test_1.fq.gz
Approx 60% complete for test_2.fq.gz
Approx 65% complete for test_1.fq.gz
Approx 65% complete for test_2.fq.gz
Approx 70% complete for test_1.fq.gz
Approx 70% complete for test_2.fq.gz
Approx 75% complete for test_1.fq.gz
Approx 75% complete for test_2.fq.gz
Approx 80% complete for test_1.fq.gz
Approx 80% complete for test_2.fq.gz
Approx 85% complete for test_1.fq.gz
Approx 85% complete for test_2.fq.gz
Approx 90% complete for test_1.fq.gz
Approx 90% complete for test_2.fq.gz
Approx 95% complete for test_1.fq.gz
Approx 95% complete for test_2.fq.gz
Analysis complete for test_1.fq.gz
Analysis complete for test_2.fq.gz
查看運行fastqc獲得的結果文件
!ls -t test*html #-t sort by modification time, newest first
test_1_fastqc.html test_2_fastqc.html
打開test_1_fastqc.html文件查看圖形化的質量結果,如下:
這個數據的Q值曲線都在30以上,說明質量挺不錯。橫軸代表位置,縱軸代表quality。紅色表示中位數,黃色是25%-75%區間,觸鬚是10%-90%區間,藍線是平均數。
使用fastp軟體對原始測序數據進行過濾,去除低質量、adapter及N鹼基等,得到cleandata。
!fastp -i test_1.fq.gz -I test_2.fq.gz -o clean_1.fq.gz -O clean_2.fq.gz
Read1 before filtering:
total reads: 2783612
total bases: 417541800
Q20 bases: 409754276(98.1349%)
Q30 bases: 398333598(95.3997%)
Read1 after filtering:
total reads: 2708145
total bases: 406146010
Q20 bases: 401516631(98.8602%)
Q30 bases: 391420330(96.3743%)
Read2 before filtering:
total reads: 2783612
total bases: 417541800
Q20 bases: 399039636(95.5688%)
Q30 bases: 376827191(90.249%)
Read2 aftering filtering:
total reads: 2708145
total bases: 406146010
Q20 bases: 392948823(96.7506%)
Q30 bases: 372542262(91.7262%)
Filtering result:
reads passed filter: 5416290
reads failed due to low quality: 149360
reads failed due to too many N: 1574
reads failed due to too short: 0
reads with adapter trimmed: 9616
bases trimmed due to adapters: 157916
JSON report: fastp.json
HTML report: fastp.html
fastp -i test_1.fq.gz -I test_2.fq.gz -o clean_1.fq.gz -O clean_2.fq.gz
fastp v0.12.2, time used: 66 seconds
查看發現運行fastp獲得結果文件clean1.fq.gz和clean2.fq.gz
!ls -t
clean_2.fq.gz fastp.html test_1_fastqc.html test_1_fastqc.zip test_1.fq.gz
clean_1.fq.gz fastp.json test_2_fastqc.html test_2_fastqc.zip test_2.fq.gz
基因組組裝使用SPAdes (版本:3.11) 短序列組裝軟體對Clean Data進行組裝,經多次調整參數後獲得最優組裝結果;然後reads將比對回組裝獲得的Contig上,再根據reads的paired-end和overlap關係,對組裝結果進行局部組裝和優化
!spades.py -h
SPAdes genome assembler v3.11.0
Usage: /usr/local/bin/spades.py [options] -o <output_dir>
Basic options:
-o <output_dir> directory to store all the resulting files (required)
--sc this flag is required for MDA (single-cell) data
--meta this flag is required for metagenomic sample data
--rna this flag is required for RNA-Seq data
--plasmid runs plasmidSPAdes pipeline for plasmid detection
--iontorrent this flag is required for IonTorrent data
--test runs SPAdes on toy dataset
-h/--help prints this usage message
-v/--version prints version
Input data:
--12 <filename> file with interlaced forward and reverse paired-end reads
-1 <filename> file with forward paired-end reads
-2 <filename> file with reverse paired-end reads
-s <filename> file with unpaired reads
--pe<#>-12 <filename> file with interlaced reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-1 <filename> file with forward reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-2 <filename> file with reverse reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-s <filename> file with unpaired reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-<or> orientation of reads for paired-end library number <#> (<#> = 1,2,..,9; <or> = fr, rf, ff)
--s<#> <filename> file with unpaired reads for single reads library number <#> (<#> = 1,2,..,9)
--mp<#>-12 <filename> file with interlaced reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-1 <filename> file with forward reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-2 <filename> file with reverse reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-s <filename> file with unpaired reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-<or> orientation of reads for mate-pair library number <#> (<#> = 1,2,..,9; <or> = fr, rf, ff)
--hqmp<#>-12 <filename> file with interlaced reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-1 <filename> file with forward reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-2 <filename> file with reverse reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-s <filename> file with unpaired reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-<or> orientation of reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9; <or> = fr, rf, ff)
--nxmate<#>-1 <filename> file with forward reads for Lucigen NxMate library number <#> (<#> = 1,2,..,9)
--nxmate<#>-2 <filename> file with reverse reads for Lucigen NxMate library number <#> (<#> = 1,2,..,9)
--sanger <filename> file with Sanger reads
--pacbio <filename> file with PacBio reads
--nanopore <filename> file with Nanopore reads
--tslr <filename> file with TSLR-contigs
--trusted-contigs <filename> file with trusted contigs
--untrusted-contigs <filename> file with untrusted contigs
Pipeline options:
--only-error-correction runs only read error correction (without assembling)
--only-assembler runs only assembling (without read error correction)
--careful tries to reduce number of mismatches and short indels
--continue continue run from the last available check-point
--restart-from <cp> restart run with updated options and from the specified check-point ('ec', 'as', 'k<int>', 'mc')
--disable-gzip-output forces error correction not to compress the corrected reads
--disable-rr disables repeat resolution stage of assembling
Advanced options:
--dataset <filename> file with dataset description in YAML format
-t/--threads <int> number of threads
[default: 16]
-m/--memory <int> RAM limit for SPAdes in Gb (terminates if exceeded)
[default: 250]
--tmp-dir <dirname> directory for temporary files
[default: <output_dir>/tmp]
-k <int,int,...> comma-separated list of k-mer sizes (must be odd and
less than 128) [default: 'auto']
--cov-cutoff <float> coverage cutoff value (a positive float number, or 'auto', or 'off') [default: 'off']
--phred-offset <33 or 64> PHRED quality offset in the input reads (33 or 64)
[default: auto-detect]
運行spades程序進行組裝,可以自定義-k 選項獲得最佳組裝結果
!spades.py -k 77,87,97,117 --careful --only-assembler
-1 clean_1.fq.gz -2 clean_2.fq.gz -o assembly -t 20
組裝結果統計QUAST 用於基因組和宏基因組的拼接評估
#查看quast幫助文檔
!quast.py -h
QUAST: QUality ASsessment Tool for Genome Assemblies
Version: 4.5
Usage: python /usr/local/bin/quast.py [options] <files_with_contigs>
Options:
-o --output-dir <dirname> Directory to store all result files [default: quast_results/results_<datetime>]
-R <filename> Reference genome file
-G --genes <filename> File with gene coordinates in the reference (GFF, BED, NCBI or TXT)
-O --operons <filename> File with operon coordinates in the reference (GFF, BED, NCBI or TXT)
-m --min-contig <int> Lower threshold for contig length [default: 500]
-t --threads <int> Maximum number of threads [default: 25% of CPUs]
Advanced options:
-s --scaffolds Assemblies are scaffolds, split them and add contigs to the comparison
-l --labels "label, label, ..." Names of assemblies to use in reports, comma-separated. If contain spaces, use quotes
-L Take assembly names from their parent directory names
-f --gene-finding Predict genes using GeneMarkS (prokaryotes, default) or GeneMark-ES (eukaryotes, use --eukaryote)
--glimmer Use GlimmerHMM for gene prediction (instead of the default finder, see above)
--mgm Use MetaGeneMark for gene prediction (instead of the default finder, see above)
--gene-thresholds <int,int,...> Comma-separated list of threshold lengths of genes to search with Gene Finding module
[default: 0,300,1500,3000]
-e --eukaryote Genome is eukaryotic
--est-ref-size <int> Estimated reference size (for computing NGx metrics without a reference)
--gage Use GAGE (results are in gage_report.txt)
--contig-thresholds <int,int,...> Comma-separated list of contig length thresholds [default: 0,1000,5000,10000,25000,50000]
-u --use-all-alignments Compute genome fraction, # genes, # operons in QUAST v1.* style.
By default, QUAST filters Nucmer's alignments to keep only best ones
-i --min-alignment <int> Nucmer's parameter: the minimum alignment length [default: 0]
--min-identity <float> Nucmer's parameter: the minimum alignment identity (80.0, 100.0) [default: 95.0]
-a --ambiguity-usage <none|one|all> Use none, one, or all alignments of a contig when all of them
are almost equally good (see --ambiguity-score) [default: one]
--ambiguity-score <float> Score S for defining equally good alignments of a single contig. All alignments are sorted
by decreasing LEN * IDY% value. All alignments with LEN * IDY% < S * best(LEN * IDY%) are
discarded. S should be between 0.8 and 1.0 [default: 0.99]
--strict-NA Break contigs in any misassembly event when compute NAx and NGAx
By default, QUAST breaks contigs only by extensive misassemblies (not local ones)
-x --extensive-mis-size <int> Lower threshold for extensive misassembly size. All relocations with inconsistency
less than extensive-mis-size are counted as local misassemblies [default: 1000]
--scaffold-gap-max-size <int> Max allowed scaffold gap length difference. All relocations with inconsistency
less than scaffold-gap-size are counted as scaffold gap misassemblies [default: 10000]
Only scaffold assemblies are affected (use -s/--scaffolds)!
--unaligned-part-size <int> Lower threshold for detecting partially unaligned contigs. Such contig should have
at least one unaligned fragment >= the threshold [default: 500]
--fragmented Reference genome may be fragmented into small pieces (e.g. scaffolded reference)
--fragmented-max-indent <int> Mark translocation as fake if both alignments are located no further than N bases
from the ends of the reference fragments [default: 85]
Requires --fragmented option.
--plots-format <str> Save plots in specified format [default: pdf]
Supported formats: emf, eps, pdf, png, ps, raw, rgba, svg, svgz
--memory-efficient Run Nucmer using one thread, separately per each assembly and each chromosome
This may significantly reduce memory consumption on large genomes
--space-efficient Create only reports and plots files. .stdout, .stderr, .coords and other aux files will not be created
This may significantly reduce space consumption on large genomes. Icarus viewers also will not be built
-1 --reads1 <filename> File with forward reads (in FASTQ format, may be gzipped)
-2 --reads2 <filename> File with reverse reads (in FASTQ format, may be gzipped)
--sam <filename> SAM alignment file
--bam <filename> BAM alignment file
Reads (or SAM/BAM file) are used for structural variation detection and
coverage histogram building in Icarus
--sv-bedpe <filename> File with structural variations (in BEDPE format)
Speedup options:
--no-check Do not check and correct input fasta files. Use at your own risk (see manual)
--no-plots Do not draw plots
--no-html Do not build html reports and Icarus viewers
--no-icarus Do not build Icarus viewers
--no-snps Do not report SNPs (may significantly reduce memory consumption on large genomes)
--no-gc Do not compute GC% and GC-distribution
--no-sv Do not run structural variation detection (make sense only if reads are specified)
--no-gzip Do not compress large output files
--fast A combination of all speedup options except --no-check
Other:
--silent Do not print detailed information about each step to stdout (log file is not affected)
--test Run QUAST on the data from the test_data folder, output to quast_test_output
--test-sv Run QUAST with structural variants detection on the data from the test_data folder,
output to quast_test_output
-h --help Print full usage message
-v --version Print version
#運行quast程序
!quast.py assembly/scaffolds.fasta
#查看拼接結果統計
!more quast_results/results_2019_02_23_12_05_14/report.tsv
Assembly scaffolds
# contigs (>= 0 bp) 510
# contigs (>= 1000 bp) 18
# contigs (>= 5000 bp) 14
# contigs (>= 10000 bp) 12
# contigs (>= 25000 bp) 11
# contigs (>= 50000 bp) 9
Total length (>= 0 bp) 3421550
Total length (>= 1000 bp) 3256575
Total length (>= 5000 bp) 3245372
Total length (>= 10000 bp) 3233600
Total length (>= 25000 bp) 3216581
Total length (>= 50000 bp) 3122437
# contigs 35
Largest contig 883888
Total length 3266291
GC (%) 57.23
N50 385978
N75 363878
L50 3
L75 5
# N's per 100 kbp 0.00
組裝示意圖基因組從頭組裝結果包含大量的Contigs序列以及錯綜複雜的連接方式,使用Bandage (a Bioinformatics Application for Navigating De novo Assembly Graphs Easily)可以方便的展示基因組組裝結果的 de Bruijn圖,此外還可以將基因注釋結果直觀的展現在組裝結果上。
!Bandage image assembly/assembly_graph.fastg assembly.png
Contigs組裝連接圖基因組組裝的de Bruijn圖,圖中上方構成組裝的基因組草圖,圖中帶有顏色的線表示不同的contig,斷點表示出現的contig組成的分支序列。圖中下方表示未能組成scaffolds的contig。
基因組注釋Prokka是一個快速注釋原核生物(細菌、古細菌、病毒等)基因組的軟體工具。它產生GFF3、GBK和SQN文件,能夠在Sequin中編輯並最終上傳到Genbank/DDJB/ENA資料庫。不能注釋真核生物。
使用的注釋工具:
Prodigal 編碼序列(CDS)
RNAmmer 核糖體RNA基因(rRNA)
Aragorn 轉運RNA基因(tRNA)
SignalP 信號肽
Infernal 非編碼RNA
輸出:
.fna 原始輸入contigs的FASTA文件(核苷酸)
.faa 翻譯的編碼基因的FASTA文件(蛋白質)
.ffn 所有基因組特徵的FASTA文件(核苷酸)
.fsa 用於提交的Contig序列(核苷酸)
.tbl 用於提交的特徵表(Feature table)
.sqn 用於提交的Sequin可編輯文件
.gbk 包含序列和注釋的Genbank文件
.gff 包含序列和注釋的GFF v3文件
.log 日誌文件
.txt 注釋匯總統計
#查看prokka幫助文檔
!prokka -h
Option h is ambiguous (help, hmms)
Name:
Prokka 1.12 by Torsten Seemann <torsten.seemann@gmail.com>
Synopsis:
rapid bacterial genome annotation
Usage:
prokka [options] <contigs.fasta>
General:
--help This help
--version Print version and exit
--docs Show full manual/documentation
--citation Print citation for referencing Prokka
--quiet No screen output (default OFF)
--debug Debug mode: keep all temporary files (default OFF)
Setup:
--listdb List all configured databases
--setupdb Index all installed databases
--cleandb Remove all database indices
--depends List all software dependencies
Outputs:
--outdir [X] Output folder [auto] (default '')
--force Force overwriting existing output folder (default OFF)
--prefix [X] Filename output prefix [auto] (default '')
--addgenes Add 'gene' features for each 'CDS' feature (default OFF)
--addmrna Add 'mRNA' features for each 'CDS' feature (default OFF)
--locustag [X] Locus tag prefix [auto] (default '')
--increment [N] Locus tag counter increment (default '1')
--gffver [N] GFF version (default '3')
--compliant Force Genbank/ENA/DDJB compliance: --addgenes --mincontiglen 200 --centre XXX (default OFF)
--centre [X] Sequencing centre ID. (default '')
--accver [N] Version to put in Genbank file (default '1')
Organism details:
--genus [X] Genus name (default 'Genus')
--species [X] Species name (default 'species')
--strain [X] Strain name (default 'strain')
--plasmid [X] Plasmid name or identifier (default '')
Annotations:
--kingdom [X] Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria')
--gcode [N] Genetic code / Translation table (set if --kingdom is set) (default '0')
--gram [X] Gram: -/neg +/pos (default '')
--usegenus Use genus-specific BLAST databases (needs --genus) (default OFF)
--proteins [X] FASTA or GBK file to use as 1st priority (default '')
--hmms [X] Trusted HMM to first annotate from (default '')
--metagenome Improve gene predictions for highly fragmented genomes (default OFF)
--rawproduct Do not clean up /product annotation (default OFF)
--cdsrnaolap Allow [tr]RNA to overlap CDS (default OFF)
Computation:
--cpus [N] Number of CPUs to use [0=all] (default '8')
--fast Fast mode - only use basic BLASTP databases (default OFF)
--noanno For CDS just set /product="unannotated protein" (default OFF)
--mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
--evalue [n.n] Similarity e-value cut-off (default '1e-06')
--rfam Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
--norrna Don't run rRNA search (default OFF)
--notrna Don't run tRNA search (default OFF)
--rnammer Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)
#基因組注釋命令prokka
prokka --outdir annotation/ --force --locustag test --prefix test assembly/scaffolds.fasta --force
--cpus 20 --centre test --compliant
#列出當前目錄下的文件及大小
!tree -h -L 2 -n
.
├── [4.0K] annotation
│ ├── [ 92] errorsummary.val
│ ├── [ 316] test.ecn
│ ├── [422K] test.err
│ ├── [1.0M] test.faa
│ ├── [2.9M] test.ffn
│ ├── [3.3M] test.fna
│ ├── [3.4M] test.fsa
│ ├── [7.4M] test.gbf
│ ├── [4.5M] test.gff
│ ├── [ 45K] test.log
│ ├── [ 13M] test.sqn
│ ├── [857K] test.tbl
│ ├── [230K] test.tsv
│ ├── [ 99] test.txt
│ └── [3.8K] test.val
├── [4.0K] assembly
│ ├── [6.7M] assembly_graph.fastg
│ ├── [3.3M] assembly_graph_with_scaffolds.gfa
│ ├── [3.3M] before_rr.fasta
│ ├── [3.3M] contigs.fasta
│ ├── [ 43K] contigs.paths
│ ├── [ 68] dataset.info
│ ├── [ 186] input_dataset.yaml
│ ├── [4.0K] K117
│ ├── [4.0K] K77
│ ├── [4.0K] K87
│ ├── [4.0K] K97
│ ├── [4.0K] misc
│ ├── [4.0K] mismatch_corrector
│ ├── [1.3K] params.txt
│ ├── [3.3M] scaffolds.fasta
│ ├── [ 43K] scaffolds.paths
│ ├── [144K] spades.log
│ └── [4.0K] tmp
├── [234M] clean_1.fq.gz
├── [258M] clean_2.fq.gz
├── [ 138] do.sh
├── [470K] fastp.html
├── [127K] fastp.json
├── [4.0K] quast_results
│ ├── [ 27] latest -> results_2019_02_23_12_07_03
│ ├── [4.0K] results_2019_02_23_12_05_14
│ ├── [4.0K] results_2019_02_23_12_06_26
│ └── [4.0K] results_2019_02_23_12_07_03
├── [4.0K] test_1_fastqc
│ ├── [123K] fastqc_data.txt
│ ├── [9.9K] fastqc.fo
│ ├── [311K] fastqc_report.html
│ ├── [4.0K] Icons
│ ├── [4.0K] Images
│ └── [ 494] summary.txt
├── [311K] test_1_fastqc.html
├── [389K] test_1_fastqc.zip
├── [208M] test_1.fq.gz
├── [315K] test_2_fastqc.html
├── [393K] test_2_fastqc.zip
└── [231M] test_2.fq.gz
17 directories, 41 files
在線分析網站基因組注釋 RAST
http://rast.nmpdr.org/
基因預測
http://prodigal.ornl.gov/
http://ccb.jhu.edu/software/glimmer/index.shtml
RNAmer
http://www.cbs.dtu.dk/services/RNAmmer/
tRNAscan
http://lowelab.ucsc.edu/tRNAscan-SE/
trf預測
http://tandem.bu.edu/trf/trf407b.linux.download.html
操縱子預測
http://www.microbesonline.org/operons/
http://operondb.cbcb.umd.edu/cgi-bin/operondb/operons.cgi
基因島預測 IslandViewer
http://www.pathogenomics.sfu.ca/islandviewer/browse/
預測噬菌體
http://phaster.ca
預測信號肽
http://www.cbs.dtu.dk/services/SignalP/
毒力基因數據 VFDB
http://www.mgc.ac.cn/VFs/
耐藥基因數據 CARD
https://card.mcmaster.ca/
詞彙表1 De novo測序
從頭測序,無需任何參考序列,直接對一個物種進行測序,然後進行拼接、組裝成該物種的基因組序列圖譜。
2 建庫
在基因組隨機打斷的片段上機前,為保證在測序時有足夠的數據強度支持,所進行的基於PCR反應的片段擴增,區別於通常說的基於Fosmid/BAC等載體的建立文庫。
3 PCR-free建庫
對於異常GC含量(<35%,>65%)的樣品,為避免常規建庫使用PCR導致測序偏向性產生從而影響後期信息分析,而採用的基於非PCR反應的建庫手段。 對於樣品量需求也大一些(>10 ug),因為建庫過程沒有PCR擴增過程,此外,只適用於小片段的構建。
4 Index測序
常見於混合樣品測序。 將不同的樣品混合在一起進行測序,為區分各個樣品的數據,在待測序列後加一段已知(一般8 bp)的序列作為標籤(index標籤)。
5 Paired-End測序
雙末端測序,對插入片段兩端進行測序,產生具有Paired-End關係的reads。
6 測序策略sequences strategy
PE100:即Paired-End (100,100),採用雙末端測序法對插入片段兩端進行讀長為100 bp的高通量測序。
7 讀長
測序儀所能獲取的實際長度,即reads的長度,例如:90bp、125bp、300bp。
8 插入片段長度(Insert Size)
待測基因組序列被隨機打斷的長度,用於下一步的文庫構建,不同長度的插入片段在組裝中有不同的作用,小片段(<500 bp)長度一般用於Contig級別的組裝,大片段(>2 k)一般用於Scaffold級別的組裝。
9 Raw data(reads)
即下機數據,原始的reads。
10 Clean data(reads)
即下機數據經過去接頭汙染、過濾低質量reads、去duplication(針對大片段數據)等之後,實際用於組裝及分析的數據。
11 Duplication
兩對或多對Paired-End的reads,每對reads中的read1 和read2分別對應完全一樣,屬於 duplication,數據處理時會去掉這樣的duplication,只保留一對reads。
12 Contig序列
由來源於同一基因組,具有overlapping關係的reads拼接而成的片段。
13 Scaffold序列
又稱super Contig,通過reads的Paired-End關係將Contig連接,並用N補充其中的內洞而形成。
14 N50/N90
將組裝所得片段(Scaffold/Contig)按照從長到短排序並累加求和,累加值達到基因組總長度一半時的片段長度即是該組裝結果的N50值,通常用來衡量組裝情況;N90與之類似,即累加長度為基因組總長90%時,該片段的長度。
15 K-mer
K-mer 就是一個長度為K 的DNA 序列,K 為正整數。 有時候,如K=15,也稱為15-mer。 K-mer 有多種用途,用於糾正測序錯誤,構建Contig,以及估計基因組大小,雜合率,和重複序列含量等。
16 非一致序列
與所測基因組不一致的序列,通常情況下為外源基因序列汙染。
17 平均測序深度(Depth)
平均每個鹼基被測到的次數。 與基因組被覆蓋的程度相關,值為cleandata的數據量大小/所測物種基因組的大小。 理論上30X的數據量可以覆蓋全部的基因組序列,實際由於基因組情況的不同,相同測序深度對基因組覆蓋程度也會發生變化。
18 Ref_genome/gene
參考基因組/基因。
19 基因組覆蓋度
組裝獲得的測序基因組與參考基因組共有片段佔全部共有片段的比例。 有三種評價方式,基於reads,K-mer分析和參考序列。
20 基因區覆蓋度
組裝獲得的測序基因組與參考基因組共有基因長度佔全部共有基因長度的比例。
21 單鹼基錯誤率
測序或者組裝導致的錯誤鹼基數目佔整個序列長度的比例。
22 GC skew
使用(G-C)/(G+C)的計算方法得出,因為在細菌複製起始位點附件GC含量會有一個較大波動,所以通過該方法可以用來推測細菌基因組的複製起點。