通過使用「多路復用」,可以將幾個樣品合併到一個測序儀運行中,在測序構建體中插入識別樣品的條形碼(barcode)測序。條形碼也稱為索引(index)序列。在Illumina測序中,條形碼通常位於測序引物之前,因此不會出現在包含生物序列的正向讀數中。條形碼是通過進行一次(單索引)或兩次(雙索引)額外讀取獲得的,有時單索引稱為i1,雙索引稱為i5+i7。對於其他下一代測序儀,條形碼序列通常出現在讀取的開始,可能在機器特定的序列之後,例如454的TCAG。使用當前的Illumina軟體和標準庫準備協議,獲得的下機數據已經是拆分好的每個樣本的FASTQ文件,並且不包括索引序列。但是,有時自己需要進行樣品拆分,您可以獲得「原始」i1、r1和r2讀數。對於454和Illumina,由於錯誤的條形碼序列,序列被錯誤拆分,並且錯誤率極高。減少cross-talk的建議策略是使用稀疏雙索引方案,其中大多數索引對不分配給樣本。
原文:Several samples can be combined into a single sequencer run by using "multiplexing" where a barcode sequencing identifying the sample is inserted into the sequencing construct. Barcodes are also called index sequences.
With Illumina sequencing, the barcode is usually positioned before the sequencing primer so does not appear in the forward reads that contain the biological sequence. Barcodes are obtained by making one (single-indexing) or two (dual-indexing) additional reads which are sometimes called i1 for single indexing and i5+i7 for dual indexing.
With other next-generation sequencers, the barcode sequence usually appears at the beginning of the read, possibly after a machine-specific sequence such as TCAG for 454.
With current Illumina software and standard library preparation protocols, the demultiplexing is usually done for you and the basespace download includes one FASTQ file for each sample; the index reads are not included. However, it is sometimes useful to do the demultiplexing yourself, in which case you can get "raw" i1, r1 and r2 reads.
With both 454 and Illumina, reads are assigned to the wrong sample due to incorrect barcode sequences at a surprisingly high rate. I call this problem cross-talk. A suggested strategy for reducing cross-talk is to use a sparse dual index scheme where most pairs of indexes are not assigned to samples.
我拿到的數據是後綴R1.fq.gz和R2.fq.gz的兩個雙端16S的擴增子測序文件,共有60個樣本混合在這兩個文件裡,需要根據這些樣本barcode進行拆分。我檢索到拆分樣品有不同的工具可以使用,例如:fastq-multx、seqtk_demultiplex和usearch -fastx_demux。
正反向確認對了在此之前,要先確定下R1和R2兩個文件名是否標註正確。我的這批數據就反了,我後面排查出來了。具體做法可以正反向引物去兩個文件上檢索一下,與正向引物順序一致的則是R1。同時還能檢驗一下引物是否正確。這裡可以使用usearch -search_oligodb命令:
# 軟體路徑
usearch=/home/tools/protocols/dix-seq-0.0.1/binaries
# 構造引物序列
[shenmy@extranet-206 raw_data]$ cat 515F.fa
>515F
GTGCCAGCMGCCGCGGTAA
[shenmy@extranet-206 raw_data]$ cat 806R.fa
>806R
GGACTACHVGGGTWTCTAAT
# 檢測引物
$usearch/usearch -search_oligodb s2_H5WK5BCX2_L2_1.clean.fq -db 515F.fa -strand plus -userout R1_515F.txt -userfields query+qlo+qhi+qstrand
$usearch/usearch -search_oligodb s2_H5WK5BCX2_L2_1.clean.fq -db 806R.fa -strand plus -userout R1_806R.txt -userfields query+qlo+qhi+qstrand
$usearch/usearch -search_oligodb s2_H5WK5BCX2_L2_2.clean.fq -db 515F.fa -strand plus -userout R2_515F.txt -userfields query+qlo+qhi+qstrand
$usearch/usearch -search_oligodb s2_H5WK5BCX2_L2_2.clean.fq -db 806R.fa -strand plus -userout R2_806R.txt -userfields query+qlo+qhi+qstrand
我這批數據比較奇葩,一般情況下R1文件裡只能檢測到正向引物(R1_515F的檢索率為100%,R1_806R的檢索率為0%)。而我的數據正反向文件檢測到的引物都接近50%左右。這告訴我們上面的R1文件裡混了真正的R1和R2樣本。嗯….為什麼會這樣子……
樣本拆分我使用的是usearch -fastx_demux,我不清楚是index是單端的還是雙端的,我手頭上的barcode是一個樣本一個標籤,只能按照這個標籤拆分了。
# Illumina paired with single index (i1 + r1 + r2)
$usearch/usearch -fastx_demux s2_H5WK5BCX2_L2_1.clean.fq -reverse s2_H5WK5BCX2_L2_2.clean.fq -fastqout fwd_demux.fq -output2 rev_demux.fq -barcodes barcode.fa
# Demuxed 6690945 / 13882527 (48.2%)
$usearch/usearch -fastx_demux R1.fq -reverse R2.fq -fastqout R1_demux.fq -output2 R2_demux.fq -barcodes ../barcode.fa
## 01:35 77Mb 100.0% Demuxed 12715252 / 13721440 (92.7%)
由於先預先嘗試直接拆分樣本,拆分率只有48.2%,說明還有一半的數據沒有拆分成功。我對上面的R1和R2序列進行分析,將所有帶806R引物序列的合併到R1文件,所匹配的序列提取到R2文件,拆分率提升到92.7%。我還在思考這麼做到底對不對?問題出在哪裡了呢?(有小夥伴知道嘛)
這果然是一批問題數據,有問題請聯繫我
個人微信ID:
Shenmengyuan1993