新算法助力大規模多序列比對
作者:
小柯機器人發布時間:2019/12/3 12:33:22
近日,西班牙巴塞隆納科學技術學院Cedric Notredame、Evan Floden等研究人員合作開發了可用於大規模多序列比對(MSA)的算法。相關論文於12月2日在線發表於《自然—生物技術》。
研究人員引入了一種回歸算法,該算法可在標準工作站上實現多達140萬個序列的MSA,並大大提高了大於10000個序列的數據集的準確性。這一回歸算法與漸進算法相反,以比對最相似的序列為起點。它使用有效的分而治之策略在線性時間內運行第三方對齊方法,而不管其原始複雜性如何。
這一方法將能夠分析非常龐大的基因組數據集,例如最近宣布的地球生物基因組計劃(包含150萬個真核生物基因組)。
據悉,MSA用於結構和進化預測,但是比對大型數據集的複雜性要求使用近似解,包括漸進算法。漸進式MSA方法從比對最相似的序列開始,然後根據引導樹從葉節點到根節點合併其餘序列。隨著序列數量的增加,它們的準確性會大大下降。
附:英文原文
Title: Large multiple sequence alignments with a root-to-leaf regressive method
Author: Edgar Garriga, Paolo Di Tommaso, Cedrik Magis, Ionas Erb, Leila Mansouri, Athanasios Baltzis, Hafid Laayouni, Fyodor Kondrashov, Evan Floden, Cedric Notredame
Issue&Volume: 2019-12-02
Abstract: Multiple sequence alignments (MSAs) are used for structural1,2 and evolutionary predictions1,2, but the complexity of aligning large datasets requires the use of approximate solutions3, including the progressive algorithm4. Progressive MSA methods start by aligning the most similar sequences and subsequently incorporate the remaining sequences, from leaf to root, based on a guide tree. Their accuracy declines substantially as the number of sequences is scaled up5. We introduce a regressive algorithm that enables MSA of up to 1.4 million sequences on a standard workstation and substantially improves accuracy on datasets larger than 10,000 sequences. Our regressive algorithm works the other way around from the progressive algorithm and begins by aligning the most dissimilar sequences. It uses an efficient divide-and-conquer strategy to run third-party alignment methods in linear time, regardless of their original complexity. Our approach will enable analyses of extremely large genomic datasets such as the recently announced Earth BioGenome Project, which comprises 1.5 million eukaryotic genomes6.
DOI: 10.1038/s41587-019-0333-6
Source: https://www.nature.com/articles/s41587-019-0333-6