Computational biology 計算生物學
Artificial intelligence is solving one of biology’s biggest challenges.
人工智慧正在攻克生物學最大的挑戰之一
To understand life, you must understand proteins. These molecular chains, each assembled from a menu of 20 types of chemical links called amino acids, do biology’s heavy lifting. In the guise of enzymes they catalyse the chemistry that keeps bodies running. Actin and myosin, the proteins of muscles, permit those bodies to move around. Keratin provides their skin and hair. Haemoglobin carries their oxygen. Insulin regulates their metabolism. And a protein called spike allows coronaviruses to invade human cells, thereby shutting down entire economies.
要了解生命,必須了解蛋白質。這些分子鏈每條都由總共20種稱為胺基酸的化學鏈中的某些連接組成,擔負著大量的生理功能。它們以酶的形式催化著使身體保持運轉的化學反應。肌肉中的肌動蛋白和肌球蛋白讓身體可以自由活動。角蛋白是皮膚和頭髮的主要成分。血紅蛋白攜帶氧氣。胰島素調節身體的新陳代謝。而一種叫做刺突的蛋白質讓冠狀病毒得以入侵人體細胞,最終令整個國家停擺。
amino [ə'miːnəʊ; ə'maɪnəʊ] n. (化學)氨基
adj. 氨基的
guise [ɡaɪz] n. 偽裝;裝束;外觀
vt. 使化裝
vi. 偽裝
enzyme [ˈenzaɪm] n. [生化] 酶
metabolism [məˈtæbəlɪzəm] n. [生理] 新陳代謝
spike [spaɪk] n. 長釘,道釘;釘鞋;細高跟
vt. 阻止;以大釘釘牢;用尖物刺穿
invade [ɪnˈveɪd] vt. 侵略;侵襲;侵擾;湧入
vi. 侵略;侵入;侵襲;侵犯
do heavy lifting 挑起重擔,擔負
Listing a protein’s amino acids is easy. Machines to do so have existed for decades. But this is only half the battle in the quest to understand how proteins work. What a protein does, and how it does it, depends also on how it folds up after its creation, into its final, intricate shape.
列出組成一種蛋白質的胺基酸很容易。具備這種功能的機器已經發明幾十年了。但要了解蛋白質的作用機制,列出胺基酸只是完成了工作的一半。一種蛋白質的作用及其作用機制還取決於蛋白質在生成後如何摺疊成最終的複雜形狀。
At the moment, molecular biologists can probe proteins』 shapes experimentally, using techniques like X-ray crystallography. But this is fiddly and time-consuming. Now, things may be about to get much easier. On November 30th researchers from DeepMind, an artificial-intelligence (AI) laboratory owned by Alphabet, Google’s parent company, presented results suggesting that they have made enormous progress on one of biology’s grandest challenges—how to use a computer to predict a protein’s shape from just a list of its amino-acid components.
目前,分子生物學家可以用X射線晶體學等技術通過實驗來探究蛋白質的形狀。但這既費力又耗時。現在,事情可能要容易多了。11月30日,谷歌母公司Alphabet旗下的AI實驗室DeepMind的研究人員發布的研究顯示,他們在生物學中最嚴峻的一項挑戰上取得了巨大進展,那就是僅憑一種蛋白質的胺基酸成分,用計算機預測它的形狀。
To non-biologists, this may sound somewhere between arcane and prosaic. In fact, it is a big achievement. Replacing months of experiments with a few hours of computing time could shed new light on the inner workings of cells. It could speed up drug development. And it could in particular suggest treatments for diseases like Alzheimer’s, in which misshapen proteins are thought to play a role.
對除生物學家以外的人來說,這聽起來可能有些晦澀和單調。實際上這是一項重大成就。用幾小時的計算替代幾個月的實驗可能會進一步揭示細胞的內部運作機制。它可以加速藥物研發,尤其是可以為阿爾茲海默症這類與畸形蛋白質有關的疾病提出療法。
But there is yet more to it than that. Until now, the machine-learning techniques which DeepMind’s team used to attack the protein-folding problem have been best known for powering things like face-recognition cameras and voice assistants, and for defeating human beings at tricky games like Go. But Demis Hassabis, DeepMind’s boss, who founded in 2010 what was then an independent firm, did so hoping that they could also be employed to accelerate the progress of science. This result demonstrates how that might work in practice.
不止於此。DeepMind團隊用機器學習技術來攻克蛋白質摺疊問題,而這種技術到目前為止最廣為人知的應用包括面部識別攝像頭和語音助手,以及在圍棋等複雜比賽中擊敗人類。但DeepMind的老闆戴米斯·哈薩比斯(Demis Hassabis)在2010年創建公司時(當時還是一家獨立公司)希望機器學習還可以用來加速科學的發展。現在這項成果顯示了這可能如何付諸實踐。
The idea of using computers to predict proteins』 shapes is half a century old. Progress has been real, but slow, says Ewan Birney, deputy director of the European Molecular Biology Laboratory, a multinational endeavour with headquarters in Germany. And it has been marked by a history of wrong turns and premature declarations of victory.
用計算機預測蛋白質形狀的想法在半個世紀前就有了。總部位於德國的政府間組織歐洲分子生物學實驗室(European Molecular Biology Laboratory)的副主管伊萬·伯尼(Ewan Birney)說,之前有實質性的進展,但很慢。過程中還走了不少彎路,也曾過早地宣布勝利。
These days a humbler field, protein-shape prediction now measures its progress by how well algorithms perform in something called Critical Assessment of Protein Structure Prediction (CASP). This is a biennial experiment-cum-competition which started in 1994 and is jokingly dubbed the 「Olympics of protein-folding」. In it, algorithms are subjected to blind tests of their ability to predict the shapes of several proteins of known structure.
如今這個領域已變得更謙遜。其進展由「蛋白質結構預測關鍵評估」(CASP)中算法的表現來衡量。CASP是一項始於1994年的實驗兼比賽,兩年舉辦一次,被戲稱為「蛋白質摺疊的奧林匹克競賽」。它通過盲測檢驗算法預測幾種已知結構的蛋白質的能力。
humbler ['hʌmblə] adj. 較低級的;更加謙卑的(humble的比較級)
n. 謙虛的人
biennial [baɪˈeniəl] adj. 兩年一次的
n. [植] 二年生植物
cum [kʌm] prep. 和,與;附有,連同;兼作
n. 精液
abbr. 累積的;漸增的 (cumulative)
dub [dʌb] v. 給……起別名(或綽號);封……為爵士;裝上(假蠅)做釣餌;將(軟毛、羊毛等材料)摻入魚餌;在(皮革上)塗油;譯製;配音;複製;合成;(高爾夫)打糟(球的一擊);付清,捐款
n. 配音,配樂;混錄音樂;新手;強節奏音樂,強節拍詩歌;鼓聲;笨蛋
DeepMind’s first entry to CASP, two years ago, was dubbed AlphaFold. It made waves by performing much better than any other then-existing program. The current version, AlphaFold 2, has stretched that lead still further (see chart). One measure of success within CASP is the global-distance test. This assigns algorithms a score between zero and 100 by comparing the predicted locations of atoms in a molecule’s structure with their location in reality. AlphaFold 2 had an average score of 92.4—an accuracy that CASP’s founder, John Moult, who is a biologist at the University of Maryland, says is roughly comparable with what can be obtained by techniques like X-ray crystallography.
DeepMind兩年前首次參加CASP的程序名為「阿爾法否」(AlphaFold)。它的表現遠勝於當時任何其他程序,引起轟動。現在這版阿爾法否2進一步擴大了領先優勢(見圖表)。CASP衡量成功的一個標準是全局距離測試。通過比較算法預測出分子結構中原子的位置及其實際位置,按百分制給算法打分。阿爾法否2的平均得分為92.4。CASP的創始人、馬裡蘭大學的生物學家約翰·穆爾特(John Moult)說,它的準確性與X射線晶體學等技術大致相當。
made waves 搗蛋;興風作浪
then-existing 現有的,已經存在的
global-distance 全局距離
Until now, DeepMind was probably best known for its success in teaching computers to play games—particularly Go, a pastime of deceptively simple rules but fiendish strategy that had been a totem of AI researchers since the field began. In 2016 a DeepMind program called AlphaGo defeated Lee Sedol, one of the world’s best players. Superficially, this may seem of little consequence. But Dr Hassabis says that more similarities exist between protein-folding and Go than might, at first, appear.
以前,DeepMind最出名的可能是教計算機玩遊戲,特別是下圍棋。圍棋看似規則簡單,但策略極其複雜,自AI研究興起以來一直是研究人員的一種圖騰。2016年,DeepMind名為阿爾法狗(AlphaGo)的程序擊敗了世界頂級棋手之一李世石。從表面上看這似乎無關緊要。但哈薩比斯說,蛋白質摺疊和圍棋之間的相似之處可比乍看上去要多。
pastime [ˈpɑːstaɪm] n. 娛樂,消遣
deceptively [dɪˈseptɪvli] adv. 看似;不像看上去那麼;比實際更顯得
fiendish [ˈfiːndɪʃ] adj. 惡魔似的,殘忍的;極壞的
totem [ˈtəʊtəm] n. 圖騰;崇拜物
superficially [ˌsuːpəˈfɪʃəli; ˌsjuːpəˈfɪʃəli] adv. 表面地;淺薄地
One is the impracticality of attacking either problem with computational brute force. There are thought to be around 10^170 legal arrangements of stones on a Go board. That is much greater than the number of atoms in the observable universe, and it is therefore far beyond the reach of any computer unless computational shortcuts can be devised.
一個相似之處是這兩個問題都不能靠蠻力運算解決。據估計,圍棋的棋局大約有10^170種。這個數字遠大於可觀察宇宙中的原子數,遠遠超出了任何計算機的運算能力,除非設計出計算捷徑。
Proteins are even more complicated than Go. One estimate is that a reasonably complex protein could, in principle, take any of as many as 10^300 different shapes. The shape which it does eventually settle into is a result of a balance of various atom-scale forces that act within its amino-acid building blocks, between those building blocks, and between the building blocks and other, surrounding, molecules, particularly those of water. These are all matters of considerable complexity which are difficult to measure. It is therefore clear that, as with playing Go, the only way to perform the trick of predicting protein-folding is to look for shortcuts.
蛋白質比圍棋還要複雜。一種估計是,一個不算太複雜的蛋白質原則上可以有多達10^300種不同形狀。它最終的形狀是在其胺基酸結構單元的內部、之間,以及胺基酸與周圍其他分子(尤其是水分子)之間各種原子級作用力平衡的結果。這些過程都相當複雜而難以測量。因此很顯然,和下圍棋一樣,預測蛋白質摺疊的唯一方法就是尋找捷徑。
The progress that computers have made on the problem over the years demonstrates that these shortcuts do exist. And it also turns out that even inexpert humans can learn such tricks by playing around. Dr Hassabis recalls being struck by the ability of human amateurs to achieve good results with FoldIt, a science-oriented video game launched in 2008 that invites its players to try folding proteins themselves, and which has generated a clutch of papers and discoveries.
多年來計算機在此課題上取得的進步表明捷徑確實存在。而且事實證明,即使非專業人士也可以通過遊戲來學習摺疊蛋白質。哈薩比斯回憶說,業餘愛好者玩科學電子遊戲「疊它」(FoldIt)取得的佳績曾讓他倍感震驚,這款2008年推出的遊戲邀請玩家嘗試自己摺疊蛋白質,催生了許多論文和發現。
Getting players of FoldIt to explain exactly what they have been up to, though, is tricky. This is another parallel with Go. Rather than describing step by step what they are thinking, players of both games tend to talk in vaguer terms of 「intuition」 and 「what feels right」. This is where the machine learning comes in. By feeding computers enough examples, they are able to learn and apply shortcuts and rules-of-thumb of the sort that human beings also exploit, but struggle to articulate. Sometimes, the machines come up with insights that surprise human experts. As Dr Moult observes, 「In general, the detail of the backbone [the molecular scaffolding that joins amino acids together] is extraordinary. [AlphaFold 2] has decided that if you don’t get the details right, you won’t get the big things right. This is a school of thought that’s been around for some time, but I thought it wasn’t correct.」
但是,讓「疊它」的玩家確切解釋他們在做什麼很難。這是與圍棋的另一個相似之處。兩種遊戲的玩家都不會一步一步地描述他們的想法,而是用「直覺」和「覺得這樣是對的」等模糊的說法表述。這就有了機器學習的用武之地。給計算機提供足夠的示例,它們就可以學習和應用人類也會利用卻說不清道不明的捷徑和經驗法則。有時,機器會得出讓人類專家感到驚訝的見解。正如穆爾特所觀察到的,「總的來說,主鏈(將胺基酸連接在一起的分子基架)的細節非常特別。(阿爾法否2)的結論是如果這些細節沒弄對,那麼大方向也會錯。這種看法已經存在了一段時間,但以前我認為這是不對的。」
parallel [ˈpærəlel] n. 平行線;對比
vt. 使…與…平行
adj. 平行的;類似的,相同的
vague [veɪɡ] adj. 模糊的;含糊的;不明確的;曖昧的
intuition [ˌɪntjuˈɪʃn] n. 直覺;直覺力;直覺的知識
articulate [ɑːˈtɪkjuleɪt] vt. 清晰地發(音);明確有力地表達;用關節連接;使相互連貫
vi. 發音;清楚地講話;用關節連接起來
adj. 發音清晰的;口才好的;有關節的
n. 【動物學】有節體的動物
scaffold [ˈskæfəʊld] n. 腳手架;鷹架;絞刑臺
vt. 給…搭腳手架;用支架支撐
As an achievement in AI, AlphaFold 2 is not quite so far ahead of the field as was AlphaGo. Plenty of other research groups have applied machine learning to the protein-structure problem, and have seen encouraging progress. Exactly what DeepMind has done to seize the lead remains unclear, though the firm has promised a technical paper that will delve into the details. For now, John Jumper, the project’s leader, points out that machine learning is a box which contains a variety of tools, and says the team has abandoned the system it used to build the original AlphaFold in 2018, after it became clear that it had reached the limits of its ability.
作為AI的一項成就,阿爾法否2的領先優勢遠沒有阿爾法狗那麼大。其他許多研究小組也已經將機器學習應用於蛋白質結構問題,並且看到了令人鼓舞的進展。DeepMind具體做了些什麼而取得領先地位仍不得而知,不過它承諾會發表一篇技術論文來深入探討細節。目前,該項目的負責人約翰·姜普(John Jumper)指出機器學習是一個裝有各種工具的盒子,他還說,在2018年用於打造最初版阿爾法否的系統顯示能力達到極限後,他們就放棄了它。
The current version, says Dr Jumper, has more room to grow. He thinks space exists to boost the software’s accuracy still further. There are also, for now, things that remain beyond its reach, such as how structures built from several proteins are joined together.
現在的版本有更大的發展空間,姜普說。他認為有空間來進一步提高軟體的準確性。到目前為止,也有一些事情是這個版本無法確知的,例如由幾種蛋白質構建的結構是如何結合在一起的。
Moreover, as Ken Dill, a biologist at Stony Brook University in New York state, who is the author of a recent overview of the field, points out, what AlphaFold 2, its rivals and, indeed, techniques like X-ray crystallography discover are static structures. Action in biology comes, by contrast, from how molecules interact with each other. 「It is」, he puts it, 「a bit like someone asking how a car works, so you open the hood [bonnet] and take a picture and say, 『There, that’s how it works!』」 Useful, in other words, but not quite the entire story.
此外,正如紐約州立大學石溪分校的生物學家肯·迪爾(Ken Dill)在最近發表的一篇有關該領域的綜述中所指出的,阿爾法否2、它的競爭對手,甚至還有X射線晶體學等技術所發現的都是靜態結構。相反,生理活動源於分子之間的相互作用方式。他說:「這有點像有人問汽車的工作原理,你打開引擎蓋拍了張照片說,『喏,就是這樣子的!』」換句話說,這也並非無用,但沒完全說清楚。
Nonetheless—and depending on how DeepMind decides to license the technology—an ability to generate protein structures routinely in this way could have a big impact on the field. Around 180m amino-acid sequences are known to science. But only some 170,000 of them have had their structures determined. Dr Moult thinks that boosting this number could help screen drug candidates to see which are likely to bind well to a particular protein. It could be used to reanalyse existing drugs to see what else they might do. And it could boost synthetic biology, by speeding up the creation of human-designed proteins intended to catalyse chemical reactions.
但是,能常規化地以這種方式得出蛋白質的結構可能會對這個領域產生重大影響,當然這也要看DeepMind決定以何種方式授權使用這項技術。科學上已知的胺基酸序列有約1.8億個,但其中只有約17萬個的結構得以確定。穆爾特認為,確定更多的序列結構可以幫助篩選候選藥物,看哪些藥物可能與特定的蛋白質很好地結合。也可以對現有藥物重新分析,看它們還有什麼其他功效。還可以通過加快創造人工設計的用於催化化學反應的蛋白質,促進合成生物學的發展。
screen [skriːn] n. 屏,幕;屏風
vt. 篩;拍攝;放映;掩蔽
vi. 拍電影
synthetic [sɪnˈθetɪk] adj. 綜合的;合成的,人造的
n. 合成物
catalyse [ˈkætəlaɪz] vt. 催化;促成(等於catalyze)
Some promising successes have, indeed, already happened. For example, AlphaFold 2 was able to predict the structures of several of the proteins used by the new coronavirus, including spike. As for Dr Birney, he says, 「We’re definitely going to want to spend some time kicking the tyres. But when I first saw these results, I nearly fell off my chair.」
一些具有應用潛力的發現實際上已經出現了。例如,阿爾法否2能夠預測刺突等幾種被新冠病毒利用的蛋白質的結構。伯尼說:「我們肯定會要花一些時間做檢驗。但當我第一次看到這些結果時,我激動得差點從椅子上摔下來。」