多種機器學習和統計模型預測個體患者臨床風險並不一致
作者:
小柯機器人發布時間:2020/11/8 22:27:12
英國曼徹斯特大學Tjeerd Pieter van Staa團隊研究了多種機器學習和統計模型預測個體患者臨床風險的一致性。2020年11月4日,該研究發表在《英國醫學雜誌》上。
為了評估機器學習和統計技術在預測個體水平和群體水平心血管疾病風險方面的一致性,以及審查對風險預測的影響,1998年1月1日至2018年12月31日,研究組進行了一項縱向隊列研究。
研究組使用在英格蘭391種常規實踐中註冊的360萬患者的數據,均有相關住院記錄和死亡記錄。模型性能包括在具有可比性的模型之間對相同患者的鑑別、校準和個體風險預測的一致性。研究組使用了19種不同的預測技術,包括12個機器學習模型,3個Cox比例風險模型,3個參數生存模型和1個邏輯模型。
各種模型具有相似的群體水平性能。但是,在不同類型的機器學習和統計模型之間以及組內,對心血管疾病個人風險的預測差異很大,尤其是在風險較高的患者中。QRISK3預測的風險為9.5-10.5%的患者在隨機森林中的風險為2.9-9.2%,在神經網絡中的風險為2.4-7.2%。
QRISK3和神經網絡之間的預測風險差異在–23.2%和0.1%之間。忽略審查的模型大大低估了心血管疾病的風險。使用QRISK3心血管疾病風險高於7.5%的223815位患者中,有57.8%的患者在使用另一種模型時,心血管疾病風險低於7.5%。
研究結果表明,儘管模型性能相似,但各種模型對同一患者的風險預測卻大不相同。在不考慮審查的情況下,不應將邏輯模型和常用的機器學習模型直接用於長期風險預測。
附:英文原文
Title: Consistency of variety of machine learning and statistical models in predicting clinical risks of individual patients: longitudinal cohort study using cardiovascular disease as exemplar
Author: Yan Li, Matthew Sperrin, Darren M Ashcroft, Tjeerd Pieter van Staa
Issue&Volume: 2020/11/04
Abstract:
Objective To assess the consistency of machine learning and statistical techniques in predicting individual level and population level risks of cardiovascular disease and the effects of censoring on risk predictions.
Design Longitudinal cohort study from 1 January 1998 to 31 December 2018.
Setting and participants 3.6 million patients from the Clinical Practice Research Datalink registered at 391 general practices in England with linked hospital admission and mortality records.
Main outcome measures Model performance including discrimination, calibration, and consistency of individual risk prediction for the same patients among models with comparable model performance. 19 different prediction techniques were applied, including 12 families of machine learning models (grid searched for best models), three Cox proportional hazards models (local fitted, QRISK3, and Framingham), three parametric survival models, and one logistic model.
Results The various models had similar population level performance (C statistics of about 0.87 and similar calibration). However, the predictions for individual risks of cardiovascular disease varied widely between and within different types of machine learning and statistical models, especially in patients with higher risks. A patient with a risk of 9.5-10.5% predicted by QRISK3 had a risk of 2.9-9.2% in a random forest and 2.4-7.2% in a neural network. The differences in predicted risks between QRISK3 and a neural network ranged between –23.2% and 0.1% (95% range). Models that ignored censoring (that is, assumed censored patients to be event free) substantially underestimated risk of cardiovascular disease. Of the 223815 patients with a cardiovascular disease risk above 7.5% with QRISK3, 57.8% would be reclassified below 7.5% when using another model.
Conclusions A variety of models predicted risks for the same patients very differently despite similar model performances. The logistic models and commonly used machine learning models should not be directly applied to the prediction of long term risks without considering censoring. Survival models that consider censoring and that are explainable, such as QRISK3, are preferable. The level of consistency within and between models should be routinely assessed before they are used for clinical decision making.
DOI: 10.1136/bmj.m3919
Source: https://www.bmj.com/content/371/bmj.m3919