書接上回……
線性回歸適用條件和模型的診斷多涉及殘差分析。R的基礎包幾個常用的殘差計算分析函數如下:
residuals():模型殘差;
rstandard():標準化殘差,或者直接計算sqrt(deviance(model)/df.residual(model)),…);
rstudent():學生化殘差;
predict():預測值;
plot():繪製模型診斷圖或構建相應的殘差圖。plot函數可利用模型殘差繪製6幅圖(默認4幅,相當於which=1:4,Cook距離和Cook距離對槓桿值圖默認不顯示),實現模型正態性、方差齊性及異常點等的分析。
Six plots (selectable by which) are currently available: a plot of residuals against fitted values, a Scale-Location plot of sqrt(|residuals|) against fitted values, a Normal Q-Q plot, a plot of Cook's distances versus row labels, a plot of residuals against leverages, and a plot of Cook's distances against leverage/(1-leverage). By default, the first three and 5 are provided. 1:Residuals vs Fitted; 2: Normal Q-Q; 3: Scale-Location; 4: Cook's distance; 5: Residuals vs Leverage; 6:Cook's dist vs Leverage hii/(1-hii).
par(mfrow=c(2,3)) #par {graphics}將plot()函數生成的四幅圖繪製在一個大圖中,大圖按2×3排列四幅小圖plot(lmfit, which=1:6, col ="blue")
結果如下,基本滿足線性、正態性和方差齊性,但可能存在異常點。
殘差對擬合值散點圖:可用於判斷殘差中是否還存在非線性趨勢成分,紅色的是lowess(局部加權回歸散點修勻法)曲線;There could be a non-linear relationship between predictor variables and an outcome variable and the pattern could show up in this plot if the model doesn’t capture the non-linear relationship. If you find equally spread residuals around a horizontal line without distinct patterns, that is a good indication you don’t have non-linear relationships。該散點圖也常用來判斷判斷方差齊性。
殘差的正態QQ圖:用於判斷殘差的正態性,It’s good if residuals are lined well on the straight dashed line.
Scale-Location圖:殘差絕對值平方根對擬合值散點圖,常用於判斷方差齊性。It’s also called Spread-Location plot. This plot shows if residuals are spread equally along the ranges of predictors. This is how you can check the assumption of equal variance (homoscedasticity). It’s good if you see a horizontal line with equally (randomly) spread points.
Cook距離圖:各觀察值的Cook距離,判斷異常值Cook's distance is a measure of how much the residuals of all records would change if a particular record were excluded from the calculation of the model coefficients. A large Cook's distance indicates that excluding a record from changes the coefficients substantially, and should therefore be considered influential.[SPSS]
標準化殘差對槓桿值散點圖:尋找強影響點。This plot helps us to find influential cases.We watch out for outlying values at the upper right corner or at the lower right corner. Those spots are the places where cases can be influential against a regression line. Look for cases outside of a dashed line, Cook’s distance. When cases are outside of the Cook’s distance (meaning they have high Cook’s distance scores), the cases are influential to the regression results. The regression results will be altered if we exclude those cases.
In the Cook's distance vs leverage/(1-leverage) plot, contours of standardized residuals that are equal in magnitude are lines through the origin. The contour lines are labelled with the magnitudes.通過原點的線是標準化殘差等高線。
參考:
A Modern Approach to Regression with R.
Understanding Diagnostic Plots for Linear Regression Analysis:https://data.library.virginia.edu/diagnostic-plots/
How does plot.lm() determine outliers for residual vs fitted plot?:https://stackoverflow.com/questions/39259252/how-does-plot-lm-determine-outliers-for-residual-vs-fitted-plot
除了基礎包,R中有大量的包(比如car包、olsrr包、plotmo包、HH包、gvlma包)都可以實現對線性回歸適用條件的考察。
獨立性的判斷還是以事實經驗為主,沒有理由認為一個嬰兒的出生體重會對其他嬰兒的出生體重有影響,可以認為因變量值(或者殘差)相互獨立。Durbin-Watson檢驗可用於檢測數據的是否具有隨時間變化的自相關(相隔時間越近相關性越大),但該檢驗統計不顯著僅代表沒有自相關關係,並不代表數據就不相關,本例非聚集性數據並不適用。R中的Durbin-Watson檢驗函數有很多:DurbinWatsonTest {DescTools}、dwtest {lmtest}、test.DW {dcv}、durbinWatsonTest {car}、score_dw {auditor}、reg.dw {YRmisc}、reg.dw {PMmisc}……,使用很簡單需要時可參考,比如
library(DescTools)
DurbinWatsonTest(lmfit)
自變量與結局變量間的散點圖、殘差與自變量的散點圖、殘差對擬合值散點圖、偏殘差圖(成分殘差圖)是常用的考察辦法。當解釋變量和結局變量不滿足線性關係時,解釋變量是分類變量應考慮將其設為啞變量,解釋變量是連續變量進行Box-Tidwell變換。
各個自變量與結局變量間的散點圖矩陣可以直觀的大體了解各個自變量與響應變量間的關係。在數據初步考察裡面我們用到了各個變量間的散點圖來初步觀察這種關係。相比這種散點圖,殘差與自變量的散點圖效率更高一些,該圖在<<線性回歸中的線性考察>>一文中做過介紹。以繪製殘差與自變量age的散點圖為例:
lmres<-residuals(lmfit)
plot(lmres~lmdata$age,col="blue")
abline(h=0,lwd=2,col="red") #添水平直線y=0,線寬為2,顏色為綠色。添加水平直線用v,添加垂直直線h。添加殘差與變量的的擬合直線:abline(lm(lmres~lmdata$age),lwd=1, col="blue")
年齡對應的殘差點基本上是平均分布在y=0的直線兩側,無明顯偏負或偏正,年齡與新生兒體重基本呈線性。
我們也可以直接利用現成的函數來繪製,比如residual.plots {HH},但函數residual.plots()結果顯示不友好,可以改用函數residual.plots.lattice {HH}。Residual plots for a linear model. Four sets of plots are produced: (1) response against each of the predictor variables, (2) residuals against each of the predictor variables, (3) partial residuals for each predictor against that predictor ("partial residuals plots", and (4) partial residuals against the residuals of each predictor regressed on the other predictors ("added variable plots").
residual.plots.lattice {HH}
Construct four sets of regression plots. Response variable $Y$ against each $X_j$, residuals $e$ against each $X_j$, partial residuals plots of $e^j$ against each $X_j$, added variable plots of $e^j$ against the residuals of each $X_j$ adjusted for the other $X$ columns. The slopes shown in the panels of both bottom rows are equal to the regression coefficients.
library(HH);library(lattice);library(grid);library(latticeExtra);library(multcomp);library(survival);library(TH.data);library(MASS);library(gridExtra)
residual.plots.lattice(lmfit)
結果中第一行是各個自變量與因變量的關係;第二行是各個自變量與殘差的關係,從各個自變量與殘差的關係看,基本滿足線性關係;第三行是各自變量的偏殘差圖;第四行是添加變量圖(偏回歸圖)。
除了普通殘差圖,偏殘差圖(Partial Residual Plots)是更為專業的線性判斷方法,它可以排除混雜因素的影響,能夠更準確地判定自變量和因變量是否為線性。偏殘差圖也稱成分+殘差圖(Component plus Residual Plots),偏殘差圖就是偏殘差(用除自變量Xi外其他所有自變量校正過的因變量)與預測變量Xi構建的散點圖,詳細介紹可參見<<偏回歸圖與偏殘差圖>>。上述residual.plots.lattice {HH}函數給出的結果中,第三行結果就是各個自變量的偏殘差圖,結果顯示校正了其他變量的影響後,年齡和隨訪次數對新生兒體重的影響不大,孕前體重越重新生兒體重越重,母親吸菸、有早產經歷、患高血壓、有應激性新生兒體重越輕,相比白人,黑人和其他種族的孕婦的新生兒體重更輕一些。當然這些只是圖片上顯示出來的信息,並不能判定有沒有統計學意義。變量較多時residual.plots.lattice {HH}的結果細節顯示不足,可以使用ols_plot_comp_plus_resid{olsrr}和crPlots{car}來繪製出成分殘差圖。crPlots{car}結果的成分殘差圖不僅給出了偏殘差圖的散點(Xi,ε+βXi)及其平滑擬合曲線(粉紅色,局部加權回歸散點平滑法(locally weighted scatterplot smoothing,LOWESS或LOESS)),同時給出了Xi與βXi的擬合直線,如果直線與擬合曲線差別較大則表明自變量與因變量之間可能存在非線性關係,則需要添加一些曲線成分如二次項、三次項等;如果擬合直線與曲線接近重合則表明自變量Xi與因變量存在很好的線性關係,但非水平直線才表明對影響變量有影響(水平線表示該自變量係數為0)。library(car);library(carData)crPlots(lmfit)ols_plot_comp_plus_resid(lmfit)
我們僅展示age、lwt及race的結果。由於種族是3個水平被設置為因子後結果以2個虛擬變量來展示。各變量線性關係良好,但age對bwt影響不大,孕前體重與新生兒體重呈正相關,與白人相比,黑人和其他種族孕母的新生兒體重更輕一些。
olsrr包功能強大,Tools for Building OLS Regression Models.Tools designed to make it easier for users, particularly beginner/intermediate R users to build ordinary least squares regression models. Includes comprehensive regression output, heteroskedasticity tests, collinearity diagnostics, residual diagnostics, measures of influence, model fit assessment and variable selection procedures.
https://cran.r-project.org/web/packages/olsrr/
(3)正態性
線性回歸模型的正態性指的並不是指因變量呈正態分布,而是要求模型的殘差服從正態分布,詳細可參見<<線性回歸中的正態分布>>。你可以可對模型殘差進行統計學檢驗,也可以採用相關的專業圖進行判定。如果數據不滿足正態分布,可以考慮進行數據變換,比如Box-Cox變換,或者換用其他的統計學方法。
第一類方法是先生成模型的殘差,然後對殘差進行正態分布的檢驗。比較常見的方法是Shapiro-Wilk W檢驗和Kolmogorov-Smirnov檢驗,這些我們在<<R筆記:正態分布的檢驗>>一文中都有介紹,以Shapiro-Wilk W檢驗為例,shapiro.test(lmres)
結果如下,表明殘差滿足正態分布(P=0.612>0.05)。
Shapiro-Wilk normality test
data: lmres
W = 0.99378, p-value = 0.612
除了直接對殘差進行檢驗,我們還可以進行圖示法,比如在前面plot函數結果中有模型殘差的Q-Q圖,
qqnorm(lmres)
結果就是Q-Q圖我們就不單獨展示了。我們還可以使用專用函數,比如qqPlot{car}:
library(car);library(carData)qqPlot(lmfit) #Quantile-Comparison Plot.these functions return the labels of identified points, unless a grouping factor is employed, in which case NULL is returned invisibly.這是對函數返回值的解釋,我其實沒怎麼看懂,但結合前面的圖示返回值應該是兩個異常點。
[1] 130 132
所有的點都在Q-Q線95%CI內(如在95%CI的之外的點可初步判定為離群點),表明滿足殘差滿足正態分布的要求。但130和132可能是異常點。(4)方差齊性
線性回歸中的等方差指因變量殘差不隨所有自變量取值水平的變化而變化,在自變量與殘差或預測值與殘差的散點圖上,標準化殘差隨機、均勻的散布在0橫線上下兩側,即不論自變量或因變量預測值如何變化,標準化殘差的波動基本保持穩定,可認為方差基本相等。同正態性檢驗一樣,方差齊性檢驗除了圖示法,也可以進行統計檢驗。這些我們在<<線性回歸中的方差齊性探察>>一文中都有介紹。如果殘差不滿足方差齊同的要求,因變量變換、加權最小二乘法、穩健回歸是常用的處理方法。
圖示法我們在前面用plot()繪製模型診斷圖時已有過說明,此處演示plotres{plotmo}繪製Sqrt abs residuals vs fitted圖,結果同前,不再贅述。該函數同樣可以輸出殘差的其他診斷圖,比如Q-Q圖,殘差對預測值等。library(plotmo);library(Formula);library(plotrix);library(TeachingDemos)plotres(lmfit,which=c(1,3,4,6),col="blue") #Which plots do draw. Default is 1:4. 1:Model plot. What gets plotted here depends on the model class. For example, for earth models this is a model selection plot. Nothing will be displayed for some models. For details, please see the plotres vignette; 2:Cumulative distribution of abs residuals; 3:Residuals vs fitted; 4:QQ plot; 5:Abs residuals vs fitted; 6:Sqrt abs residuals vs fitted; 7:Abs residuals vs log fitted; 8:Cube root of the squared residuals vs log fitted; 9:Log abs residuals vs log fitted.
R中很多函數都可以實現線性回歸的方差齊性檢驗:bptest{lmtest}、ncvTest{car}、spreadLevelPlot{car}。統計檢驗法我們以bptest()和ncvTest()為例,兩種方法的檢測結果均顯示殘差滿足方差等同(P均>0.05)。
library(lmtest);library(zoo)
bptest(lmfit)
studentized Breusch-Pagan test
data: lmfit
BP = 12.718, df = 9, p-value = 0.1758
library(car);library(carData)ncvTest(lmfit)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 0.0005196456, Df = 1, p = 0.98181
olsrr 包也提供了4種異方差檢驗Bartlett Test、Breusch Pagan Test、Score Test、F Test,感興趣可以自己去練習一下。最後介紹一個集線性(Linearity)、同方差性(Homoscedasticity)、不相關(即獨立,Uncorrelatedness)、正態性(Normality)的檢驗為一體的綜合假設檢驗(Global Validation of Linear Models Assumptions):gvlma{gvlma},該函數還單獨給出了偏度、峰度和異方差檢驗結果。gvlma(lmfit)
結果顯示前提條件並沒有完全滿足,比如異方差檢驗未通過。這與前面單獨檢測結果有差異。
下次R筆記預告:模型評估與診斷,如模型擬合優度、多重共線、強影響點、離群值、槓桿值等R分析方法。