Play with R 第5期

2021-02-21 荷蘭心理統計聯盟

What will this chapter tell me?

To learn how to assess the database and turn it to more valuable data.

When assumptions are broken, we stop being able to draw accurate conclusions about reality.

Different statistical models assume different things, and if these models are going to reflect reality accurately then these assumptions need to be true.

Assumptions of parametric data

(1) Normally distributed data: the rationale behind hypothesis testing relies on having something that is normally distributed , and so if this assumption is not met then the logic behind hypothesis testing is flawed.

(2) Homogeneity of variance: This assumption means that the variances should be the same throughout the data.

(3) Interval data: Data should be measured at least at the interval level.

(4) Independence: This assumption, like that of normality, is different depending on the test you’re using.

Packages used in this chapter

R: you can install them by executing the following commands:

install.packages("car"); install.packages("ggplot2");

install.packages("pastecs"); install.packages("psych")

You then need to load these packages by executing the commands:

library(car);library(ggplot2);library(pastecs);library(psych);library(Rcmdr)

The assumption of normality

5.5.1 Oh no, it’s that pesky frequency distribution again: checking normality visually

Using the 「ggplot2」 package to draw the histograms:

First, load the data:

dlf<-read.delim("DownloadFestival.dat",header=TRUE)

Second, code something like:

hist.day1<-ggplot(dlf,aes(day1))+opts(legend.position="none")+geom_histogram(aes(y=..density..),colour="black",fill="white")+labs(x="Hygiene scoreon day 1",y ="Density") hist.day1

Then, add a normal curve to the chart:

hist.day1+stat_function(fun=dnorm,args=list(mean= mean(dlf$day1,na.rm = TRUE),sd=sd(dlf$day1,na.rm = TRUE)), colour = "black", size = 1)

Otherwise, use the ggplot2 package to draw the Q-Q plot (use the qplot() function in conjunction with the qq statistic):

qqplot.day1 <- qplot(sample = dlf$day1, stat="qq")

qqplot.day1

5.5.2 Quantifying normality with numbers

We want to quantify the shape of the distributions and look for outliers.

Use the describe()function of the psych package to explore the distribution of the variables:

describe(dlf$day1)

The second way is to use thestat.desc() function of the pastecs package:

stat.desc(variable name, basic = TRUE, norm = FALSE)

Then to override the default and specify norm = TRUE:

stat.desc(dlf$day1, basic = FALSE, norm = TRUE)

Use describe() and stat.desc() with more than one variable at the same time, and use the cbind() function to combine two or more variables:

describe(cbind(dlf$day1, dlf$day2, dlf$day3))

stat.desc(cbind(dlf$day1, dlf$day2, dlf$day3), basic = FALSE, norm = TRUE)

The second way is to select the variable names directly from the data set:

describe(dlf[,c("day1", "day2", "day3")])

stat.desc(dlf[, c("day1", "day2", "day3")], basic = FALSE, norm = TRUE)

5.5.3 Exploring groups of data

Ways to produce basic descriptive statistics for groups of data：

by () function and subset () function

These functions allow you to specify a grouping variable which splits the data, or to select a subset of cases.

5.5.3.1. Running the analysis for all data

Data:

Step1. open the file RExam.dat:

rexam <- read.delim("rexam.dat", header=TRUE)

Step2. set the variable uni to be a factor:

rexam$uni<-factor(rexam$uni, levels = c(0:1), labels = c("Duncetown University", "Sussex University"))

5.5.3.2. Running the analysis for different groups

Ways to obtain separate descriptive statistics for each of group: by() function

by (data = dataFrame, INDICES = grouping variable, FUN = a function that you want to apply to the data)

To get descriptive statistics for the variable exam for each university separately.

by (data = rexam$exam, INDICES = rexam$uni, FUN = describe)

Using stat.desc() instead of describe()

By (data = rexam$exam, INDICES = rexam$uni, FUN = stat.desc)

Adding other functions:

by(rexam$exam, rexam$uni, stat.desc, basic = FALSE, norm = TRUE)

Descriptive statistics for multiple variables: cbind()

by(cbind(data=rexam$exam,data=rexam$numeracy), rexam$uni, describe)

look at the descriptive statistics of both the previous R exam and the numeracy test.

Output:

(1) the results for students at Duncetown University

(2) the results for those attending Sussex University

Histograms: by (), ggplot2() and subset () function (create plots for different groups)

(1) Create separate dataframes

Histograms for the Duncetown：

dunceData<-subset(rexam, rexam$uni=="Duncetown University")

Histograms for Sussex Universities：

sussexData<-subset(rexam,rexam$uni=="Sussex University")

(2) Create a histogram of the numeracy scores for Duncetown University

hist.numeracy.duncetown<-ggplot(dunceData,aes(numeracy))+opts(legend.position="none")+geom_histogram(aes(y=..density..),fill="white",colour="black",binwidth=1)+labs(x="Numeracy Score",y="Density")stat_function(fun=dnorm,args=list(mean = mean(dunceData$numeracy, na.rm=TRUE), sd = sd(dunceData$numeracy, na.rm =TRUE)), colour = "blue", size=1)

Testing whether a distribution is normal

5.6.1 Doing the Shapiro–Wilk test in R

Shapiro–Wilk test and shapiro.tes are parts of the output from the stat. desc() function.

shapiro.test(variable) (variable is the name of the variable that you』d like to test for normality)

for example: shapiro.test(rexam$exam)

output: W = 0.9613, p-value = 0.004991(a significant value (p-value less than .05) indicates a deviation from normality, For R exam scores (p < .001), the Shapiro–Wilk test is highly significant)

5.6.2. Reporting the Shapiro–Wilk test

The Shapiro–Wilk test can be used to see if a distribution of scores significantly differs from a normal distribution. If the Shapiro–Wilk test is significant (p-value less than .05) then the scores are significantly different from a normal distribution. Otherwise, scores are approximately normally distributed.

The percentage on the R exam, W = 0.96, p= .005, was significantly non-normal.

Testing for homogeneity of variance

①as you go through levels of one variable, the variance of the other should not change.

②collected groups of data→outcome variable or variables should be the same in each of these groups

PS：variance of one variable should be stable at all levels of the other variable

5.7.1 Levene’s test

①see whether the values of the variances are similar

②Instead, use graphs (see section 7.9.5) and for groups of data we tend to use a test called Levene’s test (Levene, 1960)

③when Levene’s test is significant at p<=0.05 ,we can conclude,the assumption of homogeneity of variances has been violated.

④when significant at p>0.05,the variances are roughly equal and the assumption is tenable.

5.7.1.1 Levene’s test with R Commander

①Data ⇒ Import data ⇒ from text file, clipboard or URL

② select the file RExam.dat (see section 3.7.3)

③ convert uni to a factor(3.7.3)

④ select Statistics⇒Variances⇒Levene’s test

⑤ select a grouping variable

⑥ Choose the variable on the right that you want to test for equality of variances across the groups defined by uni

⑦ Run the analysis for both exam and numeracy.

5.7.1.2. Levene’s test with R

①Levene’s test function

leveneTest(outcome variable, group, center = median/mean)

② two variables：outcome variable/grouping variable

the exam scores we could execute:

leveneTest(rexam$exam, rexam$uni)

leveneTest(rexam$exam, rexam$uni, center = mean)

③numeracy scores we could execute:

leveneTest(rexam$numeracy, rexam$uni)

5.7.1.3. Levene’s test output

①This indicates that the variances are not significantly different (i.e., they are similar and the homogeneity of variance assumption is tenable).For the red color

② for the numeracy scores the variances are significantly different.For the green color

5.7.2. Reporting Levene’s test

F(1, 98) = 2.09

F(1, 98) = 5.37, p = .023.

5.7.3. Hartley’s Fmax : the variance ratio

①variances between the group with the biggest variance and the group with the smallest variance.

② Fmax<10 non-significant

15<Fmax<20 ratio<5

30<Fmax<60 ratio<2(or 3)

Correcting problems in the data

5.8.1 Dealing with outliers

If you detect outliers in the data there are several options for reducing the impact of these values.

1 Remove the case

2 Transform the data

3 Change the score

5.8.2 Dealing with non-normality and unequal variances

5.8.2.1. Transforming data

Log transformation (log(Xi))；Square root transformation(√Xi)；Reciprocal transformation (1/Xi)；Reverse score transformations

5.8.2.2. Choosing a transformation

The simple answer is trial and error: try one out and see if it helps and if it doesn’t then try a different one.

5.8.3Transforming the data using R

5.8.3.1. Computing new variables

Transformations are very easy using R.

newVariable <- function(oldVariable)

newVariable <- arithmetic with oldVariable(s)

Addition: dlf$day1PlusDay2 <- dlf$day1 + dlf$day2

Subtraction: dlf$day2MinusDay1 <- dlf$day2 - dlf$day1

Multiply: dlf$day2Times5 <- dlf$day1 * 5

Exponentiation:

dlf$day2Squared <- dlf$day2 ** 2 or dlf$day2Squared <- dlf$day2 ^ 2

Less than: dlf$day1LessThanOne <- dlf$day1 < 1

Less than or equal to: dlf$day1LessThanOrEqualOne <- dlf$day1 <= 1

Greater than: dlf$day1GreaterThanOne <- dlf$day1 > 1

Greater than or equal to: dlf$day1GreaterThanOrEqualOne <- dlf$day1 >= 1

Double equals dlf$male <- dlf$gender == "Male"

Not equal to. dlf$notMale <- dlf$gender != "Male"

5.8.3.2. The log transformation in R

To transform the variable day1, and create a new variable

logday1, we execute this command:

dlf$logday1 <- log(dlf$day1)

The advantage of adding 1 is that the logarithm of 1 is equal to 0, so people who scored a zero before the transformation score a zero after the transformation. To do this transformation we would execute:

dlf$logday1 <- log(dlf$day1 + 1)

5.8.3.3. The square root transformation in R

Therefore, to create a variable called sqrtday1 that contains the square root of the values in the variable day1, we would execute: dlf$sqrtday1 <- sqrt(day1)

5.8.3.4. The reciprocal transformation in R

We could use a name such as recday1, and to create this variable we would execute:

dlf$recday1 <- 1/(dlf$day1 + 1)

5.8.3.5. The ifelse() function in R

The ifelse() function is used to create a new variable, or change an old variable, depending on some other values. This function takes the general form:

ifelse(a conditional argument, what happens if the argument is TRUE, what happens if the argument if FALSE)

The rest of the function tells it what to do, for example, we might want to set it to missing (NA) if the score is over 4, but keep it as the old score if the score is not over 4. In which case we could execute this command:

dlf$day1NoOutlier <- ifelse(dlf$day1 > 4, NA, dlf$day1)

This command creates a new variable called day1NoOutlier which takes the value NA if day1 is greater than 4, but is the value of day1 if day1 is less than 4

5.8.3.6. The effect of transformations

Data analysis can be frustrating sometimes!

5.8.4 When it all goes horribly wrong

A much more promising approach is to use robust methods: A trimmed mean and M-estimators;

The second general procedure is the bootstrap.

Use the function of R to do numerous robust tests (two ways):

(1) To access the WRS package of R:

install.packages("WRS",repos="http://R-Forge.R-project.org")

library(WRS)

(2) To source the function from Wilcox’s website:

source("http://www-rcf.usc.edu/~rwilcox/Rallfun-v14")

Summary

張昱城

河北工業大學

人力資源，組織行為學

張蒙

河北工業大學

人力資源管理

王志靈

河北工業大學（天津市）

人力資源管理

李晶

河北工業大學（天津市）

企業管理

石蕊

河北省秦皇島市

經濟管理

曹菲

江南大學

社會心理學

‍‍‍

Play with R 第1期：為什麼要學習統計

Play with R 第5期

相關焦點

【R函數學習】R語言時間序列函數整理

R語言時間序列函數大全(收藏!)

老闆叫你「play cards」是去打牌?想的美!

小奶爸第5期王者風暴裂變營完美收官!

題庫 | 字節跳動校招研發崗位筆試編程題型介紹及解析 -- 第 3 期

「法」說第32期‖除了房產證外,還有哪些情況可以證明房子是您的?

六六六老師教你 All work no play,它卻說 All play no work

【青年大學習】「青年大學習」網上主題團課第十季第六到十期知識總結

人力資源 |《勞動社會學》(第1期)

老闆叫你play cards什麼意思?難道讓你「去打牌」?想得美

三天兩夜的第27期「職總」完美結營

尋人啟事——第三期靠譜團隊成員徵集

費恩萊斯第31期開學典禮,力次方助你遇見更好的自己

如何學習好R語言?【全套R語音書籍+視頻下載】

完美落幕|義大利阿卡丹米披薩職人培訓學院第06期培訓班

第21期思維設計 · 陳志聲

《御人之道》成都甄才|第45期完美收官!(精彩回顧)

「R」用purrr實現迭代

Play with R 第5期

相關焦點

【R函數學習】R語言時間序列函數整理

R語言時間序列函數大全(收藏!)

老闆叫你「play cards」是去打牌?想的美!

小奶爸第5期王者風暴裂變營完美收官!

題庫 | 字節跳動校招研發崗位筆試編程題型介紹及解析 -- 第 3 期

「法」說第32期‖除了房產證外,還有哪些情況可以證明房子是您的?

六六六老師教你 All work no play,它卻說 All play no work

【青年大學習】「青年大學習」網上主題團課第十季第六到十期知識總結

人力資源 |《勞動社會學》(第1期)

老闆叫你play cards什麼意思?難道讓你「去打牌」?想得美

三天兩夜的第27期「職總」完美結營

尋人啟事——第三期靠譜團隊成員徵集

費恩萊斯第31期開學典禮,力次方助你遇見更好的自己

如何學習好R語言?【全套R語音書籍+視頻下載】

完美落幕|義大利阿卡丹米披薩職人培訓學院第06期培訓班

第21期 思維設計 · 陳志聲

《御人之道》成都甄才|第45期完美收官!(精彩回顧)

「R」用purrr實現迭代

第21期思維設計 · 陳志聲