What will this chapter tell me?
To learn how to assess the database and turn it to more valuable data.
When assumptions are broken, we stop being able to draw accurate conclusions about reality.
Different statistical models assume different things, and if these models are going to reflect reality accurately then these assumptions need to be true.
Assumptions of parametric data
(1) Normally distributed data: the rationale behind hypothesis testing relies on having something that is normally distributed , and so if this assumption is not met then the logic behind hypothesis testing is flawed.
(2) Homogeneity of variance: This assumption means that the variances should be the same throughout the data.
(3) Interval data: Data should be measured at least at the interval level.
(4) Independence: This assumption, like that of normality, is different depending on the test you’re using.
Packages used in this chapter
R: you can install them by executing the following commands:
install.packages("car"); install.packages("ggplot2");
install.packages("pastecs"); install.packages("psych")
You then need to load these packages by executing the commands:
library(car);library(ggplot2);library(pastecs);library(psych);library(Rcmdr)
The assumption of normality
5.5.1 Oh no, it’s that pesky frequency distribution again: checking normality visually
Using the 「ggplot2」 package to draw the histograms:
First, load the data:
dlf<-read.delim("DownloadFestival.dat",header=TRUE)
Second, code something like:
hist.day1<-ggplot(dlf,aes(day1))+opts(legend.position="none")+geom_histogram(aes(y=..density..),colour="black",fill="white")+labs(x="Hygiene scoreon day 1",y ="Density") hist.day1
Then, add a normal curve to the chart:
hist.day1+stat_function(fun=dnorm,args=list(mean= mean(dlf$day1,na.rm = TRUE),sd=sd(dlf$day1,na.rm = TRUE)), colour = "black", size = 1)
Otherwise, use the ggplot2 package to draw the Q-Q plot (use the qplot() function in conjunction with the qq statistic):
qqplot.day1 <- qplot(sample = dlf$day1, stat="qq")
qqplot.day1
5.5.2 Quantifying normality with numbers
We want to quantify the shape of the distributions and look for outliers.
Use the describe()function of the psych package to explore the distribution of the variables:
describe(dlf$day1)
The second way is to use thestat.desc() function of the pastecs package:
stat.desc(variable name, basic = TRUE, norm = FALSE)
Then to override the default and specify norm = TRUE:
stat.desc(dlf$day1, basic = FALSE, norm = TRUE)
Use describe() and stat.desc() with more than one variable at the same time, and use the cbind() function to combine two or more variables:
describe(cbind(dlf$day1, dlf$day2, dlf$day3))
stat.desc(cbind(dlf$day1, dlf$day2, dlf$day3), basic = FALSE, norm = TRUE)
The second way is to select the variable names directly from the data set:
describe(dlf[,c("day1", "day2", "day3")])
stat.desc(dlf[, c("day1", "day2", "day3")], basic = FALSE, norm = TRUE)
5.5.3 Exploring groups of data
Ways to produce basic descriptive statistics for groups of data:
by () function and subset () function
These functions allow you to specify a grouping variable which splits the data, or to select a subset of cases.
5.5.3.1. Running the analysis for all data
Data:
Step1. open the file RExam.dat:
rexam <- read.delim("rexam.dat", header=TRUE)
Step2. set the variable uni to be a factor:
rexam$uni<-factor(rexam$uni, levels = c(0:1), labels = c("Duncetown University", "Sussex University"))
5.5.3.2. Running the analysis for different groups
Ways to obtain separate descriptive statistics for each of group: by() function
by (data = dataFrame, INDICES = grouping variable, FUN = a function that you want to apply to the data)
To get descriptive statistics for the variable exam for each university separately.
by (data = rexam$exam, INDICES = rexam$uni, FUN = describe)
Using stat.desc() instead of describe()
By (data = rexam$exam, INDICES = rexam$uni, FUN = stat.desc)
Adding other functions:
by(rexam$exam, rexam$uni, stat.desc, basic = FALSE, norm = TRUE)
Descriptive statistics for multiple variables: cbind()
by(cbind(data=rexam$exam,data=rexam$numeracy), rexam$uni, describe)
look at the descriptive statistics of both the previous R exam and the numeracy test.
Output:
(1) the results for students at Duncetown University
(2) the results for those attending Sussex University
Histograms: by (), ggplot2() and subset () function (create plots for different groups)
(1) Create separate dataframes
Histograms for the Duncetown:
dunceData<-subset(rexam, rexam$uni=="Duncetown University")
Histograms for Sussex Universities:
sussexData<-subset(rexam,rexam$uni=="Sussex University")
(2) Create a histogram of the numeracy scores for Duncetown University
hist.numeracy.duncetown<-ggplot(dunceData,aes(numeracy))+opts(legend.position="none")+geom_histogram(aes(y=..density..),fill="white",colour="black",binwidth=1)+labs(x="Numeracy Score",y="Density")stat_function(fun=dnorm,args=list(mean = mean(dunceData$numeracy, na.rm=TRUE), sd = sd(dunceData$numeracy, na.rm =TRUE)), colour = "blue", size=1)
Testing whether a distribution is normal
5.6.1 Doing the Shapiro–Wilk test in R
Shapiro–Wilk test and shapiro.tes are parts of the output from the stat. desc() function.
shapiro.test(variable) (variable is the name of the variable that you』d like to test for normality)
for example: shapiro.test(rexam$exam)
output: W = 0.9613, p-value = 0.004991(a significant value (p-value less than .05) indicates a deviation from normality, For R exam scores (p < .001), the Shapiro–Wilk test is highly significant)
5.6.2. Reporting the Shapiro–Wilk test
The Shapiro–Wilk test can be used to see if a distribution of scores significantly differs from a normal distribution. If the Shapiro–Wilk test is significant (p-value less than .05) then the scores are significantly different from a normal distribution. Otherwise, scores are approximately normally distributed.
The percentage on the R exam, W = 0.96, p= .005, was significantly non-normal.
Testing for homogeneity of variance
①as you go through levels of one variable, the variance of the other should not change.
②collected groups of data→outcome variable or variables should be the same in each of these groups
PS:variance of one variable should be stable at all levels of the other variable
5.7.1 Levene’s test
①see whether the values of the variances are similar
②Instead, use graphs (see section 7.9.5) and for groups of data we tend to use a test called Levene’s test (Levene, 1960)
③when Levene’s test is significant at p<=0.05 ,we can conclude,the assumption of homogeneity of variances has been violated.
④when significant at p>0.05,the variances are roughly equal and the assumption is tenable.
5.7.1.1 Levene’s test with R Commander
①Data ⇒ Import data ⇒ from text file, clipboard or URL
② select the file RExam.dat (see section 3.7.3)
③ convert uni to a factor(3.7.3)
④ select Statistics⇒Variances⇒Levene’s test
⑤ select a grouping variable
⑥ Choose the variable on the right that you want to test for equality of variances across the groups defined by uni
⑦ Run the analysis for both exam and numeracy.
5.7.1.2. Levene’s test with R
①Levene’s test function
leveneTest(outcome variable, group, center = median/mean)
② two variables:outcome variable/grouping variable
the exam scores we could execute:
leveneTest(rexam$exam, rexam$uni)
leveneTest(rexam$exam, rexam$uni, center = mean)
③numeracy scores we could execute:
leveneTest(rexam$numeracy, rexam$uni)
5.7.1.3. Levene’s test output
①This indicates that the variances are not significantly different (i.e., they are similar and the homogeneity of variance assumption is tenable).For the red color
② for the numeracy scores the variances are significantly different.For the green color
5.7.2. Reporting Levene’s test
F(1, 98) = 2.09
F(1, 98) = 5.37, p = .023.
5.7.3. Hartley’s Fmax : the variance ratio
①variances between the group with the biggest variance and the group with the smallest variance.
② Fmax<10 non-significant
15<Fmax<20 ratio<5
30<Fmax<60 ratio<2(or 3)
Correcting problems in the data
5.8.1 Dealing with outliers
If you detect outliers in the data there are several options for reducing the impact of these values.
1 Remove the case
2 Transform the data
3 Change the score
5.8.2 Dealing with non-normality and unequal variances
5.8.2.1. Transforming data
Log transformation (log(Xi));Square root transformation(√Xi);Reciprocal transformation (1/Xi);Reverse score transformations
5.8.2.2. Choosing a transformation
The simple answer is trial and error: try one out and see if it helps and if it doesn’t then try a different one.
5.8.3Transforming the data using R
5.8.3.1. Computing new variables
Transformations are very easy using R.
newVariable <- function(oldVariable)
newVariable <- arithmetic with oldVariable(s)
Addition: dlf$day1PlusDay2 <- dlf$day1 + dlf$day2
Subtraction: dlf$day2MinusDay1 <- dlf$day2 - dlf$day1
Multiply: dlf$day2Times5 <- dlf$day1 * 5
Exponentiation:
dlf$day2Squared <- dlf$day2 ** 2 or dlf$day2Squared <- dlf$day2 ^ 2
Less than: dlf$day1LessThanOne <- dlf$day1 < 1
Less than or equal to: dlf$day1LessThanOrEqualOne <- dlf$day1 <= 1
Greater than: dlf$day1GreaterThanOne <- dlf$day1 > 1
Greater than or equal to: dlf$day1GreaterThanOrEqualOne <- dlf$day1 >= 1
Double equals dlf$male <- dlf$gender == "Male"
Not equal to. dlf$notMale <- dlf$gender != "Male"
5.8.3.2. The log transformation in R
To transform the variable day1, and create a new variable
logday1, we execute this command:
dlf$logday1 <- log(dlf$day1)
The advantage of adding 1 is that the logarithm of 1 is equal to 0, so people who scored a zero before the transformation score a zero after the transformation. To do this transformation we would execute:
dlf$logday1 <- log(dlf$day1 + 1)
5.8.3.3. The square root transformation in R
Therefore, to create a variable called sqrtday1 that contains the square root of the values in the variable day1, we would execute: dlf$sqrtday1 <- sqrt(day1)
5.8.3.4. The reciprocal transformation in R
We could use a name such as recday1, and to create this variable we would execute:
dlf$recday1 <- 1/(dlf$day1 + 1)
5.8.3.5. The ifelse() function in R
The ifelse() function is used to create a new variable, or change an old variable, depending on some other values. This function takes the general form:
ifelse(a conditional argument, what happens if the argument is TRUE, what happens if the argument if FALSE)
The rest of the function tells it what to do, for example, we might want to set it to missing (NA) if the score is over 4, but keep it as the old score if the score is not over 4. In which case we could execute this command:
dlf$day1NoOutlier <- ifelse(dlf$day1 > 4, NA, dlf$day1)
This command creates a new variable called day1NoOutlier which takes the value NA if day1 is greater than 4, but is the value of day1 if day1 is less than 4
5.8.3.6. The effect of transformations
Data analysis can be frustrating sometimes!
5.8.4 When it all goes horribly wrong
A much more promising approach is to use robust methods: A trimmed mean and M-estimators;
The second general procedure is the bootstrap.
Use the function of R to do numerous robust tests (two ways):
(1) To access the WRS package of R:
install.packages("WRS",repos="http://R-Forge.R-project.org")
library(WRS)
(2) To source the function from Wilcox’s website:
source("http://www-rcf.usc.edu/~rwilcox/Rallfun-v14")
Summary
張昱城
河北工業大學
人力資源,組織行為學
張蒙
河北工業大學
人力資源管理
王志靈
河北工業大學(天津市)
人力資源管理
李晶
河北工業大學(天津市)
企業管理
石蕊
河北省秦皇島市
經濟管理
曹菲
江南大學
社會心理學
Play with R 第1期:為什麼要學習統計