Play with R 第5期

2021-02-21 荷蘭心理統計聯盟

 What will this chapter tell me?

To learn how to assess the database and turn it to more valuable data.

When assumptions are broken, we stop being able to draw accurate conclusions about reality.

Different statistical models assume different things, and if these models are going to reflect reality accurately then these assumptions need to be true.

Assumptions of parametric data

(1) Normally distributed data: the rationale behind hypothesis testing relies on having something that is normally distributed , and so if this assumption is not met then the logic behind hypothesis testing is flawed.

(2) Homogeneity of variance: This assumption means that the variances should be the same throughout the data.

(3) Interval data: Data should be measured at least at the interval level.

(4) Independence: This assumption, like that of normality, is different depending on the test you’re using.

 Packages used in this chapter

R: you can install them by executing the following commands:

install.packages("car"); install.packages("ggplot2");

install.packages("pastecs"); install.packages("psych")

You then need to load these packages by executing the commands:

library(car);library(ggplot2);library(pastecs);library(psych);library(Rcmdr)

The assumption of normality

5.5.1 Oh no, it’s that pesky frequency distribution again: checking normality visually

Using the 「ggplot2」 package to draw the histograms:

First, load the data:

dlf<-read.delim("DownloadFestival.dat",header=TRUE)

Second, code something like:

hist.day1<-ggplot(dlf,aes(day1))+opts(legend.position="none")+geom_histogram(aes(y=..density..),colour="black",fill="white")+labs(x="Hygiene scoreon day 1",y ="Density") hist.day1

Then, add a normal curve to the chart:

hist.day1+stat_function(fun=dnorm,args=list(mean= mean(dlf$day1,na.rm = TRUE),sd=sd(dlf$day1,na.rm = TRUE)), colour = "black", size = 1)

Otherwise, use the ggplot2 package to draw the Q-Q plot (use the qplot() function in conjunction with the qq statistic):

qqplot.day1 <- qplot(sample = dlf$day1, stat="qq")

qqplot.day1

5.5.2 Quantifying normality with numbers

We want to quantify the shape of the distributions and look for outliers.

Use the describe()function of the psych package to explore the distribution of the variables:

describe(dlf$day1)

The second way is to use thestat.desc() function of the pastecs package:

stat.desc(variable name, basic = TRUE, norm = FALSE)

Then to override the default and specify norm = TRUE:

stat.desc(dlf$day1, basic = FALSE, norm = TRUE)

Use describe() and stat.desc() with more than one variable at the same time, and use the cbind() function to combine two or more variables:

describe(cbind(dlf$day1, dlf$day2, dlf$day3))

stat.desc(cbind(dlf$day1, dlf$day2, dlf$day3), basic = FALSE, norm = TRUE)

The second way is to select the variable names directly from the data set:

describe(dlf[,c("day1", "day2", "day3")])

stat.desc(dlf[, c("day1", "day2", "day3")], basic = FALSE, norm = TRUE)

5.5.3 Exploring groups of data

Ways to produce basic descriptive statistics for groups of data:

by () function and subset () function

These functions allow you to specify a grouping variable which splits the data, or to select a subset of cases.

5.5.3.1. Running the analysis for all data

Data:

Step1. open the file RExam.dat: 

rexam <- read.delim("rexam.dat", header=TRUE)

Step2. set the variable uni to be a factor:

rexam$uni<-factor(rexam$uni, levels = c(0:1), labels = c("Duncetown University", "Sussex University"))

5.5.3.2.  Running the analysis for different groups

Ways to obtain separate descriptive statistics for each of group: by() function

by (data = dataFrame, INDICES = grouping variable, FUN = a function that you want to apply to the data)

To get descriptive statistics for the variable exam for each university separately.

by (data = rexam$exam, INDICES = rexam$uni, FUN = describe)

Using stat.desc() instead of describe()

By (data = rexam$exam, INDICES = rexam$uni, FUN = stat.desc)

Adding other functions:

by(rexam$exam, rexam$uni, stat.desc, basic = FALSE, norm = TRUE)

Descriptive statistics for multiple variables: cbind()

by(cbind(data=rexam$exam,data=rexam$numeracy), rexam$uni, describe)

look at the descriptive statistics of both the previous R exam and the numeracy test.

Output:

(1) the results for students at Duncetown University

(2) the results for those attending Sussex University

Histograms: by (), ggplot2() and subset () function (create plots for different groups)

(1) Create separate dataframes

Histograms for the Duncetown:

dunceData<-subset(rexam, rexam$uni=="Duncetown University")

Histograms for Sussex Universities:

sussexData<-subset(rexam,rexam$uni=="Sussex University")

(2) Create a histogram of the numeracy scores for Duncetown University

hist.numeracy.duncetown<-ggplot(dunceData,aes(numeracy))+opts(legend.position="none")+geom_histogram(aes(y=..density..),fill="white",colour="black",binwidth=1)+labs(x="Numeracy Score",y="Density")stat_function(fun=dnorm,args=list(mean = mean(dunceData$numeracy, na.rm=TRUE), sd = sd(dunceData$numeracy, na.rm =TRUE)), colour = "blue", size=1)

Testing whether a distribution is normal

5.6.1 Doing the Shapiro–Wilk test in R

Shapiro–Wilk test and shapiro.tes are parts of the output from the stat. desc() function.

shapiro.test(variable) (variable is the name of the variable that you』d like to test for normality)

for example: shapiro.test(rexam$exam)

output: W = 0.9613, p-value = 0.004991(a significant value (p-value less than .05) indicates a deviation from normality, For R exam scores (p < .001), the Shapiro–Wilk test is highly significant)

5.6.2. Reporting the Shapiro–Wilk test

The Shapiro–Wilk test can be used to see if a distribution of scores significantly differs from a normal distribution. If the Shapiro–Wilk test is significant (p-value less than .05) then the scores are significantly different from a normal distribution. Otherwise, scores are approximately normally distributed.

The percentage on the R exam, W = 0.96, p= .005, was significantly non-normal.

Testing for homogeneity of variance

①as you go through levels of one variable, the variance of the other should not change.

②collected groups of data→outcome variable or variables should be the same in each of these groups

PS:variance of one variable should be stable at all levels of the other variable

5.7.1 Levene’s test

①see whether the values of the variances are similar

②Instead, use graphs (see section 7.9.5) and for groups of data we tend to use a test called Levene’s test (Levene, 1960)

③when Levene’s test is significant at p<=0.05 ,we can conclude,the assumption of homogeneity of variances has been violated.

④when significant at p>0.05,the variances are roughly equal and the assumption is tenable.

5.7.1.1 Levene’s test with R Commander

①Data ⇒ Import data ⇒ from text file, clipboard or URL

② select the file RExam.dat (see section 3.7.3)

③ convert uni to a factor(3.7.3)

④ select Statistics⇒Variances⇒Levene’s test

⑤ select a grouping variable

⑥ Choose the variable on the right that you want to test for equality of variances across the groups defined by uni

⑦ Run the analysis for both exam and numeracy.

5.7.1.2. Levene’s test with R

①Levene’s test function

leveneTest(outcome variable, group, center = median/mean)

② two variables:outcome variable/grouping variable

the exam scores we could execute:

leveneTest(rexam$exam, rexam$uni)

leveneTest(rexam$exam, rexam$uni, center = mean)

③numeracy scores we could execute:

leveneTest(rexam$numeracy, rexam$uni)

5.7.1.3. Levene’s test output

①This indicates that the variances are not significantly different (i.e., they are similar and the homogeneity of variance assumption is tenable).For the red color

② for the numeracy scores the variances are significantly different.For the green color

5.7.2. Reporting Levene’s test

F(1, 98) = 2.09

F(1, 98) = 5.37, p = .023.

5.7.3. Hartley’s Fmax : the variance ratio

①variances between the group with the biggest variance and the group with the smallest variance.

② Fmax<10 non-significant

15<Fmax<20  ratio<5

30<Fmax<60  ratio<2(or 3)

 Correcting problems in the data

5.8.1 Dealing with outliers

If you detect outliers in the data there are several options for reducing the impact of these values.

1 Remove the case

2 Transform the data

3 Change the score

5.8.2 Dealing with non-normality and unequal variances

5.8.2.1. Transforming data

Log transformation (log(Xi));Square root transformation(√Xi);Reciprocal transformation (1/Xi);Reverse score transformations

5.8.2.2. Choosing a transformation

The simple answer is trial and error: try one out and see if it helps and if it doesn’t then try a different one.

5.8.3Transforming the data using R

5.8.3.1. Computing new variables

Transformations are very easy using R.

newVariable <- function(oldVariable)

newVariable <- arithmetic with oldVariable(s)

Addition: dlf$day1PlusDay2 <- dlf$day1 + dlf$day2

Subtraction: dlf$day2MinusDay1 <- dlf$day2 - dlf$day1

Multiply: dlf$day2Times5 <- dlf$day1 * 5

Exponentiation:

dlf$day2Squared <- dlf$day2 ** 2 or dlf$day2Squared <- dlf$day2 ^ 2

Less than: dlf$day1LessThanOne <- dlf$day1 < 1

Less than or equal to: dlf$day1LessThanOrEqualOne <- dlf$day1 <= 1

Greater than: dlf$day1GreaterThanOne <- dlf$day1 > 1

Greater than or equal to: dlf$day1GreaterThanOrEqualOne <- dlf$day1 >= 1

Double equals dlf$male <- dlf$gender == "Male"

Not equal to. dlf$notMale <- dlf$gender != "Male"

5.8.3.2. The log transformation in R

To transform the variable day1, and create a new variable

logday1, we execute this command:

dlf$logday1 <- log(dlf$day1)

The advantage of adding 1 is that the logarithm of 1 is equal to 0, so people who scored a zero before the transformation score a zero after the transformation. To do this transformation we would execute:

dlf$logday1 <- log(dlf$day1 + 1)

5.8.3.3. The square root transformation in R

Therefore, to create a variable called sqrtday1 that contains the square root of the values in the variable day1, we would execute: dlf$sqrtday1 <- sqrt(day1)


5.8.3.4. The reciprocal transformation in R

We could use a name such as recday1, and to create this variable we would execute:

dlf$recday1 <- 1/(dlf$day1 + 1)


5.8.3.5. The ifelse() function in R

The ifelse() function is used to create a new variable, or change an old variable, depending on some other values. This function takes the general form:

ifelse(a conditional argument, what happens if the argument is TRUE, what happens if the argument if FALSE)

The rest of the function tells it what to do, for example, we might want to set it to missing (NA) if the score is over 4, but keep it as the old score if the score is not over 4. In which case we could execute this command:

dlf$day1NoOutlier <- ifelse(dlf$day1 > 4, NA, dlf$day1)

This command creates a new variable called day1NoOutlier which takes the value NA if day1 is greater than 4, but is the value of day1 if day1 is less than 4


5.8.3.6. The effect of transformations

Data analysis can be frustrating sometimes!

5.8.4 When it all goes horribly wrong 

A much more promising approach is to use robust methods: A trimmed mean and M-estimators;

The second general procedure is the bootstrap.

Use the function of R to do numerous robust tests (two ways):

(1) To access the WRS package of R:

install.packages("WRS",repos="http://R-Forge.R-project.org")

library(WRS)

(2) To source the function from Wilcox’s website:

source("http://www-rcf.usc.edu/~rwilcox/Rallfun-v14")

Summary

張昱城

河北工業大學

人力資源,組織行為學

張蒙

河北工業大學

人力資源管理

王志靈

河北工業大學(天津市)

人力資源管理

李晶

河北工業大學(天津市)

企業管理

石蕊

河北省秦皇島市

經濟管理

曹菲

江南大學

社會心理學

‍‍

Play with R 第1期:為什麼要學習統計

相關焦點

  • 【R函數學習】R語言時間序列函數整理
    plot.ts(arima.sim(n = 100, list(ar = c(1,-0.5))))plot.ts(arima.sim(n = 100, list(ar = c(1,0.5))))arima.sim(n = 1000, list(ar = 0.5, ma = -0.8))acf(arima.sim(n = 1000, list(ar = 0.5, ma
  • R語言時間序列函數大全(收藏!)
    plot.ts(arima.sim(n = 100, list(ar = c(1,-0.5))))plot.ts(arima.sim(n = 100, list(ar = c(1,0.5))))arima.sim(n = 1000, list(ar = 0.5, ma = -0.8))acf(arima.sim(n = 1000, list(ar
  • 老闆叫你「play cards」是去打牌?想的美!
    」所以轉頭就八卦臉跟旁邊人說XX玩牌被抓了結果當然是被毫無情面的嘲笑了估計不明就裡的人一定不止Q姐一個老闆說「playIf you play your cards right, you'll get a promotion soon.
  • 小奶爸第5期王者風暴裂變營完美收官!
    小奶爸第5期王者風暴裂變營特訓圓滿結束這一課程結束但我們的故事還未結束整裝待發迎接下一場的挑戰人生就是一種承受,一種壓力讓我們負重前行
  • 題庫 | 字節跳動校招研發崗位筆試編程題型介紹及解析 -- 第 3 期
    只有當紅箭頭遇到邊界,也就是當字符排到第一行或者最後一行時,箭頭才會反向。當紅箭頭向下時,綠箭頭指示的所在行數依次增大;當紅箭頭向上時,綠箭頭指示的所在行數依次減小。如樣例所示,小羊可以把數組 A 劃分為 [1,2,3] 和 [2,2,1] 兩個排序子序列,至少需要劃分為 2 個排序子序列,所以輸出 2輸入的第一行為一個正整數 n (1 ≤ n ≤ 10^5)第二行包括 n 個整數 A_i (1 ≤ A_i ≤ 10^9), 表示數組 A 的每個數字。
  • 「法」說第32期‖除了房產證外,還有哪些情況可以證明房子是您的?
    往期文章:1、 「法」說第28期‖擔心吶,回家過年「打牌」別過火,罰款拘留判刑時刻等著你2、「法」說第29期‖1%小股東對持有99%股權的股東除名行不行?判決:行!3、「法」說第24期‖用人單位在《勞動合同》上沒蓋章,勞動者如何維權?
  • 六六六老師教你 All work no play,它卻說 All play no work
    這位叔叔特別愛紋身,比如那一句「All work and no play makes jack a dull boy」(只幹、不玩、會傻)。結合前陣子那部肌肉大片,我反倒是更願意相信後者,All work no play 有時候並不是傻,而是一股猛猛的衝勁吧~
  • 【青年大學習】「青年大學習」網上主題團課第十季第六到十期知識總結
    完成第十期青年大學習之後,團團本學期是和大家最後一次見面了哦。不過要注意的是,疫情還未結束,防控不得鬆懈。大家要繼續提高個人防護意識,勤洗手、戴口罩、保持社交距離,減少不必要的聚集性活動。首先呢,小夥伴們先跟著團團回顧一下最後四期的內容吧!
  • 人力資源 |《勞動社會學》(第1期)
    勞動行為5. 金斯伯格的職業發展理論中不包括的階段是  A. 幻想期  B. 規劃期  C. 嘗試期  D. 現實期6. 下列不是現代意義的職業所具有的特徵是  A. 社會性  B. 差異性  C.生產資料1-5 ABACB6-10 CDABC想要獲取更多真題練習請掃描下方二維碼
  • 老闆叫你play cards什麼意思?難道讓你「去打牌」?想得美
    NO.1play cards什麼意思?
  • 三天兩夜的第27期「職總」完美結營
    12月11日-13日,為期三天兩夜的第27期研修營在緊張刺激的課堂對決、全案例實戰研討中完美落幕回顧這幾天的學習,您是否也是意猶未盡,回味無窮?△比分激烈的課堂對決中,有一支隊伍突出重圍,成為職總大贏家最後,在共同回顧三天兩夜的精彩瞬間之中,第27期職業總經理研修營圓滿落下帷幕。
  • 尋人啟事——第三期靠譜團隊成員徵集
    尋人啟事——第三期靠譜團隊成員徵集一:為什麼是第三期,第一期第二期在哪裡?
  • 費恩萊斯第31期開學典禮,力次方助你遇見更好的自己
    徵服世界,並不偉大徵服自己,才是世界上最偉大的人走進費恩萊斯,與高手同行一次次的徵服自己一次次的力竭一次次怒虐你的每一塊肌肉7月1日,費恩萊斯第31期新學員開學典禮在力次方的大力支持下順利召開200多人的熱情註定成為費恩萊斯第31期入學典禮的印記
  • 如何學習好R語言?【全套R語音書籍+視頻下載】
    李艦和肖凱的作品《數據科學中的R語言》(六月份出版)# 閱讀建議#- 在閱讀時做筆記,以記下一些重點或心得 - 在閱讀代碼時,要在 R 環境中親手鍵入代碼並理解其意義 - 堅持練習,嘗試利用身邊的數據進行應用分析 - 理解擴展包和函數背後的原理(引用論文) 測序與晶片高通量數據挖掘與分析學習班(上海 5.26-27)
  • 完美落幕|義大利阿卡丹米披薩職人培訓學院第06期培訓班
    2019年8月23日,義大利阿卡丹米披薩職人培訓學院
  • 第21期 思維設計 · 陳志聲
    GPD  | 推薦 | 第21 期
  • 《御人之道》成都甄才|第45期完美收官!(精彩回顧)
    5月16-17日,C8HR四川甄才第45期《御人之道》研修班在成都高新ICON大源國際中心舉行,現場數十位企業家、高管參加中國人才甄選倡導者,暨C8HR創始人龍平老師的授課。
  • 「R」用purrr實現迭代
    迭代方式主要有兩種:準備工作purrr是tidyverse的核心r包之一,提供了一些更加強大的編程工具。0.2.5#> ✔ tibble 1.4.2 ✔ dplyr 0.7.6#> ✔ tidyr 0.8.1 ✔ stringr 1.3.1#> ✔ readr 1.1.1 ✔ forcats 0.3.0#> ─ Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts