「R」dplyr 行式計算

2021-03-02 優雅R
「原文來自：dplyr 文檔」
上一篇：「R」dplyr 列式計算
通常 dplyr 和 R 更適合對列進行操作，而對行操作則顯得更麻煩。這篇文章，我們將學習圍繞rowwise() 創建的 row-wise 數據框的 dplyr 操作方法。
本文將討論 3 種常見的使用案例：
這些問題通常可以通過 for 循環簡單地解決掉，但如果能夠自然地將其流程化將是一個非常好的方案。
❝
Of course, someone has to write loops. It doesn’t have to be you. — Jenny Bryan
❞
載入包
library(dplyr, warn.conflicts = FALSE)
創建行式操作需要一個特殊的分組類型，每一組簡單地包含一個單一的行。你可以使用 rowwise() 創建它：
df <- tibble(x = 1:2, y = 3:4, z = 5:6)
df %>% rowwise()
#> # A tibble: 2 x 3
#> # Rowwise: 
#>       x     y     z
#>   <int> <int> <int>
#> 1     1     3     5
#> 2     2     4     6
與 group_by() 類似， rowwise() 本身並不進行任何的操作，它僅改變其他動詞操作如何工作。例如，比較下面 mutate() 的結果：
df %>% mutate(m = mean(c(x, y, z)))
#> # A tibble: 2 x 4
#>       x     y     z     m
#>   <int> <int> <int> <dbl>
#> 1     1     3     5   3.5
#> 2     2     4     6   3.5
df %>% rowwise() %>% mutate(m = mean(c(x, y, z)))
#> # A tibble: 2 x 4
#> # Rowwise: 
#>       x     y     z     m
#>   <int> <int> <int> <dbl>
#> 1     1     3     5     3
#> 2     2     4     6     4
如果你使用 mutate() 操作一個常規的數據框，它計算所有行的 x, y 和 z 的均值。而如果你只應用到一個行式數據框，它計算每一行的均值。
你可以在 rowwise() 中提供「標識符」變量，這些變量將在你調用 summarise() 的時候保留，因此它的行為類似於將變量傳入 group_by()：
df <- tibble(name = c("Mara", "Hadley"), x = 1:2, y = 3:4, z = 5:6)

df %>% 
  rowwise() %>% 
  summarise(m = mean(c(x, y, z)))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 1
#>       m
#>   <dbl>
#> 1     3
#> 2     4

df %>% 
  rowwise(name) %>% 
  summarise(m = mean(c(x, y, z)))
#> `summarise()` regrouping output by 'name' (override with `.groups` argument)
#> # A tibble: 2 x 2
#> # Groups:   name [2]
#>   name       m
#>   <chr>  <dbl>
#> 1 Mara       3
#> 2 Hadley     4
rowwise() 僅是分組的一個特殊形式，因此如果你想要將其從數據框中移除，調用 ungroup() 即可。
按行匯總統計dplyr::summarise() 讓一列多行的統計匯總變得非常簡單，當它與 rowwise() 結合時，它也可以簡便地操作匯總一行多列。為了查看它是怎樣工作的，我們從創建一個小的數據框開始：
df <- tibble(id = 1:6, w = 10:15, x = 20:25, y = 30:35, z = 40:45)
df
#> # A tibble: 6 x 5
#>      id     w     x     y     z
#>   <int> <int> <int> <int> <int>
#> 1     1    10    20    30    40
#> 2     2    11    21    31    41
#> 3     3    12    22    32    42
#> 4     4    13    23    33    43
#> # … with 2 more rows
假設我們想要計算每行 w, x, y, 和 z 的和，我們縣創建一個行式數據框：
rf <- df %>% rowwise(id)
我們然後使用 mutate() 添加一個新的列，或者使用 summarise() 僅返回一個匯總列：
rf %>% mutate(total = sum(c(w, x, y, z)))
#> # A tibble: 6 x 6
#> # Rowwise:  id
#>      id     w     x     y     z total
#>   <int> <int> <int> <int> <int> <int>
#> 1     1    10    20    30    40   100
#> 2     2    11    21    31    41   104
#> 3     3    12    22    32    42   108
#> 4     4    13    23    33    43   112
#> # … with 2 more rows
rf %>% summarise(total = sum(c(w, x, y, z)))
#> `summarise()` regrouping output by 'id' (override with `.groups` argument)
#> # A tibble: 6 x 2
#> # Groups:   id [6]
#>      id total
#>   <int> <int>
#> 1     1   100
#> 2     2   104
#> 3     3   108
#> 4     4   112
#> # … with 2 more rows
當然，如果你有大量的變量，鍵入每個變量名將非常無聊。因此，你可以使用 c_across() ，它支持 tidy 選擇語法，因而你可以一次性選擇許多變量：
rf %>% mutate(total = sum(c_across(w:z)))
#> # A tibble: 6 x 6
#> # Rowwise:  id
#>      id     w     x     y     z total
#>   <int> <int> <int> <int> <int> <int>
#> 1     1    10    20    30    40   100
#> 2     2    11    21    31    41   104
#> 3     3    12    22    32    42   108
#> 4     4    13    23    33    43   112
#> # … with 2 more rows
rf %>% mutate(total = sum(c_across(where(is.numeric))))
#> # A tibble: 6 x 6
#> # Rowwise:  id
#>      id     w     x     y     z total
#>   <int> <int> <int> <int> <int> <int>
#> 1     1    10    20    30    40   100
#> 2     2    11    21    31    41   104
#> 3     3    12    22    32    42   108
#> 4     4    13    23    33    43   112
#> # … with 2 more rows
你可以結合列式操作（見前一篇文章）計算每一行的比例：
rf %>% 
  mutate(total = sum(c_across(w:z))) %>% 
  ungroup() %>% 
  mutate(across(w:z, ~ . / total))
#> # A tibble: 6 x 6
#>      id     w     x     y     z total
#>   <int> <dbl> <dbl> <dbl> <dbl> <int>
#> 1     1 0.1   0.2   0.3   0.4     100
#> 2     2 0.106 0.202 0.298 0.394   104
#> 3     3 0.111 0.204 0.296 0.389   108
#> 4     4 0.116 0.205 0.295 0.384   112
#> # … with 2 more rows
行式匯總函數rowwise() 方法支持任何的匯總函數。但如果你要考慮計算的速度，尋找能夠完成任務的內置的行式匯總函數非常值得。它們的效率更高，因為它們不會將數據切分為行，然後計算統計量，最後再把結果拼起來，它們將整個數據框作為一個整體進行操作。
df %>% mutate(total = rowSums(across(where(is.numeric))))
#> # A tibble: 6 x 6
#>      id     w     x     y     z total
#>   <int> <int> <int> <int> <int> <dbl>
#> 1     1    10    20    30    40   101
#> 2     2    11    21    31    41   106
#> 3     3    12    22    32    42   111
#> 4     4    13    23    33    43   116
#> # … with 2 more rows
df %>% mutate(mean = rowMeans(across(where(is.numeric))))
#> # A tibble: 6 x 6
#>      id     w     x     y     z  mean
#>   <int> <int> <int> <int> <int> <dbl>
#> 1     1    10    20    30    40  20.2
#> 2     2    11    21    31    41  21.2
#> 3     3    12    22    32    42  22.2
#> 4     4    13    23    33    43  23.2
#> # … with 2 more rows
列表列當您有列表列時，rowwise()操作是一種自然的配對。它們允許你避免顯式的循環和/或使用 apply() 或 purrr::map 家族函數。
動機想像你有下面這個數據框，你想要計算每個元素的長度：
df <- tibble(
  x = list(1, 2:3, 4:6)
)
你可能會嘗試 length()：
df %>% mutate(l = length(x))
#> # A tibble: 3 x 2
#>   x             l
#>   <list>    <int>
#> 1 <dbl [1]>     3
#> 2 <int [2]>     3
#> 3 <int [3]>     3
但是返回的是列的長度，而不是單獨值的長度。如果你是一個 R 文檔迷，你可能知道有一個 base R 函數就是用來處理這種情況的：
df %>% mutate(l = lengths(x))
#> # A tibble: 3 x 2
#>   x             l
#>   <list>    <int>
#> 1 <dbl [1]>     1
#> 2 <int [2]>     2
#> 3 <int [3]>     3
或者你是一個有經驗的 R 編程者，你可能知道如何使用 sapply() 等函數將一個操作應用到每一個元素：
df %>% mutate(l = sapply(x, length))
#> # A tibble: 3 x 2
#>   x             l
#>   <list>    <int>
#> 1 <dbl [1]>     1
#> 2 <int [2]>     2
#> 3 <int [3]>     3
df %>% mutate(l = purrr::map_int(x, length))
#> # A tibble: 3 x 2
#>   x             l
#>   <list>    <int>
#> 1 <dbl [1]>     1
#> 2 <int [2]>     2
#> 3 <int [3]>     3
但如果只寫 length(x) dplyr 就能算出 x中 元素的長度不是很好嗎？既然已經到了這裡，你可能已經猜到了答案：這只是行模式的另一個應用。
df %>% 
  rowwise() %>% 
  mutate(l = length(x))
#> # A tibble: 3 x 2
#> # Rowwise: 
#>   x             l
#>   <list>    <int>
#> 1 <dbl [1]>     1
#> 2 <int [2]>     2
#> 3 <int [3]>     3
取子集在我們繼續之前，我想簡單地提一下讓它起作用的魔法。這不是你通常需要考慮的事情（它會工作），但知道什麼時候出錯是很有用的。
分組數據框（每個組恰好有一行）和行數據框（每個組總是有一行）之間有一個重要的區別。以這兩個數據框為例:
df <- tibble(g = 1:2, y = list(1:3, "a"))
gf <- df %>% group_by(g)
rf <- df %>% rowwise(g)
如果我們計算 y 的一些屬性，我們會發現結果有一些不同：
gf %>% mutate(type = typeof(y), length = length(y))
#> # A tibble: 2 x 4
#> # Groups:   g [2]
#>       g y         type  length
#>   <int> <list>    <chr>  <int>
#> 1     1 <int [3]> list       1
#> 2     2 <chr [1]> list       1
rf %>% mutate(type = typeof(y), length = length(y))
#> # A tibble: 2 x 4
#> # Rowwise:  g
#>       g y         type      length
#>   <int> <list>    <chr>      <int>
#> 1     1 <int [3]> integer        3
#> 2     2 <chr [1]> character      1
關鍵的區別在於 mutate() 將列切分然後傳入 length(y) 的時候，分組 mutate 使用 [ 操作，而行式 mutate 使用 [[。下面代碼通過 for 循環展示這一區別：
# grouped
out1 <- integer(2)
for (i in 1:2) {
  out1[[i]] <- length(df$y[i])
}
out1
#> [1] 1 1

# rowwise
out2 <- integer(2)
for (i in 1:2) {
  out2[[i]] <- length(df$y[[i]])
}
out2
#> [1] 3 1
注意，這種魔力只適用於引用現有列時，而不適用於創建新行。這可能會讓人感到困惑，但我們確信這是最差的解決方案，特別是在錯誤消息中給出了提示。
gf %>% mutate(y2 = y)
#> # A tibble: 2 x 3
#> # Groups:   g [2]
#>       g y         y2       
#>   <int> <list>    <list>   
#> 1     1 <int [3]> <int [3]>
#> 2     2 <chr [1]> <chr [1]>
rf %>% mutate(y2 = y)
#> Error: Problem with `mutate()` input `y2`.
#> x Input `y2` can't be recycled to size 1.
#> ℹ Input `y2` is `y`.
#> ℹ Input `y2` must be size 1, not 3.
#> ℹ Did you mean: `y2 = list(y)` ?
#> ℹ The error occurred in row 1.
rf %>% mutate(y2 = list(y))
#> # A tibble: 2 x 3
#> # Rowwise:  g
#>       g y         y2       
#>   <int> <list>    <list>   
#> 1     1 <int [3]> <int [3]>
#> 2     2 <chr [1]> <chr [1]>
❝譯者註：第二個例子中的操作 y 已經被解開列表了，所以需要重新被包裹起來。
❞建模rowwise() 數據框允許我們以一種特別優雅的方式解決很多的建模問題。讓我們從創建一個嵌套數據框開始：
by_cyl <- mtcars %>% nest_by(cyl)
#> `summarise()` ungrouping output (override with `.groups` argument)
by_cyl
#> # A tibble: 3 x 2
#> # Rowwise:  cyl
#>     cyl data              
#>   <dbl> <list>            
#> 1     4 <tibble [11 × 12]>
#> 2     6 <tibble [7 × 12]> 
#> 3     8 <tibble [14 × 12]>
這與通常的 group_by() 輸出有一點不同:我們明顯地改變了數據的結構。現在我們有了三行（每個組一行），還有一個列表列 data，用於存儲該組的數據。還要注意輸出是 rowwwise();這一點很重要，因為它將使處理數據框列表變得更加容易。
一旦我們每一行有一個數據框，對每行創建一個模型非常直觀：
mods <- by_cyl %>% mutate(mod = list(lm(mpg ~ wt, data = data)))
mods
#> # A tibble: 3 x 3
#> # Rowwise:  cyl
#>     cyl data               mod   
#>   <dbl> <list>             <list>
#> 1     4 <tibble [11 × 12]> <lm>  
#> 2     6 <tibble [7 × 12]>  <lm>  
#> 3     8 <tibble [14 × 12]> <lm>
用每行一組預測值來補充：
mods <- mods %>% mutate(pred = list(predict(mod, data)))
mods
#> # A tibble: 3 x 4
#> # Rowwise:  cyl
#>     cyl data               mod    pred      
#>   <dbl> <list>             <list> <list>    
#> 1     4 <tibble [11 × 12]> <lm>   <dbl [11]>
#> 2     6 <tibble [7 × 12]>  <lm>   <dbl [7]> 
#> 3     8 <tibble [14 × 12]> <lm>   <dbl [14]>
然後你可以用多種方式總結這個模型：
mods %>% summarise(rmse = sqrt(mean((pred - data$mpg) ^ 2)))
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 3 x 2
#> # Groups:   cyl [3]
#>     cyl  rmse
#>   <dbl> <dbl>
#> 1     4 3.01 
#> 2     6 0.985
#> 3     8 1.87
mods %>% summarise(rsq = summary(mod)$r.squared)
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 3 x 2
#> # Groups:   cyl [3]
#>     cyl   rsq
#>   <dbl> <dbl>
#> 1     4 0.509
#> 2     6 0.465
#> 3     8 0.423
mods %>% summarise(broom::glance(mod))
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 3 x 13
#> # Groups:   cyl [3]
#>     cyl r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
#>   <dbl>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
#> 1     4     0.509         0.454  3.33      9.32  0.0137     1 -27.7   61.5  62.7
#> 2     6     0.465         0.357  1.17      4.34  0.0918     1  -9.83  25.7  25.5
#> 3     8     0.423         0.375  2.02      8.80  0.0118     1 -28.7   63.3  65.2
#> # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
或輕鬆訪問各模型的參數：
mods %>% summarise(broom::tidy(mod))
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 6 x 6
#> # Groups:   cyl [3]
#>     cyl term        estimate std.error statistic    p.value
#>   <dbl> <chr>          <dbl>     <dbl>     <dbl>      <dbl>
#> 1     4 (Intercept)    39.6       4.35      9.10 0.00000777
#> 2     4 wt             -5.65      1.85     -3.05 0.0137    
#> 3     6 (Intercept)    28.4       4.18      6.79 0.00105   
#> 4     6 wt             -2.78      1.33     -2.08 0.0918    
#> # … with 2 more rows
重複的函數調用rowwise()不僅適用於返回長度為1的向量的函數（又名總結函數）；如果結果是列表，它可以與任何函數一起工作。這意味著rowwise()和mutate()提供了一種優雅的方式，可以使用不同的參數多次調用函數，並將輸出與輸入一起存儲。
模擬我認為這是執行模擬的一種特別優雅的方式，因為它允許您存儲模擬值以及生成它們的參數。例如，假設你有以下數據框，描述了 3 個均勻分布樣本的屬性:
df <- tribble(
  ~ n, ~ min, ~ max,
    1,     0,     1,
    2,    10,   100,
    3,   100,  1000,
)
你可以使用 rowwise()和mutate()將這些參數提供給runif()：
df %>% 
  rowwise() %>% 
  mutate(data = list(runif(n, min, max)))
#> # A tibble: 3 x 4
#> # Rowwise: 
#>       n   min   max data     
#>   <dbl> <dbl> <dbl> <list>   
#> 1     1     0     1 <dbl [1]>
#> 2     2    10   100 <dbl [2]>
#> 3     3   100  1000 <dbl [3]>
注意這裡使用了list()——runif()返回多個值，而mutate()表達式必須返回長度為1的值。list()意味著我們將得到一個列表列，其中每一行都是一個包含多個值的列表。如果你忘記使用list()， dplyr 會給你提示：
df %>% 
  rowwise() %>% 
  mutate(data = runif(n, min, max))
#> Error: Problem with `mutate()` input `data`.
#> x Input `data` can't be recycled to size 1.
#> ℹ Input `data` is `runif(n, min, max)`.
#> ℹ Input `data` must be size 1, not 2.
#> ℹ Did you mean: `data = list(runif(n, min, max))` ?
#> ℹ The error occurred in row 2.
重複組合如果您想為每個輸入組合調用一個函數，該怎麼辦？你可以使用 expand.grid()或者tidyr::expand_grid()來生成數據幀，然後重複上面的模式：
df <- expand.grid(mean = c(-1, 0, 1), sd = c(1, 10, 100))

df %>% 
  rowwise() %>% 
  mutate(data = list(rnorm(10, mean, sd)))
#> # A tibble: 9 x 3
#> # Rowwise: 
#>    mean    sd data      
#>   <dbl> <dbl> <list>    
#> 1    -1     1 <dbl [10]>
#> 2     0     1 <dbl [10]>
#> 3     1     1 <dbl [10]>
#> 4    -1    10 <dbl [10]>
#> # … with 5 more rows
不同的函數在更複雜的問題中，你可能還希望改變被調用的函數。因為輸入tibble中的列沒有那麼規則，所以這種方法更不適合這種方法。但這仍然是可能的，而且在這裡使用do.call()是很自然的：
df <- tribble(
   ~rng,     ~params,
   "runif",  list(n = 10), 
   "rnorm",  list(n = 20),
   "rpois",  list(n = 10, lambda = 5),
) %>%
  rowwise()

df %>% 
  mutate(data = list(do.call(rng, params)))
#> # A tibble: 3 x 3
#> # Rowwise: 
#>   rng   params           data      
#>   <chr> <list>           <list>    
#> 1 runif <named list [1]> <dbl [10]>
#> 2 rnorm <named list [1]> <dbl [20]>
#> 3 rpois <named list [2]> <int [10]>
以前rowwise()rowwise() 也被質疑了很長一段時間，部分原因是我不明白有多少人需要通過本地能力來計算每一行的多個變量的摘要。作為替代方案，我們建議使用 purrr 的 map() 函數執行逐行操作。但是，這很有挑戰性，因為您需要根據變化的參數數量和結果類型來選擇映射函數，這需要相當多的 purrr 函數知識。
我也曾抗拒 rowwwise()，因為我覺得自動在[到[[之間切換太神奇了，就像自動list()-ing結果使do()太神奇一樣。我現在已經說服自己，行式魔法是好的魔法，部分原因是大多數人發現[和[[神秘化和rowwise()之間的區別意味著你不需要考慮它。
由於 rowwise() 顯然是有用的，它不再被質疑，我們希望它能夠長期存在。
do()我們對 do()的必要性已經質疑了很長一段時間，因為它與其他 dplyr 動詞並不太相似。它有兩種主要的運作模式:
沒有參數名：你可以調用函數來輸入和輸出數據框。引用「當前」組。例如，下面的代碼獲取每個組的第一行：
mtcars %>% 
  group_by(cyl) %>% 
  do(head(., 1))
#> # A tibble: 3 x 13
#> # Groups:   cyl [3]
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb  cyl2  cyl4
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1     8    16
#> 2  21       6   160   110  3.9   2.62  16.5     0     1     4     4    12    24
#> 3  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2    16    32
這已經被 cur_data() 和更寬鬆的 summarise() 所取代，後者現在可以創建多列和多行。
mtcars %>% 
  group_by(cyl) %>% 
  summarise(head(cur_data(), 1))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 13
#>     cyl   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb  cyl2  cyl4
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4  22.8   108    93  3.85  2.32  18.6     1     1     4     1     8    16
#> 2     6  21     160   110  3.9   2.62  16.5     0     1     4     4    12    24
#> 3     8  18.7   360   175  3.15  3.44  17.0     0     0     3     2    16    32
•帶參數：它的工作方式類似於 mutate() 但會自動將每個元素包裹為列表：
mtcars %>% 
  group_by(cyl) %>% 
  do(nrows = nrow(.))
#> # A tibble: 3 x 2
#> # Rowwise: 
#>     cyl nrows    
#>   <dbl> <list>   
#> 1     4 <int [1]>
#> 2     6 <int [1]>
#> 3     8 <int [1]>
我現在覺得這個行為既太神奇又不是很有用，它可以被summarise()和cur_data()取代。
mtcars %>% 
  group_by(cyl) %>% 
  summarise(nrows = nrow(cur_data()))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 2
#>     cyl nrows
#>   <dbl> <int>
#> 1     4    11
#> 2     6     7
#> 3     8    14
如果需要（不像這裡），你可以自己將結果包裝在一個列表中。
cur_data()/across() 的添加和 summarise() 應用範圍的增加意味著不再需要 do()，所以它現在被廢棄了。
「R」dplyr 行式計算

相關焦點

R語言 | 數據操作dplyr包

csvtk:高效命令行版極簡dplyr

R語言筆記-dplyr-2-表操作

強大的數據清理大師:dplyr

『R腳本練習』dplyr各種join

「R」用purrr實現迭代

dplyr包-行選擇的方法

R語言點滴:dplyr函數與查重案例(1)

《實習日記》| 7月20日 R語言筆記——dplyr

使用dplyr進行數據操作(30個實例)

R語言基於dplyr實現數據快捷操作

生信日日談22--dplyr 一個神奇的R包

dplyr包-匯總數據的方法

dplyr和tidyr簡介|數據處理

dplyr中filter函數的總結

R 語言之數據分析高級方法「主成分分析」和「因子分析」

R包之dplyr--處理表格數據的好幫手

對照著Excel入門R語言表格數據處理

【R學習筆記】- 數據整形 - dplyr and tidyr

寫給零基礎同學的R語言第四篇教程-神奇R包dplyr