R語言學習:數據可視化技能,Pair Plot,LR模型手寫版,dplyr包across函數,快速學習ggplot2包畫圖

2022-01-01 R語言

這一周R語言學習，記錄如下。

R語言數據可視化技能

數據可視化目的是為了發現和溝通。

R語言擅長做數據可視化。

數據可視化技能是我們做數據科學工作的核心技能之一。

我喜歡學習和實踐R語言做數據可視化知識，增進自己的數據可視化技能。

請問下圖，如何使用ggplot2設計和實現？

參考代碼

if(!require("mosaicData")){
  install.packages("mosaicData")
}
library(dplyr)
library(ggplot2)

plotdata <- CPS85 %>%
  filter(wage < 40)

ggplot(data = plotdata,
       mapping = aes(x = exper,
                     y = wage,
                     color = sex)) +
  geom_point(alpha = .7,
             size = 3) +
  geom_smooth(method = "lm",
              se = FALSE,
              size = 1.5) +
  scale_x_continuous(breaks = seq(0, 60, 10)) +
  scale_y_continuous(breaks = seq(0, 30, 5),
                     labels = scales::dollar) +
  scale_color_manual(values = c("indianred3",
                                "cornflowerblue")) +
  facet_wrap(~sector) +
  labs(title = "Relationship between wages and experience",
       subtitle = "Current Population Survey",
       caption = "source: http://mosaic-web.org/",
       x = "Years of Experience",
       y = "Hourly Wage",
       color = "Gender") +
  theme_minimal()

這幅圖用到了ggplot2這些知識。

1）ggplot()函數

2）geom操作

3）scale操作

4）facet操作

5）labs操作

6）theme操作

當我們熟悉ggplot2包這些知識後，我們就可以使用它設計和實現一系列有用的圖形，以幫助我們獲取數據洞見和增強溝通效果。

學習資料：

https://rkabacoff.github.io/datavis/IntroGGPLOT.html

如何使用tidyverse包繪製Pair Plot?

Pair plot適合多變量間探索性分析。

我們使用tidyverse包繪製Pair Plot包。

在這裡，使用企鵝數據集。

這份數據集的元數據描述如下：

企鵝數據集的Pair Plot參考代碼。

library(palmerpenguins)
library(tidyverse)

# 使用企鵝數據集
penguins %>% glimpse()
penguins %>%
  slice_head(n = 10) %>%
  View
# 數據準備工作
df <- penguins %>%
  rowid_to_column() %>%
  mutate(year=factor(year)) %>%
  select(where(is.numeric))
df %>% glimpse()
df %>%
  slice_head(n = 10) %>%
  View
df1 <- df %>%
  pivot_longer(cols = -rowid) %>%
  full_join(., ., by = "rowid")

df2 <- df1 %>%
  left_join(penguins %>%
              rowid_to_column() %>%
              select(rowid,species))

df2 %>%
  drop_na() %>%
  ggplot(aes(x = value.x, y = value.y, color=species)) +
  geom_point(alpha = 0.5) +
  facet_wrap(name.x ~ name.y, scales = "free")+
  theme(axis.title = element_blank(),
        legend.position = "bottom")

結果圖

知識總結：

1 使用dplyr包和tidyr包做數據準備工作

2 使用ggplot包做數據可視化工作

我創建了R語言群，歡迎大家加入，可以掃描文末的二維碼，備註：R語言，添加我微信，邀請你入群，一起學用R語言做數據科學。

邏輯回歸模型

邏輯回歸模型，我實際工作頻繁使用的模型。

邏輯回歸模型的手寫版本

思考題：

1 邏輯回歸模型的代價函數如何理解？

dplyr包across函數

dplyr包across函數功能強大，我經常使用。

使用across函數可以同時給多個列進行處理，並且可以是"glue"語法對列做命名。

across函數使用的例子。

# dplyr的across函數
library(readr)
library(dplyr)
ac_items <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/items.csv')

ac_items %>% glimpse()
ac_items %>%
  slice_head(n = 100) %>%
  View

# 統計變量
# sell_value 和 buy_value的平均值
ac_items %>%
  group_by(category) %>%
  summarise(sell_value = mean(sell_value, na.rm = TRUE),
            buy_value = mean(buy_value, na.rm = TRUE),
            .groups = "drop")

# 使用across函數
ac_items %>%
  group_by(category) %>%
  summarise(across(c(sell_value, buy_value), mean, na.rm = TRUE),
            .groups = "drop")

ac_items %>%
  group_by(category) %>%
  mutate(across(c(sell_value, buy_value), ~ .x / max(.x, na.rm = TRUE),
                .names = "{col}_prop")) %>%
  select(category, ends_with("prop"))

# 修改列名
# 使用across函數的.names參數
ac_items %>%
  group_by(category) %>%
  summarise(across(c(sell_value, buy_value), mean, na.rm = TRUE,
                   .names = "{col}_mean"))

# 說明：.names使用glue語法表示

# 多個函數作用
ac_items %>%
  group_by(category) %>%
  summarise(across(c(sell_value, buy_value),
                   list(mean = mean, sd = sd), na.rm = TRUE,
                   .names = "{col}_{fn}"))

# 使用contains函數選擇滿足模式的變量
ac_items %>%
  group_by(category) %>%
  summarise(across(contains("value"),
                   mean, na.rm = TRUE,
                   .names = "{col}_"))
# 使用where(is.numeric)操作選擇變量集
ac_items %>%
  group_by(category) %>%
  summarise(across(where(is.numeric),
                   mean, na.rm = TRUE,
                   .names = "{col}_"))

# 自定義一個匯總函數
summarizer <- function(data, numeric_cols = NULL, ...) {
  data %>%
    group_by(...) %>%
    summarise(across({{numeric_cols}}, list(
      mean = ~ mean(.x, na.rm = TRUE),
      sd = ~ sd(.x, na.rm = TRUE),
      q05 = ~ quantile(.x, 0.05, na.rm = TRUE),
      q95 = ~ quantile(.x, 0.95, na.rm = TRUE)
    ), .names = "{col}_{fn}"))
}

summarizer(ac_items, numeric_cols = c(sell_value, buy_value), category)

自定義匯總函數的結果，如下圖：

學習資料：

https://willhipson.netlify.app/post/dplyr_across/dplyr_across/

我建議你掌握這個函數的使用。

如何快速學習ggplot2做數據可視化？

ggplot2包做數據可視化，包括三個核心組件。

我們從數據可視化說起。

一個是數據

一個是可視化

一個是數據和可視化之間的映射

對應與ggplot2包的三個基礎組件。

data

Geoms

mapping = aes()

利用三個基礎組件生成基礎圖形後，後續就是根據實際需求不斷地雕刻圖形，直到滿意。

(點擊放大，查看清晰圖）

我們可以逐步學習這份代碼，來體會如何快速做成可用的圖形。

# 如何快速學習ggplot2包做圖？
library(dslabs)
library(dplyr)
library(ggplot2)

data("murders")
murders %>%
  glimpse()

# 1) ggplot對象
p <- ggplot(data = murders)
class(p)

# 2) Geometries + Aesthetic mappings
# 繪製一個散點圖
murders %>% ggplot() +
  geom_point(aes(x = population/10^6, y = total))

# 3) Layes
p + geom_point(aes(x = population/10^6, y = total)) +
  geom_text(aes(population/10^6, total, label = abb))

# 4）Global versus local aesthetic mappings
p <- murders %>% ggplot(aes(population/10^6, total, label = abb))

# 5）Scales
p + geom_point(size = 3) +
  geom_text(nudge_x = 0.05) +
  scale_x_continuous(trans = "log10") +
  scale_y_continuous(trans = "log10")
# 或者
p + geom_point(size = 3) +
  geom_text(nudge_x = 0.05) +
  scale_x_log10() +
  scale_y_log10()

# 6）Labels and titles
p + geom_point(size = 3) +
  geom_text(nudge_x = 0.05) +
  scale_x_log10() +
  scale_y_log10() +
  xlab("Populations in millions (log scale)") +
  ylab("Total number of murders (log scale)") +
  ggtitle("US Gun Murders in 2010")

# 7） Categories as colors
p <-  murders %>% ggplot(aes(population/10^6, total, label = abb)) +
  geom_text(nudge_x = 0.05) +
  scale_x_log10() +
  scale_y_log10() +
  xlab("Populations in millions (log scale)") +
  ylab("Total number of murders (log scale)") +
  ggtitle("US Gun Murders in 2010")
p + geom_point(size = 3, color ="blue")

p + geom_point(aes(col=region), size = 3)

# 8）Annotation, shapes, and adjustments
r <- murders %>%
  summarize(rate = sum(total) /  sum(population) * 10^6) %>%
  pull(rate)
r

p + geom_point(aes(col=region), size = 3) +
  geom_abline(intercept = log10(r))

p <- p + geom_abline(intercept = log10(r), lty = 2, color = "darkgrey") +
  geom_point(aes(col=region), size = 3)

p <- p + scale_color_discrete(name = "Region")

# 9) Themes
ds_theme_set()
library(ggthemes)
p + theme_economist()

# 10) 完整代碼整理
library(dslabs)
library(dplyr)
library(ggplot2)
library(ggthemes)
library(ggrepel)
data(murders)

r <- murders %>%
  summarize(rate = sum(total) /  sum(population) * 10^6) %>%
  pull(rate)

murders %>% ggplot(aes(population/10^6, total, label = abb)) +
  geom_abline(intercept = log10(r), lty = 2, color = "darkgrey") +
  geom_point(aes(col=region), size = 3) +
  geom_text_repel() +
  scale_x_log10() +
  scale_y_log10() +
  xlab("Populations in millions (log scale)") +
  ylab("Total number of murders (log scale)") +
  ggtitle("US Gun Murders in 2010") +
  scale_color_discrete(name = "Region") +
  theme_economist()

完整代碼圖形的結果

學習資料：

https://rafalab.github.io/dsbook/ggplot2.html

Tidymodels包學習和應用

第1集：認識Tidymodels包，讓我們知道Tidymodels包是什麼，如何安裝和加載，以及Tidymodels的生態、常用函數和舉了一個簡單示例說明。

第2集：利用Tidymodels包做線性回歸模型

1 線性回歸模型推導

2 Tidymodels包做線性回歸模型案例

# 02 線性回歸模型案例
library(tidyverse)
library(tidymodels)
library(vip)

# 加載數據集
advertising <- read_rds(url('https://gmudatamining.com/data/advertising.rds'))
home_sales <- read_rds(url('https://gmudatamining.com/data/home_sales.rds')) %>%
  select(-selling_date)

advertising %>% glimpse()
home_sales %>% glimpse()

# 數據集劃分
# 訓練集和測試集
set.seed(314)

advertising_split <- initial_split(advertising, prop = 0.75,
                                   strata = Sales)
# 訓練集
advertising_training <- advertising_split %>%
  training()
# 測試集
advertising_test <- advertising_split %>%
  testing()

# 使用parsnip包，格式統一化
# Pick a model type
# Set the engine
# Set the mode (either regression or classification)

lm_model <- linear_reg() %>%
  set_engine('lm') %>% # adds lm implementation of linear regression
  set_mode('regression')

lm_model

# 訓練集擬合線性回歸模型
# 使用parsnip包的fit函數
# 設置3個參數

# a parnsip model object specification
# a model formula
# a data frame with the training data

lm_fit <- lm_model %>%
  fit(Sales ~ ., data = advertising_training)
lm_fit

# 分析或者探索訓練集的結果
names(lm_fit)

summary(lm_fit$fit)

# 訓練回歸模型的診斷信息
par(mfrow=c(2, 2))
plot(lm_fit$fit, pch = 16, col = "#006EA1")

# 訓練結果的整潔格式
# yardstick包的tidy函數
# 或者
# parsnip包的glance函數
yardstick::tidy(lm_fit)
parsnip::glance(lm_fit)

# 變量重要性分析
vip(lm_fit)

# 評價測試集上面的準確率
# 模型的泛化能力
# parnsip包的predict函數

# a trained parnsip model object
# new_data for which to generate predictions

predict(lm_fit, new_data = advertising_test)

advertising_test_results <- predict(lm_fit, new_data = advertising_test) %>%
  bind_cols(advertising_test)
advertising_test_results

# 計算測試集的RMSE和R^2
# yardstick包的rmse函數

yardstick::rmse(advertising_test_results,
     truth = Sales,
     estimate = .pred)

yardstick::rsq(advertising_test_results,
    truth = Sales,
    estimate = .pred)

# 效果的可視化
# R^2 Plot
# 理想情況下 y = x
ggplot(data = advertising_test_results,
       mapping = aes(x = .pred, y = Sales)) +
  geom_point(color = '#006EA1') +
  geom_abline(intercept = 0, slope = 1, color = 'orange') +
  labs(title = 'Linear Regression Results - Advertising Test Set',
       x = 'Predicted Sales',
       y = 'Actual Sales')

# 升級版
# 創建一個機器學習工作流
# 第一步：Split Our Data
set.seed(314)

# Create a split object
advertising_split <- initial_split(advertising, prop = 0.75,
                                   strata = Sales)

# Build training data set
advertising_training <- advertising_split %>%
  training()

# Build testing data set
advertising_test <- advertising_split %>%
  testing()

# 第二步：特徵工程
advertising_recipe <- recipe(Sales ~ ., data = advertising_training) %>%
  step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
  step_normalize(all_numeric(), -all_outcomes())

# 第三步：Specify a Model
lm_model <- linear_reg() %>%
  set_engine('lm') %>%
  set_mode('regression')

# 第四步：創建工作流
# 使用workflow包
# we start with workflow() to create an empty workflow and then add out model and recipe with add_model() and add_recipe().
advertising_workflow <- workflow() %>%
  add_model(lm_model) %>%
  add_recipe(advertising_recipe)

# 第五步：執行工作流
advertising_fit <- advertising_workflow %>%
  last_fit(split = advertising_split)

# 模型性能分析
advertising_fit %>% collect_metrics()

# 測試集預測的結果
# Obtain test set predictions data frame
test_results <- advertising_fit %>%
  collect_predictions()
# View results
test_results

部分結果

學習資料：

https://www.gmudatamining.com/lesson-10-r-tutorial.html

我創建了R語言群，添加我的微信，備註：姓名-入群，我邀請你進群，一起學習R語言。

如果你想學習數據科學與人工智慧，請關注下方公眾號~

如果你想找數據工作，請關注下方公眾號~

R語言學習專輯：

2021年第48周：R語言學習

2021年第47周：R語言學習

2021年第46周：R語言學習

2021年第45周：R語言學習

2021年第44周：R語言學習

2021年第43周：R語言學習

2021年第42周：R語言學習

2021年第41周：R語言學習

2021年第40周：R語言學習

2021年第39周：R語言學習

2021年第38周：R語言學習

2021年第37周：R語言學習

2021年第36周：R語言學習

2021年第35周：R語言學習

2021年第34周：R語言學習

若覺得文章不錯，就順手幫我轉發到朋友圈和微信群哦，謝謝。

R語言學習:數據可視化技能,Pair Plot,LR模型手寫版,dplyr包across函數,快速學習ggplot2包畫圖

相關焦點

R語言學習:R4DS學習、6種join操作理解、case_when函數理解、ggplot2包做數據可視化

全棧數據之R語言常用包和函數

R語言可視化學習筆記之ggridges包繪製山巒圖

R語言學習:可視化分析,Apply家族函數,row_number函數,R4DS學習,數據變換

R語言學習路線和常用數據挖掘包

R語言可視化學習筆記之ggridges包

空間地理數據可視化之 ggplot2 包及其拓展

Python語言plotnine VS R語言ggplot2

R語言 | 數據操作dplyr包

學習|R語言實現決策樹模型

R 語言入門學習路線與資源匯總

數據可視化完美指南-R-python

【萬字長文】R語言入門學習路線與資源匯總

R語言入門教程 | tidyverse包之數據處理

R語言學習:R4DS學習交流、數據匯總、數據子集獲取、Rmd學習

【R語言】--- ggplot2包的geom_bar()函數繪製柱狀圖

ggplot2拓展功能:局部放大和邊際密度圖

R語言數據可視化ggplot2包

【數據管理】Tidyverse:R 語言學習之旅的新起點

R語言學習筆記之相關性矩陣分析及其可視化