使用scikit-learn進行數據預處理

2021-03-02 Datawhale

更高級的scikit-learn介紹

導語
為什麼要出這個教程？1.基本用例：訓練和測試分類器練習2.更高級的用例：在訓練和測試分類器之前預處理數據2.1 標準化您的數據2.2 錯誤的預處理模式2.3 保持簡單，愚蠢：使用scikit-learn的管道連接器練習3.當更多優於更少時：交叉驗證而不是單獨拆分練習4.超參數優化：微調管道內部練習5.總結：我的scikit-learn管道只有不到10行代碼（跳過import語句）6.異構數據：當您使用數字以外的數據時練習

導語

已經有好幾天沒出文章了，感覺自己失蹤了，失蹤幾天出去跨年娛樂了，哈哈，之前黃大大發了個機器學習連結，然後昨天回來發現覺得很不錯，於是今天開始翻譯並撰寫文章，最終大家看到了這篇文章，除此之外，有一件事情要說：我在自己的倉庫裡面開了一個光城翻譯文檔(點擊原文進入倉庫)，期待你的貢獻！下面一起來看我的2019年的第一篇文章！

本篇文章翻譯

https://github.com/glemaitre/pyparis-2018-sklearn

點擊閱讀原文獲取翻譯後的原始碼及解釋！

啟用內聯模式

在本節教程中將會繪製幾個圖形，於是我們激活matplotlib,使得在notebook中顯示內聯圖。

%matplotlib inline
import matplotlib.pyplot as plt

為什麼要出這個教程？

scikit-learn提供最先進的機器學習算法。但是，這些算法不能直接用於原始數據。原始數據需要事先進行預處理。因此，除了機器學習算法之外，scikit-learn還提供了一套預處理方法。此外，scikit-learn提供用於流水線化這些估計器的連接器(即變壓器，回歸器，分類器，聚類器等)。在本教程中,將C，允許流水線估計器、評估這些流水線、使用超參數優化調整這些流水線以及創建複雜的預處理步驟。

1.基本用例：訓練和測試分類器

對於第一個示例，我們將在數據集上訓練和測試一個分類器。我們將使用此示例來回憶scikit-learn的API。

我們將使用digits數據集，這是一個手寫數字的數據集。

from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True)

X中的每行包含64個圖像像素的強度。對於X中的每個樣本，我們得到表示所寫數字對應的y。

plt.imshow(X[0].reshape(8, 8), cmap='gray');

plt.axis('off')

print('The digit in the image is {}'.format(y[0]))

輸出：

The digit in the image is 0

在機器學習中，我們應該通過在不同的數據集上進行訓練和測試來評估我們的模型。train_test_split是一個用於將數據拆分為兩個獨立數據集的效用函數。 stratify參數可強制將訓練和測試數據集的類分布與整個數據集的類分布相同。

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

一旦我們擁有獨立的培訓和測試集，我們就可以使用fit方法學習機器學習模型。我們將使用score方法來測試此方法，依賴於默認的準確度指標。

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=5000, random_state=42)
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))

Accuracy score of the LogisticRegression is 0.95

scikit-learn的API在分類器中是一致的。因此，我們可以通過RandomForestClassifier輕鬆替換LogisticRegression分類器。這些更改很小，僅與分類器實例的創建有關。

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))

輸出：

Accuracy score of the RandomForestClassifier is 0.96

練習

完成接下來的練習：

2.更高級的用例：在訓練和測試分類器之前預處理數據2.1 標準化您的數據

在學習模型之前可能需要預處理。例如，一個用戶可能對創建手工製作的特徵或者算法感興趣，那麼他可能會對數據進行一些先驗假設。在我們的例子中，LogisticRegression使用的求解器期望數據被規範化。因此，我們需要在訓練模型之前標準化數據。為了觀察這個必要條件，我們將檢查訓練模型所需的迭代次數。

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=5000, random_state=42)
clf.fit(X_train, y_train)
print('{} required {} iterations to be fitted'.format(clf.__class__.__name__, clf.n_iter_[0]))

輸出：

LogisticRegression required 1841 iterations to be fitted

MinMaxScaler變換器用於規範化數據。該標量應該以下列方式應用：學習（即，fit方法）訓練集上的統計數據並標準化（即，transform方法）訓練集和測試集。最後，我們將訓練和測試這個模型並得到歸一化後的數據集。

from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000, random_state=42)
clf.fit(X_train_scaled, y_train)
accuracy = clf.score(X_test_scaled, y_test)
print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))
print('{} required {} iterations to be fitted'.format(clf.__class__.__name__, clf.n_iter_[0]))

輸出：

Accuracy score of the LogisticRegression is 0.96
LogisticRegression required 190 iterations to be fitted

通過歸一化數據，模型的收斂速度要比未歸一化的數據快得多。(迭代次數變少了)

2.2 錯誤的預處理模式

我們強調了如何預處理和充分訓練機器學習模型。發現預處理數據的錯誤方法也很有趣。其中有兩個潛在的錯誤，易於犯錯但又很容易發現。

第一種模式是在整個數據集分成訓練和測試集之前標準化數據。

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
X_train_prescaled, X_test_prescaled, y_train_prescaled, y_test_prescaled = train_test_split(
X_scaled, y, stratify=y, random_state=42)

clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000, random_state=42)
clf.fit(X_train_prescaled, y_train_prescaled)
accuracy = clf.score(X_test_prescaled, y_test_prescaled)
print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))

輸出：

Accuracy score of the LogisticRegression is 0.96

第二種模式是獨立地標準化訓練和測試集。它回來在訓練和測試集上調用fit方法。因此，訓練和測試集的標準化不同。

scaler = MinMaxScaler()
X_train_prescaled = scaler.fit_transform(X_train)

X_test_prescaled = scaler.fit_transform(X_test)

clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000, random_state=42)
clf.fit(X_train_prescaled, y_train)
accuracy = clf.score(X_test_prescaled, y_test)
print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))

輸出：

Accuracy score of the LogisticRegression is 0.96

2.3 保持簡單，愚蠢：使用scikit-learn的管道連接器

前面提到的兩個模式是數據洩漏的問題。然而，當必須手動進行預處理時，很難防止這種錯誤。因此,scikit-learn引入了Pipeline對象。它依次連接多個變壓器和分類器（或回歸器）。我們可以創建一個如下管道：

from sklearn.pipeline import Pipeline

pipe = Pipeline(steps=[('scaler', MinMaxScaler()),
('clf', LogisticRegression(solver='lbfgs', multi_class='auto', random_state=42))])

我們看到這個管道包含了縮放器(歸一化)和分類器的參數。有時，為管道中的每個估計器命名可能會很繁瑣。而make_pipeline將自動為每個估計器命名，這是類名的小寫。

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(MinMaxScaler(),
LogisticRegression(solver='lbfgs', multi_class='auto', random_state=42, max_iter=1000))

管道將具有相同的API。我們使用fit來訓練分類器和socre來檢查準確性。然而，調用fit會調用管道中所有變換器的fit_transform方法。調用score（或predict和predict_proba）將調用管道中所有變換器的內部變換。它對應於本文2.1中的規範化過程。

pipe.fit(X_train, y_train)
accuracy = pipe.score(X_test, y_test)
print('Accuracy score of the {} is {:.2f}'.format(pipe.__class__.__name__, accuracy))

Accuracy score of the Pipeline is 0.96

我們可以使用get_params()檢查管道的所有參數。

pipe.get_params()

輸出：

{'logisticregression': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=1000, multi_class='auto',
           n_jobs=None, penalty='l2', random_state=42, solver='lbfgs',
           tol=0.0001, verbose=0, warm_start=False),
'logisticregression__C': 1.0,
...
...
...}

練習

重用第一個練習的乳腺癌數據集來訓練,可以從linear_model導入SGDClassifier。使用此分類器和從sklearn.preprocessing導入的StandardScaler變換器來創建管道。然後訓練和測試這條管道。

3.當更多優於更少時：交叉驗證而不是單獨拆分

分割數據對於評估統計模型性能是必要的。但是，它減少了可用於學習模型的樣本數量。因此，應儘可能使用交叉驗證。有多個拆分也會提供有關模型穩定性的信息。

scikit-learn提供了三個函數：cross_val_score，cross_val_predict和cross_validate。後者提供了有關擬合時間，訓練和測試分數的更多信息。我也可以一次返回多個分數。

from sklearn.model_selection import cross_validate

pipe = make_pipeline(MinMaxScaler(),
LogisticRegression(solver='lbfgs', multi_class='auto',
max_iter=1000, random_state=42))
scores = cross_validate(pipe, X, y, cv=3, return_train_score=True)

使用交叉驗證函數，我們可以快速檢查訓練和測試分數，並使用pandas快速繪圖。

import pandas as pd

df_scores = pd.DataFrame(scores)
df_scores

輸出：

df_scores[['train_score', 'test_score']].boxplot()

輸出：

練習

使用上一個練習的管道並進行交叉驗證，而不是單個拆分評估。

4.超參數優化：微調管道內部

有時您希望找到管道組件的參數，從而獲得最佳精度。我們已經看到我們可以使用get_params()檢查管道的參數。

pipe.get_params()

輸出：

可以通過窮舉搜索來優化超參數。 GridSearchCV 提供此類實用程序，並通過參數網格進行交叉驗證的網格搜索。

如下例子，我們希望優化LogisticRegression分類器的C和penalty參數。

from sklearn.model_selection import GridSearchCV

pipe = make_pipeline(MinMaxScaler(),
                     LogisticRegression(solver='saga', multi_class='auto',
                                        random_state=42, max_iter=5000))
param_grid = {'logisticregression__C': [0.1, 1.0, 10],
              'logisticregression__penalty': ['l2', 'l1']}
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=-1, return_train_score=True)
grid.fit(X_train, y_train)

輸出：

GridSearchCV(cv=3, error_score='raise-deprecating',
       ...
       ...
       ...
       scoring=None, verbose=0)

在擬合網格搜索對象時，它會在訓練集上找到最佳的參數組合（使用交叉驗證）。我們可以通過訪問屬性cv_results_來得到網格搜索的結果。通過這個屬性允許我們可以檢查參數對模型性能的影響。

df_grid = pd.DataFrame(grid.cv_results_)
df_grid

輸出：

默認情況下，網格搜索對象也表現為估計器。一旦它被fit後，調用score將超參數固定為找到的最佳參數。

grid.best_params_

輸出：

{'logisticregression__C': 10, 'logisticregression__penalty': 'l2'}

此外，可以將網格搜索稱為任何其他分類器以進行預測。

accuracy = grid.score(X_test, y_test)
print('Accuracy score of the {} is {:.2f}'.format(grid.__class__.__name__, accuracy))

Accuracy score of the GridSearchCV is 0.96

最重要的是，我們只對單個分割進行網格搜索。但是，如前所述，我們可能有興趣進行外部交叉驗證，以估計模型的性能和不同的數據樣本，並檢查性能的潛在變化。由於網格搜索是一個估計器，我們可以直接在cross_validate函數中使用它。

scores = cross_validate(grid, X, y, cv=3, n_jobs=-1, return_train_score=True)
df_scores = pd.DataFrame(scores)
df_scores

輸出：

練習

重複使用乳腺癌數據集的先前管道並進行網格搜索以評估hinge(鉸鏈) and log(對數)損失之間的差異。此外，微調penalty。

5.總結：我的scikit-learn管道只有不到10行代碼（跳過import語句）

import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate

pipe = make_pipeline(MinMaxScaler(),
LogisticRegression(solver='saga', multi_class='auto', random_state=42, max_iter=5000))
param_grid = {'logisticregression__C': [0.1, 1.0, 10],
'logisticregression__penalty': ['l2', 'l1']}
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=-1)
scores = pd.DataFrame(cross_validate(grid, X, y, cv=3, n_jobs=-1, return_train_score=True))
scores[['train_score', 'test_score']].boxplot()

輸出：

6.異構數據：當您使用數字以外的數據時

到目前為止，我們使用scikit-learn來訓練使用數值數據的模型。

輸出：

array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]])

X是僅包含浮點值的NumPy數組。但是，數據集可以包含混合類型。

import os
data = pd.read_csv(os.path.join('data', 'titanic_openml.csv'), na_values='?')
data.head()

輸出：

鐵達尼號數據集包含分類，文本和數字特徵。我們將使用此數據集來預測乘客是否在鐵達尼號中倖存下來。

讓我們將數據拆分為訓練和測試集，並將倖存列用作目標。

y = data['survived']
X = data.drop(columns='survived')

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

首先，可以嘗試使用LogisticRegression分類器，看看它的表現有多好。

clf = LogisticRegression()
clf.fit(X_train, y_train)

哎呀，大多數分類器都設計用於處理數值數據。因此，我們需要將分類數據轉換為數字特徵。最簡單的方法是使用OneHotEncoder對每個分類特徵進行讀熱編碼。讓我們以sex與embarked列為例。請注意，我們還會遇到一些缺失的數據。我們將使用SimpleImputer用常量值替換缺失值。

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
ohe = make_pipeline(SimpleImputer(strategy='constant'), OneHotEncoder())
X_encoded = ohe.fit_transform(X_train[['sex', 'embarked']])
X_encoded.toarray()

輸出：

array([[0., 1., 0., 0., 1., 0.],
       [0., 1., 1., 0., 0., 0.],
       [0., 1., 0., 0., 1., 0.],
       ...,
       [0., 1., 0., 0., 1., 0.],
       [1., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 1., 0.]])

這樣，可以對分類特徵進行編碼。但是，我們也希望標準化數字特徵。因此，我們需要將原始數據分成2個子組並應用不同的預處理：（i）分類數據的獨熱編；（ii）數值數據的標準縮放(歸一化)。我們還需要處理兩種情況下的缺失值：對於分類列，我們將字符串'missing_values'替換為缺失值，該字符串將自行解釋為類別。對於數值數據，我們將用感興趣的特徵的平均值替換缺失的數據。

col_cat = ['sex', 'embarked']
col_num = ['age', 'sibsp', 'parch', 'fare']

X_train_cat = X_train[col_cat]
X_train_num = X_train[col_num]
X_test_cat = X_test[col_cat]
X_test_num = X_test[col_num]

from sklearn.preprocessing import StandardScaler

scaler_cat = make_pipeline(SimpleImputer(strategy='constant'), OneHotEncoder())
X_train_cat_enc = scaler_cat.fit_transform(X_train_cat)
X_test_cat_enc = scaler_cat.transform(X_test_cat)

scaler_num = make_pipeline(SimpleImputer(strategy='mean'), StandardScaler())
X_train_num_scaled = scaler_num.fit_transform(X_train_num)
X_test_num_scaled = scaler_num.transform(X_test_num)

我們應該像在本文2.1中那樣在訓練和測試集上應用這些變換。

import numpy as np
from scipy import sparse

X_train_scaled = sparse.hstack((X_train_cat_enc,
sparse.csr_matrix(X_train_num_scaled)))
X_test_scaled = sparse.hstack((X_test_cat_enc,
sparse.csr_matrix(X_test_num_scaled)))

轉換完成後，我們現在可以組合所有數值的信息。最後，我們使用LogisticRegression分類器作為模型。

clf = LogisticRegression(solver='lbfgs')
clf.fit(X_train_scaled, y_train)
accuracy = clf.score(X_test_scaled, y_test)
print('Accuracy score of the {} is {:.2f}'.format(clf.__class__.__name__, accuracy))

輸出：

Accuracy score of the LogisticRegression is 0.79

上面首先轉換數據然後擬合/評分分類器的模式恰好是本節2.1的模式之一。因此，我們希望為此目的使用管道。但是，我們還希望對矩陣的不同列進行不同的處理。應使用ColumnTransformer轉換器或make_column_transformer函數。它用於在不同的列上自動應用不同的管道。

from sklearn.compose import make_column_transformer

pipe_cat = make_pipeline(SimpleImputer(strategy='constant'), OneHotEncoder(handle_unknown='ignore'))
pipe_num = make_pipeline(SimpleImputer(), StandardScaler())
preprocessor = make_column_transformer((col_cat, pipe_cat), (col_num, pipe_num))

pipe = make_pipeline(preprocessor, LogisticRegression(solver='lbfgs'))

pipe.fit(X_train, y_train)
accuracy = pipe.score(X_test, y_test)
print('Accuracy score of the {} is {:.2f}'.format(pipe.__class__.__name__, accuracy))

輸出：

Accuracy score of the Pipeline is 0.79

此外，它還可以被使用在另一個管道。因此，我們將能夠使用所有scikit-learn實用程序作為cross_validate或GridSearchCV。

pipe.get_params()

輸出：

{'columntransformer': ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
          transformer_weights=None,
          transformers=[('pipeline-1', Pipeline(memory=None,
      ...]}

合併及可視化：

pipe_cat = make_pipeline(SimpleImputer(strategy='constant'), OneHotEncoder(handle_unknown='ignore'))
pipe_num = make_pipeline(StandardScaler(), SimpleImputer())
preprocessor = make_column_transformer((col_cat, pipe_cat), (col_num, pipe_num))

pipe = make_pipeline(preprocessor, LogisticRegression(solver='lbfgs'))

param_grid = {'columntransformer__pipeline-2__simpleimputer__strategy': ['mean', 'median'],
'logisticregression__C': [0.1, 1.0, 10]}
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, n_jobs=-1)
scores = pd.DataFrame(cross_validate(grid, X, y, scoring='balanced_accuracy', cv=5, n_jobs=-1, return_train_score=True))
scores[['train_score', 'test_score']].boxplot()

輸出：

練習

完成接下來的練習：

加載位於./data/adult_openml.csv中的成人數據集。製作自己的ColumnTransformer預處理器，並用分類器管道化它。對其進行微調並在交叉驗證中檢查預測準確性。

使用LogisticRegression分類器對預處理器進行管道傳輸。隨後定義網格搜索以找到最佳參數C.使用cross_validate在交叉驗證方案中訓練和測試此工作流程。

使用scikit-learn進行數據預處理

相關焦點

使用scikit-learn對數據進行預處理

使用scikit-learn進行機器學習

第二章使用scikit-learn進行分類預測

極簡Scikit-Learn入門

利用 Scikit Learn的Python數據預處理實戰指南

Scikit-learn使用總結

用 Scikit-learn Pipeline 創建機器學習流程

Scikit-learn估計器分類

用 Scikit-learn 與 Pandas 進行線性回歸預測

使用Scikit-learn 理解隨機森林

Python中的人工智慧入門:在scikit-learn中的建模

7個常用的Scikit-learn使用技巧

python機器學習之使用scikit-learn庫

scikit-learn機器學習簡介

【下載】Scikit-learn作者新書《Python機器學習導論》, 教程+代碼手把手帶你實踐機器學習算法

Python數據分析之scikit-learn與數據預處理

用 Scikit-Learn 和 Pandas 學習線性回歸

教程 | 如何在Python中用scikit-learn生成測試數據集

開源機器學習框架:Scikit-learn API簡介

Python + Scikit-learn 完美入門機器學習指南

使用scikit-learn進行數據預處理

相關焦點

使用scikit-learn對數據進行預處理

使用scikit-learn進行機器學習

第二章 使用scikit-learn進行分類預測

極簡Scikit-Learn入門

利用 Scikit Learn的Python數據預處理實戰指南

Scikit-learn使用總結

用 Scikit-learn Pipeline 創建機器學習流程

Scikit-learn估計器分類

用 Scikit-learn 與 Pandas 進行線性回歸預測

使用Scikit-learn 理解隨機森林

Python中的人工智慧入門:在scikit-learn中的建模

7個常用的Scikit-learn使用技巧

python機器學習之使用scikit-learn庫

scikit-learn機器學習簡介

【下載】Scikit-learn作者新書《Python機器學習導論》, 教程+代碼手把手帶你實踐機器學習算法

Python數據分析之scikit-learn與數據預處理​

用 Scikit-Learn 和 Pandas 學習線性回歸

教程 | 如何在Python中用scikit-learn生成測試數據集

開源機器學習框架:Scikit-learn API簡介

Python + Scikit-learn 完美入門機器學習指南 ​

第二章使用scikit-learn進行分類預測

Python數據分析之scikit-learn與數據預處理

Python + Scikit-learn 完美入門機器學習指南