使用python+sklearn實現交叉驗證

2021-02-20 機器學習算法與知識圖譜

學習一個預測函數的參數並在相同的數據上進行測試是一個錯誤做法：一個模型只要重複它訓練樣本的標籤就可以得到一個完美的分數，但在尚未見過的數據上卻無法預測任何有用的東西。這種情況被稱為過擬合(overfitting)。為了避免這種情況，在執行（有監督的）機器學習實驗(experiment)時，通常會將部分可用數據作為測試集(test set) X_test, y_test進行保存。請注意，「實驗(experiment)」一詞並非僅用於表示學術用途，因為即使在商業環境中，機器學習通常也是從實驗開始的。本文給出了模型訓練中典型的交叉驗證流程圖(flowchart)。網格搜索(grid search)技術可以確定最佳參數。在scikit-learn中，可以使用train_test_split函數快速將數據集隨機分成訓練集和測試集。讓我們加載iris數據集，並在此數據集上擬合線性支持向量機(linear support vector machine)：

>>> import numpy as np>>> from sklearn.model_selection import train_test_split>>> from sklearn import datasets>>> from sklearn import svm
>>> X, y = datasets.load_iris(return_X_y=True)>>> X.shape, y.shape((150, 4), (150,))
我們現在可以快速地對訓練集(training set)進行採樣，同時保留40%的數據用於測試（評估）我們的分類器：>>> X_train, X_test, y_train, y_test = train_test_split(...     X, y, test_size=0.4, random_state=0)
>>> X_train.shape, y_train.shape((90, 4), (90,))>>> X_test.shape, y_test.shape((60, 4), (60,))
>>> clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)>>> clf.score(X_test, y_test)0.96...
當評估估計器的超參數不同時，例如必須為支持向量機手動設置的 C 參數，由於在訓練集上可以調整參數，直到估計器以最佳方式執行，但測試集上仍然存在過度擬合的風險。這樣，測試集的信息可以「反饋」到模型中，使模型再次調整超參數，此時評估指標已無法反映出模型的泛化性能。為了解決這個問題，可以將數據集的另一部分作為「驗證集(validation set)」：在訓練集上進行訓練，然後對驗證集進行評估，當實驗效果達到最佳時，對測試集進行最終評估。然而，通過將可用數據劃分為三組，我們大大減少了可用於學習模型的樣本數，並且結果可以依賴於（訓練集，驗證集）對的隨機選擇。解決這個問題的方法是一個稱為交叉驗證(cross-validation)（簡稱CV）的過程。測試集仍應保留以進行最終評估，但在進行CV時不再需要驗證集。在 k-fold CV的基本方法中，訓練集被分成 k 個較小的數據集（下面描述其他方法，但通常遵循相同的原則）。對於每個 k 「folds」，遵循以下步驟：使用 k-1 個 folds 作為訓練數據訓練模型；用數據剩餘的 1 個 folds 來驗證模型（即，將其用作測試集，以計算精度等性能指標）。然後，k-fold交叉驗證報告的性能指標是在循環中計算的值平均值。這種方法雖然計算代價高，但不會浪費太多數據（在訓練任意驗證集時也是如此），這在樣本數非常小的問題中（如，逆推理等）比較有優勢。1. 計算交叉驗證指標使用交叉驗證的最簡單方法是調用估計器和數據集上的cross_val_score輔助函數。下面的例子演示了如何通過分割數據、擬合模型和連續5次計算分數（每次分割不同）來估計線性核支持向量機在iris數據集上的準確度：>>> from sklearn.model_selection import cross_val_score>>> clf = svm.SVC(kernel='linear', C=1)>>> scores = cross_val_score(clf, X, y, cv=5)>>> scoresarray([0.96..., 1.  ..., 0.96..., 0.96..., 1.        ])
因此，得分估計(score estimate)的平均得分和95%置信區間由下式給出：>>> print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))Accuracy: 0.98 (+/- 0.03)
默認情況下，在每次CV迭代時計算的分數是使用估計器的 score方法。可以使用scoring參數更改此設置：>>> from sklearn import metrics>>> scores = cross_val_score(...     clf, X, y, cv=5, scoring='f1_macro')>>> scoresarray([0.96..., 1.  ..., 0.96..., 0.96..., 1.        ])
詳見評分參數：定義模型評估規則。在Iris數據集下，樣本在目標類之間是平衡的，因此精確度(accuracy )和F1分數(F1-score)幾乎相等。當 cv 參數為整數時， cross_val_score 默認使用 KFold 或 StratifiedKFold 策略，如果估計器派生自ClassifierMixin，則使用後者。也可以通過傳入一個交叉驗證迭代器(cross validation iterator)來使用其他交叉驗證策略，例如：>>> from sklearn.model_selection import ShuffleSplit>>> n_samples = X.shape[0]>>> cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)>>> cross_val_score(clf, X, y, cv=cv)array([0.977..., 0.977..., 1.  ..., 0.955..., 1.        ])
另一種選擇是使用可迭代的生成器作為索引數組，用於（train，test）數據集的劃分，例如：>>> def custom_cv_2folds(X):...     n = X.shape[0]...     i = 1...     while i <= 2:...         idx = np.arange(n * (i - 1) / 2, n * i / 2, dtype=int)...         yield idx, idx...         i += 1...>>> custom_cv = custom_cv_2folds(X)>>> cross_val_score(clf, X, y, cv=custom_cv)array([1.        , 0.973...])
正如在測試集上測試預測器(predictor)很重要一樣，預處理（如標準化、特徵選擇等）和數據轉換也應該從訓練集中學習，並應用到測試集中以進行預測：>>> from sklearn import preprocessing>>> X_train, X_test, y_train, y_test = train_test_split(...     X, y, test_size=0.4, random_state=0)>>> scaler = preprocessing.StandardScaler().fit(X_train)>>> X_train_transformed = scaler.transform(X_train)>>> clf = svm.SVC(C=1).fit(X_train_transformed, y_train)>>> X_test_transformed = scaler.transform(X_test)>>> clf.score(X_test_transformed, y_test)0.9333...
管道(Pipeline)使組合估計器變得更加容易，在交叉驗證下提供此行為：>>> from sklearn.pipeline import make_pipeline>>> clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))>>> cross_val_score(clf, X, y, cv=cv)array([0.977..., 0.933..., 0.955..., 0.933..., 0.977...])
1.1.交叉驗證函數與多指標評價cross_validate函數與cross_val_score有兩種不同：除測試分數外，它返回一個包含擬合時間(fit-times)、分數時間(score-times)（和可選的訓練分數以及擬合估計值）的字典(dict)。對於單指標評估，如果評分參數是字符串(string)、可調用(callable)或空(None)，則keys將是-['test_score', 'fit_time', 'score_time']對於多指標評估，返回值是一個字典(dict)，帶有以下鍵-['test_', 'test_', 'test_', 'fit_time', 'score_time']return_train_score默認設置為False以節省計算時間。如果您還要在訓練集上評估分數，需要將其設置為True。您也可以通過設置return_estimator=True來保留在每個訓練集上擬合的估計器。可以將多個度量指標指定為列表、元組或一組預定義的記分器名稱(scorer name)：>>> from sklearn.model_selection import cross_validate>>> from sklearn.metrics import recall_score>>> scoring = ['precision_macro', 'recall_macro']>>> clf = svm.SVC(kernel='linear', C=1, random_state=0)>>> scores = cross_validate(clf, X, y, scoring=scoring)>>> sorted(scores.keys())['fit_time', 'score_time', 'test_precision_macro', 'test_recall_macro']>>> scores['test_recall_macro']array([0.96..., 1.  ..., 0.96..., 0.96..., 1.        ])
或者作為一個字典(dict)將記分器名稱映射到預定義或自定義記分函數：>>> from sklearn.metrics import make_scorer>>> scoring = {'prec_macro': 'precision_macro',...            'rec_macro': make_scorer(recall_score, average='macro')}>>> scores = cross_validate(clf, X, y, scoring=scoring,...                         cv=5, return_train_score=True)>>> sorted(scores.keys())['fit_time', 'score_time', 'test_prec_macro', 'test_rec_macro', 'train_prec_macro', 'train_rec_macro']>>> scores['train_rec_macro']array([0.97..., 0.97..., 0.99..., 0.98..., 0.98...])
下面是使用單個指標進行cross_validate 的示例：>>> scores = cross_validate(clf, X, y,...                         scoring='precision_macro', cv=5,...                         return_estimator=True)>>> sorted(scores.keys())['estimator', 'fit_time', 'score_time', 'test_score']
1.2. 通過交叉驗證獲取預測cross_val_predict 函數與cross_val_score有一個類似的接口，但對於輸入中的每個元素，返回該元素在測試集中獲得的預測。只有交叉驗證策略將所有元素精確地分配給測試集一次，交叉驗證策略才能使用（否則，將引發異常）。**警告：**關於cross_val_predict使用不當的說明由於元素的分組方式不同，cross_val_predict 的結果可能與使用cross_val_score 得到的結果不同。cross_val_score 函數在cross-validation folds中取平均值，而cross_val_predict 只返回幾個不同模型中的標籤（或概率）。因此，cross_val_predict 不是泛化誤差的適當度量。從不同的模型中得到的預測結果的可視化。模型混合(blending)：當一個有監督估計器的預測結果用於訓練集成方法(ensemble methods)中的另一個估計器時。使用交叉驗證的Receiver Operating Characteristic (ROC),2. 交叉驗證迭代器以下部分列出了用於生成索引的實用程序，這些索引可用於根據不同的交叉驗證策略生成數據集拆分。2.1. 用於 i.i.d. 數據的交叉驗證迭代器假設某些數據是獨立同分布（i.i.d.）且所有樣本都來自同一生成過程(假設生成過程對過去生成的樣本沒有記憶)。在這種情況下，可以使用以下交叉驗證器(cross-validators)。雖然 i.i.d.數據是機器學習理論中的一個常見假設，但實際中並不常見。如果知道樣本是使用與時間相關的過程生成的，那麼使用 time-series aware cross-validation scheme 會更安全；同樣，如果我們知道生成過程具有組結構(group structure)（從不同對象、實驗、測量設備收集的樣本），則使用group-wise cross-validation會更安全。2.1.1. K-foldKFold將所有樣本分成 >>> import numpy as np>>> from sklearn.model_selection import KFold
>>> X = ["a", "b", "c", "d"]>>> kf = KFold(n_splits=2)>>> for train, test in kf.split(X):...     print("%s %s" % (train, test))[2 3] [0 1][0 1] [2 3]
下圖是該交叉驗證方法的可視化。請注意，KFold不受類或組的影響。每個fold由兩個數組組成：第一個是訓練集 ，第二個是測試集。因此，可以使用numpy索引創建訓練/測試集：>>> X = np.array([[0., 0.], [1., 1.], [-1., -1.], [2., 2.]])>>> y = np.array([0, 1, 0, 1])>>> X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
2.1.2. 重複 K-FoldRepeatedKFold重複>>> import numpy as np>>> from sklearn.model_selection import RepeatedKFold>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])>>> random_state = 12883823>>> rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=random_state)>>> for train, test in rkf.split(X):...     print("%s %s" % (train, test))...[2 3] [0 1][0 1] [2 3][0 2] [1 3][1 3] [0 2]
類似地，RepeatedStratifiedKFold重複分層的K-Fold >>> from sklearn.model_selection import LeaveOneOut
>>> X = [1, 2, 3, 4]>>> loo = LeaveOneOut()>>> for train, test in loo.split(X):...     print("%s %s" % (train, test))[1 2 3] [0][0 2 3] [1][0 1 3] [2][0 1 2] [3]
使用LOO進行模型選擇的用戶應該權衡一些已知的注意事項。與k-fold交叉驗證相比，LOO從>>> from sklearn.model_selection import LeavePOut
>>> X = np.ones(4)>>> lpo = LeavePOut(p=2)>>> for train, test in lpo.split(X):...     print("%s %s" % (train, test))[2 3] [0 1][1 3] [0 2][1 2] [0 3][0 3] [1 2][0 2] [1 3][0 1] [2 3]
2.1.5. 隨機排列交叉驗證 a.k.a. Shuffle & SplitShuffleSplit迭代器將生成由用戶定義的給定數量的獨立訓練/測試數據集拆分。樣本首先被洗牌(shuffled)，然後分成訓練集和測試集對。可以通過顯式地設置random_state 偽隨機數生成器來控制結果的隨機性，使其結果可再現(reproducibility )。>>> from sklearn.model_selection import ShuffleSplit>>> X = np.arange(10)>>> ss = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)>>> for train_index, test_index in ss.split(X):...     print("%s %s" % (train_index, test_index))[9 1 6 7 3 0 5] [2 8 4][2 9 8 0 6 7 4] [3 5 1][4 5 1 0 6 9 7] [2 3 8][2 7 5 8 0 3 4] [6 1 9][4 1 0 6 8 9 3] [5 2 7]
下圖是該交叉驗證方法的可視化。注意ShuffleSplit不受類或組的影響。因此，ShuffleSplit 是KFold交叉驗證的一個很好的替代品，它可以更好地控制迭代次數和訓練/測試拆分的樣本比例。2.2. 基於類標籤的分層交叉驗證迭代器一些分類問題會在目標類的分布中表現出很大的不平衡：例如，負樣本可能比正樣本多出幾倍。在這種情況下，建議使用分層抽樣，如在StratifiedKFold 和StratifiedShuffleSplit中實現的，以確保在每個訓練和驗證子集(train and validation fold)中大致保留相對類頻率(relative class frequencies)。2.2.1. 分層 k-foldStratifiedKFold 是 k-fold 的一個變體，它返回分層 folds：每個集合包含的每個目標類別的樣本所佔的百分比與整個集合大致相同。這是一個分層的3-fold交叉驗證的例子，數據集有50個樣本來自兩個不平衡的類。我們顯示每個類中的樣本數，並與KFold進行比較。>>> from sklearn.model_selection import StratifiedKFold, KFold>>> import numpy as np>>> X, y = np.ones((50, 1)), np.hstack(([0] * 45, [1] * 5))>>> skf = StratifiedKFold(n_splits=3)>>> for train, test in skf.split(X, y):...     print('train -  {}   |   test -  {}'.format(...         np.bincount(y[train]), np.bincount(y[test])))train -  [30  3]   |   test -  [15  2]train -  [30  3]   |   test -  [15  2]train -  [30  4]   |   test -  [15  1]>>> kf = KFold(n_splits=3)>>> for train, test in kf.split(X, y):...     print('train -  {}   |   test -  {}'.format(...         np.bincount(y[train]), np.bincount(y[test])))train -  [28  5]   |   test -  [17]train -  [28  5]   |   test -  [17]train -  [34]   |   test -  [11  5]
我們可以看到StratifiedKFold 保留了訓練和測試數據集中的類別比率（大約為1/10）。RepeatedStratifiedKFold 可用於重複Stratified K-Fold >>> from sklearn.model_selection import GroupKFold
>>> X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]>>> y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]>>> groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]
>>> gkf = GroupKFold(n_splits=3)>>> for train, test in gkf.split(X, y, groups=groups):...     print("%s %s" % (train, test))[0 1 2 3 4 5] [6 7 8 9][0 1 2 6 7 8 9] [3 4 5][3 4 5 6 7 8 9] [0 1 2]
每一個subject都在不同的測試子集(testing fold)中，同一個subject都不可能同時在測試和訓練子集(both testing and training)中。請注意，由於數據不平衡，folds 的大小不完全相同。2.3.2. 留一組交叉驗證(Leave One Group Out)LeaveOneGroupOut 是一個交叉驗證方案，它根據第三方提供的整數組(integer groups)數組保存樣本。此組信息可用於編碼任意特定領域的預定義cross-validation folds。因此，每個訓練集由除與特定組相關的樣本之外的所有樣本構成。例如，在多個實驗的情況下，LeaveOneGroupOut 可用於根據不同的實驗創建交叉驗證：我們使用除去一個後的所有實驗樣本創建一個訓練集：>>> from sklearn.model_selection import LeaveOneGroupOut
>>> X = [1, 5, 10, 50, 60, 70, 80]>>> y = [0, 1, 1, 2, 2, 2, 2]>>> groups = [1, 1, 2, 2, 3, 3, 3]>>> logo = LeaveOneGroupOut()>>> for train, test in logo.split(X, y, groups=groups):...     print("%s %s" % (train, test))[2 3 4 5 6] [0 1][0 1 4 5 6] [2 3][0 1 2 3] [4 5 6]
另一個常見的應用是使用時間信息：例如，組可以是樣本收集的年份，因此允許使用針對基於時間的拆分進行交叉驗證。2.3.3. 留P組交叉驗證(Leave P Groups Out)LeavePGroupsOut與LeaveOneGroupOut類似，但為每個訓練/測試集移除與P組相關的樣本。>>> from sklearn.model_selection import LeavePGroupsOut
>>> X = np.arange(6)>>> y = [1, 1, 1, 2, 2, 2]>>> groups = [1, 1, 2, 2, 3, 3]>>> lpgo = LeavePGroupsOut(n_groups=2)>>> for train, test in lpgo.split(X, y, groups=groups):...     print("%s %s" % (train, test))[4 5] [0 1 2 3][2 3] [0 1 4 5][0 1] [2 3 4 5]
2.3.4. 分組 Shuffle SplitGroupShuffleSplit 迭代器的行為是ShuffleSplit 和LeavePGroupsOut的組合，並生成一個隨機劃分序列，其中為每次拆分保留一個子組。>>> from sklearn.model_selection import GroupShuffleSplit
>>> X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 0.001]>>> y = ["a", "b", "b", "b", "c", "c", "c", "a"]>>> groups = [1, 1, 2, 2, 3, 3, 4, 4]>>> gss = GroupShuffleSplit(n_splits=4, test_size=0.5, random_state=0)>>> for train, test in gss.split(X, y, groups=groups):...     print("%s %s" % (train, test))...[0 1 2 3] [4 5 6 7][2 3 6 7] [0 1 4 5][2 3 4 5] [0 1 6 7][4 5 6 7] [0 1 2 3]
當需要LeavePGroupsOut的操作時，這個類(GroupShuffleSplit)是有用的，但是組的數量足夠大，以至於保留P個組，生成所有可能劃分的開銷會非常高。在這種情況下，GroupShuffleSplit 提供由LeavePGroupsOut生成的列/測試拆分的隨機樣本（可替換(with replacement)）。

使用python+sklearn實現交叉驗證

相關焦點

使用python+sklearn實現Lasso 模型選擇:交叉驗證/ AIC / BIC

使用python+sklearn實現估計器的調參方法

使用python+sklearn實現度量和評分

數據集的劃分——交叉驗證法

sklearn實現數據集劃分

三分鐘重新學習交叉驗證

Python機器學習·微教程

機器學習乾貨|交叉驗證(Cross Validation)詳解

使用python+sklearn實現模型複雜性的影響

特徵工程總結:R與python的比較實現

sklearn學習(五):支持向量機原理實現及簡單參數優化(附代碼)

通過sklearn上手你的第一個機器學習實例

機器學習數據分析極簡思路及sklearn算法實踐小試

回歸分析之Sklearn實現電力預測

使用python+sklearn實現概率校準曲線

運用sklearn進行線性判別分析(LDA)代碼實現

使用python+sklearn實現保序回歸

Python sklearn模型選擇

使用python+sklearn實現成對度量、相關性和核函數

使用python+sklearn實現Logistic回歸中的L1懲罰與稀疏性