特徵選擇及基於Sklearn的實現

2021-03-02 數據科學DataScience

sklearn.feature_selection模塊被廣泛應用於樣本數據集的特徵選擇和特徵降維，以減少特徵數量，增強對特徵和特徵值之間的理解，提高預測準確性或提高在高維數據集上的性能。本文將依據官方文檔對sklearn.feature_selection模塊進行介紹。

通常來說主要從兩方面考慮來選擇特徵：

特徵發散程度：如果一個特徵不發散，如方差接近0，即樣本在該特徵上沒有顯著差異，這個特徵對於樣本區分的作用就很小。特徵與目標的相關性：與目標相關性高的特徵應該被優先選擇。

根據特徵選擇的形式又可以將特徵選擇方法分為3種：

過濾法Filter：按照發散程度或者相關性對各個特徵進行評分，通過設定閾值選擇特徵。包裝法Wrapper：根據目標函數（通常是預測效果評分）每次選擇或排除若干特徵。嵌入法Embedded：先使用機器學習算法和模型進行訓練得到各個特徵的權值係數，根據係數從大到小選擇特徵。

VarianceThreshold是特徵選擇中的基礎方法，它將刪除方差未達到閾值的特徵。認為如果輸入樣本中有超過一定閾值（如90%）的某一特徵值，那就可以判定該特徵作用不大。默認情況下刪除所有零方差的特徵，即在所有樣本中具有相同特徵值的特徵。適用於特徵值為離散型變量的特徵，若特徵值為連續型變量需要進行離散化。

例：假設我們有一個布爾特徵值的數據集，我們要刪除樣本中1或0比例超過80%的特徵。布爾特徵是伯努利隨機變量，其方差為 Var[X] = p*(1-p) ，則閾值為0.8*(1-0.8)。示例代碼如下所示，因第一列p=5/6>0.8，VarianceThreshold會將第一列刪除。

1from sklearn.feature_selection import VarianceThreshold
2X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
3print('Initial X:\n', X)
4sel = VarianceThreshold(threshold=(0.8*(1-0.8)))
5X_selected = sel.fit_transform(X)
6print('Selected X:\n', X_selected)

單變量特徵選擇是通過單變量統計檢驗來選擇最佳特徵，Sciket-learn將特徵選擇用包含transform函數的對象來實現。該方法比較簡單，易於運行和理解，通常對於理解數據有較好的效果，但對特徵優化、提高泛化能力來說效果不一定好。

SelectKBest保留用戶指定的K個特徵（取top K）SelectPercentile保留用戶指定的百分比的特徵(取top k%)對於每個特徵使用通用的單變量統計檢驗：False positive率為SelectFpr，False discovery率為SelectFdr，Family wise error為SelectFwe。GenericUnivariateSelect使用可配置策略進行單變量特徵選擇，允許使用超參數搜索估計器選擇最佳的單變量選擇策略。

1
2from sklearn.datasets import load_iris
3from sklearn.feature_selection import SelectKBest
4from sklearn.feature_selection import chi2
5iris = load_iris()
6X, y = iris.data, iris.target
7print('Initial X shape: ', X.shape)
8X_selected = SelectKBest(chi2, k=2).fit_transform(X, y)
9print('Selected X shape: ', X_selected.shape)

通過將特徵輸入到評分函數，返回一個單變量的f_score(F檢驗的值)或p_value（P值，用於與顯著性水平做比較），SelectKBest和SelectPercentile只有評分，沒有P值。對於不同問題可用的方法如下：

分類問題：chi2, f_classif, mutual_info_classif回歸問題：f_regression, mutual_info_regression

注意事項（一）：不要在分類問題上使用回歸評分，得到的是無用的結果。如果使用稀疏數據（如用稀疏矩陣表示的數據）卡方檢驗(chi2)、互信息回歸(mutual_info_regression)、互信息分類(mutual_info_classif)在處理數據時可保持其稀疏性。

1
2import numpy as np
3import matplotlib.pyplot as plt
4%matplotlib inline
5from sklearn import datasets, svm
6from sklearn.feature_selection import SelectPercentile, f_classif
7
8iris = datasets.load_iris()
9E = np.random.uniform(0, 0.1, size=(len(iris.data), 20))
10X = np.hstack((iris.data, E))
11y = iris.target
12
13plt.figure(1)
14plt.clf()
15X_indices = np.arange(X.shape[-1])
16
17selector = SelectPercentile(f_classif, percentile=10)
18selector.fit(X, y)
19scores = -np.log10(selector.pvalues_)
20scores /= scores.max()
21plt.bar(X_indices - .45, scores, width=.2,label=r'Univariate score ($-Log(p_{value})$)',
22 color='darkorange', edgecolor='black')
23
24clf = svm.SVC(kernel='linear')
25clf.fit(X,y)
26svm_weights = (clf.coef_ ** 2).sum(axis=0)
27svm_weights /= svm_weights.max()
28plt.bar(X_indices - .25, svm_weights, width=.2, label='SVM weight', color='navy', edgecolor='black')
29
30clf_selected = svm.SVC(kernel='linear')
31clf_selected.fit(selector.transform(X), y)
32svm_weights_selected = (clf_selected.coef_ ** 2).sum(axis=0)
33svm_weights_selected /= svm_weights_selected.max()
34plt.bar(X_indices[selector.get_support()] - .05, svm_weights_selected, width=.2,
35 label='SVM weights after selection', color='c', edgecolor='black')
36
37plt.title("Comparing feature selection")
38plt.xlabel('Feature number')
39plt.yticks(())
40plt.axis('tight')
41plt.legend(loc='upper right')
42plt.show()

注意事項（二）：F檢驗用於評估兩個隨機變量的線性相關性，互信息方法可以計算任何類型的統計依賴關係，但作為非參數方法，需要更多的樣本來估計準確。

1
2
3from sklearn.feature_selection import f_regression, mutual_info_regression
4
5np.random.seed(0)
6X = np.random.rand(1000, 3)
7y = X[:, 0] + np.sin(6 * np.pi * X[:, 1]) + 0.1 * np.random.randn(1000)
8
9f_test, _ = f_regression(X, y)
10f_test /= np.max(f_test)
11mi = mutual_info_regression(X, y)
12mi /= np.max(mi)
13
14plt.figure(figsize=(15, 5))
15for i in range(3):
16    plt.subplot(1, 3, i + 1)
17    plt.scatter(X[:, i], y, edgecolor='black', s=20)
18    plt.xlabel("$x_{}$".format(i + 1), fontsize=14)
19    if i == 0:
20        plt.ylabel("$y$", fontsize=14)
21    plt.title("F-test={:.2f}, MI={:.2f}".format(f_test[i], mi[i]),
22              fontsize=16)
23plt.show()

對特徵含有權重的預測模型（如線性模型的權重係數），遞歸特徵消除（RFE）通過遞歸考慮越來越少的特徵集來選擇特徵。具體流程如下：首先，對估計器進行初始特徵集訓練，並通過coef_或feature_importances_屬性獲取特徵的重要性；然後，從當前的特徵中刪除最不重要的特徵；最後，對以上過程進行遞歸重複，直至達到所需的特徵數量。

1
2from sklearn.svm import SVC
3from sklearn.datasets import load_digits
4from sklearn.feature_selection import RFE
5
6digits = load_digits()
7X = digits.images.reshape((len(digits.images), -1))
8y = digits.target
9
10svc = SVC(kernel="linear", C=1)
11rfe = RFE(estimator=svc, n_features_to_select=1, step=1)
12rfe.fit(X, y)
13ranking = rfe.ranking_.reshape(digits.images[0].shape)
14
15plt.matshow(ranking, cmap=plt.cm.Blues)
16plt.colorbar()
17plt.title("Ranking of pixels with RFE")
18plt.show()

1
2from sklearn.model_selection import StratifiedKFold
3from sklearn.feature_selection import RFECV
4from sklearn.datasets import make_classification
5
6X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,n_redundant=2, n_repeated=0,
7 n_classes=8, n_clusters_per_class=1, random_state=0)
8
9svc = SVC(kernel="linear")
10rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(2), scoring='accuracy')
11rfecv.fit(X, y)
12print("Optimal number of features : %d" % rfecv.n_features_)
13
14plt.figure()
15plt.xlabel("Number of features selected")
16plt.ylabel("Cross validation score (nb of correct classifications)")
17plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
18plt.show()

SelectFromModel與任何訓練後有coef_或feature_importances_屬性的預測模型一起使用，也被稱為meta-transformer。既可以指定閾值threshold將低於閾值的特徵刪除，也可以使用字符串參數調用內置的啟發式算法來設置閾值，包括平均值"mean"、中位數"median"及浮點數乘積"0.1*mean"。

1
2from sklearn.datasets import load_boston
3from sklearn.feature_selection import SelectFromModel
4from sklearn.linear_model import LassoCV
5
6boston = load_boston()
7X, y = boston['data'], boston['target']
8
9clf = LassoCV(cv=5)
10sfm = SelectFromModel(clf, threshold=0.25)
11sfm.fit(X, y)
12n_features = sfm.transform(X).shape[1]
13
14while n_features > 2:
15    sfm.threshold += 0.1
16    X_transform = sfm.transform(X)
17    n_features = X_transform.shape[1]
18
19plt.title("Features selected from Boston using SelectFromModel with threshold %0.3f." % sfm.threshold)
20feature1 = X_transform[:, 0]
21feature2 = X_transform[:, 1]
22plt.plot(feature1, feature2, 'r.')
23plt.xlabel("Feature 1")
24plt.ylabel("Feature 2")
25plt.ylim([np.min(feature2), np.max(feature2)])
26plt.show()

基於L1的特徵選擇

使用L1範數作為懲罰項的線性模型會得到稀疏解，即大部分特徵對應的稀疏為0。當需要減少特徵的維度以用於其它分類器時，可以通過SelectFromModel來選擇不為0的係數。常用於此目的的稀疏預測模型有linear_model.Lasso（回歸），linear_model.LogisticRegression 和 svm.LinearSVC（分類）。

1from sklearn.svm import LinearSVC
2from sklearn.datasets import load_iris
3from sklearn.feature_selection import SelectFromModel
4iris = load_iris()
5X, y = iris.data, iris.target
6print('X shape:', X.shape)
7
8lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
9model = SelectFromModel(lsvc, prefit=True)
10X_new = model.transform(X)
11print('X_new shape:', X_new.shape)

基於樹的特徵選擇

基於樹的預測模型（sklearn.tree模塊，sklearn.ensemble模塊）能夠用來計算特徵的重要程度，能夠用來去除不相關的特徵。

1from sklearn.ensemble import ExtraTreesClassifier
2from sklearn.datasets import load_iris
3from sklearn.feature_selection import SelectFromModel
4iris = load_iris()
5X, y = iris.data, iris.target
6print('X shape:', X.shape)
7clf = ExtraTreesClassifier(n_estimators=10)
8clf = clf.fit(X, y)
9print('feature_importance:', clf.feature_importances_)
10model = SelectFromModel(clf, prefit=True)
11X_new = model.transform(X)
12print('X_new shape:', X_new.shape)

1
2import numpy as np
3import matplotlib.pyplot as plt
4from sklearn.datasets import make_classification
5from sklearn.ensemble import ExtraTreesClassifier
6
7
8X, y = make_classification(n_samples=1000, n_features=10, n_informative=3, n_redundant=0,
9 n_repeated=0, n_classes=2, random_state=0, shuffle=False)
10
11forest = ExtraTreesClassifier(n_estimators=250, random_state=0)
12forest.fit(X, y)
13importances = forest.feature_importances_
14std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)
15indices = np.argsort(importances)[::-1]
16
17print("Feature ranking:")
18for f in range(X.shape[1]):
19 print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
20
21plt.figure()
22plt.title("Feature importances")
23plt.bar(range(X.shape[1]), importances[indices], color="r", yerr=std[indices], align="center")
24plt.xticks(range(X.shape[1]), indices)
25plt.xlim([-1, X.shape[1]])
26plt.show()

1
2import matplotlib.pyplot as plt
3from sklearn.datasets import fetch_olivetti_faces
4from sklearn.ensemble import ExtraTreesClassifier
5
6
7data = fetch_olivetti_faces()
8X = data.images.reshape((len(data.images), -1))
9y = data.target
10mask = y < 5
11X = X[mask]
12y = y[mask]
13
14
15forest = ExtraTreesClassifier(n_estimators=1000, max_features=128, n_jobs=1, random_state=0)
16forest.fit(X, y)
17importances = forest.feature_importances_
18importances = importances.reshape(data.images[0].shape)
19
20plt.matshow(importances, cmap=plt.cm.hot)
21plt.title("Pixel importances with forests of trees")
22plt.show()

特徵選擇常常被當作機器學習前的預處理步驟之一。在scikit-learn中可以使用sklearn.pipeline.Pipeline，以下示例代碼將LinearSVC 和SelectFromModel結合來評估特徵的重要性進行特徵選擇，之後用RandomForestClassifier模型使用轉換後的輸出（即被選出的相關特徵）進行訓練。

1from sklearn.pipeline import Pipeline
2from sklearn.datasets import load_iris
3from sklearn.svm import LinearSVC
4from sklearn.ensemble import RandomForestClassifier
5iris = load_iris()
6X, y = iris.data, iris.target
7clf = Pipeline([
8 ('feature_selection', SelectFromModel(LinearSVC(penalty="l1", dual=False, max_iter=3000))),
9 ('classfication', RandomForestClassifier(n_estimators=100))
10])
11clf.fit(X, y)

特徵選擇及基於Sklearn的實現

相關焦點

使用python+sklearn實現排列重要性與隨機森林特徵重要性

Python特徵選擇(全)

使用python+sklearnSVM-Anova:具有單變量特徵選擇的SVM

特徵選擇、特徵提取最全總結

【完結篇】專欄 | 基於 Jupyter 的特徵工程手冊:特徵降維

特徵選擇中幾個重要的統計學概念

結合Scikit-learn介紹幾種常用的特徵選擇方法(下)

使用python+sklearn實現處理缺失值

使用sklearn進行數據挖掘

熟練掌握這3種特徵選擇方法,模型性能至少提升20%!

機器學習深度研究:特徵選擇中幾個重要的統計學概念

如何使用sklearn優雅地進行數據挖掘?

如何使用 sklearn 優雅地進行數據挖掘?

使用python+sklearn實現管道、Anova和SVM

獨家 | 基於Python的遺傳算法特徵約簡(附代碼)

7000 字精華總結,Pandas/Sklearn 進行機器學習之特徵篩選,有效提升模型性能

基於sklearn庫的機器學習模型與調優實踐詳細步驟

機器學習|模型選擇之劃分數據集及Sklearn實現

機器學習之sklearn基礎教程!

使用python+sklearn實現線性回歸示例