sklearn.feature_selection模塊被廣泛應用於樣本數據集的特徵選擇和特徵降維,以減少特徵數量,增強對特徵和特徵值之間的理解,提高預測準確性或提高在高維數據集上的性能。本文將依據官方文檔對sklearn.feature_selection模塊進行介紹。
通常來說主要從兩方面考慮來選擇特徵:
特徵發散程度:如果一個特徵不發散,如方差接近0,即樣本在該特徵上沒有顯著差異,這個特徵對於樣本區分的作用就很小。特徵與目標的相關性:與目標相關性高的特徵應該被優先選擇。根據特徵選擇的形式又可以將特徵選擇方法分為3種:
過濾法Filter:按照發散程度或者相關性對各個特徵進行評分,通過設定閾值選擇特徵。包裝法Wrapper:根據目標函數(通常是預測效果評分)每次選擇或排除若干特徵。嵌入法Embedded:先使用機器學習算法和模型進行訓練得到各個特徵的權值係數,根據係數從大到小選擇特徵。VarianceThreshold是特徵選擇中的基礎方法,它將刪除方差未達到閾值的特徵。認為如果輸入樣本中有超過一定閾值(如90%)的某一特徵值,那就可以判定該特徵作用不大。默認情況下刪除所有零方差的特徵,即在所有樣本中具有相同特徵值的特徵。適用於特徵值為離散型變量的特徵,若特徵值為連續型變量需要進行離散化。
例:假設我們有一個布爾特徵值的數據集,我們要刪除樣本中1或0比例超過80%的特徵。布爾特徵是伯努利隨機變量,其方差為 Var[X] = p*(1-p) ,則閾值為0.8*(1-0.8)。示例代碼如下所示,因第一列p=5/6>0.8,VarianceThreshold會將第一列刪除。
1from sklearn.feature_selection import VarianceThreshold
2X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
3print('Initial X:\n', X)
4sel = VarianceThreshold(threshold=(0.8*(1-0.8)))
5X_selected = sel.fit_transform(X)
6print('Selected X:\n', X_selected)
單變量特徵選擇是通過單變量統計檢驗來選擇最佳特徵,Sciket-learn將特徵選擇用包含transform函數的對象來實現。該方法比較簡單,易於運行和理解,通常對於理解數據有較好的效果,但對特徵優化、提高泛化能力來說效果不一定好。
SelectKBest保留用戶指定的K個特徵(取top K)SelectPercentile保留用戶指定的百分比的特徵(取top k%)對於每個特徵使用通用的單變量統計檢驗:False positive率為SelectFpr,False discovery率為SelectFdr,Family wise error為SelectFwe。GenericUnivariateSelect使用可配置策略進行單變量特徵選擇,允許使用超參數搜索估計器選擇最佳的單變量選擇策略。1
2from sklearn.datasets import load_iris
3from sklearn.feature_selection import SelectKBest
4from sklearn.feature_selection import chi2
5iris = load_iris()
6X, y = iris.data, iris.target
7print('Initial X shape: ', X.shape)
8X_selected = SelectKBest(chi2, k=2).fit_transform(X, y)
9print('Selected X shape: ', X_selected.shape)
通過將特徵輸入到評分函數,返回一個單變量的f_score(F檢驗的值)或p_value(P值,用於與顯著性水平做比較),SelectKBest和SelectPercentile只有評分,沒有P值。對於不同問題可用的方法如下:
注意事項(一):不要在分類問題上使用回歸評分,得到的是無用的結果。如果使用稀疏數據(如用稀疏矩陣表示的數據)卡方檢驗(chi2)、互信息回歸(mutual_info_regression)、互信息分類(mutual_info_classif)在處理數據時可保持其稀疏性。
1
2import numpy as np
3import matplotlib.pyplot as plt
4%matplotlib inline
5from sklearn import datasets, svm
6from sklearn.feature_selection import SelectPercentile, f_classif
7
8iris = datasets.load_iris()
9E = np.random.uniform(0, 0.1, size=(len(iris.data), 20))
10X = np.hstack((iris.data, E))
11y = iris.target
12
13plt.figure(1)
14plt.clf()
15X_indices = np.arange(X.shape[-1])
16
17selector = SelectPercentile(f_classif, percentile=10)
18selector.fit(X, y)
19scores = -np.log10(selector.pvalues_)
20scores /= scores.max()
21plt.bar(X_indices - .45, scores, width=.2,label=r'Univariate score ($-Log(p_{value})$)',
22 color='darkorange', edgecolor='black')
23
24clf = svm.SVC(kernel='linear')
25clf.fit(X,y)
26svm_weights = (clf.coef_ ** 2).sum(axis=0)
27svm_weights /= svm_weights.max()
28plt.bar(X_indices - .25, svm_weights, width=.2, label='SVM weight', color='navy', edgecolor='black')
29
30clf_selected = svm.SVC(kernel='linear')
31clf_selected.fit(selector.transform(X), y)
32svm_weights_selected = (clf_selected.coef_ ** 2).sum(axis=0)
33svm_weights_selected /= svm_weights_selected.max()
34plt.bar(X_indices[selector.get_support()] - .05, svm_weights_selected, width=.2,
35 label='SVM weights after selection', color='c', edgecolor='black')
36
37plt.title("Comparing feature selection")
38plt.xlabel('Feature number')
39plt.yticks(())
40plt.axis('tight')
41plt.legend(loc='upper right')
42plt.show()
注意事項(二):F檢驗用於評估兩個隨機變量的線性相關性,互信息方法可以計算任何類型的統計依賴關係,但作為非參數方法,需要更多的樣本來估計準確。
1
2
3from sklearn.feature_selection import f_regression, mutual_info_regression
4
5np.random.seed(0)
6X = np.random.rand(1000, 3)
7y = X[:, 0] + np.sin(6 * np.pi * X[:, 1]) + 0.1 * np.random.randn(1000)
8
9f_test, _ = f_regression(X, y)
10f_test /= np.max(f_test)
11mi = mutual_info_regression(X, y)
12mi /= np.max(mi)
13
14plt.figure(figsize=(15, 5))
15for i in range(3):
16 plt.subplot(1, 3, i + 1)
17 plt.scatter(X[:, i], y, edgecolor='black', s=20)
18 plt.xlabel("$x_{}$".format(i + 1), fontsize=14)
19 if i == 0:
20 plt.ylabel("$y$", fontsize=14)
21 plt.title("F-test={:.2f}, MI={:.2f}".format(f_test[i], mi[i]),
22 fontsize=16)
23plt.show()
對特徵含有權重的預測模型(如線性模型的權重係數),遞歸特徵消除(RFE)通過遞歸考慮越來越少的特徵集來選擇特徵。具體流程如下:首先,對估計器進行初始特徵集訓練,並通過coef_或feature_importances_屬性獲取特徵的重要性;然後,從當前的特徵中刪除最不重要的特徵;最後,對以上過程進行遞歸重複,直至達到所需的特徵數量。
1
2from sklearn.svm import SVC
3from sklearn.datasets import load_digits
4from sklearn.feature_selection import RFE
5
6digits = load_digits()
7X = digits.images.reshape((len(digits.images), -1))
8y = digits.target
9
10svc = SVC(kernel="linear", C=1)
11rfe = RFE(estimator=svc, n_features_to_select=1, step=1)
12rfe.fit(X, y)
13ranking = rfe.ranking_.reshape(digits.images[0].shape)
14
15plt.matshow(ranking, cmap=plt.cm.Blues)
16plt.colorbar()
17plt.title("Ranking of pixels with RFE")
18plt.show()
1
2from sklearn.model_selection import StratifiedKFold
3from sklearn.feature_selection import RFECV
4from sklearn.datasets import make_classification
5
6X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,n_redundant=2, n_repeated=0,
7 n_classes=8, n_clusters_per_class=1, random_state=0)
8
9svc = SVC(kernel="linear")
10rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(2), scoring='accuracy')
11rfecv.fit(X, y)
12print("Optimal number of features : %d" % rfecv.n_features_)
13
14plt.figure()
15plt.xlabel("Number of features selected")
16plt.ylabel("Cross validation score (nb of correct classifications)")
17plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
18plt.show()
SelectFromModel與任何訓練後有coef_或feature_importances_屬性的預測模型一起使用,也被稱為meta-transformer。既可以指定閾值threshold將低於閾值的特徵刪除,也可以使用字符串參數調用內置的啟發式算法來設置閾值,包括平均值"mean"、中位數"median"及浮點數乘積"0.1*mean"。
1
2from sklearn.datasets import load_boston
3from sklearn.feature_selection import SelectFromModel
4from sklearn.linear_model import LassoCV
5
6boston = load_boston()
7X, y = boston['data'], boston['target']
8
9clf = LassoCV(cv=5)
10sfm = SelectFromModel(clf, threshold=0.25)
11sfm.fit(X, y)
12n_features = sfm.transform(X).shape[1]
13
14while n_features > 2:
15 sfm.threshold += 0.1
16 X_transform = sfm.transform(X)
17 n_features = X_transform.shape[1]
18
19plt.title("Features selected from Boston using SelectFromModel with threshold %0.3f." % sfm.threshold)
20feature1 = X_transform[:, 0]
21feature2 = X_transform[:, 1]
22plt.plot(feature1, feature2, 'r.')
23plt.xlabel("Feature 1")
24plt.ylabel("Feature 2")
25plt.ylim([np.min(feature2), np.max(feature2)])
26plt.show()
使用L1範數作為懲罰項的線性模型會得到稀疏解,即大部分特徵對應的稀疏為0。當需要減少特徵的維度以用於其它分類器時,可以通過SelectFromModel來選擇不為0的係數。常用於此目的的稀疏預測模型有linear_model.Lasso(回歸),linear_model.LogisticRegression 和 svm.LinearSVC(分類)。
1from sklearn.svm import LinearSVC
2from sklearn.datasets import load_iris
3from sklearn.feature_selection import SelectFromModel
4iris = load_iris()
5X, y = iris.data, iris.target
6print('X shape:', X.shape)
7
8lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
9model = SelectFromModel(lsvc, prefit=True)
10X_new = model.transform(X)
11print('X_new shape:', X_new.shape)
基於樹的預測模型(sklearn.tree模塊,sklearn.ensemble模塊)能夠用來計算特徵的重要程度,能夠用來去除不相關的特徵。
1from sklearn.ensemble import ExtraTreesClassifier
2from sklearn.datasets import load_iris
3from sklearn.feature_selection import SelectFromModel
4iris = load_iris()
5X, y = iris.data, iris.target
6print('X shape:', X.shape)
7clf = ExtraTreesClassifier(n_estimators=10)
8clf = clf.fit(X, y)
9print('feature_importance:', clf.feature_importances_)
10model = SelectFromModel(clf, prefit=True)
11X_new = model.transform(X)
12print('X_new shape:', X_new.shape)
1
2import numpy as np
3import matplotlib.pyplot as plt
4from sklearn.datasets import make_classification
5from sklearn.ensemble import ExtraTreesClassifier
6
7
8X, y = make_classification(n_samples=1000, n_features=10, n_informative=3, n_redundant=0,
9 n_repeated=0, n_classes=2, random_state=0, shuffle=False)
10
11forest = ExtraTreesClassifier(n_estimators=250, random_state=0)
12forest.fit(X, y)
13importances = forest.feature_importances_
14std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)
15indices = np.argsort(importances)[::-1]
16
17print("Feature ranking:")
18for f in range(X.shape[1]):
19 print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
20
21plt.figure()
22plt.title("Feature importances")
23plt.bar(range(X.shape[1]), importances[indices], color="r", yerr=std[indices], align="center")
24plt.xticks(range(X.shape[1]), indices)
25plt.xlim([-1, X.shape[1]])
26plt.show()
1
2import matplotlib.pyplot as plt
3from sklearn.datasets import fetch_olivetti_faces
4from sklearn.ensemble import ExtraTreesClassifier
5
6
7data = fetch_olivetti_faces()
8X = data.images.reshape((len(data.images), -1))
9y = data.target
10mask = y < 5
11X = X[mask]
12y = y[mask]
13
14
15forest = ExtraTreesClassifier(n_estimators=1000, max_features=128, n_jobs=1, random_state=0)
16forest.fit(X, y)
17importances = forest.feature_importances_
18importances = importances.reshape(data.images[0].shape)
19
20plt.matshow(importances, cmap=plt.cm.hot)
21plt.title("Pixel importances with forests of trees")
22plt.show()
特徵選擇常常被當作機器學習前的預處理步驟之一。在scikit-learn中可以使用sklearn.pipeline.Pipeline,以下示例代碼將LinearSVC 和SelectFromModel結合來評估特徵的重要性進行特徵選擇,之後用RandomForestClassifier模型使用轉換後的輸出(即被選出的相關特徵)進行訓練。
1from sklearn.pipeline import Pipeline
2from sklearn.datasets import load_iris
3from sklearn.svm import LinearSVC
4from sklearn.ensemble import RandomForestClassifier
5iris = load_iris()
6X, y = iris.data, iris.target
7clf = Pipeline([
8 ('feature_selection', SelectFromModel(LinearSVC(penalty="l1", dual=False, max_iter=3000))),
9 ('classfication', RandomForestClassifier(n_estimators=100))
10])
11clf.fit(X, y)