數學推導+純Python實現機器學習算法20:隨機森林

2021-03-02 機器學習實驗室

Python機器學習算法實現

Author：louwill

Machine Learning Lab

自從第14篇文章結束，所有的單模型基本就講完了。而後我們進入了集成學習的系列，整整花了5篇文章的篇幅來介紹集成學習中最具代表性的Boosting框架。從AdaBoost到GBDT系列，對XGBoost、LightGBM和CatBoost作了較為詳細的了解。本文作為集成學習的最後一篇文章，來介紹與Boosting框架有所不同的Bagging框架。

Bagging與隨機森林

Bagging是並行式集成學習方法最典型的代表框架。其核心概念在於自助採樣（Bootstrap Sampling），給定包含m個樣本的數據集，有放回的隨機抽取一個樣本放入採樣集中，經過m次採樣，可得到一個和原始數據集一樣大小的採樣集。我們可以採樣得到T個包含m個樣本的採樣集，然後基於每個採樣集訓練出一個基學習器，最後將這些基學習器進行組合。這便是Bagging的主要思想。Bagging與Boosting圖示如下：

可以清楚的看到，Bagging是並行的框架，而Boosting則是序列框架（但也可以實現並行）。

有了之前多篇關於決策樹的基礎以及前述關於Bagging基本思想的闡述，隨機森林（Random Forest）就沒有太多難以理解的地方了。所謂隨機森林，就是有很多棵決策樹構建起來的森林，因為構建過程中的隨機性，故而稱之為隨機森林。隨機森林算法是Bagging框架的一個典型代表。

關於構建決策樹的過程，可以參考前述第4~5篇，這裡不做重複闡述。因為基礎的推導工作都是前述章節都已完成，這裡我們可以直接闡述隨機森林的算法過程，簡單來說就是兩個隨機性。具體如下：

假設有M個樣本，有放回的隨機選擇M個樣本（每次隨機選擇一個放回後繼續選）。

假設樣本有N個特徵，在決策時的每個節點需要分裂時，隨機地從這N個特徵中選取n個特徵，滿足n<<N，從這n個特徵中選擇特徵進行節點分裂。

基於抽樣的M個樣本n個特徵按照節點分裂的方式構建決策樹。

按照1~3步構建大量決策樹組成隨機森林，然後將每棵樹的結果進行綜合（分類使用投票法，回歸可使用均值法）。

所以，當我們熟悉了Bagging的基本思想和決策樹構建的過程後，隨機森林就很好理解了。

隨機森林算法實現

本文我們使用numpy來手動實現一個隨機森林算法。隨機森林算法本身是實現思路我們是非常清晰的，但其原始構建需要大量搭建決策樹的工作，比如定義樹節點、構建基本決策樹、在基本決策樹基礎上構建分類樹和回歸樹等。這些筆者在前述章節均已實現過，這裡不再重複。

在此基礎上，隨機森林算法的構建主要包括隨機選取樣本、隨機選取特徵、構造森林並擬合其中的每棵樹、基於每棵樹的預測結果給出隨機森林的預測結果。

導入相關模塊並生成模擬數據集。

import numpy as npfrom ClassificationTree import *from sklearn.datasets import make_classificationfrom sklearn.model_selection import train_test_splitn_estimators = 10max_features = 15X, y = make_classification(n_samples=1000, n_features=20, n_redundant=0, n_informative=2,                           random_state=1, n_clusters_per_class=1)rng = np.random.RandomState(2)X += 2 * rng.uniform(size=X.shape)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
定義第一個隨機性，行抽樣選取樣本：
def bootstrap_sampling(X, y):    X_y = np.concatenate([X, y.reshape(-1,1)], axis=1)    np.random.shuffle(X_y)    n_samples = X.shape[0]    sampling_subsets = []
    for _ in range(n_estimators):        idx1 = np.random.choice(n_samples, n_samples, replace=True)        bootstrap_Xy = X_y[idx1, :]        bootstrap_X = bootstrap_Xy[:, :-1]        bootstrap_y = bootstrap_Xy[:, -1]        sampling_subsets.append([bootstrap_X, bootstrap_y])    return sampling_subsets
然後基於分類樹構建隨機森林：
trees = []# 基於決策樹構建森林for _ in range(n_estimators):    tree = ClassificationTree(min_samples_split=2, min_impurity=0,                              max_depth=3)    trees.append(tree)
定義訓練函數，對隨機森林中每棵樹進行擬合。
def fit(X, y):        n_features = X.shape[1]    sub_sets = bootstrap_sampling(X, y)    for i in range(n_estimators):        sub_X, sub_y = sub_sets[i]                idx2 = np.random.choice(n_features, max_features, replace=True)        sub_X = sub_X[:, idx2]        trees[i].fit(sub_X, sub_y)        trees[i].feature_indices = idx2        print('The {}th tree is trained done...'.format(i+1))
    我們將上述過程進行封裝，分別定義自助抽樣方法、隨機森林訓練方法和預測方法。完整代碼如下：
class RandomForest():    def __init__(self, n_estimators=100, min_samples_split=2, min_gain=0,                 max_depth=float("inf"), max_features=None):                self.n_estimators = n_estimators                self.min_samples_split = min_samples_split                self.min_gain = min_gain                self.max_depth = max_depth                self.max_features = max_features
        self.trees = []                for _ in range(self.n_estimators):            tree = ClassificationTree(min_samples_split=self.min_samples_split, min_impurity=self.min_gain,                                      max_depth=self.max_depth)            self.trees.append(tree)                    def bootstrap_sampling(self, X, y):        X_y = np.concatenate([X, y.reshape(-1,1)], axis=1)        np.random.shuffle(X_y)        n_samples = X.shape[0]        sampling_subsets = []
        for _ in range(self.n_estimators):                        idx1 = np.random.choice(n_samples, n_samples, replace=True)            bootstrap_Xy = X_y[idx1, :]            bootstrap_X = bootstrap_Xy[:, :-1]            bootstrap_y = bootstrap_Xy[:, -1]            sampling_subsets.append([bootstrap_X, bootstrap_y])        return sampling_subsets                    def fit(self, X, y):                sub_sets = self.bootstrap_sampling(X, y)        n_features = X.shape[1]                if self.max_features == None:            self.max_features = int(np.sqrt(n_features))                for i in range(self.n_estimators):                        sub_X, sub_y = sub_sets[i]            idx2 = np.random.choice(n_features, self.max_features, replace=True)            sub_X = sub_X[:, idx2]            self.trees[i].fit(sub_X, sub_y)                        self.trees[i].feature_indices = idx2            print('The {}th tree is trained done...'.format(i+1))            def predict(self, X):        y_preds = []        for i in range(self.n_estimators):            idx = self.trees[i].feature_indices            sub_X = X[:, idx]            y_pred = self.trees[i].predict(sub_X)            y_preds.append(y_pred)                    y_preds = np.array(y_preds).T        res = []        for j in y_preds:            res.append(np.bincount(j.astype('int')).argmax())        return res
    基於上述隨機森林算法封裝來對模擬數據集進行訓練並驗證：
rf = RandomForest(n_estimators=10, max_features=15)rf.fit(X_train, y_train)y_pred = rf.predict(X_test)print(accuracy_score(y_test, y_pred))
    sklearn也為我們提供了隨機森林算法的api，我們也嘗試一下與numpy手寫的進行效果對比：
from sklearn.ensemble import RandomForestClassifierclf = RandomForestClassifier(max_depth=3, random_state=0)clf.fit(X_train, y_train)y_pred = clf.predict(X_test)print(accuracy_score(y_test, y_pred))
    可以看到sklearn的預測結果要略高於我們手寫的結果。當然我們的訓練結果還可以經過調參進一步提高。隨機森林調參可參考sklearn官方文檔，這裡略過。
參考資料：
機器學習 周志華
往期精彩：
數學推導+純Python實現機器學習算法19：CatBoost
數學推導+純Python實現機器學習算法18：LightGBM
數學推導+純Python實現機器學習算法17：XGBoost
一個算法工程師的成長之路
長按二維碼.關注機器學習實驗室

數學推導+純Python實現機器學習算法20:隨機森林

相關焦點

數學推導+純Python實現機器學習算法30:系列總結與感悟

【機器學習基礎】數學推導+純Python實現機器學習算法21:馬爾可夫鏈蒙特卡洛

數學推導+純Python實現機器學習算法12:貝葉斯網絡

【機器學習基礎】數學推導+純Python實現機器學習算法12:貝葉斯網絡

數學推導+純Python實現機器學習算法17:XGBoost

數學推導+純Python實現機器學習算法28:奇異值分解SVD

【機器學習基礎】數學推導+純Python實現機器學習算法10:線性不可分支持向量機

數學推導+純Python實現機器學習算法18:LightGBM

隨機森林算法入門(Python)

python機器學習預測分析核心算法.pdf

Python大數據綜合應用 :零基礎入門機器學習、深度學習算法原理與案例

機器學習、深度學習算法原理與案例實踐暨Python大數據綜合應用...

【機器學習基礎】數學推導+純Python實現機器學習算法8-9:線性可分支持向量機和線性支持向量機

隨機森林的原理及Python代碼實現

機器學習之隨機森林算法(Random Forest)及python代碼實現

WePay機器學習反欺詐實踐:Python+scikit-learn+隨機森林

【機器學習基礎】數學推導+純Python實現機器學習算法3:k近鄰

python實現隨機森林

機器學習算法——隨機森林算法簡介

數學推導+純Python實現機器學習算法:LightGBM