【機器學習基礎】xgboost系列丨xgboost建樹過程分析及代碼實現

2021-02-20 機器學習初學者

前面我們通過對論文中的公式詳細解讀，一步步推導了XGBoost的優化目標以及建樹方法。下面我們就來動手實踐，拿真實的數據來手動計算，並且使用python來實現一個簡易的XGBoost。

XGBoost的建樹過程包括下面幾個關鍵步驟。

計算分裂前樣本的G,H(每個樣本的g,h求和)

貪心枚舉每個特徵每個值做為分隔條件

計算分隔後左右節點的G_l,H_l,G_r,H_r，算出Gain

更新最大增益Gain_max，更新分隔點。

最終得到最優分隔點。

根據上述過程遞歸建樹直到終止條件。

計算葉節點的權重w

在建完一個樹後，再建下棵樹時，注意G/H/Gain的計算用到的預測值要進行更新，對之前的所有樹的預測值求和可以得到建下棵樹時的。

這裡我們使用一個簡單的UCI數據集：

http://archive.ics.uci.edu/ml/datasets/Container+Crane+Controller+Data+Set
數據集的樣本條數只有15條，2個特徵。具體如下：

<< 左右滑動查看更多 >>

import pandas as pd
df = pd.read_csv('1.csv',index_col=0)
# df = pd.read_clipboard(index_col=0) 可以直接複製下面這個表格後使用read_clipboard讀成DataFrame

對於二分類問題概率是預測值經過Sigmoid函數變換後得到的，默認預測概率為0.5，也就是默認的預測值為0。

<< 左右滑動查看更多 >>

def log_loss_obj(preds, labels):
    preds = 1.0 / (1.0 + np.exp(-preds))
    grad = preds - labels
    hess = preds * (1.0 - preds)
    return grad, hess

base_predict = np.zeros_like(df.y)
g,h = log_loss_obj(base_predict,df.y.values) # 計算每個樣本的g和h
df['g'], df['h'] = g,h

x1x2yghID

11-500.50.2522500.50.2533-21-0.50.254121-0.50.255201-0.50.2566-51-0.50.257751-0.50.2586-200.50.2597200.50.2510601-0.50.25118-51-0.50.2512951-0.50.251310-200.50.25148200.50.2515901-0.50.25

首先對特徵x1上的不同的值進行枚舉分裂增益。特徵一總共有以下幾個取值：

sorted(df.x1.unique())
# [1, 2, 3, 6, 7, 8, 9, 10]

對x1的取值進行排序後，最小值為1，顯然沒有樣本的x1特徵值<1，以x1劃分相當於沒有進行劃分，因此從x1=2進行劃分。

<< 左右滑動查看更多 >>

def split_data(df, split_feature, split_value):
    left_instance = df[df[split_feature] < split_value]
    right_instance = df[df[split_feature] >= split_value]
    return left_instance, right_instance

left_instance, right_instance = split_data(df,'x1',2)
left_instance.index.tolist(),right_instance.index.tolist()
#  ([1, 4], [2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])

可以看到樣本1,4被劃分到了左側，其餘樣本劃分到了右側。對劃分後的數據計算G/H，並求出分裂後的增益Gain為0.055727554179566596。

<< 左右滑動查看更多 >>

reg_lambda = 1
G,H = df.g.sum(), df.h.sum() # 分裂前全部樣本的G/H
for thresh_values in sorted(df.x1.unique())[1:]:
    G_left, H_left = left_instance[['g', 'h']].sum() # 分裂後的G_l,H_l
    G_right, H_right = right_instance[['g', 'h']].sum() # 分裂後的G_r,H_r

    Gain = G_left**2/(H_left+reg_lambda)+G_right**2 / \
        (H_right+reg_lambda)-G**2/(H+reg_lambda) # 分裂後的增益計算

對x1每個取值進行劃分，得到下面不同劃分下的增益值。

G_leftH_leftG_rightH_rightGain20.00.50-1.53.250.05572830.01.00-1.52.750.1263166-0.51.25-1.02.50-0.0768597-1.02.00-0.51.75-0.0494428-1.02.50-0.51.25-0.0768599-1.03.00-0.50.75-0.08082710-2.03.500.50.250.615205

同樣，對於特徵x2每個取值的分裂情況：

G_leftH_leftG_rightH_rightGain-2-0.50.75-1.03.00-0.08082700.01.50-1.52.250.2186232-1.52.250.01.500.2186235-1.03.00-0.50.75-0.080827

我們可以看到當最大的增益為特徵x1=10時，增益為0.615205，因此將x1<10作為第一個分裂條件。

接下來開始進行第2個分裂條件的選擇：

<< 左右滑動查看更多 >>

use_instance,_ = split_data(df,'x1',10)

G,H = use_instance.g.sum(), use_instance.h.sum()
for thresh_values in sorted(use_instance.x1.unique())[1:]:
    left_instance, right_instance = split_data(use_instance,'x1',thresh_values)
    G_left, H_left = left_instance[['g', 'h']].sum()
    G_right, H_right = right_instance[['g', 'h']].sum()
    Gain = G_left**2/(H_left+reg_lambda)+G_right**2 / \
        (H_right+reg_lambda)-G**2/(H+reg_lambda)

對於特徵x1：

G_leftH_leftG_rightH_rightGain20.00.50-2.03.000.11111130.01.00-2.02.500.2539686-0.51.25-1.52.25-0.0854707-1.02.00-1.01.50-0.1555568-1.02.50-1.01.00-0.1031759-1.03.00-1.00.500.027778

對於特徵x2：

G_leftH_leftG_rightH_rightGain-2-0.50.75-1.52.75-0.1460320-0.51.25-1.52.25-0.0854702-2.02.000.01.500.4444445-1.52.75-0.50.75-0.146032

因此最優的劃分為特徵x2<2。

通過不斷的對上述過程迭代，即可遞歸地建出第一棵樹。

簡易版XGBoost實現

我們首先使用XGBoost的開源實現來構建模型，和下面我們的實現做對比。

<< 左右滑動查看更多 >>

import xgboost as xgb
model = xgb.XGBClassifier(n_estimators=2,max_depth=3,learning_rate=0.1,random_state=1,reg_lambda=1,gamma=0,objective='binary:logistic',min_child_weight=0,)
model.fit(df[['x1','x2']],df.y)
xgb.to_graphviz(model,num_trees=0)

可以看到第一棵樹的結構如下：

第二棵樹的結構如下：

xgb.to_graphviz(model,num_trees=1)

下面我們來用Python實現自己的簡易版XGBoost。
首先創建節點類，通過節點的遞歸構建出一棵樹。

<< 左右滑動查看更多 >>

from graphviz import Digraph

class Node(object):
    """結點
       leaf_value ：記錄葉子結點值
       split_feature ：分裂節點對應的特徵名
       split_value ：分裂節點對應的特徵的值
       left ：左子樹
       right ：右子樹
    """

    def __init__(self, leaf_value=None, split_feature=None, split_value=None, left=None, right=None):
        self.leaf_value = leaf_value
        self.split_feature = split_feature
        self.split_value = split_value
        self.left = left
        self.right = right

    def show(self):
        print(
            f'weight: {self.leaf_value}, split_feature: {self.split_feature}, split_value: {self.split_value}.')

    def visualize_tree(self):
        """
        使用graphviz遞歸繪製樹
        """
        def add_nodes_edges(self, dot=None):
            if dot is None:
                dot = Digraph()
                dot.node(name=str(self),
                         label=f'{self.split_feature}<{self.split_value}')
            # Add nodes
            if self.left:
                if self.left.leaf_value:
                    dot.node(name=str(self.left),
                             label=f'leaf={self.left.leaf_value:.10f}')
                else:
                    dot.node(name=str(self.left),
                             label=f'{self.left.split_feature}<{self.left.split_value}')
                dot.edge(str(self), str(self.left))
                dot = add_nodes_edges(self.left, dot=dot)
            if self.right:
                if self.right.leaf_value:
                    dot.node(name=str(self.right),
                             label=f'leaf={self.right.leaf_value:.10f}')
                else:
                    dot.node(name=str(self.right),
                             label=f'{self.right.split_feature}<{self.right.split_value}')
                dot.edge(str(self), str(self.right))
                dot = add_nodes_edges(self.right, dot=dot)
            return dot

        dot = add_nodes_edges(self)
        return dot

注意這裡我們寫了一個visualize_tree方法，使用graphviz來繪製建好的樹的結構，在XGBoost的開源實現中也是使用類似的方案。該方法不是必須的，但是卻對我們了解建樹過程有著很好的幫助。同時我們使用show方法將節點的信息也字符形式顯示出來。

下面來看一下損失函數和建樹過程的實現：

<< 左右滑動查看更多 >>

# 樹的結構參數
reg_lambda = 1 # 葉節點權重L2正則係數
min_samples_split = 1 # 分裂所需的最小樣本個數
max_depth = 3 # 樹的深度

# 建樹過程參數
learning_rate = 0.1 # 學習率
n_estimators = 2 # 樹的個數

# log損失函數
def log_loss_obj(preds, labels):
    # preds是建該樹之前模型的輸出，對於二分類問題需要的是概率，因此將該值經過Sigmoid轉換
    probs = 1.0 / (1.0 + np.exp(-preds))
    grad = probs - labels
    hess = probs * (1.0 - probs)
    return grad, hess

def build_tree(df, feature_names, depth=1):
    df = df.copy()
    df['g'], df['h'] = log_loss_obj(df.y_pred, df.y)
    G, H = df[['g', 'h']].sum()
    Gain_max = float('-inf')

    # 終止條件當前節點個數小於分裂所需最小樣本個數，深度大於max_depth，葉節點只有一類樣本無需再分
    if df.shape[0] > min_samples_split and depth <= max_depth and df.y.nunique() > 1:
        for feature in feature_names: # 遍歷每個特徵
            thresholds = sorted(set(df[feature])) # 特徵取值排序
            for thresh_value in thresholds[1:]: # 遍歷每個取值
                left_instance = df[df[feature] < thresh_value] # 劃分到左右節點的樣本
                right_instance = df[df[feature] >= thresh_value]
                G_left, H_left = left_instance[['g', 'h']].sum()
                G_right, H_right = right_instance[['g', 'h']].sum()

                Gain = G_left**2/(H_left+reg_lambda)+G_right**2 / \
                    (H_right+reg_lambda)-G**2/(H+reg_lambda) # 評價劃分的增益效果
                if Gain >= Gain_max:
                    Gain_max = Gain
                    split_feature = feature # 最大增益對應的分裂特徵
                    split_value = thresh_value # 最大增益對應的分裂值
                    left_data = left_instance # 最大增益對應的分裂後左節點樣本
                    right_data = right_instance # 最大增益對應的分裂後右節點樣本

        left = build_tree(left_data, feature_names,  depth+1) # 遞歸建左子樹
        right = build_tree(right_data, feature_names, depth+1)# 遞歸建右子樹
        return Node(split_feature=split_feature, split_value=split_value, left=left, right=right) # 返回分裂節點
    return Node(leaf_value=-G/(H+reg_lambda)*learning_rate) # 分裂終止，返回葉節點權重

對於單棵樹的預測，我們使用predict函數，根據樹的分裂條件以及當前數據的信息來確認每個樣本被分到當前這棵樹的哪個葉節點，得到對應葉節點的權重。

<< 左右滑動查看更多 >>

def predict(x, tree):
    # 遞歸每個分裂點直到樣本對應的葉節點
    # 終止條件：葉節點
    if tree.leaf_value is not None:
        return tree.leaf_value
    if x[tree.split_feature] < tree.split_value:
        return predict(x, tree.left)
    else:
        return predict(x, tree.right)

根據已建好的樹來更新預測結果，進而確定新建樹的結構，不斷迭代最終得到全部的樹：

<< 左右滑動查看更多 >>

trees = []
y_pred = 0  # 初始預測值
for i in range(n_estimators):
    df['y_pred'] = y_pred
    tree = build_tree(df, feature_names=['x1', 'x2'])
    data_weight = df[['x1', 'x2']].apply(
        predict, tree=tree, axis=1)
    y_pred += data_weight
    trees.append(tree)

對於已建好的樹，使用我們前面定義的visualize_tree函數，可以方便的看到樹的結構。

第一棵數的結構：

trees[0].visualize_tree()

第二棵樹的結構：

trees[1].visualize_tree()

可以看到，我們自己實現的XGBoost與開源的XGBoost得到的樹的結構是一致的。

<< 左右滑動查看更多 >>

import numpy as np
import pandas as pd
import xgboost as xgb
from graphviz import Digraph

class Node(object):
    """結點
       leaf_value ：記錄葉子結點值
       split_feature ：特徵i
       split_value ：特徵i的值
       left ：左子樹
       right ：右子樹
    """

    def __init__(self, leaf_value=None, split_feature=None, split_value=None, left=None, right=None):
        self.leaf_value = leaf_value
        self.split_feature = split_feature
        self.split_value = split_value
        self.left = left
        self.right = right

    def show(self):
        print(
            f'weight: {self.leaf_value}, split_feature: {self.split_feature}, split_value: {self.split_value}.')

    def visualize_tree(self):
        """
        遞歸查找繪製樹
        """

        def add_nodes_edges(self, dot=None):
            if dot is None:
                dot = Digraph()
                dot.node(name=str(self),
                         label=f'{self.split_feature}<{self.split_value}')
            # Add nodes
            if self.left:
                if self.left.leaf_value:
                    dot.node(name=str(self.left),
                             label=f'leaf={self.left.leaf_value:.10f}')
                else:
                    dot.node(name=str(self.left),
                             label=f'{self.left.split_feature}<{self.left.split_value}')
                dot.edge(str(self), str(self.left))
                dot = add_nodes_edges(self.left, dot=dot)
            if self.right:
                if self.right.leaf_value:
                    dot.node(name=str(self.right),
                             label=f'leaf={self.right.leaf_value:.10f}')
                else:
                    dot.node(name=str(self.right),
                             label=f'{self.right.split_feature}<{self.right.split_value}')
                dot.edge(str(self), str(self.right))
                dot = add_nodes_edges(self.right, dot=dot)
            return dot

        dot = add_nodes_edges(self)
        return dot

def log_loss_obj(preds, labels):
    preds = 1.0 / (1.0 + np.exp(-preds))
    grad = preds - labels
    hess = preds * (1.0 - preds)
    return grad, hess

def mse_obj(preds, labels):
    grad = labels-y_pred
    hess = np.ones_like(labels)
    return grad, hess

class XGB:
    def __init__(self, n_estimators=2, learning_rate=0.1, max_depth=3, min_samples_split=0, reg_lambda=1, base_score=0.5, loss=log_loss_obj):
        # 學習控制參數
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.base_score = base_score
        # 樹參數
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.loss = loss
        self.reg_lambda = reg_lambda

        self.trees = []
        self.feature_names = None

    def sigmoid_array(self, x):
        return 1 / (1 + np.exp(-x))

    def _predict(self, x, tree):
        # 循環終止條件：葉節點
        if tree.leaf_value is not None:
            return tree.leaf_value
        if x[tree.split_feature] < tree.split_value:
            return self._predict(x, tree.left)
        else:
            return self._predict(x, tree.right)

    def _build_tree(self, df, depth=1):
        df = df.copy()
        df['g'], df['h'] = self.loss(df.y_pred, df.y)
        G, H = df[['g', 'h']].sum()

        Gain_max = float('-inf')
        if df.shape[0] > self.min_samples_split and depth <= self.max_depth and df.y.nunique() > 1:
            for feature in self.feature_names:
                thresholds = sorted(set(df[feature]))
                for thresh_value in thresholds[1:]:
                    left_instance = df[df[feature] < thresh_value]
                    right_instance = df[df[feature] >= thresh_value]
                    G_left, H_left = left_instance[['g', 'h']].sum()
                    G_right, H_right = right_instance[['g', 'h']].sum()

                    Gain = G_left**2/(H_left+self.reg_lambda)+G_right**2 / \
                        (H_right+self.reg_lambda)-G**2/(H+self.reg_lambda)
                    if Gain >= Gain_max:
                        Gain_max = Gain
                        split_feature = feature
                        split_value = thresh_value
                        left_data = left_instance
                        right_data = right_instance
                    # print(feature,'Gain:',Gain,'G-Left',G_left,'H-left',H_left,'G-Right',G_right,'H-right',H_right,'----',thresh_value)
            # print(Gain_max,split_feature,split_value)

            left = self._build_tree(left_data,  depth+1)
            right = self._build_tree(right_data,  depth+1)
            return Node(split_feature=split_feature, split_value=split_value, left=left, right=right)
        return Node(leaf_value=-G/(H+self.reg_lambda)*self.learning_rate)

    def fit(self, X, y):
        y_pred = -np.log((1/self.base_score)-1)
        df = pd.DataFrame(X)
        df['y'] = y
        self.feature_names = df.columns.tolist()[:-1]

        for i in range(self.n_estimators):
            df['y_pred'] = y_pred
            tree = self._build_tree(df)
            data_weight = df[['x1', 'x2']].apply(
                self._predict, tree=tree, axis=1)
            y_pred += data_weight
            self.trees.append(tree)

    def predict(self, X):
        df = pd.DataFrame(X)
        y_pred = -np.log((1/self.base_score)-1)
        for tree in self.trees:
            df['y_pred'] = y_pred
            data_weight = df[['x1', 'x2']].apply(
                self._predict, tree=tree, axis=1)
            y_pred += data_weight
        return self.sigmoid_array(y_pred)

    def __repr__(self):
        return 'XGBClassifier('+', '.join(f'{k}={v}' for k, v in self.__dict__.items() if not k.startswith('_'))+')'

使用前面同樣的方法，調用我們自己封裝好的代碼，可以看到很簡潔的實現了跟官方開源實現類似的效果。

df = pd.read_csv('1.csv',index_col=0)
model = XGB()
model.fit(df[['x1', 'x2']], df.y)
model.predict(df[['x1', 'x2']])
model.trees[0].visualize_tree()

當然這裡只是一個DEMO，來幫助我們更好的理解論文中的公式以及XGBoost的實現過程，更多的細節優化可以參考官方實現：https://github.com/dmlc/xgboost

參考

https://xgboost.readthedocs.io/en/latest/tutorials/model.html

https://blog.csdn.net/qq_22238533/article/details/79477547

https://github.com/dmlc/xgboost/blob/master/demo/guide-python/custom_objective.py

https://github.com/lxmly/machine-learning/blob/master/decision_tree.py

獲取本站知識星球優惠券，複製連結直接打開：

https://t.zsxq.com/qFiUFMV

本站qq群704220115。

加入微信群請掃碼：

【機器學習基礎】xgboost系列丨xgboost建樹過程分析及代碼實現

相關焦點

【機器學習】XGBoost算法學習小結

從零解讀 Xgboost 算法 (原理+代碼)

Python XGBoost算法代碼實現和篩選特徵應用

資料|陳天奇介紹Xgboost原理的PPT

資源 | XGBoost 中文文檔開放:上去就是一把梭

機器學習實戰|GBDT Xgboost LightGBM對比

我的XGBoost學習經歷及動手實踐

第113天: Python XGBoost 算法項目實戰

數學推導 + 純 Python 實現 XGBoost

機器學習 | 機器學習之XGBoost Part2

中信建投證券:xgboost中證500指數增強7月持倉組合發布

100天搞定機器學習|Day61 手算+可視化,終於徹底理解了 XGBoost

學界 | 多 GPU 加速學習,這是一份嶄新的 XGBoost 庫

數學推導+純Python實現機器學習算法17:XGBoost

SHAP——Xgboost的歸因新境界

大戰三回合:XGBoost、LightGBM和Catboost一決高低

基於Python實現XGBoost

XGBoost源碼解讀

XGBoost缺失值引發的問題及其深度分析|CSDN博文精選