的一元二次函數,可根據最值公式求其最值點。當相互獨立的每棵樹的葉子結點都達到最優值時,整個損失函數也相應的達到最優。當樹結構固定的情況下,對上式求導並令其為0,可得最優點和最優值為:以上就是XGBoost完整的損失函數推導過程。基本框架仍然是GBDT框架,但XGBoost的最大特色是用到了損失函數的二階導數信息,目的就在於能夠更加準確的逼近真實損失。 下圖是XGBoost論文中給出的葉子結點權重計算: 根據二階導信息把XGBoost的優化推到了極為逼近真實損失的狀態,其結點分裂方式就跟CART樹的結點分裂方式本質上並沒有太大區別,但信息增益的計算方式有所不同。假設模型在某一節點完成特徵分裂,分裂前的目標函數可以寫為:如果增益Gain>0,即分裂為兩個葉子節點後,目標函數下降了,則考慮此次分裂的結果。實際處理時需要遍歷所有特徵尋找最佳分裂特徵。以上就是XGBoost模型核心數學推導部分。完整的XGBoost模型還有很多工程上的細節,這裡不做過多闡述,各位讀者最好把XGBoost論文認真研讀一遍。完整的XGBoost推導示意圖如下所示。 有了GBDT的算法實現經驗,XGBoost實現起來就並沒有太多困難了,大多數底層代碼都較為類似,主要是在信息增益計算、葉子得分計算和損失函數的二階導信息上做一些變動。同樣先列出代碼框架:
底層的決策樹結點和樹模型定義基本差別不大,具體這裡不再列出,可參考第15講GBDT實現方式。主要是繼承底層的樹結點和樹來定義XGBoost單棵樹和XGBoost樹模型擬合方式。class XGBoostTree(Tree): # 結點分裂 def _split(self, y): col = int(np.shape(y)[1]/2) y, y_pred = y[:, :col], y[:, col:] return y, y_pred # 信息增益計算公式 def _gain(self, y, y_pred): Gradient = np.power((y * self.loss.gradient(y, y_pred)).sum(), 2) Hessian = self.loss.hess(y, y_pred).sum() return 0.5 * (Gradient / Hessian) # 樹分裂增益計算 def _gain_by_taylor(self, y, y1, y2): # 結點分裂 y, y_pred = self._split(y) y1, y1_pred = self._split(y1) y2, y2_pred = self._split(y2) true_gain = self._gain(y1, y1_pred) false_gain = self._gain(y2, y2_pred) gain = self._gain(y, y_pred) return true_gain + false_gain - gain # 葉子結點最優權重 def _approximate_update(self, y): # y split into y, y_pred y, y_pred = self._split(y) # Newton's Method gradient = np.sum(y * self.loss.gradient(y, y_pred), axis=0) hessian = np.sum(self.loss.hess(y, y_pred), axis=0) update_approximation = gradient / hessian return update_approximation # 樹擬合方法 def fit(self, X, y): self._impurity_calculation = self._gain_by_taylor self._leaf_value_calculation = self._approximate_update super(XGBoostTree, self).fit(X, y)class XGBoost(object): def __init__(self, n_estimators=200, learning_rate=0.001, min_samples_split=2, min_impurity=1e-7, max_depth=2): self.n_estimators = n_estimators # Number of trees self.learning_rate = learning_rate # Step size for weight update self.min_samples_split = min_samples_split # The minimum n of sampels to justify split self.min_impurity = min_impurity # Minimum variance reduction to continue self.max_depth = max_depth # Maximum depth for tree # 交叉熵損失 self.loss = LogisticLoss() # 初始化模型 self.estimators = [] # 前向分步訓練 for _ in range(n_estimators): tree = XGBoostTree( min_samples_split=self.min_samples_split, min_impurity=min_impurity, max_depth=self.max_depth, loss=self.loss) self.estimators.append(tree) def fit(self, X, y): y = to_categorical(y) y_pred = np.zeros(np.shape(y)) for i in range(self.n_estimators): tree = self.trees[i] y_and_pred = np.concatenate((y, y_pred), axis=1) tree.fit(X, y_and_pred) update_pred = tree.predict(X) y_pred -= np.multiply(self.learning_rate, update_pred) def predict(self, X): y_pred = None # 預測 for tree in self.estimators: update_pred = tree.predict(X) if y_pred is None: y_pred = np.zeros_like(update_pred) y_pred -= np.multiply(self.learning_rate, update_pred) y_pred = np.exp(y_pred) / np.sum(np.exp(y_pred), axis=1, keepdims=True) # 將概率預測轉換為標籤 y_pred = np.argmax(y_pred, axis=1) return y_predfrom sklearn import datasetsdata = datasets.load_iris()X = data.datay = data.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, seed=2) clf = XGBoost()clf.fit(X_train, y_train)y_pred = clf.predict(X_test) accuracy = accuracy_score(y_test, y_pred)print ("Accuracy:", accuracy)Accuracy: 0.9666666666666667 XGBoost也提供了原生的模型庫可供我們調用。pip安裝xgboost即可:import xgboost as xgbfrom xgboost import plot_importancefrom matplotlib import pyplot as plt # 設置模型參數params = { 'booster': 'gbtree', 'objective': 'multi:softmax', 'num_class': 3, 'gamma': 0.1, 'max_depth': 2, 'lambda': 2, 'subsample': 0.7, 'colsample_bytree': 0.7, 'min_child_weight': 3, 'silent': 1, 'eta': 0.001, 'seed': 1000, 'nthread': 4,} plst = params.items() dtrain = xgb.DMatrix(X_train, y_train)num_rounds = 200model = xgb.train(plst, dtrain, num_rounds) # 對測試集進行預測dtest = xgb.DMatrix(X_test)y_pred = model.predict(dtest) # 計算準確率accuracy = accuracy_score(y_test, y_pred)print ("Accuracy:", accuracy)# 繪製特徵重要性plot_importance(model)plt.show();Accuracy: 0.9166666666666666參考資料:
XGBoost: A Scalable Tree Boosting System
https://www.jianshu.com/p/ac1c12f3fba1
算法工程師必備 !
AI有道年度技術文章電子版PDF來啦!
掃描下方二維碼,添加 AI有道小助手微信 ,可申請入群,並獲得2020完整技術文章合集PDF(一定要備註:入群 + 地點 + 學校/公司 。例如:入群+上海+復旦 。
長按掃碼,申請入群
(添加人數較多,請耐心等待)
最新 AI 乾貨,我 在看