我的XGBoost學習經歷及動手實踐

2021-02-20 Datawhale

↑↑↑關注後"星標"Datawhale

每日乾貨 & 每月組隊學習，不錯過

作者：李祖賢深圳大學，Datawhale高校群成員

知乎地址：http://www.zhihu.com/people/meng-di-76-92

我今天主要介紹機器學習集成學習方法中三巨頭之一的XGBoost，這個算法在早些時候機器學習比賽內曾經大放異彩，是非常好用的一個機器學習集成算法。

XGBoost是一個優化的分布式梯度增強庫，旨在實現高效，靈活和便攜。它在Gradient Boosting框架下實現機器學習算法。XGBoost提供了並行樹提升（也稱為GBDT，GBM），可以快速準確地解決許多數據科學問題。

相同的代碼在主要的分布式環境（Hadoop，SGE，MPI）上運行，並且可以解決超過數十億個樣例的問題。XGBoost利用了核外計算並且能夠使數據科學家在一個主機上處理數億的樣本數據。最終，將這些技術進行結合來做一個端到端的系統以最少的集群系統來擴展到更大的數據集上。

XGBoost原理介紹

從0開始學習，經歷過推導公式的波瀾曲折，下面展示下我自己的推公式的手稿吧，希望能激勵到大家能夠對機器學習數據挖掘更加熱愛！

XGBoost公式2

現在我們對手稿的內容進行詳細的講解：

我們的任務是找到一組樹使得OBj最小，很明顯這個優化目標OBj可以看成是樣本的損失和模型的複雜度懲罰相加組成。2. 使用追加法訓練（Additive Training Boosting）

核心思想是：在已經訓練好了棵樹後不再調整前棵樹，那麼第t棵樹可以表示為：

(1). 那此時如果我們對第t棵樹訓練，則目標函數為：

對上式進行泰勒二階展開：

由於前t-1棵樹已知，那麼

(2). 我們已經對前半部分的損失函數做出了充分的討論，但是後半部分的還只是個符號並未定義，那我們現在就來定義：假設我們待訓練的第t棵樹有T個葉子結點:葉子結點的輸出向量表示如下：

假設表示樣本到葉子結點的映射，那麼。那麼我們定義：

(3). 我們的目標函數最終化簡為：

我們找到了目標函數就需要對目標函數進行優化:

我們剛剛的假設前提是已知前t-1棵樹，因此我們現在來探討怎麼生成樹。根據決策樹的生成策略，再每次分裂節點的時候我們需要考慮能使得損失函數減小最快的節點，也就是分裂後損失函數減去分裂前損失函數我們稱之為Gain：

Gain越大越能說明分裂後目標函數值減小越多。（因為從式子來看：越大，反而OBj越小）

4. 尋找最優節點：

在決策樹（CART）裡面，我們使用的是精確貪心算法（Basic Exact Greedy Algorithm）,也就是將所有特徵的所有取值排序（耗時耗內存巨大），然後比較每一個點的Gini，找出變化最大的節點。當特徵是連續特徵時，我們對連續值離散化，取兩點的平均值為分割節點。可以看到，這裡的排序算法需要花費大量的時間，因為要遍歷整個樣本所有特徵，而且還要排序！！

論文的精確貪心算法的偽代碼

因此在XGBoost裡面我們使用的是近似算法（Approximate Algorithm）：該算法首先根據特徵分布的百分位數(percentiles)提出候選分裂點，將連續特徵映射到由這些候選點分割的桶中，匯總統計信息並根據匯總的信息在提案中找到最佳解決方案。對於某個特徵k，算法首先根據特徵分布的分位數找到特徵切割點的候選集合 ,然後將特徵k的值根據集合劃分到桶(bucket)中，接著對每個桶內的樣本統計值G、H進行累加，最後在這些累計的統計量上尋找最佳分裂點。

論文的近似算法的偽代碼XGBoost動手實踐：

1. 引入基本工具庫：

import numpy as npimport pandas as pdimport xgboost as xgbimport matplotlib.pyplot as pltplt.style.use("ggplot")%matplotlib inline
2. XGBoost原生工具庫的上手：
import xgboost as xgb  dtrain = xgb.DMatrix('demo/data/agaricus.txt.train')   dtest = xgb.DMatrix('demo/data/agaricus.txt.test')  param = {'max_depth':2, 'eta':1, 'objective':'binary:logistic' }    num_round = 2     bst = xgb.train(param, dtrain, num_round)   preds = bst.predict(dtest)   
3. XGBoost的參數設置(括號內的名稱為sklearn接口對應的參數名字)XGBoost的參數分為三種：
booster:使用哪個弱學習器訓練，默認gbtree，可選gbtree，gblinear 或dart
nthread：用於運行XGBoost的並行線程數，默認為最大可用線程數
verbosity：列印消息的詳細程度。有效值為0（靜默），1（警告），2（信息），3（調試）。
Tree Booster的參數：
eta（learning_rate）：learning_rate，在更新中使用步長收縮以防止過度擬合，默認= 0.3，範圍：[0,1]；典型值一般設置為：0.01-0.2
gamma（min_split_loss）：默認= 0，分裂節點時，損失函數減小值只有大於等於gamma節點才分裂，gamma值越大，算法越保守，越不容易過擬合，但性能就不一定能保證，需要平衡。範圍：[0，∞]
max_depth：默認= 6，一棵樹的最大深度。增加此值將使模型更複雜，並且更可能過度擬合。範圍：[0，∞]
min_child_weight：默認值= 1，如果新分裂的節點的樣本權重和小於min_child_weight則停止分裂 。這個可以用來減少過擬合，但是也不能太高，會導致欠擬合。範圍：[0，∞]
max_delta_step：默認= 0，允許每個葉子輸出的最大增量步長。如果將該值設置為0，則表示沒有約束。如果將其設置為正值，則可以幫助使更新步驟更加保守。通常不需要此參數，但是當類極度不平衡時，它可能有助於邏輯回歸。將其設置為1-10的值可能有助於控制更新。範圍：[0，∞]
subsample：默認值= 1，構建每棵樹對樣本的採樣率，如果設置成0.5，XGBoost會隨機選擇一半的樣本作為訓練集。範圍：（0,1]
sampling_method：默認= uniform，用於對訓練實例進行採樣的方法。
colsample_bytree：默認= 1，列採樣率，也就是特徵採樣率。範圍為（0，1]
lambda（reg_lambda）：默認=1，L2正則化權重項。增加此值將使模型更加保守。
alpha（reg_alpha）：默認= 0，權重的L1正則化項。增加此值將使模型更加保守。
tree_method：默認=auto，XGBoost中使用的樹構建算法。
exact：精確的貪婪算法。枚舉所有拆分的候選點。
approx：使用分位數和梯度直方圖的近似貪婪算法。
hist：更快的直方圖優化的近似貪婪算法。（LightGBM也是使用直方圖算法）
gpu_hist：GPU hist算法的實現。
scale_pos_weight:控制正負權重的平衡，這對於不平衡的類別很有用。Kaggle競賽一般設置sum(negative instances) / sum(positive instances)，在類別高度不平衡的情況下，將參數設置大於0，可以加快收斂。
num_parallel_tree：默認=1，每次迭代期間構造的並行樹的數量。此選項用於支持增強型隨機森林。
monotone_constraints：可變單調性的約束，在某些情況下，如果有非常強烈的先驗信念認為真實的關係具有一定的質量，則可以使用約束條件來提高模型的預測性能。（例如params_constrained['monotone_constraints'] = "(1,-1)"，(1,-1)我們告訴XGBoost對第一個預測變量施加增加的約束，對第二個預測變量施加減小的約束。）
Linear Booster的參數：
lambda（reg_lambda）：默認= 0，L2正則化權重項。增加此值將使模型更加保守。歸一化為訓練示例數。
alpha（reg_alpha）：默認= 0，權重的L1正則化項。增加此值將使模型更加保守。歸一化為訓練示例數。
updater：默認= shotgun。
feature_selector：默認= cyclic。特徵選擇和排序方法
top_k：要選擇的最重要特徵數（在greedy和thrifty內）
通用參數有兩種類型的booster，因為tree的性能比線性回歸好得多，因此我們很少用線性回歸。
這個參數用來控制理想的優化目標和每一步結果的度量方法。
4. XGBoost的調參說明：
參數調優的一般步驟：
5. XGBoost詳細攻略：
1). 安裝XGBoost
方式1：
方式2：
2). 數據接口（XGBoost可處理的數據格式DMatrix）
dtrain = xgb.DMatrix('train.svm.txt')dtest = xgb.DMatrix('test.svm.buffer')dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0')dtest = xgb.DMatrix('test.csv?format=csv&label_column=0')data = np.random.rand(5, 10)  # 5 entities, each contains 10 featureslabel = np.random.randint(2, size=5)  # binary targetdtrain = xgb.DMatrix(data, label=label)csr = scipy.sparse.csr_matrix((dat, (row, col)))dtrain = xgb.DMatrix(csr)data = pandas.DataFrame(np.arange(12).reshape((4,3)), columns=['a', 'b', 'c'])label = pandas.DataFrame(np.random.randint(2, size=4))dtrain = xgb.DMatrix(data, label=label)
筆者推薦：先保存到XGBoost二進位文件中將使加載速度更快，然後再加載進來
dtrain = xgb.DMatrix('train.svm.txt')dtrain.save_binary('train.buffer')dtrain = xgb.DMatrix(data, label=label, missing=-999.0)w = np.random.rand(5, 1)dtrain = xgb.DMatrix(data, label=label, missing=-999.0, weight=w)
3). 參數的設置方式：
df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',header=None)df_wine.columns = ['Class label', 'Alcohol','Malic acid', 'Ash','Alcalinity of ash','Magnesium', 'Total phenols',                   'Flavanoids', 'Nonflavanoid phenols','Proanthocyanins','Color intensity', 'Hue','OD280/OD315 of diluted wines','Proline']df_wine = df_wine[df_wine['Class label'] != 1]  y = df_wine['Class label'].valuesX = df_wine[['Alcohol','OD280/OD315 of diluted wines']].valuesfrom sklearn.model_selection import train_test_split  from sklearn.preprocessing import LabelEncoder   le = LabelEncoder()y = le.fit_transform(y)X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1,stratify=y)dtrain = xgb.DMatrix(X_train, label=y_train)dtest = xgb.DMatrix(X_test)params = {    'booster': 'gbtree',    'objective': 'multi:softmax',      'num_class': 10,                   'gamma': 0.1,                      'max_depth': 12,                   'lambda': 2,                       'subsample': 0.7,                  'colsample_bytree': 0.7,           'min_child_weight': 3,    'silent': 1,                       'eta': 0.007,                      'seed': 1000,    'nthread': 4,                      'eval_metric':'auc'}plst = params.items()
4). 訓練
num_round = 10bst = xgb.train( plst, dtrain, num_round)
5). 保存模型
bst.save_model('0001.model')bst.dump_model('dump.raw.txt')
6) . 加載保存的模型
bst = xgb.Booster({'nthread': 4})  bst.load_model('0001.model')  
7). 設置早停機制
train(..., evals=evals, early_stopping_rounds=10)
8). 預測
ypred = bst.predict(dtest)
9). 繪圖
6. 實戰案例：
1). 分類案例
from sklearn.datasets import load_irisimport xgboost as xgbfrom xgboost import plot_importancefrom matplotlib import pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score   iris = load_iris()X,y = iris.data,iris.targetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234565) 
params = {    'booster': 'gbtree',    'objective': 'multi:softmax',    'num_class': 3,    'gamma': 0.1,    'max_depth': 6,    'lambda': 2,    'subsample': 0.7,    'colsample_bytree': 0.75,    'min_child_weight': 3,    'silent': 0,    'eta': 0.1,    'seed': 1,    'nthread': 4,}
plst = params.items()
dtrain = xgb.DMatrix(X_train, y_train) num_rounds = 500model = xgb.train(plst, dtrain, num_rounds) 
dtest = xgb.DMatrix(X_test)y_pred = model.predict(dtest)
accuracy = accuracy_score(y_test,y_pred)print("accuarcy: %.2f%%" % (accuracy*100.0))
plot_importance(model)plt.show()
2). 回歸案例
import xgboost as xgbfrom xgboost import plot_importancefrom matplotlib import pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_bostonfrom sklearn.metrics import mean_squared_error
boston = load_boston()X,y = boston.data,boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
params = {    'booster': 'gbtree',    'objective': 'reg:squarederror',    'gamma': 0.1,    'max_depth': 5,    'lambda': 3,    'subsample': 0.7,    'colsample_bytree': 0.7,    'min_child_weight': 3,    'silent': 1,    'eta': 0.1,    'seed': 1000,    'nthread': 4,}
dtrain = xgb.DMatrix(X_train, y_train)num_rounds = 300plst = params.items()model = xgb.train(plst, dtrain, num_rounds)
dtest = xgb.DMatrix(X_test)ans = model.predict(dtest)
plot_importance(model)plt.show()
7. XGBoost調參（結合sklearn網格搜索）
import xgboost as xgbimport pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import GridSearchCVfrom sklearn.metrics import roc_auc_score
iris = load_iris()X,y = iris.data,iris.targetcol = iris.target_namestrain_x, valid_x, train_y, valid_y = train_test_split(X, y, test_size=0.3, random_state=1)   parameters = {              'max_depth': [5, 10, 15, 20, 25],              'learning_rate': [0.01, 0.02, 0.05, 0.1, 0.15],              'n_estimators': [500, 1000, 2000, 3000, 5000],              'min_child_weight': [0, 2, 5, 10, 20],              'max_delta_step': [0, 0.2, 0.6, 1, 2],              'subsample': [0.6, 0.7, 0.8, 0.85, 0.95],              'colsample_bytree': [0.5, 0.6, 0.7, 0.8, 0.9],              'reg_alpha': [0, 0.25, 0.5, 0.75, 1],              'reg_lambda': [0.2, 0.4, 0.6, 0.8, 1],              'scale_pos_weight': [0.2, 0.4, 0.6, 0.8, 1]
}
xlf = xgb.XGBClassifier(max_depth=10,            learning_rate=0.01,            n_estimators=2000,            silent=True,            objective='multi:softmax',            num_class=3 ,            nthread=-1,            gamma=0,            min_child_weight=1,            max_delta_step=0,            subsample=0.85,            colsample_bytree=0.7,            colsample_bylevel=1,            reg_alpha=0,            reg_lambda=1,            scale_pos_weight=1,            seed=0,            missing=None)
gs = GridSearchCV(xlf, param_grid=parameters, scoring='accuracy', cv=3)gs.fit(train_x, train_y)
print("Best score: %0.3f" % gs.best_score_)print("Best parameters set: %s" % gs.best_params_ )
👇點擊閱讀原文，關注作者

我的XGBoost學習經歷及動手實踐

相關焦點

【機器學習】XGBoost算法學習小結

從零解讀 Xgboost 算法 (原理+代碼)

資料|陳天奇介紹Xgboost原理的PPT

Python XGBoost算法代碼實現和篩選特徵應用

資源 | XGBoost 中文文檔開放:上去就是一把梭

機器學習實戰|GBDT Xgboost LightGBM對比

【機器學習基礎】xgboost系列丨xgboost建樹過程分析及代碼實現

第113天: Python XGBoost 算法項目實戰

SHAP——Xgboost的歸因新境界

中信建投證券:xgboost中證500指數增強7月持倉組合發布

100天搞定機器學習|Day61 手算+可視化,終於徹底理解了 XGBoost

大戰三回合:XGBoost、LightGBM和Catboost一決高低

機器學習 | 機器學習之XGBoost Part2

數學推導 + 純 Python 實現 XGBoost

學界 | 多 GPU 加速學習,這是一份嶄新的 XGBoost 庫

機器學習算法XGBoost原理和實踐(下)

數學推導+純Python實現機器學習算法17:XGBoost

入門 | 從結構到性能,一文概述XGBoost、Light GBM和CatBoost的同與不同

XGBoost源碼解讀