作者:徐靜 AI圖像算法研發工程師
博客:https://dataxujing.github.io/
GitHub: https://github.com/DataXujing
CatBoost(Categorical Boosting)算法是一種類似於XGBoost,LightGBM的Gradient Boosting算法,其算法創新主要有兩個:一個是對於離散特徵值的處理,採用了ordered TS(target statistic)的方法;其二是提供了兩種訓練模式:Ordered和Plain,其具體的偽代碼如下圖所示:
通過ordered boosting的思想解決了Gradient Boosting中常出現的prediction shift問題。
CatBoost目前支持通過Python,R和命令行進行調用和訓練,支持GPU,其提供了強大的訓練過程可視化功能,可以使用jupyter notebook,CatBoost Viewer,TensorBoard可視化訓練過程,學習文檔豐富,易於上手。
本文帶大家結合kaggle中titanic公共數據集基於Python和R訓練CatBoost模型。
Python實現CatBoost
1.加載數據:
```pythonfrom catboost.datasets import titanicimport numpy as npfrom sklearn.model_selection import train_test_splitfrom catboost import CatBoostClassifier, Pool, cvfrom sklearn.metrics import accuracy_score
train_df, test_df = titanic()
X = train_df.drop('Survived', axis=1)y = train_df.Survived
# 數據劃分X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.75, random_state=42)
X_test = test_df```
這裡我們直接使用數據框的結構,對於CatBoost支持numpy中的數組和pandas中的數據框,同時也提供了一種pool數據結構,如果有速度和內存佔用優化的需求,官方建議使用pool數據結構,本文我們使用數據框結構作為例子。
2.使用hyperopt調參:```python
import hyperoptfrom numpy.random import RandomState
# 目的是最小化目標函數def hyperopt_objective(params): model = CatBoostClassifier( l2_leaf_reg=int(params['l2_leaf_reg']), learning_rate=params['learning_rate'], iterations=500, eval_metric='Accuracy', random_seed=42, logging_level='Silent' ) cv_data = cv( Pool(X, y, cat_features=categorical_features_indices), model.get_params() ) best_accuracy = np.max(cv_data['test-Accuracy-mean']) return 1 - best_accuracy
# 需要優化的參數備選取值params_space = { 'l2_leaf_reg': hyperopt.hp.qloguniform('l2_leaf_reg', 0, 2, 1), 'learning_rate': hyperopt.hp.uniform('learning_rate', 1e-3, 5e-1),}
trials = hyperopt.Trials()
# 參數搜索best = hyperopt.fmin( hyperopt_objective, space=params_space, algo=hyperopt.tpe.suggest, max_evals=50, trials=trials, rstate=RandomState(123))
# 列印最優的參數組合,並使用交叉驗證重新訓練print(best)
```
在最優參數下做交叉驗證:```python
model = CatBoostClassifier( l2_leaf_reg=int(best['l2_leaf_reg']), learning_rate=best['learning_rate'], iterations=500, eval_metric='Accuracy', random_seed=42, logging_level='Silent')cv_data = cv(Pool(X, y, cat_features=categorical_features_indices), model.get_params())
model.fit(X, y, cat_features=categorical_features_indices)```
除此之外我們也可以採用網格搜索或隨機參數搜索,下面我們提供了gridsearchCV的過程供參考:```python
from catboost import CatBoostClassifier
def auc(m, X_train, X_test): return (metrics.roc_auc_score(y——train,m.predict_proba(X_train)[:,1]), metrics.roc_auc_score(y_test,m.predict_proba(X_test)[:,1]))
params = {'depth': [4, 7, 10], 'learning_rate' : [0.03, 0.1, 0.15], 'l2_leaf_reg': [1,4,9], 'iterations': [300]}cb = CatBoostClassifier()model = GridSearchCV(cb, params, scoring="roc_auc", cv = 3)model.fit(X_train,y_train,cat_features=categorical_features_indices)```
訓練過程中會有很多有意思的參數可供調整,主要是針對於訓練速度和精度的參數,除此之外還有一些可視化和和GPU相關的參數,詳細的可參考CatBoost的官方文檔。
3.變量重要性與預測:```python# 列印變量重要性feature_importances = model.get_feature_importance(X_train)feature_names = X_train.columnsfor score, name in sorted(zip(feature_importances, feature_names), reverse=True): print('{}: {}'.format(name, score))
# 預測# 3種模式的預測結果展示print(model.predict_proba(data=X_validation))print(model.predict(data=X_validation))raw_pred = model.predict(data=X_validation, prediction_type='RawFormulaVal')print(raw_pred)
import mathdef sigmoid(x): return 1 / (1 + math.exp(-x))probabilities = [sigmoid(x) for x in raw_pred]print(np.array(probabilities))
```
4.模型持久化:```python# 模型保存(後綴名可以換成其他)model.save_model('catboost_model.bin')
# 模型加載my_best_model.load_model('catboost_model.bin')print(my_best_model.get_params())print my_best_model.random_seed_print my_best_model.learning_rate_
```
R語言實現CatBoost
1.構建Pool數據
A.從文件中讀取:```Rlibrary(catboost)library(caret)library(titanic)
pool_path <- system.file("extdata","adult_train.1000",package="catboost")column_description_path <- system.file("extdata","adult.cd",package="catboost")pool <- catboost.load_pool(pool_path,column_description=column_description_path)
head(pool,1)
```
主要有兩個文件,一個是具體的特徵值文件pool_path,一個是列的描述文件column_description_path,主要描述了列的屬性,目前主要包括三種:Target (label);Categ;Num(default type),並且要注意這裡的列的索引沿用了Python方式從0開始。
B.從矩陣中獲取:```Rpool_path = system.file("extdata", "adult_train.1000", package="catboost")
column_description_vector = rep('numeric', 15)cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)for (i in cat_features) column_description_vector[i] <- 'factor'
data <- read.table(pool_path, head = F, sep = "\t", colClasses = column_description_vector, na.strings='NAN')
# Transform categorical features to numeric.for (i in cat_features) data[,i] <- as.numeric(factor(data[,i]))
target <- c(1)data_matrix <- as.matrix(data)pool <- catboost.load_pool(as.matrix(data[,-target]), label = as.matrix(data[,target]), cat_features = cat_features)head(pool, 1)
```
注意矩陣中都是數值型的數據,因此,首先在進入矩陣前的數據需進行數值化,然後在構建pool數據結構時指定離散特徵的列標。
C.從數據框中獲取:```Rtrain_path = system.file("extdata", "adult_train.1000", package="catboost")test_path = system.file("extdata", "adult_test.1000", package="catboost")
column_description_vector = rep('numeric', 15)cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)for (i in cat_features) column_description_vector[i] <- 'factor' train <- read.table(train_path, head = F, sep = "\t", colClasses = column_description_vector, na.strings='NAN')test <- read.table(test_path, head = F, sep = "\t", colClasses = column_description_vector, na.strings='NAN')target <- c(1)train_pool <- catboost.load_pool(data=train[,-target], label = train[,target])test_pool <- catboost.load_pool(data=test[,-target], label = test[,target])head(train_pool, 1)head(test_pool, 1)
```
注意離散變量需要轉化為因子變量,數值型變量即為數值型變量,label也應該是數值型變量。
2.訓練模型與預測```Rfit_params <- list(iterations = 100, thread_count = 10, loss_function = 'Logloss', ignored_features = c(4,9), border_count = 32, depth = 5, learning_rate = 0.03, l2_leaf_reg = 3.5, train_dir = 'train_dir', logging_level = 'Silent' )model <- catboost.train(train_pool, test_pool, fit_params)
# 更多參數可以help(catboost.train)
# accuracy方法calc_accuracy <- function(prediction, expected) { labels <- ifelse(prediction > 0.5, 1, -1) accuracy <- sum(labels == expected) / length(labels) return(accuracy)}
# 概率預測prediction <- catboost.predict(model, test_pool, prediction_type = 'Probability')cat("Sample predictions: ", sample(prediction, 5), "\n")
# class預測labels <- catboost.predict(model, test_pool, prediction_type = 'Class')table(labels, test[,target])
# works properly only for Loglossaccuracy <- calc_accuracy(prediction, test[,target])cat("\nAccuracy: ", accuracy, "\n")
# 變量重要性計算cat("\nFeature importances", "\n")catboost.get_feature_importance(model, train_pool)
cat("\nTree count: ", model$tree_count, "\n")
```
3.使用caret包
A.加載數據:```rset.seed(12345)
data <- as.data.frame(as.matrix(titanic_train), stringsAsFactors=TRUE)
age_levels <- levels(data$Age)most_frequent_age <- which.max(table(data$Age))data$Age[is.na(data$Age)] <- age_levels[most_frequent_age]
drop_columns = c("PassengerId", "Survived", "Name", "Ticket", "Cabin")x <- data[,!(names(data) %in% drop_columns)]y <- data[,c("Survived")]```
B.基於caret的模型訓練:```rfit_control <- trainControl(method = "cv", number = 5, classProbs = TRUE)
grid <- expand.grid(depth = c(4, 6, 8), learning_rate = 0.1, iterations = 100, l2_leaf_reg = 0.1, rsm = 0.95, border_count = 64)
model <- train(x, as.factor(make.names(y)), method = catboost.caret, logging_level = 'Silent', preProc = NULL, tuneGrid = grid, trControl = fit_control)
```
C.列印模型和變量重要性:```r
print(model)
importance <- varImp(model, scale = FALSE)print(importance)```
D.預測:```rpre_prob <- predict(model, type = 'prob')print(pre_prob)```
Reference
[1] https://github.com/catboost/tutorials[2] https://github.com/catboost
[3] CatBoost: unbiased boosting with categorical features
[4] CatBoost: gradient boosting with categorical features support
[5] 誰是數據競賽王者? CatBoost vs. Light GBM vs. XGBoost
—————————————
往期精彩: