CatBoost的Python與R實現

2021-03-02 表哥有話講

作者：徐靜 AI圖像算法研發工程師

博客：https://dataxujing.github.io/

GitHub: https://github.com/DataXujing

CatBoost(Categorical Boosting)算法是一種類似於XGBoost,LightGBM的Gradient Boosting算法，其算法創新主要有兩個：一個是對於離散特徵值的處理，採用了ordered TS(target statistic)的方法；其二是提供了兩種訓練模式：Ordered和Plain，其具體的偽代碼如下圖所示：

通過ordered boosting的思想解決了Gradient Boosting中常出現的prediction shift問題。

CatBoost目前支持通過Python,R和命令行進行調用和訓練，支持GPU,其提供了強大的訓練過程可視化功能，可以使用jupyter notebook,CatBoost Viewer,TensorBoard可視化訓練過程，學習文檔豐富，易於上手。

本文帶大家結合kaggle中titanic公共數據集基於Python和R訓練CatBoost模型。

Python實現CatBoost

1.加載數據：

```pythonfrom catboost.datasets import titanicimport numpy as npfrom sklearn.model_selection import train_test_splitfrom  catboost import CatBoostClassifier, Pool, cvfrom sklearn.metrics import accuracy_score
train_df, test_df = titanic()
X = train_df.drop('Survived', axis=1)y = train_df.Survived
# 數據劃分X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.75, random_state=42)
X_test = test_df```

這裡我們直接使用數據框的結構，對於CatBoost支持numpy中的數組和pandas中的數據框，同時也提供了一種pool數據結構，如果有速度和內存佔用優化的需求，官方建議使用pool數據結構，本文我們使用數據框結構作為例子。

2.使用hyperopt調參：
```python
import hyperoptfrom numpy.random import RandomState
# 目的是最小化目標函數def hyperopt_objective(params):    model = CatBoostClassifier(        l2_leaf_reg=int(params['l2_leaf_reg']),        learning_rate=params['learning_rate'],        iterations=500,        eval_metric='Accuracy',        random_seed=42,        logging_level='Silent'    )        cv_data = cv(        Pool(X, y, cat_features=categorical_features_indices),        model.get_params()    )    best_accuracy = np.max(cv_data['test-Accuracy-mean'])        return 1 - best_accuracy 
# 需要優化的參數備選取值params_space = {    'l2_leaf_reg': hyperopt.hp.qloguniform('l2_leaf_reg', 0, 2, 1),    'learning_rate': hyperopt.hp.uniform('learning_rate', 1e-3, 5e-1),}
trials = hyperopt.Trials()
# 參數搜索best = hyperopt.fmin(    hyperopt_objective,    space=params_space,    algo=hyperopt.tpe.suggest,    max_evals=50,    trials=trials,    rstate=RandomState(123))
# 列印最優的參數組合，並使用交叉驗證重新訓練print(best)
```

在最優參數下做交叉驗證：
```python
model = CatBoostClassifier(    l2_leaf_reg=int(best['l2_leaf_reg']),    learning_rate=best['learning_rate'],    iterations=500,    eval_metric='Accuracy',    random_seed=42,    logging_level='Silent')cv_data = cv(Pool(X, y, cat_features=categorical_features_indices), model.get_params())
model.fit(X, y, cat_features=categorical_features_indices)```

除此之外我們也可以採用網格搜索或隨機參數搜索，下面我們提供了gridsearchCV的過程供參考：
```python
from catboost import CatBoostClassifier
def auc(m, X_train, X_test):     return (metrics.roc_auc_score(y——train,m.predict_proba(X_train)[:,1]),                            metrics.roc_auc_score(y_test,m.predict_proba(X_test)[:,1]))
params = {'depth': [4, 7, 10],          'learning_rate' : [0.03, 0.1, 0.15],         'l2_leaf_reg': [1,4,9],         'iterations': [300]}cb = CatBoostClassifier()model = GridSearchCV(cb, params, scoring="roc_auc", cv = 3)model.fit(X_train,y_train,cat_features=categorical_features_indices)```

訓練過程中會有很多有意思的參數可供調整，主要是針對於訓練速度和精度的參數，除此之外還有一些可視化和和GPU相關的參數，詳細的可參考CatBoost的官方文檔。

3.變量重要性與預測：
```python# 列印變量重要性feature_importances = model.get_feature_importance(X_train)feature_names = X_train.columnsfor score, name in sorted(zip(feature_importances, feature_names), reverse=True):    print('{}: {}'.format(name, score))
# 預測# 3種模式的預測結果展示print(model.predict_proba(data=X_validation))print(model.predict(data=X_validation))raw_pred = model.predict(data=X_validation, prediction_type='RawFormulaVal')print(raw_pred)
import mathdef sigmoid(x):  return 1 / (1 + math.exp(-x))probabilities = [sigmoid(x) for x in raw_pred]print(np.array(probabilities))
```

4.模型持久化：
```python# 模型保存(後綴名可以換成其他)model.save_model('catboost_model.bin')
# 模型加載my_best_model.load_model('catboost_model.bin')print(my_best_model.get_params())print my_best_model.random_seed_print my_best_model.learning_rate_
```

R語言實現CatBoost

1.構建Pool數據

A.從文件中讀取：
```Rlibrary(catboost)library(caret)library(titanic)
pool_path <- system.file("extdata","adult_train.1000",package="catboost")column_description_path <- system.file("extdata","adult.cd",package="catboost")pool <- catboost.load_pool(pool_path,column_description=column_description_path)
head(pool,1)
```

主要有兩個文件，一個是具體的特徵值文件pool_path,一個是列的描述文件column_description_path，主要描述了列的屬性，目前主要包括三種：
Target (label);Categ;Num(default type),並且要注意這裡的列的索引沿用了Python方式從0開始。

B.從矩陣中獲取：
```Rpool_path = system.file("extdata", "adult_train.1000", package="catboost")
column_description_vector = rep('numeric', 15)cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)for (i in cat_features)    column_description_vector[i] <- 'factor'
data <- read.table(pool_path, head = F, sep = "\t", colClasses = column_description_vector, na.strings='NAN')
# Transform categorical features to numeric.for (i in cat_features)    data[,i] <- as.numeric(factor(data[,i]))

target <- c(1)data_matrix <- as.matrix(data)pool <- catboost.load_pool(as.matrix(data[,-target]),                             label = as.matrix(data[,target]),                             cat_features = cat_features)head(pool, 1)
```

注意矩陣中都是數值型的數據，因此，首先在進入矩陣前的數據需進行數值化，然後在構建pool數據結構時指定離散特徵的列標。

C.從數據框中獲取：
```Rtrain_path = system.file("extdata", "adult_train.1000", package="catboost")test_path = system.file("extdata", "adult_test.1000", package="catboost")
column_description_vector = rep('numeric', 15)cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)for (i in cat_features)    column_description_vector[i] <- 'factor'    train <- read.table(train_path, head = F, sep = "\t", colClasses = column_description_vector, na.strings='NAN')test <- read.table(test_path, head = F, sep = "\t", colClasses = column_description_vector, na.strings='NAN')target <- c(1)train_pool <- catboost.load_pool(data=train[,-target], label = train[,target])test_pool <- catboost.load_pool(data=test[,-target], label = test[,target])head(train_pool, 1)head(test_pool, 1)
```

注意離散變量需要轉化為因子變量，數值型變量即為數值型變量，label也應該是數值型變量。

2.訓練模型與預測
```Rfit_params <- list(iterations = 100,                   thread_count = 10,                   loss_function = 'Logloss',                   ignored_features = c(4,9),                   border_count = 32,                   depth = 5,                   learning_rate = 0.03,                   l2_leaf_reg = 3.5,                   train_dir = 'train_dir',                   logging_level = 'Silent'                  )model <- catboost.train(train_pool, test_pool, fit_params)
# 更多參數可以help(catboost.train)
# accuracy方法calc_accuracy <- function(prediction, expected) {  labels <- ifelse(prediction > 0.5, 1, -1)  accuracy <- sum(labels == expected) / length(labels)  return(accuracy)}
# 概率預測prediction <- catboost.predict(model, test_pool, prediction_type = 'Probability')cat("Sample predictions: ", sample(prediction, 5), "\n")
# class預測labels <- catboost.predict(model, test_pool, prediction_type = 'Class')table(labels, test[,target])
# works properly only for Loglossaccuracy <- calc_accuracy(prediction, test[,target])cat("\nAccuracy: ", accuracy, "\n")
# 變量重要性計算cat("\nFeature importances", "\n")catboost.get_feature_importance(model, train_pool)
cat("\nTree count: ", model$tree_count, "\n")
```

3.使用caret包

A.加載數據：
```rset.seed(12345)
data <- as.data.frame(as.matrix(titanic_train), stringsAsFactors=TRUE)
age_levels <- levels(data$Age)most_frequent_age <- which.max(table(data$Age))data$Age[is.na(data$Age)] <- age_levels[most_frequent_age]
drop_columns = c("PassengerId", "Survived", "Name", "Ticket", "Cabin")x <- data[,!(names(data) %in% drop_columns)]y <- data[,c("Survived")]```

B.基於caret的模型訓練：
```rfit_control <- trainControl(method = "cv",                            number = 5,                            classProbs = TRUE)
grid <- expand.grid(depth = c(4, 6, 8),                    learning_rate = 0.1,                    iterations = 100,                    l2_leaf_reg = 0.1,                    rsm = 0.95,                    border_count = 64)
model <- train(x, as.factor(make.names(y)),                method = catboost.caret,                logging_level = 'Silent', preProc = NULL,                tuneGrid = grid, trControl = fit_control)
```

C.列印模型和變量重要性：
```r
print(model)
importance <- varImp(model, scale = FALSE)print(importance)```

D.預測：
```rpre_prob <- predict(model, type = 'prob')print(pre_prob)```

Reference

[1] https://github.com/catboost/tutorials
[2] https://github.com/catboost
[3] CatBoost: unbiased boosting with categorical features
[4] CatBoost: gradient boosting with categorical features support
[5] 誰是數據競賽王者？ CatBoost vs. Light GBM vs. XGBoost
—————————————
往期精彩：

CatBoost的Python與R實現

相關焦點

CatBoost詳解

【務實基礎】CatBoost

XGBoost使用

插件化開發利器--boost.dll

PYTHON中XGBOOST的使用

Python 調用系統命令的模塊 Subprocess

入門模塊Subprocess,用Python調用系統命令

python實現高斯樸素貝葉斯分類器

Python 還能實現圖片去霧?FFA 去霧算法、暗通道去霧算法用起來!(附代碼)

Python 基於直方圖的梯度提升集成方法

python 進階知識點總結

用Python實現神奇切圖算法Seam Carving

Python 中用 XGBoost 和 scikit-learn 進行隨機梯度增強

每天一個linux命令(10):cat 命令

使用python+sklearn實現高斯過程回歸

Python系列特別篇-Conda和JupyterLab

業界 | Facebook 開源語音識別工具包wav2letter(附實現教程)

如何把Python和Bash搞在一起解決問題

用 Python 實現溺水識別