Auto Machine Learning 自動化機器學習筆記

2021-03-02 Datawhale

⭐適讀人群：有機器學習算法基礎

1. auto-sklearn 能 auto 到什麼地步？

在機器學習中的分類模型中：

常規 ML framework 如下圖灰色部分：導入數據-數據清洗-特徵工程-分類器-輸出預測值

auto部分如下圖綠色方框：在ML framework 左邊新增 meta-learning，在右邊新增 build-ensemble，對於調超參數，用的是貝葉斯優化。

自動學習樣本數據: meta-learning，去學習樣本數據的模樣，自動推薦合適的模型。比如文本數據用什麼模型比較好，比如很多的離散數據用什麼模型好。

自動調超參：Bayesian optimizer，貝葉斯優化。

自動模型集成: build-ensemble，模型集成，在一般的比賽中都會用到的技巧。多個模型組合成一個更強更大的模型。往往能提高預測準確性。

CASH problem: AutoML as a Combined Algorithm Selection and Hyperparameter optimization (CASH) problem

也就是說，一般的分類或者回歸的機器學習模型即將或者已經實現了低門檻或者零門檻甚至免費建模的程度。

其實機器學習的每個步驟都可以向著自動化方向發展，而且自動化的方式又有很多種。

機器學習自動化的難點還是在數據清洗和特徵工程這些技巧，至於模型篩選、模型集成和超參數調參已經有比較成熟可用的代碼了。

我們的願景是人人都可以用得起機器學習系統🙂 有沒有很google！

2. 目前有哪些公司在做AutoML，github上又有哪些開源項目？

業界在 automl 上的進展：

Google: Cloud AutoML, Google’s Prediction API

https://cloud.google.com/automl/

Microsoft: Custom Vision, Azure Machine Learning

Amazon: Amazon Machine Learning

others: BigML.com, Wise.io, SkyTree.com, RapidMiner.com, Dato.com, Prediction.io, DataRobot.com

github上的開源項目：

auto-sklearn (2.4k stars!)

https://github.com/automl/auto-sklearn

論文連結：

http://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning.pdf

ClimbsRocks/auto_ml，可以讀一下代碼學習如何寫

pipeline https://github.com/ClimbsRocks/auto_ml

autokeras，基於keras的 automl 向開源項目http://codewithzhangyi.com/2018/07/26/AutoML/

3. auto-sklearn的整體框架了解一下？

呃…先湊活看吧，具體的可以到github上翻看文件結構。
框架的主軸在第二列，第二列的精華在pipeline，pipeline的重點在components：

16 classifiers（可以被指定或者篩選，include_estimators=[「random_forest」, ]）

adaboost, bernoulli_nb, decision_tree, extra_trees, gaussian_nb, gradient_boosting, k_nearest_neighbors, lda, liblinear_svc, libsvm_svc, multinomial_nb, passive_aggressive, qda, random_forest, sgd, xgradient_boosting

13 regressors（可以被指定或者篩選，exclude_estimators=None）

adaboost, ard_regression, decision_tree, extra_trees, gaussian_process, gradient_boosting, k_nearest_neighbors, liblinear_svr, libsvm_svr, random_forest, ridge_regression, sgd, xgradient_boosting

18 feature preprocessing methods（這些過程可以被手動關閉全部或者部分，include_preprocessors=[「no_preprocessing」, ]）

densifier, extra_trees_preproc_for_classification, extra_trees_preproc_for_regression, fast_ica,feature_agglomeration, kernel_pca, kitchen_sinks, liblinear_svc_preprocessor, no_preprocessing, nystroem_sampler, pca, polynomial, random_trees_embedding, select_percentile, select_percentile_classification, select_percentile_regression, select_rates, truncatedSVD

5 data preprocessing methods（這些過程不能被手動關閉）

balancing, imputation, one_hot_encoding, rescaling, variance_threshold（看到這裡已經有點驚喜了！點進去有不少內容）

more than 110 hyperparameters
其中參數include_estimators,要搜索的方法,exclude_estimators:為不搜索的方法.與參數include_estimators不兼容
而include_preprocessors,可以參考手冊中的內容

auto-sklearn是基於sklearn庫，因此會有驚豔強大的模型庫和數據/特徵預處理庫，專業出身的設定。

4. meta-learning 是什麼操作？

https://ml.informatik.uni-freiburg.de/papers/15-AAAI-MI-SMBO-poster.pdf

What is MI-SMBO?
Meta-learning Initialized Sequential Model-Based Bayesian Optimization

What is meta-learning?
Mimics human domain experts: use configurations which are known to work well on similar datasets

也就是學習算法工程師的建模習慣，比如看到什麼類型的數據就會明白套用什麼模型比較適合，去生產對於數據的 metafeatures：

左邊：黑色的部分是標準貝葉斯優化流程，紅色的是添加meta-learning的貝葉斯優化

右邊：有 Metafeatures for the Iris dataset，描述數據長什麼樣的features，下面的公式是計算數據集與數據集的相似度的，只要發現相似的數據集，就可以根據經驗來推薦好用的分類器。再來張大圖感受下metafeatures到底長啥樣：
🔗論文連結

http://aad.informatik.uni-freiburg.de/papers/15-AAAI-MI-SMBO.pdf
🔗supplementary.pdf

http://codewithzhangyi.com/2018/07/26/AutoML/www.automl.org/aaai2015-mi-smbo-supplementary.pdf

5. auto-sklearn 如何實現自動超參數調參？

概念解釋

SMBO: Sequential Model-based Bayesian/Global Optimization，調超參的大多數方法基於SMBO

SMAC: Sequential Model-based Algorithm Configuration，機器學習記錄經驗值的配置空間

TPE: Tree-structured Parzen Estimator

超參數調參方法：

Grid Search 網格搜索/窮舉搜索
在高維空間不實用。

Random Search 隨機搜索
很多超參是通過並行選擇的，它們之間是相互獨立的。一些超參會產生良好的性能，另一些不會。

Heuristic Tuning 手動調參
經驗法，耗時長。（不知道經驗法的英文是否可以這樣表示）

Automatic Hyperparameter Tuning

Bayesian Optimization

SMAC

TPE

在 auto-sklearn 裡，一直出現的 bayesian optimizer 就是答案。是利用貝葉斯優化進行自動調參的。

🔗具體的貝葉斯優化原理連結

http://codewithzhangyi.com/2018/07/31/Auto%20Hyperparameter%20Tuning%20-%20Bayesian%20Optimization/

🔗論文連結

https://pdfs.semanticscholar.org/681e/518fd8e3e986ba25bc1fb33aac8873b521e7.pdf

6. auto-sklearn 如何實現自動模型集成？

官方回答：automated ensemble construction: use all classifiers that were found by Bayesian optimization
目前在庫中有16個分類器，根據貝葉斯優化找出最佳分類器組合，比如是（0.4 random forest + 0.2 sgd + 0.4 xgboost)
可以根據fit完的分類器列印結果看最終的模型是由什麼分類器組成，以及它們的參數數值：

1
2
3
4
5

import autoskleran.classification
automl = autosklearn.classification.AutoSklearnClassifier()
automl.fit(X_train, y_train)

automl.show_models()

列印automl.show_models()就能列印出所謂的自動集成模型有哪些，權重分布，以及超參數數值。

7. 如何使用 auto-sklearn？

適用系統：Linux

👍installation

http://automl.github.io/auto-sklearn/stable/installation.html

官方文檔🔗

http://automl.github.io/auto-sklearn/stable/index.html

接口文檔🔗

http://automl.github.io/auto-sklearn/stable/api.html

舉個慄子🔗

http://automl.github.io/auto-sklearn/stable/manual.html

使用套路如下：

1
2
3
4
5
6
7

# 4行代碼搞定
import autosklearn.classification
automl = autosklearn.classification.AutoSKlearnClassifier()
automl.fit(X_train, y_train)
predictions = automl.predict(X_test) # 列印出0，1結果

predictions_prob = automl.predict_proba(X_test) # 列印出0-1之間的概率值

親測 X_train, y_train 內不能含有非數值型數據，比如Male/Female字母就報錯。

訓練集有哪些特徵，測試集就必須有哪些特徵，可以理解為不做特徵篩選，所以最初導入訓練集的特徵越粗糙越好。

會列印出非常非常多的東西，耐心看，會找到類似下面的規律。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

automl.sprint_statistics()

# 列印結果如下：
# 'auto-sklearn results:
# Dataset name: 46b7545efa67d8cd76f70e71eb67b72e
# Metric: accuracy
# Best validation score: 0.955932
# Number of target algorithm runs: 1278
# Number of successful target algorithm runs: 1252
# Number of crashed target algorithm runs: 23
# Number of target algorithms that exceeded the time limit: 3
# Number of target algorithms that exceeded the memory limit: 0'

automl._get_automl_class()

# 列印結果
# autosklearn.automl.AutoMLClassifier

其他可以嘗試的操作：

1
2
3
4
5

automl.score(X,y)

automl.get_models_with_weights()

automl.get_configuration_space(X,y)

8. auto-sklearn 目前有什麼缺點9. AutoML 的發展情況

隨著谷歌發布它們的 Cloud AutoML 各種驚豔的功能，對於這塊的關注度會越來越高的吧~
machine learning的比賽已經不足為奇啦，現在已經有很多有關AutoML的比賽了：
http://automl.chalearn.org/

Auto Machine Learning 自動化機器學習筆記

相關焦點

機器學習(Machine Learning)&深度學習(Deep Learning)資料(之一)

Hands-on Machine Learning with Scikit-Learn and TensorFlow 學習筆記

春節充電系列:李宏毅2017機器學習課程學習筆記16之無監督學習:自編碼器(autoencoder)

吳恩達最新《Machine Learning》Jupyter Notebook 版筆記發布!

技術詞條 | 機器學習(Machine Learning)篇

CFA二級思維導圖分享:機器學習(Machine Learning)

談談機器學習(Machine Learning)大家

吳恩達《Machine Learning》精煉筆記 1:監督學習與非監督學習

讀者推薦書籍和他的筆記Hands-on Machine Learning

機器學習(Machine Learning)&深度學習(Deep Learning)資料(之二精選161-315網址)

吳恩達《Machine Learning》Jupyter Notebook 版筆記發布!圖解、公式、習題都有了

awesome-adversarial-machine-learning資源列表

春節充電系列:李宏毅2017機器學習課程學習筆記19之遷移學習(Transfer Learning)

博客 | 一份中外結合的 Machine Learning 自學計劃

機器學習吧面向ai的中文機器學習資源與分享平臺

吳恩達新書《Machine Learning Yearning》附完整中文版 PDF 下載!

吳恩達《Machine Learning》精煉筆記 9:PCA 及其 Python 實現

教你Machine Learning 玩轉金融入門Notes

前沿資訊007|Machine Learning、Cloud Computing、Mobile Payment

3.6 Google Machine Learning線上講座

Auto Machine Learning 自動化機器學習筆記

相關焦點

機器學習(Machine Learning)&深度學習(Deep Learning)資料(之一)

Hands-on Machine Learning with Scikit-Learn and TensorFlow 學習筆記

春節充電系列:李宏毅2017機器學習課程學習筆記16之無監督學習:自編碼器(autoencoder)

吳恩達最新《Machine Learning》Jupyter Notebook 版筆記發布!

技術詞條 | 機器學習(Machine Learning)篇

CFA二級思維導圖分享:機器學習(Machine Learning)

談談機器學習(Machine Learning)大家

吳恩達《Machine Learning》精煉筆記 1:監督學習與非監督學習

讀者推薦書籍和他的筆記Hands-on Machine Learning

​機器學習(Machine Learning)&深度學習(Deep Learning)資料(之二精選161-315網址)

吳恩達《Machine Learning》Jupyter Notebook 版筆記發布!圖解、公式、習題都有了

awesome-adversarial-machine-learning資源列表

春節充電系列:李宏毅2017機器學習課程學習筆記19之遷移學習(Transfer Learning)

博客 | 一份中外結合的 Machine Learning 自學計劃

機器學習吧面向ai的中文機器學習資源與分享平臺

吳恩達新書《Machine Learning Yearning》附完整中文版 PDF 下載!

吳恩達《Machine Learning》精煉筆記 9:PCA 及其 Python 實現

教你Machine Learning 玩轉金融入門Notes

前沿資訊007|Machine Learning、Cloud Computing、Mobile Payment

3.6 Google Machine Learning線上講座

機器學習(Machine Learning)&深度學習(Deep Learning)資料(之二精選161-315網址)