Python手寫決策樹並應對過度擬合問題

2021-02-08 DeepHub IMBA

介紹

決策樹是一種用於監督學習的算法。它使用樹結構，其中包含兩種類型的節點：決策節點和葉節點。決策節點通過在要素上詢問布爾值將數據分為兩個分支。葉節點代表一個類。訓練過程是關於在具有特定特徵的特定特徵中找到「最佳」分割。預測過程是通過沿著路徑的每個決策節點回答問題來從根到達葉節點。

基尼不純度和熵

術語「最佳」拆分是指拆分之後，兩個分支比任何其他可能的拆分更「有序」。我們如何定義更多有序的？這取決於我們選擇哪種指標。通常，度量有兩種類型：基尼不純度和熵。這些指標越小，數據集就越「有序」。

這兩個指標之間的差異非常微妙。但在大多數應用中，兩個指標的行為類似。以下是用於計算每個指標的代碼。

def gini_impurity(y):
    # calculate gini_impurity given labels/classes of each example
    m = y.shape[0]
    cnts = dict(zip(*np.unique(y, return_counts = True)))
    impurity = 1 - sum((cnt/m)**2 for cnt in cnts.values())
    return impurity

def entropy(y):
    # calculate entropy given labels/classes of each example
    m = y.shape[0]
    cnts = dict(zip(*np.unique(y, return_counts = True)))
    disorder = - sum((cnt/m)*log(cnt/m) for cnt in cnts.values())
    return disorder
構建樹訓練過程實質上是在樹上建樹。關鍵步驟是確定「最佳」分配。過程如下：我們嘗試按每個功能中的每個唯一值分割數據，然後選擇混亂程度最小的最佳數據。現在我們可以將此過程轉換為python代碼。
def get_split(X, y):
    # loop through features and values to find best combination with the most information gain
    best_gain, best_index, best_value = 0, None, None
    cur_gini = gini_impurity(y)
    n_features = X.shape[1]  

    for index in range(n_features):  
        values = np.unique(X[:, index], return_counts = False)  

        for value in values:  
            left, right = test_split(index, value, X, y)
            if left['y'].shape[0] == 0 or right['y'].shape[0] == 0:
                continue
            gain = info_gain(left['y'], right['y'], cur_gini)
            if gain > best_gain:
                best_gain, best_index, best_value = gain, index, value
    best_split = {'gain': best_gain, 'index': best_index, 'value': best_value}
    return best_split
    
def test_split(index, value, X, y):
    # split a group of examples based on given index (feature) and value
    mask = X[:, index] < value 
    left = {'X': X[mask, :], 'y': y[mask]}
    right = {'X': X[~mask, :], 'y': y[~mask]}
    return left, right
    
def info_gain(l_y, r_y, cur_gini):
    # calculate the information gain for a certain split
    m, n = l_y.shape[0], r_y.shape[0]
    p = m / (m + n)
    return cur_gini - p * gini_impurity(l_y) - (1 - p) * gini_impurity(r_y)
在構建樹之前，我們先定義決策節點和葉節點。決策節點指定將在其上拆分的特徵和值。它還指向左，右子項。葉節點包括類似於Counter對象的字典，該字典顯示每個類有多少訓練示例。這對於計算訓練的準確性很有用。另外，它導致到達該葉子的每個示例的結果預測。
class Decision_Node:
    # define a decision node
    def __init__(self, index, value, left, right):
        self.index, self.value = index, value
        self.left, self.right = left, right

class Leaf:
    # define a leaf node
    def __init__(self, y):
        self.counts = dict(zip(*np.unique(y, return_counts = True)))
        self.prediction = max(self.counts.keys(), key = lambda x: self.counts[x])
鑑於其結構，通過遞歸構造樹是最方便的。遞歸的出口是葉節點。當我們無法通過拆分提高數據純度時，就會發生這種情況。如果我們可以找到「最佳」拆分，則這將成為決策節點。然後，我們對其左，右子級遞歸執行相同的操作。
def decision_tree(X, y):
    # train the decision tree model with a dataset
    correct_prediction = 0
    
    def build_tree(X, y):
        # recursively build the tree
        split = get_split(X, y)

        if split['gain'] == 0:
            nonlocal correct_prediction
            leaf = Leaf(y)
            correct_prediction += leaf.counts[leaf.prediction]
            return leaf

        left, right = test_split(split['index'], split['value'], X, y)
        left_node = build_tree(left['X'], left['y'])
        right_node = build_tree(right['X'], right['y'])

        return Decision_Node(split['index'], split['value'], left_node, right_node)
    
    root = build_tree(X, y)
    return correct_prediction/y.shape[0], root
預測現在我們可以遍歷樹直到葉節點來預測一個示例。
def predict(x, node):
    if isinstance(node, Leaf):
        return node.prediction
    
    if x[node.index] < node.value:
        return predict(x, node.left)
    else:
        return predict(x, node.right)
事實證明，訓練精度為100％，決策邊界看起來很奇怪！顯然，該模型過度擬合了訓練數據。好吧，如果考慮到這一點，如果我們繼續拆分直到數據集變得更純淨，決策樹將過度適合數據。換句話說，如果我們不停止分裂，該模型將正確分類每個示例！訓練準確性為100％（除非具有完全相同功能的不同類別的示例），這絲毫不令人驚訝。
如何應對過度擬合？
從上一節中，我們知道決策樹過擬合的幕後原因。為了防止過度擬合，我們需要在某個時候停止拆分樹。因此，我們需要引入兩個用於訓練的超參數。它們是：樹的最大深度和葉子的最小尺寸。讓我們重寫樹的構建部分。
def decision_tree(X, y, max_dep = 5, min_size = 10):
    # train the decision tree model with a dataset
    correct_prediction = 0
    
    def build_tree(X, y, dep, max_dep = max_dep, min_size = min_size):
        # recursively build the tree
        split = get_split(X, y)

        if split['gain'] == 0 or dep >= max_dep or y.shape[0] <= min_size:
            nonlocal correct_prediction
            leaf = Leaf(y)
            correct_prediction += leaf.counts[leaf.prediction]
            return leaf

        left, right = test_split(split['index'], split['value'], X, y)

        left_node = build_tree(left['X'], left['y'], dep + 1)
        right_node = build_tree(right['X'], right['y'], dep + 1)

        return Decision_Node(split['index'], split['value'], left_node, right_node)
    
    root = build_tree(X, y, 0)
    
    return correct_prediction/y.shape[0], root
現在我們可以重新訓練數據並繪製決策邊界。
樹的可視化
接下來，我們將通過列印出決策樹的節點來可視化決策樹。節點的壓痕與其深度成正比。
def print_tree(node, indent = "|---"):
    # print the tree
    if isinstance(node, Leaf):
        print(indent + 'Class:', node.prediction)
        return

    print(indent + 'feature_' + str(node.index) + 
           ' <= ' + str(round(node.value, 2)))
    print_tree(node.left, '|   ' + indent)

    print(indent + 'feature_' + str(node.index) + 
           ' > ' + str(round(node.value, 2)))
    print_tree(node.right, '|   ' + indent)
結果如下：
|---feature_1 <= 1.87
|   |---feature_1 <= -0.74
|   |   |---feature_1 <= -1.79
|   |   |   |---feature_1 <= -2.1
|   |   |   |   |---Class: 2
|   |   |   |---feature_1 > -2.1
|   |   |   |   |---Class: 2
|   |   |---feature_1 > -1.79
|   |   |   |---feature_0 <= 1.62
|   |   |   |   |---feature_0 <= -1.31
|   |   |   |   |   |---Class: 2
|   |   |   |   |---feature_0 > -1.31
|   |   |   |   |   |---feature_1 <= -1.49
|   |   |   |   |   |   |---Class: 1
|   |   |   |   |   |---feature_1 > -1.49
|   |   |   |   |   |   |---Class: 1
|   |   |   |---feature_0 > 1.62
|   |   |   |   |---Class: 2
|   |---feature_1 > -0.74
|   |   |---feature_1 <= 0.76
|   |   |   |---feature_0 <= 0.89
|   |   |   |   |---feature_0 <= -0.86
|   |   |   |   |   |---feature_0 <= -2.24
|   |   |   |   |   |   |---Class: 2
|   |   |   |   |   |---feature_0 > -2.24
|   |   |   |   |   |   |---Class: 1
|   |   |   |   |---feature_0 > -0.86
|   |   |   |   |   |---Class: 0
|   |   |   |---feature_0 > 0.89
|   |   |   |   |---feature_0 <= 2.13
|   |   |   |   |   |---Class: 1
|   |   |   |   |---feature_0 > 2.13
|   |   |   |   |   |---Class: 2
|   |   |---feature_1 > 0.76
|   |   |   |---feature_0 <= -1.6
|   |   |   |   |---Class: 2
|   |   |   |---feature_0 > -1.6
|   |   |   |   |---feature_0 <= 1.35
|   |   |   |   |   |---feature_1 <= 1.66
|   |   |   |   |   |   |---Class: 1
|   |   |   |   |   |---feature_1 > 1.66
|   |   |   |   |   |   |---Class: 1
|   |   |   |   |---feature_0 > 1.35
|   |   |   |   |   |---Class: 2
|---feature_1 > 1.87
|   |---Class: 2
總結與其他回歸模型不同，決策樹不使用正則化來對抗過度擬合。相反，它使用樹修剪。選擇正確的超參數（樹的深度和葉子的大小）還需要進行實驗，例如 使用超參數矩陣進行交叉驗證。
對於完整的工作流，包括數據生成和繪圖決策邊界，完整的代碼在這裡：https://github.com/JunWorks/ML-Algorithm-with-Python/blob/master/decision_tree/decision_tree.ipynb

作者：Jun M
deephub翻譯組

Python手寫決策樹並應對過度擬合問題

相關焦點

用Python構建和可視化決策樹

以鳶尾花數據集為例,用Python對決策樹進行分類

決策樹

python分類-決策樹

Python機器學習sklearn模塊——決策樹

決策樹可視化-python

sklearn學習(三):決策樹

【量化課堂】決策樹入門及Python應用

機器學習系列(二)決策樹(Decision Tree)

新知識 | 「過度擬合」

關於基於樹的建模的完整教程(從R&Python)

機器學習 | 決策樹之回歸樹

機器學習入門之決策樹1

機器學習 | 決策樹之分類樹

詳解決策樹 C4.5 算法

機器學習決策樹:sklearn分類和回歸

Python隨機森林 - CodeProject

人工智慧算法問題——正則化解決神經網絡中的過度擬合

【分類算法】基於 R 語言決策樹算法介紹及應用

決策樹與隨機森林(4)—— 決策樹C5.0算法

Python手寫決策樹並應對過度擬合問題

相關焦點

用Python構建和可視化決策樹

以鳶尾花數據集為例,用Python對決策樹進行分類

決策樹

python分類-決策樹

Python機器學習sklearn模塊——決策樹

決策樹可視化-python

sklearn學習(三):決策樹

【量化課堂】決策樹入門及Python應用

機器學習系列(二)決策樹(Decision Tree)

新知識 | 「過度擬合」

關於基於樹的建模的完整教程(從R&Python)

​機器學習 | 決策樹之回歸樹

機器學習入門之決策樹1

機器學習 | 決策樹之分類樹

詳解決策樹 C4.5 算法

機器學習決策樹:sklearn分類和回歸

Python隨機森林 - CodeProject

人工智慧算法問題——正則化解決神經網絡中的過度擬合

【分類算法】基於 R 語言決策樹算法介紹及應用

決策樹與隨機森林(4)—— 決策樹C5.0算法

機器學習 | 決策樹之回歸樹