決策樹是一種用於監督學習的算法。它使用樹結構,其中包含兩種類型的節點:決策節點和葉節點。決策節點通過在要素上詢問布爾值將數據分為兩個分支。葉節點代表一個類。訓練過程是關於在具有特定特徵的特定特徵中找到「最佳」分割。預測過程是通過沿著路徑的每個決策節點回答問題來從根到達葉節點。
基尼不純度和熵術語「最佳」拆分是指拆分之後,兩個分支比任何其他可能的拆分更「有序」。我們如何定義更多有序的?這取決於我們選擇哪種指標。通常,度量有兩種類型:基尼不純度和熵。這些指標越小,數據集就越「有序」。
這兩個指標之間的差異非常微妙。但 在大多數應用中,兩個指標的行為類似。以下是用於計算每個指標的代碼。
def gini_impurity(y):
# calculate gini_impurity given labels/classes of each example
m = y.shape[0]
cnts = dict(zip(*np.unique(y, return_counts = True)))
impurity = 1 - sum((cnt/m)**2 for cnt in cnts.values())
return impurity
def entropy(y):
# calculate entropy given labels/classes of each example
m = y.shape[0]
cnts = dict(zip(*np.unique(y, return_counts = True)))
disorder = - sum((cnt/m)*log(cnt/m) for cnt in cnts.values())
return disorder構建樹訓練過程實質上是在樹上建樹。關鍵步驟是確定「最佳」分配。過程如下:我們嘗試按每個功能中的每個唯一值分割數據,然後選擇混亂程度最小的最佳數據。現在我們可以將此過程轉換為python代碼。
def get_split(X, y):
# loop through features and values to find best combination with the most information gain
best_gain, best_index, best_value = 0, None, None
cur_gini = gini_impurity(y)
n_features = X.shape[1]
for index in range(n_features):
values = np.unique(X[:, index], return_counts = False)
for value in values:
left, right = test_split(index, value, X, y)
if left['y'].shape[0] == 0 or right['y'].shape[0] == 0:
continue
gain = info_gain(left['y'], right['y'], cur_gini)
if gain > best_gain:
best_gain, best_index, best_value = gain, index, value
best_split = {'gain': best_gain, 'index': best_index, 'value': best_value}
return best_split
def test_split(index, value, X, y):
# split a group of examples based on given index (feature) and value
mask = X[:, index] < value
left = {'X': X[mask, :], 'y': y[mask]}
right = {'X': X[~mask, :], 'y': y[~mask]}
return left, right
def info_gain(l_y, r_y, cur_gini):
# calculate the information gain for a certain split
m, n = l_y.shape[0], r_y.shape[0]
p = m / (m + n)
return cur_gini - p * gini_impurity(l_y) - (1 - p) * gini_impurity(r_y)在構建樹之前,我們先定義決策節點和葉節點。決策節點指定將在其上拆分的特徵和值。它還指向左,右子項。葉節點包括類似於Counter對象的字典,該字典顯示每個類有多少訓練示例。這對於計算訓練的準確性很有用。另外,它導致到達該葉子的每個示例的結果預測。
class Decision_Node:
# define a decision node
def __init__(self, index, value, left, right):
self.index, self.value = index, value
self.left, self.right = left, right
class Leaf:
# define a leaf node
def __init__(self, y):
self.counts = dict(zip(*np.unique(y, return_counts = True)))
self.prediction = max(self.counts.keys(), key = lambda x: self.counts[x])鑑於其結構,通過遞歸構造樹是最方便的。遞歸的出口是葉節點。當我們無法通過拆分提高數據純度時,就會發生這種情況。如果我們可以找到「最佳」拆分,則這將成為決策節點。然後,我們對其左,右子級遞歸執行相同的操作。
def decision_tree(X, y):
# train the decision tree model with a dataset
correct_prediction = 0
def build_tree(X, y):
# recursively build the tree
split = get_split(X, y)
if split['gain'] == 0:
nonlocal correct_prediction
leaf = Leaf(y)
correct_prediction += leaf.counts[leaf.prediction]
return leaf
left, right = test_split(split['index'], split['value'], X, y)
left_node = build_tree(left['X'], left['y'])
right_node = build_tree(right['X'], right['y'])
return Decision_Node(split['index'], split['value'], left_node, right_node)
root = build_tree(X, y)
return correct_prediction/y.shape[0], root預測現在我們可以遍歷樹直到葉節點來預測一個示例。
def predict(x, node):
if isinstance(node, Leaf):
return node.prediction
if x[node.index] < node.value:
return predict(x, node.left)
else:
return predict(x, node.right)事實證明,訓練精度為100%,決策邊界看起來很奇怪!顯然,該模型過度擬合了訓練數據。好吧,如果考慮到這一點,如果我們繼續拆分直到數據集變得更純淨,決策樹將過度適合數據。換句話說,如果我們不停止分裂,該模型將正確分類每個示例!訓練準確性為100%(除非具有完全相同功能的不同類別的示例),這絲毫不令人驚訝。
如何應對過度擬合?從上一節中,我們知道決策樹過擬合的幕後原因。為了防止過度擬合,我們需要在某個時候停止拆分樹。因此,我們需要引入兩個用於訓練的超參數。它們是:樹的最大深度和葉子的最小尺寸。讓我們重寫樹的構建部分。
def decision_tree(X, y, max_dep = 5, min_size = 10):
# train the decision tree model with a dataset
correct_prediction = 0
def build_tree(X, y, dep, max_dep = max_dep, min_size = min_size):
# recursively build the tree
split = get_split(X, y)
if split['gain'] == 0 or dep >= max_dep or y.shape[0] <= min_size:
nonlocal correct_prediction
leaf = Leaf(y)
correct_prediction += leaf.counts[leaf.prediction]
return leaf
left, right = test_split(split['index'], split['value'], X, y)
left_node = build_tree(left['X'], left['y'], dep + 1)
right_node = build_tree(right['X'], right['y'], dep + 1)
return Decision_Node(split['index'], split['value'], left_node, right_node)
root = build_tree(X, y, 0)
return correct_prediction/y.shape[0], root現在我們可以重新訓練數據並繪製決策邊界。
樹的可視化接下來,我們將通過列印出決策樹的節點來可視化決策樹。節點的壓痕與其深度成正比。
def print_tree(node, indent = "|---"):
# print the tree
if isinstance(node, Leaf):
print(indent + 'Class:', node.prediction)
return
print(indent + 'feature_' + str(node.index) +
' <= ' + str(round(node.value, 2)))
print_tree(node.left, '| ' + indent)
print(indent + 'feature_' + str(node.index) +
' > ' + str(round(node.value, 2)))
print_tree(node.right, '| ' + indent)結果如下:
|---feature_1 <= 1.87
| |---feature_1 <= -0.74
| | |---feature_1 <= -1.79
| | | |---feature_1 <= -2.1
| | | | |---Class: 2
| | | |---feature_1 > -2.1
| | | | |---Class: 2
| | |---feature_1 > -1.79
| | | |---feature_0 <= 1.62
| | | | |---feature_0 <= -1.31
| | | | | |---Class: 2
| | | | |---feature_0 > -1.31
| | | | | |---feature_1 <= -1.49
| | | | | | |---Class: 1
| | | | | |---feature_1 > -1.49
| | | | | | |---Class: 1
| | | |---feature_0 > 1.62
| | | | |---Class: 2
| |---feature_1 > -0.74
| | |---feature_1 <= 0.76
| | | |---feature_0 <= 0.89
| | | | |---feature_0 <= -0.86
| | | | | |---feature_0 <= -2.24
| | | | | | |---Class: 2
| | | | | |---feature_0 > -2.24
| | | | | | |---Class: 1
| | | | |---feature_0 > -0.86
| | | | | |---Class: 0
| | | |---feature_0 > 0.89
| | | | |---feature_0 <= 2.13
| | | | | |---Class: 1
| | | | |---feature_0 > 2.13
| | | | | |---Class: 2
| | |---feature_1 > 0.76
| | | |---feature_0 <= -1.6
| | | | |---Class: 2
| | | |---feature_0 > -1.6
| | | | |---feature_0 <= 1.35
| | | | | |---feature_1 <= 1.66
| | | | | | |---Class: 1
| | | | | |---feature_1 > 1.66
| | | | | | |---Class: 1
| | | | |---feature_0 > 1.35
| | | | | |---Class: 2
|---feature_1 > 1.87
| |---Class: 2總結與其他回歸模型不同,決策樹不使用正則化來對抗過度擬合。相反,它使用樹修剪。選擇正確的超參數(樹的深度和葉子的大小)還需要進行實驗,例如 使用超參數矩陣進行交叉驗證。
對於完整的工作流,包括數據生成和繪圖決策邊界,完整的代碼在這裡:https://github.com/JunWorks/ML-Algorithm-with-Python/blob/master/decision_tree/decision_tree.ipynb
作者:Jun M
deephub翻譯組