直播案例 | 理解維度災難問題

2021-03-02 數據酷客

維度災難（curse of dimensionality）是指隨著特徵數量的增多，模型的複雜度指數級增長。本案例首先通過高維單位球體積的變化形象地演示高維空間中體積的反直覺現象。然後通過隨機生成的數據演示高維空間中歐式距離的失效特點，幫助理解維度災難對機器學習的影響。其次，我們通過 Trunk 數據集理解了 Hughes phenomenon，即分類器的性能隨著維數的增多先增大後減小的特點。我們通過一個月牙形數據集展示了核方法在利用維數的祝福方面的作用。最後，通過在一份隨機噪音數據集上訓練分類模型，理解隨著維數的增大，分類器不斷擬合噪音的過程。

高維空間中有很多反直觀的例子，例如對於高維球體，可以證明球體內部的面積和球體表面積處的體積相比可以忽略不計。我們首先來看看高維單位球體積隨維度的變化。

1 單位球體積隨著維度的變化

假設維度為

import numpy as np
import math
from scipy.special import gamma
def V(d,r):
    return math.pi**(d/2)*(r**d)/gamma(d/2 + 1)
print(V(3,1))
4.188790204786391
import pandas as pd

df = pd.DataFrame()

df["d"] = np.arange(1,20)
df["V"] = V(df["d"],1)
df.round(9)

dV012123.141593234.18879344.934802455.263789565.167713674.724766784.058712893.2985099102.55016410111.88410411121.33526312130.91062913140.59926514150.38144315160.23533116170.14098117180.08214618190.046622藉助 Python 的可視化工具 Matplotlib ，將單位球的體積隨著維度的變化使用折線圖進行可視化。
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots(figsize=(12, 6)) #設置圖片大小
ds = np.arange(1,50)
plt.plot(ds,V(ds,1),color="#E4007F",marker="o")
plt.xlabel("維度 $d$")
plt.ylabel("單位球體積 $V_d$")
Text(0, 0.5, '單位球體積 $V_d$')
觀察上圖可以發現，單位球體積先隨著維度增大而增大（維度小於5），然後隨著維度的增大不斷體積不斷減小，不斷向0靠近。
0.9半徑體積的比例
def ratio(d):
    return (V(d,1) - V(d,0.9))/ V(d,1)
df["ratio90"] =  1 - 0.9**(df.d)
df.round(6).head(20)

dVratio90012.0000000.100000123.1415930.190000234.1887900.271000344.9348020.343900455.2637890.409510565.1677130.468559674.7247660.521703784.0587120.569533893.2985090.6125809102.5501640.65132210111.8841040.68618911121.3352630.71757012130.9106290.74581313140.5992650.77123214150.3814430.79410915160.2353310.81469816170.1409810.83322817180.0821460.84990518190.0466220.864915import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots(figsize=(12, 6)) #設置圖片大小
ds = np.arange(1,50)
plt.plot(ds,ratio(ds),color="#E4007F",marker="o")
plt.xlabel("維度 $d$")
plt.ylabel("0.1邊界體積佔比")
Text(0, 0.5, '0.1邊界體積佔比')
2 高維空間中樣本距離的特點在高維空間中，距離度量失效，樣本之間的最大距離和最小距離的差距不斷減小。也即樣本之間的歐式距離差別不大。首先看看歐式距離的計算公式：
假設數據表示為 
def data_euclidean_dist(x):
    sum_x = np.sum(np.square(x), 1)
    dist = np.add(np.add(-2 * np.dot(x, x.T), sum_x).T, sum_x)
    return dist
x = np.array([[0, 1, 3], [3, 8, 9], [2, 3, 5]])
dist_matrix = data_euclidean_dist(x)
dist_matrix
array([[ 0, 94, 12],
[94,  0, 42],
[12, 42,  0]])
mask = np.ones(dist_matrix.shape, dtype=bool)
np.fill_diagonal(mask, 0)
dist_matrix[mask].min()
12
min_max = (dist_matrix.max() - dist_matrix.min())/dist_matrix.max()
min_max
1.0
生成一個包含5000個樣本500個維度的數據集。每一個維度都是從[-1,1]之間隨機生成。
X = np.random.uniform(-1,1,(5000,500))
X
array([[ 0.4996856 ,  0.8324241 ,  0.3870355 , ..., -0.23761176,
0.60227451, -0.15201565],
[ 0.73878924,  0.84450388, -0.6010043 , ...,  0.74107434,
0.67783831, -0.26190667],
[ 0.29775479, -0.6546214 , -0.44797607, ...,  0.1708173 ,-0.20420226, -0.76420954],
...,[-0.8260602 ,  0.42442039, -0.33873362, ..., -0.79854579,
0.08217504,  0.05219312],[ 0.34246276,  0.92894392, -0.24763796, ..., -0.03167687,
0.44623362,  0.46104014],[ 0.52112198, -0.39278192,  0.98184547, ...,  0.24774073,0.51651399,  0.42129946]])
X.shape
(5000, 500)
下面我們來觀察，隨著維度的增大，樣本之間歐式距離最大值和最小值之間的差距的變化趨勢。注意最大值和最小值應該去掉對角線的元素。
min_max_list = []
for d in range(1,500):
    dist_matrix = data_euclidean_dist(X[:,:d])
    mask = np.ones(dist_matrix.shape, dtype=bool)
    np.fill_diagonal(mask, 0)
    min_max = (dist_matrix[mask].max() - dist_matrix[mask].min())/dist_matrix[mask].max()
    if d%10 == 0:
        print(d,min_max.round(3))
    min_max_list.append(min_max)
10 0.997
20 0.974
30 0.925
40 0.894
50 0.834
60 0.813
70 0.799
80 0.767
90 0.763
100 0.725
110 0.717
120 0.691
130 0.702
140 0.691
150 0.653
160 0.656
170 0.646
180 0.639
190 0.602
200 0.603
210 0.592
220 0.585
230 0.579
240 0.556
250 0.552
260 0.556
270 0.537
280 0.532
290 0.527
300 0.528
310 0.529
320 0.531
330 0.521
340 0.511
350 0.501
360 0.496
370 0.483
380 0.478
390 0.487
400 0.478
410 0.464
420 0.456
430 0.446
440 0.451
450 0.441
460 0.438
470 0.445
480 0.443
490 0.443
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots(figsize=(16, 4)) #設置圖片大小
ds = np.arange(0,len(min_max_list))
plt.plot(ds,min_max_list,color="#E4007F")
plt.xlabel("維度 $d$")
plt.ylabel("最大-最小距離差距")
Text(0, 0.5, '最大-最小距離差距')
可見，隨著維度的不斷增大，樣本之間的歐式距離趨向相同，距離的度量將不再有效。這會影響一系列基於距離的機器學習算法的有效性，包括K近鄰，K-Means聚類，支持向量機等。
3 維度對分類性能的影響（Hughes phenomenon）1979 年 G.V. Trunk 發表了論文 a very clear and simple example of the peaking phenomenon[1]。這篇論文被多次引用用來解釋和說明在分類模型中的 Hughes 現象：隨著維數的增加，分類器的效果會先上升後下降。
下面我們通過 Trunk 中的方法來理解 Hughes 現象。首先，需要生成隨機數據集。數據集樣本有兩個類別，按照如下方法不斷生成特徵：
• 2.對於特徵 從上述生成過程可以看到，隨著特徵的增多，兩類數據的可區分性越來越小。
首先，生成 Trunk 數據集。
import pandas as pd
import numpy as np
max_features,num_samples = 1000,500 #最大特徵數量和樣本數量
X_pos = pd.DataFrame()
X_neg = pd.DataFrame()

for i in range(max_features):
    X_pos["f"+str(i+1)] = np.random.randn(num_samples)+ np.sqrt(1/(i+1)) #生成當前特徵的正類樣本
    X_neg["f"+str(i+1)] = np.random.randn(num_samples)- np.sqrt(1/(i+1)) #生成當前特徵的負類樣本
    
X_pos["label"],X_neg["label"] = 0,1 #添加標籤
trunk_data = pd.concat([X_pos,X_neg],axis=0) #合併正類和負類樣本
trunk_data.head()

f1f2f3f4f5f6f7f8f9f10...f992f993f994f99501.595439-0.8112310.4588221.5980861.283388-0.495683-0.1317140.177101-0.2413470.485646...-1.559063-1.8883671.1956280.34565910.1413030.8219661.527353-0.264535-0.015876-0.1888311.236983-0.249804-0.3388630.408868...0.265744-0.2146620.3977450.1543262-0.5010481.7827750.375562-0.210164-1.2503341.30217-0.8443010.5923012.76939-0.753161...1.3677661.2423550.550537-1.43681132.149112-0.30866-1.1729041.8786851.0147261.6088351.137846-1.0552610.1504651.34898...-0.20990.9745322.2785010.14405440.347441.905093-0.86761.456725-0.2713070.6592960.639625-0.5155940.078049-1.194074...1.894632-0.9460710.1036792.0017935 rows × 1001 columns
隨機選擇幾個特徵，用散點圖可視化正負類樣本的分布情況。
import matplotlib.pyplot as plt
%matplotlib inline
features = [1,5,10,100]
fig, ax = plt.subplots(figsize=(16, 4)) #設置圖片大小

for i in range(len(features)):
    plt.subplot(1,4,i+1)
    plt.scatter(trunk_data[trunk_data.label == 0]["f" + str(features[i])], trunk_data[trunk_data.label == 0]["f" + str(features[i]+1)], color="#007979")
    plt.scatter(trunk_data[trunk_data.label == 1]["f" + str(features[i])], trunk_data[trunk_data.label == 1]["f" + str(features[i]+1)],color="#E4007F")
    plt.xlabel("feature " + str(features[i]))
    plt.ylabel("feature " + str(features[i]+1))
plt.tight_layout() 
下面，我們不斷增加特徵數量，觀察分類性能隨著維數的變化。
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(trunk_data.iloc[:,:-1],trunk_data["label"],test_size=0.5)#訓練集和測試集劃分
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
num_features = np.arange(1,max_features,10)
exp_times = 10 #試驗次數
test_result = np.zeros(len(num_features))
train_result = np.zeros(len(num_features))
for t in range(exp_times): #運行多次試驗
    scores_train = [] #記錄訓練集正確率
    scores_test = [] #記錄測試集正確率
    for num_feature in num_features:  #使用不同特徵數量
        clf = LinearDiscriminantAnalysis()
        clf.fit(X_train.iloc[:,:num_feature],y_train)
        score_train = clf.score(X_train.iloc[:,:num_feature],y_train)
        score_test = clf.score(X_test.iloc[:,:num_feature],y_test)
        scores_train.append(score_train)
        scores_test.append(score_test)
    train_result += np.array(scores_train)
    test_result += np.array(scores_test)
    print(t)
test_result /= exp_times
train_result /= exp_times
/explorer/pyenv/jupyter-py36/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:388: UserWarning: Variables are collinear. warnings.warn("Variables are collinear.")
0
1
2
3
4
5
6
7
8
9
隨著維數的增多，在測試集上的分類性能結果如下。
fig, ax = plt.subplots(figsize=(12, 6)) 
ax.plot(np.log10(num_features),test_result,color="#E4007F",marker=".")
plt.ylim(0.5,1)
plt.xlabel("number of features")
plt.ylabel("accuracy")
Text(0, 0.5, 'accuracy')
4 使用高維處理非線性問題4.1 維度的祝福生成一份月牙形隨機線性不可分隨機數據集。
import pandas as pd
from sklearn import datasets 
sample,target = datasets.make_moons(n_samples=300,shuffle=True,noise=0.1,random_state=0)
data = pd.DataFrame(data=sample,columns=["x1","x2"])
data["label"] = target
將兩類數據使用散點圖進行可視化。
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['axes.unicode_minus']=False  # 用來正常顯示負號
fig, ax = plt.subplots(figsize=(6, 6)) #設置圖片大小
ax.scatter(data[data.label==0]["x1"],data[data.label==0]["x2"],color="#007979") 
ax.scatter(data[data.label==1]["x1"],data[data.label==1]["x2"],color="#E4007F") 
plt.xlabel("$x_1$")
plt.ylabel("$x_2$")
Text(0, 0.5, '$x_2$')
使用線性支持向量機進行分類。
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(data[["x1","x2"]],data.label,test_size=0.5)
from sklearn.svm import LinearSVC
svm_linear = LinearSVC()
svm_linear.fit(X_train,y_train)
svm_linear.score(X_test,y_test)
0.9133333333333333
svm_linear.coef_
array([[ 0.31442609, -1.48346841]])
svm_linear.intercept_
array([0.15382003])

將線性分類器的決策直線繪製出來。
fig, ax = plt.subplots(figsize=(6, 6)) #設置圖片大小
ax.scatter(X_train[y_train==0]["x1"],X_train[y_train==0]["x2"],color="#007979") 
ax.scatter(X_train[y_train==1]["x1"],X_train[y_train==1]["x2"],color="#E4007F")
ax.scatter(X_test[y_test==0]["x1"],X_test[y_test==0]["x2"],color="#007979",marker="^",alpha=0.6) 
ax.scatter(X_test[y_test==1]["x1"],X_test[y_test==1]["x2"],color="#E4007F",marker="^",alpha=0.6)
x1 = np.linspace(-1.5,2.5,50)
x2 = - x1*svm_linear.coef_[0][0]/svm_linear.coef_[0][1] - svm_linear.intercept_[0]/svm_linear.coef_[0][1]
ax.plot(x1,x2,color="gray")
plt.xlabel("$x_1$")
plt.ylabel("$x_2$")
Text(0, 0.5, '$x_2$')
現在使用帶核函數的支持向量機模型。
from sklearn.svm import SVC
svm_rbf = SVC(kernel='rbf')
svm_rbf.fit(X_train,y_train)
svm_rbf.score(X_test,y_test)
0.98
可以看到，使用核函數映射到高維空間後，非線性分類問題得到更好地解決。在測試集上的分類正確率上升到 95% 以上。下面我們來看下使用核函數的支持向量機的決策面。
下面是兩個繪製分類決策面的輔助函數，代碼來源於Sklearn官網示例[2]。
def plot_contours(ax, clf, xx, yy, **params):
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    return out
def make_meshgrid(x, y, h=.02):
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))
    return xx, yy
將決策平面繪製出來。
xx, yy = make_meshgrid(X_train["x1"], X_train["x2"])

fig, ax = plt.subplots(figsize=(6, 6)) #設置圖片大小
ax.scatter(X_train[y_train==0]["x1"],X_train[y_train==0]["x2"],color="#007979") 
ax.scatter(X_train[y_train==1]["x1"],X_train[y_train==1]["x2"],color="#E4007F")
ax.scatter(X_test[y_test==0]["x1"],X_test[y_test==0]["x2"],color="#007979",marker="^",alpha=0.6) 
ax.scatter(X_test[y_test==1]["x1"],X_test[y_test==1]["x2"],color="#E4007F",marker="^",alpha=0.6)

plot_contours(ax, svm_rbf, xx, yy, cmap=plt.cm.coolwarm, alpha=0.5)

plt.xlabel("$x_1$")
plt.ylabel("$x_2$")
Text(0, 0.5, '$x_2$')
4.2 維度的詛咒從上一小節的分析我們看到，支持向量機使用核函數，幫助我們解決了低維空間線性不可分的問題。維度的增加也可能給我們帶來一系列問題，使得我們在訓練集上的訓練誤差不斷減小，造成模型效果不斷提升的錯覺。實際上，模型只是在不斷嘗試擬合訓練數據中的誤差而已，在測試機上的效果會很差。這種問題也稱為過度擬合。
下面來看一個比較極端的例子。我們隨機生成一份滿足標準正態分布的數據集，包含 1000 個樣本，500 個特徵。類別標籤是從 0 和 1 中隨機生成的。然後我們將數據集以 1：1 劃分為訓練集和測試集。
X = np.random.randn(1000,500) #生成滿足標準正太分布的特徵
y = np.random.randint(0,2,1000) #生成標籤
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.5)#訓練集和測試集劃分
藉助線性支持向量機模型，我們不斷增大使用的特徵數量，觀察模型在訓練集和測試集上的正確率變化情況。
from sklearn.svm import LinearSVC
num_features = np.arange(1,X.shape[1],20)
scores_train = [] #記錄訓練集正確率
scores_test = [] #記錄測試集正確率
for num_feature in num_features:
    linear_svm = LinearSVC() #新建線性支持向量機
    linear_svm.fit(X_train[:,:num_feature],y_train)
    score_train = linear_svm.score(X_train[:,:num_feature],y_train)
    score_test = linear_svm.score(X_test[:,:num_feature],y_test)
    scores_train.append(score_train)
    scores_test.append(score_test)
    print(num_feature,score_train,score_test)
1 0.526 0.486
21 0.562 0.502
41 0.63 0.51
61 0.638 0.482
81 0.664 0.476
101 0.688 0.478
121 0.712 0.454
141 0.728 0.46
161 0.77 0.498
181 0.764 0.478
201 0.79 0.492
221 0.796 0.5
241 0.846 0.52
261 0.908 0.504
281 0.986 0.488
301 0.994 0.47
321 1.0 0.49
341 1.0 0.474
361 1.0 0.476
381 1.0 0.46
401 1.0 0.458
421 1.0 0.484
441 1.0 0.472
461 1.0 0.472
481 1.0 0.462
將訓練集和測試集的分類正確率隨著維數的變化使用折線圖進行可視化。
fig, ax = plt.subplots(figsize=(8, 6)) #設置圖片大小
ax.plot(num_features,scores_train,color="#007979",marker="o")
ax.plot(num_features,scores_test,color="#E4007F",marker="^")
ax.hlines(0.5,0,500,color="gray",linestyles="--")
plt.ylim(0,1.2)
plt.xlabel("number of features")
plt.ylabel("accuracy")
Text(0, 0.5, 'accuracy')
5 總結通過一些數據集和案例展示了機器學習中的維度災難問題。
參考資料[1]a very clear and simple example of the peaking phenomenon: https://ieeexplore.ieee.org/document/4766926?tp=&arnumber=4766926&contentType=Journals%20%26%20Magazines&searchField%3DSearch_All%26queryText%3Dtrunk%201979%22%20data-mce-href=
[2]官網示例: https://scikit-learn.org/stable/auto_examples/svm/plot_iris_svc.html

直播案例 | 理解維度災難問題

相關焦點

不同維度之間互動障礙的解惑

基於關卡設計維度的戰棋遊戲系統與關卡設計用例

【徐進】時間維度與戰略目標

跨維度?

霍金:災難從未遠離我們

如何解決案例分析出現的問題(必背)

十分鐘理解Transformer

二十七篇系列:從多維度談Supercell估值下滑2/3,和產品的遇挫問題

直播沙龍報名 | ClickHouse 在 A/B 實驗和模型訓練的使用

直播:期貨MMA實戰系統

如何理解:城市的「信息化→智能化→智慧化」

網課|7-12年級暑假職業探索之商科案例分析 & 市場營銷

【綜述專欄】十分鐘理解Transformer

ICCV | 深度三維殘差神經網絡:視頻理解新突破

鍋爐水冷壁吹損,爆管停爐案例

TensorRT模型加速部署方案解析(視頻/案例講解)

Autopilot的深度學習網絡如何才能趨近真正的AI ——從實景中快速獲取時間維度的訓練數據方法(3)

國際心理大師布萊克曼:十天集訓系統講解101種心理防禦工作坊(線上直播)

今日直播 | 通過2D3D實際案例學習Cinemachine虛擬相機

中國美院2018年畢業展|視覺傳達設計系——關懷的維度2018

直播案例 | 理解維度災難問題

相關焦點

不同維度之間互動障礙的解惑

基於關卡設計維度的戰棋遊戲系統與關卡設計用例

【徐進】時間維度與戰略目標

跨維度?

霍金:災難從未遠離我們

如何解決案例分析出現的問題(必背)

十分鐘理解Transformer

二十七篇系列:從多維度談Supercell估值下滑2/3,和產品的遇挫問題

直播沙龍報名 | ClickHouse 在 A/B 實驗和模型訓練的使用

直播:期貨MMA實戰系統

如何理解:城市的「信息化→智能化→智慧化」

網課|7-12年級 暑假職業探索之 商科案例分析 & 市場營銷

【綜述專欄】十分鐘理解Transformer

ICCV | 深度三維殘差神經網絡:視頻理解新突破

鍋爐水冷壁吹損,爆管停爐案例

TensorRT模型加速部署方案解析(視頻/案例講解)

Autopilot的深度學習網絡如何才能趨近真正的AI ——從實景中快速獲取時間維度的訓練數據方法(3)

國際心理大師布萊克曼:十天集訓系統講解101種心理防禦工作坊(線上直播)

今日直播 | 通過2D3D實際案例學習Cinemachine虛擬相機

中國美院2018年畢業展|視覺傳達設計系——關懷的維度2018

網課|7-12年級暑假職業探索之商科案例分析 & 市場營銷