萬字總結複雜而奇妙的高斯過程!

2021-02-20 數據分析1480

高斯過程的理論知識

高斯過程的Python實現

使用Numpy手動實現

使用`Scikit-learn`實現高斯過程

小結

高斯過程GaussianProcess
高斯過程的理論知識非參數方法的基本思想

在監督學習中，我們經常使用包含參數

來解釋數據，並通過 最大似然估計（MLE） 或 最大後驗估計（MPE） 來推斷參數

如果需要，我們還可以推斷出一個完整的後驗分布

來代替點估計

隨著數據複雜性的增加，通常需要使用具有更多參數的模型來合理地解釋數據。在非參數方法中，參數的數量取決於數據集的大小。

例如，在Nadaraya-Watson 核回歸（Kernel regression） 中，將權重

這裡，我們假設了接近 核函數

一種特殊情況是k最近鄰（KNN），其中最接近

非參數方法通常需要處理所有訓練數據以進行預測，因此相比於參數方法，推斷(inference)時速度較慢。另一方面，由於非參數模型僅需要記住訓練數據，在這個層面上來說，訓練速度通常會更快。

注1： 核回歸是一種非參數回歸的方法，這種模型是基於（特徵注2：

高斯過程的基本概念

非參數方法的另一個示例是高斯過程（GPs）。參數方法需要推斷參數的分布，而在非參數方法中，比如高斯過程，它可以直接推斷函數的分布。

高斯過程定義了先驗函數。觀察到某些函數值後，可通過代數運算以將其轉換為後驗函數。

在這種情況下，連續函數值的推斷稱為GP回歸，但GP也可以用於分類。

高斯過程是一種隨機過程，其為任意點

在這裡

並且 均值函數，通常使用正定核函數或協方差函數。因此，高斯過程是函數的分布，其形狀（平滑度等等性質）由核函數

這裡有一個假設：

如果通過核函數將點

觀察到一些GP的先驗

其中

根據GP的定義，觀測數據

給定

其中

這是實現高斯過程並將其應用於回歸問題所需的最低知識。下一部分將展示如何從頭開始使用純NumPy實現GP，隨後的部分將展示如何使用scikit-learn中的GP實現。高斯過程的Python實現使用Numpy手動實現定義核函數

在這裡，我們將使用平方指數內核squared exponential kernel，也稱為高斯內核或RBF內核：

長度參數

import numpy as np

def kernel(X1, X2, l=1.0, sigma_f=1.0):
    '''
    各向同性平方指數內核.
    計算點X1與點X2的協方差矩陣.
    
    Args:
        X1: ndArray， m個點 (m x d).
        X2: ndArray， n個點 (n x d).

    返回:
        協方差矩陣 (m x n).
    '''
    sqdist = np.sum(X1**2, 1).reshape(-1, 1) + np.sum(X2**2, 1) - 2 * np.dot(X1, X2.T)
    return sigma_f**2 * np.exp(-0.5 / l**2 * sqdist)
定義先驗我們首先定義一個均值 0 和一個用內核參數 
為了從該GP提取隨機函數，我們從相應的多元高斯分布抽取隨機樣本。
以下示例繪製了三個隨機樣本，並將其與 0 均值和95％置信區間（根據協方差矩陣的對角線計算）一起繪製。
# 編寫繪圖函數
import matplotlib.pyplot as plt

from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D

def plot_gp(mu, cov, X, X_train=None, Y_train=None, samples=[]):
    X = X.ravel()
    mu = mu.ravel()
    uncertainty = 1.96 * np.sqrt(np.diag(cov))
    
    plt.fill_between(X, mu + uncertainty, mu - uncertainty, alpha=0.1)
    
    plt.plot(X, mu, label='均值')
    for i, sample in enumerate(samples):
        plt.plot(X, sample, lw=1, ls='--', label=f'樣本 {i+1}')
    if X_train is not None:
        plt.plot(X_train, Y_train, 'rx')
    plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left")

def plot_gp_2D(gx, gy, mu, X_train, Y_train, title, i):
    ax = plt.gcf().add_subplot(1, 2, i, projection='3d')
    ax.plot_surface(gx, gy, mu.reshape(gx.shape), cmap=cm.coolwarm, linewidth=0, alpha=0.2, antialiased=False)
    ax.scatter(X_train[:,0], X_train[:,1], Y_train, c=Y_train, cmap=cm.coolwarm)
    ax.set_title(title)
%matplotlib inline

# 有限個輸入數據點
X = np.arange(-5, 5, 0.2).reshape(-1, 1)

# 先驗的均值與方差
mu = np.zeros(X.shape)
cov = kernel(X, X)

# 從先驗分布（多元高斯分布）中抽取樣本點
samples = np.random.multivariate_normal(mu.ravel(), cov, 3)

# 畫出GP的均值, 置信區間
plot_gp(mu, cov, X, samples=samples)
基於無噪聲訓練數據進行預測為了計算充分統計量，即後驗預測分布的均值和協方差矩陣，我們用下面代碼實現公式（4）和（5）
# 倒入計算逆矩陣的函數inv()
from numpy.linalg import inv

def posterior_predictive(X_s, X_train, Y_train, l=1.0, sigma_f=1.0, sigma_y=1e-8):
    '''
    計算後驗分布的充分統計量 
    給定 m 個訓練數據 X_train 與 Y_train 
    給定 n 個新輸入數據 inputs X_s.
    
    Args:
        X_s: 新輸入數據 (n x d).
        X_train: 訓練輸入數據 (m x d).
        Y_train: 訓練輸出數據 (m x 1).
        l: 核函數的長度參數.
        sigma_f: 核函數的縱向波動參數.
        sigma_y: 噪音參數.
    
    返回:
        後驗均值向量 (n x d) 與協方差矩陣 (n x n).
    '''
    K = kernel(X_train, X_train, l, sigma_f) + sigma_y**2 * np.eye(len(X_train))
    K_s = kernel(X_train, X_s, l, sigma_f)
    K_ss = kernel(X_s, X_s, l, sigma_f) + 1e-8 * np.eye(len(X_s))
    K_inv = inv(K)
    
    # 公式 (4)
    mu_s = K_s.T.dot(K_inv).dot(Y_train)

    # 公式 (5)
    cov_s = K_ss - K_s.T.dot(K_inv).dot(K_s)
    
    return mu_s, cov_s
並將它們應用於無噪聲訓練數據X_train和Y_train。以下示例從後驗預測中提取3個樣本，並將它們與均值，置信區間和訓練數據一起繪製。在無噪聲模型中，訓練點的方差為 0
註： 從後驗預測分布提取的所有隨機函數都經過訓練點。
# 無噪音的5個輸入數據
X_train = np.array([-4, -3, -2, -1, 1]).reshape(-1, 1)
# y=sin(x)
Y_train = np.sin(X_train)

# 計算後驗預測分布的均值向量與協方差矩陣
mu_s, cov_s = posterior_predictive(X, X_train, Y_train)

# 從後驗預測分布中抽取3個樣本
samples = np.random.multivariate_normal(mu_s.ravel(), cov_s, 3)
plot_gp(mu_s, cov_s, X, X_train=X_train, Y_train=Y_train, samples=samples)
基於帶有噪音的訓練數據進行預測如果模型中包含一些噪聲，則僅對訓練點進行近似，並且訓練點的方差不為零。
# 定義噪音參數
noise = 0.4

# 帶有噪音的訓練數據
X_train = np.arange(-3, 4, 1).reshape(-1, 1)
Y_train = np.sin(X_train) + noise * np.random.randn(*X_train.shape)

# 可以對比地看一下帶噪音的訓練數據與不帶噪音的訓練數據的區別
plt.figure()
plt.plot(X_train, np.sin(X_train) + noise * np.random.randn(*X_train.shape), lw=1, ls='-.', label='有噪音')
plt.plot(X_train, np.sin(X_train) + 0.0 * np.random.randn(*X_train.shape), lw=2.5, ls='-', label='無噪音')
plt.plot(X_train, np.sin(X_train) + 0.0 * np.random.randn(*X_train.shape), 'rx')
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left")
plt.show()
# 計算後驗預測分布的均值向量以及方差矩陣
mu_s, cov_s = posterior_predictive(X, X_train, Y_train, sigma_y=noise)

# 從後驗預測分布中抽取3個樣本點
samples = np.random.multivariate_normal(mu_s.ravel(), cov_s, 3)
plot_gp(mu_s, cov_s, X, X_train=X_train, Y_train=Y_train, samples=samples)
核函數參數和噪聲參數的影響以下示例顯示了核函數參數 
（下圖第一行）
（下圖第二行）
（下圖第三行）
params = [
    (0.3, 1.0, 0.2),
    (3.0, 1.0, 0.2),
    (1.0, 0.3, 0.2),
    (1.0, 3.0, 0.2),
    (1.0, 1.0, 0.05),
    (1.0, 1.0, 1.5),
]

plt.figure(figsize=(12, 5))

for i, (l, sigma_f, sigma_y) in enumerate(params):
    mu_s, cov_s = posterior_predictive(X, X_train, Y_train, l=l, 
                                       sigma_f=sigma_f, 
                                       sigma_y=sigma_y)
    plt.subplot(3, 2, i + 1)
    plt.subplots_adjust(top=2)
    plt.title(f'l = {l}, sigma_f = {sigma_f}, sigma_y = {sigma_y}')
    plot_gp(mu_s, cov_s, X, X_train=X_train, Y_train=Y_train)
這些參數的最優值可以通過最大化由[1] [3]給出的邊際對數似然來得到：
 
在下面的代碼中，我們將最小化負邊際對數似然來獲得核函數參數 from numpy.linalg import cholesky, det, lstsq
from scipy.optimize import minimize

def nll_fn(X_train, Y_train, noise, naive=True):
    '''
    基於給定數據X_train和Y_train以及噪聲水平
    返回一個可以計算負邊際對數似然的函數
    
    Args:
        X_train: 訓練輸入 (m x d).
        Y_train: 訓練輸出 (m x 1).
        noise: Y_train的噪聲水平.
        naive: 如果 True 那麼使用公式(7)來實現
               如果 False 那麼使用數值方法來實現
               
        
    返回:
        最小的目標對象
    '''
    def nll_naive(theta):
        # 使用公式(7)來實現 
        # 與下面的nll_stable的實現相比在數值上不穩定
        K = kernel(X_train, X_train, l=theta[0], sigma_f=theta[1]) + \
            noise**2 * np.eye(len(X_train))
        return 0.5 * np.log(det(K)) + \
               0.5 * Y_train.T.dot(inv(K).dot(Y_train)) + \
               0.5 * len(X_train) * np.log(2*np.pi)

    def nll_stable(theta):
        # 數值上更穩定，相比於nll_naive 
        K = kernel(X_train, X_train, l=theta[0], sigma_f=theta[1]) + \
            noise**2 * np.eye(len(X_train))
        L = cholesky(K)
        return np.sum(np.log(np.diagonal(L))) + \
               0.5 * Y_train.T.dot(lstsq(L.T, lstsq(L, Y_train)[0])[0]) + \
               0.5 * len(X_train) * np.log(2*np.pi)
    
    if naive:
        return nll_naive
    else:
        return nll_stable

# 求解可滿足最小化目標函數的參數 l 及 sigma_f.
# 實際上，我們應該使用不同的初始化多次運行最小化，以避免局部最小化，
# 但是為了簡單起見，此處將其跳過
res = minimize(nll_fn(X_train, Y_train, noise), [1, 1], 
               bounds=((1e-5, None), (1e-5, None)),
               method='L-BFGS-B')

# 將優化結果存儲在全局變量中，以便我們以後可以將其與其他實現的結果進行比較
l_opt, sigma_f_opt = res.x

# 使用優化的核函數參數計算後驗預測分布的參數，並繪製結果圖
mu_s, cov_s = posterior_predictive(X, X_train, Y_train, l=l_opt, sigma_f=sigma_f_opt, sigma_y=noise)
plot_gp(mu_s, cov_s, X, X_train=X_train, Y_train=Y_train)
其最優化的核函數參數 
print(l_opt, sigma_f_opt)
0.9872536793237083 0.8613778055591963
更高維的高斯過程以上實現也可以用於更高的輸入數據維度。此處，GP用於擬合在 
下圖顯示了核函數參數優化前後的帶噪聲樣本和後驗預測分布的均值向量。
#噪音參數
noise_2D = 0.1

rx, ry = np.arange(-5, 5, 0.3), np.arange(-5, 5, 0.3)
gx, gy = np.meshgrid(rx, rx)

X_2D = np.c_[gx.ravel(), gy.ravel()]

X_2D_train = np.random.uniform(-4, 4, (100, 2))
Y_2D_train = np.sin(0.5 * np.linalg.norm(X_2D_train, axis=1)) + \
             noise_2D * np.random.randn(len(X_2D_train))

plt.figure(figsize=(14,7))

mu_s, _ = posterior_predictive(X_2D, X_2D_train, Y_2D_train, sigma_y=noise_2D)
plot_gp_2D(gx, gy, mu_s, X_2D_train, Y_2D_train, 
           f'參數優化前: l={1.00} sigma_f={1.00}', 1)

res = minimize(nll_fn(X_2D_train, Y_2D_train, noise_2D), [1, 1], 
               bounds=((1e-5, None), (1e-5, None)),
               method='L-BFGS-B')

mu_s, _ = posterior_predictive(X_2D, X_2D_train, Y_2D_train, *res.x, sigma_y=noise_2D)
plot_gp_2D(gx, gy, mu_s, X_2D_train, Y_2D_train,
           f'參數優化後: l={res.x[0]:.2f} sigma_f={res.x[1]:.2f}', 2)
使用Scikit-learn實現高斯過程scikit-learn 提供了 GaussianProcessRegressor方法來實現GP回歸模型. 你可以自定義核函數，或者使用它內置的核函數. 核函數也可以疊加. Squared exponential 核函數在scikit-learn中也就是RBF. scikit-learn中的RBF只有長度參數 
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel, RBF

rbf = ConstantKernel(1.0) * RBF(length_scale=1.0)
gpr = GaussianProcessRegressor(kernel=rbf, alpha=noise**2)

# 1D 訓練樣本
gpr.fit(X_train, Y_train)

# 計算後驗預測分布的均值向量與協方差矩陣
mu_s, cov_s = gpr.predict(X, return_cov=True)

# 獲得最優核函數參數
l = gpr.kernel_.k2.get_params()['length_scale']
sigma_f = np.sqrt(gpr.kernel_.k1.get_params()['constant_value'])

# 與前面手寫的結果比對
assert(np.isclose(l_opt, l))
assert(np.isclose(sigma_f_opt, sigma_f))

# 繪製結果
plot_gp(mu_s, cov_s, X, X_train=X_train, Y_train=Y_train)
小結從前面我們可以看出，與常見的機器學習模型不同，用高斯過程做預測的方法是直接生成一個後驗預測分布（依然是高斯分布）。
這也決定了我們可以不僅僅得到一個「光禿禿」的預測值，還可以得到關於這個預測值的不確定性信息，可以利用這些信息繪製error-bar等等！
從統計學的角度上來看，利用高斯過程模型做預測具有很高的價值。

萬字總結複雜而奇妙的高斯過程!

相關焦點

高斯過程

高斯過程回歸(GPR)

重慶高斯小數

NeurIPS 2020 | 近期必讀高斯過程精選論文

深度科普:說說高斯過程回歸

一文了解高斯濾波器,附原理及實現過程

一種利用推斷網絡對高斯過程模型進行有效推斷的方法

圖文詳解高斯過程(一)——含代碼

高斯模糊算法的全面優化過程分享(二)

正態分布和高斯分布的作用_高斯分布的定義_誤差服從高斯分布

《高斯數學》課程

數學家高斯

從數學到實現,全面回顧高斯過程中的函數最優化

高斯和小行星的發現

德國數學家高斯的有趣故事

「數學王子」高斯

【小新星·推薦】高斯小數,讓孩子愛上數學,主動思考

《數學博覽》數學家高斯

為什麼數據科學家都喜歡高斯分布

高斯數學簡介

萬字總結複雜而奇妙的高斯過程!

相關焦點

高斯過程

高斯過程回歸(GPR)

重慶高斯小數

NeurIPS 2020 | 近期必讀高斯過程精選論文

深度科普:說說高斯過程回歸

一文了解高斯濾波器,附原理及實現過程

一種利用推斷網絡對高斯過程模型進行有效推斷的方法

圖文詳解高斯過程(一)——含代碼

高斯模糊算法的全面優化過程分享(二)

正態分布和高斯分布的作用_高斯分布的定義_誤差服從高斯分布

《高斯數學》課程

數學家高斯

從數學到實現,全面回顧高斯過程中的函數最優化

高斯和小行星的發現

德國數學家高斯的有趣故事

「數學王子」高斯

【小新星·推薦】高斯小數,讓孩子愛上數學,主動思考

《數學博覽》 數學家高斯

為什麼數據科學家都喜歡高斯分布

高斯數學簡介

《數學博覽》數學家高斯