基於信號處理技術的機器學習

2021-01-11 智能航空發動機

本文講解一篇關於基於信號處理技術的機器學習的博客：《Machine Learning with Signal Processing Techniques》，極力推薦！！

連結：http://ataspinar.com/2018/04/04/machine-learning-with-signal-processing-techniques/

隨機信號分析是一個與隨機信號的處理、修正和分析相關的學科。

任何有物理或工程背景的人都一定程度上了解信號分析技術，這些技術是什麼，以及如何使用它們來分析、建模和分類信號。但是，來自不同領域的數據科學家，比如計算機科學或統計學，可能沒有意識到這些技術可以帶來的分析能力。

在這篇博客文章中，我們將介紹如何使用隨機信號分析技術，結合傳統的機器學習分類器，來對時間序列和信號進行精確地建模和分類。

你可能經常遇到描述數據集的術語，例如時間序列和信號，但是並不清楚它們之間究竟有什麼區別。

在時間序列數據集中，待預測值 y 是時間 t 的函數 y=y(t)，但是信號是一個更一般的概念：因變量 y 未必是時間的函數，也可以是空間坐標的函數 y=y(x, y)，或者距原點距離的函數 y=y(r)。信號具有多種多樣的形式，像聲音信號、圖片、雷達數據等。從本質上說，任何東西都可以叫做信號，只要它本身攜帶信息。

連續信號與離散信號的主要區別在於：連續信號有一個連續的自變量（例如，時間或者空間坐標）。如果你將信號放大後觀察，你會發現在每一個可以取到實數值的地方都存在連續信號的值。而離散信號只會在某些特定的時刻存在值，比如t=0.1s、t=0.2s、t=0.3s...，在t=0.13s就沒有值。

為了能在電腦上分析和可視化一個信號，通常需要將一個模擬連續信號進行數位化（以一定頻率進行採樣），將其變為離散信號。

選擇一個好的採樣率很重要：如果採樣率太低，離散信號會丟失很多原始信號的信息。採樣率需要滿足奈奎斯特採樣定律，即採樣率要大於原始信號中最大頻率的兩倍。

2、時域和頻域之間的變換（FFT、PSD、Wavelet）

傅立葉分析是一個分析信號周期性的研究領域。如果一個信號含有周期性成分，傅立葉分析可以用來將這個信號分解成周期性成分，並且能夠告訴我們每個周期性分量的頻率是多少。

傅立葉變換可以將複合信號分解成一系列的周期性成分（三角函數）。

在Python中，一個信號的快速傅立葉變換可以用Scipy庫計算。

from scipy.fftpack import fft
def get_fft_values(y_values, T, N, f_s):    f_values = np.linspace(0.0, 1.0/(2.0*T), N//2)    fft_values_ = fft(y_values)    fft_values = 2.0/N * np.abs(fft_values_[0:N//2])    return f_values, fft_values
t_n = 10N = 1000T = t_n / Nf_s = 1/T
x_value = np.linspace(0,t_n,N)amplitudes = [4, 6, 8, 10, 14]frequencies = [6.5, 5, 3, 1.5, 1]y_values = [amplitudes[ii]*np.sin(2*np.pi*frequencies[ii]*x_value) for ii in range(0,len(amplitudes))]composite_y_value = np.sum(y_values, axis=0)f_values, fft_values = get_fft_values(composite_y_value, T, N, f_s)
plt.plot(f_values, fft_values, linestyle='-', color='blue')plt.xlabel('Frequency [Hz]', fontsize=16)plt.ylabel('Amplitude', fontsize=16)plt.title("Frequency domain of the signal", fontsize=16)plt.show()
該複合信號是由頻率分別為6.5、5、3、1.5和1Hz以及相應振幅為4、6、8、10和14的sin函數組成的。
與傅立葉變換密切相關的一個概念是功率譜密度分析。和傅立葉變換類似，它也是描述一個信號頻譜的方法，並且考慮了每個頻率的功率分布。一般來說，PSD圖中峰值對應的頻率和FFT圖中峰值對應的頻率一樣，但是峰值的高度和寬度不一樣。PSD圖中峰值下面的面積對應著那個頻率的功率分布。
from scipy.signal import welch
def get_psd_values(y_values, T, N, f_s):    f_values, psd_values = welch(y_values, fs=f_s)    return f_values, psd_values

t_n = 10N = 1000T = t_n / Nf_s = 1/T
x_value = np.linspace(0,t_n,N)amplitudes = [4, 6, 8, 10, 14]frequencies = [6.5, 5, 3, 1.5, 1]y_values = [amplitudes[ii]*np.sin(2*np.pi*frequencies[ii]*x_value) for ii in range(0,len(amplitudes))]composite_y_value = np.sum(y_values, axis=0)f_values, psd_values = get_psd_values(composite_y_value, T, N, f_s)
plt.plot(f_values, psd_values, linestyle='-', color='blue')plt.xlabel('Frequency [Hz]')plt.ylabel('PSD [V**2 / Hz]')plt.show()
2.3 在Python中計算自相關係數
自相關函數計算信號與自身延時版本的相關性：如果一個信號的周期為t，那麼該信號和其自身延遲t秒的版本之間就會有很高的相關性。
Scipy庫中沒有現成的計算自相關係數的函數，但是我們可以藉助numpy庫中的correlate()函數實現。自相關係數是與時延t有關的函數值，因此t的取值不能超過原始信號的長度。
def autocorr(x):    result = np.correlate(x, x, mode='full')    return result[len(result)//2:]
def get_autocorr_values(y_values, T, N, f_s):    autocorr_values = autocorr(y_values)    x_values = np.array([T * jj for jj in range(0, N)])    return x_values, autocorr_values
t_n = 10N = 1000T = t_n / Nf_s = 1/T
x_value = np.linspace(0,t_n,N)amplitudes = [4, 6, 8, 10, 14]frequencies = [6.5, 5, 3, 1.5, 1]y_values = [amplitudes[ii]*np.sin(2*np.pi*frequencies[ii]*x_value) for ii in range(0,len(amplitudes))]composite_y_value = np.sum(y_values, axis=0)t_values, autocorr_values = get_autocorr_values(composite_y_value, T, N, f_s)
plt.plot(t_values, autocorr_values, linestyle='-', color='blue')plt.xlabel('time delay [s]')plt.ylabel('Autocorrelation amplitude')plt.show()
通過對自相關係數進行傅立葉變換，可以得到與FFT一樣的頻率成分。自相關係數與PSD也是一對傅立葉變換，也就是說通過對自相關係數進行傅立葉變換，可以得到PSD；對PSD做傅立葉反變換，可以得到自相關係數。
小波變換適用於動態頻譜的信號（信號的頻率隨著時間不斷變換）。
https://www.mathworks.com/help/wavelet/examples.html這個連結裡提供了matlab小波工具箱裡的一些實用範例，包含了如下內容：

3、統計參數估計和特徵提取
前文已經講述了利用FFT、PSD和自相關函數三種方法來計算信號的頻域特性。這三種方法將時域信號轉換到頻域上，並給出了頻譜圖。
將信號轉換到頻域上後，我們可以從轉換後的信號中提取特徵，並將其作為分類器的輸入。那麼，可以提取哪些特徵呢？一個比較好的想法是將這三種方法得到的圖中峰值的坐標作為特徵，如下圖所示。
關於數據集的描述：This dataset contains measurements done by 30 people between the ages of 19 to 48. The measurements are done with a smartphone placed on the waist while doing one of the following six activities:    walking,    walking upstairs,    walking downstairs,    sitting,    standing or    laying.    The measurements are done at a constant rate of 50 Hz. After filtering out the noise, the signals are cut in fixed-width windows of 2.56 sec with an overlap of 1.28 sec.Each signal will therefore have 50 x 2.56 = 128 samples in total.
The smartphone measures three-axial linear body acceleration, three-axial linear total acceleration and three-axial angular velocity. So per measurement, the total signal consists of nine components

劃分好訓練集和測試集後，訓練集的大小為（7352，128，9），測試集的大小為（2497，128，9）。也就是說，訓練集中有7352個信號，每個信號的9個分量均含有128個樣本點。選擇其中的一個信號，對其分別做FFT、PSD和自相關係數變換後的結果如下圖所示。
特徵提取流程：    transform a signal by means of the FFT, PSD or autocorrelation function.    locate the peaks in the transformation with the peak-finding function.
所提取的特徵個數決定於下面幾個因素：· The number of columns in each matrix depend on depends on your choice of features. · Each signal has nine components, and for each component you can calculate either just the FFT or all three of the transformations. · For each transformation you can decide to look at the first n peaks in the signal. · And for each peak you can decide to take only the x value, or both the x and y values. In the example above, we have taken the x and y values of the first 5 peaks of each transform, so we have 270 columns (9*3*5*2) in total.

4、利用傳統的scikit-learn庫中的分類器進行分類
使用scikit-learn庫中的隨機森林分類器的結果如下。
from sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import classification_report
clf = RandomForestClassifier(n_estimators=1000)clf.fit(X_train, Y_train)print("Accuracy on training set is : {}".format(clf.score(X_train, Y_train)))print("Accuracy on test set is : {}".format(clf.score(X_test, Y_test)))Y_test_pred = clf.predict(X_test)print(classification_report(Y_test, Y_test_pred))
---Accuracy on training set is : 1.0---Accuracy on test set is : 0.9097387173396675---          1       0.96      0.98      0.97       496---          2       0.94      0.95      0.95       471---          3       0.94      0.90      0.92       420---          4       0.84      0.82      0.83       491---          5       0.86      0.96      0.90       532---          6       0.94      0.85      0.89       537- avg / total      0.91      0.91      0.91      2947

但是，存在特徵選擇的問題：如果可以進一步地進行有效的特徵選擇，那麼精度應該會有所提升。
It is understandable that some of the 270 features will be more informative than other ones. It could be that some transformations of some components do not have five peaks, or that the frequency value of the peaks is more informative than the amplitude value, or that the FFT is always more informative than the auto-correlation.
The accuracy will increase even more if we actively select the features, transformations and components which are important for classification. Maybe we can even choose a different classifier or play around with its parameter values (hyperparameter optimization) to achieve a higher accuracy.

5、結語
隨機信號分析為我們提供了一套對時間序列和信號進行分析、建模和分類的有力工具（FFT、PSD和自相關係數）。