Python 異常值檢測實戰(附代碼與可視化)

2021-12-09 Python數據科學

‍1介紹.什麼是異常值檢測？

異常值檢測也稱為離群值檢測、噪聲檢測、偏差檢測或異常挖掘。一般來說並沒有普遍接受的定義。（Grubbs，1969）給出的一個早期定義是: 異常值或離群值是似乎與其所在的樣本內其他成員明顯偏離的觀測值。（Barnett 和 Lewis，1994）的最新定義是: 與該組數據的其餘部分不一致的觀測值。

.成因

導致異常值的最常見原因有，

自然引入（並不是錯誤，而是數據多樣性導致的數據新穎性）.應用

異常值/離群值檢測的應用比較廣泛，例如

活動監視，通過監視電話活動或股票市場中的可疑交易來檢測手機欺詐。網絡性能，監視計算機網絡的性能，例如檢測網絡瓶頸。故障診斷，檢測例如太空梭上的電動機、發電機、管道或太空儀器中的故障。時間序列監視，監視安全關鍵應用，例如鑽孔或高速銑削。檢測文本中的新穎性，檢測新聞事件的出現，進行主題檢測和跟蹤，或讓交易者查明股票、商品、外匯交易事件，表現出色或表現不佳的商品。檢測資料庫中的意外記錄，用於數據挖掘以檢測錯誤、欺詐或有效但異常的記錄。

有三類離群值檢測方法:

在沒有數據先驗知識的情況下確定異常值。這類似於無監督聚類。對正常和異常進行建模。這類似於監督分類，需要標記好數據。僅建模正常數據。這稱為新穎性檢測，類似於半監督識別。這種方法需要屬於正常類的標記數據。

我將處理第一種方法，這也是最常見的情況。大多數數據集並沒有關於異常值的標記數據。

.方法分類.離群值檢測算法

本篇採用的算法有:

2實踐.數據集

這裡將使用 Pokemon[1] 數據集並在 ['HP', 'Speed'] 這兩列上執行異常值檢測。這個數據集具有很少的觀測值，計算將很快。出於可視化目的而只選擇了其中兩列（二維），但該方法適用於多維度處理。

.上代碼

import numpy as np
import pandas as pd
from scipy import stats
import eif as iso
from sklearn import svm
from sklearn.cluster import DBSCAN
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor

import matplotlib.dates as md
from scipy.stats import norm
%matplotlib inline
import seaborn as sns
sns.set_style("whitegrid") #possible choices: white, dark, whitegrid, darkgrid, ticks

import matplotlib.pyplot as plt
plt.style.use('ggplot')
import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
pd.set_option('float_format', '{:f}'.format)
pd.set_option('max_columns',250)
pd.set_option('max_rows',150)

data = pd.read_csv('Pokemon.csv')

data.head().T

x1='HP'; x2='Speed'
X = data[[x1,x2]]
X.shape

(800, 2)

.孤立森林

我將使用 sklearn 庫中的 IsolationForest。定義算法時，有一個重要的參數稱為汙染。它是算法期望的離群值觀察值的百分比。我們將 X（具有 HP 和 Speed 2 個特徵）擬合到算法中，並在 X 上使用 fit_predict 來對其進行處理。這將產生普通的異常值（-1 為異常值，1 為異常值）。我們還可以使用函數 decision_function 來獲得 Isolation Forest 給每個樣本的分數。

clf = IsolationForest(max_samples='auto', random_state = 1, contamination= 0.02)

preds = clf.fit_predict(X)

data['isoletionForest_outliers'] = preds
data['isoletionForest_outliers'] = data['isoletionForest_outliers'].astype(str)
data['isoletionForest_scores'] = clf.decision_function(X)

print(data['isoletionForest_outliers'].value_counts())
data[152:156]

1 785
-1 15
Name: isoletionForest_outliers, dtype: int64

將結果繪製出來看看。

fig = px.scatter(data, x=x1, y=x2, color='isoletionForest_outliers', hover_name='Name')
fig.update_layout(title='Isolation Forest Outlier Detection', title_x=0.5, yaxis=dict(gridcolor = '#DFEAF4'), xaxis=dict(gridcolor = '#DFEAF4'), plot_bgcolor='white')
# fig.show()

fig = px.scatter(data, x=x1, y=x2, color="isoletionForest_scores")
fig.update_layout(title='Isolation Forest Outlier Detection (scores)', title_x=0.5,yaxis=dict(gridcolor = '#DFEAF4'), xaxis=dict(gridcolor = '#DFEAF4'), plot_bgcolor='white')
# fig.show()

從視覺上看，這 15 個點不在主要數據點範圍內，判為離群值似乎合乎常理。

除了異常值和異常值顯示孤立森林的決策邊界外，我們還可以進行更高級的可視化。

data['isoletionForest_outliers']=='1'

0 True
1 True
2 True
3 True
4 True
...
795 True
796 True
797 True
798 True
799 True
Name: isoletionForest_outliers, Length: 800, dtype: bool

X_inliers = data.loc[data['isoletionForest_outliers']=='1'][[x1,x2]]
X_outliers = data.loc[data['isoletionForest_outliers']=='-1'][[x1,x2]]

xx, yy = np.meshgrid(np.linspace(X.iloc[:, 0].min(), X.iloc[:, 0].max(), 50), np.linspace(X.iloc[:, 1].min(), X.iloc[:, 1].max(), 50))
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

fig, ax = plt.subplots(figsize=(15, 7))
plt.title("Isolation Forest Outlier Detection with Outlier Areas", fontsize = 15, loc='center')
plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)

inl = plt.scatter(X_inliers.iloc[:, 0], X_inliers.iloc[:, 1], c='white', s=20, edgecolor='k')
outl = plt.scatter(X_outliers.iloc[:, 0], X_outliers.iloc[:, 1], c='red',s=20, edgecolor='k')

plt.axis('tight')
plt.xlim((X.iloc[:, 0].min(), X.iloc[:, 0].max()))
plt.ylim((X.iloc[:, 1].min(), X.iloc[:, 1].max()))
plt.legend([inl, outl],["normal observations", "abnormal observations"],loc="upper left");
# plt.show()

顏色越深，該區域就越離群。下面代碼可以查看分數分布。

fig, ax = plt.subplots(figsize=(20, 7))
ax.set_title('Distribution of Isolation Forest Scores', fontsize = 15, loc='center')
sns.distplot(data['isoletionForest_scores'],color='#3366ff',label='if',hist_kws = {"alpha": 0.35});

分布很重要，可以幫助我們更好地確定案例的正確汙染值。如果我們更改汙染值，isoletionForest_scores 將會更改，但是分布將保持不變。該算法將調整分布圖中離群值的截止值。

.擴展孤立森林

孤立森林有一個缺點: 它的決策邊界是垂直或水平的。由於線只能平行於軸，因此某些區域包含許多分支切口，並且只有少量或單個觀測值，這會導致某些觀測值的異常分不正確。

安裝 pip install git+https://github.com/sahandha/eif.git

擴展孤立森林選擇如下操作，

1）分支剪切的隨機斜率，以及

2）從訓練數據的可用值範圍中選擇的隨機截距。這些項實際上是線性回歸線。

X_data = X.values.astype('double')
F1 = iso.iForest(X_data, ntrees=100, sample_size=256, ExtensionLevel=X.shape[1]-1) # X needs to by numpy array
# calculate anomaly scores
anomaly_scores = F1.compute_paths(X_in = X_data)
data['extendedIsoletionForest_scores'] = -anomaly_scores
# determine lowest 2% as outliers
data['extendedIsoletionForest_outliers'] = data['extendedIsoletionForest_scores'].apply(lambda x: '-1' if x<=data['extendedIsoletionForest_scores'].quantile(0.02) else '1')
print(data['extendedIsoletionForest_outliers'].value_counts())

1 784
-1 16
Name: extendedIsoletionForest_outliers, dtype: int64

fig = px.scatter(data, x=x1, y=x2, color='extendedIsoletionForest_outliers', hover_name='Name')
fig.update_layout(title='Extended Isolation Forest Outlier Detection', title_x=0.5, yaxis=dict(gridcolor = '#DFEAF4'), xaxis=dict(gridcolor = '#DFEAF4'), plot_bgcolor='white')
# fig.show()

fig = px.scatter(data, x=x1, y=x2, color="extendedIsoletionForest_scores")
fig.update_layout(title='Extended Isolation Forest Outlier Detection (scores)', title_x=0.5,yaxis=dict(gridcolor = '#DFEAF4'), xaxis=dict(gridcolor = '#DFEAF4'), plot_bgcolor='white')
# fig.show()

擴展孤立森林並不提供普通的異常值和正常值（如 -1 和 1）。我們只是通過將得分最低的 2％作為離群值來創建它們。該算法的分數與基本孤立森林不同，所有分數均為負。

X_inliers = data.loc[data['extendedIsoletionForest_outliers']=='1'][[x1,x2]]
X_outliers = data.loc[data['extendedIsoletionForest_outliers']=='-1'][[x1,x2]]

xx, yy = np.meshgrid(np.linspace(X.iloc[:, 0].min(), X.iloc[:, 0].max(), 50), np.linspace(X.iloc[:, 1].min()-30, X.iloc[:, 1].max()+30, 50))

S1 = F1.compute_paths(X_in=np.c_[xx.ravel(), yy.ravel()])
S1 = S1.reshape(xx.shape)

fig, ax = plt.subplots(figsize=(15, 7))
plt.title("Extended Isolation Forest Outlier Detection with Outlier Areas", fontsize = 15, loc='center')
levels = np.linspace(np.min(S1),np.max(S1),50)
CS = ax.contourf(xx, yy, S1, levels, cmap=plt.cm.Blues)

inl = plt.scatter(X_inliers.iloc[:, 0], X_inliers.iloc[:, 1], c='white', s=20, edgecolor='k')
outl = plt.scatter(X_outliers.iloc[:, 0], X_outliers.iloc[:, 1], c='red',s=20, edgecolor='k')

plt.axis('tight')
plt.xlim((X.iloc[:, 0].min(), X.iloc[:, 0].max()))
plt.ylim((X.iloc[:, 1].min()-30, X.iloc[:, 1].max()+30))
plt.legend([inl, outl],["normal observations", "abnormal observations"],loc="upper left")
# plt.show()

fig, ax = plt.subplots(figsize=(20, 7))
ax.set_title('Distribution of Extended Isolation Scores', fontsize = 15, loc='center')
sns.distplot(data['extendedIsoletionForest_scores'],color='red',label='eif',hist_kws = {"alpha": 0.5});

3局部離群因子 LOF

clf = LocalOutlierFactor(n_neighbors=11)
y_pred = clf.fit_predict(X)

data['localOutlierFactor_outliers'] = y_pred.astype(str)
print(data['localOutlierFactor_outliers'].value_counts())
data['localOutlierFactor_scores'] = clf.negative_outlier_factor_

1 779
-1 21
Name: localOutlierFactor_outliers, dtype: int64

最重要的參數是 n_neighbors。默認值為 20，這給出了 45 個離群值。我將其更改為 11 以得到更少的離群值，接近 2％。

fig = px.scatter(data, x=x1, y=x2, color='localOutlierFactor_outliers', hover_name='Name')
fig.update_layout(title='Local Outlier Factor Outlier Detection', title_x=0.5, yaxis=dict(gridcolor = '#DFEAF4'), xaxis=dict(gridcolor = '#DFEAF4'), plot_bgcolor='white')
# fig.show()

fig = px.scatter(data, x=x1, y=x2, color="localOutlierFactor_scores", hover_name='Name')
fig.update_layout(title='Local Outlier Factor Outlier Detection', title_x=0.5,yaxis=dict(gridcolor = '#DFEAF4'), xaxis=dict(gridcolor = '#DFEAF4'), plot_bgcolor='white')
# fig.show()

我們可以創建另一個有趣的圖，其中局部離群值越大，其周圍的圓圈越大。

fig, ax = plt.subplots(figsize=(15, 7.5))
ax.set_title('Local Outlier Factor Scores Outlier Detection', fontsize = 15, loc='center')

plt.scatter(X.iloc[:, 0], X.iloc[:, 1], color='k', s=3., label='Data points')
radius = (data['localOutlierFactor_scores'].max() - data['localOutlierFactor_scores']) / (data['localOutlierFactor_scores'].max() - data['localOutlierFactor_scores'].min())
plt.scatter(X.iloc[:, 0], X.iloc[:, 1], s=2000 * radius, edgecolors='r', facecolors='none', label='Outlier scores')
plt.axis('tight')
legend = plt.legend(loc='upper left')
legend.legendHandles[0]._sizes = [10]
legend.legendHandles[1]._sizes = [20]
plt.show();

fig, ax = plt.subplots(figsize=(20, 7))
ax.set_title('Distribution of Local Outlier Factor Scores', fontsize = 15, loc='center')
sns.distplot(data['localOutlierFactor_scores'],color='red',label='eif',hist_kws = {"alpha": 0.5});

該算法與以前的算法有很大不同，它以不同的方式找到離群值。

.DBSCAN

一種經典的聚類算法，其工作方式如下：

from sklearn.cluster import DBSCAN
outlier_detection = DBSCAN(eps = 20, metric='euclidean', min_samples = 5,n_jobs = -1)
clusters = outlier_detection.fit_predict(X)

data['dbscan_outliers'] = clusters
data['dbscan_outliers'] = data['dbscan_outliers'].apply(lambda x: str(1) if x>-1 else str(-1))
print(data['dbscan_outliers'].value_counts())

1 787
-1 13
Name: dbscan_outliers, dtype: int64

要調整的最重要參數是 eps。

fig = px.scatter(data, x=x1, y=x2, color="dbscan_outliers", hover_name='Name')
fig.update_layout(title='DBSCAN Outlier Detection', title_x=0.5,yaxis=dict(gridcolor = '#DFEAF4'), xaxis=dict(gridcolor = '#DFEAF4'), plot_bgcolor='white')
# fig.show()

.單分類 SVM

有關單分類 SVM 的更多信息可參考，

Outlier Detection with One-Class SVMs[2]One-Class Classification Algorithms for Imbalanced Datasets[3]

clf = svm.OneClassSVM(nu=0.08, kernel='rbf', gamma='auto')
outliers = clf.fit_predict(X)
data['ocsvm_outliers'] = outliers
data['ocsvm_outliers'] = data['ocsvm_outliers'].apply(lambda x: str(-1) if x==-1 else str(1))
data['ocsvm_scores'] = clf.score_samples(X)
print(data['ocsvm_outliers'].value_counts())

-1 481
1 319
Name: ocsvm_outliers, dtype: int64

fig = px.scatter(data, x=x1, y=x2, color="ocsvm_outliers", hover_name='Name')
fig.update_layout(title='One Class SVM Outlier Detection', title_x=0.5,yaxis=dict(gridcolor = '#DFEAF4'), xaxis=dict(gridcolor = '#DFEAF4'), plot_bgcolor='white')
# fig.show()

在此數據中找不到更好的 nu，參數在這個例子上似乎不起作用。對於其他 nu 值，離群值更是大於正常值。

.集成

最後，讓我們結合這 5 種算法來構成一種健壯的算法。我將簡單添加離群值列，其中 -1 代表離群值，1 代表正常值。

由於此例中效果不好，因此不使用 One Class SVM。

data['outliers_sum'] = data['isoletionForest_outliers'].astype(int)+data['extendedIsoletionForest_outliers'].astype(int)+data['localOutlierFactor_outliers'].astype(int)+data['dbscan_outliers'].astype(int)

data['outliers_sum'].value_counts()

3 774
1 11
-3 8
-1 7
Name: outliers_sum, dtype: int64

fig = px.scatter(data, x=x1, y=x2, color="outliers_sum", hover_name='Name')
fig.update_layout(title='Ensemble Outlier Detection', title_x=0.5,yaxis=dict(gridcolor = '#DFEAF4'), xaxis=dict(gridcolor = '#DFEAF4'), plot_bgcolor='white')
# fig.show()

觀察值 outliers_sum=4 的意思是，所有 4 種算法均同意這是一個正常值，而對於離群值的完全一致是其和為 -4。

首先，讓我們看看所有算法中哪些被認為是離群值，然後將 sum = 4 的觀察值設為正常值，其餘則作為離群值。

data.loc[data['outliers_sum']==-4]['Name']

121 Chansey
155 Snorlax
217 Wobbuffet
261 Blissey
313 Slaking
316 Shedinja
431 DeoxysSpeed Forme
495 Munchlax
Name: Name, dtype: object

data['outliers_sum'] = data['outliers_sum'].apply(lambda x: str(1) if x==4 else str(-1))

Python 異常值檢測實戰(附代碼與可視化)

相關焦點

【Python數據分析基礎】: 異常值檢測和處理

異常值檢測

四種檢測異常值的常用技術簡述

windows下如何安裝numpy、pandas、matplotlib、seaborn的python包?(附可視化展示+代碼)

一文讀懂異常檢測 LOF 算法(Python代碼)

使用模板匹配在Python上進行對象檢測!(附代碼)

【Python教程】用Python進行數據可視化

推薦 | Python機器學習項目實戰(附代碼 + 可下載)【一】

獨家 | 在Python中使用廣義極端學生化偏差(GESD)進行異常檢測(附連結)

python數據可視化--matpltolib繪製箱線圖

python深度學習目標檢測自學全框架!

Python視頻教程網課編程零基礎入門數據分析網絡爬蟲全套Python...

Python數據清洗(三):異常值識別與處理

python人工智慧項目實戰,PDF+源碼

基於python的大數據分析-pandas數據讀取(代碼實戰)

【實戰+代碼+數據】Python機器學習隨筆之logistic回歸識別手寫數字

【乾貨】Python機器學習機器學習項目實戰3——模型解釋與結果分析(附代碼)

使用biopython可視化染色體和基因元件

python初學者必看的學習路線 Python是近幾年比較火的程式語言

Python數據可視化-seaborn Iris鳶尾花數據