寫在前面 :筆者前段時間一直在外地出差,已經半個多月沒有更新文章了!!!打工人想保持穩定的學習效率太難了。
1 什麼是AttentionFM?
AttentionFM(Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks,簡稱AFM),2017年由浙大與新加坡國立大學合作推出,顧名思義就是在FM模型中加入了Attention機制。
1.1 FM模型回顧
FM全稱Factorization Machine,給每個特徵學習一個隱向量,通過隱向量內積來對每一組交叉特徵進行建模。
從FM原理可以看出,FM存在下面兩個問題:
特徵組合時每個特徵針對其他不同特徵都使用同一個隱向量。所以後來提出FFM模型,將特徵劃分為若干個field,每個field的特徵和其他field的特徵組合時單獨採用一個隱向量,很好地解決了隱向量單一的問題。所有組合特徵的權重w都有著相同的權重,即所有特徵組合的重要性相同。AFM就是用來解決這個問題的。
在一次預測中,並不是所有的特徵組合都是有用的。AFM模型引入attention機制,針對不同的特徵組合使用不同的權重,增強了模型的表達能力。這也使得模型的可解釋性更強,方便後續針對重要的特徵組合進行深入研究。
1.2 AFM模型
在FM、DeepFM、NFM等模型中,不同field的特徵Embedding向量經過特徵交叉後,將各交叉特徵向量按照Embedding Size維度進行「加和」,相當於是「平等」地對待所有交叉特徵,未考慮特徵對結果的影響程度,事實上消除了大量有價值的信息。
針對以上情況,attention機制 被引入推薦系統中,attention機制相當於一個加權平均,attention的值就是其中權重,用來描述不同特徵之間交互的重要性。例如:如果應用場景為「預測一位男性用戶是否購買滑鼠的可能性」,那麼「性別=男」和「購買歷史包含鍵盤」這個交叉特徵比「性別=男」和「用戶年齡=25」這一交叉特徵更重要,模型在前一交叉特徵上投入了更多的「注意力」。
這裡給出AFM模型的預測公式:
式中, , 表示特徵 對應field的隱向量; 的含義是element-wise product,即兩個Embedding特徵之間的元素積操作:
是attention的值,即注意力得分。這裡使用單層的全連接網絡進行參數學習。
其中 , 為embedding後的向量的維度, 為attention network的隱藏維度。
從宏觀來看,AFM只是在FM的基礎上添加了attention的機制,但是實際上,由於最後的加權累加,二次項並沒有進行更深的網絡去學習非線性交叉特徵,所以它的上限和FFM很接近,沒有完全發揮出DNN的優勢。
2 模型結構
AFM模型是在「特徵交叉層」和「輸出層」之間引入注意力網絡實現的,注意力網絡的作用為每一個交叉特徵提供權重。AFM模型結構如下圖所示。
AFM模型結構
值得注意的是,上圖忽略了FM中的lr部分,只是針對FM的二次項部分進行了結構的展示。圖中的前三部分:sparse input (輸入層),embedding layer (嵌入層),pair-wise interaction layer (交互層),都和FM是一樣的。而後面的兩部分,則是AFM的創新所在。從比較宏觀的角度理解,AFM就是通過一個attention net生成一個關於特徵交叉項的權重,然後將FM原來的二次項直接累加,變成加權累加。
2.1 sparse input(輸入層)和embedding layer(嵌入層)
通常樣本特徵field主要分為以下幾種情況:
二元離散特徵,如用戶性別、是否購買某個產品,輸入值為0或1多值離散特徵,如用戶年齡段(小孩、年輕人、老年人),輸入值為0或1或2連續值特徵,直接輸入。(如果是較大的連續值,需在特徵工程部分先做歸一化,或者考慮先做離散化處理成離散值)
筆者前面在其他文章中說過,Embedding其實就是onehot+Dense。因此,簡單起見這裡筆者自定義了一個embedding層,實現特徵嵌入。
class CateEmbedding(keras.layers.Layer): def __init__(self, emb_dim, **kwargs): self.emb_dim = emb_dim super(CateEmbedding, self).__init__(**kwargs) def build(self, input_shape): """ :param input_shape: field個數 :return: """ self.kernel = self.add_weight(name='cate_em_vecs', shape=(input_shape[1], self.emb_dim), initializer='glorot_uniform', trainable=True) def call(self, x, **kwargs): x = K.expand_dims(x, axis=2) x = K.repeat_elements(x, rep=self.emb_dim, axis=2) out = x * self.kernel return out def compute_output_shape(self, input_shape): return (input_shape[0], input_shape[1] * self.emb_dim)具體步驟如下:
先定義一個kernel,即embedding矩陣,大小為[n_features, emb_dim];將輸入層擴展一個維度由[batch_size, n_features]變為[batch_size, n_features, 1],然後對將其第2列複製emb_dim倍,大小變為[batch_size, n_features, emb_dim];讓上述兩個矩陣按元素相乘得到embedding層的輸出。2.2 Pair-wise Interaction Layer(交互層)這一層主要是對組合特徵進行建模,原來的 個嵌入向量,通過element-wise product(哈達瑪積)操作得到了 個組合向量,這些向量的維度和嵌入向量的維度相同均為emb_dim。形式化如下:
也就是說Pair-wise Interaction Layer的輸入是所有嵌入向量,輸出也是一組向量。輸出是任意兩個嵌入向量的element-wise product。任意兩個嵌入向量都組合得到一個Interacted vector,所以 個嵌入向量得到 個向量。
如果不考慮Attention機制,在Pair-wise Interaction Layer之後直接得到最終輸出,可以形式化如下:
其中 是權重矩陣。當 全為1的時候,很明顯,這特麼就是FM。
這裡筆者沒有像其他文章中一樣使用兩層循環實現Pair-wise Interaction Layer,原因很簡單,兩層循環計算起來又慢又不優雅,筆者是個講究人。這裡給出筆者的實現:
class PairWiseInteraction(keras.layers.Layer): def __init__(self, **kwargs): super(PairWiseInteraction, self).__init__(**kwargs) def call(self, x): """ pair-wise interaction layer :param x: (batch_size, n_features, emb_dim) :return: """ # (batch_size, n_features, emb_dim) -> (batch_size, n_features, 1, emb_dim) x = K.expand_dims(tf.cast(x, tf.float32), axis=2) # (batch_size, n_features, 1, emb_dim) -> (batch_size, n_features, n_features, emb_dim) x = K.repeat_elements(x, rep=x.shape[1], axis=2) xt = tf.transpose(x, perm=[0, 2, 1, 3]) out = x * xt # (1, emb_dim, n_features, n_features) mask = 1 - tf.linalg.band_part(tf.ones((1, out.shape[3], out.shape[1], out.shape[2])), -1, 0) # (1, n_features, n_features, emb_dim) mask = tf.transpose(mask, perm=[0, 2, 3, 1]) out = out * mask return tf.reshape(out, shape=(-1, out.shape[1] * out.shape[2], out.shape[3])) def compute_output_shape(self, input_shape): return (input_shape[0], input_shape[1] * input_shape[1], input_shape[2])筆者將輸入擴充後轉置,然後和轉置前的輸入按元素相乘,相當於用矩陣運算方法實現了Pair-wise Interaction Layer,但是相乘後的輸出是一個對稱矩陣,包含了 、 和 這三種情況,因此有 個向量。實際應用中我們只需要使用其上三角矩陣或下三角矩陣即可。因此筆者定義了一個mask矩陣,將重複的交互向量mask掉,不影響後面softmax即可。
詳細步驟如下:
將輸入擴展一個維度由[batch_size, n_features, emb_dim]變為[batch_size, n_features, 1, emb_dim],然後對將其第2列複製n_features倍,大小變為[batch_size, n_features, n_features, emb_dim],記為x;將上一步擴充後的結果x進行轉置,記為xt,將x與xt按元素相乘,得到大小為[batch_size, n_features, n_features, emb_dim]的矩陣,記為out,這個矩陣的含義就是每兩個特徵的的交互向量,包含 、 和 這三種情況;定義一個大小為[1, emb_dim, n_features, n_features]的上三角矩陣,對矩陣進行轉置使之尺寸與out的尺寸對應,記為mask;將out與mask按元素相乘,然後reshape至[batch_size, n_features * n_features, emb_dim],得到 個交互向量,經過mask後其中非零向量只有 個。2.3 Attention-based Pooling注意力層在FM中,特徵向量進行兩兩交叉之後,直接進行sum pooling,將二階交叉向量進行等權重求和處理。但是直覺上來說,不同的交叉特徵應該有著不同的重要性。不重要的交叉特徵應該降低其權重,而重要性高的交叉特徵應該提高其權重。Attention概念與該思想不謀而合,AFM作者順勢將其引入到模型之中,為每個交叉特徵引入重要性權重,最終在對特徵向量進行sum pooling時,利用重要性權重對二階交叉特徵進行加權求和。
2.3.1 Attention score為了計算特徵重要性權重Attention score,作者構建了一個Attention Network,其本質是含有一個隱藏層的多層感知機(MLP)。表達式如下:
其中 , 為embedding後的向量的維度, 為attention network的隱藏維度。計算得到的 即表示對應的二階交叉特徵 的重要性權重Attention score。
可以看到,本文中的Attention network實際上就是一個一層神經網絡:
輸入層是嵌入向量element-wise product之後的結果(interacted vector,交互層的輸出,用來在嵌入空間中對組合特徵進行編碼);隱藏層神經元的個數為 ,權重和偏置項分別為 ,激活函數使用ReLU;輸出層參數為 ,神經元個數是1,不加偏置項,得到未歸一化的Attention score,即 ;這裡直接給出筆者的實現,不再做詳細介紹。
self.attention_wb = keras.layers.Dense(attention_dim, use_bias=True, activation="relu", name="attention_wb") self.attention_h = keras.layers.Dense(1, use_bias=False, name="attention_h") self.attention_softmax = keras.layers.Lambda(lambda x: K.softmax(x, axis=1), name="attention_softmax") # (batch_size, n_features * n_features, attention_size) a_out = self.attention_wb(wise_product) # (batch_size, n_features * n_features, 1) a_out = self.attention_h(a_out) # (batch_size, n_features * n_features, 1) a_out = self.attention_softmax(a_out) 2.3.2 sum pooling這一步對二階交叉特徵進行加權sum pooling,可以表示為 。
主要分為兩步:
加權求和。直接將attention score與interacted vector按元素相乘,然後在第一列求和,得到大小為[batch_size, emb_dim]的向量;池化。使用一層Dense將向量由[batch_size, emb_dim]pooling成[batch_size, 1]。這裡直接給出筆者的實現,不再做詳細介紹。
class AttentionPSum(keras.layers.Layer): def __init__(self, **kwargs): self.p_sum_layer = keras.layers.Dense(1, use_bias=False) super(AttentionPSum, self).__init__(**kwargs) def call(self, inp): """ :param inp: attention_out:(batch_size, n_features * n_features, 1); wise_product:(batch_size, n_features * n_features, emb_dim) :return: """ attention_out, wise_product = inp # (batch_size, emb_dim) out = tf.reduce_sum(tf.multiply(attention_out, wise_product), axis=1) # (batch_size, 1) out = self.p_sum_layer(out) return out def compute_output_shape(self, input_shape): return (input_shape[1][0], 1) 2.4 輸出層這一層沒什麼好說的,就是把前面一階項、偏置項以及Attention層的輸出加起來,然後softmax激活即可。
2.5 模型整體結構和代碼基於tensorflow2.X實現,直接使用keras也能跑通。
afm.py
import tensorflow as tf from tensorflow import keras import tensorflow.keras.backend as K class CateEmbedding(keras.layers.Layer): def __init__(self, emb_dim, **kwargs): self.emb_dim = emb_dim super(CateEmbedding, self).__init__(**kwargs) def build(self, input_shape): """ :param input_shape: field個數 :return: """ self.kernel = self.add_weight(name='cate_em_vecs', shape=(input_shape[1], self.emb_dim), initializer='glorot_uniform', trainable=True) def call(self, x, **kwargs): x = K.expand_dims(x, axis=2) x = K.repeat_elements(x, rep=self.emb_dim, axis=2) out = x * self.kernel return out def compute_output_shape(self, input_shape): return (input_shape[0], input_shape[1] * self.emb_dim) class PairWiseInteraction(keras.layers.Layer): def __init__(self, **kwargs): super(PairWiseInteraction, self).__init__(**kwargs) def call(self, x): """ pair-wise interaction layer :param x: (batch_size, n_features, emb_dim) :return: """ # (batch_size, n_features, emb_dim) -> (batch_size, n_features, 1, emb_dim) x = K.expand_dims(tf.cast(x, tf.float32), axis=2) # (batch_size, n_features, 1, emb_dim) -> (batch_size, n_features, n_features, emb_dim) x = K.repeat_elements(x, rep=x.shape[1], axis=2) xt = tf.transpose(x, perm=[0, 2, 1, 3]) out = x * xt # (1, emb_dim, n_features, n_features) mask = 1 - tf.linalg.band_part(tf.ones((1, out.shape[3], out.shape[1], out.shape[2])), -1, 0) # (1, n_features, n_features, emb_dim) mask = tf.transpose(mask, perm=[0, 2, 3, 1]) out = out * mask return tf.reshape(out, shape=(-1, out.shape[1] * out.shape[2], out.shape[3])) def compute_output_shape(self, input_shape): return (input_shape[0], input_shape[1] * input_shape[1], input_shape[2]) class AttentionPSum(keras.layers.Layer): def __init__(self, **kwargs): self.p_sum_layer = keras.layers.Dense(1, use_bias=False) super(AttentionPSum, self).__init__(**kwargs) def call(self, inp): """ :param inp: attention_out:(batch_size, n_features * n_features, 1); wise_product:(batch_size, n_features * n_features, emb_dim) :return: """ attention_out, wise_product = inp # (batch_size, emb_dim) out = tf.reduce_sum(tf.multiply(attention_out, wise_product), axis=1) # (batch_size, 1) out = self.p_sum_layer(out) return out def compute_output_shape(self, input_shape): return (input_shape[1][0], 1) class AttentionFM(object): def __init__(self, n_features, emb_dim=16, attention_dim=16): self.inp = keras.Input(shape=(n_features,), name="inp") self.embedding_layer = CateEmbedding(emb_dim, name="embedding_layer") self.lr_layer = keras.layers.Dense(1, use_bias=True, name="lr") self.wise_product_layer = PairWiseInteraction(name="wise_product_interaction") self.attention_wb = keras.layers.Dense(attention_dim, use_bias=True, activation="relu", name="attention_wb") self.attention_h = keras.layers.Dense(1, use_bias=False, name="attention_h") self.attention_softmax = keras.layers.Lambda(lambda x: K.softmax(x, axis=1), name="attention_softmax") self.attention_psum = AttentionPSum(name="psum") self.add_lr_ap = keras.layers.Add() self.sigmoid = keras.layers.Activation("sigmoid") def build(self): # (batch_size, n_features) -> (batch_size, 1) lr = self.lr_layer(self.inp) # (batch_size, n_features) -> (batch_size, n_features, emb_dim) x = self.embedding_layer(self.inp) # (batch_size, n_features * n_features, emb_dim) wise_product = self.wise_product_layer(x) # (batch_size, n_features * n_features, attention_size) a_out = self.attention_wb(wise_product) # (batch_size, n_features * n_features, 1) a_out = self.attention_h(a_out) # (batch_size, n_features * n_features, 1) a_out = self.attention_softmax(a_out) # (batch_size, n_features * n_features, emb_dim) -> (batch_size, emb_dim) ->(batch_size, 1) p_sum = self.attention_psum([a_out, wise_product]) self.out = self.add_lr_ap([lr, p_sum]) self.out = self.sigmoid(self.out) self.model = keras.Model(self.inp, self.out) self.model.compile(loss=keras.losses.binary_crossentropy, optimizer="adam", metrics=[keras.metrics.binary_accuracy, keras.metrics.Recall()])模型結構圖如下:
keras.utils.plot_model(afmal.model, "afm.png", show_layer_names=True, show_shapes=True) afm模型結構3 案例實踐:基於AFM預估電信用戶流失3.1 導入庫import tensorflow as tf import tensorflow.keras.backend as K from tensorflow import keras import numpy as np import pandas as pd from afm import AttentionFM # 設置GPU顯存動態增長 gpus = tf.config.experimental.list_physical_devices(device_type="GPU") for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True) 3.2 輸入數據處理讀取數據:
data = pd.read_csv("./telecom-churn/vec_tel_churn.csv", header=0) data.head() #output: Unnamed: 0 customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn 0 0 7590-VHVEG 0.0 0.0 1.0 0.0 1.0 0.0 2.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 29.85 29.85 0.0 1 1 5575-GNVDE 1.0 0.0 0.0 0.0 34.0 1.0 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 56.95 1889.50 0.0 2 2 3668-QPYBK 1.0 0.0 0.0 0.0 2.0 1.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 53.85 108.15 1.0 3 3 7795-CFOCW 1.0 0.0 0.0 0.0 45.0 0.0 2.0 1.0 1.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 2.0 42.30 1840.75 0.0 4 4 9237-HQITU 0.0 0.0 0.0 0.0 2.0 1.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 70.70 151.65 1.0前面兩列都是索引,不用管。customerID表示用戶編號,後面19列為用戶特徵,最後一列為用戶流失與否的標籤(0/1表示)。
# 單值離散特徵 single_discrete = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'PaperlessBilling'] # 多值離散特徵 multi_discrete = ['MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaymentMethod'] # 連續數值特徵 continuous = ["tenure", "MonthlyCharges", "TotalCharges"] # 連續數值特徵標準化 from sklearn.preprocessing import StandardScaler scaler = StandardScaler() data[continuous] = scaler.fit_transform(data[continuous]) features = single_discrete + multi_discrete + continuous # 劃分訓練集測試集 from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(data[features], data["Churn"], test_size=.1, random_state=10, shuffle=True) # 洗牌、劃分batch,轉為可輸入模型tensor train_dataset = tf.data.Dataset.from_tensor_slices((X_train.values, y_train.values)) train_dataset = train_dataset.shuffle(len(X_train)).batch(32) test_dataset = tf.data.Dataset.from_tensor_slices((X_test.values, y_test.values)) test_dataset = test_dataset.batch(32) 3.3 建立模型n_features, emb_dim, attention_dim = len(features), 16, 16 afmal =AttentionFM(n_features, emb_dim, attention_dim) afmal.build() afmal.model.summary() #output: Model: "model" __________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== inp (InputLayer) [(None, 19)] 0 __________________________________________________________________________________________________ embedding_layer (CateEmbedding) (None, 19, 16) 304 inp[0][0] __________________________________________________________________________________________________ wise_product_interaction (PairW (None, 361, 16) 0 embedding_layer[0][0] __________________________________________________________________________________________________ attention_wb (Dense) (None, 361, 16) 272 wise_product_interaction[0][0] __________________________________________________________________________________________________ attention_h (Dense) (None, 361, 1) 16 attention_wb[0][0] __________________________________________________________________________________________________ attention_softmax (Lambda) (None, 361, 1) 0 attention_h[0][0] __________________________________________________________________________________________________ lr (Dense) (None, 1) 20 inp[0][0] __________________________________________________________________________________________________ psum (AttentionPSum) (None, 1) 16 attention_softmax[0][0] wise_product_interaction[0][0] __________________________________________________________________________________________________ add (Add) (None, 1) 0 lr[0][0] psum[0][0] __________________________________________________________________________________________________ activation (Activation) (None, 1) 0 add[0][0] ================================================================================================== Total params: 628 Trainable params: 628 Non-trainable params: 0模型總共包含628個待訓練參數,相比其他DNN模型少了很多。
afmal.model.fit(train_dataset, epochs=20) #output: Train for 199 steps Epoch 1/20 199/199 [==============================] - 2s 11ms/step - loss: 0.5290 - binary_accuracy: 0.7166 - recall: 0.1871 Epoch 2/20 199/199 [==============================] - 1s 5ms/step - loss: 0.4698 - binary_accuracy: 0.7520 - recall: 0.3333 Epoch 3/20 199/199 [==============================] - 1s 5ms/step - loss: 0.4562 - binary_accuracy: 0.7687 - recall: 0.4497 Epoch 4/20 199/199 [==============================] - 1s 5ms/step - loss: 0.4458 - binary_accuracy: 0.7764 - recall: 0.4947 Epoch 5/20 199/199 [==============================] - 1s 5ms/step - loss: 0.4410 - binary_accuracy: 0.7782 - recall: 0.5105 Epoch 6/20 199/199 [==============================] - 1s 5ms/step - loss: 0.4378 - binary_accuracy: 0.7840 - recall: 0.5222 Epoch 7/20 199/199 [==============================] - 1s 5ms/step - loss: 0.4348 - binary_accuracy: 0.7886 - recall: 0.5398 Epoch 8/20 199/199 [==============================] - 1s 4ms/step - loss: 0.4323 - binary_accuracy: 0.7919 - recall: 0.5310 Epoch 9/20 199/199 [==============================] - 1s 5ms/step - loss: 0.4258 - binary_accuracy: 0.7965 - recall: 0.5444 Epoch 10/20 199/199 [==============================] - 1s 4ms/step - loss: 0.4267 - binary_accuracy: 0.7969 - recall: 0.5450 Epoch 11/20 199/199 [==============================] - 1s 5ms/step - loss: 0.4282 - binary_accuracy: 0.7982 - recall: 0.5579 Epoch 12/20 199/199 [==============================] - 1s 5ms/step - loss: 0.4196 - binary_accuracy: 0.7974 - recall: 0.5532 Epoch 13/20 199/199 [==============================] - 1s 5ms/step - loss: 0.4205 - binary_accuracy: 0.7979 - recall: 0.5538 Epoch 14/20 199/199 [==============================] - 1s 5ms/step - loss: 0.4165 - binary_accuracy: 0.7987 - recall: 0.5456 Epoch 15/20 199/199 [==============================] - 1s 5ms/step - loss: 0.4151 - binary_accuracy: 0.7999 - recall: 0.5433 Epoch 16/20 199/199 [==============================] - 1s 5ms/step - loss: 0.4156 - binary_accuracy: 0.8017 - recall: 0.5363 Epoch 17/20 199/199 [==============================] - 1s 4ms/step - loss: 0.4145 - binary_accuracy: 0.8037 - recall: 0.5485 Epoch 18/20 199/199 [==============================] - 1s 5ms/step - loss: 0.4177 - binary_accuracy: 0.8007 - recall: 0.5409 Epoch 19/20 199/199 [==============================] - 1s 5ms/step - loss: 0.4141 - binary_accuracy: 0.8021 - recall: 0.5415 Epoch 20/20 199/199 [==============================] - 1s 5ms/step - loss: 0.4139 - binary_accuracy: 0.8039 - recall: 0.5392 <tensorflow.python.keras.callbacks.History at 0x1a7c092eda0>20輪迭代後,準確率在80%左右,召回率差不多54%,和其他DNN+FM模型差不多,但是由於參數少,訓練時間大大減小,這裡就不介紹調參工作了。
3.4 模型評估loss, acc, recall = afmal.model.evaluate(test_dataset) #output: 23/23 [==============================] - 0s 11ms/step - loss: 0.3909 - binary_accuracy: 0.7986 - recall: 0.4843驗證集上準確率為79.86%,和訓練集差不多,召回率為48.43%。
4 小結相比於其他的DNN模型(比如Wide&Deep,Deep&Cross)都是通過MLP來隱式學習組合特徵,AttentionFM是在FM的基礎上改進的,通過兩個隱向量內積來學習組合特徵,可解釋性更好。
通過直接擴展FM,AFM引入Attention機制來學習不同組合特徵的權重,即保證了模型的可解釋性又提高了模型性能(但個人覺得這裡的缺點是使用了物理意義並不明顯的哈達瑪積,筆者認為可以採用self-Attention試試)。
DNN的另一個作用是提取高階組合特徵,而AttentionFM在交叉項加權累加後直接與一階項相加作為輸出,並沒有進行更深的網絡去學習高階交叉特徵,這可能是AttentionFM模型的一個缺點。