計算廣告CTR預估系列(十)--AFM模型理論與實踐

2021-02-21 機器學習薦貨情報局

Attention-based Pooling Layer的輸入是Pair-wise Interaction Layer的輸出。它包含m(m-1)/2個向量，每個向量的維度是k。（k是嵌入向量的維度，m是Embedding Layer中嵌入向量的個數）

Attention-based Pooling Layer的輸出是一個k維向量。它對Interacted vector使用Attention score進行了weighted sum pooling操作。

Attention score的學習是一個問題。一個常規的想法就是隨著最小化loss來學習，但是這樣做對於訓練集中從來沒有一起出現過的特徵組合的Attention score無法學習。

AFM用一個Attention Network來學習。

Attention network實際上是一個one layer MLP，激活函數使用ReLU，網絡大小用attention factor表示，就是神經元的個數。

Attention network的輸入是兩個嵌入向量element-wise product之後的結果(interacted vector，用來在嵌入空間中對組合特徵進行編碼)；它的輸出是組合特徵對應的Attention score。最後，使用softmax對得到的Attention score進行規範化，Attention Network形式化如下：

Attention Network

總結，AFM模型總形式化如下：

AFM模型總形式化
前面一部分是線性部分；後面一部分對每兩個嵌入向量進行element-wise product得到Interacted vector；然後使用Attention機製得到每個組合特徵的Attention score，並用這個score來進行weighted sum pooling；最後將這個k維的向量通過權重矩陣p得到最終的預測結果。3.2 模型訓練

AFM針對不同的任務有不同的損失函數。

回歸問題。square loss。

分類問題。log loss。

論文中針對回歸問題來討論，所以使用的是square loss，形式化如下：

AFM square loss

模型參數估計使用的是SGD。

3.3 過擬合

防止過擬合常用的方法是Dropout或者L2 L1正則化。AFM的做法是：

在Pair-wise Interaction Layer的輸出使用Dropout

在Attention Network中使用L2正則化

Attention Network是一個one layer MLP。不給他使用Dropout是因為，作者發現如果同時在interaction layer和Attention Network中使用Dropout會使得訓練不穩定，並且降低性能。

所以，AFM的loss函數更新為：

AFM Loss Function

其中W是Attention Network的參數矩陣。

四、總結

AFM是在FM的基礎上改進的。相比於其他的DNN模型，比如Wide&Deep，DeepCross都是通過MLP來隱式學習組合特徵。這些Deep Methods都缺乏解釋性，因為並不知道各個組合特徵的情況。相比之下，FM通過兩個隱向量內積來學習組合特徵，解釋性就比較好。

通過直接擴展FM，AFM引入Attention機制來學習不同組合特徵的權重，即保證了模型的可解釋性又提高了模型性能。但是，DNN的另一個作用是提取高階組合特徵，AFM依舊只考慮了二階組合特徵，這應該算是AFM的一個缺點吧。

五、代碼實踐

完成代碼、數據以及論文資料請移步github，不要忘記star呦~

https://github.com/gutouyu/ML_CIA

核心的網絡構建部分代碼如下：

先準備設置參數，以及初始化Embedding和Linear的權重矩陣：

1
2field_size = params['field_size']
3feature_size = params['feature_size']
4embedding_size = params['embedding_size']
5l2_reg = params['l2_reg']
6learning_rate = params['learning_rate']
7
8dropout = params['dropout']
9attention_factor = params['attention_factor']
10
11
12Global_Bias = tf.get_variable("bias", shape=[1], initializer=tf.constant_initializer(0.0))
13Feat_Wgts = tf.get_variable("linear", shape=[feature_size], initializer=tf.glorot_normal_initializer())
14Feat_Emb = tf.get_variable("emb", shape=[feature_size, embedding_size], initializer=tf.glorot_normal_initializer())
15
16
17feat_ids = features['feat_ids']
18feat_vals = features['feat_vals']
19feat_ids = tf.reshape(feat_ids, shape=[-1, field_size])
20feat_vals = tf.reshape(feat_vals, shape=[-1, field_size])

FM的線性部分：

1
2with tf.variable_scope("Linear-part"):
3 feat_wgts = tf.nn.embedding_lookup(Feat_Wgts, feat_ids)
4 y_linear = tf.reduce_sum(tf.multiply(feat_wgts, feat_vals), 1)

Embedding Layer部分：

1
2with tf.variable_scope("Embedding_Layer"):
3    embeddings = tf.nn.embedding_lookup(Feat_Emb, feat_ids)
4    feat_vals = tf.reshape(feat_vals, shape=[-1, field_size, 1])
5    embeddings = tf.multiply(embeddings, feat_vals)

Pair-wise Interaction Layer對每一對嵌入向量都進行element-wise produce:

1with tf.variable_scope("Pair-wise_Interaction_Layer"):
2    num_interactions = field_size * (field_size - 1) / 2
3    element_wise_product_list = []
4    for i in range(0, field_size):
5        for j in range(i + 1, field_size):
6            element_wise_product_list.append(tf.multiply(embeddings[:, i, :], embeddings[:, j, :]))
7    element_wise_product_list = tf.stack(element_wise_product_list)
8    element_wise_product_list = tf.transpose(element_wise_product_list, perm=[1,0,2])

Attention Network用來得到Attention Score：

1
2with tf.variable_scope("Attention_Netowrk"):
3
4    deep_inputs = tf.reshape(element_wise_product_list, shape=[-1, embedding_size])
5
6    deep_inputs = contrib.layers.fully_connected(inputs=deep_inputs, num_outputs=attention_factor, activation_fn=tf.nn.relu,weights_regularizer=contrib.layers.l2_regularizer(l2_reg), scope="attention_net_mlp")
7
8    aij = contrib.layers.fully_connected(inputs=deep_inputs, num_outputs=1, activation_fn=tf.identity, weights_regularizer=contrib.layers.l2_regularizer(l2_reg), scope="attention_net_out")
9
10
11    aij = tf.reshape(aij, shape=[-1, int(num_interactions), 1])
12    aij_softmax = tf.nn.softmax(aij, dim=1, name="attention_net_softout")
13
14    if mode == tf.estimator.ModeKeys.TRAIN:
15        aij_softmax = tf.nn.dropout(aij_softmax, keep_prob=dropout[0])

得到Attention Score之後，和前面的Interacted vector進行weighted sum pooling，也就是Attention-based Pooling Layer：

1with tf.variable_scope("Attention-based_Pooling_Layer"):
2    deep_inputs = tf.multiply(element_wise_product_list, aij_softmax)
3    deep_inputs = tf.reduce_sum(deep_inputs, axis=1)
4
5
6    if mode == tf.estimator.ModeKeys.TRAIN:
7        deep_inputs = tf.nn.dropout(deep_inputs, keep_prob=dropout[1])
8
9

Prediction Layer，最後把Attention-based Pooling Layer的輸出k維度向量，得到最終預測結果。這一層可以看做直接和一個神經元進行向量。注意這個神經元得到類似logists的值，還不是概率。這個值和後面的FM全局偏置、FM linear part得到最終的logists，然後再通過sigmoid得到最終預測概率：

1with tf.variable_scope("Prediction_Layer"):
2
3    deep_inputs = contrib.layers.fully_connected(inputs=deep_inputs, num_outputs=1, activation_fn=tf.identity, weights_regularizer=contrib.layers.l2_regularizer(l2_reg), scope="afm_out")
4    y_deep = tf.reshape(deep_inputs, shape=[-1])
5
6with tf.variable_scope("AFM_overall"):
7    y_bias = Global_Bias * tf.ones_like(y_deep, dtype=tf.float32)
8    y = y_bias + y_linear + y_deep
9    pred = tf.nn.sigmoid(y)

運行結果截圖：

Train
Evaluate
PredictReference

Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks

https://github.com/lambdaji/tf_repos/blob/master/deep_ctr/Model_pipeline/AFM.py

計算廣告CTR預估系列往期回顧

計算廣告CTR預估系列(一)--DeepFM理論
計算廣告CTR預估系列(二)--DeepFM實踐
計算廣告CTR預估系列(三)--FFM理論與實踐
計算廣告CTR預估系列(四)--Wide&Deep理論與實踐
計算廣告CTR預估系列(五)--阿里Deep Interest Network理論
計算廣告CTR預估系列(六)--阿里Mixed Logistic Regression
計算廣告CTR預估系列(七)--Facebook經典模型LR+GBDT理論與實踐
計算廣告CTR預估系列(八)--PNN模型理論與實踐
計算廣告CTR預估系列(九)--NFM模型理論與實踐

機器學習薦貨情報局，特別有料！

機器學習薦貨情報局

計算廣告CTR預估系列(十)--AFM模型理論與實踐

相關焦點

計算廣告CTR預估系列(一)--DeepFM理論

推薦系統遇上深度學習(八)--AFM模型理論和實踐

從FM推演各深度CTR預估模型(附代碼)

淺談 CTR 預估模型發展史

聊聊FFM在CTR預估中的應用

深度CTR預估模型的演化之路

GBDT + LR理論與實踐

CTR學習筆記&代碼實現2-深度ctr模型 MLP->Wide&Deep

代碼+實戰:TensorFlow Estimator of Deep CTR——DeepFM/NFM/AFM/...

推薦系統遇上深度學習(六)--PNN模型理論和實踐

計算廣告中主要模塊、策略及其場景(下)

谷歌、阿里等10大深度學習CTR模型最全演化圖譜

從模型到部署,FPGA該怎樣加速廣告推薦算法

視覺信息助力廣告點擊率預估--京東廣告團隊技術論文入圍KDD2020

視覺信息助力廣告點擊率預估-京東廣告團隊技術論文入圍KDD2020

算法大佬看了流淚,為什麼這麼好的CTR預估總結之前沒分享(上篇)

推薦系統遇上深度學習(一)--FM模型理論和實踐

推薦系統遇上深度學習(二)--FFM模型理論和實踐

百度飛槳分布式訓練在 Volcano 系統上的實踐

【推薦算法】AttentionFM模型原理和實踐