【推薦算法】AttentionFM模型原理和實踐

2021-02-20 後來遇見AI

寫在前面：筆者前段時間一直在外地出差，已經半個多月沒有更新文章了！！！打工人想保持穩定的學習效率太難了。

1 什麼是AttentionFM？

AttentionFM（Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks，簡稱AFM），2017年由浙大與新加坡國立大學合作推出，顧名思義就是在FM模型中加入了Attention機制。

1.1 FM模型回顧

FM全稱Factorization Machine，給每個特徵學習一個隱向量，通過隱向量內積來對每一組交叉特徵進行建模。

從FM原理可以看出，FM存在下面兩個問題：

特徵組合時每個特徵針對其他不同特徵都使用同一個隱向量。所以後來提出FFM模型，將特徵劃分為若干個field，每個field的特徵和其他field的特徵組合時單獨採用一個隱向量，很好地解決了隱向量單一的問題。所有組合特徵的權重w都有著相同的權重，即所有特徵組合的重要性相同。AFM就是用來解決這個問題的。

在一次預測中，並不是所有的特徵組合都是有用的。AFM模型引入attention機制，針對不同的特徵組合使用不同的權重，增強了模型的表達能力。這也使得模型的可解釋性更強，方便後續針對重要的特徵組合進行深入研究。

1.2 AFM模型

在FM、DeepFM、NFM等模型中，不同field的特徵Embedding向量經過特徵交叉後，將各交叉特徵向量按照Embedding Size維度進行「加和」，相當於是「平等」地對待所有交叉特徵，未考慮特徵對結果的影響程度，事實上消除了大量有價值的信息。

針對以上情況，attention機制被引入推薦系統中，attention機制相當於一個加權平均，attention的值就是其中權重，用來描述不同特徵之間交互的重要性。例如：如果應用場景為「預測一位男性用戶是否購買滑鼠的可能性」，那麼「性別=男」和「購買歷史包含鍵盤」這個交叉特徵比「性別=男」和「用戶年齡=25」這一交叉特徵更重要，模型在前一交叉特徵上投入了更多的「注意力」。

這裡給出AFM模型的預測公式：

式中，

其中

從宏觀來看，AFM只是在FM的基礎上添加了attention的機制，但是實際上，由於最後的加權累加，二次項並沒有進行更深的網絡去學習非線性交叉特徵，所以它的上限和FFM很接近，沒有完全發揮出DNN的優勢。

2 模型結構

AFM模型是在「特徵交叉層」和「輸出層」之間引入注意力網絡實現的，注意力網絡的作用為每一個交叉特徵提供權重。AFM模型結構如下圖所示。

AFM模型結構

值得注意的是，上圖忽略了FM中的lr部分，只是針對FM的二次項部分進行了結構的展示。圖中的前三部分：sparse input（輸入層），embedding layer（嵌入層），pair-wise interaction layer（交互層），都和FM是一樣的。而後面的兩部分，則是AFM的創新所在。從比較宏觀的角度理解，AFM就是通過一個attention net生成一個關於特徵交叉項的權重，然後將FM原來的二次項直接累加，變成加權累加。

2.1 sparse input（輸入層）和embedding layer（嵌入層）

通常樣本特徵field主要分為以下幾種情況：

二元離散特徵，如用戶性別、是否購買某個產品，輸入值為0或1多值離散特徵，如用戶年齡段（小孩、年輕人、老年人），輸入值為0或1或2連續值特徵，直接輸入。（如果是較大的連續值，需在特徵工程部分先做歸一化，或者考慮先做離散化處理成離散值）

筆者前面在其他文章中說過，Embedding其實就是onehot+Dense。因此，簡單起見這裡筆者自定義了一個embedding層，實現特徵嵌入。

class CateEmbedding(keras.layers.Layer):
    def __init__(self, emb_dim, **kwargs):
        self.emb_dim = emb_dim
        super(CateEmbedding, self).__init__(**kwargs)

    def build(self, input_shape):
        """
        :param input_shape: field個數
        :return:
        """
        self.kernel = self.add_weight(name='cate_em_vecs',
                                      shape=(input_shape[1], self.emb_dim),
                                      initializer='glorot_uniform',
                                      trainable=True)

    def call(self, x, **kwargs):
        x = K.expand_dims(x, axis=2)
        x = K.repeat_elements(x, rep=self.emb_dim, axis=2)
        out = x * self.kernel
        return out

    def compute_output_shape(self, input_shape):
        return (input_shape[0], input_shape[1] * self.emb_dim)
具體步驟如下：
先定義一個kernel，即embedding矩陣，大小為[n_features, emb_dim]；將輸入層擴展一個維度由[batch_size, n_features]變為[batch_size, n_features, 1]，然後對將其第2列複製emb_dim倍，大小變為[batch_size, n_features, emb_dim]；讓上述兩個矩陣按元素相乘得到embedding層的輸出。2.2 Pair-wise Interaction Layer（交互層）這一層主要是對組合特徵進行建模，原來的
也就是說Pair-wise Interaction Layer的輸入是所有嵌入向量，輸出也是一組向量。輸出是任意兩個嵌入向量的element-wise product。任意兩個嵌入向量都組合得到一個Interacted vector，所以
如果不考慮Attention機制，在Pair-wise Interaction Layer之後直接得到最終輸出，可以形式化如下：
其中
這裡筆者沒有像其他文章中一樣使用兩層循環實現Pair-wise Interaction Layer，原因很簡單，兩層循環計算起來又慢又不優雅，筆者是個講究人。這裡給出筆者的實現：
class PairWiseInteraction(keras.layers.Layer):
    def __init__(self, **kwargs):
        super(PairWiseInteraction, self).__init__(**kwargs)

    def call(self, x):
        """
        pair-wise interaction layer
        :param x: (batch_size, n_features, emb_dim)
        :return:
        """
        # (batch_size, n_features, emb_dim) -> (batch_size, n_features, 1, emb_dim)
        x = K.expand_dims(tf.cast(x, tf.float32), axis=2)
        # (batch_size, n_features, 1, emb_dim) -> (batch_size, n_features, n_features, emb_dim)
        x = K.repeat_elements(x, rep=x.shape[1], axis=2)
        xt = tf.transpose(x, perm=[0, 2, 1, 3])
        out = x * xt
        # (1, emb_dim, n_features, n_features)
        mask = 1 - tf.linalg.band_part(tf.ones((1, out.shape[3], out.shape[1], out.shape[2])), -1, 0)
        # (1, n_features, n_features, emb_dim)
        mask = tf.transpose(mask, perm=[0, 2, 3, 1])
        out = out * mask
        return tf.reshape(out, shape=(-1, out.shape[1] * out.shape[2], out.shape[3]))

    def compute_output_shape(self, input_shape):
        return (input_shape[0], input_shape[1] * input_shape[1], input_shape[2])

筆者將輸入擴充後轉置，然後和轉置前的輸入按元素相乘，相當於用矩陣運算方法實現了Pair-wise Interaction Layer，但是相乘後的輸出是一個對稱矩陣，包含了
詳細步驟如下：
將輸入擴展一個維度由[batch_size, n_features, emb_dim]變為[batch_size, n_features, 1, emb_dim]，然後對將其第2列複製n_features倍，大小變為[batch_size, n_features, n_features, emb_dim]，記為x；將上一步擴充後的結果x進行轉置，記為xt，將x與xt按元素相乘，得到大小為[batch_size, n_features, n_features, emb_dim]的矩陣，記為out，這個矩陣的含義就是每兩個特徵的的交互向量，包含在FM中，特徵向量進行兩兩交叉之後，直接進行sum pooling，將二階交叉向量進行等權重求和處理。但是直覺上來說，不同的交叉特徵應該有著不同的重要性。不重要的交叉特徵應該降低其權重，而重要性高的交叉特徵應該提高其權重。Attention概念與該思想不謀而合，AFM作者順勢將其引入到模型之中，為每個交叉特徵引入重要性權重，最終在對特徵向量進行sum pooling時，利用重要性權重對二階交叉特徵進行加權求和。
2.3.1 Attention score為了計算特徵重要性權重Attention score，作者構建了一個Attention Network，其本質是含有一個隱藏層的多層感知機（MLP）。表達式如下：
其中
可以看到，本文中的Attention network實際上就是一個一層神經網絡：
輸入層是嵌入向量element-wise product之後的結果(interacted vector，交互層的輸出，用來在嵌入空間中對組合特徵進行編碼)；隱藏層神經元的個數為這裡直接給出筆者的實現，不再做詳細介紹。
self.attention_wb = keras.layers.Dense(attention_dim, use_bias=True, activation="relu", name="attention_wb")
self.attention_h = keras.layers.Dense(1, use_bias=False, name="attention_h")
self.attention_softmax = keras.layers.Lambda(lambda x: K.softmax(x, axis=1), name="attention_softmax")

# (batch_size, n_features * n_features, attention_size)
a_out = self.attention_wb(wise_product)
# (batch_size, n_features * n_features, 1)
a_out = self.attention_h(a_out)
# (batch_size, n_features * n_features, 1)
a_out = self.attention_softmax(a_out)
2.3.2 sum pooling這一步對二階交叉特徵進行加權sum pooling，可以表示為
主要分為兩步：
加權求和。直接將attention score與interacted vector按元素相乘，然後在第一列求和，得到大小為[batch_size, emb_dim]的向量；池化。使用一層Dense將向量由[batch_size, emb_dim]pooling成[batch_size, 1]。這裡直接給出筆者的實現，不再做詳細介紹。
class AttentionPSum(keras.layers.Layer):
    def __init__(self, **kwargs):
        self.p_sum_layer = keras.layers.Dense(1, use_bias=False)
        super(AttentionPSum, self).__init__(**kwargs)

    def call(self, inp):
        """
        :param inp: attention_out:(batch_size, n_features * n_features, 1);
                    wise_product:(batch_size, n_features * n_features, emb_dim)
        :return:
        """
        attention_out, wise_product = inp
        # (batch_size, emb_dim)
        out = tf.reduce_sum(tf.multiply(attention_out, wise_product), axis=1)
        # (batch_size, 1)
        out = self.p_sum_layer(out)
        return out

    def compute_output_shape(self, input_shape):
        return (input_shape[1][0], 1)

2.4 輸出層這一層沒什麼好說的，就是把前面一階項、偏置項以及Attention層的輸出加起來，然後softmax激活即可。
2.5 模型整體結構和代碼基於tensorflow2.X實現，直接使用keras也能跑通。
afm.py
import tensorflow as tf
from tensorflow import keras
import tensorflow.keras.backend as K


class CateEmbedding(keras.layers.Layer):
    def __init__(self, emb_dim, **kwargs):
        self.emb_dim = emb_dim
        super(CateEmbedding, self).__init__(**kwargs)

    def build(self, input_shape):
        """
        :param input_shape: field個數
        :return:
        """
        self.kernel = self.add_weight(name='cate_em_vecs',
                                      shape=(input_shape[1], self.emb_dim),
                                      initializer='glorot_uniform',
                                      trainable=True)

    def call(self, x, **kwargs):
        x = K.expand_dims(x, axis=2)
        x = K.repeat_elements(x, rep=self.emb_dim, axis=2)
        out = x * self.kernel
        return out

    def compute_output_shape(self, input_shape):
        return (input_shape[0], input_shape[1] * self.emb_dim)


class PairWiseInteraction(keras.layers.Layer):
    def __init__(self, **kwargs):
        super(PairWiseInteraction, self).__init__(**kwargs)

    def call(self, x):
        """
        pair-wise interaction layer
        :param x: (batch_size, n_features, emb_dim)
        :return:
        """
        # (batch_size, n_features, emb_dim) -> (batch_size, n_features, 1, emb_dim)
        x = K.expand_dims(tf.cast(x, tf.float32), axis=2)
        # (batch_size, n_features, 1, emb_dim) -> (batch_size, n_features, n_features, emb_dim)
        x = K.repeat_elements(x, rep=x.shape[1], axis=2)
        xt = tf.transpose(x, perm=[0, 2, 1, 3])
        out = x * xt
        # (1, emb_dim, n_features, n_features)
        mask = 1 - tf.linalg.band_part(tf.ones((1, out.shape[3], out.shape[1], out.shape[2])), -1, 0)
        # (1, n_features, n_features, emb_dim)
        mask = tf.transpose(mask, perm=[0, 2, 3, 1])
        out = out * mask
        return tf.reshape(out, shape=(-1, out.shape[1] * out.shape[2], out.shape[3]))

    def compute_output_shape(self, input_shape):
        return (input_shape[0], input_shape[1] * input_shape[1], input_shape[2])


class AttentionPSum(keras.layers.Layer):
    def __init__(self, **kwargs):
        self.p_sum_layer = keras.layers.Dense(1, use_bias=False)
        super(AttentionPSum, self).__init__(**kwargs)

    def call(self, inp):
        """

        :param inp: attention_out:(batch_size, n_features * n_features, 1);
                    wise_product:(batch_size, n_features * n_features, emb_dim)
        :return:
        """
        attention_out, wise_product = inp
        # (batch_size, emb_dim)
        out = tf.reduce_sum(tf.multiply(attention_out, wise_product), axis=1)
        # (batch_size, 1)
        out = self.p_sum_layer(out)
        return out

    def compute_output_shape(self, input_shape):
        return (input_shape[1][0], 1)


class AttentionFM(object):
    def __init__(self, n_features, emb_dim=16, attention_dim=16):
        self.inp = keras.Input(shape=(n_features,), name="inp")
        self.embedding_layer = CateEmbedding(emb_dim, name="embedding_layer")
        self.lr_layer = keras.layers.Dense(1, use_bias=True, name="lr")
        self.wise_product_layer = PairWiseInteraction(name="wise_product_interaction")
        self.attention_wb = keras.layers.Dense(attention_dim, use_bias=True, activation="relu", name="attention_wb")
        self.attention_h = keras.layers.Dense(1, use_bias=False, name="attention_h")
        self.attention_softmax = keras.layers.Lambda(lambda x: K.softmax(x, axis=1), name="attention_softmax")
        self.attention_psum = AttentionPSum(name="psum")
        self.add_lr_ap = keras.layers.Add()
        self.sigmoid = keras.layers.Activation("sigmoid")

    def build(self):
        # (batch_size, n_features) -> (batch_size, 1)
        lr = self.lr_layer(self.inp)
        # (batch_size, n_features) -> (batch_size, n_features, emb_dim)
        x = self.embedding_layer(self.inp)
        # (batch_size, n_features * n_features, emb_dim)
        wise_product = self.wise_product_layer(x)
        # (batch_size, n_features * n_features, attention_size)
        a_out = self.attention_wb(wise_product)
        # (batch_size, n_features * n_features, 1)
        a_out = self.attention_h(a_out)
        # (batch_size, n_features * n_features, 1)
        a_out = self.attention_softmax(a_out)
        # (batch_size, n_features * n_features, emb_dim) -> (batch_size, emb_dim) ->(batch_size, 1)
        p_sum = self.attention_psum([a_out, wise_product])
        self.out = self.add_lr_ap([lr, p_sum])
        self.out = self.sigmoid(self.out)
        self.model = keras.Model(self.inp, self.out)
        self.model.compile(loss=keras.losses.binary_crossentropy,
                           optimizer="adam",
                           metrics=[keras.metrics.binary_accuracy, keras.metrics.Recall()])

模型結構圖如下：
keras.utils.plot_model(afmal.model, "afm.png", show_layer_names=True, show_shapes=True)
afm模型結構3 案例實踐：基於AFM預估電信用戶流失3.1 導入庫import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow import keras
import numpy as np
import pandas as pd
from afm import AttentionFM

# 設置GPU顯存動態增長
gpus = tf.config.experimental.list_physical_devices(device_type="GPU")
for gpu in gpus:
     tf.config.experimental.set_memory_growth(gpu, True)
3.2 輸入數據處理讀取數據：
data = pd.read_csv("./telecom-churn/vec_tel_churn.csv", header=0)
data.head()

#output:
Unnamed: 0	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	0	7590-VHVEG	0.0	0.0	1.0	0.0	1.0	0.0	2.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	29.85	29.85	0.0
1	1	5575-GNVDE	1.0	0.0	0.0	0.0	34.0	1.0	0.0	1.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	1.0	56.95	1889.50	0.0
2	2	3668-QPYBK	1.0	0.0	0.0	0.0	2.0	1.0	0.0	1.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	1.0	53.85	108.15	1.0
3	3	7795-CFOCW	1.0	0.0	0.0	0.0	45.0	0.0	2.0	1.0	1.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0	2.0	42.30	1840.75	0.0
4	4	9237-HQITU	0.0	0.0	0.0	0.0	2.0	1.0	0.0	2.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	70.70	151.65	1.0
前面兩列都是索引，不用管。customerID表示用戶編號，後面19列為用戶特徵，最後一列為用戶流失與否的標籤（0/1表示）。
# 單值離散特徵
single_discrete = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'PaperlessBilling']
# 多值離散特徵
multi_discrete = ['MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaymentMethod']
# 連續數值特徵
continuous = ["tenure", "MonthlyCharges", "TotalCharges"]

# 連續數值特徵標準化
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data[continuous] = scaler.fit_transform(data[continuous])

features = single_discrete + multi_discrete + continuous

# 劃分訓練集測試集
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data[features], data["Churn"], 
                                                    test_size=.1, 
                                                    random_state=10, shuffle=True)

# 洗牌、劃分batch，轉為可輸入模型tensor
train_dataset = tf.data.Dataset.from_tensor_slices((X_train.values, y_train.values))
train_dataset = train_dataset.shuffle(len(X_train)).batch(32)
test_dataset = tf.data.Dataset.from_tensor_slices((X_test.values, y_test.values))
test_dataset = test_dataset.batch(32)
3.3 建立模型n_features, emb_dim, attention_dim = len(features), 16, 16
afmal =AttentionFM(n_features, emb_dim, attention_dim)
afmal.build()
afmal.model.summary()

#output:
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
inp (InputLayer)                [(None, 19)]         0                                            
__________________________________________________________________________________________________
embedding_layer (CateEmbedding) (None, 19, 16)       304         inp[0][0]                        
__________________________________________________________________________________________________
wise_product_interaction (PairW (None, 361, 16)      0           embedding_layer[0][0]            
__________________________________________________________________________________________________
attention_wb (Dense)            (None, 361, 16)      272         wise_product_interaction[0][0]   
__________________________________________________________________________________________________
attention_h (Dense)             (None, 361, 1)       16          attention_wb[0][0]               
__________________________________________________________________________________________________
attention_softmax (Lambda)      (None, 361, 1)       0           attention_h[0][0]                
__________________________________________________________________________________________________
lr (Dense)                      (None, 1)            20          inp[0][0]                        
__________________________________________________________________________________________________
psum (AttentionPSum)            (None, 1)            16          attention_softmax[0][0]          
                                                                 wise_product_interaction[0][0]   
__________________________________________________________________________________________________
add (Add)                       (None, 1)            0           lr[0][0]                         
                                                                 psum[0][0]                       
__________________________________________________________________________________________________
activation (Activation)         (None, 1)            0           add[0][0]                        
==================================================================================================
Total params: 628
Trainable params: 628
Non-trainable params: 0
模型總共包含628個待訓練參數，相比其他DNN模型少了很多。
afmal.model.fit(train_dataset, epochs=20)

#output:
Train for 199 steps
Epoch 1/20
199/199 [==============================] - 2s 11ms/step - loss: 0.5290 - binary_accuracy: 0.7166 - recall: 0.1871
Epoch 2/20
199/199 [==============================] - 1s 5ms/step - loss: 0.4698 - binary_accuracy: 0.7520 - recall: 0.3333
Epoch 3/20
199/199 [==============================] - 1s 5ms/step - loss: 0.4562 - binary_accuracy: 0.7687 - recall: 0.4497
Epoch 4/20
199/199 [==============================] - 1s 5ms/step - loss: 0.4458 - binary_accuracy: 0.7764 - recall: 0.4947
Epoch 5/20
199/199 [==============================] - 1s 5ms/step - loss: 0.4410 - binary_accuracy: 0.7782 - recall: 0.5105
Epoch 6/20
199/199 [==============================] - 1s 5ms/step - loss: 0.4378 - binary_accuracy: 0.7840 - recall: 0.5222
Epoch 7/20
199/199 [==============================] - 1s 5ms/step - loss: 0.4348 - binary_accuracy: 0.7886 - recall: 0.5398
Epoch 8/20
199/199 [==============================] - 1s 4ms/step - loss: 0.4323 - binary_accuracy: 0.7919 - recall: 0.5310
Epoch 9/20
199/199 [==============================] - 1s 5ms/step - loss: 0.4258 - binary_accuracy: 0.7965 - recall: 0.5444
Epoch 10/20
199/199 [==============================] - 1s 4ms/step - loss: 0.4267 - binary_accuracy: 0.7969 - recall: 0.5450
Epoch 11/20
199/199 [==============================] - 1s 5ms/step - loss: 0.4282 - binary_accuracy: 0.7982 - recall: 0.5579
Epoch 12/20
199/199 [==============================] - 1s 5ms/step - loss: 0.4196 - binary_accuracy: 0.7974 - recall: 0.5532
Epoch 13/20
199/199 [==============================] - 1s 5ms/step - loss: 0.4205 - binary_accuracy: 0.7979 - recall: 0.5538
Epoch 14/20
199/199 [==============================] - 1s 5ms/step - loss: 0.4165 - binary_accuracy: 0.7987 - recall: 0.5456
Epoch 15/20
199/199 [==============================] - 1s 5ms/step - loss: 0.4151 - binary_accuracy: 0.7999 - recall: 0.5433
Epoch 16/20
199/199 [==============================] - 1s 5ms/step - loss: 0.4156 - binary_accuracy: 0.8017 - recall: 0.5363
Epoch 17/20
199/199 [==============================] - 1s 4ms/step - loss: 0.4145 - binary_accuracy: 0.8037 - recall: 0.5485
Epoch 18/20
199/199 [==============================] - 1s 5ms/step - loss: 0.4177 - binary_accuracy: 0.8007 - recall: 0.5409
Epoch 19/20
199/199 [==============================] - 1s 5ms/step - loss: 0.4141 - binary_accuracy: 0.8021 - recall: 0.5415
Epoch 20/20
199/199 [==============================] - 1s 5ms/step - loss: 0.4139 - binary_accuracy: 0.8039 - recall: 0.5392
<tensorflow.python.keras.callbacks.History at 0x1a7c092eda0>
20輪迭代後，準確率在80%左右，召回率差不多54%，和其他DNN+FM模型差不多，但是由於參數少，訓練時間大大減小，這裡就不介紹調參工作了。
3.4 模型評估loss, acc, recall = afmal.model.evaluate(test_dataset)

#output:
23/23 [==============================] - 0s 11ms/step - loss: 0.3909 - binary_accuracy: 0.7986 - recall: 0.4843
驗證集上準確率為79.86%，和訓練集差不多，召回率為48.43%。
4 小結相比於其他的DNN模型（比如Wide&Deep，Deep&Cross）都是通過MLP來隱式學習組合特徵，AttentionFM是在FM的基礎上改進的，通過兩個隱向量內積來學習組合特徵，可解釋性更好。
通過直接擴展FM，AFM引入Attention機制來學習不同組合特徵的權重，即保證了模型的可解釋性又提高了模型性能（但個人覺得這裡的缺點是使用了物理意義並不明顯的哈達瑪積，筆者認為可以採用self-Attention試試）。
DNN的另一個作用是提取高階組合特徵，而AttentionFM在交叉項加權累加後直接與一階項相加作為輸出，並沒有進行更深的網絡去學習高階交叉特徵，這可能是AttentionFM模型的一個缺點。

【推薦算法】AttentionFM模型原理和實踐

相關焦點

推薦系統遇上深度學習(八)--AFM模型理論和實踐

FM+FTRL算法原理以及工程化實現

推薦系統遇上深度學習(一)--FM模型理論和實踐

【推薦算法】DeepFM模型原理與實踐

機器學習算法XGBoost原理和實踐(下)

LibRec 每周算法:FTRL原理與工程實踐 (by Google)

「看一看」推薦模型揭秘!微信團隊提出實時Look-alike算法,解決推薦系統多樣性問題

機器學習算法XGBoost原理和實踐(上)

多樣性算法在58部落的實踐和思考

CB算法:基於內容的推薦算法的基本原理

短視頻內容理解和推薦算法比賽大揭秘

從Seq2seq到Attention模型到Self Attention(二)

FM, FTRL, Softmax

推薦系統rerank模型梳理&論文推薦

論文解讀|微信看一看實時Look-alike推薦算法

Tensroflow練習,包括強化學習、推薦系統、nlp等

機器學習、深度學習算法原理與案例實踐暨Python大數據綜合應用...

36氪首發 | 今日頭條推薦算法原理全文詳解

今日頭條、抖音推薦算法原理全文詳解!

今日頭條算法原理(全文)