SDM模型——建模用户长短期兴趣的Match模型

本文主要是介绍SDM模型——建模用户长短期兴趣的Match模型，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

1. 引言

SDM模型(Sequential Deep Matching Model)是阿里团队在2019年CIKM的一篇paper。模型属于序列召回模型，研究的是如何通过用户的历史行为序列去学习到用户的丰富兴趣。

SDM模型把用户的历史序列根据交互的时间分成了短期和长期两类，然后从短期会话和长期行为中分别采取相应的措施(短期的RNN+多头注意力机制，长期的Att Net) 学习用户的短期兴趣和长期行为偏好，并巧妙的设计了一个门控网络将长短期兴趣进行融合，得到用户的最终兴趣向量。

创新点：长期偏好的行为表示，多头注意力机制学习多兴趣，长短期兴趣的融合机制等

目录如下：

背景与动机
SDM的网络结构与细节剖析
SDM模型的简易复现

2. 背景与动机

召回，是从海量商品中得到一个小的候选集，然后通过排序模型做精确的筛选。因此召回模块对于候选对象的筛选中起着至关重要的作用。

淘宝目前的召回模型基于协同过滤，通过用户与商品的历史交互建模，得到用户的物品的表示向量，但这个过程是静态的，而用户的行为或者兴趣是时刻变化的，协同过滤并不能很好的捕捉到用户整个行为序列的动态变化。所以作者认为，序列顺序信息应当加到召回模块。

阿里在精排模型方面的进化从DIN->DIEN->DSIN，解决的问题和本文类似，只不过召回使用了SDM模型进行检索。

学习长序列行为过程中，用户的兴趣可能发生转移，因此需要以Session会话为单位，先把长序列进行切分。同时，与DSIN模型分成多个会话之后，直接进行Transformer不同，SDM模型把最近的一次会话，和之前的会话分别视为了用户短期行为和长期行为分别进行了建模，并采用不同的措施学习用户的短期兴趣和长期兴趣，然后通过一个门控机制融合得到用户最终的表示向量。

每个会话对于当前商品的预测，显然并不是同等重要的，往往是越邻近的会话，重要程度越大一些，而比较久远的会话，可能影响程度小一些，但并不一定没有影响。因此需要把短期会话和长期会话分开建模。

对于短期会话，使用相对复杂的模型，从多方面考虑用户的短期兴趣
作者发现用户的兴趣点在一个会话里也是多重的
作者使用了LSTM学习序列关系，然后使用Multi-head attention机制，学习用户的多兴趣
对于长期会话，使用简单模型，获取一个用户的长期兴趣

同时，用户的长期行为也会影响当前的决策，比如用户的兴趣、爱好相关，因此长期偏好或者行为往往是复杂广泛的，对于融合长短期兴趣，作者使用了一个门控循环单元。

总结：SDM模型首先使用RNN学习序列关系，其次通过多头注意力机制捕捉多兴趣，然后通过一个Attention Net加权得到短期兴趣表示，长期会话通过Attention Net融合，然后过DNN，得到用户的长期表示，并设计了一个类似于LSTM的门控单元，融合两种兴趣，得到用户最终的表示向量。

3. SDM的网络结构与细节剖析

3.1 问题定义

U 表示用户集合，I 表示item集合，模型考虑在时间 t 时刻用户 u 是否会对 i 产生交互。对于 u，我们能够得到它的历史行为序列，会话的划分规则：

相同会话ID的商品算是一个会话
相邻的商品，时间间隔小于10分钟算一个会话
同一个会话中的商品不能超过50个

同时，对于用户u的短期行为定义是离目前最近的这次会话，用Su序列表示。而长期的用户行为是过去一周内的会话，不包括短期的这次会话。

在这里插入图片描述

接收用户的短期行为和长期行为，然后分别通过两个盲盒得到表示向量，再通过门控融合就得到了最终的用户表示。训练的时候，用户向量与当前交互商品做softmax操作进行采样，然后计算损失，具体的可以参考YouTubeDNN。

训练好模型之后，在out与softmax层得到item的所有向量，当一个用户产生行为，通过这个模块拿到该用户表示，然后与所有item向量做最近邻检索，拿到相似的TOP-K推荐。

3.2 Input Embedding with side Information

在淘宝的推荐场景中，顾客与物品产生交互行为的时候，不仅考虑特定的商品本身，还考虑产品，商铺，价格等。所以对于一个商品来说，不仅要用到Item ID，还用了更多的side info信息，包括leat category, fist level category, brand,shop。

所以，假设用户的短期行为是 $\mathcal{S}^{u}=\left[i_{1}^{u}, \ldots, i_{t}^{u}, \ldots, i_{m}^{u}\right]$ ，其中每个商品 $i_{t}^{u}$ 有5个属性表示，每个属性本质是ID，但转成 embedding之后，就得到了5个embedding，所以这里就涉及到了融合问题。这里用 $\boldsymbol{e}_{t}^{u} \in \mathbb{R}^{d \times 1}$ 来表示每个 $i_{t}^{u}$ ，但这里不是embedding 的pooling操作，而是Concat
$\boldsymbol{e}_{i_{t}^{u}}=\operatorname{concat}\left(\left\{\boldsymbol{e}_{i}^{f} \mid f \in \mathcal{F}\right\}\right)$
其中， $\boldsymbol{e}_{i}^{f}=\boldsymbol{W}^{f} \boldsymbol{x}_{i}^{f} \in \mathbb{R}^{d_{f} \times 1}$ ，将每个side info的 id 通过 embedding layer 得到各自的embedding。这里 embedding的维度是 $d_{f}$ ，拼接之后维度为 $d$ 。

然后是用户的base表示向量就是用户的基础画像得到embedding，然后Concat：
$\boldsymbol{e}_{u}=\operatorname{concat}\left(\left\{\boldsymbol{e}_{u}^{p} \mid p \in \mathcal{P}\right\}\right)$
其中， $e_{u}^{p}$ 是特征 $p$ 的 embedding。

在这里插入图片描述

3.3 短期用户行为建模

短期用户行为是下面的那个框，输入是用户最近的一次会话，里面各个商品加入了 side info信息之后，得到最终的 embedding表示 $\left[\boldsymbol{e}_{i 1}, \ldots, \boldsymbol{e}_{i t}\right]$ 。
然后通过LSTM模型学习序列信息，直接上公式:
$\begin{aligned} \boldsymbol{i} \boldsymbol{n}_{t}^{u} &=\sigma\left(\boldsymbol{W}_{i n}^{1} \boldsymbol{e}_{i t}^{u}+\boldsymbol{W}_{i n}^{2} \boldsymbol{h}_{t-1}^{u}+b_{i n}\right) \\ f_{t}^{u} &=\sigma\left(\boldsymbol{W}_{f}^{1} \boldsymbol{e}_{i t}^{u}+\boldsymbol{W}_{f}^{2} \boldsymbol{h}_{t-1}^{u}+b_{f}\right) \\ \boldsymbol{o}_{t}^{u} &=\sigma\left(\boldsymbol{W}_{o}^{1} \boldsymbol{e}_{i}^{u}+\boldsymbol{W}_{o}^{2} \boldsymbol{h}_{t-1}^{u}+b_{o}\right) \\ \boldsymbol{c}_{t}^{u} &=\boldsymbol{f}_{t} \boldsymbol{c}_{t-1}^{u}+\boldsymbol{i} \boldsymbol{n}_{t}^{u} \tanh \left(\boldsymbol{W}_{c}^{1} \boldsymbol{e}_{i t}^{u}+\boldsymbol{W}_{c}^{2} \boldsymbol{h}_{t-1}^{u}+b_{c}\right) \\ \boldsymbol{h}_{t}^{u} &=\boldsymbol{o}_{t}^{u} \tanh \left(\boldsymbol{c}_{t}^{u}\right) \end{aligned}$
采用了多输入多输出，即每个时间步都会有一个隐藏状态 $h_{t}^{u}$ 输出出来，经过LSTM之后，原始的序列就有了序列相关信息，得到了 $\left[\boldsymbol{h}_{1}^{u}, \ldots, \boldsymbol{h}_{t}^{u}\right]$ , 记为 $\boldsymbol{X}^{u}$ 。这里的 $\boldsymbol{h}_{t}^{u} \in \mathbb{R}^{d \times 1}$ 表示时间 $t$ 的序列偏好表示。
然后经过Multi-head self-attention层，学习 $h_{i}^{u}$ 系列之间的相关性，类似聚类操作，因为我们先用多头矩阵把 $h_{i}^{u}$ 系列映射到多个空间，然后从各个空间中互求相关性
$\text { head }{ }_{i}^{u}=\operatorname{Attention}\left(\boldsymbol{W}_{i}^{Q} \boldsymbol{X}^{u}, \boldsymbol{W}_{i}^{K} \boldsymbol{X}^{u}, \boldsymbol{W}_{i}^{V} \boldsymbol{X}^{u}\right)$
得到权重后，对原始的向量加权融合。使 $Q_{i}^{u}=W_{i}^{Q} X^{u} ， K_{i}^{u}=W_{i}^{K} \boldsymbol{X}^{u}, V_{i}^{u}=W_{i}^{V} X^{u}$ ，计算公式如下:
$\begin{gathered} f\left(Q_{i}^{u}, K_{i}^{u}\right)=Q_{i}^{u T} K_{i}^{u} \\ A_{i}^{u}=\operatorname{softmax}\left(f\left(Q_{i}^{u}, K_{i}^{u}\right)\right) \\ \operatorname{head}_{i}^{u}=V_{i}^{u} A_{i}^{u T} \end{gathered}$
这是一个头的计算，接下来每个头都这么算，假设有 $h$ 个头，这里会通过上面的映射矩阵 $W$ 系列，先把原始的 $h_{i}^{u}$ 向量映射到 $d_{k}=\frac{1}{h} d$ 维度，然后计算 $h e a d_{i}^{u}$ 也是 $d_{k}$ 维，这样 $h$ 个head进行拼接，正好是 $d$ 维，接下来过一个全连接或者线性映射得到Multi Head的输出。
$\hat{X}^{u}=\operatorname{MultiHead}\left(X^{u}\right)=W^{O} \operatorname{concat}\left(\operatorname{head}_{1}^{u}, \ldots, \text { head }_{h}^{u}\right)$
相当于将相似的 $h_{i}^{u}$ 融合到了一块，就能学习到用户的多兴趣。

注意，这里没有使用残差连接， $\hat{X}^{u}=\left[\hat{\boldsymbol{h}}_{1}^{u}, \ldots, \hat{\boldsymbol{h}}_{t}^{u}\right]$ 已经融合多兴趣。
得到这个东西之后，接下来再过一个User Attention层，挖掘更细粒度的用户个性化信息。当然，这个就是普通的embedding层了，用户的base向量 $e_{u}$ 作为 query，与 $\hat{X}^{u}$ 的每个向量做Attention，然后加权求和得最终向量:
$\begin{aligned} \alpha_{k} &=\frac{\exp \left(\hat{\boldsymbol{h}}_{k}^{u T} \boldsymbol{e}_{u}\right)}{\sum_{k=1}^{t} \exp \left(\hat{\boldsymbol{h}}_{k}^{u T} \boldsymbol{e}_{u}\right)} \\ \boldsymbol{s}_{t}^{u} &=\sum_{k=1}^{t} \alpha_{k} \hat{\boldsymbol{h}}_{k}^{u} \end{aligned}$
其中 $s_{t}^{u} \in \mathbb{R}^{d \times 1}$ ，形成了短期行为兴趣。

3.4 用户长期行为建模

从长期的视角来看，用户在不同的维度上可能积累了广泛的兴趣，用户可能经常访问一组类似的商店，并反复购买属于同一类别的商品。所以长期行为 ${L}^{u}$ 来自于不同的特征尺度，并包含了各种side特征。
$\mathcal{L}^{u}=\left\{\mathcal{L}_{f}^{u} \mid f \in \mathcal{F}\right\}$
与短期行为不同，长期行为是从特征的维度进行聚合，把用户的历史长序列分成了多个特征，比如用户历史点击过的商品，历史逛过的店铺，历史看过的商品的类别，品牌等，分成了多个特征子集，然后这每个特征子集里面有对应的id，比如商品有商品id, 店铺有店铺id等，对于每个子集，通过user Attention layer，和用户的base向量求Attention，相当于看看用户喜欢逛啥样的商店，喜欢啥样的品牌，啥样的商品类别等等，得到每个子集最终的表示向量。每个子集的计算过程如下：
$\begin{aligned} \alpha_{k} &=\frac{\exp \left(\boldsymbol{g}_{k}^{u T} \boldsymbol{e}_{u}\right)}{\sum_{k=1}^{\left|\mathcal{L}_{f}^{u}\right|} \exp \left(\boldsymbol{g}_{k}^{u T} \boldsymbol{e}_{u}\right)} \\ z_{f}^{u} &=\sum_{k=1}^{\left|\mathcal{L}_{f}^{u}\right|} \alpha_{k} \boldsymbol{g}_{k}^{u} \end{aligned}$
每个子集都会得到一个加权的向量，把这个东西拼起来，然后通过DNN。
$\begin{aligned} &\boldsymbol{z}^{u}=\operatorname{concat}\left(\left\{z_{f}^{u} \mid f \in \mathcal{F}\right\}\right) \\ &\boldsymbol{p}^{u}=\tanh \left(\boldsymbol{W}^{p} z^{u}+b\right) \end{aligned}$
这里的 $\boldsymbol{p}^{u} \in \mathbb{R}^{d \times 1}$ ，得到用户的长期兴趣表示。

3.5 短长期兴趣融合

长短期兴趣融合，作者发现之前模型往往喜欢直接拼接起来，或者加和，注意力加权等，作者认为这样不能很好的将两类兴趣融合起来，因为长期序列里面，只有很少的一部分行为和当前有关，因此直接融合是有问题的。所以作者采用了门控机制：
$\begin{gathered} G_{t}^{u}=\operatorname{sigmoid}\left(\boldsymbol{W}^{1} \boldsymbol{e}_{u}+\boldsymbol{W}^{2} s_{t}^{u}+\boldsymbol{W}^{3} \boldsymbol{p}^{u}+b\right) \\ o_{t}^{u}=\left(1-G_{t}^{u}\right) \odot p^{u}+G_{t}^{u} \odot s_{t}^{u} \end{gathered}$
这个和LSTM的这种门控机制很像，首先门控接收的输入有用户画像 $e_{u}$ ，用户短期兴趣 $s_{t}^{u}$ ，用户长期兴趣 $p^{u}$ ，经过sigmoid函数得到了 $G_{t}^{u} \in \mathbb{R}^{d \times 1}$ ，用来决定在 $t$ 时刻短期和长期兴趣的贡献程度。然后根据这个贡献程度对短期和长期偏好加权进行融合。

实验中证明了这种融合的有效性，因为，最终得到的短期或者长期兴趣都是 d 维的向量，每一个维度可能代表着不同的兴趣偏好，比如第一维度代表品牌，第二个维度代表类别，第三个维度代表价格，第四个维度代表商店等。如果我们是直接相加或者是加权相加，其实都意味着长短期兴趣这每个维度都有很高的保留。

门控机制的巧妙就在于，我会给每个维度都学习到一个权重，而这个权重非0即1，融合的时候，通过这个门控机制，取长期和短期兴趣向量每个维度上的其中一个，这样就不会有冲突发生。这样就使得用户长期兴趣和短期兴趣融合的时候，每个维度上的信息保留变得有选择。使得兴趣的融合方式更加灵活。

论文后面的实验就是作了方法对比，消融研究了各个模块的有效性等。

4. SDM模型的简易复现

参考DeepMatch，并在新闻推荐的数据集上进行召回任务。

关于数据集的介绍，可以参考YouTubeDNN那篇文章。

4.1 模型的输入

SDM模型将用户的行为序列分成了会话的形式，在构造模型输入和其他模型有较大区别。

产生数据集的时候，需要传入短期会话长度及长期会话长度，对于一个行为序列，构造数据集的时候要按照两个长度分成短期行为和长期行为，并且每一种都需要指明真实的序列长度。

另外，由于用到了商品的side info信息，所以这里加入了文章的两个类别特征cat_1和cat_2，作为文章的side info。

产生数据集：

"""构造sdm数据集"""
def get_data_set(click_data, seq_short_len=5, seq_prefer_len=50):""":param: seq_short_len: 短期会话的长度:param: seq_prefer_len: 会话的最长长度"""click_data.sort_values("expo_time", inplace=True)train_set, test_set = [], []for user_id, hist_click in tqdm(click_data.groupby('user_id')):pos_list = hist_click['article_id'].tolist()cat1_list = hist_click['cat_1'].tolist()cat2_list = hist_click['cat_2'].tolist()# 滑动窗口切分数据for i in range(1, len(pos_list)):hist = pos_list[:i]cat1_hist = cat1_list[:i]cat2_hist = cat2_list[:i]# 序列长度只够短期的if i <= seq_short_len and i != len(pos_list) - 1:train_set.append((# 用户id, 用户短期历史行为序列， 用户长期历史行为序列， 当前行为文章， label， user_id, hist[::-1], [0]*seq_prefer_len, pos_list[i], 1, # 用户短期历史序列长度， 用户长期历史序列长度， len(hist[::-1]), 0, # 用户短期历史序列对应类别1， 用户长期历史行为序列对应类别1cat1_hist[::-1], [0]*seq_prefer_len, # 历史短期历史序列对应类别2， 用户长期历史行为序列对应类别2 cat2_hist[::-1], [0]*seq_prefer_len))# 序列长度够长期的elif i != len(pos_list) - 1:train_set.append((# 用户id, 用户短期历史行为序列，用户长期历史行为序列， 当前行为文章， labeluser_id, hist[::-1][:seq_short_len], hist[::-1][seq_short_len:], pos_list[i], 1, # 用户短期行为序列长度，用户长期行为序列长度，seq_short_len, len(hist[::-1])-seq_short_len,# 用户短期历史行为序列对应类别1， 用户长期历史行为序列对应类别1cat1_hist[::-1][:seq_short_len], cat1_hist[::-1][seq_short_len:],# 用户短期历史行为序列对应类别2， 用户长期历史行为序列对应类别2cat2_hist[::-1][:seq_short_len], cat2_hist[::-1][seq_short_len:]             ))# 测试集保留最长的那一条elif i <= seq_short_len and i == len(pos_list) - 1:test_set.append((user_id, hist[::-1], [0]*seq_prefer_len, pos_list[i], 1,len(hist[::-1]), 0, cat1_hist[::-1], [0]*seq_perfer_len, cat2_hist[::-1], [0]*seq_prefer_len))else:test_set.append((user_id, hist[::-1][:seq_short_len], hist[::-1][seq_short_len:], pos_list[i], 1,seq_short_len, len(hist[::-1])-seq_short_len, cat1_hist[::-1][:seq_short_len], cat1_hist[::-1][seq_short_len:],cat2_list[::-1][:seq_short_len], cat2_hist[::-1][seq_short_len:]))random.shuffle(train_set)random.shuffle(test_set)return train_set, test_set

根据会话的长短，把之前的一个长行为序列划分成了短期和长期两个，然后加入了两个新的side info特征。

下面产生模型的输入，模型输入依然需要指明短期序列长度和长期序列长度，padding的时候会用到。

构造SDM模型的输入：

def gen_model_input(train_set, user_profile, seq_short_len, seq_prefer_len):"""构造模型输入"""# row: [user_id, short_train_seq, perfer_train_seq, item_id, label, short_len, perfer_len, cat_1_short, cat_1_perfer, cat_2_short, cat_2_prefer]train_uid = np.array([row[0] for row in train_set])short_train_seq = [row[1] for row in train_set]prefer_train_seq = [row[2] for row in train_set]train_iid = np.array([row[3] for row in train_set])train_label = np.array([row[4] for row in train_set])train_short_len = np.array([row[5] for row in train_set])train_prefer_len = np.array([row[6] for row in train_set])short_train_seq_cat1 = np.array([row[7] for row in train_set])prefer_train_seq_cat1 = np.array([row[8] for row in train_set])short_train_seq_cat2 = np.array([row[9] for row in train_set])prefer_train_seq_cat2 = np.array([row[10] for row in train_set])# padding操作train_short_item_pad = pad_sequences(short_train_seq, maxlen=seq_short_len, padding='post', truncating='post', value=0)train_prefer_item_pad = pad_sequences(prefer_train_seq, maxlen=seq_prefer_len, padding='post', truncating='post', value=0)train_short_cat1_pad = pad_sequences(short_train_seq_cat1, maxlen=seq_short_len, padding='post', truncating='post', value=0)train_prefer_cat1_pad = pad_sequences(prefer_train_seq_cat1, maxlen=seq_prefer_len, padding='post', truncating='post', value=0)train_short_cat2_pad = pad_sequences(short_train_seq_cat2, maxlen=seq_short_len, padding='post', truncating='post', value=0)train_prefer_cat2_pad = pad_sequences(prefer_train_seq_cat2, maxlen=seq_prefer_len, padding='post', truncating='post', value=0)# 形成输入词典train_model_input = {"user_id": train_uid,"doc_id": train_iid,"short_doc_id": train_short_item_pad,"prefer_doc_id": train_prefer_item_pad,"prefer_sess_length": train_prefer_len,"short_sess_length": train_short_len,"short_cat1": train_short_cat1_pad,"prefer_cat1": train_prefer_cat1_pad,"short_cat2": train_short_cat2_pad,"prefer_cat2": train_prefer_cat2_pad}# 其他的用户特征加入for key in ["gender", "age", "city"]:train_model_input[key] = user_profile.loc[train_model_input['user_id']][key].valuesreturn train_model_input, train_label

4.2 模型的代码架构

整个SDM模型，参考deepmatch修改的一个简易版本：

def SDM(user_feature_columns, item_feature_columns, history_feature_list, num_sampled=5, units=32, rnn_layers=2,dropout_rate=0.2, rnn_num_res=1, num_head=4, l2_reg_embedding=1e-6, dnn_activation='tanh', seed=1024):""":param rnn_num_res: rnn的残差层个数 :param history_feature_list: short和long sequence field"""# item_feature目前只支持doc_id， 再加别的就不行了，其实这里可以改造下if (len(item_feature_columns)) > 1: raise ValueError("SDM only support 1 item feature like doc_id")# 获取item_feature的一些属性item_feature_column = item_feature_columns[0]item_feature_name = item_feature_column.nameitem_vocabulary_size = item_feature_column.vocabulary_size# 为用户特征创建Input层user_input_layer_dict = build_input_layers(user_feature_columns)item_input_layer_dict = build_input_layers(item_feature_columns)# 将Input层转化成列表的形式作为model的输入user_input_layers = list(user_input_layer_dict.values())item_input_layers = list(item_input_layer_dict.values())# 筛选出特征中的sparse特征和dense特征，方便单独处理sparse_feature_columns = list(filter(lambda x: isinstance(x, SparseFeat), user_feature_columns)) if user_feature_columns else []dense_feature_columns = list(filter(lambda x: isinstance(x, DenseFeat), user_feature_columns)) if user_feature_columns else []if len(dense_feature_columns) != 0:raise ValueError("SDM dont support dense feature")  # 目前不支持Dense featurevarlen_feature_columns = list(filter(lambda x: isinstance(x, VarLenSparseFeat), user_feature_columns)) if user_feature_columns else []# 构建embedding字典embedding_layer_dict = build_embedding_layers(user_feature_columns+item_feature_columns)# 拿到短期会话和长期会话列 之前的命名规则在这里起作用sparse_varlen_feature_columns = []prefer_history_columns = []short_history_columns = []prefer_fc_names = list(map(lambda x: "prefer_" + x, history_feature_list))short_fc_names = list(map(lambda x: "short_" + x, history_feature_list))for fc in varlen_feature_columns:if fc.name in prefer_fc_names:prefer_history_columns.append(fc)elif fc.name in short_fc_names:short_history_columns.append(fc)else:sparse_varlen_feature_columns.append(fc)# 获取用户的长期行为序列列表 L^u # [<tf.Tensor 'emb_prefer_doc_id_2/Identity:0' shape=(None, 50, 32) dtype=float32>, <tf.Tensor 'emb_prefer_cat1_2/Identity:0' shape=(None, 50, 32) dtype=float32>, <tf.Tensor 'emb_prefer_cat2_2/Identity:0' shape=(None, 50, 32) dtype=float32>]prefer_emb_list = embedding_lookup(prefer_fc_names, user_input_layer_dict, embedding_layer_dict)# 获取用户的短期序列列表 S^u# [<tf.Tensor 'emb_short_doc_id_2/Identity:0' shape=(None, 5, 32) dtype=float32>, <tf.Tensor 'emb_short_cat1_2/Identity:0' shape=(None, 5, 32) dtype=float32>, <tf.Tensor 'emb_short_cat2_2/Identity:0' shape=(None, 5, 32) dtype=float32>]short_emb_list = embedding_lookup(short_fc_names, user_input_layer_dict, embedding_layer_dict)# 用户离散特征的输入层与embedding层拼接 e^uuser_emb_list = embedding_lookup([col.name for col in sparse_feature_columns], user_input_layer_dict, embedding_layer_dict)user_emb = concat_func(user_emb_list)user_emb_output = Dense(units, activation=dnn_activation, name='user_emb_output')(user_emb)  # (None, 1, 32)# 长期序列行为编码# 过AttentionSequencePoolingLayer --> Concat --> DNNprefer_sess_length = user_input_layer_dict['prefer_sess_length']prefer_att_outputs = []# 遍历长期行为序列for i, prefer_emb in enumerate(prefer_emb_list):prefer_attention_output = AttentionSequencePoolingLayer(dropout_rate=0)([user_emb_output, prefer_emb, prefer_sess_length])prefer_att_outputs.append(prefer_attention_output)prefer_att_concat = concat_func(prefer_att_outputs)   # (None, 1, 64) <== Concat(item_embedding，cat1_embedding,cat2_embedding)prefer_output = Dense(units, activation=dnn_activation, name='prefer_output')(prefer_att_concat)# print(prefer_output.shape)   # (None, 1, 32)# 短期行为序列编码short_sess_length = user_input_layer_dict['short_sess_length']short_emb_concat = concat_func(short_emb_list)   # (None, 5, 64)   这里注意下， 对于短期序列，描述item的side info信息进行了拼接short_emb_input = Dense(units, activation=dnn_activation, name='short_emb_input')(short_emb_concat)  # (None, 5, 32)# 过rnn 这里的return_sequence=True， 每个时间步都需要输出hshort_rnn_output = DynamicMultiRNN(num_units=units, return_sequence=True, num_layers=rnn_layers, num_residual_layers=rnn_num_res,   # 这里竟然能用到残差dropout_rate=dropout_rate)([short_emb_input, short_sess_length])# print(short_rnn_output) # (None, 5, 32)# 过MultiHeadAttention  # (None, 5, 32)short_att_output = MultiHeadAttention(num_units=units, head_num=num_head, dropout_rate=dropout_rate)([short_rnn_output, short_sess_length]) # (None, 5, 64)# user_attention # (None, 1, 32)short_output = UserAttention(num_units=units, activation=dnn_activation, use_res=True, dropout_rate=dropout_rate)([user_emb_output, short_att_output, short_sess_length])# 门控融合gated_input = concat_func([prefer_output, short_output, user_emb_output])gate = Dense(units, activation='sigmoid')(gated_input)   # (None, 1, 32)# temp = tf.multiply(gate, short_output) + tf.multiply(1-gate, prefer_output)  感觉这俩一样？gated_output = Lambda(lambda x: tf.multiply(x[0], x[1]) + tf.multiply(1-x[0], x[2]))([gate, short_output, prefer_output])  # [None, 1,32]gated_output_reshape = Lambda(lambda x: tf.squeeze(x, 1))(gated_output)  # (None, 32)  这个维度必须要和docembedding层的维度一样，否则后面没法sortmax_loss# 接下来item_embedding_matrix = embedding_layer_dict[item_feature_name]  # 获取doc_id的embedding层item_index = EmbeddingIndex(list(range(item_vocabulary_size)))(item_input_layer_dict[item_feature_name]) # 所有doc_id的索引item_embedding_weight = NoMask()(item_embedding_matrix(item_index))  # 拿到所有item的embeddingpooling_item_embedding_weight = PoolingLayer()([item_embedding_weight])  # 这里依然是当可能不止item_id，或许还有brand_id, cat_id等，需要池化# 这里传入的是整个doc_id的embedding， user_embedding, 以及用户点击的doc_id，然后去进行负采样计算损失操作output = SampledSoftmaxLayer(num_sampled)([pooling_item_embedding_weight, gated_output_reshape, item_input_layer_dict[item_feature_name]])model = Model(inputs=user_input_layers+item_input_layers, outputs=output)# 下面是等模型训练完了之后，获取用户和item的embeddingmodel.__setattr__("user_input", user_input_layers)model.__setattr__("user_embedding", gated_output_reshape)  # 用户embedding是取得门控融合的用户向量model.__setattr__("item_input", item_input_layers)# item_embedding取得pooling_item_embedding_weight, 这个会发现是负采样操作训练的那个embedding矩阵model.__setattr__("item_embedding", get_item_embedding(pooling_item_embedding_weight, item_input_layer_dict[item_feature_name]))return model

函数式API搭建模型的方式，首先需要传入封装好的用户特征描述以及item特征描述，比如：

# 建立模型
user_feature_columns = [SparseFeat('user_id', feature_max_idx['user_id'], 16),SparseFeat('gender', feature_max_idx['gender'], 16),SparseFeat('age', feature_max_idx['age'], 16),SparseFeat('city', feature_max_idx['city'], 16),VarLenSparseFeat(SparseFeat('short_doc_id', feature_max_idx['article_id'], embedding_dim, embedding_name="doc_id"), SEQ_LEN_short, 'mean', 'short_sess_length'),    VarLenSparseFeat(SparseFeat('prefer_doc_id', feature_max_idx['article_id'], embedding_dim, embedding_name='doc_id'), SEQ_LEN_prefer, 'mean', 'prefer_sess_length'),VarLenSparseFeat(SparseFeat('short_cat1', feature_max_idx['cat_1'], embedding_dim, embedding_name='cat_1'), SEQ_LEN_short, 'mean', 'short_sess_length'),VarLenSparseFeat(SparseFeat('prefer_cat1', feature_max_idx['cat_1'], embedding_dim, embedding_name='cat_1'), SEQ_LEN_prefer, 'mean', 'prefer_sess_length'),VarLenSparseFeat(SparseFeat('short_cat2', feature_max_idx['cat_2'], embedding_dim, embedding_name='cat_2'), SEQ_LEN_short, 'mean', 'short_sess_length'),VarLenSparseFeat(SparseFeat('prefer_cat2', feature_max_idx['cat_2'], embedding_dim, embedding_name='cat_2'), SEQ_LEN_prefer, 'mean', 'prefer_sess_length'),]item_feature_columns = [SparseFeat('doc_id', feature_max_idx['article_id'], embedding_dim)]

这里需要注意的一个点是短期和长期序列的名字，必须严格的short__， prefer_ 进行标识，因为在模型搭建的时候就是靠着这个去找到短期和长期序列特征。

逻辑比较清晰，首先是建立Input层，然后是embedding层，接下来，根据命名选择出用户的base特征列，短期行为序列和长期行为序列。

长期序列通过Attention Pooling Layer层进行编码，本质就是注意力机制然后融合，需要注意的一个点是for循环，也就是长期序列行为里面的特征列，比如商品，cat_1, cat_2是for循环的形式求融合向量，再拼接起来过DNN，和论文图保持一致。

短期序列编码部分，是item_embedding, cat_1 embedding， cat_2 embedding拼接起来，过Dynamic Multi RNN 层学习序列信息，通过Multi Head Attention学习多兴趣，最后通过User Attention Layer进行向量融合。

接下来，长期兴趣向量和短期兴趣向量以及用户base向量，过门控融合机制，得到最终的user_embedding。

后面的那块是为了模型训练完之后，方便取出user embedding和item embedding.

接下来，看几个重要的层。

4.3 Attention Sequence Pooling Layer层

首先是长期行为序列中使用的Att_Net层，本质上就是传统的求Attention机制。

class AttentionSequencePoolingLayer(Layer):def __init__(self, dropout_rate=0, scale=True, **kwargs):self.dropout_rate = dropout_rateself.scale = scalesuper(AttentionSequencePoolingLayer, self).__init__(**kwargs)def build(self, input_shape):self.projection_layer = Dense(units=1, activation='tanh')       super(AttentionSequencePoolingLayer, self).build(input_shape)def call(self, inputs, mask=None, **kwargs):# queries[None, 1, 64], keys[None, 50, 32], keys_length[None, 1]， 表示真实的会话长度， 后面mask会用到queries, keys, keys_length = inputshist_len = keys.get_shape()[1]key_mask = tf.sequence_mask(keys_length, hist_len)  # mask 矩阵 (None, 1, 50) queries = tf.tile(queries, [1, hist_len, 1])  # [None, 50, 64]   为每个key都配备一个query# 后面，把queries与keys拼起来过全连接， 这里是这样的， 本身接下来是要求queryies与keys中每个item相关性，常规操作我们能想到的就是直接内积求得分数# 而这里的Attention实现，使用的是LuongAttention， 传统的Attention原来是有两种BahdanauAttention与LuongAttention， 这个在博客上整理下# 这里采用的LuongAttention，对其方式是q_k先拼接，然后过DNN的方式q_k = tf.concat([queries, keys], axis=-1)  # [None, 50, 96]output_scores = self.projection_layer(q_k)  # [None, 50, 1]if self.scale:output_scores = output_scores / (q_k.get_shape().as_list()[-1] ** 0.5)attention_score = tf.transpose(output_scores, [0, 2, 1])# 加权求和 需要把填充的那部分mask掉paddings = tf.ones_like(attention_score) * (-2 ** 32 + 1)attention_score = tf.where(key_mask, attention_score, paddings)attention_score = softmax(attention_score)  # [None, 1, 50]outputs = tf.matmul(attention_score, keys)  # [None, 1, 64]return outputs

这个层的实现注意力机制的方式是Q和key向量拼接，然后过一个全连接层得到权重，然后反乘到key上。

之前是直接向量内积得到权重不就行, 于是去查Attention的实现方法，结果发现了一点新东西:

传统的Attention其实是有两种经典注意力机制的，分别是Luong Attention和Bahdanau Attention，这两个在理念上大致相同，但是实现细节上有些区别。看下面两个图：

在这里插入图片描述

经常看到的计算注意力的方式是右边这个，在NLP里面算注意力的时候， $t$ 步的注意力 $c_{t}$ 是由解码器的第 $t - 1$ 步隐藏状态 $h_{t-1}$ 与编码器中的每个隐藏状态 $\overline{\mathbf{h}}_{s}$ 加权计算得到的。称为Bahdanau Attention。
与其不一样的是左边的这个，计算 $c_{t}$ 的时候，用的是解码器第 $t$ 步隐藏状态 $h_{t}$ 与编码器的隐藏状态 $\overline{\mathbf{h}}{ }_{s}$ 得到的，这种attention机制是要在解码器中先计算出 $h_{t}$ 来，然后这东西与编码器的各个隐层求出加权求和得到 $c_{t}$ ，然后这俩拼接起来，再过一个全连接，得到 $\tilde{h}_{t}$ ，用这个计算当前时间步的输出 $\hat{\mathbf{y}}_{t}$ 。
所以这两种Attention机制的区别总结如下:

注意力计算方式不同。在 Luong Attention 机制中，第 $t$ 步的注意力 $c_{t}$ 是由 decoder 第 $\mathrm{t}$ 步的 hidden state $\mathbf{h}_{t}$ 与 encoder 中的每一个 hidden state $\overline{\mathbf{h}}_{s}$ 加权计算得出的。而在 Bahdanau Attention 机制中，第 $t$ 步的注意力 $c_{t}$ 是由 decoder 第 $t - 1$ 步的 hidden state $\mathbf{h}_{t-1}$ 与 encoder 中的每一个 hidden state $\overline{\mathbf{h}}_{s}$ 加权计算得出的。
decoder的输入输出不同。在 Bahdanau Attention 机制中， decoder 在第 $t$ 步时，输入是由注意力 $\mathbf{c}_{t}$ 与前一步的 hidden state $\mathbf{h}_{t-1}$ 拼接 (concatenate) 得出的，得到第 $t$ 步的 hidden state $\mathbf{h}_{t}$ 并直接输出 $\hat{\mathbf{y}}_{t+1}$ 。而 Luong Attention 机制在 decoder 部分建立了一层额外的网络结构，以注意力 $\mathbf{c}_{t}$ 与原 decoder 第 $t$ 步的 hidden state $\mathbf{h}_{t}$ 拼接作为输入，得到第 $t$ 步的 hidden state $\tilde{\mathbf{h}}_{t}$ 并输出 $\hat{\mathbf{y}}_{t}$ 。
因此，Bahdanau Attention 机制的计算流程为 $\mathbf{h}_{t-1} \rightarrow \mathbf{a}_{t} \rightarrow \mathbf{c}_{t} \rightarrow \mathbf{h}_{t}$ ，而 Luong attention 机制的计算流程为 $\mathbf{h}_{t} \rightarrow \mathbf{a}_{t} \rightarrow$ $\mathbf{c}_{t} \rightarrow \tilde{\mathbf{h}}_{t}$ 。相较而言，Luong attention 机制中的 decoder 在每一步使用当前步 (而非前一步) 的 hidden state 来计算注意力，从逻辑上更自然，但需要使用一层额外的 RNN decoder 来计算输出。
Bahdanau Attention只在 concat 对齐函数上进行了实验，Luong Attention在多种对齐函数进行了实验，下面为Luong Attention设计的三种对齐函数

$\operatorname{score}\left(\boldsymbol{h}_{t}, \overline{\boldsymbol{h}}_{s}\right)= \begin{cases}\boldsymbol{h}_{t}^{\top} \overline{\boldsymbol{h}}_{s} & \text { dot } \\ \boldsymbol{h}_{t}^{\top} \boldsymbol{W}_{a} \overline{\boldsymbol{h}}_{s} & \text { general } \\ \boldsymbol{v}_{a}^{\top} \tanh \left(\boldsymbol{W}_{\boldsymbol{a}}\left[\boldsymbol{h}_{t} ; \overline{\boldsymbol{h}}_{s}\right]\right) & \text { concat }\end{cases}$

这样，关于经典Attention机制的内容就更加全面了，上面代码里面实现的就是最下面这种concat对齐函数下的Luong Attention。

后面的动态 RNN 以及多头注意力机制都是都比较常规的。

总结

SDM 模型一个标准的序列推荐召回模型，文章把用户的行为训练以会话的形式进行切分，然后再根据时间，分成了短期会话和长期会话，然后分别采用不同的策略去学习用户的短期兴趣和长期兴趣。

对于短期会话，和当前预测相关性较大，首先用RNN来学习序列信息，然后采用多头注意力机制得到用户的多兴趣，接下来就是和用户的base向量进行注意力融合得到短期兴趣
长期会话序列中，每个side info信息进行分开，然后分别进行注意力编码融合得到
为了使得长期会话中对当前预测有用的部分得以体现，在融合短期兴趣和长期兴趣的时候，采用了门控的方式，而不是普通的拼接或者加和等操作，使得兴趣保留信息变得有选择

可以借鉴的地方：

首先是多头注意力机制也能学习到用户的多兴趣，这样对于多兴趣，就有了胶囊网络与多头注意力机制两种思路。另外对于两个向量融合，提供了一种门控融合机制；

另外还学习到了传统的两种经典注意力机制以及区别。

参考：