Transformer模型-Multi-Head Attention多头注意力的简明介绍

本文主要是介绍Transformer模型-Multi-Head Attention多头注意力的简明介绍，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

今天介绍transformer模型的Multi-Head Attention多头注意力。

原论文计算scaled dot-product attention和multi-head attention

实际整合到一起的流程为：

通过之前文章，假定我们已经理解了attention；今天我们按顺序来梳理一下整合之后的顺序。重新梳理Attention Is All You Need（Transformer模型）: Attention=距离，权重，概率；Multi-Head attention-CSDN博客https://blog.csdn.net/ank1983/article/details/136998593

当使用多头注意力时，通常d_key = d_value = (d_model / n_heads)，其中n_heads是头的数量。研究人员表示，模型之所以能够“关注不同位置的不同表示子空间中的信息”，所以经常使用并行注意力层而不是全维度层。只有一个头时，平均化会阻止这种情况。

第一步：通过线性层W*传递输入Q、K和V

计算注意力的第一步是获取Q、K和V张量；它们分别是查询、键和值张量。它们是通过获取位置编码的嵌入（记作X）并同时将张量传递通过三个线性层（分别记作Wq、Wk和Wv）来计算的。这可以在上面的详细图像中看到。

Q = XWq
K = XWk
V = XWv
X has a size of (batch_size, seq_length, d_model). An example would be a batch of 32 sequences of length 10 with an embedding of 512, which would have a shape of (32, 10, 512).
Wq, Wk, and Wv have a size of (d_model, d_model). Following the example above, they would have a shape of (512, 512).

The linear layers for Wq, Wk, and Wv can be created using nn.Linear(d_model, d_model).

**关于W*和线性层，可参考文章：

学习transformer模型-线性层（Linear Layer），全连接层（Fully Connected Layer）或密集层（Dense Layer）的简明介绍-CSDN博客https://blog.csdn.net/ank1983/article/details/137212380学习transformer模型-权重矩阵Wq，Wk，Wv的简明介绍-CSDN博客https://blog.csdn.net/ank1983/article/details/137160105

第二步：将Q、K和V分割为各自的头

创建了Q、K和V张量后，现在可以通过将d_model的视图更改为(n_heads, d_key)来将它们分割为各自的头。n_heads可以是一个任意数，但在处理较大的嵌入时，通常会选择8、10或12。请注意，d_key = (d_model / n_heads)。

Q has a shape of (batch_size, n_heads, Q_length, d_key)
K has a shape of (batch_size, n_heads, K_length, d_key)
V has a shape of (batch_size, n_heads, V_length, d_key)

第三步：对每个头计算attention

关于点积和矩阵乘法，请参看：

学习transformer模型-点积dot product，计算attention-CSDN博客https://blog.csdn.net/ank1983/article/details/137093906学习transformer模型-矩阵乘法；与点积dot product的关系；计算attention-CSDN博客https://blog.csdn.net/ank1983/article/details/137090019