Transformer 模型以及自注意力机制 (Self-attention)

在 Transformer 之前，序列翻译任务（或者说与序列、时序相关的任务）通常采用 RNN、CNN 结构，其中 RNN 的缺点在于：（1）使用计算的先后次序，来表征序列中的先后信息，因此只能串行计算（2）长序列早期的信息可能会丢失；CNN 的缺点在于：捕捉相邻信息依赖卷积的窗口，因此对于长序列的信息可能需要很多层卷积。

基于上述问题，Transformer 应运而生，提出新结构，用于实现（1）更好地并行化（2）更好地建模长序列。

Transformer 模型结构

在下述模型结构中，左边为编码器 Encoder，右边为解码器 Decoder；Encoder 负责将输入 Inputs $(x_1,\ldots,x_n)$ 编码为 $\boldsymbol{z}=(z_1,\ldots,z_n)$。将 $\boldsymbol{z}$ 传入 Decoder 后，解码器将采用自回归的方式输出序列 $(y_1,\ldots,y_m)$，每一步将前面的输出 $y_{<i}$ 作为额外的输入，输出对应的 token $y_i$（即对应词表中每个 token 的概率）。接下来将依次对 Transformer 中的关键组件进行介绍。

1. Tokenization

首先需要将语言文字处理为 token 的形式，如下图所示：

Transformer 在训练前，会维护一个词表，每个 token 都是在词表中的一个元素。为了对这些 token 进行处理，Input / Output Embeddings 层分别将这些 token 转成对应的 Embedding，即在上述维护的词表中，每个 token 都各自对应一个向量 (Embedding)，在 Transformer 原始论文中，这个向量的维度被设置为 512 维（后续随着模型参数的扩大，Embedding 维度也会相应地提升）。

这些 Embedding 通常以均匀分布或正态分布随机初始化，例如 nn.Embedding 默认使用区间 $[-\sqrt{1/d_{\text{model}}},\sqrt{1/d_{\text{model}}}]$ 的均匀分布进行初始化。这些 Embedding 会在训练过程中通过反向传播更新，待训练结束后，作为模型权重的一部分进行存储。下述为 embedding_layer 的代码示例。

import torch.nn as nn
d_model = 512  # 或其他模型设定的值
vocab_size = 30000  # 假设词表大小为3万
embedding_layer = nn.Embedding(vocab_size, d_model)

Byte-Pair Encoding（BPE）

具体 Tokenization 的技术在 Transformer 出现前便是 NLP 领域的重点方向，此处主要介绍一种较为常见的分词方法，即 Byte-Pair Encoding (BPE). 由于该方法与 Transformer 本身关系不大，不感兴趣的话可以跳过。

BPE 主要分为两步，第一步是如何根据大量文本数据得到词表。首先会统计所有单词的词频，并在单词结尾增加 </w> 字符，例如初始语料库为：

{"yes</w>": 7, "highest</w>": 3, "high</w>": 9}

根据上述语料库，将每个词拆分为字符，并统计每个相邻字节对的频率：

合并统计结果后，最高频的字节对为 h-i，共出现了 12 次。在词表中加入 hi 后，再次统计更新后的字节对频率：

此时最高频的字节对为 g-h，共出现了 12 次，将在词表中加入 gh。通过不断迭代合并，直至达到停止条件（停止条件通常为词表大小或固定的迭代次数）。最终的词表如下所示（hi 和 gh 在迭代过程中被合并为了 high）：

{"y", "es", "</w>", "high", "t", "high</w>"}

第二步就是根据上述得到的词表，对单词序列进行子词切分，即句子编码。编码的过程就是遍历已得到的词表，从最长到最短，并尝试使用这些 token 替换给定单词序列中的子字符串，如果最终仍有训练时没有见过的词，则用 unknown token 替换它们。

Byte-Pair Encoding 本质上是一种数据压缩算法，其通过构建子词的方式，压缩整体词表的大小（比单词级的词表要紧凑很多），并且可以通过子词组合覆盖新词，甚至可以跨语言复用（拉丁语系词根），整体适用性更强。

2. Positional Encoding

得到每个 token 的 Embedding 向量后，为了表示 token 之间的位置关系，Transformer 中对每一个 token 的向量加上了一个表示当前位置信息的向量（位置编码），其具体做法如下：

$$ \begin{aligned} P E_{(p o s, 2 i)} &=\sin \left(p o s / 10000^{2 i / d_{\text {model }}}\right) \\ P E_{(p o s, 2 i+1)} &=\cos \left(p o s / 10000^{2 i / d_{\text {model }}}\right) \end{aligned} $$

其中 $PE_{(pos,2i)}$ 表示 $pos$ 位置对应向量的第 $2i$ 维度数值。相较于 RNN 中通过计算的先后次序来表征序列中先后信息的方式，Transformer 中位置编码的方式可以更好地支持数据并行。

3. Attention

Scaled Dot-Product Attention

首先是单层直接做 Attention，输入 queries $Q\in \mathbb{R}^{n\times d_k}$，keys $K\in \mathbb{R}^{n\times d_k}$，values $V\in \mathbb{R}^{n\times d_v}$，其中 $n$ 代表输入的序列长度。具体 Attention 的操作是将 $Q$ 与 $K$ 两两做相似度比较，再将 softmax 后的相似度作为与 $V$ 的权重系数，具体计算方式如下：

$$ \text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V $$

其中 softmax 会对结果矩阵 $Q K^T\in \mathbb{R}^{n\times n}$ 的每一行应用，即将 $q_i$ 与 $k_{1,\ldots,n}$ 内所有 key 计算相似度。随后再将相似度矩阵乘以 $V$，得到 Attention 后的结果 $V'\in \mathbb{R}^{n\times d_v}$。其中每个 $v'_i\in \mathbb{R}^{d_v}$ 由所有 $v_{1,\ldots,n}$ 加权求和得到，其权重为 $q_i$ 与所有 $k_{1,\ldots,n}$ 计算得到的相似度。

上述式子中除以 $\sqrt{d_k}$ 的目的是缩小点积后的数值范围，确保 softmax 操作后梯度稳定。因为当向量较长时，softmax 很容易将较大的元素往 1 推，其余元素往 0 推，而当所有结果分布在 0、1 区域时，softmax 函数的梯度将变得很小，即出现梯度消失现象。

Masked Attention

由于上述的 Attention 操作是对全局做的，但在 Decoder 中计算先前输出序列 $(y_1,\ldots,y_m)$ 的 Attention 时，会采用 Masked Attention，即 $t-1$ 时刻的输出是无法看到 $t$ 时刻的输出信息的，因此对 $t-1$ 时刻之后的结果，乘以一个很大的负数，使得通过 softmax 后的权重变为 0。

具体来说，Masked Attention 后得到的 $v'_{i}$ 不再由所有的 $v_{1,\ldots,n}$ 加权求和得到，而是仅有 $v_{\leq i}$ 的部分加权求和得到。

Multi-Head Attention

在上述 Scaled Dot-Product Attention 的过程中，给定 Q、K、V 后，就可以直接得到对应的 Attention 结果 $V’$，其中并没有参数可以进行学习。因此 Transformer 中实际采用的是 Multi-Head Attention，如下所示：

$$ \begin{aligned} \text{ MultiHead }(Q, K, V) & =\operatorname{Concat}\left(\operatorname{head}_1, \ldots, \text{head}_{h}\right) W^O \\ \text { where head}_i & =\operatorname{Attention}\left(Q W_i^Q, K W_i^K, V W_i^V\right) \end{aligned} $$

其中 $Q,K,V\in \mathbb{R}^{n\times d_{\text{model}}}$（$n$ 为输入序列长度），$W_i^Q,W_i^K\in \mathbb{R}^{d_{\text{model}}\times d_k}$, $W_i^V\in \mathbb{R}^{d_{\text{model}}\times d_v}$, $W_i^O\in \mathbb{R}^{hd_v\times d_{\text{model}}}$（这几个 $W$ 矩阵即为可学习的参数）。

Multi-Head Attention 的整体想法就是将 $Q,K,V$ 映射到多个子空间中，分别进行 Attention 操作后再拼接起来，具体结构如下所示：

Multi-Head Attention 模块在整个 Transformer 架构中一共出现了三处：

在 Encoder 处为 Self-attention，即 $V,K,Q$ 的原始输入数据一致；
在 Decoder 处为 Masked Self-attention，即 $V,K,Q$ 的原始输入数据一致，但 $v'_{i}$ 仅有 $v_{\leq i}$ 加权求和得到；
在 Encoder-Decoder Attention 层，Encoder 的输出提供 $V,K\in \mathbb{R}^{n\times d}$，而 $Q \in \mathbb{R}^{m\times d}$ 由先前 Decoder 的输出提供，最终输出的 $V'\in \mathbb{R}^{m\times d}$，其中 $v'_i$ 由 Encoder 提供的 $v_{1,\ldots,n}$ 加权求和得到。

4. Layer Normalization

Batch Normalization：对每一个特征 $i$，将所有样本中的特征 $i$ 进行归一化，使其均值为 0，方差为 1
Layer Normalization：对每一个样本，将其中所有特征做归一化，使其均值为 0，方差为 1
Transformer 中采用 Layer Normalization 的原因：在序列问题中，每一个样本的有效长度是不一样的（无效处通常填 0），因此若采用 BN 对每一个特征进行归一化，很容易受到训练样本有效长度的影响，例如测试时出现一个特别长的样本。

5. Feed-Forward Networks

实际上就是两层 MLP，计算过程如下：

$$ \operatorname{FFN}(x)=\max \left(0, x W_1+b_1\right) W_2+b_2 $$

Multi-Head Attention 代码实践

首先是手动实现 Multi-Head Attention（可以深入了解其运行过程），具体代码如下：

import torch
import torch.nn as nn
import torch.nn.functional as F


class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads, dropout=0.1):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # 初始化线性投影层
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

        self.dropout = nn.Dropout(dropout)
        
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # 计算注意力分数
        scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.d_k, dtype=torch.float32))
        
        # 应用 mask (如果存在)
        if mask is not None:
            if mask.dim() == 2: # mask shape: [seq_len, seq_len]
                mask = mask.unsqueeze(0).unsqueeze(0)
            elif mask.dim() == 3: # mask shape: [batch_size, seq_len, seq_len]
                mask = mask.unsqueeze(1)
            scores = scores.masked_fill(mask == 0, -1e9)
        
        # 计算注意力权重
        attn_weights = F.softmax(scores, dim=-1)

        # 应用 dropout
        attn_weights = self.dropout(attn_weights)
        
        # 计算上下文向量
        output = torch.matmul(attn_weights, V)
        return output, attn_weights
        
    def split_heads(self, x):
        # 将输入分割为多个头, 维度变换为 [batch_size, num_heads, seq_len, d_k]
        batch_size, seq_len, _ = x.size()
        return x.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        
    def combine_heads(self, x):
        # 合并多个头, 维度变换为 [batch_size, seq_len, d_model]
        batch_size, _, seq_len, _ = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
        
    def forward(self, Q, K, V, mask=None):
        # 线性投影并分割头
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))
        
        # 计算缩放点积注意力
        attn_output, attn_weights = self.scaled_dot_product_attention(Q, K, V, mask)
        
        # 合并头并最终投影
        attn_output = self.combine_heads(attn_output)
        output = self.W_o(attn_output)
        
        return output, attn_weights
    

if __name__ == "__main__":
    # 参数设置
    d_model = 512
    num_heads = 8
    dropout = 0.1

    # 创建模拟数据
    seq_len = 10
    batch_size = 4

    # 创建模块
    model = MultiHeadAttention(d_model, num_heads, dropout)

    # 生成随机输入（Q, K, V）
    Q = torch.randn(batch_size, seq_len, d_model)
    K = torch.randn(batch_size, seq_len, d_model)
    V = torch.randn(batch_size, seq_len, d_model)

    # 创建 mask（可选）
    mask = torch.ones(batch_size, seq_len, seq_len)  # 示例：全 1 表示无 mask

    # 前向传播
    output, attn_weights = model(Q, K, V, mask)

    print(output.shape)        # torch.Size([batch_size, seq_len, d_model])
    print(attn_weights.shape)  # torch.Size([batch_size, num_heads, seq_len, seq_len])

在实践中，可以直接调用 nn.MultiheadAttention 实现上述代码功能：

import torch
import torch.nn as nn

# 参数设置
d_model, num_heads, dropout = 512, 8, 0.1
seq_len, batch_size = 10, 4

# 初始化官方模块 (包含投影层、多头计算)
mha = nn.MultiheadAttention(
    embed_dim=d_model,
    num_heads=num_heads,
    dropout=dropout,
    batch_first=True  # 输入输出格式为 [batch, seq, features]
)

# 生成随机输入 (Q/K/V 形状相同)
Q = K = V = torch.randn(batch_size, seq_len, d_model)

# 生成因果掩码 (下三角矩阵)
causal_mask = torch.triu(  # 上三角为 True 表示遮蔽
    torch.ones(seq_len, seq_len), diagonal=1
).bool()

# 执行注意力计算
output, attn_weights = mha(
    query=Q, 
    key=K, 
    value=V,
    attn_mask=causal_mask,          # 遮蔽未来位置
    need_weights=True               # 返回注意力权重
)

print("Output shape:", output.shape)        # [4, 10, 512]
print("Weights shape:", attn_weights.shape) # [4, 10, 10] 此处为各头权重平均值