BERT模型源码解析( 七 )


for _ in range(num_dims - 2):
position_broadcast_shape.append(1)
position_broadcast_shape.extend([seq_length, width]) 扩张
position_embeddings = tf.reshape(position_embeddings,
position_broadcast_shape) 变形
output += position_embeddings  将位置数据加进去
output = layer_norm_and_dropout(output, dropout_prob) 标准化和丢弃
return output
创建掩码
■从输入掩码创建注意力掩码
def create_attention_mask_from_input_mask(from_tensor, to_mask):
"""Create 3D attention mask from a 2D tensor mask.
从 2D掩码创建3D掩码
Args: 入参:输入张量,转换成掩码的张量
from_tensor: 2D or 3D Tensor of shape [batch_size, from_seq_length, ...].
to_mask: int32 Tensor of shape [batch_size, to_seq_length].
Returns:  返回值 浮点值的张量
float Tensor of shape [batch_size, from_seq_length, to_seq_length].
""" 获取入参形状参数
from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
batch_size = from_shape[0]
from_seq_length = from_shape[1]
获取转换张量的形状
to_shape = get_shape_list(to_mask, expected_rank=2)
to_seq_length = to_shape[1]
先变形,然后转换成float32浮点数
to_mask = tf.cast(
tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)
from_tensor不一定是掩码(虽然它可能是)
  我们不太关心(from里面的填充符号),所以创建一个全是1的张量;
# We don't assume that `from_tensor` is a mask (although it could be). We
# don't actually care if we attend *from* padding tokens (only *to* padding)
# tokens so we create a tensor of all ones.
#
# `broadcast_ones` = [batch_size, from_seq_length, 1]
创建全1张量
broadcast_ones = tf.ones(
shape=[batch_size, from_seq_length, 1], dtype=tf.float32)
我们在两个维度上进行广播,从而创建掩码
# Here we broadcast along two dimensions to create the mask.
mask = broadcast_ones * to_mask
return mask
注意力层
■注意力 层
def attention_layer(from_tensor,
to_tensor,
attention_mask=None,
num_attention_heads=1,
size_per_head=512,
query_act=None,
key_act=None,
value_act=None,
attention_probs_dropout_prob=0.0,
initializer_range=0.02,
do_return_2d_tensor=False,
batch_size=None,
from_seq_length=None,
to_seq_length=None):
"""Performs multi-headed attention from `from_tensor` to `to_tensor`.
多头的注意力
This is an implementation of multi-headed attention
based on "Attention is all you Need".
这是一个多头注意力的实现,注意的才是需要的
如果from_tensor和to_tensor是一样的 , name这个注意力就是自己注意自己,也叫自注意力 。
If `from_tensor` and `to_tensor` are the same, then
this is self-attention. Each timestep in `from_tensor` attends to the
corresponding sequence in `to_tensor`, and returns a fixed-with vector.
先将from_tensor投射成query张量,并且将to_tensor投射成key和value张量 。
这将产生一系列张量 , 张量个数=头数,
其中每个张量的形状都是[批处理量,序列长度,头的大小]
This function first projects `from_tensor` into a "query" tensor and
`to_tensor` into "key" and "value" tensors. These are (effectively) a list
of tensors of length `num_attention_heads`, where each tensor is of shape
[batch_size, seq_length, size_per_head].
query 张量和key张量都是 点积的 和成比例的??? 。
通过softmax运算从而获取注意力数据 。
value 张量通过这些注意力数据差值计算得出 , 然后把它们连接成一个张量 。
Then, the query and key tensors are dot-producted and scaled. These are
softmaxed to obtain attention probabilities. The value tensors are then
interpolated by these probabilities, then concatenated back to a single
tensor and returned.
实际操作中,多头注意力进行转置和变形运算 , 而不是独立的张量运算 。
In practice, the multi-headed attention are done with transposes and
reshapes rather than actual separate tensors.
Args: 入参,输入张量 , 输出张量
from_tensor: float Tensor of shape [batch_size, from_seq_length,
from_width].
to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
注意力掩码
attention_mask: (optional) int32 Tensor of shape [batch_size,
from_seq_length, to_seq_length]. The values should be 1 or 0. The
attention scores will effectively be set to -infinity for any positions in
the mask that are 0, and will be unchanged for positions that are 1.

推荐阅读