Attention works by having a network learn an attention score, which indicates the importance (assigning a weight) of different parts of the data. The attention score is used to aggregate the data. This allows the network to concentrate on the most informative parts of the data.

Cross-attention is between two different sequences. Self-attention computes the attention of an input with respect to itself. For a given token of the input, we compute the attention weight for all other tokens in the sequence. i.e., it computes attention based on the input sequence itself.

There’s a few different ways to compute the attention between two embeddings : the dot product, cosine similarity, bilinear (score(a, b) = a^T*W*b), an MLP.