Posts tagged Attention Sink

Support Learnable Attention Sink

17 November 2025

Large-Scaled Models assign significant attention to few tokens (such as the intial tokens in the sequence), even if they are not semantically important, which is known as attention sink [Xiao et al., 2024]. Researchers attribute this interesting phenomenon to the nature of \(softmax\), which requires attention scores of each query token to always sum up to \(1\) for all key tokens in the context, even when some query token does not strongly attend to any key token at all [Gu et al., 2025]. Therefore, during the training, we can deliberately add some learnable sink tokens to the key sequence for each query token to collect those unneeded attention scores to relax the ”sum-up-to-one” constraint, as a learnable version of \(\textit{off-by-one}\space softmax\) [Miller, 2024].

Read more ...

Recent Posts

Tags

Categories

Archives

Authors

Locations

Posts tagged Attention Sink

Support Learnable Attention Sink