Attention Mechanisms in Transformers: Comparing MHA, MQA, and GQA

Background The Transformer (Vaswani et al., 2017) is a model based on the encoder-decoder architecture. This model has demonstrated outstanding performance in the field of natural language processing (NLP), leading to a series of optimized models based on it, such as BERT (Devlin et al., 2018) which uses only the encoder, GPT (Radford et al., 2018) series which uses only the decoder, and subsequent large language models (LLMs) like LLaMA (Touvron et al., 2023) and GPT-4 (OpenAI et al., 2024), most of which adopt a decoder-only architecture. ...

2025-01-16 · 29 min · 6139 words · Yue Shui