Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality

Chen, Siyu; Sheen, Heejune; Wang, Tianhao; Yang, Zhuoran

Session: 1D-2 (Neural Networks), Sunday, Jun 30, 16:30-17:45

Abstract: We study the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression. We establish the convergence of gradient flow under suitable choices of initialization, where the key technique is to reveal the spectral dynamics of the attention weights along gradient flow. In addition, we prove that the limiting model learned by gradient flow is on par with the best possible multi-head softmax attention model up to a constant factor. Our analysis also delineates a strict separation in terms of the prediction accuracy of ICL between single-head and multi-head attention models. Furthermore, we prove that the model learned by gradient flow exhibits an interesting ``task allocation" phenomenon, where each attention head focuses on solving a single task of the multi-task model. To our best knowledge, our work provides the first convergence result for the multi-head softmax attention model.