visit
Authors:
(1) Suzanna Sia, Johns Hopkins University; (2) David Mueller; (3) Kevin Duh.
In-context learning (ICL) refers to the phenomenon in which large generative pretrained transformers (GPTs) perform tasks with no gradient updates when shown task examples or descriptions in their context (Brown et al., 2020; Bommasani et al., 2021). While in-context learning in GPT models appears to be generally applicable to any natural language task, to study task location, we use Machine Translation (MT) as there is little to no ambiguity in evaluating whether the model has recognised the task, since it must generate tokens in a different language. While in-context MT has yet to reach parity with supervised neural MT models, it’s off-the-shelf translation performance is comparatively 1 Johns Hopkins University. Correspondence to: Suzanna Sia . Conference Paper Under Review. strong and suggests a promising direction for the future of MT (Hendy et al., 2023; Garcia et al., 2023). Prior work on in-context MT has focused on prompt-engineering, treating GPT models as black boxes by focusing on which examples to provide in-context (Moslem et al., 2023). Agrawal et al. (2022) apply similarity-based retrieval to select in-context examples, while Sia & Duh (2023) suggest a coherencebased approach. However, these works apply surface level interventions leaving the internal mechanism of MT in GPT models largely not understood.
In this work, we ask where does in-context Machine Translation occur in GPT models? We conduct an initial exploration into locating self-attention layers responsible for in-context MT in three base pre-trained and one instruction tuned open-source GPT models . Using causal masking over different parts of the context we demonstrate that there exists a "task-recognition" point after which attention to the context is no longer necessary (Section 3). The potential implications are large computational savings when the context is several times longer than the test source sentence (Section 5). Having identified the layers in which "task recognition" occurs, we study the extent to which subsequent layers are either redundant or corresponding to the "task recognition" layers. Simple layer-wise masking shows that for 3B parameter models, removing attention around the "task-recognition" layers can cause the model to fail to perform translation all-together, whereas layers towards the end of the model are much more redundant (Section 4.1).
In-Context Learning was first demonstrated by Brown et al. (2020) who showed that GPT-3 could be used to perform a huge variety of tasks without any task-specific parameters or training, by conditioning the model’s generation on a prompt which included a few labeled examples of the task of interest. Since then, interest in using GPT models for ICL has grown significantly, with several recent works introducing methods such as instruction-tuning (Sanh et al., 2022; Wang et al., 2022) or chain-of-thought prompting (Wei et al., 2022) to improve downstream ICL accuracy.
In-Context Machine Translation While GPT models are strong few-shot learners, their pre-training data is historically dominated by English, limiting their ability to perform translation tasks (Hendy et al., 2023). Lin et al. (2022) find that an explicitly multilingual GPT significantly outperforms traditional english models such as GPT-3, and Garcia et al. (2023) find that such models can even be competitive with supervised MT models in some settings. However, even with explicit multilingual pre-training, in-context MT has been found to be very sensitive to the examples used Liu et al. (2022) and their orders Lu et al. (2022). In response, recent work focuses on how to select prompts that elicit the best downstream MT performance (Agrawal et al., 2022; Sia & Duh, 2023). However, further improvement to translation with GPT models is limited by our understanding of how MT emerges in GPT models. Our work directly analyses when, in layer representations, a GPT model becomes a translation model via in-context learning, and how that may inform decisions around parameter tuning and redundancy.