Building models that understand and generate natural language well is one the grand goals of machine learning (ML) research and has a direct impact on building smart systems for everyday applications. Improving the quality of language models is a key target for researchers to make progress toward such a goal.
Most common paradigms to build and train language models use either autoregressive decoder-only architectures (e.g., PaLM or GPT-3), where the model is trained to predict the next word for a given prefix phrase, or span corruption-based encoder-decoder architectures (e.g., T5, ST-MoE), where the training objective is to recover the subset of words masked out of the input. On the one hand, T5-like models perform well on supervised fine-tuning tasks, but struggle with few-shot in-context learning. On the other hand, autoregressive language models are great for open-ended generation (e.g., dialog generation with LaMDA) and prompt-based learning (e.g., in-context learning with PaLM), but may perform suboptimally on fine-tuning tasks. Thus, there remains an opportunity to create an effective unified framework for pre-training models.
In “Unifying Language Learning Paradigms”, we present a novel language pre-training paradigm called Unified Language Learner (UL2) that improves the performance of language models universally across datasets and setups. UL2 frames different objective functions for training language models as denoising tasks, where the model has to recover missing sub-sequences of a given input. During pre-training it uses a novel mixture-of-denoisers that samples from a varied set of such objectives, each with different configurations. We demonstrate that models trained using the UL2 framework perform well in a variety of language domains, including prompt-based few-shot learning and models fine-tuned for down-stream tasks. Additionally, we show that UL2 excels in generation, language understanding, retrieval, long-text understanding and question answering tasks. Finally, we are excited to publicly release the checkpoints for our best performing UL2 20 billion parameter model.
Background: Language Modeling Objectives and Architectures
Common objective functions for training language models can mostly be framed as learning data transformations that map inputs to targets. The model is conditioned on different forms of input to predict target tokens. To this end, different objectives utilize different properties of the inputs.
The standard Causal Language modeling objective (CausalLM) is trained to predict full sequence lengths and so, only recognizes tokens in the target output. The prefix language modeling objective (PrefixLM) modifies this process by randomly sampling a contiguous span of k tokens from the given tokenized text to form the input of the model, referred to as the “prefix”. The span corruption objective masks contiguous spans from the inputs and trains the model to predict these masked spans.
In the table below, we list the common objectives on which state-of-the-art language models are trained along with different characteristics of the input, i.e., how it is presented to the model. Moreover, we characterize the example efficiency of each objective in terms of the ability of the model for exploiting supervision signals from a single input, e.g., how much of the input tokens contribute to the calculation of the loss.
Efficiency CausalLM none text N/A full seq_len PrefixLM text (up to position k) text (after position k) contiguous seq_len – k Span corruption masked text masked_tokens non-contiguous, may be bi-directional typically lower than others Common objectives used in today’s language models. Throughout, “text” indicates tokenized text.
UL2 leverages the strengths of each of these objective functions through a framework that generalizes over each of them, which enables the ability to reason and unify common pre-training objectives. Based on this framework, the main task for training a language model is to learn the transformation of a sequence of input tokens to a sequence of target tokens. Then all the objective functions introduced above can be simply reduced to different ways of generating input and target tokens. For instance, the PrefixLM objective can be viewed as a transformation that moves a segment of k contiguous tokens from the inputs to the targets. Meanwhile, the span corruption objective is a data transformation that corrupts spans (a subsequence of tokens in the input), replacing them with mask tokens that are shifted to the targets.
It is worth noting that one can decouple the model architecture and the objective function with which it’s trained. Thus, it is possible to train different architectures, such as the common single stack decoder-only and two-stack encoder-decoder models, with any of these objectives.
Mixture of Denoisers
The UL2 framework can be used to train a model on a mixture of pre-training objectives and supply it with capabilities and inductive bias benefits from different pre-training tasks. Training on the mixture helps the model leverage the strengths of different tasks and mitigates the weaknesses of others. For instance, the mixture-of-denoisers objective can strongly improve the prompt-based learning capability of the model as opposed to a span corruption-only T5 model.
UL2 is trained using a mixture of three denoising tasks: (1) R-denoising (or regular span corruption), which emulates the standard T5 span corruption objective; (2) X-denoising (or extreme span corruption); and (3) S-denoising (or sequential PrefixLM). During pre-training, we sample from the available denoising tasks based on user-specified ratios (i.e., different combinations of the R, X, and S-denoisers) and prepare the input and target appropriately. Then, a paradigm token is appended to the input (one of [R], [X], or [S]) indicating the denoising task at hand.
Improving Trade-Offs Across Learning Paradigms
Many existing commonly used language learning paradigms typically excel at one type of task or application, such as fine-tuning performance or prompt-based in-context learning. In the plot below, we show baseline objective functions on different tasks compared to UL2: CausalLM (referred to as GPT-like), PrefixLM, Span Corrupt (also referred to as T5 in the plot), and a baseline objective function proposed by UniLM. We use these objectives for training decoder only architectures (green) and encoder-decoder architectures (blue) and evaluate different combinations of objective functions and architectures on two main sets of tasks:
Fine-tuning, by measuring performance on SuperGLUE (y-axis of the plot below) In-context learning, by measuring performance of the model on a suite of 1-shot GEM tasks (e.g., XSUM, SGD or Schema guided dialog and TOTTO) (x-axis of the plot below).
For most of the existing language learning paradigms, there is a trade-off between the quality of the model on these two sets of tasks. We show that UL2 bridges this trade-off across in-context learning and fine-tuning.
In both decoder-only and encoder-decoder setups, UL2 strikes a significantly improved balance in performance between fine-tuned discriminative tasks and prompt-based 1-shot open-ended text generation compared to previous methods. (All models are comparable in terms of computational costs, i.e., FLOPs (EncDec models are 300M and Dec models are 150M parameters).
UL2 for Few-Shot Prompting and Chain-of-Thought Reasoning
We scale up UL2 and train a 20 billion parameter encoder-decoder model on the public C4 corpus and demonstrate some impressive capabilities of the UL2 20B model.
UL2 is a powerful in-context learner that excels at both few-shot and chain-of-thought (CoT) prompting. In the table below, we compare UL2 with other state-of-the-art models (e.g, T5 XXL and PaLM) for few-shot prompting on the XSUM summarization dataset. Our results show that UL2 20B outperforms PaLM and T5, both of which are in the same ballpark of compute cost.
Model ROUGE-1 ROUGE-2 ROUGE-L LaMDA 137B – 5.4 – PaLM 62B – 11.2 – PaLM 540B – 12.2 – PaLM 8B – 4.5 – T5 XXL 11B 0.6 0.1 0.6 T5 XXL 11B + LM 13.3 2.3 10.7 UL2 20B 25.5 8.6 19.8 Comparison of UL2 with T5 XXL, PaLM and LamDA 137B on 1-shot summarization (XSUM) in terms of ROUGE-1/2/L (higher is better), which captures the quality by comparing the generated summaries with the gold summaries as reference.
Most CoT prompting results have been obtained using much larger language models, such as GPT-3 175B, PaLM 540B, or LaMDA 137B. We show that reasoning via CoT prompting can be achieved with UL2 20B, which is both publicly available and several times smaller than prior models that leverage chain-of-thought prompting. This enables an open avenue for researchers to conduct research on CoT prompting and reasoning at an accessible scale. In the table below, we show that for UL2, CoT prompting outperforms standard prompting on math word problems with a range of difficulties (GSM8K, SVAMP, ASDiv, AQuA, and MAWPS). We also show that self-consistency further improves performance.
Chain-of-thought (CoT) prompting and self-consistency (SC) results on five arithmetic reasoning benchmarks.
Conclusion and Future Directions
UL2 demonstrates superior performance on a plethora of fine-tuning and few-shot tasks. We publicly release checkpoints of our best performing UL2 model with 20 billion parameters, which we hope will inspire faster progress in developing better language models in the machine learning community as a whole.
It was an honor and privilege to work on this with Vinh Q. Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby and Donald Metzler. We further acknowledge Alexey Gritsenko, Andrew M. Dai, Jacob Devlin, Jai Gupta, William Fedus, Orhan Firat, Sebastian Gerhmann, Nan Du, Dave Uthus, Siamak Shakeri, Slav Petrov and Quoc Le for support and discussions. We thank the Jax and T5X team for building such wonderful infrastructure that made this research possible.