Text Alignment Is An Efficient Unified Model for Massive NLP Tasks
- URL: http://arxiv.org/abs/2307.02729v2
- Date: Thu, 2 Nov 2023 03:49:19 GMT
- Title: Text Alignment Is An Efficient Unified Model for Massive NLP Tasks
- Authors: Yuheng Zha, Yichi Yang, Ruichen Li, Zhiting Hu
- Abstract summary: Next-word prediction is often not an efficient formulation for many NLP tasks.
We propose text alignment as an efficient unified model for a wide range of crucial tasks.
Our model delivers on par or even superior performance with much smaller model sizes.
- Score: 24.069447197357164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs), typically designed as a function of next-word
prediction, have excelled across extensive NLP tasks. Despite the generality,
next-word prediction is often not an efficient formulation for many of the
tasks, demanding an extreme scale of model parameters (10s or 100s of billions)
and sometimes yielding suboptimal performance. In practice, it is often
desirable to build more efficient models -- despite being less versatile, they
still apply to a substantial subset of problems, delivering on par or even
superior performance with much smaller model sizes. In this paper, we propose
text alignment as an efficient unified model for a wide range of crucial tasks
involving text entailment, similarity, question answering (and answerability),
factual consistency, and so forth. Given a pair of texts, the model measures
the degree of alignment between their information. We instantiate an alignment
model (Align) through lightweight finetuning of RoBERTa (355M parameters) using
5.9M examples from 28 datasets. Despite its compact size, extensive experiments
show the model's efficiency and strong performance: (1) On over 20 datasets of
aforementioned diverse tasks, the model matches or surpasses FLAN-T5 models
that have around 2x or 10x more parameters; the single unified model also
outperforms task-specific models finetuned on individual datasets; (2) When
applied to evaluate factual consistency of language generation on 23 datasets,
our model improves over various baselines, including the much larger GPT-3.5
(ChatGPT) and sometimes even GPT-4; (3) The lightweight model can also serve as
an add-on component for LLMs such as GPT-3.5 in question answering tasks,
improving the average exact match (EM) score by 17.94 and F1 score by 15.05
through identifying unanswerable questions.
Related papers
- What Should Baby Models Read? Exploring Sample-Efficient Data Composition on Model Performance [0.0]
We evaluate several dataset sources, including child-directed speech (CHILDES), classic books (Gutenberg), synthetic data (TinyStories) and a mix of these across different model sizes.
Our experiments show that smaller models (e.g., GPT2-97M, GPT2-705M, Llama-360M) perform better when trained on more complex and rich datasets like Gutenberg.
arXiv Detail & Related papers (2024-11-11T02:37:21Z) - Evaluating the Performance of Large Language Models for SDG Mapping (Technical Report) [6.789534723913505]
Large language models (LLMs) enable users to protect data privacy by eliminating the need to provide data to third parties.
We compare the performance of various language models on the Sustainable Development Goal mapping task.
According to the results of this study, LLaMA 2 and Gemma still have significant room for improvement.
arXiv Detail & Related papers (2024-08-05T03:05:02Z) - MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large
Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities.
By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up.
Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z) - Explanation-based Finetuning Makes Models More Robust to Spurious Cues [21.327036110196637]
Large Language Models (LLMs) are so powerful that they sometimes learn correlations between labels and features that are irrelevant to the task.
We propose explanation-based finetuning as a general approach to mitigate LLMs' reliance on spurious correlations.
We finetune the model to additionally generate a free-text explanation supporting its answer.
arXiv Detail & Related papers (2023-05-08T18:53:45Z) - Maximizing Use-Case Specificity through Precision Model Tuning [0.0]
We present an in-depth analysis of the performance of four transformer-based language models on the task of biomedical information retrieval.
Our findings suggest that smaller models, with 10B parameters and fine-tuned on domain-specific datasets, tend to outperform larger language models on highly specific questions.
arXiv Detail & Related papers (2022-12-29T07:50:14Z) - Data-Efficient Finetuning Using Cross-Task Nearest Neighbors [75.07773863013001]
We use unlabeled target-task examples to retrieve most similar labeled examples from a pool of multitask data augmented with prompts.
Our approach of finetuning models on cross-task nearest neighbors is significantly more data-efficient.
arXiv Detail & Related papers (2022-12-01T00:53:04Z) - Scaling Instruction-Finetuned Language Models [126.4789306516927]
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance.
We find that instruction finetuning dramatically improves performance on a variety of model classes.
arXiv Detail & Related papers (2022-10-20T16:58:32Z) - Are Sample-Efficient NLP Models More Robust? [90.54786862811183]
We investigate the relationship between sample efficiency (amount of data needed to reach a given ID accuracy) and robustness (how models fare on OOD evaluation)
We find that higher sample efficiency is only correlated with better average OOD robustness on some modeling interventions and tasks, but not others.
These results suggest that general-purpose methods for improving sample efficiency are unlikely to yield universal OOD robustness improvements, since such improvements are highly dataset- and task-dependent.
arXiv Detail & Related papers (2022-10-12T17:54:59Z) - Efficient Large Scale Language Modeling with Mixtures of Experts [61.45159383372181]
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation.
This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings.
arXiv Detail & Related papers (2021-12-20T17:05:11Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.