Catwalk: A Unified Language Model Evaluation Framework for Many Datasets
- URL: http://arxiv.org/abs/2312.10253v1
- Date: Fri, 15 Dec 2023 23:11:45 GMT
- Title: Catwalk: A Unified Language Model Evaluation Framework for Many Datasets
- Authors: Dirk Groeneveld and Anas Awadalla and Iz Beltagy and Akshita Bhagia
and Ian Magnusson and Hao Peng and Oyvind Tafjord and Pete Walsh and Kyle
Richardson and Jesse Dodge
- Abstract summary: Catwalk provides a unified interface to a broad range of existing NLP datasets and models.
Catwalk substantially lowers the barriers to conducting controlled experiments at scale.
- Score: 50.75378592254184
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The success of large language models has shifted the evaluation paradigms in
natural language processing (NLP). The community's interest has drifted towards
comparing NLP models across many tasks, domains, and datasets, often at an
extreme scale. This imposes new engineering challenges: efforts in constructing
datasets and models have been fragmented, and their formats and interfaces are
incompatible. As a result, it often takes extensive (re)implementation efforts
to make fair and controlled comparisons at scale.
Catwalk aims to address these issues. Catwalk provides a unified interface to
a broad range of existing NLP datasets and models, ranging from both canonical
supervised training and fine-tuning, to more modern paradigms like in-context
learning. Its carefully-designed abstractions allow for easy extensions to many
others. Catwalk substantially lowers the barriers to conducting controlled
experiments at scale. For example, we finetuned and evaluated over 64 models on
over 86 datasets with a single command, without writing any code. Maintained by
the AllenNLP team at the Allen Institute for Artificial Intelligence (AI2),
Catwalk is an ongoing open-source effort: https://github.com/allenai/catwalk.
Related papers
- How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics [49.9329723199239]
We propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples.
We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics.
When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset.
arXiv Detail & Related papers (2024-10-04T13:39:21Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Placing Objects in Context via Inpainting for Out-of-distribution Segmentation [59.00092709848619]
Placing Objects in Context (POC) is a pipeline to realistically add objects to an image.
POC can be used to extend any dataset with an arbitrary number of objects.
We present different anomaly segmentation datasets based on POC-generated data and show that POC can improve the performance of recent state-of-the-art anomaly fine-tuning methods.
arXiv Detail & Related papers (2024-02-26T08:32:41Z) - Cramming: Training a Language Model on a Single GPU in One Day [64.18297923419627]
Recent trends in language modeling have focused on increasing performance through scaling.
We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU.
We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings.
arXiv Detail & Related papers (2022-12-28T18:59:28Z) - Zero-Shot Learners for Natural Language Understanding via a Unified
Multiple Choice Perspective [26.41585967095811]
Zero-shot learning aims to train a model on a given task such that it can address new learning tasks without any additional training.
Our approach converts zero-shot learning into multiple-choice tasks, avoiding problems in commonly used large-scale generative models such as FLAN.
Our approach shows state-of-the-art performance on several benchmarks and produces satisfactory results on tasks such as natural language inference and text classification.
arXiv Detail & Related papers (2022-10-16T17:24:06Z) - CoreLM: Coreference-aware Language Model Fine-Tuning [0.0]
We propose a Fine-Tuning framework, named CoreLM, that extends the architecture of current Pretrained Language Models.
We make available information outside the contextual space of the model, which results in a better Language Model for a fraction of the computational cost.
Our proposed model achieves a lower Perplexity in GUMBY and LAMBDADA datasets when compared to GPT2 and a fine-tuned version of GPT2 without any changes.
arXiv Detail & Related papers (2021-11-04T08:44:31Z) - Program Synthesis with Large Language Models [40.41120807053989]
We evaluate large language models for program synthesis in Python.
We find that synthesis performance scales log-linearly with model size.
We find that even our best models are generally unable to predict the output of a program given a specific input.
arXiv Detail & Related papers (2021-08-16T03:57:30Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - A Data Efficient End-To-End Spoken Language Understanding Architecture [22.823732899634518]
We introduce a data efficient system which is trained end-to-end, with no additional, pre-trained external module.
The proposed model achieves a reasonable size and competitive results with respect to state-of-the-art while using a small training dataset.
arXiv Detail & Related papers (2020-02-14T10:24:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.