Call for Papers -- The BabyLM Challenge: Sample-efficient pretraining on
a developmentally plausible corpus
- URL: http://arxiv.org/abs/2301.11796v1
- Date: Fri, 27 Jan 2023 15:52:50 GMT
- Title: Call for Papers -- The BabyLM Challenge: Sample-efficient pretraining on
a developmentally plausible corpus
- Authors: Alex Warstadt, Leshem Choshen, Aaron Mueller, Adina Williams, Ethan
Wilcox, Chengxu Zhuang
- Abstract summary: We present the call for papers for the BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus.
This shared task is intended for participants with an interest in small scale language modeling, human language acquisition, low-resource NLP, and cognitive modeling.
- Score: 32.51325830633226
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the call for papers for the BabyLM Challenge: Sample-efficient
pretraining on a developmentally plausible corpus. This shared task is intended
for participants with an interest in small scale language modeling, human
language acquisition, low-resource NLP, and cognitive modeling. In partnership
with CoNLL and CMCL, we provide a platform for approaches to pretraining with a
limited-size corpus sourced from data inspired by the input to children. The
task has three tracks, two of which restrict the training data to pre-released
datasets of 10M and 100M words and are dedicated to explorations of approaches
such as architectural variations, self-supervised objectives, or curriculum
learning. The final track only restricts the amount of text used, allowing
innovation in the choice of the data, its domain, and even its modality (i.e.,
data from sources other than text is welcome). We will release a shared
evaluation pipeline which scores models on a variety of benchmarks and tasks,
including targeted syntactic evaluations and natural language understanding.
Related papers
- TinyHelen's First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment [30.93798042712827]
Training language models (LMs) and their application agents is increasingly costly due to large datasets and models.
We propose a pipeline to refine text data by eliminating noise, minimizing vocabulary, and maintaining genre-specific patterns.
Our experiments show that leaner pre-training boosts LM learning efficiency.
arXiv Detail & Related papers (2024-12-31T16:08:15Z) - Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora [79.03392191805028]
The BabyLM Challenge is a community effort to close the data-efficiency gap between human and computational language learners.
Participants compete to optimize language model training on a fixed language data budget of 100 million words or less.
arXiv Detail & Related papers (2024-12-06T16:06:08Z) - CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large
Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale.
Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Selective In-Context Data Augmentation for Intent Detection using
Pointwise V-Information [100.03188187735624]
We introduce a novel approach based on PLMs and pointwise V-information (PVI), a metric that can measure the usefulness of a datapoint for training a model.
Our method first fine-tunes a PLM on a small seed of training data and then synthesizes new datapoints - utterances that correspond to given intents.
Our method is thus able to leverage the expressive power of large language models to produce diverse training data.
arXiv Detail & Related papers (2023-02-10T07:37:49Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Leveraging Pre-Trained Language Models to Streamline Natural Language
Interaction for Self-Tracking [25.28975864365579]
We propose a novel NLP task for self-tracking that extracts close- and open-ended information from a retrospective activity log.
The framework augments the prompt using synthetic samples to transform the task into 10-shot learning, to address a cold-start problem in bootstrapping a new tracking topic.
arXiv Detail & Related papers (2022-05-31T01:58:04Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.