Call for Papers -- The BabyLM Challenge: Sample-efficient pretraining on
a developmentally plausible corpus
- URL: http://arxiv.org/abs/2301.11796v1
- Date: Fri, 27 Jan 2023 15:52:50 GMT
- Title: Call for Papers -- The BabyLM Challenge: Sample-efficient pretraining on
a developmentally plausible corpus
- Authors: Alex Warstadt, Leshem Choshen, Aaron Mueller, Adina Williams, Ethan
Wilcox, Chengxu Zhuang
- Abstract summary: We present the call for papers for the BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus.
This shared task is intended for participants with an interest in small scale language modeling, human language acquisition, low-resource NLP, and cognitive modeling.
- Score: 32.51325830633226
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the call for papers for the BabyLM Challenge: Sample-efficient
pretraining on a developmentally plausible corpus. This shared task is intended
for participants with an interest in small scale language modeling, human
language acquisition, low-resource NLP, and cognitive modeling. In partnership
with CoNLL and CMCL, we provide a platform for approaches to pretraining with a
limited-size corpus sourced from data inspired by the input to children. The
task has three tracks, two of which restrict the training data to pre-released
datasets of 10M and 100M words and are dedicated to explorations of approaches
such as architectural variations, self-supervised objectives, or curriculum
learning. The final track only restricts the amount of text used, allowing
innovation in the choice of the data, its domain, and even its modality (i.e.,
data from sources other than text is welcome). We will release a shared
evaluation pipeline which scores models on a variety of benchmarks and tasks,
including targeted syntactic evaluations and natural language understanding.
Related papers
- CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large
Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale.
Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Pre-Training to Learn in Context [138.0745138788142]
The ability of in-context learning is not fully exploited because language models are not explicitly trained to learn in context.
We propose PICL (Pre-training for In-Context Learning), a framework to enhance the language models' in-context learning ability.
Our experiments show that PICL is more effective and task-generalizable than a range of baselines, outperforming larger language models with nearly 4x parameters.
arXiv Detail & Related papers (2023-05-16T03:38:06Z) - Selective In-Context Data Augmentation for Intent Detection using
Pointwise V-Information [100.03188187735624]
We introduce a novel approach based on PLMs and pointwise V-information (PVI), a metric that can measure the usefulness of a datapoint for training a model.
Our method first fine-tunes a PLM on a small seed of training data and then synthesizes new datapoints - utterances that correspond to given intents.
Our method is thus able to leverage the expressive power of large language models to produce diverse training data.
arXiv Detail & Related papers (2023-02-10T07:37:49Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Leveraging Pre-Trained Language Models to Streamline Natural Language
Interaction for Self-Tracking [25.28975864365579]
We propose a novel NLP task for self-tracking that extracts close- and open-ended information from a retrospective activity log.
The framework augments the prompt using synthetic samples to transform the task into 10-shot learning, to address a cold-start problem in bootstrapping a new tracking topic.
arXiv Detail & Related papers (2022-05-31T01:58:04Z) - ORCA: Interpreting Prompted Language Models via Locating Supporting Data
Evidence in the Ocean of Pretraining Data [38.20984369410193]
Large pretrained language models have been performing increasingly well in a variety of downstream tasks via prompting.
It remains unclear from where the model learns the task-specific knowledge, especially in a zero-shot setup.
In this work, we want to find evidence of the model's task-specific competence from pretraining and are specifically interested in locating a very small subset of pretraining data.
arXiv Detail & Related papers (2022-05-25T09:25:06Z) - On the Use of External Data for Spoken Named Entity Recognition [40.93448412171246]
Recent advances in self-supervised speech representations have made it feasible to consider learning models with limited labeled data.
We draw on a variety of approaches, including self-training, knowledge distillation, and transfer learning, and consider their applicability to both end-to-end models and pipeline approaches.
arXiv Detail & Related papers (2021-12-14T18:49:26Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.