BabyLM Turns 4 and Goes Multilingual: Call for Papers for the 2026 BabyLM Workshop
- URL: http://arxiv.org/abs/2602.20092v2
- Date: Tue, 24 Feb 2026 17:51:23 GMT
- Title: BabyLM Turns 4 and Goes Multilingual: Call for Papers for the 2026 BabyLM Workshop
- Authors: Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Jaap Jumelet, Tal Linzen, Aaron Mueller, Suchir Salhan, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox,
- Abstract summary: The goal of the BabyLM is to stimulate new research connections between cognitive modeling and language model pretraining.<n>This year, we move beyond our previous English-only pretraining datasets with a new track, focusing on English, Dutch, and Chinese.<n>For the workshop, we call for papers related to the overall theme of BabyLM, which includes training efficiency, small-scale training datasets, cognitive modeling, model evaluation, and architecture innovation.
- Score: 73.0356575273869
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of the BabyLM is to stimulate new research connections between cognitive modeling and language model pretraining. We invite contributions in this vein to the BabyLM Workshop, which will also include the 4th iteration of the BabyLM Challenge. As in previous years, the challenge features two ``standard'' tracks (Strict and Strict-Small), in which participants must train language models on under 100M or 10M words of data, respectively. This year, we move beyond our previous English-only pretraining datasets with a new Multilingual track, focusing on English, Dutch, and Chinese. For the workshop, we call for papers related to the overall theme of BabyLM, which includes training efficiency, small-scale training datasets, cognitive modeling, model evaluation, and architecture innovation.
Related papers
- BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data [30.00078536496021]
BabyBabelLM is a collection of datasets modeling the language a person observes from birth until they acquire a native language.<n>We curate developmentally plausible pretraining data aiming to cover the equivalent of 100M English words of content in each of 45 languages.
arXiv Detail & Related papers (2025-10-11T10:50:47Z) - Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora [84.03928547166873]
Children can acquire language from less than 100 million words of input.<n>Large language models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data and still do not perform as well as humans on many evaluations.<n>The BabyLM Challenge is a communal effort in which participants compete to optimize language model training on a fixed data budget.
arXiv Detail & Related papers (2025-04-10T23:22:43Z) - Towards Data-Efficient Language Models: A Child-Inspired Approach to Language Learning [2.565964707090901]
We use various methods of training language models (LMs) with significantly less data compared to traditional large language models (LLMs)<n>We develop a model trained on a curated dataset consisting of 10 million words, primarily sourced from child-directed transcripts.<n>We reduce the vocabulary size to 32,000 tokens, aligning it with the limited vocabulary of children in the early stages of language acquisition.
arXiv Detail & Related papers (2025-03-06T16:57:26Z) - BabyLM Turns 3: Call for papers for the 2025 BabyLM workshop [77.62533643491747]
BabyLM aims to dissolve the boundaries between cognitive modeling and language modeling.<n>We call for both workshop papers and for researchers to join the 3rd BabyLM competition.
arXiv Detail & Related papers (2025-02-15T02:46:43Z) - BabyLMs for isiXhosa: Data-Efficient Language Modelling in a Low-Resource Context [2.57490464660469]
BabyLM challenge called on participants to develop sample-efficient language models.<n> submissions were pretrained on a fixed English corpus, limited to the amount of words children are exposed to in development.<n>New architectures for data-efficient language modelling outperformed models trained on trillions of words.
arXiv Detail & Related papers (2025-01-07T15:13:45Z) - Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora [79.03392191805028]
The BabyLM Challenge is a community effort to close the data-efficiency gap between human and computational language learners.<n>Participants compete to optimize language model training on a fixed language data budget of 100 million words or less.
arXiv Detail & Related papers (2024-12-06T16:06:08Z) - A surprisal oracle for when every layer counts [2.5716627278119444]
Active Curriculum Language Modeling (ACLM) is a learner directed approach to training a language model.<n>We propose an updated ACLM process for the BabyLM 2024 task.
arXiv Detail & Related papers (2024-12-04T07:53:45Z) - Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities [2.047424180164312]
Training large language models (LLMs) in low-resource languages such as Hebrew poses unique challenges.
We introduce DictaLM2.0 and DictaLM2.0-Instruct, two LLMs trained on a substantial corpus of approximately 200 billion tokens in both Hebrew and English.
arXiv Detail & Related papers (2024-07-09T17:51:37Z) - [Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus [81.34965784440176]
This CfP provides rules for the BabyLM Challenge 2024-2025.
The overarching goals of the challenge remain the same.
We replace the loose track with a paper track.
We relax the rules around pretraining data.
We introduce a multimodal vision-and-language track.
arXiv Detail & Related papers (2024-04-09T11:04:50Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - Tencent AI Lab - Shanghai Jiao Tong University Low-Resource Translation
System for the WMT22 Translation Task [49.916963624249355]
This paper describes Tencent AI Lab - Shanghai Jiao Tong University (TAL-SJTU) Low-Resource Translation systems for the WMT22 shared task.
We participate in the general translation task on English$Leftrightarrow$Livonian.
Our system is based on M2M100 with novel techniques that adapt it to the target language pair.
arXiv Detail & Related papers (2022-10-17T04:34:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.