LongTail-Swap: benchmarking language models' abilities on rare words
- URL: http://arxiv.org/abs/2510.04268v1
- Date: Sun, 05 Oct 2025 16:17:33 GMT
- Title: LongTail-Swap: benchmarking language models' abilities on rare words
- Authors: Robin Algayres, Charles-Éric Saint-James, Mahi Luthra, Jiayi Shen, Dongyan Lin, Youssef Benchekroun, Rashel Moritz, Juan Pino, Emmanuel Dupoux,
- Abstract summary: LongTail-Swap is a benchmark that focuses on the tail of the distribution.<n>It measures the ability of LMs to learn new words with very little exposure.<n> LT-Swap is a pretraining corpus-specific test set of acceptable versus unacceptable sentence pairs.
- Score: 16.946063624357745
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Children learn to speak with a low amount of data and can be taught new words on a few-shot basis, making them particularly data-efficient learners. The BabyLM challenge aims at exploring language model (LM) training in the low-data regime but uses metrics that concentrate on the head of the word distribution. Here, we introduce LongTail-Swap (LT-Swap), a benchmark that focuses on the tail of the distribution, i.e., measures the ability of LMs to learn new words with very little exposure, like infants do. LT-Swap is a pretraining corpus-specific test set of acceptable versus unacceptable sentence pairs that isolate semantic and syntactic usage of rare words. Models are evaluated in a zero-shot fashion by computing the average log probabilities over the two members of each pair. We built two such test sets associated with the 10M words and 100M words BabyLM training sets, respectively, and evaluated 16 models from the BabyLM leaderboard. Our results not only highlight the poor performance of language models on rare words but also reveal that performance differences across LM architectures are much more pronounced in the long tail than in the head. This offers new insights into which architectures are better at handling rare word generalization. We've also made the code publicly avail
Related papers
- Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora [84.03928547166873]
Children can acquire language from less than 100 million words of input.<n>Large language models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data and still do not perform as well as humans on many evaluations.<n>The BabyLM Challenge is a communal effort in which participants compete to optimize language model training on a fixed data budget.
arXiv Detail & Related papers (2025-04-10T23:22:43Z) - Towards Data-Efficient Language Models: A Child-Inspired Approach to Language Learning [2.565964707090901]
We use various methods of training language models (LMs) with significantly less data compared to traditional large language models (LLMs)<n>We develop a model trained on a curated dataset consisting of 10 million words, primarily sourced from child-directed transcripts.<n>We reduce the vocabulary size to 32,000 tokens, aligning it with the limited vocabulary of children in the early stages of language acquisition.
arXiv Detail & Related papers (2025-03-06T16:57:26Z) - BabyLMs for isiXhosa: Data-Efficient Language Modelling in a Low-Resource Context [2.57490464660469]
BabyLM challenge called on participants to develop sample-efficient language models.<n> submissions were pretrained on a fixed English corpus, limited to the amount of words children are exposed to in development.<n>New architectures for data-efficient language modelling outperformed models trained on trillions of words.
arXiv Detail & Related papers (2025-01-07T15:13:45Z) - The Neglected Tails in Vision-Language Models [51.79913798808725]
We show that vision-language models (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts.
We propose REtrieval-Augmented Learning (REAL) to mitigate the imbalanced performance of zero-shot VLMs.
arXiv Detail & Related papers (2024-01-23T01:25:00Z) - Paloma: A Benchmark for Evaluating Language Model Fit [112.481957296585]
Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training.<n>We introduce Perplexity Analysis for Language Model Assessment (Paloma), a benchmark to measure LM fit to 546 English and code domains.
arXiv Detail & Related papers (2023-12-16T19:12:45Z) - The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency.
We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes.
It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z) - Pre-training LLMs using human-like development data corpus [3.5757761767474876]
We pre-train and evaluate Large Language Models (LLMs) on their ability to learn contextual word representations using roughly the same number of tokens as seen by children.
We provide a strong set of baselines; with different architectures, evaluation of changes in performance across epochs, and reported pre-training metrics for the strict small and strict tracks of the task.
arXiv Detail & Related papers (2023-11-08T13:13:23Z) - Mini Minds: Exploring Bebeshka and Zlata Baby Models [3.558894829990311]
We describe the University of Lyon 2 submission to the Strict-Small track of the BabyLM competition.
We introduce two small-size language models (LMs) that were submitted for evaluation.
Despite being half the scale of the baseline LMs, our proposed models achieve comparable performance.
arXiv Detail & Related papers (2023-11-06T16:01:10Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.