Paloma: A Benchmark for Evaluating Language Model Fit
- URL: http://arxiv.org/abs/2312.10523v2
- Date: Sat, 07 Dec 2024 20:22:22 GMT
- Title: Paloma: A Benchmark for Evaluating Language Model Fit
- Authors: Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hannaneh Hajishirzi, Noah A. Smith, Kyle Richardson, Jesse Dodge,
- Abstract summary: Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training.
We introduce Perplexity Analysis for Language Model Assessment (Paloma), a benchmark to measure LM fit to 546 English and code domains.
- Score: 112.481957296585
- License:
- Abstract: Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains--varying distributions of language. We introduce Perplexity Analysis for Language Model Assessment (Paloma), a benchmark to measure LM fit to 546 English and code domains, instead of assuming perplexity on one distribution extrapolates to others. We include two new datasets of the top 100 subreddits (e.g., r/depression on Reddit) and programming languages (e.g., Java on GitHub), both sources common in contemporary LMs. With our benchmark, we release 6 baseline 1B LMs carefully controlled to provide fair comparisons about which pretraining corpus is best and code for others to apply those controls to their own experiments. Our case studies demonstrate how the fine-grained results from Paloma surface findings such as that models pretrained without data beyond Common Crawl exhibit anomalous gaps in LM fit to many domains or that loss is dominated by the most frequently occurring strings in the vocabulary.
Related papers
- TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking [6.070192392563392]
We present TituLLMs, the first large pretrained Bangla LLMs in 1B and 3B parameter sizes.
To train TituLLMs, we collected a pretraining dataset of approximately 37 billion tokens.
We extended the Llama-3.2 tokenizer to incorporate language- and culture-specific knowledge.
arXiv Detail & Related papers (2025-02-16T16:22:23Z) - Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models [10.482557806309174]
We introduce LexBench, a comprehensive evaluation suite enabled to test language models (LMs) on semantic phrase processing tasks.
Thanks to ourbenchmark, we assess the performance of 15 LMs across model architectures and parameter scales in classification, extraction, and interpretation tasks.
Our benchmarking findings can serve future research aiming to improve the generic capability of LMs on semantic phrase comprehension.
arXiv Detail & Related papers (2024-05-05T09:20:38Z) - Acquiring Linguistic Knowledge from Multimodal Input [10.965306219502303]
In contrast to children, language models (LMs) exhibit considerably inferior data efficiency when acquiring language.
We test the hypothesis that this data efficiency gap is partly caused by a lack of multimodal input and grounding in the learning environment of typical language models.
arXiv Detail & Related papers (2024-02-27T23:29:10Z) - Are Pre-trained Language Models Useful for Model Ensemble in Chinese
Grammatical Error Correction? [10.302225525539003]
We explore several ensemble strategies based on strong PLMs with four sophisticated single models.
The performance does not improve but even gets worse after the PLM-based ensemble.
arXiv Detail & Related papers (2023-05-24T14:18:52Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Bridging the Data Gap between Training and Inference for Unsupervised
Neural Machine Translation [49.916963624249355]
A UNMT model is trained on the pseudo parallel data with translated source, and natural source sentences in inference.
The source discrepancy between training and inference hinders the translation performance of UNMT models.
We propose an online self-training approach, which simultaneously uses the pseudo parallel data natural source, translated target to mimic the inference scenario.
arXiv Detail & Related papers (2022-03-16T04:50:27Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.