Related papers: Language model developers should report train-test overlap

Language model developers should report train-test overlap

URL: http://arxiv.org/abs/2410.08385v1
Date: Thu, 10 Oct 2024 21:44:56 GMT
Title: Language model developers should report train-test overlap
Authors: Andy K Zhang, Kevin Klyman, Yifan Mai, Yoav Levine, Yian Zhang, Rishi Bommasani, Percy Liang,
Abstract summary: We document the practices of 30 model developers, finding that just 9 developers report train-test overlap. We hope our work increases transparency into train-test overlap to increase the community-wide trust in model evaluations.
Score: 52.523638165129505
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models are extensively evaluated, but correctly interpreting evaluation results requires knowledge of train-test overlap which refers to the extent to which the language model is trained on the very data it is being tested on. The public currently lacks adequate information about train-test overlap: most models have no public train-test overlap statistics, and third parties cannot directly measure train-test overlap since they do not have access to the training data. To make this clear, we document the practices of 30 model developers, finding that just 9 developers report train-test overlap: 4 developers release training data under open-source licenses, enabling the community to directly measure train-test overlap, and 5 developers publish their train-test overlap methodology and statistics. By engaging with language model developers, we provide novel information about train-test overlap for three additional developers. Overall, we take the position that language model developers should publish train-test overlap statistics and/or training data whenever they report evaluation results on public test sets. We hope our work increases transparency into train-test overlap to increase the community-wide trust in model evaluations.

Related papers

Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora [84.03928547166873]
Children can acquire language from less than 100 million words of input. Large language models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data and still do not perform as well as humans on many evaluations. The BabyLM Challenge is a communal effort in which participants compete to optimize language model training on a fixed data budget.
arXiv Detail & Related papers (2025-04-10T23:22:43Z)
TÜLU 3: Pushing Frontiers in Open Language Model Post-Training [94.14908801708049]
We introduce T"ULU 3, a family of fully-open state-of-the-art post-trained models. T"ULU 3 builds on Llama 3.1 base models, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku.
arXiv Detail & Related papers (2024-11-22T18:44:04Z)
Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data [75.7383558074758]
This work presents an Open Whisper-style Speech Model (OWSM) OWSM reproduces Whisper-style training using an open-source toolkit and publicly available data. We will publicly release all scripts used for data preparation, training, inference, and scoring as well as pre-trained models and training logs to promote open science.
arXiv Detail & Related papers (2023-09-25T05:01:34Z)
The CRINGE Loss: Learning what language not to model [35.40992193113732]
We show that even with large amounts of positive training data, issues remain that can be alleviated with relatively small amounts of negative data. We propose a novel procedure to train with such data called the CRINGE loss (ContRastive Iterative Negative GEneration) Our models outperform multiple strong baselines and are conceptually simple, easy to train and implement.
arXiv Detail & Related papers (2022-11-10T19:30:08Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
BERT Fine-Tuning for Sentiment Analysis on Indonesian Mobile Apps Reviews [1.5749416770494706]
This study examines the effectiveness of fine-tuning BERT for sentiment analysis using two different pre-trained models. The dataset used is Indonesian user reviews of the ten best apps in 2020 in Google Play sites. Two training data labeling approaches were also tested to determine the effectiveness of the model, which is score-based and lexicon-based.
arXiv Detail & Related papers (2021-07-14T16:00:15Z)
Deduplicating Training Data Makes Language Models Better [50.22588162039083]
Existing language modeling datasets contain many near-duplicate examples and long repetitives. Over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets.
arXiv Detail & Related papers (2021-07-14T06:06:52Z)
Few-shot learning through contextual data augmentation [74.20290390065475]
Machine translation models need to adapt to new data to maintain their performance over time. We show that adaptation on the scale of one to five examples is possible. Our model reports better accuracy scores than a reference system trained with on average 313 parallel examples.
arXiv Detail & Related papers (2021-03-31T09:05:43Z)
Pre-Training BERT on Arabic Tweets: Practical Considerations [11.087099497830552]
We pretrained 5 BERT models that differ in the size of their training sets, mixture of formal and informal Arabic, and linguistic preprocessing. All are intended to support Arabic dialects and social media. New models achieve new state-of-the-art results on several downstream tasks.
arXiv Detail & Related papers (2021-02-21T20:51:33Z)
Learning from Imperfect Annotations [15.306536555936692]
Many machine learning systems today are trained on large amounts of human-annotated data. We propose a new end-to-end framework that enables us to merge the aggregation step with model training. We show accuracy gains of up to 25% over the current state-of-the-art approaches for aggregating annotations.
arXiv Detail & Related papers (2020-04-07T15:21:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.