No News is Good News: A Critique of the One Billion Word Benchmark
- URL: http://arxiv.org/abs/2110.12609v1
- Date: Mon, 25 Oct 2021 02:41:27 GMT
- Title: No News is Good News: A Critique of the One Billion Word Benchmark
- Authors: Helen Ngo, Jo\~ao G.M. Ara\'ujo, Jeffrey Hui, Nicholas Frosst
- Abstract summary: The One Billion Word Benchmark is a dataset derived from the WMT 2011 News Crawl.
We train models solely on Common Crawl web scrapes partitioned by year, and demonstrate that they perform worse on this task over time due to distributional shift.
- Score: 4.396860522241306
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The One Billion Word Benchmark is a dataset derived from the WMT 2011 News
Crawl, commonly used to measure language modeling ability in natural language
processing. We train models solely on Common Crawl web scrapes partitioned by
year, and demonstrate that they perform worse on this task over time due to
distributional shift. Analysis of this corpus reveals that it contains several
examples of harmful text, as well as outdated references to current events. We
suggest that the temporal nature of news and its distribution shift over time
makes it poorly suited for measuring language modeling ability, and discuss
potential impact and considerations for researchers building language models
and evaluation datasets.
Related papers
- AutoCast++: Enhancing World Event Prediction with Zero-shot Ranking-based Context Retrieval [9.357912396498142]
We introduce AutoCast++, a zero-shot ranking-based context retrieval system.
Our approach first re-ranks articles based on zero-shot question-passage relevance, honing in on semantically pertinent news.
We conduct both the relevance evaluation and article summarization without needing domain-specific training.
arXiv Detail & Related papers (2023-10-03T08:34:44Z) - Text2Time: Transformer-based Article Time Period Prediction [0.11470070927586018]
This work investigates the problem of predicting the publication period of a text document, specifically a news article, based on its textual content.
We create our own extensive labeled dataset of over 350,000 news articles published by The New York Times over six decades.
In our approach, we use a pretrained BERT model fine-tuned for the task of text classification, specifically for time period prediction.
arXiv Detail & Related papers (2023-04-21T10:05:03Z) - Models See Hallucinations: Evaluating the Factuality in Video Captioning [57.85548187177109]
We conduct a human evaluation of the factuality in video captioning and collect two annotated factuality datasets.
We find that 57.0% of the model-generated sentences have factual errors, indicating it is a severe problem in this field.
We propose a weakly-supervised, model-based factuality metric FactVC, which outperforms previous metrics on factuality evaluation of video captioning.
arXiv Detail & Related papers (2023-03-06T08:32:50Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word.
We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages.
Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z) - Are Some Words Worth More than Others? [3.5598388686985354]
We propose two new intrinsic evaluation measures within the framework of a simple word prediction task.
We evaluate several commonly-used large English language models using our proposed metrics.
arXiv Detail & Related papers (2020-10-12T23:12:11Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.