Separating the Wheat from the Chaff with BREAD: An open-source benchmark
and metrics to detect redundancy in text
- URL: http://arxiv.org/abs/2311.06440v1
- Date: Sat, 11 Nov 2023 00:11:50 GMT
- Title: Separating the Wheat from the Chaff with BREAD: An open-source benchmark
and metrics to detect redundancy in text
- Authors: Isaac Caswell, Lisa Wang, Isabel Papadimitriou
- Abstract summary: We create and release BREAD, a human-labeled benchmark on repetitive boilerplate vs. plausible linguistic content.
We release several baseline CRED (Character REDundancy) scores along with it, and evaluate their effectiveness on BREAD.
- Score: 9.484323358958706
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data quality is a problem that perpetually resurfaces throughout the field of
NLP, regardless of task, domain, or architecture, and remains especially severe
for lower-resource languages. A typical and insidious issue, affecting both
training data and model output, is data that is repetitive and dominated by
linguistically uninteresting boilerplate, such as price catalogs or
computer-generated log files. Though this problem permeates many web-scraped
corpora, there has yet to be a benchmark to test against, or a systematic study
to find simple metrics that generalize across languages and agree with human
judgements of data quality. In the present work, we create and release BREAD, a
human-labeled benchmark on repetitive boilerplate vs. plausible linguistic
content, spanning 360 languages. We release several baseline CRED (Character
REDundancy) scores along with it, and evaluate their effectiveness on BREAD. We
hope that the community will use this resource to develop better filtering
methods, and that our reference implementations of CRED scores can become
standard corpus evaluation tools, driving the development of cleaner language
modeling corpora, especially in low-resource languages.
Related papers
- Empowering Low-Resource Language ASR via Large-Scale Pseudo Labeling [24.870429379543193]
We tackle the challenge of limited labeled data for low-resource languages in ASR, focusing on Hindi.
Our framework integrates multiple base models for transcription and evaluators for assessing audio-transcript pairs, resulting in robust pseudo-labeling for low resource languages.
We validate our approach with a new benchmark, IndicYT, comprising diverse YouTube audio files from multiple content categories.
arXiv Detail & Related papers (2024-08-26T05:36:35Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Learning Translation Quality Evaluation on Low Resource Languages from
Large Language Models [4.168157981135698]
We show how knowledge can be distilled from Large Language Models (LLMs) to improve upon learned metrics without requiring human annotators.
We show that the performance of a BLEURT-like model on lower resource languages can be improved in this way.
arXiv Detail & Related papers (2023-02-07T14:35:35Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word.
We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages.
Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.