V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages
- URL: http://arxiv.org/abs/2305.05858v1
- Date: Wed, 10 May 2023 03:07:17 GMT
- Title: V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages
- Authors: Rahul Aralikatte, Ziling Cheng, Sumanth Doddapaneni, Jackie Chi Kit
Cheung
- Abstract summary: This dataset includes 41.8 million news articles in 14 different Indic languages (and English)
To the best of our knowledge, this is the largest collection of curated articles for Indic languages currently available.
- Score: 21.018996007110324
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present V\=arta, a large-scale multilingual dataset for headline
generation in Indic languages. This dataset includes 41.8 million news articles
in 14 different Indic languages (and English), which come from a variety of
high-quality sources. To the best of our knowledge, this is the largest
collection of curated articles for Indic languages currently available. We use
the data collected in a series of experiments to answer important questions
related to Indic NLP and multilinguality research in general. We show that the
dataset is challenging even for state-of-the-art abstractive models and that
they perform only slightly better than extractive baselines. Owing to its size,
we also show that the dataset can be used to pretrain strong language models
that outperform competitive baselines in both NLU and NLG benchmarks.
Related papers
- Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Pretraining Data and Tokenizer for Indic LLM [1.7729311045335219]
We develop a novel approach to data preparation for developing multilingual Indic large language model.
Our meticulous data acquisition spans open-source and proprietary sources, including Common Crawl, Indic books, news articles, and Wikipedia.
For each Indic language, we design a custom preprocessing pipeline to effectively eliminate redundant and low-quality text content.
arXiv Detail & Related papers (2024-07-17T11:06:27Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages [0.4194295877935868]
L3Cube-IndicNews is a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages.
We have centered our work on 10 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, and Punjabi.
Each of these news datasets comprises 10 or more classes of news articles.
arXiv Detail & Related papers (2024-01-04T13:11:17Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - Aksharantar: Open Indic-language Transliteration datasets and models for
the Next Billion Users [32.23606056944172]
We introduce Aksharantar, the largest publicly available transliteration dataset for Indian languages created by mining from monolingual and parallel corpora.
The dataset contains 26 million transliteration pairs for 21 Indic languages from 3 language families using 12 scripts.
Aksharantar is 21 times larger than existing datasets and is the first publicly available dataset for 7 languages and 1 language family.
arXiv Detail & Related papers (2022-05-06T05:13:12Z) - IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic
Languages [23.157951796614466]
In this paper, we present the IndicNLG suite, a collection of datasets for benchmarking Natural Language Generation for 11 Indic languages.
We focus on five diverse tasks, namely, biography generation using Wikipedia infoboxes (WikiBio), news headline generation, sentence summarization, question generation and paraphrase generation.
arXiv Detail & Related papers (2022-03-10T15:53:58Z) - XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44
Languages [7.8288425529553916]
We present XL-Sum, a comprehensive and diverse dataset of 1 million professionally annotated article-summary pairs from BBC.
The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available.
XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.
arXiv Detail & Related papers (2021-06-25T18:00:24Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.