IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic
Languages
- URL: http://arxiv.org/abs/2203.05437v1
- Date: Thu, 10 Mar 2022 15:53:58 GMT
- Title: IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic
Languages
- Authors: Aman Kumar, Himani Shrotriya, Prachi Sahu, Raj Dabre, Ratish
Puduppully, Anoop Kunchukuttan, Amogh Mishra, Mitesh M. Khapra, Pratyush
Kumar
- Abstract summary: In this paper, we present the IndicNLG suite, a collection of datasets for benchmarking Natural Language Generation for 11 Indic languages.
We focus on five diverse tasks, namely, biography generation using Wikipedia infoboxes (WikiBio), news headline generation, sentence summarization, question generation and paraphrase generation.
- Score: 23.157951796614466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present the IndicNLG suite, a collection of datasets for
benchmarking Natural Language Generation (NLG) for 11 Indic languages. We focus
on five diverse tasks, namely, biography generation using Wikipedia infoboxes
(WikiBio), news headline generation, sentence summarization, question
generation and paraphrase generation. We describe the process of creating the
datasets and present statistics of the dataset, following which we train and
report a variety of strong monolingual and multilingual baselines that leverage
pre-trained sequence-to-sequence models and analyze the results to understand
the challenges involved in Indic language NLG. To the best of our knowledge,
this is the first NLG dataset for Indic languages and also the largest
multilingual NLG dataset. Our methods can also be easily applied to
modest-resource languages with reasonable monolingual and parallel corpora, as
well as corpora containing structured data like Wikipedia. We hope this dataset
spurs research in NLG on diverse languages and tasks, particularly for Indic
languages. The datasets and models are publicly available at
https://indicnlp.ai4bharat.org/indicnlg-suite.
Related papers
- MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions [54.08017526771947]
Multilingual Reverse Instructions (MURI) generates high-quality instruction tuning datasets for low-resource languages.
MURI produces instruction-output pairs from existing human-written texts in low-resource languages.
Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages.
arXiv Detail & Related papers (2024-09-19T17:59:20Z) - Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Pretraining Data and Tokenizer for Indic LLM [1.7729311045335219]
We develop a novel approach to data preparation for developing multilingual Indic large language model.
Our meticulous data acquisition spans open-source and proprietary sources, including Common Crawl, Indic books, news articles, and Wikipedia.
For each Indic language, we design a custom preprocessing pipeline to effectively eliminate redundant and low-quality text content.
arXiv Detail & Related papers (2024-07-17T11:06:27Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages [0.4194295877935868]
L3Cube-IndicNews is a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages.
We have centered our work on 10 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, and Punjabi.
Each of these news datasets comprises 10 or more classes of news articles.
arXiv Detail & Related papers (2024-01-04T13:11:17Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages [21.018996007110324]
This dataset includes 41.8 million news articles in 14 different Indic languages (and English)
To the best of our knowledge, this is the largest collection of curated articles for Indic languages currently available.
arXiv Detail & Related papers (2023-05-10T03:07:17Z) - IndicXNLI: Evaluating Multilingual Inference for Indian Languages [9.838755823660147]
IndicXNLI is an NLI dataset for 11 Indic languages.
By finetuning different pre-trained LMs on this IndicXNLI, we analyze various cross-lingual transfer techniques.
These experiments provide us with useful insights into the behaviour of pre-trained models for a diverse set of languages.
arXiv Detail & Related papers (2022-04-19T09:49:00Z) - GLGE: A New General Language Generation Evaluation Benchmark [139.25515221280767]
General Language Generation Evaluation (GLGE) is a new multi-task benchmark for evaluating the generalization capabilities of NLG models.
To encourage research on pretraining and transfer learning on NLG models, we make GLGE publicly available and build a leaderboard with strong baselines.
arXiv Detail & Related papers (2020-11-24T06:59:45Z) - The Tatoeba Translation Challenge -- Realistic Data Sets for Low
Resource and Multilingual MT [0.0]
This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs.
The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages.
arXiv Detail & Related papers (2020-10-13T13:12:21Z) - XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training,
Understanding and Generation [100.09099800591822]
XGLUE is a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models.
XGLUE provides 11 diversified tasks that cover both natural language understanding and generation scenarios.
arXiv Detail & Related papers (2020-04-03T07:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.