IndicBART: A Pre-trained Model for Natural Language Generation of Indic
Languages
- URL: http://arxiv.org/abs/2109.02903v1
- Date: Tue, 7 Sep 2021 07:08:33 GMT
- Title: IndicBART: A Pre-trained Model for Natural Language Generation of Indic
Languages
- Authors: Raj Dabre and Himani Shrotriya and Anoop Kunchukuttan and Ratish
Puduppully and Mitesh M. Khapra and Pratyush Kumar
- Abstract summary: IndicBART is a multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic languages and English.
We evaluate IndicBART on two NLG tasks: Neural Machine Translation (NMT) and extreme summarization.
- Score: 24.638109544527104
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this paper we present IndicBART, a multilingual, sequence-to-sequence
pre-trained model focusing on 11 Indic languages and English. Different from
existing pre-trained models, IndicBART utilizes the orthographic similarity
between Indic scripts to improve transfer learning between similar Indic
languages. We evaluate IndicBART on two NLG tasks: Neural Machine Translation
(NMT) and extreme summarization. Our experiments on NMT for 12 language pairs
and extreme summarization for 7 languages using multilingual fine-tuning show
that IndicBART is competitive with or better than mBART50 despite containing
significantly fewer parameters. Our analyses focus on identifying the impact of
script unification (to Devanagari), corpora size as well as multilingualism on
the final performance. The IndicBART model is available under the MIT license
at https://indicnlp.ai4bharat.org/indic-bart .
Related papers
- Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank [56.810282574817414]
We present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in Universal Dependencies (UD)
We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies.
Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries.
arXiv Detail & Related papers (2024-03-15T13:33:10Z) - Machine Translation by Projecting Text into the Same
Phonetic-Orthographic Space Using a Common Encoding [3.0422770070015295]
We propose an approach based on common multilingual Latin-based encodings (WX notation) that take advantage of language similarity.
We verify the proposed approach by demonstrating experiments on similar language pairs.
We also get up to 1 BLEU points improvement on distant and zero-shot language pairs.
arXiv Detail & Related papers (2023-05-21T06:46:33Z) - Towards Leaving No Indic Language Behind: Building Monolingual Corpora,
Benchmark and Models for Indic Languages [19.91781398526369]
We aim to improve the NLU capabilities of Indic languages by making contributions along 3 important axes.
Specifically, we curate the largest monolingual corpora, IndicCorp, with 20.9B tokens covering 24 languages from 4 language families.
We create a human-supervised benchmark, IndicXTREME, consisting of nine diverse NLU tasks covering 20 languages.
Across languages and tasks, IndicXTREME contains a total of 105 evaluation sets, of which 52 are new contributions to the literature.
arXiv Detail & Related papers (2022-12-11T04:45:50Z) - CLSRIL-23: Cross Lingual Speech Representations for Indic Languages [0.0]
CLSRIL-23 is a self supervised learning based model which learns cross lingual speech representations from raw audio across 23 Indic languages.
It is built on top of wav2vec 2.0 which is solved by training a contrastive task over masked latent speech representations.
We compare the language wise loss during pretraining to compare effects of monolingual and multilingual pretraining.
arXiv Detail & Related papers (2021-07-15T15:42:43Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - Vy\=akarana: A Colorless Green Benchmark for Syntactic Evaluation in
Indic Languages [0.0]
Indic languages have rich morphosyntax, grammatical genders, free linear word-order, and highly inflectional morphology.
We introduce Vy=akarana: a benchmark of gender-balanced Colorless Green sentences in Indic languages for syntactic evaluation of multilingual language models.
We use the datasets from the evaluation tasks to probe five multilingual language models of varying architectures for syntax in Indic languages.
arXiv Detail & Related papers (2021-03-01T09:07:58Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Improving Massively Multilingual Neural Machine Translation and
Zero-Shot Translation [81.7786241489002]
Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations.
We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics.
We propose random online backtranslation to enforce the translation of unseen training language pairs.
arXiv Detail & Related papers (2020-04-24T17:21:32Z) - Knowledge Distillation for Multilingual Unsupervised Neural Machine
Translation [61.88012735215636]
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs.
UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time.
In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder.
arXiv Detail & Related papers (2020-04-21T17:26:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.