Summarizing Indian Languages using Multilingual Transformers based
Models
- URL: http://arxiv.org/abs/2303.16657v1
- Date: Wed, 29 Mar 2023 13:05:17 GMT
- Title: Summarizing Indian Languages using Multilingual Transformers based
Models
- Authors: Dhaval Taunk and Vasudeva Varma
- Abstract summary: We study how these multilingual models perform on the datasets which have Indian languages as source and target text.
We experimented with IndicBART and mT5 models to perform the experiments and report the ROUGE-1, ROUGE-2, ROUGE-3 and ROUGE-4 scores as a performance metric.
- Score: 13.062351454646912
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With the advent of multilingual models like mBART, mT5, IndicBART etc.,
summarization in low resource Indian languages is getting a lot of attention
now a days. But still the number of datasets is low in number. In this work, we
(Team HakunaMatata) study how these multilingual models perform on the datasets
which have Indian languages as source and target text while performing
summarization. We experimented with IndicBART and mT5 models to perform the
experiments and report the ROUGE-1, ROUGE-2, ROUGE-3 and ROUGE-4 scores as a
performance metric.
Related papers
- Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages [6.7638050195383075]
We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian languages.
We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families.
We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.
arXiv Detail & Related papers (2024-05-08T05:54:54Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Implementing Deep Learning-Based Approaches for Article Summarization in
Indian Languages [1.5749416770494706]
This paper presents a summary of various deep-learning approaches used for the ILSUM 2022 Indic language summarization datasets.
The ISUM 2022 consists of news articles written in Indian English, Hindi, and Gujarati respectively, and their groundtruth summarizations.
arXiv Detail & Related papers (2022-12-12T04:50:43Z) - Crosslingual Generalization through Multitask Finetuning [80.8822603322471]
Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting.
We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0.
We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages.
arXiv Detail & Related papers (2022-11-03T13:19:32Z) - TaTa: A Multilingual Table-to-Text Dataset for African Languages [32.348630887289524]
Table-to-Text in African languages (TaTa) is the first large multilingual table-to-text dataset with a focus on African languages.
TaTa includes 8,700 examples in nine languages including four African languages (Hausa, Igbo, Swahili, and Yorub'a) and a zero-shot test language (Russian)
arXiv Detail & Related papers (2022-10-31T21:05:42Z) - IndicSUPERB: A Speech Processing Universal Performance Benchmark for
Indian languages [16.121708272597154]
We release the IndicSUPERB benchmark for speech recognition in 12 Indian languages.
We train and evaluate different self-supervised models alongside a commonly used baseline benchmark.
We show that language-specific fine-tuned models are more accurate than baseline on most of the tasks.
arXiv Detail & Related papers (2022-08-24T20:14:52Z) - Transfer Learning for Scene Text Recognition in Indian Languages [27.609596088151644]
We investigate the power of transfer learning for all the layers of deep scene text recognition networks from English to two common Indian languages.
We show that the transfer of English models to simple synthetic datasets of Indian languages is not practical.
We set new benchmarks for scene-text recognition on Hindi, Telugu, and Malayalam datasets from IIIT-ILST and Bangla dataset from MLT-17.
arXiv Detail & Related papers (2022-01-10T06:14:49Z) - MTVR: Multilingual Moment Retrieval in Videos [89.24431389933703]
We introduce mTVR, a large-scale multilingual video moment retrieval dataset, containing 218K English and Chinese queries from 21.8K TV show video clips.
The dataset is collected by extending the popular TVR dataset (in English) with paired Chinese queries and subtitles.
We propose mXML, a multilingual moment retrieval model that learns and operates on data from both languages.
arXiv Detail & Related papers (2021-07-30T20:01:03Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.