Summarizing Indian Languages using Multilingual Transformers based
Models
- URL: http://arxiv.org/abs/2303.16657v1
- Date: Wed, 29 Mar 2023 13:05:17 GMT
- Title: Summarizing Indian Languages using Multilingual Transformers based
Models
- Authors: Dhaval Taunk and Vasudeva Varma
- Abstract summary: We study how these multilingual models perform on the datasets which have Indian languages as source and target text.
We experimented with IndicBART and mT5 models to perform the experiments and report the ROUGE-1, ROUGE-2, ROUGE-3 and ROUGE-4 scores as a performance metric.
- Score: 13.062351454646912
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With the advent of multilingual models like mBART, mT5, IndicBART etc.,
summarization in low resource Indian languages is getting a lot of attention
now a days. But still the number of datasets is low in number. In this work, we
(Team HakunaMatata) study how these multilingual models perform on the datasets
which have Indian languages as source and target text while performing
summarization. We experimented with IndicBART and mT5 models to perform the
experiments and report the ROUGE-1, ROUGE-2, ROUGE-3 and ROUGE-4 scores as a
performance metric.
Related papers
- L3Cube-MahaSum: A Comprehensive Dataset and BART Models for Abstractive Text Summarization in Marathi [0.4194295877935868]
We present the MahaSUM dataset, a large-scale collection of diverse news articles in Marathi.
The dataset was created by scraping articles from a wide range of online news sources and manually verifying the abstract summaries.
We train an IndicBART model, a variant of the BART model tailored for Indic languages, using the MahaSUM dataset.
arXiv Detail & Related papers (2024-10-11T18:37:37Z) - MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions [54.08017526771947]
Multilingual Reverse Instructions (MURI) generates high-quality instruction tuning datasets for low-resource languages.
MURI produces instruction-output pairs from existing human-written texts in low-resource languages.
Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages.
arXiv Detail & Related papers (2024-09-19T17:59:20Z) - Benchmarking and Building Zero-Shot Hindi Retrieval Model with Hindi-BEIR and NLLB-E5 [8.21020989074456]
We introduce the Hindi-BEIR benchmark, comprising 15 datasets across seven distinct tasks.
We evaluate state-of-the-art multilingual retrieval models on the Hindi-BEIR benchmark, identifying task and domain-specific challenges.
We introduce NLLB-E5, a multilingual retrieval model that leverages a zero-shot approach to support Hindi without the need for Hindi training data.
arXiv Detail & Related papers (2024-09-09T07:57:43Z) - Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages [6.7638050195383075]
We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian languages.
We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families.
We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.
arXiv Detail & Related papers (2024-05-08T05:54:54Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Implementing Deep Learning-Based Approaches for Article Summarization in
Indian Languages [1.5749416770494706]
This paper presents a summary of various deep-learning approaches used for the ILSUM 2022 Indic language summarization datasets.
The ISUM 2022 consists of news articles written in Indian English, Hindi, and Gujarati respectively, and their groundtruth summarizations.
arXiv Detail & Related papers (2022-12-12T04:50:43Z) - Crosslingual Generalization through Multitask Finetuning [80.8822603322471]
Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting.
We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0.
We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages.
arXiv Detail & Related papers (2022-11-03T13:19:32Z) - IndicSUPERB: A Speech Processing Universal Performance Benchmark for
Indian languages [16.121708272597154]
We release the IndicSUPERB benchmark for speech recognition in 12 Indian languages.
We train and evaluate different self-supervised models alongside a commonly used baseline benchmark.
We show that language-specific fine-tuned models are more accurate than baseline on most of the tasks.
arXiv Detail & Related papers (2022-08-24T20:14:52Z) - MTVR: Multilingual Moment Retrieval in Videos [89.24431389933703]
We introduce mTVR, a large-scale multilingual video moment retrieval dataset, containing 218K English and Chinese queries from 21.8K TV show video clips.
The dataset is collected by extending the popular TVR dataset (in English) with paired Chinese queries and subtitles.
We propose mXML, a multilingual moment retrieval model that learns and operates on data from both languages.
arXiv Detail & Related papers (2021-07-30T20:01:03Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.