Implementing Deep Learning-Based Approaches for Article Summarization in
Indian Languages
- URL: http://arxiv.org/abs/2212.05702v1
- Date: Mon, 12 Dec 2022 04:50:43 GMT
- Title: Implementing Deep Learning-Based Approaches for Article Summarization in
Indian Languages
- Authors: Rahul Tangsali, Aabha Pingle, Aditya Vyawahare, Isha Joshi, Raviraj
Joshi
- Abstract summary: This paper presents a summary of various deep-learning approaches used for the ILSUM 2022 Indic language summarization datasets.
The ISUM 2022 consists of news articles written in Indian English, Hindi, and Gujarati respectively, and their groundtruth summarizations.
- Score: 1.5749416770494706
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The research on text summarization for low-resource Indian languages has been
limited due to the availability of relevant datasets. This paper presents a
summary of various deep-learning approaches used for the ILSUM 2022 Indic
language summarization datasets. The ISUM 2022 dataset consists of news
articles written in Indian English, Hindi, and Gujarati respectively, and their
ground-truth summarizations. In our work, we explore different pre-trained
seq2seq models and fine-tune those with the ILSUM 2022 datasets. In our case,
the fine-tuned SoTA PEGASUS model worked the best for English, the fine-tuned
IndicBART model with augmented data for Hindi, and again fine-tuned PEGASUS
model along with a translation mapping-based approach for Gujarati. Our scores
on the obtained inferences were evaluated using ROUGE-1, ROUGE-2, and ROUGE-4
as the evaluation metrics.
Related papers
- L3Cube-MahaSum: A Comprehensive Dataset and BART Models for Abstractive Text Summarization in Marathi [0.4194295877935868]
We present the MahaSUM dataset, a large-scale collection of diverse news articles in Marathi.
The dataset was created by scraping articles from a wide range of online news sources and manually verifying the abstract summaries.
We train an IndicBART model, a variant of the BART model tailored for Indic languages, using the MahaSUM dataset.
arXiv Detail & Related papers (2024-10-11T18:37:37Z) - Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages [6.7638050195383075]
We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian languages.
We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families.
We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.
arXiv Detail & Related papers (2024-05-08T05:54:54Z) - Summarizing Indian Languages using Multilingual Transformers based
Models [13.062351454646912]
We study how these multilingual models perform on the datasets which have Indian languages as source and target text.
We experimented with IndicBART and mT5 models to perform the experiments and report the ROUGE-1, ROUGE-2, ROUGE-3 and ROUGE-4 scores as a performance metric.
arXiv Detail & Related papers (2023-03-29T13:05:17Z) - Indian Language Summarization using Pretrained Sequence-to-Sequence
Models [11.695648989161878]
The ILSUM task focuses on text summarization for two major Indian languages- Hindi and Gujarati, along with English.
We present a detailed overview of the models and our approaches in this paper.
arXiv Detail & Related papers (2023-03-25T13:05:54Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - GAE-ISumm: Unsupervised Graph-Based Summarization of Indian Languages [5.197307534263253]
Document summarization aims to create a precise and coherent summary of a text document.
Many deep learning summarization models are developed mainly for English, often requiring a large training corpus.
We propose GAE-ISumm, an unsupervised Indic summarization model that extracts summaries from text documents.
arXiv Detail & Related papers (2022-12-25T17:20:03Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Facebook AI's WMT20 News Translation Task Submission [69.92594751788403]
This paper describes Facebook AI's submission to WMT20 shared news translation task.
We focus on the low resource setting and participate in two language pairs, Tamil -> English and Inuktitut -> English.
We approach the low resource problem using two main strategies, leveraging all available data and adapting the system to the target news domain.
arXiv Detail & Related papers (2020-11-16T21:49:00Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.