L3Cube-MahaSum: A Comprehensive Dataset and BART Models for Abstractive Text Summarization in Marathi
- URL: http://arxiv.org/abs/2410.09184v1
- Date: Fri, 11 Oct 2024 18:37:37 GMT
- Title: L3Cube-MahaSum: A Comprehensive Dataset and BART Models for Abstractive Text Summarization in Marathi
- Authors: Pranita Deshmukh, Nikita Kulkarni, Sanhita Kulkarni, Kareena Manghani, Raviraj Joshi,
- Abstract summary: We present the MahaSUM dataset, a large-scale collection of diverse news articles in Marathi.
The dataset was created by scraping articles from a wide range of online news sources and manually verifying the abstract summaries.
We train an IndicBART model, a variant of the BART model tailored for Indic languages, using the MahaSUM dataset.
- Score: 0.4194295877935868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present the MahaSUM dataset, a large-scale collection of diverse news articles in Marathi, designed to facilitate the training and evaluation of models for abstractive summarization tasks in Indic languages. The dataset, containing 25k samples, was created by scraping articles from a wide range of online news sources and manually verifying the abstract summaries. Additionally, we train an IndicBART model, a variant of the BART model tailored for Indic languages, using the MahaSUM dataset. We evaluate the performance of our trained models on the task of abstractive summarization and demonstrate their effectiveness in producing high-quality summaries in Marathi. Our work contributes to the advancement of natural language processing research in Indic languages and provides a valuable resource for future research in this area using state-of-the-art models. The dataset and models are shared publicly at https://github.com/l3cube-pune/MarathiNLP
Related papers
- SPRING Lab IITM's submission to Low Resource Indic Language Translation Shared Task [10.268444449457956]
We develop a robust translation model for four low-resource Indic languages: Khasi, Mizo, Manipuri, and Assamese.
Our approach includes a comprehensive pipeline from data collection and preprocessing to training and evaluation.
To address the scarcity of bilingual data, we use back-translation techniques on monolingual datasets for Mizo and Khasi.
arXiv Detail & Related papers (2024-11-01T16:39:03Z) - From News to Summaries: Building a Hungarian Corpus for Extractive and Abstractive Summarization [0.19107347888374507]
HunSum-2 is an open-source Hungarian corpus suitable for training abstractive and extractive summarization models.
The dataset is assembled from segments of the Common Crawl corpus undergoing thorough cleaning.
arXiv Detail & Related papers (2024-04-04T16:07:06Z) - Walia-LLM: Enhancing Amharic-LLaMA by Integrating Task-Specific and Generative Datasets [2.8123257987021058]
We focus on enhancing the LLaMA-2-Amharic model by integrating task-specific and generative datasets.
We compile an Amharic instruction fine-tuning dataset and fine-tuned LLaMA-2-Amharic model.
The fine-tuned model shows promising results in different NLP tasks.
arXiv Detail & Related papers (2024-02-12T19:25:11Z) - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - GAE-ISumm: Unsupervised Graph-Based Summarization of Indian Languages [5.197307534263253]
Document summarization aims to create a precise and coherent summary of a text document.
Many deep learning summarization models are developed mainly for English, often requiring a large training corpus.
We propose GAE-ISumm, an unsupervised Indic summarization model that extracts summaries from text documents.
arXiv Detail & Related papers (2022-12-25T17:20:03Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Building Machine Translation Systems for the Next Thousand Languages [102.24310122155073]
We describe results in three research domains: building clean, web-mined datasets for 1500+ languages, developing practical MT models for under-served languages, and studying the limitations of evaluation metrics for these languages.
We hope that our work provides useful insights to practitioners working towards building MT systems for currently understudied languages, and highlights research directions that can complement the weaknesses of massively multilingual models in data-sparse settings.
arXiv Detail & Related papers (2022-05-09T00:24:13Z) - Experimental Evaluation of Deep Learning models for Marathi Text
Classification [0.0]
We evaluate CNN, LSTM, ULMFiT, and BERT based models on two publicly available Marathi text classification datasets.
We show that basic single layer models based on CNN and LSTM coupled with FastText embeddings perform on par with the BERT based models on the available datasets.
arXiv Detail & Related papers (2021-01-13T06:21:27Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.