Related papers: Indian Language Summarization using Pretrained Sequence-to-Sequence Models

Indian Language Summarization using Pretrained Sequence-to-Sequence Models

URL: http://arxiv.org/abs/2303.14461v1
Date: Sat, 25 Mar 2023 13:05:54 GMT
Title: Indian Language Summarization using Pretrained Sequence-to-Sequence Models
Authors: Ashok Urlana, Sahil Manoj Bhatt, Nirmal Surange, Manish Shrivastava
Abstract summary: The ILSUM task focuses on text summarization for two major Indian languages- Hindi and Gujarati, along with English. We present a detailed overview of the models and our approaches in this paper.
Score: 11.695648989161878
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ILSUM shared task focuses on text summarization for two major Indian languages- Hindi and Gujarati, along with English. In this task, we experiment with various pretrained sequence-to-sequence models to find out the best model for each of the languages. We present a detailed overview of the models and our approaches in this paper. We secure the first rank across all three sub-tasks (English, Hindi and Gujarati). This paper also extensively analyzes the impact of k-fold cross-validation while experimenting with limited data size, and we also perform various experiments with a combination of the original and a filtered version of the data to determine the efficacy of the pretrained models.

Related papers

HindiLLM: Large Language Model for Hindi [0.09363323206192666]
We have pre-trained two autoregressive Large Language Model (LLM) models for the Hindi language. We use a two-step process comprising unsupervised pre-training and supervised fine-tuning. The evaluation shows that the HindiLLM-based fine-tuned models outperform several models in most of the language related tasks.
arXiv Detail & Related papers (2024-12-29T05:28:15Z)
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning. Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks. We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks. We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z)
CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents. We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary. We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z)
UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining [92.3702056505905]
We propose a new sampling method, UniMax, that delivers more uniform coverage of head languages while mitigating overfitting on tail languages. We find that UniMax outperforms standard temperature-based sampling, and the benefits persist as scale increases.
arXiv Detail & Related papers (2023-04-18T17:45:50Z)
Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. We design a simple but effective ensemble-based framework that combines various transfer learning techniques. We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z)
Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings. Our model operates on parallel data in $N$ languages. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z)
A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities. We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention. Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z)
A Survey of Recent Abstract Summarization Techniques [0.0]
We investigate the impact of pre-training models on several Wikipedia datasets in English and Indonesian language. The most significant factors that influence ROUGE performance are coverage, density, and compression. The T5-Large, the Pegasus-XSum, and the ProphetNet-CNNDM provide the best summarization.
arXiv Detail & Related papers (2021-04-15T20:01:34Z)
Investigating Monolingual and Multilingual BERTModels for Vietnamese Aspect Category Detection [0.0]
This paper investigates the performance of various monolingual pre-trained language models compared with multilingual models on the Vietnamese aspect category detection problem. The experimental results demonstrated the effectiveness of the monolingual PhoBERT model than others on two datasets. To the best of our knowledge, our research study is the first attempt at performing various available pre-trained language models on aspect category detection task.
arXiv Detail & Related papers (2021-03-17T09:04:03Z)
Indic-Transformers: An Analysis of Transformer Language Models for Indian Languages [0.8155575318208631]
Language models based on the Transformer architecture have achieved state-of-the-art performance on a wide range of NLP tasks. However, this performance is usually tested and reported on high-resource languages, like English, French, Spanish, and German. Indian languages, on the other hand, are underrepresented in such benchmarks.
arXiv Detail & Related papers (2020-11-04T14:43:43Z)
Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language. We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models. Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.