Indian Language Summarization using Pretrained Sequence-to-Sequence
Models
- URL: http://arxiv.org/abs/2303.14461v1
- Date: Sat, 25 Mar 2023 13:05:54 GMT
- Title: Indian Language Summarization using Pretrained Sequence-to-Sequence
Models
- Authors: Ashok Urlana, Sahil Manoj Bhatt, Nirmal Surange, Manish Shrivastava
- Abstract summary: The ILSUM task focuses on text summarization for two major Indian languages- Hindi and Gujarati, along with English.
We present a detailed overview of the models and our approaches in this paper.
- Score: 11.695648989161878
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ILSUM shared task focuses on text summarization for two major Indian
languages- Hindi and Gujarati, along with English. In this task, we experiment
with various pretrained sequence-to-sequence models to find out the best model
for each of the languages. We present a detailed overview of the models and our
approaches in this paper. We secure the first rank across all three sub-tasks
(English, Hindi and Gujarati). This paper also extensively analyzes the impact
of k-fold cross-validation while experimenting with limited data size, and we
also perform various experiments with a combination of the original and a
filtered version of the data to determine the efficacy of the pretrained
models.
Related papers
- CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents.
We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary.
We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z) - UniMax: Fairer and more Effective Language Sampling for Large-Scale
Multilingual Pretraining [92.3702056505905]
We propose a new sampling method, UniMax, that delivers more uniform coverage of head languages while mitigating overfitting on tail languages.
We find that UniMax outperforms standard temperature-based sampling, and the benefits persist as scale increases.
arXiv Detail & Related papers (2023-04-18T17:45:50Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities.
We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention.
Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z) - A Survey of Recent Abstract Summarization Techniques [0.0]
We investigate the impact of pre-training models on several Wikipedia datasets in English and Indonesian language.
The most significant factors that influence ROUGE performance are coverage, density, and compression.
The T5-Large, the Pegasus-XSum, and the ProphetNet-CNNDM provide the best summarization.
arXiv Detail & Related papers (2021-04-15T20:01:34Z) - Investigating Monolingual and Multilingual BERTModels for Vietnamese
Aspect Category Detection [0.0]
This paper investigates the performance of various monolingual pre-trained language models compared with multilingual models on the Vietnamese aspect category detection problem.
The experimental results demonstrated the effectiveness of the monolingual PhoBERT model than others on two datasets.
To the best of our knowledge, our research study is the first attempt at performing various available pre-trained language models on aspect category detection task.
arXiv Detail & Related papers (2021-03-17T09:04:03Z) - Indic-Transformers: An Analysis of Transformer Language Models for
Indian Languages [0.8155575318208631]
Language models based on the Transformer architecture have achieved state-of-the-art performance on a wide range of NLP tasks.
However, this performance is usually tested and reported on high-resource languages, like English, French, Spanish, and German.
Indian languages, on the other hand, are underrepresented in such benchmarks.
arXiv Detail & Related papers (2020-11-04T14:43:43Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.