Low Resource Summarization using Pre-trained Language Models
- URL: http://arxiv.org/abs/2310.02790v1
- Date: Wed, 4 Oct 2023 13:09:39 GMT
- Title: Low Resource Summarization using Pre-trained Language Models
- Authors: Mubashir Munaf, Hammad Afzal, Naima Iltaf, Khawir Mahmood
- Abstract summary: We propose a methodology for adapting self-attentive transformer-based architecture models (mBERT, mT5) for low-resource summarization.
Our adapted summarization model textiturT5 can capture contextual information of low resource language effectively with evaluation score (up to 46.35 ROUGE-1, 77 BERTScore) at par with state-of-the-art models in high resource language English.
- Score: 1.26404863283601
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With the advent of Deep Learning based Artificial Neural Networks models,
Natural Language Processing (NLP) has witnessed significant improvements in
textual data processing in terms of its efficiency and accuracy. However, the
research is mostly restricted to high-resource languages such as English and
low-resource languages still suffer from a lack of available resources in terms
of training datasets as well as models with even baseline evaluation results.
Considering the limited availability of resources for low-resource languages,
we propose a methodology for adapting self-attentive transformer-based
architecture models (mBERT, mT5) for low-resource summarization, supplemented
by the construction of a new baseline dataset (76.5k article, summary pairs) in
a low-resource language Urdu. Choosing news (a publicly available source) as
the application domain has the potential to make the proposed methodology
useful for reproducing in other languages with limited resources. Our adapted
summarization model \textit{urT5} with up to 44.78\% reduction in size as
compared to \textit{mT5} can capture contextual information of low resource
language effectively with evaluation score (up to 46.35 ROUGE-1, 77 BERTScore)
at par with state-of-the-art models in high resource language English
\textit{(PEGASUS: 47.21, BART: 45.14 on XSUM Dataset)}. The proposed method
provided a baseline approach towards extractive as well as abstractive
summarization with competitive evaluation results in a limited resource setup.
Related papers
- Weighted Cross-entropy for Low-Resource Languages in Multilingual Speech Recognition [2.7247388777405597]
We introduce a novel application of weighted cross-entropy, typically used for unbalanced datasets.
We fine-tune the Whisper multilingual ASR model on five high-resource languages and one low-resource language.
arXiv Detail & Related papers (2024-09-25T14:09:09Z) - Chain-of-Translation Prompting (CoTR): A Novel Prompting Technique for Low Resource Languages [0.4499833362998489]
Chain of Translation Prompting (CoTR) is a novel strategy designed to enhance the performance of language models in low-resource languages.
CoTR restructures prompts to first translate the input context from a low-resource language into a higher-resource language, such as English.
We demonstrate the effectiveness of this method through a case study on the low-resource Indic language Marathi.
arXiv Detail & Related papers (2024-09-06T17:15:17Z) - Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - TriSum: Learning Summarization Ability from Large Language Models with Structured Rationale [66.01943465390548]
We introduce TriSum, a framework for distilling large language models' text summarization abilities into a compact, local model.
Our method enhances local model performance on various benchmarks.
It also improves interpretability by providing insights into the summarization rationale.
arXiv Detail & Related papers (2024-03-15T14:36:38Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - Learning Translation Quality Evaluation on Low Resource Languages from
Large Language Models [4.168157981135698]
We show how knowledge can be distilled from Large Language Models (LLMs) to improve upon learned metrics without requiring human annotators.
We show that the performance of a BLEURT-like model on lower resource languages can be improved in this way.
arXiv Detail & Related papers (2023-02-07T14:35:35Z) - Improving Cross-lingual Information Retrieval on Low-Resource Languages
via Optimal Transport Distillation [21.057178077747754]
In this work, we propose OPTICAL: Optimal Transport distillation for low-resource Cross-lingual information retrieval.
By separating the cross-lingual knowledge from knowledge of query document matching, OPTICAL only needs bitext data for distillation training.
Experimental results show that, with minimal training data, OPTICAL significantly outperforms strong baselines on low-resource languages.
arXiv Detail & Related papers (2023-01-29T22:30:36Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Tackling the Low-resource Challenge for Canonical Segmentation [23.17111619633273]
Canonical morphological segmentation consists of dividing words into their standardized morphemes.
We explore two new models for the task, borrowing from the closely related area of morphological generation.
We find that, in the low-resource setting, the novel approaches outperform existing ones on all languages by up to 11.4% accuracy.
arXiv Detail & Related papers (2020-10-06T15:15:05Z) - Improving Candidate Generation for Low-resource Cross-lingual Entity
Linking [81.41804263432684]
Cross-lingual entity linking (XEL) is the task of finding referents in a target-language knowledge base (KB) for mentions extracted from source-language texts.
In this paper, we propose three improvements that (1) reduce the disconnect between entity mentions and KB entries, and (2) improve the robustness of the model to low-resource scenarios.
arXiv Detail & Related papers (2020-03-03T05:32:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.