Exploring News Summarization and Enrichment in a Highly Resource-Scarce Indian Language: A Case Study of Mizo
- URL: http://arxiv.org/abs/2405.00717v1
- Date: Thu, 25 Apr 2024 17:23:04 GMT
- Title: Exploring News Summarization and Enrichment in a Highly Resource-Scarce Indian Language: A Case Study of Mizo
- Authors: Abhinaba Bala, Ashok Urlana, Rahul Mishra, Parameswari Krishnamurthy,
- Abstract summary: We conduct a study to investigate the effectiveness of a simple methodology designed to generate a holistic summary for Mizo news articles.
We make available 500 Mizo news articles and corresponding enriched holistic summaries.
Human evaluation confirms that our approach significantly enhances the information coverage of Mizo news articles.
- Score: 7.393476206148905
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Obtaining sufficient information in one's mother tongue is crucial for satisfying the information needs of the users. While high-resource languages have abundant online resources, the situation is less than ideal for very low-resource languages. Moreover, the insufficient reporting of vital national and international events continues to be a worry, especially in languages with scarce resources, like \textbf{Mizo}. In this paper, we conduct a study to investigate the effectiveness of a simple methodology designed to generate a holistic summary for Mizo news articles, which leverages English-language news to supplement and enhance the information related to the corresponding news events. Furthermore, we make available 500 Mizo news articles and corresponding enriched holistic summaries. Human evaluation confirms that our approach significantly enhances the information coverage of Mizo news articles. The mizo dataset and code can be accessed at \url{https://github.com/barvin04/mizo_enrichment
Related papers
- SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods [1.2091341579150698]
We release datasets of sentences containing polysemous words across ten low-resource languages.<n>To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method.<n>Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation.
arXiv Detail & Related papers (2025-05-29T17:48:08Z) - Detection of Somali-written Fake News and Toxic Messages on the Social Media Using Transformer-based Language Models [0.0]
This paper presents part of our ongoing research work to bridge some of these gaps for the Somali language.
We developed a transformer-based monolingual Somali language model (named SomBERTa)
SomBERTa is fine-tuned and evaluated on toxic content, fake news and news topic classification datasets.
arXiv Detail & Related papers (2025-03-23T15:45:31Z) - LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models [62.47865866398233]
This white paper proposes a framework to generate linguistic tools for low-resource languages.
By addressing the data scarcity that hinders intelligent applications for such languages, we contribute to promoting linguistic diversity.
arXiv Detail & Related papers (2024-11-20T16:59:41Z) - Monolingual and Multilingual Misinformation Detection for Low-Resource Languages: A Comprehensive Survey [2.5459710368096586]
This survey provides a comprehensive overview of the current research on low-resource language misinformation detection.
We review the existing datasets, methodologies, and tools used in these domains, identifying key challenges related to: data resources, model development, cultural and linguistic context, real-world applications, and research efforts.
Our findings underscore the need for robust, inclusive systems capable of addressing misinformation across diverse linguistic and cultural contexts.
arXiv Detail & Related papers (2024-10-24T03:02:03Z) - MMCFND: Multimodal Multilingual Caption-aware Fake News Detection for Low-resource Indic Languages [0.4062349563818079]
We introduce the Multimodal Multilingual dataset for Indic Fake News Detection (MMIFND)
This meticulously curated dataset consists of 28,085 instances distributed across Hindi, Bengali, Marathi, Malayalam, Tamil, Gujarati and Punjabi.
We propose the Multimodal Caption-aware framework for Fake News Detection (MMCFND)
arXiv Detail & Related papers (2024-10-14T11:59:33Z) - A diverse Multilingual News Headlines Dataset from around the World [57.37355895609648]
Babel Briefings is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide.
It serves as a high-quality dataset for training or evaluating language models as well as offering a simple, accessible collection of articles.
arXiv Detail & Related papers (2024-03-28T12:08:39Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Identifying Informational Sources in News Articles [109.70475599552523]
We build the largest and widest-ranging annotated dataset of informational sources used in news writing.
We introduce a novel task, source prediction, to study the compositionality of sources in news articles.
arXiv Detail & Related papers (2023-05-24T08:56:35Z) - Resources for Turkish Natural Language Processing: A critical survey [0.0]
We review a broad range of resources, focusing on the ones that are publicly available.
We present a set of recommendations, and identify gaps in the data available for conducting research and building applications in Turkish Linguistics and Natural Language Processing.
arXiv Detail & Related papers (2022-04-11T12:23:07Z) - Toward More Meaningful Resources for Lower-resourced Languages [2.3513645401551333]
We examine the contents of the names stored in Wikidata for a few lower-resourced languages.
We discuss quality issues present in WikiAnn and evaluate whether it is a useful supplement to hand annotated data.
We conclude with recommended guidelines for resource development.
arXiv Detail & Related papers (2022-02-24T18:39:57Z) - MetaXL: Meta Representation Transformation for Low-resource
Cross-lingual Learning [91.5426763812547]
Cross-lingual transfer learning is one of the most effective methods for building functional NLP systems for low-resource languages.
We propose MetaXL, a meta-learning based framework that learns to transform representations judiciously from auxiliary languages to a target one.
arXiv Detail & Related papers (2021-04-16T06:15:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.