Exploring News Summarization and Enrichment in a Highly Resource-Scarce Indian Language: A Case Study of Mizo
- URL: http://arxiv.org/abs/2405.00717v1
- Date: Thu, 25 Apr 2024 17:23:04 GMT
- Title: Exploring News Summarization and Enrichment in a Highly Resource-Scarce Indian Language: A Case Study of Mizo
- Authors: Abhinaba Bala, Ashok Urlana, Rahul Mishra, Parameswari Krishnamurthy,
- Abstract summary: We conduct a study to investigate the effectiveness of a simple methodology designed to generate a holistic summary for Mizo news articles.
We make available 500 Mizo news articles and corresponding enriched holistic summaries.
Human evaluation confirms that our approach significantly enhances the information coverage of Mizo news articles.
- Score: 7.393476206148905
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Obtaining sufficient information in one's mother tongue is crucial for satisfying the information needs of the users. While high-resource languages have abundant online resources, the situation is less than ideal for very low-resource languages. Moreover, the insufficient reporting of vital national and international events continues to be a worry, especially in languages with scarce resources, like \textbf{Mizo}. In this paper, we conduct a study to investigate the effectiveness of a simple methodology designed to generate a holistic summary for Mizo news articles, which leverages English-language news to supplement and enhance the information related to the corresponding news events. Furthermore, we make available 500 Mizo news articles and corresponding enriched holistic summaries. Human evaluation confirms that our approach significantly enhances the information coverage of Mizo news articles. The mizo dataset and code can be accessed at \url{https://github.com/barvin04/mizo_enrichment
Related papers
- Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers [81.47046536073682]
We present a review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature.
We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.
arXiv Detail & Related papers (2024-04-07T11:52:44Z) - A diverse Multilingual News Headlines Dataset from around the World [57.37355895609648]
Babel Briefings is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide.
It serves as a high-quality dataset for training or evaluating language models as well as offering a simple, accessible collection of articles.
arXiv Detail & Related papers (2024-03-28T12:08:39Z) - Bangla AI: A Framework for Machine Translation Utilizing Large Language
Models for Ethnic Media [0.0]
Ethnic media caters to diaspora communities in host nations.
Rather than utilizing the language of the host nation, ethnic media delivers news in the language of the immigrant community.
This research delves into the prospective integration of large language models (LLM) and multi-lingual machine translations (MMT) within the ethnic media industry.
arXiv Detail & Related papers (2024-02-21T23:43:04Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Identifying Informational Sources in News Articles [109.70475599552523]
We build the largest and widest-ranging annotated dataset of informational sources used in news writing.
We introduce a novel task, source prediction, to study the compositionality of sources in news articles.
arXiv Detail & Related papers (2023-05-24T08:56:35Z) - Resources for Turkish Natural Language Processing: A critical survey [0.0]
We review a broad range of resources, focusing on the ones that are publicly available.
We present a set of recommendations, and identify gaps in the data available for conducting research and building applications in Turkish Linguistics and Natural Language Processing.
arXiv Detail & Related papers (2022-04-11T12:23:07Z) - Toward More Meaningful Resources for Lower-resourced Languages [2.3513645401551333]
We examine the contents of the names stored in Wikidata for a few lower-resourced languages.
We discuss quality issues present in WikiAnn and evaluate whether it is a useful supplement to hand annotated data.
We conclude with recommended guidelines for resource development.
arXiv Detail & Related papers (2022-02-24T18:39:57Z) - MetaXL: Meta Representation Transformation for Low-resource
Cross-lingual Learning [91.5426763812547]
Cross-lingual transfer learning is one of the most effective methods for building functional NLP systems for low-resource languages.
We propose MetaXL, a meta-learning based framework that learns to transform representations judiciously from auxiliary languages to a target one.
arXiv Detail & Related papers (2021-04-16T06:15:52Z) - BanFakeNews: A Dataset for Detecting Fake News in Bangla [1.4170999534105675]
We propose an annotated dataset of 50K news that can be used for building automated fake news detection systems.
We develop a benchmark system with state of the art NLP techniques to identify Bangla fake news.
arXiv Detail & Related papers (2020-04-19T07:42:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.