BeliN: A Novel Corpus for Bengali Religious News Headline Generation using Contextual Feature Fusion
- URL: http://arxiv.org/abs/2501.01069v1
- Date: Thu, 02 Jan 2025 05:34:21 GMT
- Title: BeliN: A Novel Corpus for Bengali Religious News Headline Generation using Contextual Feature Fusion
- Authors: Md Osama, Ashim Dey, Kawsar Ahmed, Muhammad Ashad Kabir,
- Abstract summary: Existing approaches to headline generation typically rely solely on the article content, overlooking crucial contextual features such as sentiment, category, and aspect.
This study addresses this limitation by introducing a novel corpus, BeliN (Bengali Religious News)
It comprises religious news articles from prominent Bangladeshi online newspapers, and MultiGen - a contextual multi-input feature fusion headline generation approach.
- Score: 1.2416206871977309
- License:
- Abstract: Automatic text summarization, particularly headline generation, remains a critical yet underexplored area for Bengali religious news. Existing approaches to headline generation typically rely solely on the article content, overlooking crucial contextual features such as sentiment, category, and aspect. This limitation significantly hinders their effectiveness and overall performance. This study addresses this limitation by introducing a novel corpus, BeliN (Bengali Religious News) - comprising religious news articles from prominent Bangladeshi online newspapers, and MultiGen - a contextual multi-input feature fusion headline generation approach. Leveraging transformer-based pre-trained language models such as BanglaT5, mBART, mT5, and mT0, MultiGen integrates additional contextual features - including category, aspect, and sentiment - with the news content. This fusion enables the model to capture critical contextual information often overlooked by traditional methods. Experimental results demonstrate the superiority of MultiGen over the baseline approach that uses only news content, achieving a BLEU score of 18.61 and ROUGE-L score of 24.19, compared to baseline approach scores of 16.08 and 23.08, respectively. These findings underscore the importance of incorporating contextual features in headline generation for low-resource languages. By bridging linguistic and cultural gaps, this research advances natural language processing for Bengali and other underrepresented languages. To promote reproducibility and further exploration, the dataset and implementation code are publicly accessible at https://github.com/akabircs/BeliN.
Related papers
- Headline-Guided Extractive Summarization for Thai News Articles [0.0]
We propose CHIMA, an extractive summarization model that incorporates the contextual information of the headline for Thai news articles.
Our model utilizes a pre-trained language model to capture complex language semantics and assigns a probability to each sentence to be included in the summary.
Experiments on publicly available Thai news datasets demonstrate that CHIMA outperforms baseline models across ROUGE, BLEU, and F1 scores.
arXiv Detail & Related papers (2024-12-02T15:43:10Z) - TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu [4.272315504476224]
relevance-based headline classification can greatly aid the task of generating relevant headlines.
We present TeClass, the first-ever human-annotated Telugu news headline classification dataset.
The headlines generated by the models fine-tuned on highly relevant article-headline pairs, showed about a 5 point increment in the ROUGE-L scores.
arXiv Detail & Related papers (2024-04-17T13:07:56Z) - A diverse Multilingual News Headlines Dataset from around the World [57.37355895609648]
Babel Briefings is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide.
It serves as a high-quality dataset for training or evaluating language models as well as offering a simple, accessible collection of articles.
arXiv Detail & Related papers (2024-03-28T12:08:39Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of
GPT-Generated Text [82.5469544192645]
We propose a novel training-free detection strategy called Divergent N-Gram Analysis (DNA-GPT)
By analyzing the differences between the original and new remaining parts through N-gram analysis, we unveil significant discrepancies between the distribution of machine-generated text and human-written text.
Results show that our zero-shot approach exhibits state-of-the-art performance in distinguishing between human and GPT-generated text.
arXiv Detail & Related papers (2023-05-27T03:58:29Z) - BenCoref: A Multi-Domain Dataset of Nominal Phrases and Pronominal
Reference Annotations [0.0]
We introduce a new dataset, BenCoref, comprising coreference annotations for Bengali texts gathered from four distinct domains.
This relatively small dataset contains 5200 mention annotations forming 502 mention clusters within 48,569 tokens.
arXiv Detail & Related papers (2023-04-07T15:08:46Z) - A Survey on Retrieval-Augmented Text Generation [53.04991859796971]
Retrieval-augmented text generation has remarkable advantages and has achieved state-of-the-art performance in many NLP tasks.
It firstly highlights the generic paradigm of retrieval-augmented generation, and then it reviews notable approaches according to different tasks.
arXiv Detail & Related papers (2022-02-02T16:18:41Z) - A Framework for Neural Topic Modeling of Text Corpora [6.340447411058068]
We introduce FAME, an open-source framework enabling an efficient mechanism of extracting and incorporating textual features.
To demonstrate the effectiveness of this library, we conducted experiments on the well-known News-Group dataset.
arXiv Detail & Related papers (2021-08-19T23:32:38Z) - HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish
Text [1.6675267471157407]
We present a corpus (HinGE) for a widely popular code-mixed language Hinglish (code-mixing of Hindi and English languages)
HinGE has Hinglish sentences generated by humans as well as two rule-based algorithms corresponding to the parallel Hindi-English sentences.
In addition, we demonstrate the inefficacy of widely-used evaluation metrics on the code-mixed data.
arXiv Detail & Related papers (2021-07-08T11:11:37Z) - Deep Learning for Text Style Transfer: A Survey [71.8870854396927]
Text style transfer is an important task in natural language generation, which aims to control certain attributes in the generated text.
We present a systematic survey of the research on neural text style transfer, spanning over 100 representative articles since the first neural text style transfer work in 2017.
We discuss the task formulation, existing datasets and subtasks, evaluation, as well as the rich methodologies in the presence of parallel and non-parallel data.
arXiv Detail & Related papers (2020-11-01T04:04:43Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.