TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu
- URL: http://arxiv.org/abs/2404.11349v1
- Date: Wed, 17 Apr 2024 13:07:56 GMT
- Title: TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu
- Authors: Gopichand Kanumolu, Lokesh Madasu, Nirmal Surange, Manish Shrivastava,
- Abstract summary: relevance-based headline classification can greatly aid the task of generating relevant headlines.
We present TeClass, the first-ever human-annotated Telugu news headline classification dataset.
The headlines generated by the models fine-tuned on highly relevant article-headline pairs, showed about a 5 point increment in the ROUGE-L scores.
- Score: 4.272315504476224
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: News headline generation is a crucial task in increasing productivity for both the readers and producers of news. This task can easily be aided by automated News headline-generation models. However, the presence of irrelevant headlines in scraped news articles results in sub-optimal performance of generation models. We propose that relevance-based headline classification can greatly aid the task of generating relevant headlines. Relevance-based headline classification involves categorizing news headlines based on their relevance to the corresponding news articles. While this task is well-established in English, it remains under-explored in low-resource languages like Telugu due to a lack of annotated data. To address this gap, we present TeClass, the first-ever human-annotated Telugu news headline classification dataset, containing 78,534 annotations across 26,178 article-headline pairs. We experiment with various baseline models and provide a comprehensive analysis of their results. We further demonstrate the impact of this work by fine-tuning various headline generation models using TeClass dataset. The headlines generated by the models fine-tuned on highly relevant article-headline pairs, showed about a 5 point increment in the ROUGE-L scores. To encourage future research, the annotated dataset as well as the annotation guidelines will be made publicly available.
Related papers
- BeliN: A Novel Corpus for Bengali Religious News Headline Generation using Contextual Feature Fusion [1.2416206871977309]
Existing approaches to headline generation typically rely solely on the article content, overlooking crucial contextual features such as sentiment, category, and aspect.
This study addresses this limitation by introducing a novel corpus, BeliN (Bengali Religious News)
It comprises religious news articles from prominent Bangladeshi online newspapers, and MultiGen - a contextual multi-input feature fusion headline generation approach.
arXiv Detail & Related papers (2025-01-02T05:34:21Z) - Headline-Guided Extractive Summarization for Thai News Articles [0.0]
We propose CHIMA, an extractive summarization model that incorporates the contextual information of the headline for Thai news articles.
Our model utilizes a pre-trained language model to capture complex language semantics and assigns a probability to each sentence to be included in the summary.
Experiments on publicly available Thai news datasets demonstrate that CHIMA outperforms baseline models across ROUGE, BLEU, and F1 scores.
arXiv Detail & Related papers (2024-12-02T15:43:10Z) - LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification [4.450536872346658]
We propose a teacher-student framework for developing multilingual news classification models of reasonable size.
The framework employs a Generative Pretrained Transformer (GPT) model as the teacher model to develop an IPTC Media Topic training dataset.
Student models achieve high performance comparable to the teacher model.
We publish the best-performing news topic, enabling multilingual classification with the top-level categories of the IPTC Media Topic schema.
arXiv Detail & Related papers (2024-11-29T11:42:58Z) - Attention Sorting Combats Recency Bias In Long Context Language Models [69.06809365227504]
Current language models often fail to incorporate long contexts efficiently during generation.
We show that a major contributor to this issue are attention priors that are likely learned during pre-training.
We leverage this fact to introduce attention sorting'': perform one step of decoding, sort documents by the attention they receive, repeat the process, generate the answer with the newly sorted context.
arXiv Detail & Related papers (2023-09-28T05:19:06Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - An Amharic News Text classification Dataset [0.0]
We aim to introduce the Amharic text classification dataset that consists of more than 50k news articles that were categorized into 6 classes.
This dataset is made available with easy baseline performances to encourage studies and better performance experiments.
arXiv Detail & Related papers (2021-03-10T16:36:39Z) - Multitask Learning for Class-Imbalanced Discourse Classification [74.41900374452472]
We show that a multitask approach can improve 7% Micro F1-score upon current state-of-the-art benchmarks.
We also offer a comparative review of additional techniques proposed to address resource-poor problems in NLP.
arXiv Detail & Related papers (2021-01-02T07:13:41Z) - Be More with Less: Hypergraph Attention Networks for Inductive Text
Classification [56.98218530073927]
Graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task.
Despite the success, their performance could be largely jeopardized in practice since they are unable to capture high-order interaction between words.
We propose a principled model -- hypergraph attention networks (HyperGAT) which can obtain more expressive power with less computational consumption for text representation learning.
arXiv Detail & Related papers (2020-11-01T00:21:59Z) - Hooks in the Headline: Learning to Generate Headlines with Controlled
Styles [69.30101340243375]
We propose a new task, Stylistic Headline Generation (SHG), to enrich the headlines with three style options.
TitleStylist generates style-specific headlines by combining the summarization and reconstruction tasks into a multitasking framework.
The attraction score of our model generated headlines surpasses that of the state-of-the-art summarization model by 9.68%, and even outperforms human-written references.
arXiv Detail & Related papers (2020-04-04T17:24:47Z) - Low resource language dataset creation, curation and classification:
Setswana and Sepedi -- Extended Abstract [2.3801001093799115]
We create datasets that are focused on news headlines for Setswana and Sepedi.
We propose baselines for classification, and investigate an approach on data augmentation better suited to low-resourced languages.
arXiv Detail & Related papers (2020-03-30T18:03:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.