SeLeRoSa: Sentence-Level Romanian Satire Detection Dataset
- URL: http://arxiv.org/abs/2509.00893v2
- Date: Wed, 15 Oct 2025 13:56:03 GMT
- Title: SeLeRoSa: Sentence-Level Romanian Satire Detection Dataset
- Authors: Răzvan-Alexandru Smădu, Andreea Iuga, Dumitru-Clementin Cercel, Florin Pop,
- Abstract summary: We introduce the first sentence-level dataset for Romanian satire detection for news articles, called SeLeRoSa.<n>The dataset comprises 13,873 manually annotated sentences spanning various domains, including social issues, IT, science, and movies.
- Score: 2.709981170021896
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Satire, irony, and sarcasm are techniques typically used to express humor and critique, rather than deceive; however, they can occasionally be mistaken for factual reporting, akin to fake news. These techniques can be applied at a more granular level, allowing satirical information to be incorporated into news articles. In this paper, we introduce the first sentence-level dataset for Romanian satire detection for news articles, called SeLeRoSa. The dataset comprises 13,873 manually annotated sentences spanning various domains, including social issues, IT, science, and movies. With the rise and recent progress of large language models (LLMs) in the natural language processing literature, LLMs have demonstrated enhanced capabilities to tackle various tasks in zero-shot settings. We evaluate multiple baseline models based on LLMs in both zero-shot and fine-tuning settings, as well as baseline transformer-based models. Our findings reveal the current limitations of these models in the sentence-level satire detection task, paving the way for new research directions.
Related papers
- SaRoHead: Detecting Satire in a Multi-Domain Romanian News Headline Dataset [3.1208433686641666]
Even the headline must reflect the tone of the satirical main content.<n>Current approaches for the Romanian language detect the tone by combining the main article and the headline.
arXiv Detail & Related papers (2025-04-10T10:03:29Z) - Make Satire Boring Again: Reducing Stylistic Bias of Satirical Corpus by Utilizing Generative LLMs [0.0]
This study proposes a debiasing approach for satire detection, focusing on reducing biases in training data by utilizing generative large language models.<n>Results show that the debiasing method enhances the robustness and generalizability of the models for satire and irony detection tasks in Turkish and English.<n>This work curates and presents the Turkish Satirical News dataset with detailed human annotations, with case studies on classification, debiasing, and explainability.
arXiv Detail & Related papers (2024-12-12T12:57:55Z) - NewsEdits 2.0: Learning the Intentions Behind Updating News [74.84017890548259]
As events progress, news articles often update with new information: if we are not cautious, we risk propagating outdated facts.<n>In this work, we hypothesize that linguistic features indicate factual fluidity, and that we can predict which facts in a news article will update using solely the text of a news article.
arXiv Detail & Related papers (2024-11-27T23:35:23Z) - AI "News" Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian [18.410994374810105]
Large Language Models (LLMs) are increasingly used as "content farm" models (CFMs) to generate synthetic text that could pass for real news articles.
We show that fine-tuning Llama (v1), mostly trained on English, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic.
arXiv Detail & Related papers (2024-06-17T22:19:00Z) - TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data [50.40191599304911]
This paper proposes a simple but effective framework: Transliterate-Merge-literation (TransMI)<n>TransMI can create strong baselines for data that is transliterated into a common script by exploiting an existing mPLM and its tokenizer without any training.<n>Our experiments demonstrate that TransMI not only preserves the mPLM's ability to handle non-transliterated data, but also enables it to effectively process transliterated data, thereby facilitating crosslingual transfer across scripts.
arXiv Detail & Related papers (2024-05-16T09:08:09Z) - MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer [50.40191599304911]
We introduce MoSECroT Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer.
In this paper, we present the first framework that leverages relative representations to construct a common space for the embeddings of a source language PLM and the static word embeddings of a target language.
We show that although our proposed framework is competitive with weak baselines when addressing MoSECroT, it fails to achieve competitive results compared with some strong baselines.
arXiv Detail & Related papers (2024-01-09T21:09:07Z) - Fake News in Sheep's Clothing: Robust Fake News Detection Against LLM-Empowered Style Attacks [60.14025705964573]
SheepDog is a style-robust fake news detector that prioritizes content over style in determining news veracity.
SheepDog achieves this resilience through (1) LLM-empowered news reframings that inject style diversity into the training process by customizing articles to match different styles; (2) a style-agnostic training scheme that ensures consistent veracity predictions across style-diverse reframings; and (3) content-focused attributions that distill content-centric guidelines from LLMs for debunking fake news.
arXiv Detail & Related papers (2023-10-16T21:05:12Z) - "According to ...": Prompting Language Models Improves Quoting from
Pre-Training Data [52.03853726206584]
Large Language Models (LLMs) may hallucinate and generate fake information, despite pre-training on factual data.
We propose according-to prompting: directing LLMs to ground responses against previously observed text.
To quantify this grounding, we propose a novel evaluation metric (QUIP-Score) that measures the extent to which model-produced answers are directly found in underlying text corpora.
arXiv Detail & Related papers (2023-05-22T17:25:24Z) - DPIC: Decoupling Prompt and Intrinsic Characteristics for LLM Generated Text Detection [56.513637720967566]
Large language models (LLMs) can generate texts that pose risks of misuse, such as plagiarism, planting fake reviews on e-commerce platforms, or creating inflammatory false tweets.
Existing high-quality detection methods usually require access to the interior of the model to extract the intrinsic characteristics.
We propose to extract deep intrinsic characteristics of the black-box model generated texts.
arXiv Detail & Related papers (2023-05-21T17:26:16Z) - NewsEdits: A News Article Revision Dataset and a Document-Level
Reasoning Challenge [122.37011526554403]
NewsEdits is the first publicly available dataset of news revision histories.
It contains 1.2 million articles with 4.6 million versions from over 22 English- and French-language newspaper sources.
arXiv Detail & Related papers (2022-06-14T18:47:13Z) - A Multi-Modal Method for Satire Detection using Textual and Visual Cues [5.147194328754225]
Satire is a form of humorous critique, but it is sometimes misinterpreted by readers as legitimate news.
We observe that the images used in satirical news articles often contain absurd or ridiculous content.
We propose a multi-modal approach based on state-of-the-art visiolinguistic model ViLBERT.
arXiv Detail & Related papers (2020-10-13T20:08:29Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - Satirical News Detection with Semantic Feature Extraction and
Game-theoretic Rough Sets [5.326582776477692]
We propose a semantic feature based approach to detect satirical news tweets.
Features are extracted by exploring inconsistencies in phrases, entities, and between main and relative clauses.
We apply game-theoretic rough set model to detect satirical news, in which probabilistic thresholds are derived by game equilibrium and repetition learning mechanism.
arXiv Detail & Related papers (2020-04-08T03:22:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.