SPOT: An Annotated French Corpus and Benchmark for Detecting Critical Interventions in Online Conversations
- URL: http://arxiv.org/abs/2511.07405v1
- Date: Mon, 10 Nov 2025 18:54:40 GMT
- Title: SPOT: An Annotated French Corpus and Benchmark for Detecting Critical Interventions in Online Conversations
- Authors: Manon Berriche, Célia Nouri, Chloé Clavel, Jean-Philippe Cointet,
- Abstract summary: SPOT is the first annotated corpus translating the sociological concept of stopping point into a reproducible NLP task.<n>The corpus contains 43,305 manually annotated French Facebook comments linked to URLs flagged as false information by social media users.<n>We benchmark fine-tuned encoder models (CamemBERT) and instruction-tuned LLMs under various prompting strategies.
- Score: 10.409447852574907
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce SPOT (Stopping Points in Online Threads), the first annotated corpus translating the sociological concept of stopping point into a reproducible NLP task. Stopping points are ordinary critical interventions that pause or redirect online discussions through a range of forms (irony, subtle doubt or fragmentary arguments) that frameworks like counterspeech or social correction often overlook. We operationalize this concept as a binary classification task and provide reliable annotation guidelines. The corpus contains 43,305 manually annotated French Facebook comments linked to URLs flagged as false information by social media users, enriched with contextual metadata (article, post, parent comment, page or group, and source). We benchmark fine-tuned encoder models (CamemBERT) and instruction-tuned LLMs under various prompting strategies. Results show that fine-tuned encoders outperform prompted LLMs in F1 score by more than 10 percentage points, confirming the importance of supervised learning for emerging non-English social media tasks. Incorporating contextual metadata further improves encoder models F1 scores from 0.75 to 0.78. We release the anonymized dataset, along with the annotation guidelines and code in our code repository, to foster transparency and reproducible research.
Related papers
- SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space [11.534994345027362]
Multimodal large language models (MLLMs) have shown impressive capabilities in vision-language tasks such as reasoning segmentation.<n>We introduce a novel adversarial paraphrasing task: generating grammatically correct paraphrases that preserve the original query meaning while degrading segmentation performance.<n>We introduce SPARTA-a black-box, sentence-level optimization method that operates in the low-dimensional semantic latent space of a text autoencoder.
arXiv Detail & Related papers (2025-10-28T14:09:05Z) - Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization [6.057968525653529]
We propose a rubric-guided, pseudo-labeled, and prompt-driven zero-shot video summarization framework.<n>A small subset of human annotations is converted into high-confidence pseudo labels.<n>During inference, boundary scenes are scored independently based on their own descriptions.
arXiv Detail & Related papers (2025-10-20T12:54:32Z) - Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions [81.33113485830711]
We introduce a vision-free, single-encoder retrieval pipeline for vision-language models.<n>We migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions.<n>Our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks.
arXiv Detail & Related papers (2025-09-23T16:22:27Z) - Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems [0.0]
We integrate two types of feedback-driven annotations: those that identify spelling and grammatical errors, and those that highlight argumentative components.<n>To illustrate how this method could be applied in real-world scenarios, we employ two LLMs to generate annotations -- a generative language model used for spell correction and an encoder-based token-classifier trained to identify and mark argumentative elements.
arXiv Detail & Related papers (2025-05-28T18:39:56Z) - Adapting Pretrained Language Models for Citation Classification via Self-Supervised Contrastive Learning [13.725832389453911]
Citation classification is pivotal for scholarly analysis.<n>Previous works suggest fine-tuning pretrained language models (PLMs) on citation classification.<n>We present a novel framework, Citss, that adapts the PLMs to overcome these challenges.
arXiv Detail & Related papers (2025-05-20T15:05:27Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval.
We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning.
On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - KLUE: Korean Language Understanding Evaluation [43.94952771238633]
We introduce Korean Language Understanding Evaluation (KLUE) benchmark.
KLUE is a collection of 8 Korean natural language understanding (NLU) tasks.
We build all of the tasks from scratch from diverse source corpora while respecting copyrights.
arXiv Detail & Related papers (2021-05-20T11:40:30Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.