Related papers: Hatevolution: What Static Benchmarks Don't Tell Us

Hatevolution: What Static Benchmarks Don't Tell Us

URL: http://arxiv.org/abs/2506.12148v1
Date: Fri, 13 Jun 2025 18:08:19 GMT
Title: Hatevolution: What Static Benchmarks Don't Tell Us
Authors: Chiara Di Bonaventura, Barbara McGillivray, Yulan He, Albert Meroño-Peñuela,
Abstract summary: We empirically evaluate the robustness of 20 language models across two evolving hate speech experiments.<n>Our findings call for time-sensitive linguistic benchmarks in order to correctly and reliably evaluate language models in the hate speech domain.
Score: 14.8862493303907
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language changes over time, including in the hate speech domain, which evolves quickly following social dynamics and cultural shifts. While NLP research has investigated the impact of language evolution on model training and has proposed several solutions for it, its impact on model benchmarking remains under-explored. Yet, hate speech benchmarks play a crucial role to ensure model safety. In this paper, we empirically evaluate the robustness of 20 language models across two evolving hate speech experiments, and we show the temporal misalignment between static and time-sensitive evaluations. Our findings call for time-sensitive linguistic benchmarks in order to correctly and reliably evaluate language models in the hate speech domain.

Related papers

CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment [13.74065648648307]
We propose CogBench, the first benchmark designed to evaluate the cross-lingual and cross-site generalizability of large language models for speech-based cognitive impairment assessment.<n>Our results show that conventional deep learning models degrade substantially when transferred across domains.<n>Our findings offer a critical step toward building clinically useful and linguistically robust speech-based cognitive assessment tools.
arXiv Detail & Related papers (2025-08-05T12:06:16Z)
SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models [60.72029578488467]
SpeechR is a unified benchmark for evaluating reasoning over speech in large audio-language models.<n>It evaluates models along three key dimensions: factual retrieval, procedural inference, and normative judgment.<n> Evaluations on eleven state-of-the-art LALMs reveal that high transcription accuracy does not translate into strong reasoning capabilities.
arXiv Detail & Related papers (2025-08-04T03:28:04Z)
Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models [49.1574468325115]
We introduce Speech-IFeval, an evaluation framework designed to assess instruction-following capabilities.<n>Recent SLMs integrate speech perception with large language models (LLMs), often degrading textual capabilities due to speech-centric training.<n>Our findings show that most SLMs struggle with even basic instructions, performing far worse than text-based LLMs.
arXiv Detail & Related papers (2025-05-25T08:37:55Z)
A Systematic Analysis on the Temporal Generalization of Language Models in Social Media [12.035331011654078]
This paper focuses on temporal shifts in social media and, in particular, Twitter. We propose a unified evaluation scheme to assess the performance of language models (LMs) under temporal shift.
arXiv Detail & Related papers (2024-05-15T05:41:06Z)
Holmes: A Benchmark to Assess the Linguistic Competence of Language Models [59.627729608055006]
We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence. We use computation-based probing to examine LMs' internal representations regarding distinct linguistic phenomena. As a result, we meet recent calls to disentangle LMs' linguistic competence from other cognitive abilities.
arXiv Detail & Related papers (2024-04-29T17:58:36Z)
Examining Temporal Bias in Abusive Language Detection [3.465144840147315]
Machine learning models have been developed to automatically detect abusive language. These models can suffer from temporal bias, the phenomenon in which topics, language use or social norms change over time. This study investigates the nature and impact of temporal bias in abusive language detection across various languages.
arXiv Detail & Related papers (2023-09-25T13:59:39Z)
Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z)
BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models [56.93604813379634]
Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels. We propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels. We highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.
arXiv Detail & Related papers (2023-06-02T12:54:38Z)
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
The Rediscovery Hypothesis: Language Models Need to Meet Linguistics [8.293055016429863]
We study whether linguistic knowledge is a necessary condition for good performance of modern language models. We show that language models that are significantly compressed but perform well on their pretraining objectives retain good scores when probed for linguistic structures. This result supports the rediscovery hypothesis and leads to the second contribution of our paper: an information-theoretic framework that relates language modeling objective with linguistic information.
arXiv Detail & Related papers (2021-03-02T15:57:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.