The SMeL Test: A simple benchmark for media literacy in language models
- URL: http://arxiv.org/abs/2508.02074v2
- Date: Thu, 07 Aug 2025 03:54:11 GMT
- Title: The SMeL Test: A simple benchmark for media literacy in language models
- Authors: Gustaf Ahdritz, Anat Kleiman,
- Abstract summary: We introduce the Synthetic Media Literacy Test (SMeL Test), a minimal benchmark that tests the ability of language models to actively filter out untrustworthy information in context.<n>We benchmark a variety of commonly used instruction-tuned LLMs, including reasoning models, and find that no model consistently succeeds.<n>We hope our work sheds more light on this important form of hallucination and guides the development of new methods to combat it.
- Score: 0.897780713904412
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The internet is rife with unattributed, deliberately misleading, or otherwise untrustworthy content. Though large language models (LLMs) are often tasked with autonomous web browsing, the extent to which they have learned the simple heuristics human researchers use to navigate this noisy environment is not currently known. In this paper, we introduce the Synthetic Media Literacy Test (SMeL Test), a minimal benchmark that tests the ability of language models to actively filter out untrustworthy information in context. We benchmark a variety of commonly used instruction-tuned LLMs, including reasoning models, and find that no model consistently succeeds; while reasoning in particular is associated with higher scores, even the best API model we test hallucinates up to 70% of the time. Remarkably, larger and more capable models do not necessarily outperform their smaller counterparts. We hope our work sheds more light on this important form of hallucination and guides the development of new methods to combat it.
Related papers
- Distillation and Refinement of Reasoning in Small Language Models for Document Re-ranking [21.23826888841565]
We present a novel approach for training small language models for reasoning-intensive document ranking.<n>We use web data and a teacher LLM to automatically generate high-quality training examples with relevance explanations.<n>Our model ranks third on the leaderboard while using substantially fewer parameters than other approaches.
arXiv Detail & Related papers (2025-04-04T21:27:48Z) - Experiential Semantic Information and Brain Alignment: Are Multimodal Models Better than Language Models? [5.412335160966597]
A common assumption in Computational Linguistics is that text representations learnt by multimodal models are richer and more human-like than those by language-only models.<n>We compare word representations from contrastive multimodal models vs. language-only ones in the extent to which they capture information.<n>Our results indicate that, surprisingly, language-only models are superior to multimodal ones in both respects.
arXiv Detail & Related papers (2025-04-01T16:28:38Z) - Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence [3.4049215220521933]
We introduce Dynamic Intelligence Assessment (DIA), a novel methodology for testing AI models.
The framework introduces four new metrics to assess a model's reliability and confidence across multiple attempts.
The accompanying dataset, DIA-Bench, contains a diverse collection of challenge templates with mutable parameters presented in various formats.
arXiv Detail & Related papers (2024-10-20T20:07:36Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation [49.3814117521631]
Standard benchmarks of bias and fairness in large language models (LLMs) measure the association between the user attributes stated or implied by a prompt.<n>We develop analogous RUTEd evaluations from three contexts of real-world use: children's bedtime stories, user personas, and English language learning exercises.<n>We find that standard bias metrics have no significant correlation with the more realistic bias metrics.
arXiv Detail & Related papers (2024-02-20T01:49:15Z) - Towards Auditing Large Language Models: Improving Text-based Stereotype
Detection [5.3634450268516565]
This work introduces i) the Multi-Grain Stereotype dataset, which includes 52,751 instances of gender, race, profession and religion stereotypic text.
We design several experiments to rigorously test the proposed model trained on the novel dataset.
Experiments show that training the model in a multi-class setting can outperform the one-vs-all binary counterpart.
arXiv Detail & Related papers (2023-11-23T17:47:14Z) - Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - Scalable Performance Analysis for Vision-Language Models [26.45624201546282]
Joint vision-language models have shown great performance over a diverse set of tasks.
Our paper introduces a more scalable solution that relies on already annotated benchmarks.
We confirm previous findings that CLIP behaves like a bag of words model and performs better with nouns and verbs.
arXiv Detail & Related papers (2023-05-30T06:40:08Z) - Do LLMs Understand Social Knowledge? Evaluating the Sociability of Large
Language Models with SocKET Benchmark [14.922083834969323]
Large language models (LLMs) have been shown to perform well at a variety of syntactic, discourse, and reasoning tasks.
We introduce a new theory-driven benchmark, SocKET, that contains 58 NLP tasks testing social knowledge.
arXiv Detail & Related papers (2023-05-24T09:21:06Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.