Related papers: Probing Multimodal Large Language Models on Cognitive Biases in Chinese Short-Video Misinformation

Probing Multimodal Large Language Models on Cognitive Biases in Chinese Short-Video Misinformation

URL: http://arxiv.org/abs/2601.06600v1
Date: Sat, 10 Jan 2026 15:43:30 GMT
Title: Probing Multimodal Large Language Models on Cognitive Biases in Chinese Short-Video Misinformation
Authors: Jen-tse Huang, Chang Chen, Shiyang Lai, Wenxuan Wang, Michelle R. Kaufman, Mark Dredze,
Abstract summary: Short-video platforms have become major channels for misinformation, where deceptive claims leverage visual experiments and social cues.<n>We introduce a comprehensive evaluation framework using a high-quality, manually annotated dataset of 200 short videos spanning four health domains.<n>This dataset provides fine-grained annotations for three deceptive patterns, experimental errors, logical fallacies, and fabricated claims.
Score: 34.28647703173823
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Short-video platforms have become major channels for misinformation, where deceptive claims frequently leverage visual experiments and social cues. While Multimodal Large Language Models (MLLMs) have demonstrated impressive reasoning capabilities, their robustness against misinformation entangled with cognitive biases remains under-explored. In this paper, we introduce a comprehensive evaluation framework using a high-quality, manually annotated dataset of 200 short videos spanning four health domains. This dataset provides fine-grained annotations for three deceptive patterns, experimental errors, logical fallacies, and fabricated claims, each verified by evidence such as national standards and academic literature. We evaluate eight frontier MLLMs across five modality settings. Experimental results demonstrate that Gemini-2.5-Pro achieves the highest performance in the multimodal setting with a belief score of 71.5/100, while o3 performs the worst at 35.2. Furthermore, we investigate social cues that induce false beliefs in videos and find that models are susceptible to biases like authoritative channel IDs.

Related papers

Addressing Stereotypes in Large Language Models: A Critical Examination and Mitigation [0.0]
Large Language models (LLMs) have gained popularity in recent years with the advancement of Natural Language Processing (NLP)<n>This study inspects and highlights the need to address biases in LLMs amid growing generative Artificial Intelligence (AI)<n>We utilize bias-specific benchmarks such StereoSet and CrowSPairs to evaluate the existence of various biases in many different generative models such as BERT, GPT 3.5, and ADA.
arXiv Detail & Related papers (2025-11-18T05:43:34Z)
Video-LevelGauge: Investigating Contextual Positional Bias in Large Video Language Models [51.67019924750931]
Video-LevelGauge is a benchmark designed to assess positional bias in large video language models (LVLMs)<n>We employ standardized probes and customized contextual setups, allowing flexible control over context length, probe position, and contextual types.<n>Our benchmark comprises 438 manually curated videos spanning multiple types, yielding 1,177 high-quality multiple-choice questions and 120 open-ended questions.
arXiv Detail & Related papers (2025-08-27T07:58:16Z)
When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification [14.187153195380668]
Large language models have remarkable capabilities across many NLP tasks, but their effectiveness for multilingual claim verification with nuanced classification schemes remains understudied.<n>We evaluate five state-of-the-art language models on the X-Fact dataset, which spans 25 languages with seven distinct veracity categories.<n>Surprisingly, we find that XLM-R substantially outperforms all tested LLMs, achieving 57.7% macro-F1 compared to the best LLM performance of 16.9%.
arXiv Detail & Related papers (2025-07-28T10:49:04Z)
Debunk and Infer: Multimodal Fake News Detection via Diffusion-Generated Evidence and LLM Reasoning [34.75988591416631]
We propose a Debunk-and-Infer framework for Fake News Detection.<n>DIFND integrates the generative strength of conditional diffusion models with the collaborative reasoning capabilities of multimodal large language models.<n>Experiments on the FakeSV and FVC datasets show that DIFND not only outperforms existing approaches but also delivers trustworthy decisions.
arXiv Detail & Related papers (2025-06-11T09:08:43Z)
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories.<n>These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives.<n>Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z)
STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions [6.19084217044276]
Mitigating explicit and implicit biases in Large Language Models (LLMs) has become a critical focus in the field of natural language processing.<n>We introduce the Sensitivity Testing on Offensive Progressions dataset, which includes 450 offensive progressions containing 2,700 unique sentences.<n>Our findings reveal that even the best-performing models detect bias inconsistently, with success rates ranging from 19.3% to 69.8%.
arXiv Detail & Related papers (2024-09-20T18:34:38Z)
Towards Event-oriented Long Video Understanding [101.48089908037888]
Event-Bench is an event-oriented long video understanding benchmark built on existing datasets and human annotations. VIM is a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions.
arXiv Detail & Related papers (2024-06-20T09:14:19Z)
WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning [49.72868038180909]
We present WorldQA, a video dataset designed to push the boundaries of multimodal world models. We identify five essential types of world knowledge for question formulation. We introduce WorldRetriever, an agent designed to synthesize expert knowledge into a coherent reasoning chain.
arXiv Detail & Related papers (2024-05-06T08:42:34Z)
Multimodal Large Language Models to Support Real-World Fact-Checking [80.41047725487645]
Multimodal large language models (MLLMs) carry the potential to support humans in processing vast amounts of information. While MLLMs are already being used as a fact-checking tool, their abilities and limitations in this regard are understudied. We propose a framework for systematically assessing the capacity of current multimodal models to facilitate real-world fact-checking.
arXiv Detail & Related papers (2024-03-06T11:32:41Z)
GPTBIAS: A Comprehensive Framework for Evaluating Bias in Large Language Models [83.30078426829627]
Large language models (LLMs) have gained popularity and are being widely adopted by a large user community. The existing evaluation methods have many constraints, and their results exhibit a limited degree of interpretability. We propose a bias evaluation framework named GPTBIAS that leverages the high performance of LLMs to assess bias in models.
arXiv Detail & Related papers (2023-12-11T12:02:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.