The Paradigm Shift: A Comprehensive Survey on Large Vision Language Models for Multimodal Fake News Detection
- URL: http://arxiv.org/abs/2601.15316v1
- Date: Fri, 16 Jan 2026 02:40:16 GMT
- Title: The Paradigm Shift: A Comprehensive Survey on Large Vision Language Models for Multimodal Fake News Detection
- Authors: Wei Ai, Yilong Tan, Yuntao Shou, Tao Meng, Haowen Chen, Zhixiong He, Keqin Li,
- Abstract summary: In recent years, the rapid evolution of large vision models (LVLMs) has driven a paradigm shift in multimodal fake news (MFND)<n>We present a historical perspective, mapping to foundation model paradigms, and discuss the remaining technical challenges, including interpretability, temporal reasoning, and domain generalization.<n>We outline future research directions to guide the next stage of this paradigm shift.
- Score: 35.503099074709006
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, the rapid evolution of large vision-language models (LVLMs) has driven a paradigm shift in multimodal fake news detection (MFND), transforming it from traditional feature-engineering approaches to unified, end-to-end multimodal reasoning frameworks. Early methods primarily relied on shallow fusion techniques to capture correlations between text and images, but they struggled with high-level semantic understanding and complex cross-modal interactions. The emergence of LVLMs has fundamentally changed this landscape by enabling joint modeling of vision and language with powerful representation learning, thereby enhancing the ability to detect misinformation that leverages both textual narratives and visual content. Despite these advances, the field lacks a systematic survey that traces this transition and consolidates recent developments. To address this gap, this paper provides a comprehensive review of MFND through the lens of LVLMs. We first present a historical perspective, mapping the evolution from conventional multimodal detection pipelines to foundation model-driven paradigms. Next, we establish a structured taxonomy covering model architectures, datasets, and performance benchmarks. Furthermore, we analyze the remaining technical challenges, including interpretability, temporal reasoning, and domain generalization. Finally, we outline future research directions to guide the next stage of this paradigm shift. To the best of our knowledge, this is the first comprehensive survey to systematically document and analyze the transformative role of LVLMs in combating multimodal fake news. The summary of existing methods mentioned is in our Github: \href{https://github.com/Tan-YiLong/Overview-of-Fake-News-Detection}{https://github.com/Tan-YiLong/Overview-of-Fake-News-Detection}.
Related papers
- Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval [67.73095846666583]
Visual Document Retrieval (VDR) has emerged as a critical frontier in bridging the gap between unstructured visually rich data and precise information acquisition.<n>This paper presents the first comprehensive survey of the VDR landscape, specifically through the lens of the Multimodal Large Language Model (MLLM) era.
arXiv Detail & Related papers (2026-02-23T15:27:41Z) - Multimodal Information Fusion for Chart Understanding: A Survey of MLLMs -- Evolution, Limitations, and Cognitive Enhancement [25.08967298618286]
Multimodal Large Language Models (MLLMs) are transforming chart information fusion.<n>This survey aims to equip researchers and practitioners with a structured understanding of how MLLMs are transforming chart information fusion.
arXiv Detail & Related papers (2026-02-08T12:59:50Z) - Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention [7.511262066889113]
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language understanding.<n>We perform a systematic layer-wise masking analysis across multiple architectures, revealing how visual-text fusion evolves within MLLMs.<n>We introduce a training-free contrastive attention framework that models the transformation between early fusion and final layers to highlight meaningful attention shifts.
arXiv Detail & Related papers (2026-01-13T02:26:21Z) - Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey [40.20905051575087]
In AI for Science, multimodal emotion recognition and reasoning has become a rapidly growing frontier.<n>This paper is the first attempt to comprehensively survey the intersection of MLLMs with multimodal emotion recognition and reasoning.
arXiv Detail & Related papers (2025-09-29T06:13:14Z) - Multi-Modal Interpretability for Enhanced Localization in Vision-Language Models [2.984679075401059]
This paper presents the Multi-Modal Explainable Learning framework, designed to enhance the interpretability of vision-language models.<n>Our approach processes features at multiple semantic levels to capture relationships between image regions at different granularities.<n>We show that by incorporating semantic relationship information into gradient-based attribution maps, MMEL produces more focused and contextually aware visualizations.
arXiv Detail & Related papers (2025-09-17T18:18:59Z) - Generalizing vision-language models to novel domains: A comprehensive survey [55.97518817219619]
Vision-language pretraining has emerged as a transformative technique that integrates the strengths of both visual and textual modalities.<n>This survey aims to comprehensively summarize the generalization settings, methodologies, benchmarking and results in VLM literatures.
arXiv Detail & Related papers (2025-06-23T10:56:37Z) - Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving [61.992824291296444]
Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs)<n>This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework.
arXiv Detail & Related papers (2025-05-23T08:18:00Z) - Analyzing Finetuning Representation Shift for Multimodal LLMs Steering [56.710375516257876]
We propose to map hidden states to interpretable visual and textual concepts.<n>This enables us to more efficiently compare certain semantic dynamics, such as the shift from an original and fine-tuned model.<n>We also demonstrate the use of shift vectors to capture these concepts changes.
arXiv Detail & Related papers (2025-01-06T13:37:13Z) - Remote Sensing SpatioTemporal Vision-Language Models: A Comprehensive Survey [35.600870905903996]
We present the first comprehensive review of RS-STVLMs.<n>We discuss progress in representative tasks, such as change captioning, change question, answering captions and change grounding.<n>We aim to illuminate current achievements and promising directions for future research in vision-language understanding for remote sensing.
arXiv Detail & Related papers (2024-12-03T16:56:10Z) - From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models [56.9134620424985]
Cross-modal reasoning (CMR) is increasingly recognized as a crucial capability in the progression toward more sophisticated artificial intelligence systems.
The recent trend of deploying Large Language Models (LLMs) to tackle CMR tasks has marked a new mainstream of approaches for enhancing their effectiveness.
This survey offers a nuanced exposition of current methodologies applied in CMR using LLMs, classifying these into a detailed three-tiered taxonomy.
arXiv Detail & Related papers (2024-09-19T02:51:54Z) - Evolving from Single-modal to Multi-modal Facial Deepfake Detection: Progress and Challenges [40.11614155244292]
This survey traces the evolution of deepfake detection from early single-modal methods to sophisticated multi-modal approaches.<n>We present a structured taxonomy of detection techniques and analyze the transition from GAN-based to diffusion model-driven deepfakes.
arXiv Detail & Related papers (2024-06-11T05:48:04Z) - Recent Advances in Hate Speech Moderation: Multimodality and the Role of Large Models [52.24001776263608]
This comprehensive survey delves into the recent strides in HS moderation.
We highlight the burgeoning role of large language models (LLMs) and large multimodal models (LMMs)
We identify existing gaps in research, particularly in the context of underrepresented languages and cultures.
arXiv Detail & Related papers (2024-01-30T03:51:44Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.