Related papers: Perception, Understanding and Reasoning, A Multimodal Benchmark for Video Fake News Detection

Perception, Understanding and Reasoning, A Multimodal Benchmark for Video Fake News Detection

URL: http://arxiv.org/abs/2510.24816v1
Date: Tue, 28 Oct 2025 10:04:13 GMT
Title: Perception, Understanding and Reasoning, A Multimodal Benchmark for Video Fake News Detection
Authors: Cui Yakun, Fushuo Huo, Weijie Shi, Juntao Dai, Hang Du, Zhenghao Zhu, Sirui Han, Yike Guo,
Abstract summary: We introduce the MVFNDB (Multi-modal Video Fake News Detection Benchmark) based on the empirical analysis.<n>The benchmark comprises 10 tasks and is meticulously crafted to probe MLLMs' perception, understanding, and reasoning capacities during detection.<n>To validate the impact of combining multiple features on the final results, we design a novel framework named MVFND-CoT.
Score: 32.26866389632305
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The advent of multi-modal large language models (MLLMs) has greatly advanced research into applications for Video fake news detection (VFND) tasks. Traditional video-based FND benchmarks typically focus on the accuracy of the final decision, often failing to provide fine-grained assessments for the entire detection process, making the detection process a black box. Therefore, we introduce the MVFNDB (Multi-modal Video Fake News Detection Benchmark) based on the empirical analysis, which provides foundation for tasks definition. The benchmark comprises 10 tasks and is meticulously crafted to probe MLLMs' perception, understanding, and reasoning capacities during detection, featuring 9730 human-annotated video-related questions based on a carefully constructed taxonomy ability of VFND. To validate the impact of combining multiple features on the final results, we design a novel framework named MVFND-CoT, which incorporates both creator-added content and original shooting footage reasoning. Building upon the benchmark, we conduct an in-depth analysis of the deeper factors influencing accuracy, including video processing strategies and the alignment between video features and model capabilities. We believe this benchmark will lay a solid foundation for future evaluations and advancements of MLLMs in the domain of video fake news detection.

Related papers

Advancing Adaptive Multi-Stage Video Anomaly Reasoning: A Benchmark Dataset and Method [96.63801368613177]
We present a new task that elevates video anomaly analysis from descriptive understanding to structured, multi-stage reasoning.<n>We present a new dataset with 8,641 videos, totaling more than 50,000 samples, making it one of the largest datasets for video anomaly understanding.<n>Building upon the proposed task and dataset, we develop an end-to-end MLLM-based VAR model termed Vad-R1-Plus, which supports adaptive hierarchical reasoning and risk-aware decision making.
arXiv Detail & Related papers (2026-01-15T08:09:04Z)
EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning [58.42596067220998]
deepfake video technology has not only facilitated artistic creation but also made it easier to spread misinformation.<n>Traditional deepfake video detection methods face issues such as a lack of transparency in their principles and insufficient capabilities to cope with forgery techniques.<n>This paper proposes the explainable deepfake video detection (EDVD) task and designs the EDVD-LLaMA multimodal reasoning framework.
arXiv Detail & Related papers (2025-10-18T10:34:05Z)
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks [67.31276358668424]
We introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer.<n> AVHaystacks is an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task.<n>We propose a model-agnostic, multi-agent framework to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystack
arXiv Detail & Related papers (2025-06-08T06:34:29Z)
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding [67.56182262082729]
We introduce MMBench-Video, a quantitative benchmark to rigorously evaluate large vision-language models (LVLMs) in video understanding. MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases. The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy.
arXiv Detail & Related papers (2024-06-20T17:26:01Z)
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs [64.60035916955837]
VANE-Bench is a benchmark designed to assess the proficiency of Video-LMMs in detecting anomalies and inconsistencies in videos.<n>Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models.<n>We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies.
arXiv Detail & Related papers (2024-06-14T17:59:01Z)
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)<n>We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.<n>We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z)
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench. We first introduce a novel static-to-dynamic method to define these temporal-related tasks. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.