METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark
- URL: http://arxiv.org/abs/2507.16206v1
- Date: Tue, 22 Jul 2025 03:42:51 GMT
- Title: METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark
- Authors: Xu Yang, Qi Zhang, Shuming Jiang, Yaowen Xu, Zhaofan Zou, Hao Sun, Xuelong Li,
- Abstract summary: We introduce METER, a unified benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content.<n>Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations.
- Score: 48.78602579128459
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: With the rapid advancement of generative AI, synthetic content across images, videos, and audio has become increasingly realistic, amplifying the risk of misinformation. Existing detection approaches predominantly focus on binary classification while lacking detailed and interpretable explanations of forgeries, which limits their applicability in safety-critical scenarios. Moreover, current methods often treat each modality separately, without a unified benchmark for cross-modal forgery detection and interpretation. To address these challenges, we introduce METER, a unified, multi-modal benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content. Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations, including spatio-temporal localization, textual rationales, and forgery type tracing. Compared to prior benchmarks, METER offers broader modality coverage and richer interpretability metrics such as spatial/temporal IoU, multi-class tracing, and evidence consistency. We further propose a human-aligned, three-stage Chain-of-Thought (CoT) training strategy combining SFT, DPO, and a novel GRPO stage that integrates a human-aligned evaluator with CoT reasoning. We hope METER will serve as a standardized foundation for advancing generalizable and interpretable forgery detection in the era of generative media.
Related papers
- Reasoning-Driven Amodal Completion: Collaborative Agents and Perceptual Evaluation [17.405818788700234]
We present a Collaborative Multi-Agent Reasoning Framework that explicitly decouples Semantic Planning from Visual Synthesis.<n>Our method generates a structured, explicit plan before pixel generation, enabling visually and semantically coherent single-pass synthesis.<n>Addressing the limitations of traditional metrics in assessing inferred invisible content, we introduce the MAC-Score, a novel human-aligned evaluation metric.
arXiv Detail & Related papers (2025-12-24T04:39:45Z) - Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline [56.790045049514326]
Two major forms of deception dominate: human-crafted misinformation and AI-generated content.<n>We propose Unified Multimodal Fake Content Detection (UMFDet), a framework designed to handle both forms of deception.<n>UMFDet achieves robust and consistent performance across both misinformation types, outperforming specialized baselines.
arXiv Detail & Related papers (2025-09-30T09:26:32Z) - Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z) - Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations [56.816929931908824]
We pioneer the detection of semantically-coordinated manipulations in multimodal data.<n>We propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework.<n>Our framework significantly outperforms existing methods, achieving 2.06% higher detection accuracy on SAMM compared to state-of-the-art approaches.
arXiv Detail & Related papers (2025-09-16T04:18:48Z) - KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge [1.5833270109954136]
We propose KnowDR-REC, built upon real-world knowledge, requiring fine-grained multimodal reasoning across text and image.<n>We evaluate 16 state-of-the-art multimodal models on KnowDR-REC, with experimental results showing that existing MLLMs still struggle with knowledge-driven visual grounding tasks.
arXiv Detail & Related papers (2025-08-12T19:43:44Z) - Consistency-aware Fake Videos Detection on Short Video Platforms [4.291448222735821]
This paper focuses on detecting fake news on the short video platforms.<n>Existing approaches typically combine raw video data with metadata inputs before applying a classification layer.<n>Motivated by this insight, we propose a novel detection paradigm that explicitly identifies and leverages cross-modal contradictions.
arXiv Detail & Related papers (2025-04-30T10:26:04Z) - SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs [70.79124435220695]
We propose a novel unified Semantic-enhanced generative Cross-mOdal REtrieval framework (SemCORE)<n>We first construct a Structured natural language IDentifier (SID) that effectively aligns target identifiers with generative models optimized for natural language comprehension and generation.<n>We then introduce a Generative Semantic Verification (GSV) strategy enabling fine-grained target discrimination.
arXiv Detail & Related papers (2025-04-17T17:59:27Z) - Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints [15.541287957548771]
We propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture.<n>It integrates implicit and explicit modeling approaches within a two-stage framework.<n>It significantly outperforms state-of-the-art REC and RIS methods by a substantial margin.
arXiv Detail & Related papers (2025-01-12T04:30:13Z) - Sequential Visual and Semantic Consistency for Semi-supervised Text
Recognition [56.968108142307976]
Scene text recognition (STR) is a challenging task that requires large-scale annotated data for training.
Most existing STR methods resort to synthetic data, which may introduce domain discrepancy and degrade the performance of STR models.
This paper proposes a novel semi-supervised learning method for STR that incorporates word-level consistency regularization from both visual and semantic aspects.
arXiv Detail & Related papers (2024-02-24T13:00:54Z) - Prototype-based Aleatoric Uncertainty Quantification for Cross-modal
Retrieval [139.21955930418815]
Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space.
However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts.
We propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity.
arXiv Detail & Related papers (2023-09-29T09:41:19Z) - Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual
Recognition [57.08108545219043]
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision.
Existing literature addresses this challenge by employing local-based representation approaches.
This article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition.
arXiv Detail & Related papers (2023-05-12T00:13:17Z) - Enhancing Multimodal Entity and Relation Extraction with Variational
Information Bottleneck [12.957002659910456]
We study the multimodal named entity recognition (MNER) and multimodal relation extraction (MRE)
The core of MNER and MRE lies in incorporating evident visual information to enhance textual semantics.
We propose a novel method for MNER and MRE by Multi-Modal representation learning with Information Bottleneck (MMIB)
arXiv Detail & Related papers (2023-04-05T09:32:25Z) - TOT: Topology-Aware Optimal Transport For Multimodal Hate Detection [18.015012133043093]
We propose TOT: a topology-aware optimal transport framework to decipher the implicit harm in memes scenario.
Specifically, we leverage an optimal transport kernel method to capture complementary information from multiple modalities.
The newly achieved state-of-the-art performance on two publicly available benchmark datasets, together with further visual analysis, demonstrate the superiority of TOT.
arXiv Detail & Related papers (2023-02-27T06:58:19Z) - Towards Accurate Scene Text Recognition with Semantic Reasoning Networks [52.86058031919856]
We propose a novel end-to-end trainable framework named semantic reasoning network (SRN) for accurate scene text recognition.
GSRM is introduced to capture global semantic context through multi-way parallel transmission.
Results on 7 public benchmarks, including regular text, irregular text and non-Latin long text, verify the effectiveness and robustness of the proposed method.
arXiv Detail & Related papers (2020-03-27T09:19:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.