Related papers: AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs

AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs

URL: http://arxiv.org/abs/2511.21251v2
Date: Mon, 01 Dec 2025 07:07:08 GMT
Title: AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs
Authors: Shuhan Xia, Peipei Li, Xuannan Liu, Dongsen Zhang, Xinyu Guo, Zekun Li,
Abstract summary: We introduce AVFakeBench, the first comprehensive audio-video forgery detection benchmark.<n> AVFakeBench comprises 12K carefully curated audio-video questions, covering seven forgery types and four levels of annotations.<n>We evaluate 11 Audio-Video Large Language Models (AV-LMMs) and 2 prevalent detection methods on AVFakeBench.
Score: 13.950397580491666
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The threat of Audio-Video (AV) forgery is rapidly evolving beyond human-centric deepfakes to include more diverse manipulations across complex natural scenes. However, existing benchmarks are still confined to DeepFake-based forgeries and single-granularity annotations, thus failing to capture the diversity and complexity of real-world forgery scenarios. To address this, we introduce AVFakeBench, the first comprehensive audio-video forgery detection benchmark that spans rich forgery semantics across both human subject and general subject. AVFakeBench comprises 12K carefully curated audio-video questions, covering seven forgery types and four levels of annotations. To ensure high-quality and diverse forgeries, we propose a multi-stage hybrid forgery framework that integrates proprietary models for task planning with expert generative models for precise manipulation. The benchmark establishes a multi-task evaluation framework covering binary judgment, forgery types classification, forgery detail selection, and explanatory reasoning. We evaluate 11 Audio-Video Large Language Models (AV-LMMs) and 2 prevalent detection methods on AVFakeBench, demonstrating the potential of AV-LMMs as emerging forgery detectors while revealing their notable weaknesses in fine-grained perception and reasoning.

Related papers

Leveraging large multimodal models for audio-video deepfake detection: a pilot study [20.17103408581687]
We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - "Is this video real or fake?"<n>Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning.<n>On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.
arXiv Detail & Related papers (2026-02-25T04:39:08Z)
DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning [58.70446237944036]
DAVID-X is the first dataset to pair AI-generated videos with detailed defect-level, temporal-spatial annotations and written rationales.<n>We present DAVID-XR1, a video-language model designed to deliver an interpretable chain of visual reasoning.<n>Our results highlight the promise of explainable detection methods for trustworthy identification of AI-generated video content.
arXiv Detail & Related papers (2025-06-13T13:39:53Z)
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks [67.31276358668424]
We introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer.<n> AVHaystacks is an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task.<n>We propose a model-agnostic, multi-agent framework to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystack
arXiv Detail & Related papers (2025-06-08T06:34:29Z)
Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models [53.55128042938329]
Forensics-Bench is a new forgery detection evaluation benchmark suite.<n>It comprises 63,292 meticulously curated multi-choice visual questions, covering 112 unique forgery detection types.<n>We conduct thorough evaluations on 22 open-sourced LVLMs and 3 proprietary models GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet.
arXiv Detail & Related papers (2025-03-19T09:21:44Z)
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? [65.49972312524724]
multimodal large language models (MLLMs) have expanded their capabilities to include vision and audio modalities.<n>Our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial.<n>We introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information.
arXiv Detail & Related papers (2024-12-03T17:41:23Z)
A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning [9.786907179872815]
The potential of vision and language remains underexplored in face forgery detection. There is a need for a methodology that converts face forgery detection to a Visual Question Answering (VQA) task. We propose a multi-staged approach that diverges from the traditional binary decision paradigm to address this gap.
arXiv Detail & Related papers (2024-10-01T08:16:40Z)
AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection [49.81915942821647]
This study introduces the audio-visual transformer-based ensemble network (AVTENet) to detect deepfake videos.<n>For evaluation, we use the recently released benchmark multimodal audio-video FakeAVCeleb dataset.<n>For a detailed analysis, we evaluate AVTENet, its variants, and several existing methods on multiple test sets of the FakeAVCeleb dataset.
arXiv Detail & Related papers (2023-10-19T19:01:26Z)
Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level. Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion. We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z)
A Multi-View Approach To Audio-Visual Speaker Verification [38.9710777250597]
In this study, we explore audio-visual approaches to speaker verification. We report the lowest AV equal error rate (EER) of 0.7% on the VoxCeleb1 dataset. This new approach achieves 28% EER on VoxCeleb1 in the challenging testing condition of cross-modal verification.
arXiv Detail & Related papers (2021-02-11T22:29:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.