BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM
- URL: http://arxiv.org/abs/2507.14632v2
- Date: Thu, 31 Jul 2025 12:03:49 GMT
- Title: BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM
- Authors: Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, Guangliang Cheng,
- Abstract summary: We introduce textbfBusterX++, a novel framework for cross-modal detection and explanation of synthetic media.<n>Our approach incorporates an advanced reinforcement learning (RL) post-training strategy that eliminates cold-start.<n>We also present textbfGenBuster++, a cross-modal benchmark leveraging state-of-the-art image and video generation techniques.
- Score: 12.349038994581415
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in generative AI have dramatically improved image and video synthesis capabilities, significantly increasing the risk of misinformation through sophisticated fake content. In response, detection methods have evolved from traditional approaches to multimodal large language models (MLLMs), offering enhanced transparency and interpretability in identifying synthetic media. However, current detection systems remain fundamentally limited by their single-modality design. These approaches analyze images or videos separately, making them ineffective against synthetic content that combines multiple media formats. To address these challenges, we introduce \textbf{BusterX++}, a novel framework designed specifically for cross-modal detection and explanation of synthetic media. Our approach incorporates an advanced reinforcement learning (RL) post-training strategy that eliminates cold-start. Through Multi-stage Training, Thinking Reward, and Hybrid Reasoning, BusterX++ achieves stable and substantial performance improvements. To enable comprehensive evaluation, we also present \textbf{GenBuster++}, a cross-modal benchmark leveraging state-of-the-art image and video generation techniques. This benchmark comprises 4,000 images and video clips, meticulously curated by human experts using a novel filtering methodology to ensure high quality, diversity, and real-world applicability. Extensive experiments demonstrate the effectiveness and generalizability of our approach.
Related papers
- Leveraging Pre-Trained Visual Models for AI-Generated Video Detection [54.88903878778194]
The field of video generation has advanced beyond DeepFakes, creating an urgent need for methods capable of detecting AI-generated videos with generic content.<n>We propose a novel approach that leverages pre-trained visual models to distinguish between real and generated videos.<n>Our method achieves high detection accuracy, above 90% on average, underscoring its effectiveness.
arXiv Detail & Related papers (2025-07-17T15:36:39Z) - Image Can Bring Your Memory Back: A Novel Multi-Modal Guided Attack against Image Generation Model Unlearning [28.15997901023315]
Recall is a novel adversarial framework designed to compromise the robustness of unlearned IGMs.<n>It consistently outperforms existing baselines in terms of adversarial effectiveness, computational efficiency, and semantic fidelity with the original prompt.<n>These findings reveal critical vulnerabilities in current unlearning mechanisms and underscore the need for more robust solutions.
arXiv Detail & Related papers (2025-07-09T02:59:01Z) - MAGREF: Masked Guidance for Any-Reference Video Generation [33.35245169242822]
MAGREF is a unified framework for any-reference video generation.<n>We propose a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference.<n>Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios.
arXiv Detail & Related papers (2025-05-29T17:58:15Z) - FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities [76.46448367752944]
multimodal large language models (MLLMs) unify visual understanding and image generation within a single framework.<n>Most existing MLLMs rely on autore (AR) architectures, which impose inherent limitations on future development.<n>We introduce FUDOKI, a unified multimodal model purely based on discrete flow matching.
arXiv Detail & Related papers (2025-05-26T15:46:53Z) - Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation [54.588082888166504]
We present Mogao, a unified framework that enables interleaved multi-modal generation through a causal approach.<n>Mogoo integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance.<n>Experiments show that Mogao achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs.
arXiv Detail & Related papers (2025-05-08T17:58:57Z) - Advance Fake Video Detection via Vision Transformers [0.9035379689195373]
Vision Transformer (ViT)-based fake image detection and extend this idea to video.<n>We propose an original %innovative framework that effectively integrates ViT embeddings over time to enhance detection performance.<n>Our method shows promising accuracy, generalization, and few-shot learning capabilities across a new, large and diverse dataset of videos.
arXiv Detail & Related papers (2025-04-29T11:51:07Z) - SFLD: Reducing the content bias for AI-generated Image Detection [23.152346805893373]
A novel approach, SFLD, incorporates PatchShuffle to integrate high-level semantic and low-level textural information.<n>Current benchmarks face challenges such as low image quality, insufficient content preservation, and limited class diversity.<n>In response, we introduce Twin Synths, a new benchmark generation methodology that constructs visually near-identical pairs of real and synthetic images.
arXiv Detail & Related papers (2025-02-24T12:38:34Z) - Weak Supervision Dynamic KL-Weighted Diffusion Models Guided by Large Language Models [0.0]
We present a novel method for improving text-to-image generation by combining Large Language Models with diffusion models.<n>Our approach incorporates semantic understanding from pre-trained LLMs to guide the generation process.<n>Our method significantly improves both the visual quality and alignment of generated images with text descriptions.
arXiv Detail & Related papers (2025-02-02T15:43:13Z) - MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling [64.09238330331195]
We propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework.<n>Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss in an efficient way.<n>We also propose a theoretically proven technique that addresses the numerical stability issue and a training strategy that balances the generation and understanding task goals.
arXiv Detail & Related papers (2024-10-14T17:57:18Z) - CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.<n>We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.<n>We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z) - Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World
Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling.
It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences.
It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z) - CapST: Leveraging Capsule Networks and Temporal Attention for Accurate Model Attribution in Deep-fake Videos [9.209808258321559]
Attributing a deep-fake to its specific generation model or encoder is vital for forensic analysis, enabling source and tailored countermeasures.<n>We investigate the model attribution problem for deep-fake videos using two datasets: Deepfakes from Different Models (DFDM) and GANGen-Detection.<n>We introduce a novel Capsule-Spatial-Cap (CapST) model that integrates a truncated VGG19 network for feature extraction, capsule networks for temporal extraction.
arXiv Detail & Related papers (2023-11-07T08:05:09Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.