Related papers: Localizing Audio-Visual Deepfakes via Hierarchical Boundary Modeling

Localizing Audio-Visual Deepfakes via Hierarchical Boundary Modeling

URL: http://arxiv.org/abs/2508.02000v1
Date: Mon, 04 Aug 2025 02:41:09 GMT
Title: Localizing Audio-Visual Deepfakes via Hierarchical Boundary Modeling
Authors: Xuanjun Chen, Shih-Peng Cheng, Jiawei Du, Lin Zhang, Xiaoxiao Miao, Chung-Che Wang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang,
Abstract summary: We propose a.<n> Boundary Modeling Network (HBMNet), which includes three modules: an Audio-Visual Feature, a.<n> Coarse Proposal Generator and a Fine-Hierarchical Probabilities Generator.<n>From the modality perspective, we enhance audio-visual encoding and fusion, reinforced by frame-level supervision.<n>Experiments show that encoding and fusion primarily improve precision, while frame-level supervision recall.
Score: 50.8215545241128
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Audio-visual temporal deepfake localization under the content-driven partial manipulation remains a highly challenging task. In this scenario, the deepfake regions are usually only spanning a few frames, with the majority of the rest remaining identical to the original. To tackle this, we propose a Hierarchical Boundary Modeling Network (HBMNet), which includes three modules: an Audio-Visual Feature Encoder that extracts discriminative frame-level representations, a Coarse Proposal Generator that predicts candidate boundary regions, and a Fine-grained Probabilities Generator that refines these proposals using bidirectional boundary-content probabilities. From the modality perspective, we enhance audio-visual learning through dedicated encoding and fusion, reinforced by frame-level supervision to boost discriminability. From the temporal perspective, HBMNet integrates multi-scale cues and bidirectional boundary-content relationships. Experiments show that encoding and fusion primarily improve precision, while frame-level supervision boosts recall. Each module (audio-visual fusion, temporal scales, bi-directionality) contributes complementary benefits, collectively enhancing localization performance. HBMNet outperforms BA-TFD and UMMAFormer and shows improved potential scalability with more training data.

Related papers

GAID: Frame-Level Gated Audio-Visual Integration with Directional Perturbation for Text-Video Retrieval [12.483734449829235]
GAID is a framework that integrates audio and visual features under textual guidance.<n>DASP injects structure-aware perturbations into text embeddings, enhancing robustness and discrimination without incurring multi-pass inference.<n>Experiments on MSR-VTT, DiDeMo, LSMDC, and VATEX show consistent state-of-the-art results with notable efficiency gains.
arXiv Detail & Related papers (2025-08-03T10:44:24Z)
Hierarchical Fusion and Joint Aggregation: A Multi-Level Feature Representation Method for AIGC Image Quality Assessment [0.9821874476902972]
The quality assessment of AI-generated content (AIGC) faces multi-dimensional challenges, that span from low-level visual perception to high-level semantic understanding.<n>To address this limitation, a multi-level visual representation paradigm is proposed with three stages, namely multi-level feature extraction, hierarchical fusion, and joint aggregation.<n>Experiments on benchmarks demonstrate outstanding performance on both tasks, validating the effectiveness of the proposed multi-level visual assessment paradigm.
arXiv Detail & Related papers (2025-07-23T04:12:32Z)
FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation [47.8417810406568]
Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling.<n>We introduce FOCUS, a unified LVLM that integrates segmentation-aware perception and controllable object-centric generation within an end-to-end framework.
arXiv Detail & Related papers (2025-06-20T07:46:40Z)
RD-UIE: Relation-Driven State Space Modeling for Underwater Image Enhancement [59.364418120895]
Underwater image enhancement (UIE) is a critical preprocessing step for marine vision applications.<n>We develop a novel relation-driven Mamba framework for effective UIE (RD-UIE)<n>Experiments on underwater enhancement benchmarks demonstrate RD-UIE outperforms the state-of-the-art approach WMamba.
arXiv Detail & Related papers (2025-05-02T12:21:44Z)
Revisiting the Integration of Convolution and Attention for Vision Backbone [59.50256661158862]
Convolutions and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones. We propose in this work to use MSHAs and Convs in parallel textbfat different granularity levels instead. We empirically verify the potential of the proposed integration scheme, named textitGLMix: by offloading the burden of fine-grained features to light-weight Convs, it is sufficient to use MHSAs in a few semantic slots.
arXiv Detail & Related papers (2024-11-21T18:59:08Z)
Unleashing Network Potentials for Semantic Scene Completion [50.95486458217653]
This paper proposes a novel SSC framework - Adrial Modality Modulation Network (AMMNet) AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition. Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin.
arXiv Detail & Related papers (2024-03-12T11:48:49Z)
Faster Learning of Temporal Action Proposal via Sparse Multilevel Boundary Generator [9.038216757761955]
Temporal action localization in videos presents significant challenges in the field of computer vision. We propose a novel framework, Sparse Multilevel Boundary Generator (SMBG), which enhances the boundary-sensitive method with boundary classification and action completeness regression. Our method is evaluated on two popular benchmarks, ActivityNet-1.3 and THUMOS14, and is shown to achieve state-of-the-art performance, with a better inference speed (2.47xBSN++, 2.12xDBG)
arXiv Detail & Related papers (2023-03-06T14:26:56Z)
Boundary-semantic collaborative guidance network with dual-stream feedback mechanism for salient object detection in optical remote sensing imagery [22.21644705244091]
We propose boundary-semantic collaborative guidance network (BSCGNet) with dual-stream feedback mechanism. BSCGNet exhibits distinct advantages in challenging scenarios and outperforms the 17 state-of-the-art (SOTA) approaches proposed in recent years.
arXiv Detail & Related papers (2023-03-06T03:36:06Z)
Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network. A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features. The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z)
The Devil is in the Boundary: Exploiting Boundary Representation for Basis-based Instance Segmentation [85.153426159438]
We propose Basis based Instance(B2Inst) to learn a global boundary representation that can complement existing global-mask-based methods. Our B2Inst leads to consistent improvements and accurately parses out the instance boundaries in a scene.
arXiv Detail & Related papers (2020-11-26T11:26:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.