Unified Frequency-Assisted Transformer Framework for Detecting and
Grounding Multi-Modal Manipulation
- URL: http://arxiv.org/abs/2309.09667v1
- Date: Mon, 18 Sep 2023 11:06:42 GMT
- Title: Unified Frequency-Assisted Transformer Framework for Detecting and
Grounding Multi-Modal Manipulation
- Authors: Huan Liu, Zichang Tan, Qiang Chen, Yunchao Wei, Yao Zhao, Jingdong
Wang
- Abstract summary: We present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM4 problem.
By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts.
Our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands.
- Score: 109.1912721224697
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting and grounding multi-modal media manipulation (DGM^4) has become
increasingly crucial due to the widespread dissemination of face forgery and
text misinformation. In this paper, we present the Unified Frequency-Assisted
transFormer framework, named UFAFormer, to address the DGM^4 problem. Unlike
previous state-of-the-art methods that solely focus on the image (RGB) domain
to describe visual forgery features, we additionally introduce the frequency
domain as a complementary viewpoint. By leveraging the discrete wavelet
transform, we decompose images into several frequency sub-bands, capturing rich
face forgery artifacts. Then, our proposed frequency encoder, incorporating
intra-band and inter-band self-attentions, explicitly aggregates forgery
features within and across diverse sub-bands. Moreover, to address the semantic
conflicts between image and frequency domains, the forgery-aware mutual module
is developed to further enable the effective interaction of disparate image and
frequency features, resulting in aligned and comprehensive visual forgery
representations. Finally, based on visual and textual forgery features, we
propose a unified decoder that comprises two symmetric cross-modal interaction
modules responsible for gathering modality-specific forgery information, along
with a fusing interaction module for aggregation of both modalities. The
proposed unified decoder formulates our UFAFormer as a unified framework,
ultimately simplifying the overall architecture and facilitating the
optimization process. Experimental results on the DGM^4 dataset, containing
several perturbations, demonstrate the superior performance of our framework
compared to previous methods, setting a new benchmark in the field.
Related papers
- Cross Group Attention and Group-wise Rolling for Multimodal Medical Image Synthesis [22.589087990596887]
Multimodal MR image synthesis aims to generate missing modality image by fusing and mapping a few available MRI data.
We propose an Adaptive Group-wise Interaction Network (AGI-Net) that explores both inter-modality and intra-modality relationships for multimodal MR image synthesis.
arXiv Detail & Related papers (2024-11-22T02:29:37Z) - A Hybrid Transformer-Mamba Network for Single Image Deraining [70.64069487982916]
Existing deraining Transformers employ self-attention mechanisms with fixed-range windows or along channel dimensions.
We introduce a novel dual-branch hybrid Transformer-Mamba network, denoted as TransMamba, aimed at effectively capturing long-range rain-related dependencies.
arXiv Detail & Related papers (2024-08-31T10:03:19Z) - Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment [20.902935570581207]
We introduce a Multimodal Alignment and Reconstruction Network (MARNet) to enhance the model's resistance to visual noise.
MARNet includes a cross-modal diffusion reconstruction module for smoothly and stably blending information across different domains.
Experiments conducted on two benchmark datasets, Vireo-Food172 and Ingredient-101, demonstrate that MARNet effectively improves the quality of image information extracted by the model.
arXiv Detail & Related papers (2024-07-26T16:30:18Z) - Frequency Domain Modality-invariant Feature Learning for
Visible-infrared Person Re-Identification [79.9402521412239]
We propose a novel Frequency Domain modality-invariant feature learning framework (FDMNet) to reduce modality discrepancy from the frequency domain perspective.
Our framework introduces two novel modules, namely the Instance-Adaptive Amplitude Filter (IAF) and the Phrase-Preserving Normalization (PPNorm)
arXiv Detail & Related papers (2024-01-03T17:11:27Z) - A Dual Domain Multi-exposure Image Fusion Network based on the
Spatial-Frequency Integration [57.14745782076976]
Multi-exposure image fusion aims to generate a single high-dynamic image by integrating images with different exposures.
We propose a novelty perspective on multi-exposure image fusion via the Spatial-Frequency Integration Framework, named MEF-SFI.
Our method achieves visual-appealing fusion results against state-of-the-art multi-exposure image fusion approaches.
arXiv Detail & Related papers (2023-12-17T04:45:15Z) - Multimodal Transformer Using Cross-Channel attention for Object Detection in Remote Sensing Images [1.662438436885552]
Multi-modal fusion has been determined to enhance the accuracy by fusing data from multiple modalities.
We propose a novel multi-modal fusion strategy for mapping relationships between different channels at the early stage.
By addressing fusion in the early stage, as opposed to mid or late-stage methods, our method achieves competitive and even superior performance compared to existing techniques.
arXiv Detail & Related papers (2023-10-21T00:56:11Z) - Transformer-based Network for RGB-D Saliency Detection [82.6665619584628]
Key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities.
We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement.
Our proposed network performs favorably against state-of-the-art RGB-D saliency detection methods.
arXiv Detail & Related papers (2021-12-01T15:53:58Z) - TransVG: End-to-End Visual Grounding with Transformers [102.11922622103613]
We present a transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to an image.
We show that the complex fusion modules can be replaced by a simple stack of transformer encoder layers with higher performance.
arXiv Detail & Related papers (2021-04-17T13:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.