Denoising Bottleneck with Mutual Information Maximization for Video
Multimodal Fusion
- URL: http://arxiv.org/abs/2305.14652v3
- Date: Wed, 31 May 2023 08:20:33 GMT
- Title: Denoising Bottleneck with Mutual Information Maximization for Video
Multimodal Fusion
- Authors: Shaoxiang Wu, Damai Dai, Ziwei Qin, Tianyu Liu, Binghuai Lin, Yunbo
Cao, Zhifang Sui
- Abstract summary: Video multimodal fusion aims to integrate multimodal signals in videos.
Video has longer multimodal sequences with more redundancy and noise in visual and audio modalities.
We propose a denoising bottleneck fusion model for fine-grained video fusion.
- Score: 30.631733395175765
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video multimodal fusion aims to integrate multimodal signals in videos, such
as visual, audio and text, to make a complementary prediction with multiple
modalities contents. However, unlike other image-text multimodal tasks, video
has longer multimodal sequences with more redundancy and noise in both visual
and audio modalities. Prior denoising methods like forget gate are coarse in
the granularity of noise filtering. They often suppress the redundant and noisy
information at the risk of losing critical information. Therefore, we propose a
denoising bottleneck fusion (DBF) model for fine-grained video multimodal
fusion. On the one hand, we employ a bottleneck mechanism to filter out noise
and redundancy with a restrained receptive field. On the other hand, we use a
mutual information maximization module to regulate the filter-out module to
preserve key information within different modalities. Our DBF model achieves
significant improvement over current state-of-the-art baselines on multiple
benchmarks covering multimodal sentiment analysis and multimodal summarization
tasks. It proves that our model can effectively capture salient features from
noisy and redundant video, audio, and text inputs. The code for this paper is
publicly available at https://github.com/WSXRHFG/DBF.
Related papers
- On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection [44.55891118519547]
We propose an innovative algorithm named Multi-Mod-al Detection(MM-Det) for detecting diffusion-generated content.
MM-Det utilizes the profound and comprehensive abilities of Large Multi-modal Models (LMMs) by generating a Multi-Modal Forgery Representation (MMFR)
MM-Det achieves state-of-the-art performance in Video Forensics (DVF)
arXiv Detail & Related papers (2024-10-31T04:20:47Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - Multi-scale Bottleneck Transformer for Weakly Supervised Multimodal Violence Detection [9.145305176998447]
Weakly supervised multimodal violence detection aims to learn a violence detection model by leveraging multiple modalities.
We propose a new weakly supervised MVD method that explicitly addresses the challenges of information redundancy, modality imbalance, and modality asynchrony.
Experiments on the largest-scale XD-Violence dataset demonstrate that the proposed method achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-05-08T15:27:08Z) - A Study of Dropout-Induced Modality Bias on Robustness to Missing Video
Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames.
While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input.
We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z) - Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities [67.89368528234394]
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities.
Video and audio are obtained at much higher rates than text and are roughly aligned in time.
Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models.
arXiv Detail & Related papers (2023-11-09T19:15:12Z) - VideoFusion: Decomposed Diffusion Models for High-Quality Video
Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis.
Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - Speech Prediction in Silent Videos using Variational Autoencoders [29.423462898526605]
We present a model for generating speech in a silent video.
The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory's conditional distribution.
We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.
arXiv Detail & Related papers (2020-11-14T17:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.