Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation
- URL: http://arxiv.org/abs/2307.13236v1
- Date: Tue, 25 Jul 2023 03:59:04 GMT
- Title: Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation
- Authors: Jinxiang Liu, Chen Ju, Chaofan Ma, Yanfeng Wang, Yu Wang, Ya Zhang
- Abstract summary: We propose a novel textbfAudio-aware query-enhanced textbfTRansformer (AuTR) to tackle the task.
Unlike existing methods, our approach introduces a multimodal transformer architecture that enables deep fusion and aggregation of audio-visual features.
- Score: 22.28510611697998
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of the audio-visual segmentation (AVS) task is to segment the
sounding objects in the video frames using audio cues. However, current
fusion-based methods have the performance limitations due to the small
receptive field of convolution and inadequate fusion of audio-visual features.
To overcome these issues, we propose a novel \textbf{Au}dio-aware
query-enhanced \textbf{TR}ansformer (AuTR) to tackle the task. Unlike existing
methods, our approach introduces a multimodal transformer architecture that
enables deep fusion and aggregation of audio-visual features. Furthermore, we
devise an audio-aware query-enhanced transformer decoder that explicitly helps
the model focus on the segmentation of the pinpointed sounding objects based on
audio signals, while disregarding silent yet salient objects. Experimental
results show that our method outperforms previous methods and demonstrates
better generalization ability in multi-sound and open-set scenarios.
Related papers
- Audio-Visual Talker Localization in Video for Spatial Sound Reproduction [3.2472293599354596]
In this research, we detect and locate the active speaker in the video.
We found the role of the two modalities to complement each other.
Future investigations will assess the robustness of the model in noisy and highly reverberant environments.
arXiv Detail & Related papers (2024-06-01T16:47:07Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting
Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries.
Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality.
We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - AVSegFormer: Audio-Visual Segmentation with Transformer [42.24135756439358]
A new audio-visual segmentation (AVS) task has been introduced, aiming to locate and segment the sounding objects in a given video.
This task demands audio-driven pixel-level scene understanding for the first time, posing significant challenges.
We propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture.
arXiv Detail & Related papers (2023-07-03T16:37:10Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - Transavs: End-To-End Audio-Visual Segmentation With Transformer [33.56539999875508]
We propose TransAVS, the first Transformer-based end-to-end framework for Audio-Visual task.
TransAVS disentangles the audio stream as audio queries, which will interact with images and decode into segmentation masks.
Our experiments demonstrate that TransAVS achieves state-of-the-art results on the AVSBench dataset.
arXiv Detail & Related papers (2023-05-12T03:31:04Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Audio-Visual Segmentation [47.10873917119006]
We propose to explore a new problem called audio-visual segmentation (AVS)
The goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame.
We construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos.
arXiv Detail & Related papers (2022-07-11T17:50:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.