Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature
Alignment
- URL: http://arxiv.org/abs/2307.12964v2
- Date: Wed, 18 Oct 2023 18:15:58 GMT
- Title: Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature
Alignment
- Authors: Sarah Ibrahimi, Xiaohang Sun, Pichao Wang, Amanmeet Garg, Ashutosh
Sanan, Mohamed Omar
- Abstract summary: TEFAL is a TExt-conditioned Feature ALignment method that produces both audio and video representations conditioned on the text query.
Our approach employs two independent cross-modal attention blocks that enable the text to attend to the audio and video representations separately.
- Score: 16.304894187743013
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Text-to-video retrieval systems have recently made significant progress by
utilizing pre-trained models trained on large-scale image-text pairs. However,
most of the latest methods primarily focus on the video modality while
disregarding the audio signal for this task. Nevertheless, a recent advancement
by ECLIPSE has improved long-range text-to-video retrieval by developing an
audiovisual video representation. Nonetheless, the objective of the
text-to-video retrieval task is to capture the complementary audio and video
information that is pertinent to the text query rather than simply achieving
better audio and video alignment. To address this issue, we introduce TEFAL, a
TExt-conditioned Feature ALignment method that produces both audio and video
representations conditioned on the text query. Instead of using only an
audiovisual attention block, which could suppress the audio information
relevant to the text query, our approach employs two independent cross-modal
attention blocks that enable the text to attend to the audio and video
representations separately. Our proposed method's efficacy is demonstrated on
four benchmark datasets that include audio: MSR-VTT, LSMDC, VATEX, and
Charades, and achieves better than state-of-the-art performance consistently
across the four datasets. This is attributed to the additional
text-query-conditioned audio representation and the complementary information
it adds to the text-query-conditioned video representation.
Related papers
- Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.
For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z) - Multi-Granularity and Multi-modal Feature Interaction Approach for Text Video Retrieval [6.656989511639513]
The key of the text-to-video retrieval (TVR) task lies in learning the unique similarity between each pair of text (consisting of words) and video (consisting of audio and image frames) representations.
We propose a novel multi-granularity feature interaction module called MGFI, consisting of text-frame and word-frame.
We also introduce a cross-modal feature interaction module of audio and text called CMFI to solve the problem of insufficient expression of frames in the video.
arXiv Detail & Related papers (2024-06-21T02:28:06Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Zero-Shot Audio Captioning via Audibility Guidance [57.70351255180495]
We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and (iii) audibility.
Our method is a zero-shot method, i.e., we do not learn to perform captioning.
We present our results on the AudioCap dataset, demonstrating that audibility guidance significantly enhances performance compared to the baseline.
arXiv Detail & Related papers (2023-09-07T17:45:58Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment [30.38594416942543]
We propose a novel and personalized text-to-sound generation approach with visual alignment based on latent diffusion models, namely DiffAVA.
Our DiffAVA leverages a multi-head attention transformer to aggregate temporal information from video features, and a dual multi-modal residual network to fuse temporal visual representations with text embeddings.
Experimental results on the AudioCaps dataset demonstrate that the proposed DiffAVA can achieve competitive performance on visual-aligned text-to-audio generation.
arXiv Detail & Related papers (2023-05-22T10:37:27Z) - Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? [131.300931102986]
In real-world scenarios, online videos are often accompanied by relevant text information such as titles, tags, and even subtitles.
We propose a novel approach to text-video retrieval, where we directly generate associated captions from videos using zero-shot video captioning.
We conduct comprehensive ablation studies to demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-12-31T11:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.