Related papers: An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment

An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment

URL: http://arxiv.org/abs/2410.05997v1
Date: Tue, 8 Oct 2024 12:52:48 GMT
Title: An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment
Authors: Hugo Malard, Michel Olvera, Stéphane Lathuiliere, Slim Essid,
Abstract summary: Multimodal large language models have fueled progress in image captioning. In this work, we show that this ability can be re-purposed for audio captioning. We introduce a novel methodology for bridging the audiovisual modality gap.
Score: 6.977241620071544
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models have fueled progress in image captioning. These models, fine-tuned on vast image datasets, exhibit a deep understanding of semantic concepts. In this work, we show that this ability can be re-purposed for audio captioning, where the joint image-language decoder can be leveraged to describe auditory content associated with image sequences within videos featuring audiovisual content. This can be achieved via multimodal alignment. Yet, this multimodal alignment task is non-trivial due to the inherent disparity between audible and visible elements in real-world videos. Moreover, multimodal representation learning often relies on contrastive learning, facing the challenge of the so-called modality gap which hinders smooth integration between modalities. In this work, we introduce a novel methodology for bridging the audiovisual modality gap by matching the distributions of tokens produced by an audio backbone and those of an image captioner. Our approach aligns the audio token distribution with that of the image tokens, enabling the model to perform zero-shot audio captioning in an unsupervised fashion while keeping the initial image captioning component unaltered. This alignment allows for the use of either audio or audiovisual input by combining or substituting the image encoder with the aligned audio encoder. Our method achieves significantly improved performances in zero-shot audio captioning, compared to existing approaches.

Related papers

Classifier-Guided Captioning Across Modalities [69.75111271002137]
We introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning. Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system. Notably, when combined with an existing zero-shot audio captioning system, our framework improves its quality and sets state-of-the-art performance in zero-shot audio captioning.
arXiv Detail & Related papers (2025-01-03T18:09:26Z)
Translating speech with just images [23.104041372055466]
We extend this connection by linking images to text via an existing image captioning system. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yorub'a, and propose a Yorub'a-to-English speech translation model.
arXiv Detail & Related papers (2024-06-11T10:29:24Z)
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users. Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z)
Zero-shot audio captioning with audio-language model guidance and audio context keywords [59.58331215337357]
We propose ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training. Our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions. Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
arXiv Detail & Related papers (2023-11-14T18:55:48Z)
Can CLIP Help Sound Source Localization? [19.370071553914954]
We introduce a framework that translates audio signals into tokens compatible with CLIP's text encoder. By directly using these embeddings, our method generates audio-grounded masks for the provided audio. Our findings suggest that utilizing pre-trained image-text models enable our model to generate more complete and compact localization maps for the sounding objects.
arXiv Detail & Related papers (2023-11-07T15:26:57Z)
Zero-Shot Audio Captioning via Audibility Guidance [57.70351255180495]
We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and (iii) audibility. Our method is a zero-shot method, i.e., we do not learn to perform captioning. We present our results on the AudioCap dataset, demonstrating that audibility guidance significantly enhances performance compared to the baseline.
arXiv Detail & Related papers (2023-09-07T17:45:58Z)
Align, Adapt and Inject: Sound-guided Unified Image Generation [50.34667929051005]
We propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization. Our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing Text-to-Image (T2I) models. Our proposed AAI outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2023-06-20T12:50:49Z)
CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge. We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z)
Towards Generating Diverse Audio Captions via Adversarial Training [33.76154801580643]
We propose a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems. A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions. The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-12-05T05:06:19Z)
Controllable Image Captioning [0.0]
We introduce a novel framework for image captioning which can generate diverse descriptions by capturing the co-dependence between Part-Of-Speech tags and semantics. We propose a method to generate captions through a Transformer network, which predicts words based on the input Part-Of-Speech tag sequences.
arXiv Detail & Related papers (2022-04-28T07:47:49Z)
Sound-Guided Semantic Image Manipulation [19.01823634838526]
We propose a framework that directly encodes sound into the multi-modal (image-text) embedding space and manipulates an image from the space. Our method can mix different modalities, i.e., text and audio, which enrich the variety of the image modification. The experiments on zero-shot audio classification and semantic-level image classification show that our proposed model outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2021-11-30T13:30:12Z)
Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers. We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.