Translating speech with just images
- URL: http://arxiv.org/abs/2406.07133v1
- Date: Tue, 11 Jun 2024 10:29:24 GMT
- Title: Translating speech with just images
- Authors: Dan Oneata, Herman Kamper,
- Abstract summary: We extend this connection by linking images to text via an existing image captioning system.
This approach can be used for speech translation with just images by having the audio in a different language from the generated captions.
We investigate such a system on a real low-resource language, Yorub'a, and propose a Yorub'a-to-English speech translation model.
- Score: 23.104041372055466
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yor\`ub\'a, and propose a Yor\`ub\'a-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form.
Related papers
- Classifier-Guided Captioning Across Modalities [69.75111271002137]
We introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning.
Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system.
Notably, when combined with an existing zero-shot audio captioning system, our framework improves its quality and sets state-of-the-art performance in zero-shot audio captioning.
arXiv Detail & Related papers (2025-01-03T18:09:26Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - Zero-shot audio captioning with audio-language model guidance and audio
context keywords [59.58331215337357]
We propose ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training.
Our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions.
Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
arXiv Detail & Related papers (2023-11-14T18:55:48Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language
Model [56.49878599920353]
SpeechCLIP is a novel framework bridging speech and text through images to enhance speech models without transcriptions.
We leverage state-of-the-art pre-trained HuBERT and CLIP, aligning them via paired images and spoken captions with minimal fine-tuning.
arXiv Detail & Related papers (2022-10-03T04:15:36Z) - Text-Free Image-to-Speech Synthesis Using Learned Segmental Units [24.657722909094662]
We present the first model for directly fluent, natural-sounding spoken audio captions for images.
We connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units.
We conduct experiments on the Flickr8k spoken caption dataset and a novel corpus of spoken audio captions collected for the popular MSCOCO dataset.
arXiv Detail & Related papers (2020-12-31T05:28:38Z) - Audio Captioning using Pre-Trained Large-Scale Language Model Guided by
Audio-based Similar Caption Retrieval [28.57294189207084]
The goal of audio captioning is to translate input audio into its description using natural language.
The proposed method has succeeded to use a pre-trained language model for audio captioning.
The oracle performance of the pre-trained model-based caption generator was clearly better than that of the conventional method trained from scratch.
arXiv Detail & Related papers (2020-12-14T08:27:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.