Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
        - URL: http://arxiv.org/abs/2012.15454v1
- Date: Thu, 31 Dec 2020 05:28:38 GMT
- Title: Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
- Authors: Wei-Ning Hsu, David Harwath, Christopher Song, James Glass
- Abstract summary: We present the first model for directly fluent, natural-sounding spoken audio captions for images.
We connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units.
We conduct experiments on the Flickr8k spoken caption dataset and a novel corpus of spoken audio captions collected for the popular MSCOCO dataset.
- Score: 24.657722909094662
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   In this paper we present the first model for directly synthesizing fluent,
natural-sounding spoken audio captions for images that does not require natural
language text as an intermediate representation or source of supervision.
Instead, we connect the image captioning module and the speech synthesis module
with a set of discrete, sub-word speech units that are discovered with a
self-supervised visual grounding task. We conduct experiments on the Flickr8k
spoken caption dataset in addition to a novel corpus of spoken audio captions
collected for the popular MSCOCO dataset, demonstrating that our generated
captions also capture diverse visual semantics of the images they describe. We
investigate several different intermediate speech representations, and
empirically find that the representation must satisfy several important
properties to serve as drop-in replacements for text.
 
      
        Related papers
        - Vision-Speech Models: Teaching Speech Models to Converse about Images [67.62394024470528]
 We introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules.
An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics.
We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis.
 arXiv  Detail & Related papers  (2025-03-19T18:40:45Z)
- Classifier-Guided Captioning Across Modalities [69.75111271002137]
 We introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning.
Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system.
 Notably, when combined with an existing zero-shot audio captioning system, our framework improves its quality and sets state-of-the-art performance in zero-shot audio captioning.
 arXiv  Detail & Related papers  (2025-01-03T18:09:26Z)
- Translating speech with just images [23.104041372055466]
 We extend this connection by linking images to text via an existing image captioning system.
This approach can be used for speech translation with just images by having the audio in a different language from the generated captions.
We investigate such a system on a real low-resource language, Yorub'a, and propose a Yorub'a-to-English speech translation model.
 arXiv  Detail & Related papers  (2024-06-11T10:29:24Z)
- CapText: Large Language Model-based Caption Generation From Image
  Context and Description [0.0]
 We propose and evaluate a new approach to generate captions from textual descriptions and context alone.
Our approach outperforms current state-of-the-art image-text alignment models like OSCAR-VinVL on this task on the CIDEr metric.
 arXiv  Detail & Related papers  (2023-06-01T02:40:44Z)
- ANNA: Abstractive Text-to-Image Synthesis with Filtered News Captions [6.066100464517522]
 Real-world image-caption pairs present in domains such as news data do not use simple and directly descriptive captions.
We launch ANNA, an Abstractive News captioNs dAtaset extracted from online news articles in a variety of different contexts.
We show that techniques such as transfer learning achieve limited success in understanding abstractive captions but still fail to consistently learn the relationships between content and context features.
 arXiv  Detail & Related papers  (2023-01-05T17:19:01Z)
- VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
  Speech Representation Learning [119.49605266839053]
 We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
 arXiv  Detail & Related papers  (2022-11-21T09:10:10Z)
- SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language
  Model [56.49878599920353]
 SpeechCLIP is a novel framework bridging speech and text through images to enhance speech models without transcriptions.
We leverage state-of-the-art pre-trained HuBERT and CLIP, aligning them via paired images and spoken captions with minimal fine-tuning.
 arXiv  Detail & Related papers  (2022-10-03T04:15:36Z)
- Matching Visual Features to Hierarchical Semantic Topics for Image
  Paragraph Captioning [50.08729005865331]
 This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
 arXiv  Detail & Related papers  (2021-05-10T06:55:39Z)
- Structural and Functional Decomposition for Personality Image Captioning
  in a Communication Game [53.74847926974122]
 Personality image captioning (PIC) aims to describe an image with a natural language caption given a personality trait.
We introduce a novel formulation for PIC based on a communication game between a speaker and a listener.
 arXiv  Detail & Related papers  (2020-11-17T10:19:27Z)
- SPLAT: Speech-Language Joint Pre-Training for Spoken Language
  Understanding [61.02342238771685]
 Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
 arXiv  Detail & Related papers  (2020-10-05T19:29:49Z)
- Improving Image Captioning with Better Use of Captions [65.39641077768488]
 We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
 arXiv  Detail & Related papers  (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.