Vision-Speech Models: Teaching Speech Models to Converse about Images
- URL: http://arxiv.org/abs/2503.15633v1
- Date: Wed, 19 Mar 2025 18:40:45 GMT
- Title: Vision-Speech Models: Teaching Speech Models to Converse about Images
- Authors: Amélie Royer, Moritz Böhle, Gabriel de Marmiesse, Laurent Mazaré, Neil Zeghidour, Alexandre Défossez, Patrick Pérez,
- Abstract summary: We introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules.<n>An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics.<n>We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis.
- Score: 67.62394024470528
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent successes of Vision-Language models raise the question of how to equivalently imbue a pretrained speech model with vision understanding, an important milestone towards building a multimodal speech model able to freely converse about images. Building such a conversational Vision-Speech model brings its unique challenges: (i) paired image-speech datasets are much scarcer than their image-text counterparts, (ii) ensuring real-time latency at inference is crucial thus bringing compute and memory constraints, and (iii) the model should preserve prosodic features (e.g., speaker tone) which cannot be inferred from text alone. In this work, we introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules. An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics. To reduce training costs, we design a simple one-stage, parameter-efficient fine-tuning pipeline in which we leverage a mixture of image-text (i.e., "speechless") and image-speech samples. We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis. Our inference code will be made available, as well as the image-speech data used for audio evaluation.
Related papers
- NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training [6.34265125858783]
We propose a noise-robust framework for efficient vision-language pre-training that requires less pre-training data.
Specifically, we bridge the modality gap between a frozen image encoder and a large language model with a transformer.
We introduce two innovative learning strategies: noise-adaptive learning and concept-enhanced learning.
arXiv Detail & Related papers (2024-09-15T01:54:17Z) - Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation [46.93969003104427]
This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM)
USDM is designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech.
Our approach effectively generates natural-sounding spoken responses, surpassing previous and cascaded baselines.
arXiv Detail & Related papers (2024-02-08T14:35:09Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Text-Free Image-to-Speech Synthesis Using Learned Segmental Units [24.657722909094662]
We present the first model for directly fluent, natural-sounding spoken audio captions for images.
We connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units.
We conduct experiments on the Flickr8k spoken caption dataset and a novel corpus of spoken audio captions collected for the popular MSCOCO dataset.
arXiv Detail & Related papers (2020-12-31T05:28:38Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Modality-Balanced Models for Visual Dialogue [102.35406085738325]
The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue.
We show that previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history.
We present methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters.
arXiv Detail & Related papers (2020-01-17T14:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.