Towards Practical and Efficient Image-to-Speech Captioning with
Vision-Language Pre-training and Multi-modal Tokens
- URL: http://arxiv.org/abs/2309.08531v1
- Date: Fri, 15 Sep 2023 16:48:34 GMT
- Title: Towards Practical and Efficient Image-to-Speech Captioning with
Vision-Language Pre-training and Multi-modal Tokens
- Authors: Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe,
Yong Man Ro
- Abstract summary: We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-supervised speech model.
With the vision-language pre-training strategy, we set new state-of-the-art Im2Sp performances on two widely used benchmark databases.
- Score: 87.52235889917223
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose methods to build a powerful and efficient
Image-to-Speech captioning (Im2Sp) model. To this end, we start with importing
the rich knowledge related to image comprehension and language modeling from a
large-scale pre-trained vision-language model into Im2Sp. We set the output of
the proposed Im2Sp as discretized speech units, i.e., the quantized speech
features of a self-supervised speech model. The speech units mainly contain
linguistic information while suppressing other characteristics of speech. This
allows us to incorporate the language modeling capability of the pre-trained
vision-language model into the spoken language modeling of Im2Sp. With the
vision-language pre-training strategy, we set new state-of-the-art Im2Sp
performances on two widely used benchmark databases, COCO and Flickr8k. Then,
we further improve the efficiency of the Im2Sp model. Similar to the speech
unit case, we convert the original image into image units, which are derived
through vector quantization of the raw image. With these image units, we can
drastically reduce the required data storage for saving image data to just 0.8%
when compared to the original image data in terms of bits. Demo page:
https://ms-dot-k.github.io/Image-to-Speech-Captioning.
Related papers
- Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2 [0.0]
A set of image-to-speech framework CLIP-KNN-Fastspeech2 based on the Chinese context was constructed.
The framework integrates multiple basic models and adopts the strategy of independent pre-training and joint fine-tuning.
Experimental results on multiple public datasets show that the model has improved objective indicators such as BLEU4,FAD(Fr'echet Audio Distance), WER(Word Error Ratio), and even inference speed.
arXiv Detail & Related papers (2024-07-19T11:18:44Z) - Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - User-Aware Prefix-Tuning is a Good Learner for Personalized Image
Captioning [35.211749514733846]
Traditional image captioning methods often overlook the preferences and characteristics of users.
Most existing methods emphasize the user context fusion process by memory networks or transformers.
We propose a novel personalized image captioning framework that leverages user context to consider personality factors.
arXiv Detail & Related papers (2023-12-08T02:08:00Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Bidirectional Representations for Low Resource Spoken Language
Understanding [39.208462511430554]
We propose a representation model to encode speech in bidirectional rich encodings.
The approach uses a masked language modelling objective to learn the representations.
We show that the performance of the resulting encodings is better than comparable models on multiple datasets.
arXiv Detail & Related papers (2022-11-24T17:05:16Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.