Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval
- URL: http://arxiv.org/abs/2104.01894v2
- Date: Thu, 8 Apr 2021 10:16:17 GMT
- Title: Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval
- Authors: Ramon Sanabria, Austin Waters, Jason Baldridge
- Abstract summary: Speech-based image retrieval has been studied as a proxy for joint representation learning.
It is unclear how well speech-based retrieval can work in practice.
We show our best speech-based models can match or exceed cascaded ASR-to-text encoding when speech is spontaneous, accented, or otherwise hard to automatically transcribe.
- Score: 13.40010612226968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech-based image retrieval has been studied as a proxy for joint
representation learning, usually without emphasis on retrieval itself. As such,
it is unclear how well speech-based retrieval can work in practice -- both in
an absolute sense and versus alternative strategies that combine automatic
speech recognition (ASR) with strong text encoders. In this work, we
extensively study and expand choices of encoder architectures, training
methodology (including unimodal and multimodal pretraining), and other factors.
Our experiments cover different types of speech in three datasets: Flickr
Audio, Places Audio, and Localized Narratives. Our best model configuration
achieves large gains over state of the art, e.g., pushing recall-at-one from
21.8% to 33.2% for Flickr Audio and 27.6% to 53.4% for Places Audio. We also
show our best speech-based models can match or exceed cascaded ASR-to-text
encoding when speech is spontaneous, accented, or otherwise hard to
automatically transcribe.
Related papers
- PALM: Few-Shot Prompt Learning for Audio Language Models [1.6177972328875514]
Audio-Language Models (ALMs) have recently achieved remarkable success in zero-shot audio recognition tasks.
We propose a novel method, Prompt Learning in Audio Language Models (PALM), which optimize the feature space of the text encoder branch.
We demonstrate the effectiveness of our approach on 11 audio recognition datasets, and compare the results with three baselines in a few-shot learning setup.
arXiv Detail & Related papers (2024-09-29T22:06:07Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Lip2Vec: Efficient and Robust Visual Speech Recognition via
Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model.
The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER.
We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Understanding Shared Speech-Text Representations [34.45772613231558]
Mae-stro has developed approaches to train speech models by incorpo-rating text into end-to-end models.
We find that a corpus-specific duration modelfor speech-text alignment is the most important component for learn-ing a shared speech-text representation.
We find that theshared encoder learns a more compact and overlapping speech-textrepresentation than the uni-modal encoders.
arXiv Detail & Related papers (2023-04-27T20:05:36Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Joint Speech Recognition and Audio Captioning [37.205642807313545]
Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources.
We aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR)
We propose several approaches for end-to-end joint modeling of ASR and AAC tasks.
arXiv Detail & Related papers (2022-02-03T04:42:43Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.