Improving Joint Speech-Text Representations Without Alignment
- URL: http://arxiv.org/abs/2308.06125v1
- Date: Fri, 11 Aug 2023 13:28:48 GMT
- Title: Improving Joint Speech-Text Representations Without Alignment
- Authors: Cal Peyser, Zhong Meng, Ke Hu, Rohit Prabhavalkar, Andrew Rosenberg,
Tara N. Sainath, Michael Picheny, Kyunghyun Cho
- Abstract summary: We show that joint speech-text encoders naturally achieve consistent representations across modalities by disregarding sequence length.
We argue that consistency losses could forgive length differences and simply assume the best alignment.
- Score: 92.60384956736536
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The last year has seen astonishing progress in text-prompted image generation
premised on the idea of a cross-modal representation space in which the text
and image domains are represented jointly. In ASR, this idea has found
application as joint speech-text encoders that can scale to the capacities of
very large parameter models by being trained on both unpaired speech and text.
While these methods show promise, they have required special treatment of the
sequence-length mismatch inherent in speech and text, either by up-sampling
heuristics or an explicit alignment model. In this work, we offer evidence that
joint speech-text encoders naturally achieve consistent representations across
modalities by disregarding sequence length, and argue that consistency losses
could forgive length differences and simply assume the best alignment. We show
that such a loss improves downstream WER in both a large-parameter monolingual
and multilingual system.
Related papers
- SSR: Alignment-Aware Modality Connector for Speech Language Models [23.859649312290447]
Fusing speech into pre-trained language model (SpeechLM) usually suffers from inefficient encoding of long-form speech and catastrophic forgetting of pre-trained text modality.
We propose SSR-Connector (Segmented Speech Representation Connector) for better modality fusion.
arXiv Detail & Related papers (2024-09-30T19:17:46Z) - Soft Alignment of Modality Space for End-to-end Speech Translation [49.29045524083467]
End-to-end Speech Translation aims to convert speech into target text within a unified model.
The inherent differences between speech and text modalities often impede effective cross-modal and cross-lingual transfer.
We introduce Soft Alignment (S-Align), using adversarial training to align the representation spaces of both modalities.
arXiv Detail & Related papers (2023-12-18T06:08:51Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - Parameter Efficient Audio Captioning With Faithful Guidance Using
Audio-text Shared Latent Representation [0.9285295512807729]
We propose a data augmentation technique for generating hallucinated audio captions and show that similarity based on an audio-text shared latent space is suitable for detecting hallucination.
We then propose a parameter efficient inference time faithful decoding algorithm that enables smaller audio captioning models with performance equivalent to larger models trained with more data.
arXiv Detail & Related papers (2023-09-06T19:42:52Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z) - Long Text Generation by Modeling Sentence-Level and Discourse-Level
Coherence [59.51720326054546]
We propose a long text generation model, which can represent the prefix sentences at sentence level and discourse level in the decoding process.
Our model can generate more coherent texts than state-of-the-art baselines.
arXiv Detail & Related papers (2021-05-19T07:29:08Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.