Related papers: Fast-Slow Transformer for Visually Grounding Speech

Fast-Slow Transformer for Visually Grounding Speech

URL: http://arxiv.org/abs/2109.08186v1
Date: Thu, 16 Sep 2021 18:45:45 GMT
Title: Fast-Slow Transformer for Visually Grounding Speech
Authors: Puyuan Peng and David Harwath
Abstract summary: We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS. FaST-VGS is a Transformer-based model for learning the associations between raw speech waveforms and visual images.
Score: 15.68151998164009
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS. FaST-VGS is a Transformer-based model for learning the associations between raw speech waveforms and visual images. The model unifies dual-encoder and cross-attention architectures into a single model, reaping the superior retrieval speed of the former along with the accuracy of the latter. FaST-VGS achieves state-of-the-art speech-image retrieval accuracy on benchmark datasets, and its learned representations exhibit strong performance on the ZeroSpeech 2021 phonetic and semantic tasks.

Related papers

V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow [57.51550409392103]
We introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos. To address these challenges, we decompose the speech signal into manageable subspaces, each representing distinct speech attributes, and predict them directly from the visual input. To generate coherent and realistic speech from these predicted attributes, we employ a rectified flow matching decoder built on a Transformer architecture.
arXiv Detail & Related papers (2024-11-29T05:55:20Z)
VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space [0.49109372384514843]
VQalAttent is a lightweight model designed to generate fake speech with tunable performance and interpretability. Our results demonstrate VQalAttent's capacity to generate intelligible speech samples with limited computational resources.
arXiv Detail & Related papers (2024-11-22T00:21:39Z)
DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech [43.45691362372739]
We propose a method called Directional Patch Interaction for Text-to-Speech (DPI-TTS) DPI-TTS employs a low-to-high frequency, frame-by-frame progressive inference approach that aligns more closely with acoustic properties. Experimental results demonstrate that our method increases the training speed by nearly 2 times and significantly outperforms the baseline models.
arXiv Detail & Related papers (2024-09-18T09:36:55Z)
SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2. SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods. We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z)
DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation [36.126810842258706]
Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model. Due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution. We propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST.
arXiv Detail & Related papers (2023-10-11T11:39:36Z)
ViTs for SITS: Vision Transformers for Satellite Image Time Series [52.012084080257544]
We introduce a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT) TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder.
arXiv Detail & Related papers (2023-01-12T11:33:07Z)
UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units. We enhance the model performance by subword prediction in the first-pass decoder. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z)
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation. We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices. TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z)
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis [90.3069686272524]
This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies. Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms.
arXiv Detail & Related papers (2022-04-21T07:49:09Z)
Relative Positional Encoding for Speech Recognition and Direct Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer. As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z)
Multiresolution and Multimodal Speech Recognition with Transformers [22.995102995029576]
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture. We focus on the scene context provided by the visual information, to ground the ASR. Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.
arXiv Detail & Related papers (2020-04-29T09:32:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.