Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study
- URL: http://arxiv.org/abs/2309.15800v1
- Date: Wed, 27 Sep 2023 17:21:13 GMT
- Title: Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study
- Authors: Xuankai Chang and Brian Yan and Kwanghee Choi and Jeeweon Jung and
Yichen Lu and Soumi Maiti and Roshan Sharma and Jiatong Shi and Jinchuan Tian
and Shinji Watanabe and Yuya Fujita and Takashi Maekaku and Pengcheng Guo and
Yao-Fei Cheng and Pavel Denisov and Kohei Saijo and Hsiu-Hsuan Wang
- Abstract summary: Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
- Score: 68.88536866933038
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech signals, typically sampled at rates in the tens of thousands per
second, contain redundancies, evoking inefficiencies in sequence modeling.
High-dimensional speech features such as spectrograms are often used as the
input for the subsequent model. However, they can still be redundant. Recent
investigations proposed the use of discrete speech units derived from
self-supervised learning representations, which significantly compresses the
size of speech data. Applying various methods, such as de-duplication and
subword modeling, can further compress the speech sequence length. Hence,
training time is significantly reduced while retaining notable performance. In
this study, we undertake a comprehensive and systematic exploration into the
application of discrete units within end-to-end speech processing models.
Experiments on 12 automatic speech recognition, 3 speech translation, and 1
spoken language understanding corpora demonstrate that discrete units achieve
reasonably good results in almost all the settings. We intend to release our
configurations and trained models to foster future research efforts.
Related papers
- SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units.
Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering.
SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z) - Sample-Efficient Diffusion for Text-To-Speech Synthesis [31.372486998377966]
It is based on a novel diffusion architecture, that we call U-Audio Transformer (U-AT)
SESD achieves impressive results despite training on less than 1k hours of speech.
It synthesizes more intelligible speech than the state-of-the-art auto-regressive model, VALL-E, while using less than 2% the training data.
arXiv Detail & Related papers (2024-09-01T20:34:36Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - Robust Speech Recognition via Large-Scale Weak Supervision [69.63329359286419]
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet.
When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks.
We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
arXiv Detail & Related papers (2022-12-06T18:46:04Z) - Bootstrapping meaning through listening: Unsupervised learning of spoken
sentence embeddings [4.582129557845177]
This study tackles the unsupervised learning of semantic representations for spoken utterances.
We propose WavEmbed, a sequential autoencoder that predicts hidden units from a dense representation of speech.
We also propose S-HuBERT to induce meaning through knowledge distillation.
arXiv Detail & Related papers (2022-10-23T21:16:09Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Unsupervised Pattern Discovery from Thematic Speech Archives Based on
Multilingual Bottleneck Features [41.951988293049205]
We propose a two-stage approach, which comprises unsupervised acoustic modeling and decoding, followed by pattern mining in acoustic unit sequences.
The proposed system is able to effectively extract topic-related words and phrases from the lecture recordings on MIT OpenCourseWare.
arXiv Detail & Related papers (2020-11-03T20:06:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.