What does it take to get state of the art in simultaneous speech-to-speech translation?
- URL: http://arxiv.org/abs/2409.00965v2
- Date: Sat, 14 Sep 2024 04:16:43 GMT
- Title: What does it take to get state of the art in simultaneous speech-to-speech translation?
- Authors: Vincent Wilmet, Johnson Du,
- Abstract summary: We study the latency characteristics observed in simultaneous speech-to-speech model's performance.
We propose methods to minimize latency spikes and improve overall performance.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents an in-depth analysis of the latency characteristics observed in simultaneous speech-to-speech model's performance, particularly focusing on hallucination-induced latency spikes. By systematically experimenting with various input parameters and conditions, we propose methods to minimize latency spikes and improve overall performance. The findings suggest that a combination of careful input management and strategic parameter adjustments can significantly enhance speech-to-speech model's latency behavior.
Related papers
- DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization [12.310318928818546]
We propose a novel method of distilling TTS diffusion models with direct end-to-end evaluation metric optimization.
We show DMDSpeech consistently surpasses prior state-of-the-art models in both naturalness and speaker similarity.
This work highlights the potential of direct metric optimization in speech synthesis, allowing models to better align with human auditory preferences.
arXiv Detail & Related papers (2024-10-14T21:17:58Z) - Dynamic Behaviour of Connectionist Speech Recognition with Strong
Latency Constraints [6.5458610824731664]
This paper describes the use of connectionist techniques in phonetic speech recognition with strong latency constraints.
The constraints are imposed by the task of deriving the lip movements of a synthetic face in real time from the speech signal.
arXiv Detail & Related papers (2024-01-12T14:10:28Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech
Enhancement [41.872384434583466]
We propose a learning objective that formalizes differences in perceptual quality.
We identify temporal acoustic parameters that are non-differentiable.
We develop a neural network estimator that can accurately predict their time-series values.
arXiv Detail & Related papers (2023-02-16T05:17:06Z) - TAPLoss: A Temporal Acoustic Parameter Loss for Speech Enhancement [41.872384434583466]
We provide a differentiable estimator for four categories of low-level acoustic descriptors involving: frequency-related parameters, energy or amplitude-related parameters, spectral balance parameters, and temporal features.
We show that adding TAP as an auxiliary objective in speech enhancement produces speech with improved perceptual quality and intelligibility.
arXiv Detail & Related papers (2023-02-16T04:57:11Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Exploring Continuous Integrate-and-Fire for Adaptive Simultaneous Speech
Translation [75.86581380817464]
A SimulST system generally includes two components: the pre-decision that aggregates the speech information and the policy that decides to read or write.
This paper proposes to model the adaptive policy by adapting the Continuous Integrate-and-Fire (CIF)
Compared with monotonic multihead attention (MMA), our method has the advantage of simpler computation, superior quality at low latency, and better generalization to long utterances.
arXiv Detail & Related papers (2022-03-22T23:33:18Z) - Conditional Diffusion Probabilistic Model for Speech Enhancement [101.4893074984667]
We propose a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes.
In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models.
arXiv Detail & Related papers (2022-02-10T18:58:01Z) - Learning robust speech representation with an articulatory-regularized
variational autoencoder [13.541055956177937]
We develop an articulatory model able to associate articulatory parameters describing the jaw, tongue, lips and velum configurations with vocal tract shapes and spectral features.
We show that this articulatory constraint improves model training by decreasing time to convergence and reconstruction loss at convergence, and yields better performance in a speech denoising task.
arXiv Detail & Related papers (2021-04-07T15:47:04Z) - Incremental Text to Speech for Neural Sequence-to-Sequence Models using
Reinforcement Learning [60.20205278845412]
Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised.
This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation.
We propose a reinforcement learning based framework to train an agent to make this decision.
arXiv Detail & Related papers (2020-08-07T11:48:05Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.