Visualization: the missing factor in Simultaneous Speech Translation
- URL: http://arxiv.org/abs/2111.00514v1
- Date: Sun, 31 Oct 2021 14:44:01 GMT
- Title: Visualization: the missing factor in Simultaneous Speech Translation
- Authors: Sara Papi, Matteo Negri, Marco Turchi
- Abstract summary: Simultaneous speech translation (SimulST) is a task in which output generation has to be performed on partial, incremental speech input.
SimulST has become popular due to the spread of cross-lingual application scenarios.
- Score: 14.454116027072335
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Simultaneous speech translation (SimulST) is the task in which output
generation has to be performed on partial, incremental speech input. In recent
years, SimulST has become popular due to the spread of cross-lingual
application scenarios, like international live conferences and streaming
lectures, in which on-the-fly speech translation can facilitate users' access
to audio-visual content. In this paper, we analyze the characteristics of the
SimulST systems developed so far, discussing their strengths and weaknesses. We
then concentrate on the evaluation framework required to properly assess
systems' effectiveness. To this end, we raise the need for a broader
performance analysis, also including the user experience standpoint. SimulST
systems, indeed, should be evaluated not only in terms of quality/latency
measures, but also via task-oriented metrics accounting, for instance, for the
visualization strategy adopted. In light of this, we highlight which are the
goals achieved by the community and what is still missing.
Related papers
- STAB: Speech Tokenizer Assessment Benchmark [57.45234921100835]
Representing speech as discrete tokens provides a framework for transforming speech into a format that closely resembles text.
We present STAB (Speech Tokenizer Assessment Benchmark), a systematic evaluation framework designed to assess speech tokenizers comprehensively.
We evaluate the STAB metrics and correlate this with downstream task performance across a range of speech tasks and tokenizer choices.
arXiv Detail & Related papers (2024-09-04T02:20:59Z) - An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - Analysis of Visual Features for Continuous Lipreading in Spanish [0.0]
lipreading is a complex task whose objective is to interpret speech when audio is not available.
We propose an analysis of different speech visual features with the intention of identifying which of them is the best approach to capture the nature of lip movements for natural Spanish.
arXiv Detail & Related papers (2023-11-21T09:28:00Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z) - End-to-End Evaluation for Low-Latency Simultaneous Speech Translation [55.525125193856084]
We propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions.
This includes the segmentation of the audio as well as the run-time of the different components.
We also compare different approaches to low-latency speech translation using this framework.
arXiv Detail & Related papers (2023-08-07T09:06:20Z) - KIT's Multilingual Speech Translation System for IWSLT 2023 [58.5152569458259]
We describe our speech translation system for the multilingual track of IWSLT 2023.
The task requires translation into 10 languages of varying amounts of resources.
Our cascaded speech system substantially outperforms its end-to-end counterpart on scientific talk translation.
arXiv Detail & Related papers (2023-06-08T16:13:20Z) - Attention as a Guide for Simultaneous Speech Translation [15.860792612311277]
We propose an attention-based policy (EDAtt) for simultaneous speech translation (SimulST)
Its goal is to leverage the encoder-decoder attention scores to guide inference in real time.
Results on en->de, es show that the EDAtt policy achieves overall better results compared to the SimulST state of the art.
arXiv Detail & Related papers (2022-12-15T14:18:53Z) - Towards the evaluation of simultaneous speech translation from a
communicative perspective [0.0]
We present the results of an experiment aimed at evaluating the quality of a simultaneous speech translation engine.
We found better performance for the human interpreters in terms of intelligibility, while the machine performs slightly better in terms of informativeness.
arXiv Detail & Related papers (2021-03-15T13:09:00Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.