DualVoice: Speech Interaction that Discriminates between Normal and
Whispered Voice Input
- URL: http://arxiv.org/abs/2208.10499v1
- Date: Mon, 22 Aug 2022 13:01:28 GMT
- Title: DualVoice: Speech Interaction that Discriminates between Normal and
Whispered Voice Input
- Authors: Jun Rekimoto
- Abstract summary: There is no easy way to distinguish between commands being issued and text required to be input in speech.
The input of symbols and commands is also challenging because these may be misrecognized as text letters.
This study proposes a speech interaction method called DualVoice, by which commands can be input in a whispered voice and letters in a normal voice.
- Score: 16.82591185507251
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Interactions based on automatic speech recognition (ASR) have become widely
used, with speech input being increasingly utilized to create documents.
However, as there is no easy way to distinguish between commands being issued
and text required to be input in speech, misrecognitions are difficult to
identify and correct, meaning that documents need to be manually edited and
corrected. The input of symbols and commands is also challenging because these
may be misrecognized as text letters. To address these problems, this study
proposes a speech interaction method called DualVoice, by which commands can be
input in a whispered voice and letters in a normal voice. The proposed method
does not require any specialized hardware other than a regular microphone,
enabling a complete hands-free interaction. The method can be used in a wide
range of situations where speech recognition is already available, ranging from
text input to mobile/wearable computing. Two neural networks were designed in
this study, one for discriminating normal speech from whispered speech, and the
second for recognizing whisper speech. A prototype of a text input system was
then developed to show how normal and whispered voice can be used in speech
text input. Other potential applications using DualVoice are also discussed.
Related papers
- Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - Morse Code-Enabled Speech Recognition for Individuals with Visual and Hearing Impairments [0.0]
The proposed model proposes the speech from the user, is transmitted to the speech recognition layer where it is converted into text.
The accuracy of the model is completely dependent on speech recognition, as the morse code conversion is a process.
The proposed model's WER and accuracy are both determined to be 10.18% and 89.82%, respectively.
arXiv Detail & Related papers (2024-07-07T09:54:29Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - A Novel Scheme to classify Read and Spontaneous Speech [15.542726069501231]
We propose a novel scheme for identifying read and spontaneous speech.
Our approach uses a pre-trained DeepSpeech audio-to-alphabet recognition engine.
arXiv Detail & Related papers (2023-06-13T11:16:52Z) - MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech
Recognition [75.12948999653338]
We propose a novel multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR)
We employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data.
Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
arXiv Detail & Related papers (2022-11-29T13:16:09Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - Conversion of Acoustic Signal (Speech) Into Text By Digital Filter using
Natural Language Processing [0.0]
We create an interface that transforms speech and other auditory inputs into text using a digital filter.
It is also possible for linguistic faults to appear occasionally, gender recognition, speech recognition that is unsuccessful (cannot recognize voice) and gender recognition to fail.
Since technical problems are involved, we developed a program that acts as a mediator to prevent initiating software issues.
arXiv Detail & Related papers (2022-09-09T08:55:34Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.