A Novel Scheme to classify Read and Spontaneous Speech
- URL: http://arxiv.org/abs/2306.08012v1
- Date: Tue, 13 Jun 2023 11:16:52 GMT
- Title: A Novel Scheme to classify Read and Spontaneous Speech
- Authors: Sunil Kumar Kopparapu
- Abstract summary: We propose a novel scheme for identifying read and spontaneous speech.
Our approach uses a pre-trained DeepSpeech audio-to-alphabet recognition engine.
- Score: 15.542726069501231
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The COVID-19 pandemic has led to an increased use of remote telephonic
interviews, making it important to distinguish between scripted and spontaneous
speech in audio recordings. In this paper, we propose a novel scheme for
identifying read and spontaneous speech. Our approach uses a pre-trained
DeepSpeech audio-to-alphabet recognition engine to generate a sequence of
alphabets from the audio. From these alphabets, we derive features that allow
us to discriminate between read and spontaneous speech. Our experimental
results show that even a small set of self-explanatory features can effectively
classify the two types of speech very effectively.
Related papers
- Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows.
Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues.
We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z) - Towards Accurate Lip-to-Speech Synthesis in-the-Wild [31.289366690147556]
We introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements.
The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust language model from speech alone.
We propose incorporating noisy text supervision using a state-of-the-art lip-to-text network that instills language information into our model.
arXiv Detail & Related papers (2024-03-02T04:07:24Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - Zero-shot personalized lip-to-speech synthesis with face image based
voice control [41.17483247506426]
Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies.
We propose a zero-shot personalized Lip2Speech synthesis method, in which face images control speaker identities.
arXiv Detail & Related papers (2023-05-09T02:37:29Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech
Recognition [75.12948999653338]
We propose a novel multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR)
We employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data.
Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
arXiv Detail & Related papers (2022-11-29T13:16:09Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - VCSE: Time-Domain Visual-Contextual Speaker Extraction Network [54.67547526785552]
We propose a two-stage time-domain visual-contextual speaker extraction network named VCSE.
In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence.
In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues.
arXiv Detail & Related papers (2022-10-09T12:29:38Z) - DualVoice: Speech Interaction that Discriminates between Normal and
Whispered Voice Input [16.82591185507251]
There is no easy way to distinguish between commands being issued and text required to be input in speech.
The input of symbols and commands is also challenging because these may be misrecognized as text letters.
This study proposes a speech interaction method called DualVoice, by which commands can be input in a whispered voice and letters in a normal voice.
arXiv Detail & Related papers (2022-08-22T13:01:28Z) - Speaker Extraction with Co-Speech Gestures Cue [79.91394239104908]
We explore the use of co-speech gestures sequence, e.g. hand and body movements, as the speaker cue for speaker extraction.
We propose two networks using the co-speech gestures cue to perform attentive listening on the target speaker.
The experimental results show that the co-speech gestures cue is informative in associating the target speaker, and the quality of the extracted speech shows significant improvements over the unprocessed mixture speech.
arXiv Detail & Related papers (2022-03-31T06:48:52Z) - Automatic Speech recognition for Speech Assessment of Preschool Children [4.554894288663752]
The acoustic and linguistic features of preschool speech are investigated in this study.
Wav2Vec 2.0 is a paradigm that could be used to build a robust end-to-end speech recognition system.
arXiv Detail & Related papers (2022-03-24T07:15:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.