Accented Speech Recognition under the Indian context
- URL: http://arxiv.org/abs/2209.03787v2
- Date: Sun, 11 Sep 2022 11:41:35 GMT
- Title: Accented Speech Recognition under the Indian context
- Authors: Ankit Grover
- Abstract summary: Accent forms an integral part of identifying cultures, emotions, behavior'ss, etc.
People often perceive each other in a different manner due to their accent.
The accent itself can be a conveyor of status, pride, and other emotional information which can be captured through Speech itself.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Accent forms an integral part of identifying cultures, emotions, behavior'ss,
etc. People often perceive each other in a different manner due to their
accent. The accent itself can be a conveyor of status, pride, and other
emotional information which can be captured through Speech itself. Accent
itself can be defined as: "the way in which people in a particular area,
country, or social group pronounce words" or "a special emphasis given to a
syllable in a word, word in a sentence, or note in a set of musical notes".
Accented Speech Recognition is one the most important problems in the domain of
Speech Recognition. Speech recognition is an interdisciplinary sub-field of
Computer Science and Linguistics research where the main aim is to develop
technologies which enable conversion of speech into text. The speech can be of
any form such as read speech or spontaneous speech, conversational speech. As
all instances of language utterances are present speech is very diverse and
exhibits many traits of variability. This diversity stems from the
environmental conditions, variabilities from speaker to speaker, channel noise,
differences in Speech production due to disabilities, presence of disfluencies.
Speech therefore is indeed a rich source of information waiting to be
exploited.
Related papers
- Artificial Neural Networks to Recognize Speakers Division from Continuous Bengali Speech [0.5330251011543498]
We used our dataset of more than 45 hours of audio data from 633 individual male and female speakers.
We recorded the highest accuracy of 85.44%.
arXiv Detail & Related papers (2024-04-18T10:17:20Z) - Joint Audio and Speech Understanding [81.34673662385774]
We build a machine learning model, called LTU-AS, that has a conceptually similar universal audio perception and advanced reasoning ability.
By integrating Whisper as a perception module and LLaMA as a reasoning module, LTU-AS can simultaneously recognize and jointly understand spoken text, speech paralinguistics, and non-speech audio events.
arXiv Detail & Related papers (2023-09-25T17:59:05Z) - Deep Speech Based End-to-End Automated Speech Recognition (ASR) for
Indian-English Accents [0.0]
We have used transfer learning approach to develop an end-to-end speech recognition system for Indian-English accents.
Indic TTS data of Indian-English accents is used for transfer learning and fine-tuning the pre-trained Deep Speech model.
arXiv Detail & Related papers (2022-04-03T03:11:21Z) - Bridging the Gap: Using Deep Acoustic Representations to Learn Grounded
Language from Percepts and Raw Speech [26.076534338576234]
Learning to understand grounded language, which connects natural language to percepts, is a critical research area.
In this work we demonstrate the feasibility of performing grounded language acquisition on paired visual percepts and raw speech inputs.
arXiv Detail & Related papers (2021-12-27T16:12:30Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - Emotional Prosody Control for Speech Generation [7.66200737962746]
We propose a text to speech(TTS) system, where a user can choose the emotion of generated speech from a continuous and meaningful emotion space.
The proposed TTS system can generate speech from the text in any speaker's style, with fine control of emotion.
arXiv Detail & Related papers (2021-11-07T08:52:04Z) - E-ffective: A Visual Analytic System for Exploring the Emotion and
Effectiveness of Inspirational Speeches [57.279044079196105]
E-ffective is a visual analytic system allowing speaking experts and novices to analyze both the role of speech factors and their contribution in effective speeches.
Two novel visualizations include E-spiral (that shows the emotional shifts in speeches in a visually compact way) and E-script (that connects speech content with key speech delivery information.
arXiv Detail & Related papers (2021-10-28T06:14:27Z) - Analysis of French Phonetic Idiosyncrasies for Accent Recognition [0.8602553195689513]
Differences in pronunciation, in accent and intonation of speech in general, create one of the most common problems of speech recognition.
We use traditional machine learning techniques and convolutional neural networks, and show that the classical techniques are not sufficiently efficient to solve this problem.
In this paper, we focus our attention on the French accent. We also identify its limitation by understanding the impact of French idiosyncrasies on its spectrograms.
arXiv Detail & Related papers (2021-10-18T10:50:50Z) - Limited Data Emotional Voice Conversion Leveraging Text-to-Speech:
Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity.
We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data.
The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z) - High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner.
Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.