Artificial Neural Networks to Recognize Speakers Division from Continuous Bengali Speech
- URL: http://arxiv.org/abs/2404.15168v1
- Date: Thu, 18 Apr 2024 10:17:20 GMT
- Title: Artificial Neural Networks to Recognize Speakers Division from Continuous Bengali Speech
- Authors: Hasmot Ali, Md. Fahad Hossain, Md. Mehedi Hasan, Sheikh Abujar, Sheak Rashed Haider Noori,
- Abstract summary: We used our dataset of more than 45 hours of audio data from 633 individual male and female speakers.
We recorded the highest accuracy of 85.44%.
- Score: 0.5330251011543498
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Voice based applications are ruling over the era of automation because speech has a lot of factors that determine a speakers information as well as speech. Modern Automatic Speech Recognition (ASR) is a blessing in the field of Human-Computer Interaction (HCI) for efficient communication among humans and devices using Artificial Intelligence technology. Speech is one of the easiest mediums of communication because it has a lot of identical features for different speakers. Nowadays it is possible to determine speakers and their identity using their speech in terms of speaker recognition. In this paper, we presented a method that will provide a speakers geographical identity in a certain region using continuous Bengali speech. We consider eight different divisions of Bangladesh as the geographical region. We applied the Mel Frequency Cepstral Coefficient (MFCC) and Delta features on an Artificial Neural Network to classify speakers division. We performed some preprocessing tasks like noise reduction and 8-10 second segmentation of raw audio before feature extraction. We used our dataset of more than 45 hours of audio data from 633 individual male and female speakers. We recorded the highest accuracy of 85.44%.
Related papers
- Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows.
Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues.
We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z) - Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - A Deep Dive into the Disparity of Word Error Rates Across Thousands of
NPTEL MOOC Videos [4.809236881780707]
We describe the curation of a massive speech dataset of 8740 hours consisting of $sim9.8$K technical lectures in the English language along with their transcripts delivered by instructors representing various parts of Indian demography.
We use the curated dataset to measure the existing disparity in YouTube Automatic Captions and OpenAI Whisper model performance across the diverse demographic traits of speakers in India.
arXiv Detail & Related papers (2023-07-20T05:03:00Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Speaker Identification using Speech Recognition [0.0]
This research provides a mechanism for identifying a speaker in an audio file, based on the human voice biometric features like pitch, amplitude, frequency etc.
We proposed an unsupervised learning model where the model can learn speech representation with limited dataset.
arXiv Detail & Related papers (2022-05-29T13:03:42Z) - Deep Speech Based End-to-End Automated Speech Recognition (ASR) for
Indian-English Accents [0.0]
We have used transfer learning approach to develop an end-to-end speech recognition system for Indian-English accents.
Indic TTS data of Indian-English accents is used for transfer learning and fine-tuning the pre-trained Deep Speech model.
arXiv Detail & Related papers (2022-04-03T03:11:21Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - NeuraGen-A Low-Resource Neural Network based approach for Gender
Classification [0.0]
We have used speech recordings collected from the ELSDSR and limited TIMIT datasets.
We extracted 8 speech features, which were pre-processed and then fed into NeuraGen to identify the gender.
NeuraGen has successfully achieved accuracy of 90.7407% and F1 score of 91.227% in train and 20-fold cross validation dataset.
arXiv Detail & Related papers (2022-03-29T05:57:24Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z) - HarperValleyBank: A Domain-Specific Spoken Dialog Corpus [7.331287001215395]
HarperValleyBank is a free, public domain spoken dialog corpus.
The data simulate simple consumer banking interactions, containing about 23 hours of audio from 1,446 human-human conversations.
arXiv Detail & Related papers (2020-10-26T22:16:52Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.