Related papers: Automatic Speech Recognition for Hindi

Automatic Speech Recognition for Hindi

URL: http://arxiv.org/abs/2406.18135v1
Date: Wed, 26 Jun 2024 07:39:20 GMT
Title: Automatic Speech Recognition for Hindi
Authors: Anish Saha, A. G. Ramakrishnan,
Abstract summary: The research involved developing a web application and designing a web interface for speech recognition. The web application manages large volumes of audio files and their transcriptions, facilitating human correction of ASR transcripts. The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine.
Score: 0.6292138336765964
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic speech recognition (ASR) is a key area in computational linguistics, focusing on developing technologies that enable computers to convert spoken language into text. This field combines linguistics and machine learning. ASR models, which map speech audio to transcripts through supervised learning, require handling real and unrestricted text. Text-to-speech systems directly work with real text, while ASR systems rely on language models trained on large text corpora. High-quality transcribed data is essential for training predictive models. The research involved two main components: developing a web application and designing a web interface for speech recognition. The web application, created with JavaScript and Node.js, manages large volumes of audio files and their transcriptions, facilitating collaborative human correction of ASR transcripts. It operates in real-time using a client-server architecture. The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine. VAD detects human speech presence, aiding efficient speech processing and reducing unnecessary processing during non-speech intervals, thus saving computation and network bandwidth in VoIP applications. The final phase of the research tested a neural network for accurately aligning the speech signal to hidden Markov model (HMM) states. This included implementing a novel backpropagation method that utilizes prior statistics of node co-activations.

Related papers

Speech Retrieval-Augmented Generation without Automatic Speech Recognition [4.731446054087683]
SpeechRAG is a novel framework designed for open-question answering over spoken data. Our proposed approach fine-tunes a pre-trained speech encoder into a speech adapter fed into a frozen large language model. By aligning the embedding spaces of text and speech, our speech retriever directly retrieves audio passages from text-based queries.
arXiv Detail & Related papers (2024-12-21T06:16:04Z)
Scaling Speech-Text Pre-training with Synthetic Interleaved Data [31.77653849518526]
Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction. Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data. We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora.
arXiv Detail & Related papers (2024-11-26T17:19:09Z)
Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition [110.8431434620642]
We introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. We discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.
arXiv Detail & Related papers (2024-09-15T16:32:49Z)
Learning Speech Representation From Contrastive Token-Acoustic Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space. The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z)
On decoder-only architecture for speech-to-text and large language model integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z)
Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task. This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z)
An Adaptive Learning based Generative Adversarial Network for One-To-One Voice Conversion [9.703390665821463]
We propose an adaptive learning-based GAN model called ALGAN-VC for an efficient one-to-one VC of speakers. The model is tested on Voice Conversion Challenge (VCC) 2016, 2018, and 2020 datasets as well as on our self-prepared speech dataset. A subjective and objective evaluation of the generated speech samples indicated that the proposed model elegantly performed the voice conversion task.
arXiv Detail & Related papers (2021-04-25T13:44:32Z)
A.I. based Embedded Speech to Text Using Deepspeech [3.2221306786493065]
This paper shows the implementation process of speech recognition on a low-end computational device. Deepspeech is an open-source voice recognition that was using a neural network to convert speech spectrogram into a text transcript. This paper shows the experiment using Deepspeech version 0.1.0, 0.1.1, and 0.6.0, and there is some improvement on Deepspeech version 0.6.0, faster while processing speech-to-text on old hardware raspberry pi 3 b+.
arXiv Detail & Related papers (2020-02-25T08:27:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.