A.I. based Embedded Speech to Text Using Deepspeech
- URL: http://arxiv.org/abs/2002.12830v1
- Date: Tue, 25 Feb 2020 08:27:41 GMT
- Title: A.I. based Embedded Speech to Text Using Deepspeech
- Authors: Muhammad Hafidh Firmansyah, Anand Paul, Deblina Bhattacharya, Gul
Malik Urfa
- Abstract summary: This paper shows the implementation process of speech recognition on a low-end computational device.
Deepspeech is an open-source voice recognition that was using a neural network to convert speech spectrogram into a text transcript.
This paper shows the experiment using Deepspeech version 0.1.0, 0.1.1, and 0.6.0, and there is some improvement on Deepspeech version 0.6.0, faster while processing speech-to-text on old hardware raspberry pi 3 b+.
- Score: 3.2221306786493065
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Deepspeech was very useful for development IoT devices that need voice
recognition. One of the voice recognition systems is deepspeech from Mozilla.
Deepspeech is an open-source voice recognition that was using a neural network
to convert speech spectrogram into a text transcript. This paper shows the
implementation process of speech recognition on a low-end computational device.
Development of English-language speech recognition that has many datasets
become a good point for starting. The model that used results from pre-trained
model that provide by each version of deepspeech, without change of the model
that already released, furthermore the benefit of using raspberry pi as a media
end-to-end speech recognition device become a good thing, user can change and
modify of the speech recognition, and also deepspeech can be standalone device
without need continuously internet connection to process speech recognition,
and even this paper show the power of Tensorflow Lite can make a significant
difference on inference by deepspeech rather than using Tensorflow
non-Lite.This paper shows the experiment using Deepspeech version 0.1.0, 0.1.1,
and 0.6.0, and there is some improvement on Deepspeech version 0.6.0, faster
while processing speech-to-text on old hardware raspberry pi 3 b+.
Related papers
- Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - Automatic Speech Recognition for Hindi [0.6292138336765964]
The research involved developing a web application and designing a web interface for speech recognition.
The web application manages large volumes of audio files and their transcriptions, facilitating human correction of ASR transcripts.
The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine.
arXiv Detail & Related papers (2024-06-26T07:39:20Z) - SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language
Models [58.996653700982556]
Existing speech tokens are not specifically designed for speech language modeling.
We propose SpeechTokenizer, a unified speech tokenizer for speech large language models.
Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark.
arXiv Detail & Related papers (2023-08-31T12:53:09Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - EfficientSpeech: An On-Device Text to Speech Model [15.118059441365343]
State of the art (SOTA) neural text to speech (TTS) models can generate natural-sounding synthetic voices.
In this work, an efficient neural TTS called EfficientSpeech that synthesizes speech on an ARM CPU in real-time is proposed.
arXiv Detail & Related papers (2023-05-23T10:28:41Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - Deep Speech Based End-to-End Automated Speech Recognition (ASR) for
Indian-English Accents [0.0]
We have used transfer learning approach to develop an end-to-end speech recognition system for Indian-English accents.
Indic TTS data of Indian-English accents is used for transfer learning and fine-tuning the pre-trained Deep Speech model.
arXiv Detail & Related papers (2022-04-03T03:11:21Z) - SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language
Processing [77.4527868307914]
We propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.
The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets.
To align the textual and speech information into a unified semantic space, we propose a cross-modal vector quantization method with random mixing-up to bridge speech and text.
arXiv Detail & Related papers (2021-10-14T07:59:27Z) - A review of on-device fully neural end-to-end automatic speech
recognition algorithms [20.469868150587075]
We review various end-to-end automatic speech recognition algorithms and their optimization techniques for on-device applications.
fully neural network end-to-end speech recognition algorithms have been proposed.
We extensively discuss their structures, performance, and advantages compared to conventional algorithms.
arXiv Detail & Related papers (2020-12-14T22:18:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.