Digits micro-model for accurate and secure transactions
- URL: http://arxiv.org/abs/2402.01931v1
- Date: Fri, 2 Feb 2024 22:01:27 GMT
- Title: Digits micro-model for accurate and secure transactions
- Authors: Chirag Chhablani, Nikhita Sharma, Jordan Hosier, and Vijay K. Gurbani
- Abstract summary: We highlight the potential of smaller, specialized "micro" speech recognition models.
Unlike larger speech recognition models, micro-models are trained on carefully selected and curated datasets.
Our work contributes to domain-specific ASR models, improving digit recognition accuracy, and privacy of data.
- Score: 0.5999777817331317
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic Speech Recognition (ASR) systems are used in the financial domain
to enhance the caller experience by enabling natural language understanding and
facilitating efficient and intuitive interactions. Increasing use of ASR
systems requires that such systems exhibit very low error rates. The
predominant ASR models to collect numeric data are large, general-purpose
commercial models -- Google Speech-to-text (STT), or Amazon Transcribe -- or
open source (OpenAI's Whisper). Such ASR models are trained on hundreds of
thousands of hours of audio data and require considerable resources to run.
Despite recent progress large speech recognition models, we highlight the
potential of smaller, specialized "micro" models. Such light models can be
trained perform well on number recognition specific tasks, competing with
general models like Whisper or Google STT while using less than 80 minutes of
training time and occupying at least an order of less memory resources. Also,
unlike larger speech recognition models, micro-models are trained on carefully
selected and curated datasets, which makes them highly accurate, agile, and
easy to retrain, while using low compute resources. We present our work on
creating micro models for multi-digit number recognition that handle diverse
speaking styles reflecting real-world pronunciation patterns. Our work
contributes to domain-specific ASR models, improving digit recognition
accuracy, and privacy of data. An added advantage, their low resource
consumption allows them to be hosted on-premise, keeping private data local
instead uploading to an external cloud. Our results indicate that our
micro-model makes less errors than the best-of-breed commercial or open-source
ASRs in recognizing digits (1.8% error rate of our best micro-model versus 5.8%
error rate of Whisper), and has a low memory footprint (0.66 GB VRAM for our
model versus 11 GB VRAM for Whisper).
Related papers
- Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking [68.77659513993507]
We present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy.
Our results show spoken language identification accuracy improvements of 8.7% and 6.1%, respectively, and word error rates which are 3.3% and 2.0% lower on these benchmarks.
arXiv Detail & Related papers (2024-09-27T03:31:32Z) - Exploring the limits of decoder-only models trained on public speech
recognition corpora [36.446905777292066]
Decoder-Only Transformer for ASR (DOTA) model comprehensively outperforms the encoder-decoder open source replication of Whisper (OWSM) on nearly all English ASR benchmarks and outperforms Whisper large-v3 on 7 out of 15 test sets.
arXiv Detail & Related papers (2024-01-31T23:29:42Z) - Advancing African-Accented Speech Recognition: Epistemic Uncertainty-Driven Data Selection for Generalizable ASR Models [2.4654745083407175]
We propose a new multi-rounds adaptation process that uses uncertainty to automate the annotation process.
This novel method streamlines data annotation and strategically selects data samples contributing most to model uncertainty.
Our results show that our approach leads to a 27% WER relative average improvement while requiring on average 45% less data than established baselines.
arXiv Detail & Related papers (2023-06-03T13:11:37Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Incremental Online Learning Algorithms Comparison for Gesture and Visual
Smart Sensors [68.8204255655161]
This paper compares four state-of-the-art algorithms in two real applications: gesture recognition based on accelerometer data and image classification.
Our results confirm these systems' reliability and the feasibility of deploying them in tiny-memory MCUs.
arXiv Detail & Related papers (2022-09-01T17:05:20Z) - Data Augmentation for Low-Resource Quechua ASR Improvement [2.260916274164351]
Deep learning methods have made it possible to deploy systems with word error rates below 5% for ASR of English.
For so-called low-resource languages, methods of creating new resources on the basis of existing ones are being investigated.
We describe our data augmentation approach to improve the results of ASR models for low-resource and agglutinative languages.
arXiv Detail & Related papers (2022-07-14T12:49:15Z) - Visual Speech Recognition for Multiple Languages in the Wild [64.52593130370757]
We show that designing better VSR models is equally important to using larger training sets.
We propose the addition of prediction-based auxiliary tasks to a VSR model.
We show that such model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin.
arXiv Detail & Related papers (2022-02-26T07:21:00Z) - CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command
Recognition [91.33781557979819]
We introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR)
It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers.
We provide detailed statistics of both the clean and the augmented versions of our dataset.
arXiv Detail & Related papers (2022-01-11T06:32:12Z) - Scaling ASR Improves Zero and Few Shot Learning [23.896440724468246]
We propose data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets.
By training 1-10B parameter universal English ASR models, we push the limits of speech recognition performance across many domains.
For speakers with disorders due to brain damage, our best zero-shot and few-shot models achieve 22% and 60% relative improvement on the AphasiaBank test set, respectively.
arXiv Detail & Related papers (2021-11-10T21:18:59Z) - Improving Streaming Automatic Speech Recognition With Non-Streaming
Model Distillation On Unsupervised Data [44.48235209327319]
Streaming end-to-end automatic speech recognition models are widely used on smart speakers and on-device applications.
We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher.
We scale the training of streaming models to up to 3 million hours of YouTube audio.
arXiv Detail & Related papers (2020-10-22T22:41:33Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.