Can Visual Context Improve Automatic Speech Recognition for an Embodied
Agent?
- URL: http://arxiv.org/abs/2210.13189v1
- Date: Fri, 21 Oct 2022 11:16:05 GMT
- Title: Can Visual Context Improve Automatic Speech Recognition for an Embodied
Agent?
- Authors: Pradip Pramanick, Chayan Sarkar
- Abstract summary: We propose a new decoder biasing technique to incorporate the visual context while ensuring the ASR output does not degrade for incorrect context.
We achieve a 59% relative reduction in WER from an unmodified ASR system.
- Score: 3.7311680121118345
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The usage of automatic speech recognition (ASR) systems are becoming
omnipresent ranging from personal assistant to chatbots, home, and industrial
automation systems, etc. Modern robots are also equipped with ASR capabilities
for interacting with humans as speech is the most natural interaction modality.
However, ASR in robots faces additional challenges as compared to a personal
assistant. Being an embodied agent, a robot must recognize the physical
entities around it and therefore reliably recognize the speech containing the
description of such entities. However, current ASR systems are often unable to
do so due to limitations in ASR training, such as generic datasets and
open-vocabulary modeling. Also, adverse conditions during inference, such as
noise, accented, and far-field speech makes the transcription inaccurate. In
this work, we present a method to incorporate a robot's visual information into
an ASR system and improve the recognition of a spoken utterance containing a
visible entity. Specifically, we propose a new decoder biasing technique to
incorporate the visual context while ensuring the ASR output does not degrade
for incorrect context. We achieve a 59% relative reduction in WER from an
unmodified ASR system.
Related papers
- Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - A Deep Learning System for Domain-specific Speech Recognition [0.0]
The author works with pre-trained DeepSpeech2 and Wav2Vec2 acoustic models to develop benefit-specific ASR systems.
The best performance comes from a fine-tuned Wav2Vec2-Large-LV60 acoustic model with an external KenLM.
The viability of using error prone ASR transcriptions as part of spoken language understanding (SLU) is also investigated.
arXiv Detail & Related papers (2023-03-18T22:19:09Z) - Hey ASR System! Why Aren't You More Inclusive? Automatic Speech
Recognition Systems' Bias and Proposed Bias Mitigation Techniques. A
Literature Review [0.0]
We present research that addresses ASR biases against gender, race, and the sick and disabled.
We also discuss techniques for designing a more accessible and inclusive ASR technology.
arXiv Detail & Related papers (2022-11-17T13:15:58Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - Automatic Speech Recognition using limited vocabulary: A survey [0.0]
An approach to design an ASR system targeting under-resourced languages is to start with a limited vocabulary.
This paper aims to provide a comprehensive view of mechanisms behind ASR systems as well as techniques, tools, projects, recent contributions, and possibly future directions in ASR using a limited vocabulary.
arXiv Detail & Related papers (2021-08-23T15:51:41Z) - On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion
Recognition: An Update for the Deep Learning Era [0.0]
We create transcripts from the original speech by applying three modern ASR systems.
For extraction and learning of acoustic speech features, we utilise openSMILE, openXBoW, DeepSpectrum, and auDeep.
We achieve state-of-the-art unweighted average recall values of $73.6,%$ and $73.8,%$ on the speaker-independent development and test partitions of IEMOCAP.
arXiv Detail & Related papers (2021-04-20T17:10:01Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Self-supervised reinforcement learning for speaker localisation with the
iCub humanoid robot [58.2026611111328]
Looking at a person's face is one of the mechanisms that humans rely on when it comes to filtering speech in noisy environments.
Having a robot that can look toward a speaker could benefit ASR performance in challenging environments.
We propose a self-supervised reinforcement learning-based framework inspired by the early development of humans.
arXiv Detail & Related papers (2020-11-12T18:02:15Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.