Server-side Rescoring of Spoken Entity-centric Knowledge Queries for
Virtual Assistants
- URL: http://arxiv.org/abs/2311.01398v1
- Date: Thu, 2 Nov 2023 17:07:23 GMT
- Title: Server-side Rescoring of Spoken Entity-centric Knowledge Queries for
Virtual Assistants
- Authors: Youyuan Zhang, Sashank Gondala, Thiago Fraga-Silva, Christophe Van
Gysel
- Abstract summary: We conduct an empirical study of modeling strategies for server-side rescoring of spoken information domain queries.
We demonstrate significant WER improvements of 23%-35% on various entity-centric query subpopulations.
We also show that model fusion of multiple server-side LMs trained from scratch most effectively combines complementary strengths of each model.
- Score: 5.996525771249284
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: On-device Virtual Assistants (VAs) powered by Automatic Speech Recognition
(ASR) require effective knowledge integration for the challenging entity-rich
query recognition. In this paper, we conduct an empirical study of modeling
strategies for server-side rescoring of spoken information domain queries using
various categories of Language Models (LMs) (N-gram word LMs, sub-word neural
LMs). We investigate the combination of on-device and server-side signals, and
demonstrate significant WER improvements of 23%-35% on various entity-centric
query subpopulations by integrating various server-side LMs compared to
performing ASR on-device only. We also perform a comparison between LMs trained
on domain data and a GPT-3 variant offered by OpenAI as a baseline.
Furthermore, we also show that model fusion of multiple server-side LMs trained
from scratch most effectively combines complementary strengths of each model
and integrates knowledge learned from domain-specific data to a VA ASR system.
Related papers
- ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark [28.28891500803133]
We propose ContextASR-Bench to assess the linguistic competence of Automatic Speech Recognition systems.<n>It encompasses up to 40,000 data entries with more than 300,000 named entities across over 10 domains.<n>Extensive evaluation shows LALMs outperform conventional ASR models by a large margin thanks to the strong world knowledge and context modeling of LLMs.
arXiv Detail & Related papers (2025-07-08T07:21:20Z) - Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval [9.230429417848393]
Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs)<n>We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval.
arXiv Detail & Related papers (2025-06-11T21:37:54Z) - Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.7225771305861]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M$2$RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models.<n>The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking.<n>To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z) - Reinforcement Learning for Long-Horizon Interactive LLM Agents [56.9860859585028]
Interactive digital agents (IDAs) leverage APIs of stateful digital environments to perform tasks in response to user requests.
We present a reinforcement learning (RL) approach that trains IDAs directly in their target environments.
We derive LOOP, a data- and memory-efficient variant of proximal policy optimization.
arXiv Detail & Related papers (2025-02-03T18:35:42Z) - SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions [48.02083833667388]
We present and evaluate SELMA, a Speech-Enabled Language Model for virtual Assistant interactions.
We employ low-rank adaptation modules for parameter-efficient training of both the audio encoder and the Large Language Model.
arXiv Detail & Related papers (2025-01-31T18:30:36Z) - Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets [22.29915616018026]
Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks.
Our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules.
We introduce a three-stage training approach, expressly developed to enhance the model's ability to align auditory and textual information.
arXiv Detail & Related papers (2024-05-03T14:35:58Z) - A Reference-less Quality Metric for Automatic Speech Recognition via
Contrastive-Learning of a Multi-Language Model with Self-Supervision [0.20999222360659603]
This work proposes a referenceless quality metric, which allows comparing the performance of different ASR models on a speech dataset without ground truth transcriptions.
To estimate the quality of ASR hypotheses, a pre-trained language model (LM) is fine-tuned with contrastive learning in a self-supervised learning manner.
The proposed referenceless metric obtains a much higher correlation with WER scores and their ranks than the perplexity metric from the state-of-art multi-lingual LM in all experiments.
arXiv Detail & Related papers (2023-06-21T21:33:39Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - Integrating Categorical Features in End-to-End ASR [1.332560004325655]
All-neural, end-to-end ASR systems convert speech input to text units using a single trainable neural network model.
E2E models require large amounts of paired speech text data that is expensive to obtain.
We propose a simple yet effective way to integrate categorical features into E2E model.
arXiv Detail & Related papers (2021-10-06T20:07:53Z) - Multimodal Federated Learning [9.081857621783811]
In many applications, such as smart homes with IoT devices, local data on clients are generated from different modalities.
Existing federated learning systems only work on local data from a single modality, which limits the scalability of the systems.
We propose a multimodal and semi-supervised federated learning framework that trains autoencoders to extract shared or correlated representations from different local data modalities on clients.
arXiv Detail & Related papers (2021-09-10T12:32:46Z) - Arabic Code-Switching Speech Recognition using Monolingual Data [13.513655231184261]
Code-switching in automatic speech recognition (ASR) is an important challenge due to globalization.
Recent research in multilingual ASR shows potential improvement over monolingual systems.
We study key issues related to multilingual modeling for ASR through a series of large-scale ASR experiments.
arXiv Detail & Related papers (2021-07-04T08:40:49Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.