Pattern-aware Data Augmentation for Query Rewriting in Voice Assistant
Systems
- URL: http://arxiv.org/abs/2012.11468v1
- Date: Mon, 21 Dec 2020 16:36:32 GMT
- Title: Pattern-aware Data Augmentation for Query Rewriting in Voice Assistant
Systems
- Authors: Yunmo Chen, Sixing Lu, Fan Yang, Xiaojiang Huang, Xing Fan, Chenlei
Guo
- Abstract summary: We propose an augmentation framework that learns patterns from existing training pairs and generates rewrite candidates from rewrite labels inversely to compensate for insufficient QR training data.
Our experimental results show its effectiveness compared with a fully trained QR baseline and demonstrate its potential application in boosting the QR performance on low-resource domains or locales.
- Score: 10.332550622090718
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Query rewriting (QR) systems are widely used to reduce the friction caused by
errors in a spoken language understanding pipeline. However, the underlying
supervised models require a large number of labeled pairs, and these pairs are
hard and costly to be collected. Therefore, We propose an augmentation
framework that learns patterns from existing training pairs and generates
rewrite candidates from rewrite labels inversely to compensate for insufficient
QR training data. The proposed framework casts the augmentation problem as a
sequence-to-sequence generation task and enforces the optimization process with
a policy gradient technique for controllable rewarding. This approach goes
beyond the traditional heuristics or rule-based augmentation methods and is not
constrained to generate predefined patterns of swapping/replacing words. Our
experimental results show its effectiveness compared with a fully trained QR
baseline and demonstrate its potential application in boosting the QR
performance on low-resource domains or locales.
Related papers
- Gumbel Reranking: Differentiable End-to-End Reranker Optimization [61.16471123356738]
RAG systems rely on rerankers to identify relevant documents.
fine-tuning these models remains challenging due to the scarcity of annotated query-document pairs.
We propose Gumbel Reranking, an end-to-end training framework for rerankers aimed at minimizing the training-inference gap.
arXiv Detail & Related papers (2025-02-16T13:23:39Z) - Chain-of-Retrieval Augmented Generation [72.06205327186069]
This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer.
Our proposed method, CoRAG, allows the model to dynamically reformulate the query based on the evolving state.
arXiv Detail & Related papers (2025-01-24T09:12:52Z) - GEC-RAG: Improving Generative Error Correction via Retrieval-Augmented Generation for Automatic Speech Recognition Systems [8.669397145785942]
We propose Generative Error Correction via Retrieval-Augmented Generation (GEC-RAG) to improve ASR accuracy for low-resource domains, like Persian.
GEC-RAG retrieves lexically similar examples to the ASR transcription using the Term Frequency-Inverse Document Frequency (TF-IDF) measure.
arXiv Detail & Related papers (2025-01-18T11:53:22Z) - RaFe: Ranking Feedback Improves Query Rewriting for RAG [83.24385658573198]
We propose a framework for training query rewriting models free of annotations.
By leveraging a publicly available reranker, oursprovides feedback aligned well with the rewriting objectives.
arXiv Detail & Related papers (2024-05-23T11:00:19Z) - RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation [42.82192656794179]
Large Language Models (LLMs) exhibit remarkable capabilities but are prone to generating inaccurate or hallucinatory responses.
This limitation stems from their reliance on vast pretraining datasets, making them susceptible to errors in unseen scenarios.
Retrieval-Augmented Generation (RAG) addresses this by incorporating external, relevant documents into the response generation process.
arXiv Detail & Related papers (2024-03-31T08:58:54Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Noise-Robust Dense Retrieval via Contrastive Alignment Post Training [89.29256833403167]
Contrastive Alignment POst Training (CAPOT) is a highly efficient finetuning method that improves model robustness without requiring index regeneration.
CAPOT enables robust retrieval by freezing the document encoder while the query encoder learns to align noisy queries with their unaltered root.
We evaluate CAPOT noisy variants of MSMARCO, Natural Questions, and Trivia QA passage retrieval, finding CAPOT has a similar impact as data augmentation with none of its overhead.
arXiv Detail & Related papers (2023-04-06T22:16:53Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z) - Pre-Training for Query Rewriting in A Spoken Language Understanding
System [14.902583546933563]
We first propose a neural-retrieval based approach for query rewriting.
Then, inspired by the wide success of pre-trained contextual language embeddings, we propose a language-modeling (LM) based approach.
arXiv Detail & Related papers (2020-02-13T16:31:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.