Related papers: QAEA-DR: A Unified Text Augmentation Framework for Dense Retrieval

QAEA-DR: A Unified Text Augmentation Framework for Dense Retrieval

URL: http://arxiv.org/abs/2407.20207v1
Date: Mon, 29 Jul 2024 17:39:08 GMT
Title: QAEA-DR: A Unified Text Augmentation Framework for Dense Retrieval
Authors: Hongming Tan, Shaoxiong Zhan, Hai Lin, Hai-Tao Zheng, Wai Kin, Chan,
Abstract summary: In dense retrieval, embedding long texts into dense vectors can result in information loss, leading to inaccurate query-text matching. Recent studies mainly focus on improving the sentence embedding model or retrieval process. We introduce a novel text augmentation framework for dense retrieval, which transforms raw documents into information-dense text formats.
Score: 12.225881591629815
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In dense retrieval, embedding long texts into dense vectors can result in information loss, leading to inaccurate query-text matching. Additionally, low-quality texts with excessive noise or sparse key information are unlikely to align well with relevant queries. Recent studies mainly focus on improving the sentence embedding model or retrieval process. In this work, we introduce a novel text augmentation framework for dense retrieval. This framework transforms raw documents into information-dense text formats, which supplement the original texts to effectively address the aforementioned issues without modifying embedding or retrieval methodologies. Two text representations are generated via large language models (LLMs) zero-shot prompting: question-answer pairs and element-driven events. We term this approach QAEA-DR: unifying question-answer generation and event extraction in a text augmentation framework for dense retrieval. To further enhance the quality of generated texts, a scoring-based evaluation and regeneration mechanism is introduced in LLM prompting. Our QAEA-DR model has a positive impact on dense retrieval, supported by both theoretical analysis and empirical experiments.

Related papers

QuOTE: Question-Oriented Text Embeddings [8.377715521597292]
QuOTE (Question-Oriented Text Embeddings) is a novel enhancement to retrieval-augmented generation (RAG) systems. Unlike traditional RAG pipelines, QuOTE augments chunks with hypothetical questions that the chunk can potentially answer. We demonstrate that QuOTE significantly enhances retrieval accuracy, including in multi-hop question-answering tasks.
arXiv Detail & Related papers (2025-02-16T03:37:13Z)
GeAR: Generation Augmented Retrieval [82.20696567697016]
Document retrieval techniques form the foundation for the development of large-scale information systems. The prevailing methodology is to construct a bi-encoder and compute the semantic similarity. We propose a new method called $textbfGe$neration that incorporates well-designed fusion and decoding modules.
arXiv Detail & Related papers (2025-01-06T05:29:00Z)
Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation. We introduce novel methodologies and datasets to overcome these challenges. We propose MhBART, an encoder-decoder model designed to emulate human writing style. We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs. We propose a more realistic setting in which only noisy text and its NER labels are available. We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z)
Context-augmented Retrieval: A Novel Framework for Fast Information Retrieval based Response Generation using Large Language Model [0.0]
As the corpus of contextual information grows, the answer/inference quality of Retrieval Augmented Generation (RAG) based Question Answering (QA) systems declines. This work solves this problem by combining classical text classification with the Large Language Model (LLM) New approach Context Augmented retrieval (CAR) demonstrates good quality answer generation along with significant reduction in information retrieval and answer generation time.
arXiv Detail & Related papers (2024-06-24T07:52:05Z)
Towards Unified Multi-granularity Text Detection with Interactive Attention [56.79437272168507]
"Detect Any Text" is an advanced paradigm that unifies scene text detection, layout analysis, and document page detection into a cohesive, end-to-end model. A pivotal innovation in DAT is the across-granularity interactive attention module, which significantly enhances the representation learning of text instances. Tests demonstrate that DAT achieves state-of-the-art performances across a variety of text-related benchmarks.
arXiv Detail & Related papers (2024-05-30T07:25:23Z)
RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder for Language Modeling [79.56442336234221]
We introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE) It encodes the text corpus into a latent space, capturing current and future information from both source and target text. Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.
arXiv Detail & Related papers (2023-10-16T16:42:01Z)
Optimizing Factual Accuracy in Text Generation through Dynamic Knowledge Selection [71.20871905457174]
Language models (LMs) have revolutionized the way we interact with information, but they often generate nonfactual text. Previous methods use external knowledge as references for text generation to enhance factuality but often struggle with the knowledge mix-up of irrelevant references. We present DKGen, which divide the text generation process into an iterative process.
arXiv Detail & Related papers (2023-08-30T02:22:40Z)
Boosting Punctuation Restoration with Data Generation and Reinforcement Learning [70.26450819702728]
Punctuation restoration is an important task in automatic speech recognition (ASR) The discrepancy between written punctuated texts and ASR texts limits the usability of written texts in training punctuation restoration systems for ASR texts. This paper proposes a reinforcement learning method to exploit in-topic written texts and recent advances in large pre-trained generative language models to bridge this gap.
arXiv Detail & Related papers (2023-07-24T17:22:04Z)
Boosting Event Extraction with Denoised Structure-to-Text Augmentation [52.21703002404442]
Event extraction aims to recognize pre-defined event triggers and arguments from texts. Recent data augmentation methods often neglect the problem of grammatical incorrectness. We propose a denoised structure-to-text augmentation framework for event extraction DAEE.
arXiv Detail & Related papers (2023-05-16T16:52:07Z)
A Benchmark Corpus for the Detection of Automatically Generated Text in Academic Publications [0.02578242050187029]
This paper presents two datasets comprised of artificially generated research content. In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers. The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model. We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE.
arXiv Detail & Related papers (2022-02-04T08:16:56Z)
Exploring Fluent Query Reformulations with Text-to-Text Transformers and Reinforcement Learning [11.205077315939644]
We explore methods to generate query reformulations by training reformulators using text-to-text transformers. We apply policy-based reinforcement learning algorithms to further encourage reward learning. Our framework is demonstrated to be flexible, allowing reward signals to be sourced from different downstream environments.
arXiv Detail & Related papers (2020-12-18T03:16:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.