Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering with Retrieval Augmented Generation
- URL: http://arxiv.org/abs/2406.13372v2
- Date: Thu, 10 Oct 2024 08:04:20 GMT
- Title: Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering with Retrieval Augmented Generation
- Authors: Kaikai An, Fangkai Yang, Liqun Li, Junting Lu, Sitao Cheng, Shuzheng Si, Lu Wang, Pu Zhao, Lele Cao, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang, Baobao Chang,
- Abstract summary: How-to questions are integral to decision-making processes and require dynamic, step-by-step answers.
We propose Thread, a novel data organization paradigm aimed at enabling current systems to handle how-to questions more effectively.
- Score: 49.36436704082436
- License:
- Abstract: Recent advances in retrieval-augmented generation have significantly improved the performance of question-answering systems, particularly on factoid '5Ws' questions. However, these systems still face substantial challenges when addressing '1H' questions, specifically how-to questions, which are integral to decision-making processes and require dynamic, step-by-step answers. The key limitation lies in the prevalent data organization paradigm, chunk, which divides documents into fixed-size segments, and disrupts the logical coherence and connections within the context. To overcome this, in this paper, we propose Thread, a novel data organization paradigm aimed at enabling current systems to handle how-to questions more effectively. Specifically, we introduce a new knowledge granularity, termed 'logic unit', where documents are transformed into more structured and loosely interconnected logic units with large language models. Extensive experiments conducted across both open-domain and industrial settings demonstrate that Thread outperforms existing paradigms significantly, improving the success rate of handling how-to questions by 21% to 33%. Moreover, Thread exhibits high adaptability in processing various document formats, drastically reducing the candidate quantity in the knowledge base and minimizing the required information to one-fourth compared with chunk, optimizing both efficiency and effectiveness.
Related papers
- QRMeM: Unleash the Length Limitation through Question then Reflection Memory Mechanism [46.441032033076034]
Memory mechanism offers a flexible solution for managing long contexts.
We introduce a novel strategy, Question then Reflection Memory Mechanism (QRMeM), incorporating a dual-structured memory pool.
Our evaluation across multiple-choice questions (MCQ) and multi-document question answering (Multi-doc QA) benchmarks showcases QRMeM enhanced performance compared to existing approaches.
arXiv Detail & Related papers (2024-06-19T02:46:18Z) - Learning to Filter Context for Retrieval-Augmented Generation [75.18946584853316]
Generation models are required to generate outputs given partially or entirely irrelevant passages.
FILCO identifies useful context based on lexical and information-theoretic approaches.
It trains context filtering models that can filter retrieved contexts at test time.
arXiv Detail & Related papers (2023-11-14T18:41:54Z) - Reranking Passages with Coarse-to-Fine Neural Retriever Enhanced by List-Context Information [0.9463895540925061]
This paper presents a list-context attention mechanism to augment the passage representation by incorporating the list-context information from other candidates.
The proposed coarse-to-fine (C2F) neural retriever addresses the out-of-memory limitation of the passage attention mechanism.
It integrates the coarse and fine rankers into the joint optimization process, allowing for feedback between the two layers to update the model simultaneously.
arXiv Detail & Related papers (2023-08-23T09:29:29Z) - Information Association for Language Model Updating by Mitigating
LM-Logical Discrepancy [68.31760483418901]
Large Language Models(LLMs) struggle with providing current information due to the outdated pre-training data.
Existing methods for updating LLMs, such as knowledge editing and continual fine-tuning, have significant drawbacks in generalizability of new information.
We identify the core challenge behind these drawbacks: the LM-logical discrepancy featuring the difference between language modeling probabilities and logical probabilities.
arXiv Detail & Related papers (2023-05-29T19:48:37Z) - Peek Across: Improving Multi-Document Modeling via Cross-Document
Question-Answering [49.85790367128085]
We pre-training a generic multi-document model from a novel cross-document question answering pre-training objective.
This novel multi-document QA formulation directs the model to better recover cross-text informational relations.
Unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation and long text generation.
arXiv Detail & Related papers (2023-05-24T17:48:40Z) - How Does Generative Retrieval Scale to Millions of Passages? [68.98628807288972]
We conduct the first empirical study of generative retrieval techniques across various corpus scales.
We scale generative retrieval to millions of passages with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters.
While generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge.
arXiv Detail & Related papers (2023-05-19T17:33:38Z) - Towards Personalized and Human-in-the-Loop Document Summarization [0.0]
This thesis focuses on three main challenges to alleviate information overload using novel summarisation techniques.
It covers (i) feature engineering in document summarisation, (ii) traditional static and inflexible summaries, (iii) traditional generic summarisation approaches, and (iv) the need for reference summaries.
arXiv Detail & Related papers (2021-08-21T05:34:46Z) - Recurrent Coupled Topic Modeling over Sequential Documents [33.35324412209806]
We show that a current topic evolves from all prior topics with corresponding coupling weights, forming the multi-topic-thread evolution.
A new solution with a set of novel data augmentation techniques is proposed, which successfully discomposes the multi-couplings between evolving topics.
A novel Gibbs sampler with a backward-forward filter algorithm efficiently learns latent timeevolving parameters in a closed-form.
arXiv Detail & Related papers (2021-06-23T08:58:13Z) - ClarQ: A large-scale and diverse dataset for Clarification Question
Generation [67.1162903046619]
We devise a novel bootstrapping framework that assists in the creation of a diverse, large-scale dataset of clarification questions based on postcomments extracted from stackexchange.
We quantitatively demonstrate the utility of the newly created dataset by applying it to the downstream task of question-answering.
We release this dataset in order to foster research into the field of clarification question generation with the larger goal of enhancing dialog and question answering systems.
arXiv Detail & Related papers (2020-06-10T17:56:50Z) - When Deep Learning Meets Data Alignment: A Review on Deep Registration
Networks (DRNs) [4.616914111718527]
Recent advancements in machine learning could be a turning point in the field of computer vision.
Recent advancements in machine learning could be a turning point in the field of computer vision.
arXiv Detail & Related papers (2020-03-06T12:56:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.