Towards leveraging LLMs for Conditional QA
- URL: http://arxiv.org/abs/2312.01143v1
- Date: Sat, 2 Dec 2023 14:02:52 GMT
- Title: Towards leveraging LLMs for Conditional QA
- Authors: Syed-Amad Hussain, Parag Pravin Dakle, SaiKrishna Rallabandi and
Preethi Raghavan
- Abstract summary: This study delves into the capabilities and limitations of Large Language Models (LLMs) in the challenging domain of conditional question-answering.
Our findings reveal that fine-tuned LLMs can surpass the state-of-the-art (SOTA) performance in some cases, even without fully encoding all input context.
These models encounter challenges in extractive question answering, where they lag behind the SOTA by over 10 points, and in mitigating the risk of injecting false information.
- Score: 1.9649272351760063
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study delves into the capabilities and limitations of Large Language
Models (LLMs) in the challenging domain of conditional question-answering.
Utilizing the Conditional Question Answering (CQA) dataset and focusing on
generative models like T5 and UL2, we assess the performance of LLMs across
diverse question types. Our findings reveal that fine-tuned LLMs can surpass
the state-of-the-art (SOTA) performance in some cases, even without fully
encoding all input context, with an increase of 7-8 points in Exact Match (EM)
and F1 scores for Yes/No questions. However, these models encounter challenges
in extractive question answering, where they lag behind the SOTA by over 10
points, and in mitigating the risk of injecting false information. A study with
oracle-retrievers emphasizes the critical role of effective evidence retrieval,
underscoring the necessity for advanced solutions in this area. Furthermore, we
highlight the significant influence of evaluation metrics on performance
assessments and advocate for a more comprehensive evaluation framework. The
complexity of the task, the observed performance discrepancies, and the need
for effective evidence retrieval underline the ongoing challenges in this field
and underscore the need for future work focusing on refining training tasks and
exploring prompt-based techniques to enhance LLM performance in conditional
question-answering tasks.
Related papers
- Do LLMs Know When to NOT Answer? Investigating Abstention Abilities of Large Language Models [4.377568983107492]
Abstention Ability (AA) is the ability of Large Language Models (LLMs) to refrain from answering questions when they are uncertain or when definitive answer is not possible.
We propose a black-box evaluation methodology to examine and understand the AA of LLMs across a variety of multiple-choice QA tasks.
Our findings reveal that while even state-of-the-art LLMs like GPT-4 struggle with abstention, strategic prompting can significantly enhance this ability.
arXiv Detail & Related papers (2024-07-23T06:56:54Z) - KaPQA: Knowledge-Augmented Product Question-Answering [59.096607961704656]
We introduce two product question-answering (QA) datasets focused on Adobe Acrobat and Photoshop products.
We also propose a novel knowledge-driven RAG-QA framework to enhance the performance of the models in the product QA task.
arXiv Detail & Related papers (2024-07-22T22:14:56Z) - DEXTER: A Benchmark for open-domain Complex Question Answering using LLMs [3.24692739098077]
Open-domain complex Question Answering (QA) is a difficult task with challenges in evidence retrieval and reasoning.
We evaluate state-of-the-art pre-trained dense and sparse retrieval models in an open-domain setting.
We observe that late interaction models and surprisingly lexical models like BM25 perform well compared to other pre-trained dense retrieval models.
arXiv Detail & Related papers (2024-06-24T22:09:50Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - Enhancing Large Language Model Performance To Answer Questions and
Extract Information More Accurately [2.1715455600756646]
Large Language Models (LLMs) generate responses to questions.
Their effectiveness is often hindered by sub-optimal quality of answers and occasional failures to provide accurate responses to questions.
To address these challenges, a fine-tuning process is employed, involving feedback and examples to refine models.
arXiv Detail & Related papers (2024-01-27T00:18:07Z) - Competition-Level Problems are Effective LLM Evaluators [121.15880285283116]
This paper aims to evaluate the reasoning capacities of large language models (LLMs) in solving recent programming problems in Codeforces.
We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered.
Surprisingly, theThoughtived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems.
arXiv Detail & Related papers (2023-12-04T18:58:57Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - Revisit Input Perturbation Problems for LLMs: A Unified Robustness
Evaluation Framework for Noisy Slot Filling Task [18.623619585980688]
We propose a unified robustness evaluation framework based on the slot-filling task to evaluate the dialogue understanding capability of large language models.
Specifically, we construct a input perturbation evaluation dataset, Noise-LLM, which contains five types of single perturbation and four types of mixed perturbation data.
Our aim is to assess how well various robustness methods of LLMs perform in real-world noisy scenarios.
arXiv Detail & Related papers (2023-10-10T10:22:05Z) - Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs.
We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency.
We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z) - Investigating the Factual Knowledge Boundary of Large Language Models
with Retrieval Augmentation [91.30946119104111]
We show that large language models (LLMs) possess unwavering confidence in their capabilities to respond to questions.
Retrieval augmentation proves to be an effective approach in enhancing LLMs' awareness of knowledge boundaries.
We also find that LLMs have a propensity to rely on the provided retrieval results when formulating answers.
arXiv Detail & Related papers (2023-07-20T16:46:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.