Question: How do Large Language Models perform on the Question Answering tasks? Answer:
- URL: http://arxiv.org/abs/2412.12893v1
- Date: Tue, 17 Dec 2024 13:19:38 GMT
- Title: Question: How do Large Language Models perform on the Question Answering tasks? Answer:
- Authors: Kevin Fischer, Darren Fürst, Sebastian Steindl, Jakob Lindner, Ulrich Schäfer,
- Abstract summary: Large Language Models (LLMs) have been showing promising results for various NLP-tasks without the explicit need to be trained for these tasks by using few-shot or zero-shot prompting techniques.
We propose a comprehensive performance comparison between smaller fine-tuned models and out-of-the-box instruction-following LLMs on the Stanford Question Answering dataset 2.0 (SQuAD2)
Our results show that smaller, fine-tuned models outperform current State-Of-The-Art (SOTA) LLMs on the fine-tuned task, but recent SOTA models are able to close this gap on the out
- Score: 0.0
- License:
- Abstract: Large Language Models (LLMs) have been showing promising results for various NLP-tasks without the explicit need to be trained for these tasks by using few-shot or zero-shot prompting techniques. A common NLP-task is question-answering (QA). In this study, we propose a comprehensive performance comparison between smaller fine-tuned models and out-of-the-box instruction-following LLMs on the Stanford Question Answering Dataset 2.0 (SQuAD2), specifically when using a single-inference prompting technique. Since the dataset contains unanswerable questions, previous work used a double inference method. We propose a prompting style which aims to elicit the same ability without the need for double inference, saving compute time and resources. Furthermore, we investigate their generalization capabilities by comparing their performance on similar but different QA datasets, without fine-tuning neither model, emulating real-world uses where the context and questions asked may differ from the original training distribution, for example swapping Wikipedia for news articles. Our results show that smaller, fine-tuned models outperform current State-Of-The-Art (SOTA) LLMs on the fine-tuned task, but recent SOTA models are able to close this gap on the out-of-distribution test and even outperform the fine-tuned models on 3 of the 5 tested QA datasets.
Related papers
- One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering [31.025439143093585]
Vision-Language Models (VLMs) have shown significant promise in Visual Question Answering (VQA) tasks by leveraging web-scale multimodal datasets.
These models often struggle with continual learning due to catastrophic forgetting when adapting to new tasks.
We propose the first data-free method that leverages the language generation capability of a VLM, instead of relying on external models.
arXiv Detail & Related papers (2024-11-04T16:04:59Z) - GistScore: Learning Better Representations for In-Context Example
Selection with Gist Bottlenecks [3.9638110494107095]
In-context Learning (ICL) is the ability of Large Language Models (LLMs) to perform new tasks when conditioned on prompts.
We propose Example Gisting, a novel approach for training example encoders through supervised fine-tuning.
We show that our fine-tuned models get state-of-the-art ICL performance with over 20% absolute gain over off-the-shelf retrievers.
arXiv Detail & Related papers (2023-11-16T06:28:05Z) - A Lightweight Method to Generate Unanswerable Questions in English [18.323248259867356]
We examine a simpler data augmentation method for unanswerable question generation in English.
We perform antonym and entity swaps on answerable questions.
Compared to the prior state-of-the-art, data generated with our training-free and lightweight strategy results in better models.
arXiv Detail & Related papers (2023-10-30T10:14:52Z) - Chain-of-Skills: A Configurable Model for Open-domain Question Answering [79.8644260578301]
The retrieval model is an indispensable component for real-world knowledge-intensive tasks.
Recent work focuses on customized methods, limiting the model transferability and scalability.
We propose a modular retriever where individual modules correspond to key skills that can be reused across datasets.
arXiv Detail & Related papers (2023-05-04T20:19:39Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - Self-Prompting Large Language Models for Zero-Shot Open-Domain QA [67.08732962244301]
Open-Domain Question Answering (ODQA) aims to answer questions without explicitly providing background documents.
This task becomes notably challenging in a zero-shot setting where no data is available to train tailored retrieval-reader models.
We propose a Self-Prompting framework to explicitly utilize the massive knowledge encoded in the parameters of Large Language Models.
arXiv Detail & Related papers (2022-12-16T18:23:43Z) - Learning to Perturb Word Embeddings for Out-of-distribution QA [55.103586220757464]
We propose a simple yet effective DA method based on a noise generator, which learns to perturb the word embedding of the input questions and context without changing their semantics.
We validate the performance of the QA models trained with our word embedding on a single source dataset, on five different target domains.
Notably, the model trained with ours outperforms the model trained with more than 240K artificially generated QA pairs.
arXiv Detail & Related papers (2021-05-06T14:12:26Z) - Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks [133.93803565077337]
retrieval-augmented generation models combine pre-trained parametric and non-parametric memory for language generation.
We show that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
arXiv Detail & Related papers (2020-05-22T21:34:34Z) - ManyModalQA: Modality Disambiguation and QA over Diverse Inputs [73.93607719921945]
We present a new multimodal question answering challenge, ManyModalQA, in which an agent must answer a question by considering three distinct modalities.
We collect our data by scraping Wikipedia and then utilize crowdsourcing to collect question-answer pairs.
arXiv Detail & Related papers (2020-01-22T14:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.