Test-Time Self-Adaptive Small Language Models for Question Answering
- URL: http://arxiv.org/abs/2310.13307v1
- Date: Fri, 20 Oct 2023 06:49:32 GMT
- Title: Test-Time Self-Adaptive Small Language Models for Question Answering
- Authors: Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, Jong C. Park
- Abstract summary: We show and investigate the capabilities of smaller self-adaptive LMs, only with unlabeled test data.
Our proposed self-adaption strategy demonstrates significant performance improvements on benchmark QA datasets.
- Score: 63.91013329169796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent instruction-finetuned large language models (LMs) have achieved
notable performances in various tasks, such as question-answering (QA).
However, despite their ability to memorize a vast amount of general knowledge
across diverse tasks, they might be suboptimal on specific tasks due to their
limited capacity to transfer and adapt knowledge to target tasks. Moreover,
further finetuning LMs with labeled datasets is often infeasible due to their
absence, but it is also questionable if we can transfer smaller LMs having
limited knowledge only with unlabeled test data. In this work, we show and
investigate the capabilities of smaller self-adaptive LMs, only with unlabeled
test data. In particular, we first stochastically generate multiple answers,
and then ensemble them while filtering out low-quality samples to mitigate
noise from inaccurate labels. Our proposed self-adaption strategy demonstrates
significant performance improvements on benchmark QA datasets with higher
robustness across diverse prompts, enabling LMs to stay stable. Code is
available at: https://github.com/starsuzi/T-SAS.
Related papers
- One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering [31.025439143093585]
Vision-Language Models (VLMs) have shown significant promise in Visual Question Answering (VQA) tasks by leveraging web-scale multimodal datasets.
These models often struggle with continual learning due to catastrophic forgetting when adapting to new tasks.
We propose the first data-free method that leverages the language generation capability of a VLM, instead of relying on external models.
arXiv Detail & Related papers (2024-11-04T16:04:59Z) - Zero-to-Strong Generalization: Eliciting Strong Capabilities of Large Language Models Iteratively without Gold Labels [75.77877889764073]
Large Language Models (LLMs) have demonstrated remarkable performance through supervised fine-tuning or in-context learning using gold labels.
This study explores whether solely utilizing unlabeled data can elicit strong model capabilities.
We propose a new paradigm termed zero-to-strong generalization.
arXiv Detail & Related papers (2024-09-19T02:59:44Z) - Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Elephants Never Forget: Testing Language Models for Memorization of
Tabular Data [21.912611415307644]
Large Language Models (LLMs) can be applied to a diverse set of tasks, but the critical issues of data contamination and memorization are often glossed over.
We introduce a variety of different techniques to assess the degrees of contamination, including statistical tests for conditional distribution modeling and four tests that identify memorization.
arXiv Detail & Related papers (2024-03-11T12:07:13Z) - PEDANTS: Cheap but Effective and Interpretable Answer Equivalence [10.367359022491181]
We provide rubrics and datasets for evaluating machine QA adopted from the Trivia community.
We also propose an efficient, and interpretable QA evaluation that is more stable than an exact match and neural methods(BERTScore)
arXiv Detail & Related papers (2024-02-17T01:56:19Z) - QActor: On-line Active Learning for Noisy Labeled Stream Data [10.814099534254922]
We propose QActor which combines the selection of supposedly clean samples via quality models and actively querying an oracle for the most informative true labels.
QActor swiftly combines the merits of quality models for data filtering and oracle queries for cleaning the most informative data.
A central feature of QActor is to dynamically adjust the query limit according to the learning loss for each data batch.
arXiv Detail & Related papers (2020-01-28T15:13:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.