A Simple and Effective Framework for Strict Zero-Shot Hierarchical
Classification
- URL: http://arxiv.org/abs/2305.15282v2
- Date: Fri, 26 May 2023 13:57:52 GMT
- Title: A Simple and Effective Framework for Strict Zero-Shot Hierarchical
Classification
- Authors: Rohan Bhambhoria, Lei Chen, Xiaodan Zhu
- Abstract summary: Large language models (LLMs) have achieved strong performance on benchmark tasks, especially in zero or few-shot settings.
We propose a more indicative long-tail prediction task for hierarchical datasets.
Our method does not require any updates, a resource-intensive process and achieves strong performance across multiple datasets.
- Score: 23.109264015761873
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, large language models (LLMs) have achieved strong
performance on benchmark tasks, especially in zero or few-shot settings.
However, these benchmarks often do not adequately address the challenges posed
in the real-world, such as that of hierarchical classification. In order to
address this challenge, we propose refactoring conventional tasks on
hierarchical datasets into a more indicative long-tail prediction task. We
observe LLMs are more prone to failure in these cases. To address these
limitations, we propose the use of entailment-contradiction prediction in
conjunction with LLMs, which allows for strong performance in a strict
zero-shot setting. Importantly, our method does not require any parameter
updates, a resource-intensive process and achieves strong performance across
multiple datasets.
Related papers
- HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly [34.205934899868346]
We present HELMET, a comprehensive benchmark encompassing seven diverse, application-centric categories.
We find that synthetic tasks like NIAH are not good predictors of downstream performance.
While most LCLMs achieve perfect NIAH scores, open-source models significantly lag behind closed ones when the task requires full-context reasoning.
arXiv Detail & Related papers (2024-10-03T17:20:11Z) - Zero-to-Strong Generalization: Eliciting Strong Capabilities of Large Language Models Iteratively without Gold Labels [75.77877889764073]
Large Language Models (LLMs) have demonstrated remarkable performance through supervised fine-tuning or in-context learning using gold labels.
This study explores whether solely utilizing unlabeled data can elicit strong model capabilities.
We propose a new paradigm termed zero-to-strong generalization.
arXiv Detail & Related papers (2024-09-19T02:59:44Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - On the Worst Prompt Performance of Large Language Models [93.13542053835542]
Performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts.
We introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries.
Experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance.
arXiv Detail & Related papers (2024-06-08T13:40:38Z) - Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM
Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs)
We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence.
Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z) - Task Contamination: Language Models May Not Be Few-Shot Anymore [9.696290050028237]
Large language models (LLMs) offer impressive performance in various zero-shot and few-shot tasks.
However, their success in zero-shot and few-shot settings may be affected by task contamination.
This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over time.
arXiv Detail & Related papers (2023-12-26T21:17:46Z) - FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models [79.62191017182518]
FollowBench is a benchmark for Fine-grained Constraints Following Benchmark for Large Language Models.
We introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level.
By evaluating 13 popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work.
arXiv Detail & Related papers (2023-10-31T12:32:38Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Aligning Instruction Tasks Unlocks Large Language Models as Zero-Shot
Relation Extractors [11.28397947587596]
Fine-tuning large language models (LLMs) on large-scale instruction-following datasets substantially improves their performance on a wide range of NLP tasks.
However, even advanced instruction-tuned LLMs still fail to outperform small LMs on relation extraction (RE)
We propose QA4RE, a framework that aligns RE with question answering (QA), a predominant task in instruction-tuning datasets.
arXiv Detail & Related papers (2023-05-18T17:48:03Z) - Towards Practical Few-Shot Query Sets: Transductive Minimum Description
Length Inference [0.0]
We introduce a PrimAl Dual Minimum Description LEngth (PADDLE) formulation, which balances data-fitting accuracy and model complexity for a given few-shot task.
Our constrained MDL-like objective promotes competition among a large set of possible classes, preserving only effective classes that befit better the data of a few-shot task.
arXiv Detail & Related papers (2022-10-26T08:06:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.