In Search of the Long-Tail: Systematic Generation of Long-Tail
Inferential Knowledge via Logical Rule Guided Search
- URL: http://arxiv.org/abs/2311.07237v2
- Date: Tue, 27 Feb 2024 22:28:52 GMT
- Title: In Search of the Long-Tail: Systematic Generation of Long-Tail
Inferential Knowledge via Logical Rule Guided Search
- Authors: Huihan Li, Yuting Ning, Zeyi Liao, Siyuan Wang, Xiang Lorraine Li,
Ximing Lu, Wenting Zhao, Faeze Brahman, Yejin Choi, Xiang Ren
- Abstract summary: State-of-the-art LLMs outperform humans on reasoning tasks such as Natural Language Inference.
Recent works evaluating LLMs note a marked performance drop on input data from the low-probability distribution, i.e., the longtail.
We propose a novel framework that generates factually correct and long-tail knowledge statements grounded on symbolic rule templates.
- Score: 69.59343233016517
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: State-of-the-art LLMs outperform humans on reasoning tasks such as Natural
Language Inference. Recent works evaluating LLMs note a marked performance drop
on input data from the low-probability distribution, i.e., the longtail.
Therefore, we focus on systematically generating statements involving long-tail
inferential knowledge for more effective evaluation of LLMs in the reasoning
space. We first propose a novel framework Logic-Induced- Knowledge-Search
(LINK) that generates factually correct and long-tail knowledge statements
grounded on symbolic rule templates; LINK effectively generates data in the
longtail distribution that zero-shot prompted LLMs are unable to reach, and
outperforms zero-shot GPT4 on factual correctness by 5%. We further use the
data generated by LINK to construct a dataset Logic-Induced-Long-Tail (LINT)
that can be used to evaluate downstream models on the long-tail distribution;
LINT contains 108K knowledge statements spanning four domains. We use LINT to
test LLMs on an entailment classification task and find that model performances
drop by as high as 5% in the long-tail distribution compared to head
distribution. Our work shows the utility of evaluating models in the long-tail
distribution, and calls for more research on generating evaluation data in the
long-tail distribution.
Related papers
- SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - GOLD: Generalized Knowledge Distillation via Out-of-Distribution-Guided Language Data Generation [21.56082253577229]
Gold is a task-agnostic data generation and knowledge distillation framework.
It employs an iterative out-of-distribution-guided feedback mechanism for the LLM.
An energy-based OOD evaluation approach is also introduced to deal with noisy generated data.
arXiv Detail & Related papers (2024-03-28T18:08:22Z) - Evaluating Large Language Models for Health-Related Text Classification Tasks with Public Social Media Data [3.9459077974367833]
Large language models (LLMs) have demonstrated remarkable success in NLP tasks.
We benchmarked one supervised classic machine learning model based on Support Vector Machines (SVMs), three supervised pretrained language models (PLMs) based on RoBERTa, BERTweet, and SocBERT, and two LLM based classifiers (GPT3.5 and GPT4), across 6 text classification tasks.
Our comprehensive experiments demonstrate that employ-ing data augmentation using LLMs (GPT-4) with relatively small human-annotated data to train lightweight supervised classification models achieves superior results compared to training with human-annotated data
arXiv Detail & Related papers (2024-03-27T22:05:10Z) - A & B == B & A: Triggering Logical Reasoning Failures in Large Language
Models [65.86149763739141]
We introduce LogicAsker, an automatic approach that comprehensively evaluates and improves the logical reasoning abilities of LLMs.
We evaluate LogicAsker on six widely deployed LLMs, including GPT-3, ChatGPT, GPT-4, Bard, Vicuna, and Guanaco.
The results show that test cases from LogicAsker can find logical reasoning failures in different LLMs with a rate of 25% - 94%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z) - The Devil is in the Tails: How Long-Tailed Code Distributions Impact
Large Language Models [15.462819541662752]
Learning-based models, including popular Large Language Models for code, heavily rely on data.
Long-tailed distribution has a substantial impact on the effectiveness of LLMs for code.
Our study provides a better understanding of the effects of long-tailed distributions on popular LLMs for code.
arXiv Detail & Related papers (2023-09-07T08:53:16Z) - From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets.
Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z) - Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis,
and LLMs Evaluations [111.88727295707454]
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP.
We propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts.
We conduct experiments on pre-trained language models for analysis and evaluation of OOD robustness.
arXiv Detail & Related papers (2023-06-07T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.