Related papers: Defining Boundaries: A Spectrum of Task Feasibility for Large Language Models

Defining Boundaries: A Spectrum of Task Feasibility for Large Language Models

URL: http://arxiv.org/abs/2408.05873v2
Date: Mon, 14 Oct 2024 20:31:45 GMT
Title: Defining Boundaries: A Spectrum of Task Feasibility for Large Language Models
Authors: Wenbo Zhang, Zihang Xu, Hengrui Cai,
Abstract summary: Large language models (LLMs) have shown remarkable performance in various tasks but often fail to handle queries that exceed their knowledge and capabilities. This paper addresses the need for LLMs to recognize and refuse infeasible tasks due to the required skills surpassing their capabilities.
Score: 6.008311204104302
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have shown remarkable performance in various tasks but often fail to handle queries that exceed their knowledge and capabilities, leading to incorrect or fabricated responses. This paper addresses the need for LLMs to recognize and refuse infeasible tasks due to the required skills surpassing their capabilities. We first conceptualize infeasible tasks for LLMs and provide categorizations that cover a spectrum of related hallucinations over existing literature. We develop and benchmark a new dataset comprising diverse infeasible and feasible tasks to evaluate multiple LLMs' abilities to reject infeasible tasks. Furthermore, we explore the potential of increasing LLMs' refusal capabilities with fine-tuning. Experiments validate the effectiveness of our trained models, offering promising directions for refining the operational boundaries of LLMs in real applications.

Related papers

The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities [51.594836904623534]
We investigate whether instruction-tuned models possess fundamentally different capabilities from base models that are prompted using in-context examples. We show that the performance of instruction-tuned models is significantly correlated with the in-context performance of their base counterparts. Specifically, we extend this understanding to instruction-tuned models, suggesting that their pretraining data similarly sets a limiting boundary on the tasks they can solve.
arXiv Detail & Related papers (2025-01-15T10:57:55Z)
Multi-Task Learning with LLMs for Implicit Sentiment Analysis: Data-level and Task-level Automatic Weight Learning [18.836998294161834]
Implicit sentiment analysis presents significant challenges due to the absence of salient cue words. We introduce MT-ISA, a novel MTL framework that enhances ISA by leveraging the generation and reasoning capabilities of large language models. We introduce data-level and task-level automatic weight learning (AWL), which dynamically identifies relationships and prioritizes more reliable data and critical tasks.
arXiv Detail & Related papers (2024-12-12T08:15:16Z)
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making [85.24399869971236]
We aim to evaluate Large Language Models (LLMs) for embodied decision making. Existing evaluations tend to rely solely on a final success rate. We propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks.
arXiv Detail & Related papers (2024-10-09T17:59:00Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
Efficient Prompting for LLM-based Generative Internet of Things [88.84327500311464]
Large language models (LLMs) have demonstrated remarkable capacities on various tasks, and integrating the capacities of LLMs into the Internet of Things (IoT) applications has drawn much research attention recently. Due to security concerns, many institutions avoid accessing state-of-the-art commercial LLM services, requiring the deployment and utilization of open-source LLMs in a local network setting. We propose a LLM-based Generative IoT (GIoT) system deployed in the local network setting in this study.
arXiv Detail & Related papers (2024-06-14T19:24:00Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
Prompting Large Language Models for Counterfactual Generation: An Empirical Study [13.506528217009507]
Large language models (LLMs) have made remarkable progress in a wide range of natural language understanding and generation tasks. We present a comprehensive evaluation framework on various types of NLU tasks, which covers all key factors in determining LLMs' capability of generating counterfactuals.
arXiv Detail & Related papers (2023-05-24T06:44:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.