The Shifted and The Overlooked: A Task-oriented Investigation of
User-GPT Interactions
- URL: http://arxiv.org/abs/2310.12418v1
- Date: Thu, 19 Oct 2023 02:12:17 GMT
- Title: The Shifted and The Overlooked: A Task-oriented Investigation of
User-GPT Interactions
- Authors: Siru Ouyang, Shuohang Wang, Yang Liu, Ming Zhong, Yizhu Jiao, Dan
Iter, Reid Pryzant, Chenguang Zhu, Heng Ji, Jiawei Han
- Abstract summary: We analyze a large-scale collection of real user queries to GPT.
We find that tasks such as design'' and planning'' are prevalent in user interactions but are largely neglected or different from traditional NLP benchmarks.
- Score: 114.67699010359637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in Large Language Models (LLMs) has produced models that
exhibit remarkable performance across a variety of NLP tasks. However, it
remains unclear whether the existing focus of NLP research accurately captures
the genuine requirements of human users. This paper provides a comprehensive
analysis of the divergence between current NLP research and the needs of
real-world NLP applications via a large-scale collection of user-GPT
conversations. We analyze a large-scale collection of real user queries to GPT.
We compare these queries against existing NLP benchmark tasks and identify a
significant gap between the tasks that users frequently request from LLMs and
the tasks that are commonly studied in academic research. For example, we find
that tasks such as ``design'' and ``planning'' are prevalent in user
interactions but are largely neglected or different from traditional NLP
benchmarks. We investigate these overlooked tasks, dissect the practical
challenges they pose, and provide insights toward a roadmap to make LLMs better
aligned with user needs.
Related papers
- SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - Hidden Question Representations Tell Non-Factuality Within and Across Large Language Models [34.985758097434946]
This work studies non-factuality prediction (NFP)
NFP predicts whether an LLM will generate non-factual responses to a question before the generation process.
We propose a question-aligned strategy to ensure the efficacy of mini-batch based training.
arXiv Detail & Related papers (2024-06-08T02:59:52Z) - Towards better Human-Agent Alignment: Assessing Task Utility in
LLM-Powered Applications [37.047117782796064]
AgentEval is a framework designed to simplify the utility verification process.
We present a comprehensive analysis of the robustness of quantifier's work.
arXiv Detail & Related papers (2024-02-14T08:46:15Z) - CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks [29.35269979211728]
We present CRoW, a benchmark that evaluates the ability of models to apply commonsense reasoning in the context of six real-world NLP tasks.
We use CRoW to study how NLP systems perform across different dimensions of commonsense knowledge, such as physical, temporal, and social reasoning.
We find a significant performance gap when NLP systems are evaluated on CRoW compared to humans, showcasing that commonsense reasoning is far from being solved in real-world task settings.
arXiv Detail & Related papers (2023-10-23T18:00:23Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Pushing the Limits of ChatGPT on NLP Tasks [79.17291002710517]
Despite the success of ChatGPT, its performances on most NLP tasks are still well below the supervised baselines.
In this work, we looked into the causes, and discovered that its subpar performance was caused by the following factors.
We propose a collection of general modules to address these issues, in an attempt to push the limits of ChatGPT on NLP tasks.
arXiv Detail & Related papers (2023-06-16T09:40:05Z) - TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks [2.822851601000061]
The paper proposes a general taxonomy that can be used to design prompts with specific properties in order to perform a wide range of complex tasks.
This taxonomy will allow future benchmarking studies to report the specific categories of prompts used as part of the study.
arXiv Detail & Related papers (2023-05-19T04:59:34Z) - Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond [48.70557995528463]
This guide aims to provide researchers and practitioners with valuable insights and best practices for working with Large Language Models.
We present various use cases and non-use cases to illustrate the practical applications and limitations of LLMs in real-world scenarios.
arXiv Detail & Related papers (2023-04-26T17:52:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.