The Shifted and The Overlooked: A Task-oriented Investigation of
User-GPT Interactions
- URL: http://arxiv.org/abs/2310.12418v1
- Date: Thu, 19 Oct 2023 02:12:17 GMT
- Title: The Shifted and The Overlooked: A Task-oriented Investigation of
User-GPT Interactions
- Authors: Siru Ouyang, Shuohang Wang, Yang Liu, Ming Zhong, Yizhu Jiao, Dan
Iter, Reid Pryzant, Chenguang Zhu, Heng Ji, Jiawei Han
- Abstract summary: We analyze a large-scale collection of real user queries to GPT.
We find that tasks such as design'' and planning'' are prevalent in user interactions but are largely neglected or different from traditional NLP benchmarks.
- Score: 114.67699010359637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in Large Language Models (LLMs) has produced models that
exhibit remarkable performance across a variety of NLP tasks. However, it
remains unclear whether the existing focus of NLP research accurately captures
the genuine requirements of human users. This paper provides a comprehensive
analysis of the divergence between current NLP research and the needs of
real-world NLP applications via a large-scale collection of user-GPT
conversations. We analyze a large-scale collection of real user queries to GPT.
We compare these queries against existing NLP benchmark tasks and identify a
significant gap between the tasks that users frequently request from LLMs and
the tasks that are commonly studied in academic research. For example, we find
that tasks such as ``design'' and ``planning'' are prevalent in user
interactions but are largely neglected or different from traditional NLP
benchmarks. We investigate these overlooked tasks, dissect the practical
challenges they pose, and provide insights toward a roadmap to make LLMs better
aligned with user needs.
Related papers
- WILT: A Multi-Turn, Memorization-Robust Inductive Logic Benchmark for LLMs [0.8883751685905831]
We introduce the Wason Inductive Logic Test (WILT), a simple yet challenging multi-turn reasoning benchmark designed to resist memorization.
Our findings reveal that LLMs struggle with this task, exhibiting distinct strengths and weaknesses.
Despite these variations, the best-performing model achieves only 28% accuracy, highlighting a significant gap in LLM performance on complex multi-turn reasoning tasks.
arXiv Detail & Related papers (2024-10-14T18:29:13Z) - Towards Boosting LLMs-driven Relevance Modeling with Progressive Retrieved Behavior-augmented Prompting [23.61061000692023]
This study proposes leveraging user interactions recorded in search logs to yield insights into users' implicit search intentions.
We propose ProRBP, a novel Progressive Retrieved Behavior-augmented Prompting framework for integrating search scenario-oriented knowledge with Large Language Models.
arXiv Detail & Related papers (2024-08-18T11:07:38Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks [29.35269979211728]
We present CRoW, a benchmark that evaluates the ability of models to apply commonsense reasoning in the context of six real-world NLP tasks.
We use CRoW to study how NLP systems perform across different dimensions of commonsense knowledge, such as physical, temporal, and social reasoning.
We find a significant performance gap when NLP systems are evaluated on CRoW compared to humans, showcasing that commonsense reasoning is far from being solved in real-world task settings.
arXiv Detail & Related papers (2023-10-23T18:00:23Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Pushing the Limits of ChatGPT on NLP Tasks [79.17291002710517]
Despite the success of ChatGPT, its performances on most NLP tasks are still well below the supervised baselines.
In this work, we looked into the causes, and discovered that its subpar performance was caused by the following factors.
We propose a collection of general modules to address these issues, in an attempt to push the limits of ChatGPT on NLP tasks.
arXiv Detail & Related papers (2023-06-16T09:40:05Z) - TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks [2.822851601000061]
The paper proposes a general taxonomy that can be used to design prompts with specific properties in order to perform a wide range of complex tasks.
This taxonomy will allow future benchmarking studies to report the specific categories of prompts used as part of the study.
arXiv Detail & Related papers (2023-05-19T04:59:34Z) - Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond [48.70557995528463]
This guide aims to provide researchers and practitioners with valuable insights and best practices for working with Large Language Models.
We present various use cases and non-use cases to illustrate the practical applications and limitations of LLMs in real-world scenarios.
arXiv Detail & Related papers (2023-04-26T17:52:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.