TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks
- URL: http://arxiv.org/abs/2305.11430v2
- Date: Tue, 24 Oct 2023 22:50:02 GMT
- Title: TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks
- Authors: Shubhra Kanti Karmaker Santu and Dongji Feng
- Abstract summary: The paper proposes a general taxonomy that can be used to design prompts with specific properties in order to perform a wide range of complex tasks.
This taxonomy will allow future benchmarking studies to report the specific categories of prompts used as part of the study.
- Score: 2.822851601000061
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While LLMs have shown great success in understanding and generating text in
traditional conversational settings, their potential for performing ill-defined
complex tasks is largely under-studied. Indeed, we are yet to conduct
comprehensive benchmarking studies with multiple LLMs that are exclusively
focused on a complex task. However, conducting such benchmarking studies is
challenging because of the large variations in LLMs' performance when different
prompt types/styles are used and different degrees of detail are provided in
the prompts. To address this issue, the paper proposes a general taxonomy that
can be used to design prompts with specific properties in order to perform a
wide range of complex tasks. This taxonomy will allow future benchmarking
studies to report the specific categories of prompts used as part of the study,
enabling meaningful comparisons across different studies. Also, by establishing
a common standard through this taxonomy, researchers will be able to draw more
accurate conclusions about LLMs' performance on a specific complex task.
Related papers
- Large Language Models are Pattern Matchers: Editing Semi-Structured and Structured Documents with ChatGPT [0.0]
This paper investigates if Large Language Models (LLMs) can be applied for editing structured and semi-structured documents with minimal effort.
ChatGPT demonstrates a strong ability to recognize and process the structure of annotated documents.
arXiv Detail & Related papers (2024-09-12T03:41:39Z) - Assessing SPARQL capabilities of Large Language Models [0.0]
We focus on measuring out-of-the box capabilities of Large Language Models to work with SPARQL.
We implement benchmarking tasks in the LLM-KG-Bench framework for automated execution and evaluation.
Our findings indicate that working with SPARQL SELECT queries is still challenging for LLMs.
arXiv Detail & Related papers (2024-09-09T08:29:39Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models [1.9799349044359313]
We introduce a taxonomy that employs a Hierarchical Prompt Framework (HPF) composed of five unique prompting strategies to assess large language models (LLMs) more precisely.
We also introduce the Adaptive Hierarchical Prompt framework, which automates the selection of appropriate prompting strategies for each task.
This study compares manual and adaptive hierarchical prompt frameworks using four instruction-tuned LLMs, namely Llama 3 8B, Phi 3 3.8B, Mistral 7B, and Gemma 7B, across four datasets.
arXiv Detail & Related papers (2024-06-18T14:12:27Z) - Benchmarking LLMs on the Semantic Overlap Summarization Task [9.656095701778975]
This paper comprehensively evaluates Large Language Models (LLMs) on the Semantic Overlap Summarization (SOS) task.
We report well-established metrics like ROUGE, BERTscore, and SEM-F1$ on two different datasets of alternative narratives.
arXiv Detail & Related papers (2024-02-26T20:33:50Z) - Knowledge Plugins: Enhancing Large Language Models for Domain-Specific
Recommendations [50.81844184210381]
We propose a general paradigm that augments large language models with DOmain-specific KnowledgE to enhance their performance on practical applications, namely DOKE.
This paradigm relies on a domain knowledge extractor, working in three steps: 1) preparing effective knowledge for the task; 2) selecting the knowledge for each specific sample; and 3) expressing the knowledge in an LLM-understandable way.
arXiv Detail & Related papers (2023-11-16T07:09:38Z) - The Shifted and The Overlooked: A Task-oriented Investigation of
User-GPT Interactions [114.67699010359637]
We analyze a large-scale collection of real user queries to GPT.
We find that tasks such as design'' and planning'' are prevalent in user interactions but are largely neglected or different from traditional NLP benchmarks.
arXiv Detail & Related papers (2023-10-19T02:12:17Z) - Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions.
Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions.
We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z) - Sentiment Analysis in the Era of Large Language Models: A Reality Check [69.97942065617664]
This paper investigates the capabilities of large language models (LLMs) in performing various sentiment analysis tasks.
We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets.
arXiv Detail & Related papers (2023-05-24T10:45:25Z) - Multi-Task Instruction Tuning of LLaMa for Specific Scenarios: A
Preliminary Study on Writing Assistance [60.40541387785977]
Small foundational models can display remarkable proficiency in tackling diverse tasks when fine-tuned using instruction-driven data.
In this work, we investigate a practical problem setting where the primary focus is on one or a few particular tasks rather than general-purpose instruction following.
Experimental results show that fine-tuning LLaMA on writing instruction data significantly improves its ability on writing tasks.
arXiv Detail & Related papers (2023-05-22T16:56:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.