TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks
- URL: http://arxiv.org/abs/2305.11430v2
- Date: Tue, 24 Oct 2023 22:50:02 GMT
- Title: TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks
- Authors: Shubhra Kanti Karmaker Santu and Dongji Feng
- Abstract summary: The paper proposes a general taxonomy that can be used to design prompts with specific properties in order to perform a wide range of complex tasks.
This taxonomy will allow future benchmarking studies to report the specific categories of prompts used as part of the study.
- Score: 2.822851601000061
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While LLMs have shown great success in understanding and generating text in
traditional conversational settings, their potential for performing ill-defined
complex tasks is largely under-studied. Indeed, we are yet to conduct
comprehensive benchmarking studies with multiple LLMs that are exclusively
focused on a complex task. However, conducting such benchmarking studies is
challenging because of the large variations in LLMs' performance when different
prompt types/styles are used and different degrees of detail are provided in
the prompts. To address this issue, the paper proposes a general taxonomy that
can be used to design prompts with specific properties in order to perform a
wide range of complex tasks. This taxonomy will allow future benchmarking
studies to report the specific categories of prompts used as part of the study,
enabling meaningful comparisons across different studies. Also, by establishing
a common standard through this taxonomy, researchers will be able to draw more
accurate conclusions about LLMs' performance on a specific complex task.
Related papers
- Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey [39.82566660592583]
Large Language Models (LLMs) have demonstrated remarkable success in various tasks such as natural language understanding, text summarization, and machine translation.
Their general-purpose nature often limits their effectiveness in domain-specific applications that require specialized knowledge, such as healthcare, chemistry, or legal analysis.
To address this, researchers have explored diverse methods to enhance LLMs by integrating domain-specific knowledge.
arXiv Detail & Related papers (2025-02-15T07:43:43Z) - Large Language Models are Pattern Matchers: Editing Semi-Structured and Structured Documents with ChatGPT [0.0]
This paper investigates if Large Language Models (LLMs) can be applied for editing structured and semi-structured documents with minimal effort.
ChatGPT demonstrates a strong ability to recognize and process the structure of annotated documents.
arXiv Detail & Related papers (2024-09-12T03:41:39Z) - Assessing SPARQL capabilities of Large Language Models [0.0]
We focus on measuring out-of-the box capabilities of Large Language Models to work with SPARQL.
We implement benchmarking tasks in the LLM-KG-Bench framework for automated execution and evaluation.
Our findings indicate that working with SPARQL SELECT queries is still challenging for LLMs.
arXiv Detail & Related papers (2024-09-09T08:29:39Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - Benchmarking LLMs on the Semantic Overlap Summarization Task [9.656095701778975]
This paper comprehensively evaluates Large Language Models (LLMs) on the Semantic Overlap Summarization (SOS) task.
We report well-established metrics like ROUGE, BERTscore, and SEM-F1$ on two different datasets of alternative narratives.
arXiv Detail & Related papers (2024-02-26T20:33:50Z) - Knowledge Plugins: Enhancing Large Language Models for Domain-Specific
Recommendations [50.81844184210381]
We propose a general paradigm that augments large language models with DOmain-specific KnowledgE to enhance their performance on practical applications, namely DOKE.
This paradigm relies on a domain knowledge extractor, working in three steps: 1) preparing effective knowledge for the task; 2) selecting the knowledge for each specific sample; and 3) expressing the knowledge in an LLM-understandable way.
arXiv Detail & Related papers (2023-11-16T07:09:38Z) - The Shifted and The Overlooked: A Task-oriented Investigation of
User-GPT Interactions [114.67699010359637]
We analyze a large-scale collection of real user queries to GPT.
We find that tasks such as design'' and planning'' are prevalent in user interactions but are largely neglected or different from traditional NLP benchmarks.
arXiv Detail & Related papers (2023-10-19T02:12:17Z) - Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions.
Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions.
We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z) - Sentiment Analysis in the Era of Large Language Models: A Reality Check [69.97942065617664]
This paper investigates the capabilities of large language models (LLMs) in performing various sentiment analysis tasks.
We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets.
arXiv Detail & Related papers (2023-05-24T10:45:25Z) - Multi-Task Instruction Tuning of LLaMa for Specific Scenarios: A
Preliminary Study on Writing Assistance [60.40541387785977]
Small foundational models can display remarkable proficiency in tackling diverse tasks when fine-tuned using instruction-driven data.
In this work, we investigate a practical problem setting where the primary focus is on one or a few particular tasks rather than general-purpose instruction following.
Experimental results show that fine-tuning LLaMA on writing instruction data significantly improves its ability on writing tasks.
arXiv Detail & Related papers (2023-05-22T16:56:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.