RoCar: A Relationship Network-based Evaluation Method for Large Language Models
- URL: http://arxiv.org/abs/2307.15997v2
- Date: Mon, 11 Nov 2024 07:27:03 GMT
- Title: RoCar: A Relationship Network-based Evaluation Method for Large Language Models
- Authors: Ming Wang, Wenfang Wu, Chongyun Gao, Daling Wang, Shi Feng, Yifei Zhang,
- Abstract summary: How to rationally evaluate the capabilities of large language models (LLMs) is still a task to be solved.
We propose the RoCar method, which utilizes the defined basic schemas to randomly construct a task graph.
It is possible to ensure that none of the LLMs to be tested has directly learned the evaluation tasks, guaranteeing the fairness of the evaluation method.
- Score: 20.954826722195847
- License:
- Abstract: Large language models (LLMs) have received increasing attention. However, due to the complexity of its capabilities, how to rationally evaluate the capabilities of LLMs is still a task to be solved. We propose the RoCar method, which utilizes the defined basic schemas to randomly construct a task graph and generates natural language evaluation tasks based on the task graph to evaluate the reasoning and memory abilities of LLMs respectively. Due to the very large randomness of the task construction process, it is possible to ensure that none of the LLMs to be tested has directly learned the evaluation tasks, guaranteeing the fairness of the evaluation method.
Related papers
- Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance [1.6114012813668932]
This paper investigates the performance of large language models (LLMs) on language complexity measurement tasks.
Using Swedish high school and university-level essays, we evaluate the models' abilities to compute LIX scores and perform dependency parsing.
arXiv Detail & Related papers (2025-02-17T09:09:58Z) - Learning Task Representations from In-Context Learning [73.72066284711462]
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning.
We introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads.
We show that our method's effectiveness stems from aligning the distribution of the last hidden state with that of an optimally performing in-context-learned model.
arXiv Detail & Related papers (2025-02-08T00:16:44Z) - Large Language Models as Common-Sense Heuristics [0.9093413254392775]
Large Language Models (LLMs) possess parametrised knowledge across a wide range of topics, enabling them to leverage the natural language descriptions of planning tasks in their solutions.
We introduce a novel planning method, which leverages the parametrised knowledge of LLMs by using their output as a for Hill-Climbing Search.
Our method outperforms the task success rate of similar systems within a common household environment by 22 percentage points, with consistently executable plans.
arXiv Detail & Related papers (2025-01-31T00:26:38Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - Exploring and Benchmarking the Planning Capabilities of Large Language Models [57.23454975238014]
This work lays the foundations for improving planning capabilities of large language models (LLMs)
We construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios.
We investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance.
arXiv Detail & Related papers (2024-06-18T22:57:06Z) - Sub-goal Distillation: A Method to Improve Small Language Agents [21.815417165548187]
Large Language Models (LLMs) have demonstrated significant promise as agents in interactive tasks.
We propose a method for transferring the performance of an LLM with billions of parameters to a much smaller language model.
In ScienceWorld, a challenging and multi-task interactive text environment, our method surpasses standard imitation learning based solely on elementary actions by 16.7%.
arXiv Detail & Related papers (2024-05-04T20:34:06Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - Improving Open Information Extraction with Large Language Models: A
Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text.
Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z) - ISR-LLM: Iterative Self-Refined Large Language Model for Long-Horizon
Sequential Task Planning [7.701407633867452]
Large Language Models (LLMs) offer the potential to enhance the generalizability as task-agnostic planners.
We introduce ISR-LLM, a novel framework that improves LLM-based planning through an iterative self-refinement process.
We show that ISR-LLM is able to achieve markedly higher success rates in task accomplishments compared to state-of-the-art LLM-based planners.
arXiv Detail & Related papers (2023-08-26T01:31:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.