RELIC: Evaluating Compositional Instruction Following via Language Recognition
- URL: http://arxiv.org/abs/2506.05205v1
- Date: Thu, 05 Jun 2025 16:17:24 GMT
- Title: RELIC: Evaluating Compositional Instruction Following via Language Recognition
- Authors: Jackson Petty, Michael Y. Hu, Wentao Wang, Shauli Ravfogel, William Merrill, Tal Linzen,
- Abstract summary: Large language models (LLMs) are increasingly expected to perform tasks based only on a specification of the task provided in context.<n>We introduce the Recognition of Languages In-Context (RELIC) framework to evaluate instruction following using language recognition.
- Score: 37.49115450182637
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are increasingly expected to perform tasks based only on a specification of the task provided in context, without examples of inputs and outputs; this ability is referred to as instruction following. We introduce the Recognition of Languages In-Context (RELIC) framework to evaluate instruction following using language recognition: the task of determining if a string is generated by formal grammar. Unlike many standard evaluations of LLMs' ability to use their context, this task requires composing together a large number of instructions (grammar productions) retrieved from the context. Because the languages are synthetic, the task can be increased in complexity as LLMs' skills improve, and new instances can be automatically generated, mitigating data contamination. We evaluate state-of-the-art LLMs on RELIC and find that their accuracy can be reliably predicted from the complexity of the grammar and the individual example strings, and that even the most advanced LLMs currently available show near-chance performance on more complex grammars and samples, in line with theoretical expectations. We also use RELIC to diagnose how LLMs attempt to solve increasingly difficult reasoning tasks, finding that as the complexity of the language recognition task increases, models switch to relying on shallow heuristics instead of following complex instructions.
Related papers
- Linguistic Blind Spots of Large Language Models [14.755831733659699]
We study the performance of recent large language models (LLMs) on linguistic annotation tasks.<n>We find that recent LLMs show limited efficacy in addressing linguistic queries and often struggle with linguistically complex inputs.<n>Our results provide insights to inform future advancements in LLM design and development.
arXiv Detail & Related papers (2025-03-25T01:47:13Z) - Enhancing LLM Character-Level Manipulation via Divide and Conquer [74.55804812450164]
Large Language Models (LLMs) have demonstrated strong generalization capabilities across a wide range of natural language processing (NLP) tasks.<n>They exhibit notable weaknesses in character-level string manipulation, struggling with fundamental operations such as character deletion, insertion, and substitution.<n>We propose Character-Level Manipulation via Divide and Conquer, a novel approach designed to bridge the gap between token-level processing and character-level manipulation.
arXiv Detail & Related papers (2025-02-12T07:37:39Z) - $\forall$uto$\exists$val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks [21.12437562185667]
This paper presents a new approach for scaling LLM assessment in translating formal syntax to natural language.
We use context-free grammars (CFGs) to generate out-of-distribution datasets on the fly.
We also conduct an assessment of several SOTA closed and open-source LLMs to showcase the feasibility and scalability of this paradigm.
arXiv Detail & Related papers (2024-03-27T08:08:00Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - kNN-ICL: Compositional Task-Oriented Parsing Generalization with Nearest
Neighbor In-Context Learning [50.40636157214161]
Task-Oriented Parsing (TOP) enables conversational assistants to interpret user commands expressed in natural language.
LLMs have achieved impressive performance in computer programs based on a natural language prompt.
This paper focuses on harnessing the capabilities of LLMs for semantic parsing tasks.
arXiv Detail & Related papers (2023-12-17T17:26:50Z) - Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions.
Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions.
We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z) - Exploring the Integration of Large Language Models into Automatic Speech
Recognition Systems: An Empirical Study [0.0]
This paper explores the integration of Large Language Models (LLMs) into Automatic Speech Recognition (ASR) systems.
Our primary focus is to investigate the potential of using an LLM's in-context learning capabilities to enhance the performance of ASR systems.
arXiv Detail & Related papers (2023-07-13T02:31:55Z) - IERL: Interpretable Ensemble Representation Learning -- Combining
CrowdSourced Knowledge and Distributed Semantic Representations [11.008412414253662]
Large Language Models (LLMs) encode meanings of words in the form of distributed semantics.
Recent studies have shown that LLMs tend to generate unintended, inconsistent, or wrong texts as outputs.
We propose a novel ensemble learning method, Interpretable Ensemble Representation Learning (IERL), that systematically combines LLM and crowdsourced knowledge representations.
arXiv Detail & Related papers (2023-06-24T05:02:34Z) - Interpretable Unified Language Checking [42.816372695828306]
We present an interpretable, unified, language checking (UniLC) method for both human and machine-generated language.
We find that LLMs can achieve high performance on a combination of fact-checking, stereotype detection, and hate speech detection tasks.
arXiv Detail & Related papers (2023-04-07T16:47:49Z) - Augmented Language Models: a Survey [55.965967655575454]
This survey reviews works in which language models (LMs) are augmented with reasoning skills and the ability to use tools.
We refer to them as Augmented Language Models (ALMs)
The missing token objective allows ALMs to learn to reason, use tools, and even act, while still performing standard natural language tasks.
arXiv Detail & Related papers (2023-02-15T18:25:52Z) - Prompting Language Models for Linguistic Structure [73.11488464916668]
We present a structured prompting approach for linguistic structured prediction tasks.
We evaluate this approach on part-of-speech tagging, named entity recognition, and sentence chunking.
We find that while PLMs contain significant prior knowledge of task labels due to task leakage into the pretraining corpus, structured prompting can also retrieve linguistic structure with arbitrary labels.
arXiv Detail & Related papers (2022-11-15T01:13:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.