Can Language Models Follow Multiple Turns of Entangled Instructions?
- URL: http://arxiv.org/abs/2503.13222v2
- Date: Fri, 28 Mar 2025 17:17:40 GMT
- Title: Can Language Models Follow Multiple Turns of Entangled Instructions?
- Authors: Chi Han,
- Abstract summary: Real-world scenarios require consistency across multiple instructions over time, such as secret privacy, personal preferences, and prioritization.<n>This work presents a systematic investigation of large language models' capabilities in handling multiple turns of instructions.<n>We construct MultiTurnInstruct with around 1.1K high-quality multi-turn conversations through the human-in-the-loop approach.
- Score: 4.44881011141635
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite significant achievements in improving the instruction-following capabilities of large language models (LLMs), the ability to process multiple potentially entangled or conflicting instructions remains a considerable challenge. Real-world scenarios often require consistency across multiple instructions over time, such as secret privacy, personal preferences, and prioritization, which demand sophisticated abilities to integrate multiple turns and carefully balance competing objectives when instructions intersect or conflict. This work presents a systematic investigation of LLMs' capabilities in handling multiple turns of instructions, covering three levels of difficulty: (1) retrieving information from instructions, (2) tracking and reasoning across turns, and (3) resolving conflicts among instructions. We construct MultiTurnInstruct with around 1.1K high-quality multi-turn conversations through the human-in-the-loop approach and result in nine capability categories, including statics and dynamics, reasoning, and multitasking. Our finding reveals an intriguing trade-off between different capabilities. While GPT models demonstrate superior memorization, they show reduced effectiveness in privacy-protection tasks requiring selective information withholding. Larger models exhibit stronger reasoning capabilities but still struggle with resolving conflicting instructions. Importantly, these performance gaps cannot be attributed solely to information loss, as models demonstrate strong BLEU scores on memorization tasks but their attention mechanisms fail to integrate multiple related instructions effectively. These findings highlight critical areas for improvement in complex real-world tasks involving multi-turn instructions.
Related papers
- GROOT-2: Weakly Supervised Multi-Modal Instruction Following Agents [25.195426389757355]
We introduce GROOT-2, a multimodal agent trained using a novel approach that combines weak supervision with latent variable models.<n>GROOT-2's effectiveness is validated across four diverse environments, ranging from video games to robotic manipulation.
arXiv Detail & Related papers (2024-12-07T05:47:49Z) - STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.
Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.
We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z) - Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning [53.45295657891099]
This paper proposes Visual-O1, a multi-modal multi-turn chain-of-thought reasoning framework.
It simulates human multi-modal multi-turn reasoning, providing instantial experience for highly intelligent models.
Our work highlights the potential of artificial intelligence to work like humans in real-world scenarios with uncertainty and ambiguity.
arXiv Detail & Related papers (2024-10-04T11:18:41Z) - SwitchCIT: Switching for Continual Instruction Tuning [14.085371250265224]
Large language models (LLMs) and multimodal models (MMs) have exhibited impressive capabilities in various domains.
Continual instruction tuning is crucial to adapt a large model to evolving tasks and domains.
This work addresses the catastrophic forgetting in continual instruction learning through a mechanism for routing computations to parameter-efficient tuned models.
arXiv Detail & Related papers (2024-07-16T14:37:33Z) - The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models [48.455388608863785]
We introduce a benchmark designed to evaluate models' abilities to follow multiple instructions through sequential instruction following tasks.
Our benchmark evaluates instruction following using four tasks (text modification, question answering, mathematics, and security rules)
More recent and larger models significantly outperform their older and smaller counterparts on the SIFo tasks, validating the benchmark's effectiveness.
arXiv Detail & Related papers (2024-06-28T15:34:26Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts.
This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals.
We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z) - Few-shot Multimodal Multitask Multilingual Learning [0.0]
We propose few-shot learning for a multimodal multitask multilingual (FM3) setting by adapting pre-trained vision and language models.
FM3 learns the most prominent tasks in the vision and language domains along with their intersections.
arXiv Detail & Related papers (2023-02-19T03:48:46Z) - Understanding Multimodal Procedural Knowledge by Sequencing Multimodal
Instructional Manuals [48.55362590292391]
We benchmark machine learning models' capability of reasoning over and sequencing unordered multimodal instructions.
We find models not only perform significantly worse than humans but also seem incapable of efficiently utilizing the multimodal information.
We propose sequentiality-aware pretraining techniques that exploit the sequential alignment properties of both texts and images.
arXiv Detail & Related papers (2021-10-16T06:12:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.