Beyond Single-Task: Robust Multi-Task Length Generalization for LLMs
- URL: http://arxiv.org/abs/2502.11525v2
- Date: Mon, 19 May 2025 13:48:45 GMT
- Title: Beyond Single-Task: Robust Multi-Task Length Generalization for LLMs
- Authors: Yi Hu, Shijia Kang, Haotong Yang, Haotian Xu, Muhan Zhang,
- Abstract summary: We study length generalization in multi-task settings and propose Meta Rule-Following Fine-Tuning (Meta-RFFT) as the first framework enabling robust cross-task length generalization.<n>After training on a large number of tasks and instances, our models achieve remarkable length generalization ability on unseen tasks with minimal fine-tuning or one-shot prompting.
- Score: 23.958458849973248
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Length generalization, the ability to solve problems longer than those seen during training, remains a critical challenge for large language models (LLMs). Previous work modifies positional encodings (PEs) and data formats to improve length generalization on specific symbolic tasks such as addition and sorting. However, these approaches are fundamentally limited to special tasks, often degrading general language performance. Furthermore, they are typically evaluated on small transformers trained from scratch on single tasks and can cause performance drop when applied during post-training stage of practical LLMs with general capabilities. Hu et al., (2024) proposed Rule-Following Fine-Tuning (RFFT) to improve length generalization in the post-training stage of LLMs. Despite its compatibility with practical models and strong performance, RFFT is proposed for single tasks too, requiring re-training for each individual task with extensive examples. In this paper, we study length generalization in multi-task settings and propose Meta Rule-Following Fine-Tuning (Meta-RFFT), the first framework enabling robust cross-task length generalization. As our first contribution, we construct a large length generalization dataset containing 86 tasks spanning code execution, number processing, symbolic and logical reasoning tasks, beyond the common addition or multiplication tasks. Secondly, we show that cross-task length generalization is possible with Meta-RFFT. After training on a large number of tasks and instances, the models achieve remarkable length generalization ability on unseen tasks with minimal fine-tuning or one-shot prompting. For example, after fine-tuning on 1 to 5 digit addition, our 32B model achieves 95% accuracy on 30 digit addition, significantly outperforming the state-of-the-art reasoning models (DeepSeek-R1-671B: 72%), despite never seeing this task during RF-pretraining.
Related papers
- The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities [51.594836904623534]
We investigate whether instruction-tuned models possess fundamentally different capabilities from base models that are prompted using in-context examples.<n>We show that the performance of instruction-tuned models is significantly correlated with the in-context performance of their base counterparts.<n>Specifically, we extend this understanding to instruction-tuned models, suggesting that their pretraining data similarly sets a limiting boundary on the tasks they can solve.
arXiv Detail & Related papers (2025-01-15T10:57:55Z) - RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios [58.90106984375913]
RuleArena is a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning.
Covering three practical domains -- airline baggage fees, NBA transactions, and tax regulations -- RuleArena assesses LLMs' proficiency in handling intricate natural language instructions.
arXiv Detail & Related papers (2024-12-12T06:08:46Z) - LiNeS: Post-training Layer Scaling Prevents Forgetting and Enhances Model Merging [80.17238673443127]
LiNeS is a post-training editing technique designed to preserve pre-trained generalization while enhancing fine-tuned task performance.<n>LiNeS demonstrates significant improvements in both single-task and multi-task settings across various benchmarks in vision and natural language processing.
arXiv Detail & Related papers (2024-10-22T16:26:05Z) - RNR: Teaching Large Language Models to Follow Roles and Rules [153.6596303205894]
We propose model, an automated data generation pipeline that generates diverse roles and rules from existing IFT instructions.
This data can then be used to train models that follow complex system prompts.
Our framework significantly improves role and rule following capability in large language models.
arXiv Detail & Related papers (2024-09-10T06:07:32Z) - Symbolic Working Memory Enhances Language Models for Complex Rule Application [87.34281749422756]
Large Language Models (LLMs) have shown remarkable reasoning performance but struggle with multi-step deductive reasoning.
We propose augmenting LLMs with external working memory and introduce a neurosymbolic framework for rule application.
Our framework iteratively performs symbolic rule grounding and LLM-based rule implementation.
arXiv Detail & Related papers (2024-08-24T19:11:54Z) - Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency.<n>This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z) - Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models [25.337295202341608]
Large Language Models (LLMs) are supposed to be controlled and guided by in real-world scenarios to be safe, accurate, and intelligent.
Previous studies that try to evaluate the inferential rule-following capability of LLMs fail to distinguish the inferential rule-following scenarios from the instruction-following scenarios.
This paper first clarifies the concept of inferential rule-following and proposes a comprehensive benchmark, RuleBench, to evaluate a diversified range of inferential rule-following abilities.
arXiv Detail & Related papers (2024-07-11T12:26:55Z) - From Instance Training to Instruction Learning: Task Adapters Generation from Instructions [29.452006810725184]
This paper focuses on simulating human learning to address the shortcomings of instance training.<n>We introduce Task Adapters Generation from Instructions (TAGI), which automatically constructs the task-specific model.<n>We evaluate TAGI on the Super-Natural Instructions and P3 datasets.
arXiv Detail & Related papers (2024-06-18T08:14:28Z) - Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs [87.34281749422756]
Large language models (LLMs) have achieved impressive human-like performance across various reasoning tasks.
However, their mastery of underlying inferential rules still falls short of human capabilities.
We propose a logic scaffolding inferential rule generation framework, to construct an inferential rule base, ULogic.
arXiv Detail & Related papers (2024-02-18T03:38:51Z) - Can LLMs Follow Simple Rules? [28.73820874333199]
Rule-following Language Evaluation Scenarios (RuLES) is a framework for measuring rule-following ability in Large Language Models.
RuLES consists of 14 simple text scenarios in which the model is instructed to obey various rules while interacting with the user.
We show that almost all current models struggle to follow scenario rules, even on straightforward test cases.
arXiv Detail & Related papers (2023-11-06T08:50:29Z) - Specialist or Generalist? Instruction Tuning for Specific NLP Tasks [58.422495509760154]
We investigate whether incorporating broad-coverage generalist instruction tuning can contribute to building a specialist model.
Our experiments assess four target tasks with distinct coverage levels.
The effect is particularly pronounced when the amount of task-specific training data is limited.
arXiv Detail & Related papers (2023-10-23T19:46:48Z) - Improving Length-Generalization in Transformers via Task Hinting [42.95479331339189]
In particular, the performance of a transformer model trained on tasks up to a certain length drops sharply when applied to longer instances of the same problem.
This work proposes an approach based on task hinting towards addressing length generalization.
arXiv Detail & Related papers (2023-10-01T16:57:40Z) - ChatRule: Mining Logical Rules with Large Language Models for Knowledge
Graph Reasoning [107.61997887260056]
We propose a novel framework, ChatRule, unleashing the power of large language models for mining logical rules over knowledge graphs.
Specifically, the framework is initiated with an LLM-based rule generator, leveraging both the semantic and structural information of KGs.
To refine the generated rules, a rule ranking module estimates the rule quality by incorporating facts from existing KGs.
arXiv Detail & Related papers (2023-09-04T11:38:02Z) - CrossCodeBench: Benchmarking Cross-Task Generalization of Source Code
Models [33.78307982736911]
Cross-task generalization is of strong research and application value.
We propose a large-scale benchmark that includes 216 existing code-related tasks.
arXiv Detail & Related papers (2023-02-08T13:04:52Z) - Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and
Vision-Language Tasks [86.66733026149892]
We propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-gnostic tasks.
Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model.
Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.
arXiv Detail & Related papers (2022-11-17T18:59:52Z) - Two-stage LLM Fine-tuning with Less Specialization and More
Generalization [93.12197594813378]
We propose Prompt Tuning with MOdel Tuning (ProMoT) to reduce format specialization and improve generalization.
ProMoT offloads task-specific format learning into additional and removable parameters by first doing prompt tuning and then fine-tuning the model itself with this soft prompt.
ProMoT can even enhance generalization on in-context learning tasks that are semantically related to the fine-tuned task.
arXiv Detail & Related papers (2022-11-01T17:56:57Z) - Unsupervised Cross-Task Generalization via Retrieval Augmentation [27.47782160720298]
We propose a retrieval-augmentation method named ReCross that takes a few unlabelled examples as queries to retrieve a small subset of upstream data.
Our empirical results show that the proposed ReCross consistently outperforms non-retrieval baselines by a significant margin.
arXiv Detail & Related papers (2022-04-17T06:05:13Z) - Task Adaptive Parameter Sharing for Multi-Task Learning [114.80350786535952]
Adaptive Task Adapting Sharing (TAPS) is a method for tuning a base model to a new task by adaptively modifying a small, task-specific subset of layers.
Compared to other methods, TAPS retains high accuracy on downstream tasks while introducing few task-specific parameters.
We evaluate our method on a suite of fine-tuning tasks and architectures (ResNet, DenseNet, ViT) and show that it achieves state-of-the-art performance while being simple to implement.
arXiv Detail & Related papers (2022-03-30T23:16:07Z) - RuleBert: Teaching Soft Rules to Pre-trained Language Models [21.69870624809201]
We introduce a classification task where, given facts and soft rules, the PLM should return a prediction with a probability for a given hypothesis.
We propose a revised loss function that enables the PLM to learn how to predict precise probabilities for the task.
Our evaluation results show that the resulting fine-tuned models achieve very high performance, even on logical rules that were unseen at training.
arXiv Detail & Related papers (2021-09-24T16:19:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.