Training with Pseudo-Code for Instruction Following
- URL: http://arxiv.org/abs/2505.18011v1
- Date: Fri, 23 May 2025 15:14:29 GMT
- Title: Training with Pseudo-Code for Instruction Following
- Authors: Prince Kumar, Rudra Murthy, Riyaz Bhat, Danish Contractor,
- Abstract summary: We take inspiration from recent work that suggests that models may follow instructions better when they are expressed in pseudo-code.<n>We propose fine-tuning Large Language Models with instruction-tuning data that additionally includes instructions re-expressed in pseudo-code.<n>We conduct rigorous experiments with $5$ different models and find that not only do models follow instructions better when trained with pseudo-code, they also retain their capabilities on the other tasks related to mathematical and common sense reasoning.
- Score: 4.7188893422904
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Despite the rapid progress in the capabilities of Large Language Models (LLMs), they continue to have difficulty following relatively simple, unambiguous instructions, especially when compositions are involved. In this paper, we take inspiration from recent work that suggests that models may follow instructions better when they are expressed in pseudo-code. However, writing pseudo-code programs can be tedious and using few-shot demonstrations to craft code representations for use in inference can be unnatural for non-expert users of LLMs. To overcome these limitations, we propose fine-tuning LLMs with instruction-tuning data that additionally includes instructions re-expressed in pseudo-code along with the final response. We evaluate models trained using our method on $11$ publicly available benchmarks comprising of tasks related to instruction-following, mathematics, and common-sense reasoning. We conduct rigorous experiments with $5$ different models and find that not only do models follow instructions better when trained with pseudo-code, they also retain their capabilities on the other tasks related to mathematical and common sense reasoning. Specifically, we observe a relative gain of $3$--$19$% on instruction-following benchmark, and an average gain of upto 14% across all tasks.
Related papers
- CodeBoost: Boosting Code LLMs by Squeezing Knowledge from Code Snippets with RL [28.43882967593511]
Code large language models (LLMs) have become indispensable tools for building efficient and automated coding pipelines.<n>Existing models are typically post-trained using reinforcement learning (RL) from general-purpose LLMs using "human instruction-final answer" pairs.<n>We propose CodeBoost, a framework that enhances code LLMs purely from code snippets, without relying on human-annotated instructions.
arXiv Detail & Related papers (2025-08-07T10:31:24Z) - How Many Instructions Can LLMs Follow at Once? [0.16874375111244325]
We introduce IFScale, a simple benchmark of 500 keyword-inclusion instructions for a business report writing task to measure how instruction-following performance degrades as instruction density increases.<n>We evaluate 20 state-of-the-art models across seven major providers and find that even the best frontier models only achieve 68% accuracy at the max density of 500 instructions.<n>Our insights can help inform design of instruction-dense prompts in real-world applications and highlight important performance-latency tradeoffs.
arXiv Detail & Related papers (2025-07-15T17:59:42Z) - Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction [68.6543680065379]
Large language models (LLMs) are vulnerable to prompt injection attacks.<n>We propose a novel defense method that leverages, rather than suppresses, the instruction-following abilities of LLMs.
arXiv Detail & Related papers (2025-04-29T07:13:53Z) - LLMs can be easily Confused by Instructional Distractions [16.060402139507644]
Large language models show exceptional skill in instruction following tasks.<n>This strength can turn into a vulnerability when the models are required to disregard certain instructions.<n>We introduce a novel benchmark, named DIM-Bench, specifically designed to assess LLMs' performance under instructional distraction.
arXiv Detail & Related papers (2025-02-05T04:52:57Z) - The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities [51.594836904623534]
We investigate whether instruction-tuned models possess fundamentally different capabilities from base models that are prompted using in-context examples.<n>We show that the performance of instruction-tuned models is significantly correlated with the in-context performance of their base counterparts.<n>Specifically, we extend this understanding to instruction-tuned models, suggesting that their pretraining data similarly sets a limiting boundary on the tasks they can solve.
arXiv Detail & Related papers (2025-01-15T10:57:55Z) - Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks [4.945902994386117]
We study the interaction between knowledge and instruction following, and observe that LLMs struggle to follow simple answer modifying instructions.<n>We apply a set of simple instructions which include manipulating text, numeric quantities, operate on lists and distractor instructions.
arXiv Detail & Related papers (2024-10-16T19:07:37Z) - From Symbolic Tasks to Code Generation: Diversification Yields Better Task Performers [1.6958018695660049]
We show that a more diverse instruction set, extending beyond code-related tasks, improves the performance of code generation.
Our observations suggest that a more diverse semantic space for instruction-tuning sets greatly improves the model's ability to follow instructions and perform tasks.
arXiv Detail & Related papers (2024-05-30T07:54:07Z) - Dual Instruction Tuning with Large Language Models for Mathematical Reasoning [26.00472810721806]
We propose a dual instruction tuning strategy to model mathematical reasoning from both forward and reverse directions.
This involves introducing the Intermediate Reasoning State Prediction task (forward reasoning) and the Instruction Reconstruction task (reverse reasoning) to enhance the LLMs' understanding and execution of instructions.
Comprehensive experiments validate the effectiveness and domain generalization of the dual instruction tuning strategy across various mathematical reasoning tasks.
arXiv Detail & Related papers (2024-03-27T06:43:58Z) - CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model [121.23360004498893]
We present a benchmark, namely Continual Instruction tuNing (CoIN), to assess existing MLLMs in the sequential instruction tuning paradigm.
Experiments on CoIN demonstrate that current powerful MLLMs still suffer catastrophic forgetting.
We introduce MoELoRA to MLLMs which is effective to retain the previous instruction alignment.
arXiv Detail & Related papers (2024-03-13T08:54:31Z) - From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning [63.63840740526497]
We investigate how instruction tuning adjusts pre-trained models with a focus on intrinsic changes.
The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models.
Our findings reveal three significant impacts of instruction tuning.
arXiv Detail & Related papers (2023-09-30T21:16:05Z) - Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning [92.85265959892115]
This paper introduces the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction.
Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers.
To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts.
arXiv Detail & Related papers (2023-06-26T10:26:33Z) - Enhancing Large Language Models Against Inductive Instructions with
Dual-critique Prompting [55.15697111170836]
This paper reveals the behaviors of large language models (LLMs) towards textitinductive instructions and enhance their truthfulness and helpfulness accordingly.
After extensive human and automatic evaluations, we uncovered a universal vulnerability among LLMs in processing inductive instructions.
We identify that different inductive styles affect the models' ability to identify the same underlying errors, and the complexity of the underlying assumptions also influences the model's performance.
arXiv Detail & Related papers (2023-05-23T06:38:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.