Related papers: Efficient Single-Pass Training for Multi-Turn Reasoning

Related papers

Behavior-Equivalent Token: Single-Token Replacement for Long Prompts in LLMs [55.827877498548965]
We propose a lightweight training framework that learns a single prompt-specific Behavior-Equivalent token ([BE])<n>The framework first trains [BE] to encode the natural-language content of the original system prompt via reconstruction, and then distills the prompt's downstream behavior into this single token.<n> Empirical evaluations on three datasets show that a single [BE] token achieves up to a 3000x reduction in prompt length, while retaining about 98% of the downstream performance of the original system prompts.
arXiv Detail & Related papers (2025-11-28T15:22:52Z)
No Compute Left Behind: Rethinking Reasoning and Sampling with Masked Diffusion Models [42.158430350703505]
Masked diffusion language models are trained to in-fill positions in randomly masked sequences.<n>We propose reasoning-as-infilling and multi-token entropy decoding.<n>Our work demonstrates that the training and compute used by MDLMs unlock many new inference and post-training methods.
arXiv Detail & Related papers (2025-10-22T19:41:27Z)
H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers [124.11648300910444]
We present a hierarchical plug-and-play pruning-and-$-recovering framework, called Hierarchical Hourglass Tokenizer (H$_2$OT)<n>Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines.
arXiv Detail & Related papers (2025-09-08T17:59:59Z)
DTRNet: Dynamic Token Routing Network to Reduce Quadratic Costs in Transformers [28.595962720945348]
Transformers achieve state-of-the-art results across many tasks, but their uniform application of quadratic self-attention makes them computationally expensive.<n>We introduce Dynamic Token Routing Network, an improved Transformer architecture that allows tokens to dynamically skip the quadratic cost of cross-token mixing.
arXiv Detail & Related papers (2025-08-31T16:21:21Z)
R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [80.104336426172]
Chain-of-thought (CoT) enhances problem-solving ability of large language models.<n>CoT incurs substantial inference cost due to long autoregressive trajectories.<n>We introduce R-Stitch, a training-free hybrid decoding framework.
arXiv Detail & Related papers (2025-07-23T08:14:36Z)
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation [50.001816497407475]
We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer.<n>MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking.<n>We also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint.
arXiv Detail & Related papers (2025-07-14T17:49:00Z)
Multipole Attention for Efficient Long Context Reasoning [64.94673641704289]
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks.<n>LRMs need to generate long chain-of-thought reasoning in order to think before answering.<n>We introduce Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens.
arXiv Detail & Related papers (2025-06-16T03:00:40Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence [4.044294432478243]
We leverage the inherent parallelizability of certain tasks to accelerate the reasoning process.<n>Our method achieves up to nearly 100% speedup in decoding while basically maintaining the answer quality.
arXiv Detail & Related papers (2025-03-26T13:28:57Z)
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs [48.28847964704554]
Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to solve complex reasoning tasks. We propose a novel approach for continuous-space reasoning that does not require modifying the underlying LLM.
arXiv Detail & Related papers (2025-02-17T18:52:29Z)
Zero Token-Driven Deep Thinking in LLMs: Unlocking the Full Potential of Existing Parameters via Cyclic Refinement [43.548042892597536]
We propose the Zero Token Transformer (ZTT), which features a head-tail decoupled parameter cycling method.<n>We disentangle the first (head) and last (tail) layers from parameter cycling and iteratively refine only the intermediate layers.<n>Our approach achieves superior performance under tight parameter budgets, effectively reduces computational overhead via early exits, and can be readily applied to fine-tune existing pre-trained models.
arXiv Detail & Related papers (2025-02-17T04:37:22Z)
ST$^3$: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming [14.937905258757635]
$textbfST3$ is a framework designed to accelerate MLLM inference without retraining.<n>$textbfST3$ can be seamlessly integrated into existing pre-trained MLLMs.
arXiv Detail & Related papers (2024-12-28T10:17:29Z)
QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning [58.767866109043055]
We introduce Query-dependent Prompt Optimization (QPO), which iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries. We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks. Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.
arXiv Detail & Related papers (2024-08-20T03:06:48Z)
Prompt Recursive Search: A Living Framework with Adaptive Growth in LLM Auto-Prompting [22.025533583703126]
We propose a novel Prompt Recursive Search (PRS) framework for large language models (LLMs) PRS framework incorporates an assessment of problem complexity and an adjustable structure, ensuring a reduction in the likelihood of errors. Compared to the Chain of Thought (CoT) method, the PRS method has increased the accuracy on the BBH dataset by 8% using Llama3-7B model, achieving a 22% improvement.
arXiv Detail & Related papers (2024-08-02T17:59:42Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z)
One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models [67.49462724595445]
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs) We propose a novel method that involves learning scalable and pluggable virtual tokens for RAG.
arXiv Detail & Related papers (2024-05-30T03:44:54Z)
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z)
PaSS: Parallel Speculative Sampling [29.23180061749074]
Scaling the size of language models to tens of billions of parameters has led to impressive performance on a wide range of tasks. At generation, these models are used auto-regressively, requiring a forward pass for each generated token, and thus reading the full set of parameters from memory. We show promising performance (up to $30%$ speed-up) while requiring only as few as $O(d_emb)$ additional parameters.
arXiv Detail & Related papers (2023-11-22T18:37:27Z)
Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions. We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z)
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline [22.08897444328099]
Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks. In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs.
arXiv Detail & Related papers (2023-05-22T15:36:06Z)
SatLM: Satisfiability-Aided Language Models Using Declarative Prompting [68.40726892904286]
We propose a new satisfiability-aided language modeling (SatLM) approach for improving the reasoning capabilities of large language models (LLMs) We use an LLM to generate a declarative task specification rather than an imperative program and leverage an off-the-shelf automated theorem prover to derive the final answer. We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm.
arXiv Detail & Related papers (2023-05-16T17:55:51Z)
Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer. It accurately predicts the number of output tokens and extract hidden variables. It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z)
Learning to Recover from Multi-Modality Errors for Non-Autoregressive Neural Machine Translation [38.123025955523836]
Non-autoregressive neural machine translation (NAT) predicts the entire target sequence simultaneously and significantly accelerates inference process. We propose a novel semi-autoregressive model RecoverSAT, which generates a translation as a sequence of segments. By dynamically determining segment length and repetitive deleting segments, RecoverSAT is capable of recovering from repetitive and missing token errors. Experimental results on three widely-used benchmark datasets show that our proposed model achieves more than 4$times$ speedup while maintaining comparable performance compared with the corresponding autoregressive model.
arXiv Detail & Related papers (2020-06-09T10:12:16Z)
Masking as an Efficient Alternative to Finetuning for Pretrained Language Models [49.64561153284428]
We learn selective binary masks for pretrained weights in lieu of modifying them through finetuning. In intrinsic evaluations, we show that representations computed by masked language models encode information necessary for solving downstream tasks.
arXiv Detail & Related papers (2020-04-26T15:03:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.