Related papers: Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning

Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning

URL: http://arxiv.org/abs/2412.14780v1
Date: Thu, 19 Dec 2024 12:06:24 GMT
Title: Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning
Authors: Ziang Ye, Zhenru Zhang, Yang Zhang, Jianxin Ma, Junyang Lin, Fuli Feng,
Abstract summary: We argue that tokens serving different roles - specifically, reasoning tokens versus boilerplate tokens - differ significantly in importance and learning complexity.<n>We propose a novel Shuffle-Aware Discriminator (SHAD) for adaptive token discrimination.<n>Using SHAD, we propose the Reasoning-highlighted Fine-Tuning (RFT) method, which adaptively emphasizes reasoning tokens during fine-tuning.
Score: 46.43130011147807
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: When using agent-task datasets to enhance agent capabilities for Large Language Models (LLMs), current methodologies often treat all tokens within a sample equally. However, we argue that tokens serving different roles - specifically, reasoning tokens versus boilerplate tokens (e.g., those governing output format) - differ significantly in importance and learning complexity, necessitating their disentanglement and distinct treatment. To address this, we propose a novel Shuffle-Aware Discriminator (SHAD) for adaptive token discrimination. SHAD classifies tokens by exploiting predictability differences observed after shuffling input-output combinations across samples: boilerplate tokens, due to their repetitive nature among samples, maintain predictability, whereas reasoning tokens do not. Using SHAD, we propose the Reasoning-highlighted Fine-Tuning (RFT) method, which adaptively emphasizes reasoning tokens during fine-tuning, yielding notable performance gains over common Supervised Fine-Tuning (SFT).

Related papers

Decomposing Reasoning Efficiency in Large Language Models [2.4149105714758545]
We decompose token efficiency into interpretable factors: completion under a fixed token budget, conditional correctness given completion, and verbosity.<n>When reasoning traces are available, we add deterministic trace-quality measures to separate looping from verbose-but-engaged reasoning.<n>Our decomposition reveals distinct bottleneck profiles that suggest different efficiency interventions.
arXiv Detail & Related papers (2026-02-10T14:09:18Z)
Frontend Token Enhancement for Token-Based Speech Recognition [50.35062963870211]
Discretized representations of speech signals are efficient alternatives to continuous features for speech recognition applications.<n>In this work, we introduce a system that estimates clean speech tokens from noisy speech and evaluate it on an ASR backend using semantic tokens.<n>We consider four types of enhancement models based on their input/token domains: wave-to-wave, token-to-output, continuous SSL features-to-token, and wave-to-token.
arXiv Detail & Related papers (2026-02-04T05:02:15Z)
From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature [38.46122853450324]
Existing algorithms apply uniform optimization to all tokens, ignoring their different roles in reasoning process.<n>We introduce Heterogeneous Adaptive Policy Optimization (HAPO), a token-aware algorithm that dynamically adapts optimization based on token entropy.
arXiv Detail & Related papers (2025-09-20T09:30:25Z)
Benchmarking Prosody Encoding in Discrete Speech Tokens [13.60092490447892]
This study focuses on prosodic encoding based on their sensitivity to the artificially modified prosody, aiming to provide practical guidelines for designing discrete tokens.<n>In particular, speech language models are expected to understand and generate responses that reflect not only the semantic content but also prosodic features.
arXiv Detail & Related papers (2025-08-15T05:11:16Z)
A Variational Framework for Improving Naturalness in Generative Spoken Language Models [52.673912922590866]
We propose an end-to-end variational approach that automatically learns to encode continuous speech attributes to enhance semantic tokens.<n>Our approach eliminates the need for manual extraction and selection of paralinguistic features.<n>It produces preferred speech continuations according to human raters.
arXiv Detail & Related papers (2025-06-17T17:58:17Z)
IGD: Token Decisiveness Modeling via Information Gain in LLMs for Personalized Recommendation [70.2753541780788]
We introduce an Information Gain-based Decisiveness-aware Token handling (IGD) strategy that integrates token decisiveness into both tuning and decoding.<n>IGD consistently improves recommendation accuracy, achieving significant gains on widely used ranking metrics compared to strong baselines.
arXiv Detail & Related papers (2025-06-16T08:28:19Z)
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [44.84219266082269]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data. We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z)
SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator [65.62084602011596]
Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. We have identified a key pattern: certain seemingly meaningless separator tokens (i.e., punctuations) contribute disproportionately to attention scores compared to semantically meaningful tokens. We introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens.
arXiv Detail & Related papers (2024-12-16T18:58:57Z)
Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [55.29624228206882]
Large Language Models (LLMs) have exhibited remarkable performance on reasoning tasks.<n>LLMs tend to produce positive outcomes when forced to decode other tokens instead of critical tokens.<n>We propose a novel approach - cDPO - designed to automatically recognize and conduct token-level rewards for the critical tokens.
arXiv Detail & Related papers (2024-11-29T18:58:22Z)
Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LMs) This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts. We introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution.
arXiv Detail & Related papers (2024-10-11T23:30:42Z)
SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks [13.600674179059238]
We propose a flexible objective termed SparsePO to automatically learn to weight the KL divergence and reward corresponding to each token during Preference Optimization training. We show that our approach assigns meaningful weights to tokens according to the target task, generates more responses with the desired preference and improves reasoning tasks by up to 2 percentage points compared to other token- and response-level PO methods.
arXiv Detail & Related papers (2024-10-07T15:01:29Z)
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z)
SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [93.94454894142413]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP) SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings. Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z)
Let's Think Dot by Dot: Hidden Computation in Transformer Language Models [30.972412126012884]
Chain-of-thought responses from language models improve performance across most benchmarks. We show that transformers can use meaningless filler tokens in place of a chain of thought to solve two hard algorithmic tasks. We find that learning to use filler tokens is difficult and requires specific, dense supervision to converge.
arXiv Detail & Related papers (2024-04-24T09:30:00Z)
Unlocking the Transferability of Tokens in Deep Models for Tabular Data [67.11727608815636]
Fine-tuning a pre-trained deep neural network has become a successful paradigm in various machine learning tasks. In this paper, we propose TabToken, a method aims at enhancing the quality of feature tokens. We introduce a contrastive objective that regularizes the tokens, capturing the semantics within and across features.
arXiv Detail & Related papers (2023-10-23T17:53:09Z)
Fast End-to-End Speech Recognition via a Non-Autoregressive Model and Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once) The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.