Fugu-MT 論文翻訳(概要): TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs

論文の概要: TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs

arxiv url: http://arxiv.org/abs/2511.13223v1
Date: Mon, 17 Nov 2025 10:38:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-18 14:36:25.13156
Title: TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs
Title（参考訳）: TokenSqueeze: Reasoning LLMのパフォーマンス保存圧縮
Authors: Yuxiang Zhang, Zhengxu Yu, Weihang Pan, Zhongming Jin, Qiang Fu, Deng Cai, Binbin Lin, Jieping Ye,
Abstract要約: TokenSqueezeは、パフォーマンスを保ち、自己生成データにのみ依存しながら推論パスを凝縮する新しいLong2Shortメソッドである。 TokenSqueeze は MATH500 ベンチマークの精度を維持しながらトークンの使用量を削減できることを示す。
参考スコア（独自算出の注目度）: 57.217593337454026
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Emerging reasoning LLMs such as OpenAI-o1 and DeepSeek-R1 have achieved strong performance on complex reasoning tasks by generating long chain-of-thought (CoT) traces. However, these long CoTs result in increased token usage, leading to higher inference latency and memory consumption. As a result, balancing accuracy and reasoning efficiency has become essential for deploying reasoning LLMs in practical applications. Existing long-to-short (Long2Short) methods aim to reduce inference length but often sacrifice accuracy, revealing a need for an approach that maintains performance while lowering token costs. To address this efficiency-accuracy tradeoff, we propose TokenSqueeze, a novel Long2Short method that condenses reasoning paths while preserving performance and relying exclusively on self-generated data. First, to prevent performance degradation caused by excessive compression of reasoning depth, we propose to select self-generated samples whose reasoning depth is adaptively matched to the complexity of the problem. To further optimize the linguistic expression without altering the underlying reasoning paths, we introduce a distribution-aligned linguistic refinement method that enhances the clarity and conciseness of the reasoning path while preserving its logical integrity. Comprehensive experimental results demonstrate the effectiveness of TokenSqueeze in reducing token usage while maintaining accuracy. Notably, DeepSeek-R1-Distill-Qwen-7B fine-tuned using our proposed method achieved a 50\% average token reduction while preserving accuracy on the MATH500 benchmark. TokenSqueeze exclusively utilizes the model's self-generated data, enabling efficient and high-fidelity reasoning without relying on manually curated short-answer datasets across diverse applications. Our code is available at https://github.com/zhangyx1122/TokenSqueeze.
Abstract（参考訳）: OpenAI-o1やDeepSeek-R1といった新しい推論LLMは、長いチェーン・オブ・シークレット(CoT)トレースを生成することで、複雑な推論タスクにおいて強力なパフォーマンスを実現している。しかし、これらの長いCoTはトークンの使用量が増加し、推論遅延とメモリ消費が増加する。その結果,LLMを実用化するためには,精度と推論効率のバランスが不可欠になっている。既存のLong2Short(long-to-short)メソッドは、推論長を減らすことを目的としているが、しばしば精度を犠牲にして、トークンコストを下げながらパフォーマンスを維持するアプローチの必要性を明らかにしている。この効率性と精度のトレードオフに対処するために,自己生成データにのみ依存しながら,推論経路を凝縮する新しいLong2Short法であるTokenSqueezeを提案する。まず, 推理深さの過度な圧縮による性能劣化を防止するために, 推理深さが複雑度と適応的に一致した自己生成サンプルを選択することを提案する。基礎となる推論経路を変更することなく言語表現をさらに最適化するために、論理的整合性を維持しつつ、推論経路の明瞭さと簡潔さを高める分布整合型言語洗練法を導入する。 TokenSqueezeのトークン使用率の低減と精度の維持に関する総合的な実験結果が得られた。特に,提案手法を用いて微調整したDeepSeek-R1-Distill-Qwen-7Bでは,MATH500ベンチマークの精度を保ちながら平均トークン減少率が50%に達した。 TokenSqueezeは、モデルが生成したデータのみを利用して、さまざまなアプリケーションにまたがって手動でキュレートされたショートアンサーデータセットに頼ることなく、効率的で高忠実な推論を可能にする。私たちのコードはhttps://github.com/zhangyx1122/TokenSqueeze.comから入手可能です。

論文の概要: TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs

関連論文リスト