Fugu-MT 論文翻訳(概要): TextReasoningBench: Does Reasoning Really Improve Text Classification in Large Language Models?

論文の概要: TextReasoningBench: Does Reasoning Really Improve Text Classification in Large Language Models?

arxiv url: http://arxiv.org/abs/2603.19558v1
Date: Fri, 20 Mar 2026 01:53:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 19:48:38.935744
Title: TextReasoningBench: Does Reasoning Really Improve Text Classification in Large Language Models?
Title（参考訳）: TextReasoningBench: 大規模言語モデルのテキスト分類は本当に改善されているか?
Authors: Xinyu Guo, Yazhou Zhang, Jing Qin,
Abstract要約: 大規模言語モデルからの明示的でステップバイステップの推論のトレースを排除することは、モデル機能を強化する主要なパラダイムとして現れています。テキスト分類における推論手法の有効性と効率を評価するためにTextReasoningBenchを導入する。
参考スコア（独自算出の注目度）: 14.53953450023902
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Eliciting explicit, step-by-step reasoning traces from large language models (LLMs) has emerged as a dominant paradigm for enhancing model capabilities. Although such reasoning strategies were originally designed for problems requiring explicit multi-step reasoning, they have increasingly been applied to a broad range of NLP tasks. This expansion implicitly assumes that deliberative reasoning uniformly benefits heterogeneous tasks. However, whether such reasoning mechanisms truly benefit classification tasks remains largely underexplored, especially considering their substantial token and time costs. To fill this gap, we introduce TextReasoningBench, a systematic benchmark designed to evaluate the effectiveness and efficiency of reasoning strategies for text classification with LLMs. We compare seven reasoning strategies, namely IO, CoT, SC-CoT, ToT, GoT, BoC, and long-CoT across ten LLMs on five text classification datasets. Beyond traditional metrics such as accuracy and macro-F1, we introduce two cost-aware evaluation metrics that quantify the performance gain per reasoning token and the efficiency of performance improvement relative to token cost growth. Experimental results reveal three notable findings: (1) Reasoning does not universally improve classification performance: while moderate strategies such as CoT and SC-CoT yield consistent but limited gains (typically +1% to +3% on big models), more complex methods (e.g., ToT and GoT) often fail to outperform simpler baselines and can even degrade performance, especially on small models; (2) Reasoning is often inefficient: many reasoning strategies increase token consumption by 10$\times$ to 100$\times$ (e.g., SC-CoT and ToT) while providing only marginal performance improvements.
Abstract（参考訳）: 大規模言語モデル(LLM)からの明示的でステップバイステップの推論トレースの排除が,モデル機能向上の主流パラダイムとして浮上している。このような推論戦略は、もともとは明示的な多段階推論を必要とする問題のために設計されていたが、より広範囲のNLPタスクに適用されてきている。この拡張は、議論的推論が均一に不均一なタスクに利益をもたらすことを暗黙的に仮定する。しかし、そのような推論機構が本当に分類タスクに利益をもたらすかどうかについては、特に相当なトークンと時間的コストを考慮すると、未検討のままである。このギャップを埋めるために、LLMを用いたテキスト分類のための推論戦略の有効性と効率を評価するために設計された、体系的なベンチマークであるTextReasoningBenchを紹介する。我々は,5つのテキスト分類データセット上で,10のLLMに対して,IO,CoT,SC-CoT,ToT,GoT,BoC,Long-CoTの7つの推論戦略を比較した。精度やマクロF1といった従来の指標以外にも、推論トークン当たりのパフォーマンス向上と、トークンコストの増大に対するパフォーマンス改善の効率を定量化する2つのコスト認識評価指標を導入します。 CoT や SC-CoT のような中程度の戦略は整合性はあるが(大モデルでは+1% から +3% に制限される)、より複雑な手法(例えば、ToT や GoT)はより単純なベースラインを上回り、特に小さなモデルでは性能を低下させることがある。

論文の概要: TextReasoningBench: Does Reasoning Really Improve Text Classification in Large Language Models?

関連論文リスト