Fugu-MT 論文翻訳(概要): Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

論文の概要: Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

arxiv url: http://arxiv.org/abs/2604.02315v2
Date: Fri, 03 Apr 2026 01:55:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 12:42:34.360332
Title: Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
Title（参考訳）: アシスタントターンを超えて:言語モデルにおけるインタラクション認識のプローブとしてのユーザターン生成
Authors: Sarath Shekkizhar, Romain Cosentino, Adam Earle,
Abstract要約: ユーザ・ターン・ジェネレーションはLLMの振る舞いやインタラクション・アウェアネスの次元を捉えており、現在のアシスタント・オンリー・ベンチマークでは探索されていない。この結果から,ユーザターン生成はLLMの振る舞いやインタラクションの認識の次元を捉えていることがわかった。
参考スコア（独自算出の注目度）: 3.9351446512514947
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Standard LLM benchmarks evaluate the assistant turn: the model generates a response to an input, a verifier scores correctness, and the analysis ends. This paradigm leaves unmeasured whether the LLM encodes any awareness of what follows the assistant response. We propose user-turn generation as a probe of this gap: given a conversation context of user query and assistant response, we let a model generate under the user role. If the model's weights encode interaction awareness, the generated user turn will be a grounded follow-up that reacts to the preceding context. Through experiments across $11$ open-weight LLMs (Qwen3.5, gpt-oss, GLM) and $5$ datasets (math reasoning, instruction following, conversation), we show that interaction awareness is decoupled from task accuracy. In particular, within the Qwen3.5 family, GSM8K accuracy scales from $41\%$ ($0.8$B) to $96.8\%$ ($397$B-A$17$B), yet genuine follow-up rates under deterministic generation remain near zero. In contrast, higher temperature sampling reveals interaction awareness is latent with follow up rates reaching $22\%$. Controlled perturbations validate that the proposed probe measures a real property of the model, and collaboration-oriented post-training on Qwen3.5-2B demonstrates an increase in follow-up rates. Our results show that user-turn generation captures a dimension of LLM behavior, interaction awareness, that is unexplored and invisible with current assistant-only benchmarks.
Abstract（参考訳）: 標準LCMベンチマークでは、モデルが入力に対する応答を生成し、検証者が正確性をスコアし、分析が終了する。このパラダイムは、LLMがアシスタント応答に続くものに対する認識を符号化するかどうかを未測定のまま残している。このギャップの探索として,ユーザ・ターン生成を提案する。ユーザ・クエリとアシスタント・レスポンスの会話コンテキストを考慮すれば,ユーザ・ロールの下でモデルを生成することができる。モデルの重みが相互作用の認識を符号化するならば、生成されたユーザターンは、前回のコンテキストに反応する根拠付きフォローアップになります。オープンウェイト LLM (Qwen3.5, gpt-oss, GLM) と 5$ データセット (仮推論, 後続命令, 会話) を対象とした実験により, インタラクションの認識がタスクの正確性から切り離されていることを示す。特にQwen3.5ファミリーでは、GSM8Kの精度は$41\%(0.8$B)から$96.8\%(397$B-A-17$B)までスケールするが、決定論的生成の下での真のフォローアップレートはゼロに近いままである。対照的に、高温サンプリングでは、相互作用の意識が潜んでいることが示され、フォローアップレートは22\%ドルに達した。制御された摂動は,提案したプローブがモデルの実際の特性を測定し,Qwen3.5-2B上での協調指向のポストトレーニングは追従率の増加を示す。この結果から,ユーザターン生成はLLMの振る舞いやインタラクションの認識の次元を捉えていることがわかった。

論文の概要: Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

関連論文リスト