Fugu-MT 論文翻訳(概要): Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation

論文の概要: Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation

arxiv url: http://arxiv.org/abs/2604.09368v1
Date: Fri, 10 Apr 2026 14:38:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-13 17:57:53.910396
Title: Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation
Title（参考訳）: 目を通して:パーソナライズされたユーザエミュレーションのための固定整列チューニング
Authors: Lingfeng Huang, Huizhong Guo, Tianjun Wei, Yingpeng Du, Zhu Sun,
Abstract要約: 視覚言語モデル(VLM)の視覚的注意とユーザ固有の視線パターンを一致させることで、シミュレーションの精度が向上するかどうかを検討する。提案手法はまず,解釈可能性演算子を用いてVLMの内部視覚的注意を探索し,スロットレベルの関連性分布を求める。 3つの解釈可能性に基づく探索演算子と2つのアーキテクチャ的に異なるVLMバックボーンを用いた実験は、アテンションアライメントとクリック予測精度の両面で一貫した改善を示している。
参考スコア（独自算出の注目度）: 11.879346281714453
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language model (LLM) agents are increasingly deployed as scalable user simulators for recommender system evaluation. Yet existing simulators perceive recommendations through text or structured metadata rather than the visual interfaces real users browse-a critical gap, since attention over recommendation layouts is both visually driven and highly personalized. We investigate whether aligning a vision-language model's (VLM's) visual attention with user-specific gaze patterns can improve simulation fidelity. Analysis of a real-world eye-tracking dataset collected in a carousel-based recommendation setting reveals that users exhibit stable individual gaze patterns strongly predictive of click behavior. Building on this finding, we propose Fixation-Aligned Tuning for user Emulation (FixATE). Our approach first probes the VLM's internal visual attention via interpretability operators to obtain a slot-level relevance distribution comparable with human fixation, and then learns personalized soft prompts to steer the model's attention toward each user's characteristic fixation pattern. Experiments across three interpretability-based probing operators and two architecturally distinct VLM backbones demonstrate consistent improvements in both attention alignment and click prediction accuracy. These results suggest that making the model "see like the user" is a viable path toward simulators that more faithfully reproduce how users perceive and act in recommendation interfaces.
Abstract（参考訳）: 大規模言語モデル(LLM)エージェントは、リコメンダシステム評価のためのスケーラブルなユーザシミュレータとして、ますます多くデプロイされている。しかし、既存のシミュレーターは、ビジュアルインターフェースではなくテキストや構造化メタデータを通じてレコメンデーションを認識する。視覚言語モデル(VLM)の視覚的注意とユーザ固有の視線パターンを一致させることで、シミュレーションの精度が向上するかどうかを検討する。カルーセルベースのレコメンデーションセッティングで収集された実世界の視線追跡データセットの分析により、ユーザーはクリックの振る舞いを強く予測する安定した個々の視線パターンを示すことが明らかになった。この発見に基づいて、ユーザエミュレーション(FixATE)のためのFixATE(FixATE)の修正調整を提案する。提案手法はまず,解釈可能性演算子を用いてVLMの内部視覚的注意を探索し,人間の固定に匹敵するスロットレベルの関連性分布を得るとともに,各ユーザの特徴的固定パターンに対してモデルの注意を喚起するパーソナライズされたソフトプロンプトを学習する。 3つの解釈可能性に基づく探索演算子と2つのアーキテクチャ的に異なるVLMバックボーンを用いた実験は、アテンションアライメントとクリック予測精度の両面で一貫した改善を示している。これらの結果は、モデルが"ユーザのように見える"ことは、ユーザーが推奨インターフェースでどのように感じ、どのように振る舞うかをより忠実に再現するシミュレータへの実行可能な道であることを示している。

論文の概要: Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation

関連論文リスト