Fugu-MT 論文翻訳(概要): GazeQwen: Lightweight Gaze-Conditioned LLM Modulation for Streaming Video Understanding

論文の概要: GazeQwen: Lightweight Gaze-Conditioned LLM Modulation for Streaming Video Understanding

arxiv url: http://arxiv.org/abs/2603.25841v1
Date: Thu, 26 Mar 2026 19:03:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-30 21:49:48.243304
Title: GazeQwen: Lightweight Gaze-Conditioned LLM Modulation for Streaming Video Understanding
Title（参考訳）: GazeQwen:ストリーミングビデオ理解のための軽量ガゼコンディションLDM変調
Authors: Trong Thang Pham, Hien Nguyen, Ngan Le,
Abstract要約: 現在のマルチモーダル大言語モデル(MLLM)は、視線情報をビデオ理解に効果的に利用できない。本稿では,オープンソースのMLLMに隠れ状態変調による視認性を持たせるパラメータ効率のよい手法であるGazeQwenを紹介する。
参考スコア（独自算出の注目度）: 11.055155778097033
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Current multimodal large language models (MLLMs) cannot effectively utilize eye-gaze information for video understanding, even when gaze cues are supplied via visual overlays or text descriptions. We introduce GazeQwen, a parameter efficient approach that equips an open-source MLLM with gaze awareness through hidden-state modulation. At its core is a compact gaze resampler (~1-5 M trainable parameters) that encodes V-JEPA 2.1 video features together with fixation-derived positional encodings and produces additive residuals injected into selected LLM decoder layers via forward hooks. An optional second training stage adds low-rank adapters (LoRA) to the LLM for tighter integration. Evaluated on all 10 tasks of the StreamGaze benchmark, GazeQwen reaches 63.9% accuracy, a +16.1 point gain over the same Qwen2.5-VL-7B backbone with gaze as visual prompts and +10.5 points over GPT-4o, the highest score among all open-source and proprietary models tested. These results suggest that learning where to inject gaze within an LLM is more effective than scaling model size or engineering better prompts. All code and checkpoints are available at https://github.com/phamtrongthang123/gazeqwen .
Abstract（参考訳）: 現在のマルチモーダル大言語モデル(MLLM)は、視覚的なオーバーレイやテキスト記述によって視線手がかりが供給される場合でも、視線情報を映像理解に効果的に利用することはできない。本稿では,オープンソースのMLLMに隠れ状態変調による視認性を持たせるパラメータ効率のよい手法であるGazeQwenを紹介する。コアとなるのは、V-JEPA 2.1ビデオ特徴を符号化し、固定から派生した位置エンコーディングを施し、フォワードフックを介して選択されたLLMデコーダ層に注入された付加残差を生成する、コンパクトなギャグリサンプラー(約1-5Mのトレーニング可能なパラメータ)である。オプションの2番目のトレーニングステージは、より緊密な統合のためにLLMにローランクアダプタ(LoRA)を追加する。 StreamGazeベンチマークの10タスクすべてで評価され、GazeQwenは63.9%の精度、Qwen2.5-VL-7Bのバックボーンよりも+16.1ポイント、GPT-4oよりも+10.5ポイント向上した。これらの結果から, LLM内で視線を注入する方法の学習は, モデルのサイズを拡大したり, 工学的プロンプトを向上するよりも効果的であることが示唆された。すべてのコードとチェックポイントはhttps://github.com/phamtrongthang123/gazeqwen.comで入手できる。

論文の概要: GazeQwen: Lightweight Gaze-Conditioned LLM Modulation for Streaming Video Understanding

関連論文リスト