Fugu-MT 論文翻訳(概要): Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs

論文の概要: Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs

arxiv url: http://arxiv.org/abs/2605.22823v1
Date: Thu, 21 May 2026 17:59:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 20:14:18.617384
Title: Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs
Title（参考訳）: どちらへ移動したのか : ビデオLLMにおける方向運動盲点の診断と克服
Authors: Jongseo Lee, Hyuntak Lee, Sunghun Kim, Sooa Kim, Jihoon Chung, Jinwoo Choi,
Abstract要約: ビデオ大言語モデル (Video-LLMs) は時間的ビデオ理解を急速に進歩させた。多くのビデオ-LLMは基本的な知覚的プリミティブ:署名された画像-平面運動方向で失敗する。ビデオ-LLMパイプラインを通して動き方向情報を追跡することで、障害をローカライズする。
参考スコア（独自算出の注目度）: 7.541877677953269
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Large Language Models (Video-LLMs) have made rapid progress on temporal video understanding, yet many fail at a basic perceptual primitive: signed image-plane motion direction. On simple videos of a single object moving left, right, up, or down, most Video-LLMs perform near chance, with above-chance cases largely attributable to prediction biases rather than genuine direction understanding. We call this failure directional motion blindness. We localize the failure by tracing motion direction information through the Video-LLM pipeline. Motion direction remains linearly accessible from the vision encoder, projector, and LLM hidden states, but the readout fails to bind this signal to the correct verbal answer option, revealing a direction binding gap. Although synthetic motion direction instruction tuning reduces this gap on the source domain, motion direction concept vector analysis shows that visual complexity weakens the signal magnitude and limits out-of-domain generalization. We introduce MoDirect, a dataset family for motion direction instruction tuning and evaluation, and DeltaDirect, a diagnosis-driven, projector-level objective that predicts normalized 2-D motion vectors from adjacent-frame feature deltas. On MoDirect-SynBench, instruction tuning with DeltaDirect improves motion direction accuracy from 25.9% to 85.4%. On MoDirect-RealBench, DeltaDirect improves real-world motion direction accuracy by 21.9 points over the vanilla baseline without real-world tuning data, while preserving standard video-understanding performance. Code: https://github.com/KHU-VLL/DeltaDirect
Abstract（参考訳）: ビデオ大言語モデル (Video Large Language Models, Video-LLMs) は、時間的ビデオ理解において急速に進歩しているが、多くの人は基本的な知覚的プリミティブ:署名された画像平面運動方向で失敗している。 1つの物体が左、右、上、下を移動する単純なビデオでは、ほとんどのビデオLLMがほぼ偶然に実行され、上向きのケースは真の方向理解ではなく、予測バイアスに起因する。私たちはこの障害を指向性視覚障害(directive motion blindness)と呼んでいる。ビデオ-LLMパイプラインを通して動き方向情報を追跡することで、障害をローカライズする。動作方向は、視覚エンコーダ、プロジェクタ、LLM隠蔽状態から直線的にアクセス可能であるが、読み出しは、この信号を正しい音声応答オプションにバインドできず、方向結合ギャップが明らかになる。合成動き方向指示チューニングは、ソース領域におけるこのギャップを減少させるが、動き方向概念ベクトル解析は、視覚的複雑さが信号の大きさを弱め、領域外一般化を制限することを示している。動き方向の指示と評価のためのデータセットであるMoDirectと、隣接するフレーム特徴デルタから正規化された2次元運動ベクトルを予測する診断駆動型プロジェクタレベルの目的であるDeltaDirectを紹介する。 MoDirect-SynBenchでは、DeltaDirectによる命令チューニングにより、動き方向の精度が25.9%から85.4%に向上する。 MoDirect-RealBenchでは、DeltaDirectは実世界のチューニングデータなしでバニラベースライン上で21.9ポイントの実際の動き方向精度を向上し、標準のビデオアンダーパフォーマンスを保っている。コード:https://github.com/KHU-VLL/DeltaDirect

論文の概要: Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs

関連論文リスト