Fugu-MT 論文翻訳(概要): Token Warping Helps MLLMs Look from Nearby Viewpoints

論文の概要: Token Warping Helps MLLMs Look from Nearby Viewpoints

arxiv url: http://arxiv.org/abs/2604.02870v1
Date: Fri, 03 Apr 2026 08:37:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 17:20:24.404304
Title: Token Warping Helps MLLMs Look from Nearby Viewpoints
Title（参考訳）: Token Warpingは、MLLMが近距離から見るのに役立つ
Authors: Phillip Y. Lee, Chanho Park, Mingue Park, Seungwoo Yoo, Juil Koo, Minhyuk Sung,
Abstract要約: ピクセルではなくトークンをワープすることで、マルチモーダルな大規模言語モデル(MLLM)が、周囲の視点からシーンがどのように見えるかを理解するのに役立つ。後方トークンのワープにより安定性が向上し,視点シフト下でのセマンティック・コヒーレンス(セマンティック・コヒーレンス)の保存性が向上することを示す。提案したViewBenchベンチマークの実験では、トークンレベルのワープにより、MLLMが近くの視点から確実に推論できることが示されている。
参考スコア（独自算出の注目度）: 32.97807608835125
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.
Abstract（参考訳）: ピクセルではなくトークンをワープすることは、マルチモーダルな大規模言語モデル(MLLM)が、周囲の視点からシーンがどのように見えるかを理解するのに役立つだろうか? MLLMは視覚的推論において良好に機能するが、画素ワイド・ワープは小さな深度誤差に非常に敏感であり、幾何学的歪みをもたらすため、視点変化に対して脆弱なままである。人間の視点変換の基盤として部分レベル構造表現を仮定したメンタルイメージの理論に基づいて,ViTベースのMLLMにおける画像トークンが視点変化の有効な基盤となるかどうかを検討する。対象のビューに密集したグリッドを定義し,各グリッドポイントに対して対応するソースビュートークンを検索し,安定性を向上し,視点シフト下でのセマンティックコヒーレンスを向上する後方トークンワープについて,前方および後方のワープを比較した。提案したViewBenchベンチマーク実験により,トークンレベルのワープにより,MLLMは,画素ワイドワープアプローチ,空間的に微調整されたMLLM,生成的ワープ手法など,すべてのベースラインを一貫して上回り,近傍の視点から確実に推論できることを示した。

論文の概要: Token Warping Helps MLLMs Look from Nearby Viewpoints

関連論文リスト