Fugu-MT 論文翻訳(概要): LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models

論文の概要: LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models

arxiv url: http://arxiv.org/abs/2511.19261v1
Date: Mon, 24 Nov 2025 16:13:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-25 18:34:25.300445
Title: LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models
Title（参考訳）: LAST: 一般のビジョンランゲージモデルのための空間と時間を考えるLeArning
Authors: Shuai Wang, Daoan Zhang, Tianyi Bai, Shitong Shao, Jiebo Luo, Jiaheng Wei,
Abstract要約: 一般的な視覚言語モデルにおける3次元空間的および長時間の映像理解を改善するために,LASTを提案する。 LASTは,3つの空間的理解,4つの映像理解,3つの画像理解タスクを含む,様々なベンチマークにおいて大きな利益をもたらすことを示す。
参考スコア（独自算出の注目度）: 50.50563228383038
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Humans can perceive and understand 3D space and long videos from sequential visual observations. But do vision-language models (VLMs) can? Recent work demonstrates that even state-of-the-art VLMs still struggle to understand 3D space and long videos, although they are powerful in typical vision-language tasks. Current methods often rely on specialized architectural designs to improve performance for 3D tasks and video understanding tasks separately. In contrast, we propose LAST, short for LeArn to Think in Space and Time, to jointly improve 3D spatial and long video understanding for general VLMs with only a set of 2D images as inputs. LAST makes VLMs think in space and time rather than only with text before giving the final answer, building visual thinking trajectories in 3D space and temporal dimension. We demonstrate the effectiveness of LAST in two scenarios: 1) zero-shot, where we directly prompt proprietary models; and 2) fine-tuning general VLMs with data that include thinking trajectories in 3D space and time. We show that LAST brings substantial gains in various benchmarks, including 3 spatial understanding, 4 video understanding, and 3 image understanding tasks. Notably, 15.8% gains on EgoSchema with GPT-4o in a zero-shot manner and 8.3 gains on VSI-Bench compared with Qwen2.5-VL-7B.
Abstract（参考訳）: 人間は連続した視覚的な観察から3D空間と長いビデオを理解し理解することができる。しかし、視覚言語モデル(VLM)は可能だろうか? 最近の研究は、最先端のVLMでさえ3D空間や長いビデオを理解するのに苦戦していることを示している。現在の手法は、しばしば3Dタスクとビデオ理解タスクを別々に改善するために、特別なアーキテクチャ設計に依存している。対照的に、LAST(Learn to Think in Space and Time)は2次元画像のみを入力として、一般的なVLMの3次元空間と長時間の映像理解を協調的に改善する。 LASTは、VLMを3次元空間と時間次元に視覚的思考軌道を構築することで、最後の答えを与える前にテキストでのみ考えるのではなく、空間と時間で考えるようにする。 LASTの有効性を2つのシナリオで示す。 1)ゼロショットでは、プロプライエタリなモデルを直接プロンプトします。 2)3次元空間と時間における思考軌跡を含むデータを用いた微調整一般VLM。 LASTは,3つの空間的理解,4つの映像理解,3つの画像理解タスクを含む,様々なベンチマークにおいて大きな利益をもたらすことを示す。特に、GPT-4oでEgoSchemaで15.8%、VSI-Benchで8.3、Qwen2.5-VL-7Bで8.8%が上昇した。

論文の概要: LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models

関連論文リスト