Fugu-MT 論文翻訳(概要): Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

論文の概要: Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

arxiv url: http://arxiv.org/abs/2605.21625v1
Date: Wed, 20 May 2026 18:36:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:41.957684
Title: Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
Title（参考訳）: フラットパックベンチ:家具組み立てによる大規模視覚言語モデルにおける時空間的空間的理解の評価
Authors: Aditya Chetan, Eric Cai, Peeyush Kushwaha, Bharath Raj Nagoor Kani, Utkarsh Mall, Qianqian Wang, Noah Snavely, Bharath Hariharan,
Abstract要約: 家具組み立てタスクを中心とした新しいベンチマークであるFlat-Pack Benchを紹介する。本ベンチマークでは,組立動作の時間的順序付け,組立状態の時間的局所化,組立状態の理解,トラッキングなど,ニュアンスタスクのLVLMを評価した。我々の実験は、最先端のLVLMが微粒な時間的推論にかなり苦労していることを示し、ビデオから時間的情報を効果的に活用する際の限界を強調した。
参考スコア（独自算出の注目度）: 49.02458752761219
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks. To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.
Abstract（参考訳）: LVLM(Large Vision-Language Models)の出現により,映像理解能力は飛躍的に向上した。しかし、既存のベンチマークは主にアクションセグメンテーション、分類、キャプション、検索といった粗いタスクに焦点を当てている。さらに、これらのベンチマークは、家庭の物、動物、人間など、言葉で容易に識別できるエンティティに依存しており、複雑な動画シナリオに適用性を制限する。しかし、家具組み立てや調理などの多くの応用では、ビデオの詳細な時空間的理解が必要であり、現在のベンチマークでは十分に評価されていない。このギャップに対処するために、家具組み立てタスクを中心とした新しいベンチマークであるFlat-Pack Benchを紹介します。本ベンチマークでは,組立動作の時間的順序付け,組立状態の時間的局所化,理解部分の交互化,追跡などのニュアンスタスクのLVLMを視覚的プロンプトと組み合わせた複数選択質問を用いて評価する。実験の結果,現在最先端のLVLMは微細な時空間推論にかなり苦労しており,ビデオからの時間情報の有効活用,追跡能力の制限,物理的接触などの空間的相互作用の理解に限界があることが明らかになった。

論文の概要: Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

関連論文リスト