Fugu-MT 論文翻訳(概要): 4DP-QA: Scalable QA for 4D Perception in Vision Language Models

論文の概要: 4DP-QA: Scalable QA for 4D Perception in Vision Language Models

arxiv url: http://arxiv.org/abs/2606.11568v1
Date: Wed, 10 Jun 2026 01:49:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-11 16:42:38.243511
Title: 4DP-QA: Scalable QA for 4D Perception in Vision Language Models
Title（参考訳）: 4DP-QA:視覚言語モデルにおける4次元知覚のためのスケーラブルQA
Authors: Seokju Cho, Abhishek Badki, Hang Su, Jindong Jiang, Ziyao Zeng, Seungryong Kim, Sifei Liu, Orazio Gallo,
Abstract要約: 本稿では、動きに関するシーン理解に焦点を当てた生成パイプラインを提案する。本稿では,従来手法とTrue-Motion Trackingと呼ばれる新しい参照システムの両方でトラッキングをキャストすることで,カメラと物体の動きの絡み合いを特に注意する。このパイプラインから400Kサンプル、4DP-QA(4D知覚QA)、2.2Kサンプルベンチマーク、4DP-QA-Benchの大規模なトレーニングデータセットを生成する。
参考スコア（独自算出の注目度）: 68.67551474392373
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite recent advances, Vision Language Models (VLMs) still struggle to grasp the dynamics of the world. We note that the ability to reason about a 4D scene, challenging in itself, is further complicated by two factors. First, VLMs observe motion indirectly via its projection onto 2D images. Second, existing datasets fail to disentangle object and camera motion. To address these challenges, we present a QA generation pipeline that focuses on motion-related scene understanding. We take particular care of the entanglement of camera and object motion by casting tracking in both the traditional way and in a novel, fixed reference system, dubbed True-Motion Tracking, which provides an intuitive description of motion. From this pipeline, we generate a large-scale training dataset of 400K samples, 4DP-QA (4D Perception QA), and a 2.2K-sample benchmark, 4DP-QA-Bench. Training existing models on our dataset yields performance improvements on an external benchmark, validating the effectiveness of our method.
Abstract（参考訳）: 近年の進歩にもかかわらず、ヴィジュアル言語モデル(VLM)は世界の力学を理解するのに苦戦している。 4Dシーンを推論する能力は、それ自体が困難であり、2つの要因によってさらに複雑である。まず、VLMは2次元画像への投影を通して間接的に動きを観察する。第二に、既存のデータセットはオブジェクトとカメラの動きを歪めない。これらの課題に対処するために、動作関連シーン理解に焦点を当てたQA生成パイプラインを提案する。本稿では,従来手法とTrue-Motion Tracking(True-Motion Tracking)と呼ばれる新しい参照システムにより,カメラと物体の動きの絡み合いを特に注意する。このパイプラインから400Kサンプル、4DP-QA(4D知覚QA)、2.2Kサンプルベンチマーク、4DP-QA-Benchの大規模なトレーニングデータセットを生成する。既存のモデルをデータセットでトレーニングすると、外部ベンチマークのパフォーマンスが向上し、メソッドの有効性が検証される。

論文の概要: 4DP-QA: Scalable QA for 4D Perception in Vision Language Models

関連論文リスト