Fugu-MT 論文翻訳(概要): APT: Atomic Physical Transitions for Causal Video-Language Understanding

論文の概要: APT: Atomic Physical Transitions for Causal Video-Language Understanding

arxiv url: http://arxiv.org/abs/2606.18586v1
Date: Wed, 17 Jun 2026 01:26:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:50.952837
Title: APT: Atomic Physical Transitions for Causal Video-Language Understanding
Title（参考訳）: APT:Causal Video-Language Understandingのための原子物理遷移
Authors: Shang Wu, Haoran Lu, Songling Liu, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu,
Abstract要約: 物理的事象は名前だけでは理解されず、それらを構成する因果状態の変化によって理解される。バウンス"のようなクリップレベルのラベルは、イベントを物理的に有効にするプロセスを隠しながら修正することができる。可視光キューをアクティブな物理機構に結合する最小限の時間的局所状態変化である原子物理遷移(APTs)を導入する。
参考スコア（独自算出の注目度）: 41.08551060473405
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Physical events are not understood by their names alone, but by the causal state changes that compose them. A clip-level label such as "bounce" can be correct while hiding the process that makes the event physically valid, from support loss and contact onset to rebound and settling. To make this hidden process explicit, we introduce Atomic Physical Transitions (APTs): minimal, temporally localized state changes that bind a visible cue to an active physical mechanism and before/after dynamical regimes. An APT chain represents a video as an ordered causal transition sequence rather than a single aggregate event label: event labels tell what happened; APT chains explain why it happened. To make APTs learnable by VLMs, we construct mixed-source APT data from human annotations and simulator ground truth, covering 14 transition types across contact, gravity, friction, and rotation/stability, with 27,303 timed instances over 1,246 trials. Using this data, we find that current VLMs miss transition-level physics, with zero-shot recall at most 14% and errors dominated by missed transitions. Direct fine-tuning on APT chains improves transition detection but causes event-level forgetting, indicating that the model learns a specialized answer format rather than a reusable physical representation. We therefore propose APT-Tune, a parameter-efficient recipe that teaches VLMs to use causal transitions without forgetting how to answer video questions. It combines image-pad-aware supervision, format-conditional co-training, and mechanism-conditioned domain-to-type decoding to make APT learning format-robust and physically grounded. With only 11 M LoRA parameters on Qwen3-VL-2B, APT-Tune substantially improves APT recall while also improving event-level video transfer. These results show that APTs are not a new answer format, but a human-aligned causal supervision signal for physical video understanding.
Abstract（参考訳）: 物理的事象は名前だけでは理解されず、それらを構成する因果状態の変化によって理解される。バウンス」のようなクリップレベルのラベルは、サポート損失やコンタクトオンセットからリバウンドやセットに至るまで、イベントを物理的に有効にするプロセスを隠蔽しながら修正することができる。隠れたプロセスを明確にするために、我々は原子物理遷移(APTs: Atomic Physical Transitions)を導入します。 APTチェーンは、単一の集約イベントラベルではなく、順序付けられた因果遷移シーケンスとしてビデオを表現する。 VLM で APT を学習可能にするため,1246 回の試験で27,303 件のタイムドインスタンスで,接触,重力,摩擦,回転・安定性の14 種類の遷移を網羅した,人間のアノテーションとシミュレータによる混合ソース APT データを構築した。このデータを用いて、現在のVLMは遷移レベルの物理を見逃し、ゼロショットリコールは最大14%、エラーは遷移の欠落に支配されていることがわかった。 APTチェーンの直接微調整はトランジッション検出を改善するが、イベントレベルの忘れを生じさせ、モデルが再利用可能な物理表現ではなく、特別な回答形式を学ぶことを示す。そこで本研究では,ビデオ質問に答える方法を忘れずに,VLMに対して因果遷移の使い方を教えるパラメータ効率のよいレシピであるAPT-Tuneを提案する。画像パッドを意識した監視、フォーマット条件の協調訓練、および機構条件のドメイン間デコードを組み合わせることで、APT学習形式が損なわれ、物理的に座屈する。 Qwen3-VL-2B上の11のLoRAパラメータだけで、APT-TuneはAPTリコールを大幅に改善し、イベントレベルのビデオ転送も改善した。これらの結果から,APTは新たな回答形式ではなく,物理的ビデオ理解のための人間対応因果監視信号であることが示唆された。

論文の概要: APT: Atomic Physical Transitions for Causal Video-Language Understanding

関連論文リスト