Fugu-MT 論文翻訳(概要): Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos

論文の概要: Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos

arxiv url: http://arxiv.org/abs/2604.17749v1
Date: Mon, 20 Apr 2026 03:07:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.6723
Title: Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos
Title（参考訳）: Ego-InBetween:Ego-Centric Videoにおけるオブジェクト状態遷移の生成
Authors: Mengmeng Ge, Takashi Isobe, Xu Jia, Yanan Sun, Zetong Yang, Weinong Wang, Dong Zhou, Dong Li, Huchuan Lu, Emad Barsoum,
Abstract要約: EgoInは、TransitionVLMを使用して、2つの与えられた状態間のマルチステップ遷移プロセスを推論するフレームワークである。提案したトランジションコンディショニングモジュールによって生成される遷移条件に基づいて,フレームのシーケンスを生成する。人間オブジェクトとロボットオブジェクトのインタラクションデータセットの実験は、意味的に意味があり、視覚的にコヒーレントな変換シーケンスを生成する上で、EgoInの優れたパフォーマンスを示している。
参考スコア（独自算出の注目度）: 56.20829168540647
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding physical transformation processes is crucial for both human cognition and artificial intelligence systems, particularly from an egocentric perspective, which serves as a key bridge between humans and machines in action modeling. We define this modeling process as Egocentric Instructed Visual State Transition (EIVST), which involves generating intermediate frames that depict object transformations between initial and target states under a brief action instruction. EIVST poses two challenges for current generative models: (1) understanding the visual scenes of the initial and target states and reasoning about transformation steps from an egocentric view, and (2) generating a consistent intermediate transition that follows the given instruction while preserving object appearance across the two visual states. To address these challenges, we propose the EgoIn framework. It first infers the multi-step transition process between two given states using TransitionVLM, fine-tuned on our curated dataset to better adapt to this task and reduce hallucinated information. It then generates a sequence of frames based on transition conditions produced by the proposed Transition Conditioning module. Additionally, we introduce Object-aware Auxiliary Supervision to preserve consistent object appearance throughout the transition. Extensive experiments on human-object and robot-object interaction datasets demonstrate EgoIn's superior performance in generating semantically meaningful and visually coherent transformation sequences.
Abstract（参考訳）: 物理的トランスフォーメーションプロセスを理解することは、人間の認知と人工知能システムの両方にとって、特に人間と機械のアクションモデリングにおける重要な橋渡しとなるエゴセントリックな視点から、不可欠である。我々は、このモデリングプロセスをEgocentric Instructed Visual State Transition (EIVST)と定義し、簡単なアクション命令の下で初期状態と目標状態の間のオブジェクト変換を記述する中間フレームを生成する。 EIVSTは,(1)初期状態と目標状態の視覚的シーンの理解,(2)自己中心的な視点からの変換ステップの推論,(2)与えられた指示に従う一貫した中間遷移の生成,そして2つの視覚状態のオブジェクトの外観を保ちながら,現在の生成モデルに2つの課題を提起する。これらの課題に対処するため、EgoInフレームワークを提案する。まず、TransitionVLMを使って与えられた2つの状態間の多段階遷移プロセスを推論し、このタスクに適応し、幻覚情報を削減する。そして、提案した遷移条件モジュールによって生成される遷移条件に基づいて、フレームのシーケンスを生成する。さらに、遷移を通して一貫したオブジェクトの外観を維持するために、Object-Aware Auxiliary Supervisionを導入します。人間オブジェクトとロボットオブジェクトの相互作用データセットに関する大規模な実験は、意味的に意味があり、視覚的にコヒーレントな変換シーケンスを生成する上で、EgoInの優れたパフォーマンスを示している。

論文の概要: Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos

関連論文リスト