Fugu-MT 論文翻訳(概要): IMAGIN-4D: Image-Guided Controllable Interaction Generation

論文の概要: IMAGIN-4D: Image-Guided Controllable Interaction Generation

arxiv url: http://arxiv.org/abs/2606.23675v1
Date: Mon, 22 Jun 2026 17:58:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 17:13:28.971966
Title: IMAGIN-4D: Image-Guided Controllable Interaction Generation
Title（参考訳）: IMAGIN-4D:画像誘導制御可能なインタラクション生成
Authors: Sai Kumar Dwivedi, Federica Bogo, Buğra Tekin, Chenhongyi Yang, Nadine Bertsch, Tomas Hodan, Michael J. Black, Dimitrios Tzionas, Shreyas Hampali,
Abstract要約: ヒューマンオブジェクトインタラクション(HOI)の生成は、キャラクタ、ロボティクス、AR/VR、組み込みAIの中心である。画像条件を時間的に分解する拡散型HOIジェネレータIMAGIN-4Dを紹介する。 IMAGIN-4Dはシングルトーケンおよび一様画像条件のベースラインにおけるきめ細かい相互作用制御を改善する。
参考スコア（独自算出の注目度）: 53.66890088390554
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating human-object interactions (HOI) is central to character animation, robotics, AR/VR, and embodied AI. Recent HOI generation methods synthesize motion from text, object geometry, and sparse waypoints, controlling action semantics and object trajectories. However, these signals underspecify interaction: the same prompt and trajectory can produce different grasps, approach directions, body poses, object poses, contacts, and body-object layouts. We address this ambiguity with a reference image as a visual specification of the desired interaction snapshot. However, a single global image representation conflates distinct cues and conditions all frames on identical visual evidence. We therefore introduce IMAGIN-4D, a diffusion-based HOI generator that decomposes image conditioning spatio-temporally. For spatial conditioning, IMAGIN-4D extracts supervised interaction-state tokens for body pose, object pose, body-object contact, and spatial relationships at the depicted frame. For temporal conditioning, it computes frame-aware tokens by querying image patches per generated frame, allowing sequence segments to attend to different visual cues from the same image. To balance image, text, and waypoint cues, IMAGIN-4D uses role-aware conditioning: text, waypoints, and interaction-state tokens use separate AdaLN streams, while frame-aware visual tokens cross-attend with motion tokens. Since HOI motion datasets lack paired images, we build a synthetic motion-to-image rendering pipeline from FullBodyManipulation (FBM) and introduce an image-adherence metric to evaluate whether generated motions match the reference snapshot. Experiments on FBM and BEHAVE show that IMAGIN-4D improves fine-grained interaction control over single-token and uniformly image-conditioned baselines while preserving waypoint-following and motion quality. Code and models will be released at https://imagin4d.github.io.
Abstract（参考訳）: ヒューマンオブジェクトインタラクション(HOI)の生成は、キャラクターアニメーション、ロボティクス、AR/VR、エンボディAIの中心である。最近のHOI生成法は、テキスト、オブジェクト幾何学、スパースウェイポイントから動きを合成し、アクションセマンティクスとオブジェクト軌跡を制御する。同一のプロンプトと軌道は、異なる把握、接近方向、ボディポーズ、オブジェクトポーズ、コンタクト、ボディオブジェクトレイアウトを生成することができる。所望のインタラクションスナップショットの視覚的仕様として参照画像を用いて、この曖昧さに対処する。しかし、単一のグローバルな画像表現は異なる手がかりと条件を混在させ、全てのフレームが同一の視覚的証拠に基づいている。そこで,画像条件を時空間的に分解する拡散型HOIジェネレータIMAGIN-4Dを導入する。空間条件付けのために、IMAGIN-4Dは、ボディポーズ、オブジェクトポーズ、ボディーオブジェクトの接触、および描画フレームにおける空間的関係の制御された相互作用状態トークンを抽出する。時間的条件付けのために、生成されたフレーム毎に画像パッチをクエリすることでフレーム認識トークンを計算し、シーケンスセグメントが同じ画像から異なる視覚的キューに出席できるようにする。 IMAGIN-4Dは、画像、テキスト、ウェイポイントのバランスをとるためにロール対応の条件付けを使用する: テキスト、ウェイポイント、インタラクションステートトークンは別々のAdaLNストリームを使用し、フレーム対応のビジュアルトークンはモーショントークンと交差する。 HOIモーションデータセットにはペア画像がないため、FullBodyManipulation (FBM) から合成モーション・ツー・イメージ・レンダリングパイプラインを構築し、生成したモーションが参照スナップショットにマッチするかどうかを評価するためのイメージ・アジェンス・メトリックを導入する。 FBMとBEHAVEの実験では、IMAGIN-4Dはウェイポイント追従と運動品質を保ちながら、シングルトーケンおよび一様画像条件のベースラインにおけるきめ細かい相互作用制御を改善することが示されている。コードとモデルはhttps://imagin4d.github.io.comで公開される。

論文の概要: IMAGIN-4D: Image-Guided Controllable Interaction Generation

関連論文リスト