Fugu-MT 論文翻訳(概要): CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images

論文の概要: CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images

arxiv url: http://arxiv.org/abs/2308.12288v1
Date: Wed, 23 Aug 2023 17:59:11 GMT
ステータス: 翻訳完了
システム内更新日: 2023-08-24 13:19:04.879576
Title: CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images
Title（参考訳）: CHORUS:非有界合成画像から正規化された3次元物体空間関係を学習する
Authors: Sookwan Han and Hanbyul Joo
Abstract要約: 本研究では,3次元における多種多様な物体間相互作用の空間的共通感覚を理解し,モデル化する手法を提案する。我々は、人間が同じ種類の物体と対話するとき、異なる視点から撮影した複数の2D画像を示す。実画像よりも画質が不完全であるにもかかわらず、合成画像は3次元的対象空間関係を学習するのに十分であることを示す。
参考スコア（独自算出の注目度）: 10.4286198282079
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a method for teaching machines to understand and model the underlying spatial common sense of diverse human-object interactions in 3D in a self-supervised way. This is a challenging task, as there exist specific manifolds of the interactions that can be considered human-like and natural, but the human pose and the geometry of objects can vary even for similar interactions. Such diversity makes the annotating task of 3D interactions difficult and hard to scale, which limits the potential to reason about that in a supervised way. One way of learning the 3D spatial relationship between humans and objects during interaction is by showing multiple 2D images captured from different viewpoints when humans interact with the same type of objects. The core idea of our method is to leverage a generative model that produces high-quality 2D images from an arbitrary text prompt input as an "unbounded" data generator with effective controllability and view diversity. Despite its imperfection of the image quality over real images, we demonstrate that the synthesized images are sufficient to learn the 3D human-object spatial relations. We present multiple strategies to leverage the synthesized images, including (1) the first method to leverage a generative image model for 3D human-object spatial relation learning; (2) a framework to reason about the 3D spatial relations from inconsistent 2D cues in a self-supervised manner via 3D occupancy reasoning with pose canonicalization; (3) semantic clustering to disambiguate different types of interactions with the same object types; and (4) a novel metric to assess the quality of 3D spatial learning of interaction. Project Page: https://jellyheadandrew.github.io/projects/chorus
Abstract（参考訳）: 本稿では,人間と物体の相互作用の空間的共通認識を3次元的に理解しモデル化する手法を提案する。これは難しい作業であり、人間らしく自然であると考えられる相互作用の特定の多様体が存在するが、人間のポーズや物体の幾何学は類似した相互作用に対しても変化する。このような多様性により、3dインタラクションの注釈作業は難しく、スケールしにくくなり、教師ありの方法で推論する可能性が制限される。人間と物体の間の3d空間関係を学ぶ一つの方法は、人間が同じ種類の物体と相互作用するとき、異なる視点から撮影された複数の2d画像を示すことである。本手法の核となる考え方は,任意のテキストプロンプト入力から高品質な2d画像を生成する生成モデルを活用することである。実画像よりも画質が不完全であるにもかかわらず、合成画像は3次元的対象空間関係を学習するのに十分であることを示す。 We present multiple strategies to leverage the synthesized images, including (1) the first method to leverage a generative image model for 3D human-object spatial relation learning; (2) a framework to reason about the 3D spatial relations from inconsistent 2D cues in a self-supervised manner via 3D occupancy reasoning with pose canonicalization; (3) semantic clustering to disambiguate different types of interactions with the same object types; and (4) a novel metric to assess the quality of 3D spatial learning of interaction. プロジェクトページ: https://jellyhead andrew.github.io/projects/chorus

関連論文リスト

InteractVLM: 3D Interaction Reasoning from 2D Foundational Models [85.76211596755151]
InactVLMは、人体と物体の3次元接触点を、ワン・イン・ザ・ワイルド画像から推定する新しい手法である。既存の方法は、高価なモーションキャプチャシステムや面倒な手動ラベリングを通じて収集された3Dコンタクトアノテーションに依存している。本稿では,人間の接触予測を対象のセマンティクス上で明示的に条件付けするセマンティック・ヒューマン・コンタクト推定というタスクを提案する。
論文参考訳（メタデータ） (2025-04-07T17:59:33Z)
SIGHT: Synthesizing Image-Text Conditioned and Geometry-Guided 3D Hand-Object Trajectories [124.24041272390954]
手動物体の相互作用をモデル化することは、ロボットと具体化されたAIシステムを前進させる大きな可能性を秘めている。 SIGHTは、1つの画像から現実的で物理的に妥当な3Dハンドオブジェクトインタラクショントラジェクトリを生成することに焦点を当てた,新しいタスクである。 SIGHT-Fusionは,データベースから最もよく似た3Dオブジェクトメッシュを抽出し,この課題に対処する,新しい拡散型画像文条件付き生成モデルを提案する。
論文参考訳（メタデータ） (2025-03-28T20:53:20Z)
Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors [31.277540988829976]
本稿では,現在限定されている3次元HOIデータセットのエンドツーエンドトレーニングに頼ることなく,新しいゼロショットHOI合成フレームワークを提案する。我々は、事前訓練された人間のポーズ推定モデルを用いて、人間のポーズを抽出し、一般化可能なカテゴリレベルの6-DoF推定手法を導入し、2次元HOI画像からオブジェクトポーズを求める。
論文参考訳（メタデータ） (2025-03-25T23:55:47Z)
TriDi: Trilateral Diffusion of 3D Humans, Objects, and Interactions [33.58559068016724]
本稿では,3次元オブジェクト間相互作用(HOI)をモデル化するための最初の統一モデルを提案する。我々は,新しい3方向拡散プロセスと同時に,人・物・相互作用のモダリティを生成する。本稿では,TriDiをシーン群に適用し,人間と接触したデータセットのオブジェクトを生成し,オブジェクトの幾何学を一般化する。
論文参考訳（メタデータ） (2024-12-09T09:35:05Z)
Learning 2D Invariant Affordance Knowledge for 3D Affordance Grounding [46.05283810364663]
我々はtextbf-textbfImage Guided Invariant-textbfFeature-Aware 3D textbfAffordance textbfGrounding frameworkを紹介した。複数の人間と物体の相互作用画像に共通する相互作用パターンを同定することにより、3Dオブジェクトの空き領域を推定する。
論文参考訳（メタデータ） (2024-08-23T12:27:33Z)
Monocular Human-Object Reconstruction in the Wild [11.261465071559163]
本研究では,野生の2次元画像から3次元対象空間関係を学習する2次元教師付き手法を提案する。本手法では,フローベースニューラルネットワークを用いて,データセットの各画像に対する2次元人間オブジェクトのキーポイントレイアウトとビューポートの事前分布を学習する。
論文参考訳（メタデータ） (2024-07-30T05:45:06Z)
AG3D: Learning to Generate 3D Avatars from 2D Image Collections [96.28021214088746]
本稿では,2次元画像から現実的な3次元人物の新たな逆生成モデルを提案する。本手法は, 全身型3Dジェネレータを用いて, 体の形状と変形を捉える。提案手法は, 従来の3次元・調音認識手法よりも幾何的, 外観的に優れていた。
論文参考訳（メタデータ） (2023-05-03T17:56:24Z)
Grounding 3D Object Affordance from 2D Interactions in Images [128.6316708679246]
接地した3Dオブジェクトは、3D空間内のオブジェクトの'アクション可能性'領域を見つけようとする。人間は、実演画像やビデオを通じて、物理的世界の物体の余裕を知覚する能力を持っている。我々は、異なるソースからのオブジェクトの領域的特徴を整合させる、インタラクション駆動の3D Affordance Grounding Network (IAG) を考案する。
論文参考訳（メタデータ） (2023-03-18T15:37:35Z)
Reconstructing Action-Conditioned Human-Object Interactions Using Commonsense Knowledge Priors [42.17542596399014]
本稿では,画像から人-物間相互作用の多種多様な3次元モデルを推定する手法を提案する。提案手法は,大規模言語モデルから高レベルのコモンセンス知識を抽出する。本研究では,大規模な人-物間相互作用データセットを用いて,推定された3次元モデルを定量的に評価する。
論文参考訳（メタデータ） (2022-09-06T13:32:55Z)
Neural Novel Actor: Learning a Generalized Animatable Neural Representation for Human Actors [98.24047528960406]
本稿では,複数の人物の多視点画像のスパース集合から,汎用的アニマタブルなニューラル表現を学習するための新しい手法を提案する。学習された表現は、カメラのスパースセットから任意の人の新しいビューイメージを合成し、さらにユーザのポーズ制御でアニメーション化することができる。
論文参考訳（メタデータ） (2022-08-25T07:36:46Z)
Grasping Field: Learning Implicit Representations for Human Grasps [16.841780141055505]
本稿では,深層ニューラルネットワークと統合し易い人間の把握モデリングのための表現表現を提案する。この3Dから2DマッピングをGrasping Fieldと呼び、ディープニューラルネットワークでパラメータ化し、データから学習します。我々の生成モデルは、3Dオブジェクトポイント・クラウドにのみ適用され、高品質な人間のグリップを合成することができる。
論文参考訳（メタデータ） (2020-08-10T23:08:26Z)
Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild [96.08358373137438]
本研究では,世界規模で一貫した3Dシーンにおいて,人間や物体の空間的配置や形状を推定する手法を提案する。本手法は,シーンレベルやオブジェクトレベルの3D監視を必要とせず,データセット上で動作させる。
論文参考訳（メタデータ） (2020-07-30T17:59:50Z)
Detailed 2D-3D Joint Representation for Human-Object Interaction [45.71407935014447]
HOI学習のための2次元3次元共同表現学習法を提案する。まず, 単視点の人体捕捉法を用いて, 3次元体, 顔, 手の形状を詳細に把握する。次に,3次元オブジェクトの位置と大きさを,2次元オブジェクト空間構成と対象カテゴリーの先行点から推定する。
論文参考訳（メタデータ） (2020-04-17T10:22:12Z)
Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image Synthesis [72.34794624243281]
ラベルのないビデオフレームからバリエーションを分離する自己教師付き学習フレームワークを提案する。 3Dポーズと空間部分マップの表現ギャップを埋める、微分可能な形式化により、多様なカメラの動きを持つビデオで操作できる。
論文参考訳（メタデータ） (2020-04-09T07:55:01Z)
Chained Representation Cycling: Learning to Estimate 3D Human Pose and Shape by Cycling Between Representations [73.11883464562895]
本稿では,教師なし,あるいは教師なしの学習を容易にする新しいアーキテクチャを提案する。本研究では,非ペア画像と無注釈画像から3次元人物のポーズと形状を学習することにより,その手法を実証する。人間をモデル化するための結果を示す一方で、私たちの定式化は一般的であり、他の視覚問題にも適用できる。
論文参考訳（メタデータ） (2020-01-06T14:54:00Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。