Fugu-MT 論文翻訳(概要): Looking Outside the Box to Ground Language in 3D Scenes

論文の概要: Looking Outside the Box to Ground Language in 3D Scenes

arxiv url: http://arxiv.org/abs/2112.08879v2
Date: Sun, 19 Dec 2021 12:15:30 GMT
ステータス: 翻訳完了
システム内更新日: 2021-12-21 11:21:07.897620
Title: Looking Outside the Box to Ground Language in 3D Scenes
Title（参考訳）: ボックスの外から見た3Dシーン
Authors: Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki
Abstract要約: 本稿では,3つの主要な革新を伴う3次元シーンにおける接地言語モデルを提案する。言語ストリーム、ポイントクラウド機能ストリーム、および3Dボックスの提案に反復的に注目する。 3Dオブジェクトアノテーションと言語基底アノテーションからの共同管理。マイナーな変更を伴う2Dイメージの言語基盤に適用すると、GPU時間の半分に収束しながら、最先端の処理と同等に動作します。
参考スコア（独自算出の注目度）: 27.126171549887232
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing language grounding models often use object proposal bottlenecks: a pre-trained detector proposes objects in the scene and the model learns to select the answer from these box proposals, without attending to the original image or 3D point cloud. Object detectors are typically trained on a fixed vocabulary of objects and attributes that is often too restrictive for open-domain language grounding, where an utterance may refer to visual entities at various levels of abstraction, such as a chair, the leg of a chair, or the tip of the front leg of a chair. We propose a model for grounding language in 3D scenes that bypasses box proposal bottlenecks with three main innovations: i) Iterative attention across the language stream, the point cloud feature stream and 3D box proposals. ii) Transformer decoders with non-parametric entity queries that decode 3D boxes for object and part referentials. iii) Joint supervision from 3D object annotations and language grounding annotations, by treating object detection as grounding of referential utterances comprised of a list of candidate category labels. These innovations result in significant quantitative gains (up to +9% absolute improvement on the SR3D benchmark) over previous approaches on popular 3D language grounding benchmarks. We ablate each of our innovations to show its contribution to the performance of the model. When applied on language grounding on 2D images with minor changes, it performs on par with the state-of-the-art while converges in half of the GPU time. The code and checkpoints will be made available at https://github.com/nickgkan/beauty_detr
Abstract（参考訳）: 事前訓練された検出器がシーン内のオブジェクトを提案し、モデルは元のイメージや3Dポイントクラウドに出席することなく、これらのボックスの提案から回答を選択することを学習する。オブジェクト検出器は通常、オブジェクトや属性の固定された語彙で訓練されるが、これはオープンドメインの言語接地には制約が多すぎるため、発話は椅子、椅子の脚、椅子の前脚の先端など、様々な抽象レベルでの視覚実体を指すことがある。我々は,boxの提案ボトルネックを回避し,3次元シーンにおける言語接地モデルを提案する。 i) 言語ストリーム、ポイントクラウド機能ストリーム、および3dボックスの提案全体での反復的な注意。二オブジェクト及び部分参照のための3Dボックスをデコードする非パラメトリックエンティティクエリを持つトランスフォーマーデコーダ三対象物検出を候補分類ラベルの一覧から成る参照発話の根拠として扱うことにより、3Dオブジェクトアノテーション及び言語基盤アノテーションからの共同監督これらの革新は、一般的な3D言語グラウンドベンチマークに対する以前のアプローチに比べて、大きな量的向上(SR3Dベンチマークのプラス9%の改善)をもたらす。私たちは、それぞれのイノベーションを省略して、モデルのパフォーマンスへの貢献を示しています。マイナーな変更を伴う2Dイメージの言語基盤に適用すると、GPU時間の半分に収束しながら、最先端の処理と同等に動作する。コードとチェックポイントはhttps://github.com/nickgkan/beauty_detrで公開される。

論文の概要: Looking Outside the Box to Ground Language in 3D Scenes

関連論文リスト