Fugu-MT 論文翻訳(概要): Towards Visual Query Localization in the 3D World

論文の概要: Towards Visual Query Localization in the 3D World

arxiv url: http://arxiv.org/abs/2605.01498v1
Date: Sat, 02 May 2026 15:41:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.805615
Title: Towards Visual Query Localization in the 3D World
Title（参考訳）: 3次元世界におけるビジュアルクエリローカライゼーションに向けて
Authors: Liang Peng, Bohan Tan, Zhipeng Zhang, Haobo Li, Yifan Jiao, Xingping Dong, Libo Zhang,
Abstract要約: 3DVQLと呼ばれる新しいベンチマークを導入することで、3Dの世界におけるビジュアルクエリローカライゼーションに対処する最初の試みを行う。 3DVQLは、約170,000フレームの2,002シーケンスと、38のオブジェクトカテゴリの6.4K応答セグメントを含んでいる。私たちの知る限り、3DVQLは3Dマルチモーダルなビジュアルクエリローカライゼーションのための最初のベンチマークです。
参考スコア（独自算出の注目度）: 44.67762623020334
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual query localization (VQL) aims to predict the spatio-temporal response of the most recent occurrence in a sequence given a query. Currently, most research focuses on visual query localization in 2D videos, while its counterpart in 3D space has received little attention. In this paper, we make the first attempt to address visual query localization in the 3D world by introducing a novel benchmark, dubbed 3DVQL. Specifically, 3DVQL contains 2,002 sequences with around 170,000 frames and 6.4K response track segments from 38 object categories. Each sequence in 3DVQL is provided with multiple modalities, including point clouds, RGB images, and depth images, to support flexible research. To ensure high-quality annotations, each sequence is manually annotated with multiple rounds of verification and refinement. To the best of our knowledge, 3DVQL is the first benchmark for 3D multimodal visual query localization. To facilitate comparison in subsequent research, we implement a series of representative 3D multimodal VQL baselines using point clouds and RGB images. The experimental results show that existing methods exhibit significant performance variations across different fusion modules. To encourage future research, we propose a lift-and-attention fusion algorithm named LaF, which significantly outperforms existing baseline models. Our benchmark and model will be publicly released at https://github.com/wuhengliangliang/3DVQL.
Abstract（参考訳）: ビジュアルクエリローカライゼーション(VQL)は、クエリが与えられたシーケンスにおける最新の発生の時空間的応答を予測することを目的としている。現在、ほとんどの研究は2Dビデオにおける視覚的クエリローカライズに焦点を当てている。本稿では,3DVQLと呼ばれる新しいベンチマークを導入することで,視覚的クエリローカライゼーションに対処する試みを行う。具体的には、3DVQLは、約170,000フレームの2,002シーケンスと、38のオブジェクトカテゴリの6.4Kレスポンストラックセグメントを含んでいる。 3DVQLの各シーケンスは、フレキシブルな研究をサポートするために、ポイントクラウド、RGBイメージ、ディープイメージを含む複数のモードが提供される。高品質なアノテーションを保証するため、各シーケンスは複数の検証と改善のラウンドで手動でアノテートされる。私たちの知る限り、3DVQLは3Dマルチモーダルなビジュアルクエリローカライゼーションのための最初のベンチマークです。その後の研究で比較を容易にするために,ポイントクラウドとRGB画像を用いた3次元マルチモーダルVQLベースラインのシリーズを実装した。実験の結果,既存手法は異なる融合モジュール間で大きな性能変化を示すことがわかった。今後の研究を促進するために,既存のベースラインモデルよりも大幅に優れているLaFと呼ばれるリフト・アンド・アテンション融合アルゴリズムを提案する。ベンチマークとモデルはhttps://github.com/wuhengliangliang/3DVQL.comで公開されます。

論文の概要: Towards Visual Query Localization in the 3D World

関連論文リスト