Fugu-MT 論文翻訳(概要): Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs

論文の概要: Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs

arxiv url: http://arxiv.org/abs/2605.19528v1
Date: Tue, 19 May 2026 08:30:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.206517
Title: Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs
Title（参考訳）: カメラ・ロバスト3Dローカライゼーションに向けて:MLLMのための方程式アンコールツール
Authors: Xueying Jiang, Wenhao Li, Quanhao Qian, Deli Zhao, Shijian Lu, Gongjie Zhang, Ran Xu,
Abstract要約: MLLM(Multimodal Large Language Models)における3次元ローカライゼーションは、カメラ固有の曖昧さによって制限される。本稿では,空間ツールを式変数として再活用する,等式対応型ツール利用フレームワークを提案する。提案手法は,RGBのみのベースラインとツール拡張ベースラインよりも優れており,カメラがトレーニングスケールから最も逸脱する点において,大きな効果がある。
参考スコア（独自算出の注目度）: 72.8641426724502
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D localization in Multimodal Large Language Models (MLLMs), including 3D object detection and 3D visual grounding, is fundamentally limited by camera intrinsic ambiguity: the same image admits different 3D scenes under different cameras. Existing MLLMs either ignore camera parameters and overfit to a canonical training intrinsic, or retrieve depth and 3D cues from external tools but treat the returned values as reference cues (numerical hints that the model is free to interpret implicitly), both preventing camera information from being deterministically propagated into the prediction. We propose an equation-anchored tool-use framework that re-purposes spatial tools as formula variables. The proposed framework proactively retrieves camera intrinsics and samples multi-point metric depths, writes the pinhole back-projection equation $\hat{X} = (u_c - c_x)\bar{Z}/f_x$ explicitly in Chain-of-Thought (CoT), and substitutes tool outputs into the formula before regressing the final 9-DoF bounding box. On both 3D object detection and 3D visual grounding tasks under rescaled camera intrinsics from $0.5\times$ to $1.5\times$, our method outperforms RGB-only and tool-augmented baselines, with significant gains where the camera deviates most from the training scale. Code and data will be released.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)における3Dローカライゼーションは、3Dオブジェクトの検出と3D視覚的グラウンド化を含む、基本的にカメラ固有の曖昧さによって制限されている。既存のMLLMはカメラパラメータを無視し、通常の訓練に過度に適合するか、あるいは外部ツールから奥行きと3Dキューを回収するが、返却された値を参照キューとして扱う(モデルが暗黙的に解釈する自由な数値的なヒント)。本稿では,空間ツールを式変数として再活用する,等式対応型ツール利用フレームワークを提案する。提案フレームワークは, カメラ内在と多点距離深度を積極的に回収し, ピンホールバック投射方程式 $\hat{X} = (u_c - c_x)\bar{Z}/f_x$ を Chain-of-Thought (CoT) で明示的に記述し, 最終9-DoFバウンディングボックスを回帰する前にツール出力を式に置換する。 3Dオブジェクトの検出と3D視覚的グラウンド処理の両方において、カメラがトレーニングスケールから最も逸脱した場合、0.5\times$から1.5\times$まで、我々の手法はRGBのみのベースラインとツール拡張ベースラインを上回っます。コードとデータはリリースされます。

論文の概要: Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs

関連論文リスト