Fugu-MT 論文翻訳(概要): PhysQuantAgent: An Inference Pipeline of Mass Estimation for Vision-Language Models

論文の概要: PhysQuantAgent: An Inference Pipeline of Mass Estimation for Vision-Language Models

arxiv url: http://arxiv.org/abs/2603.16958v1
Date: Tue, 17 Mar 2026 02:35:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.300521
Title: PhysQuantAgent: An Inference Pipeline of Mass Estimation for Vision-Language Models
Title（参考訳）: PhysQuantAgent:視覚言語モデルのための質量推定の推論パイプライン
Authors: Hisayuki Yokomizo, Taiki Miyanishi, Yan Gang, Shuhei Kurita, Nakamasa Inoue, Yusuke Iwasawa,
Abstract要約: 視覚言語モデル(VLM)を用いた実世界の物体質量推定フレームワークPhysQuantAgentを提案する。本稿では,対象物のサイズや内部構造を理解するために,対象物の検出,スケール推定,断面画像生成によって入力画像を強化する3つの視覚的プロンプト手法を提案する。実験の結果,視覚的プロンプトにより実世界のデータに対する質量推定精度が大幅に向上し,空間推論とVLM知識の統合の有効性が示唆された。
参考スコア（独自算出の注目度）: 38.21000830724267
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models (VLMs) are increasingly applied to robotic perception and manipulation, yet their ability to infer physical properties required for manipulation remains limited. In particular, estimating the mass of real-world objects is essential for determining appropriate grasp force and ensuring safe interaction. However, current VLMs lack reliable mass reasoning capabilities, and most existing benchmarks do not explicitly evaluate physical quantity estimation under realistic sensing conditions. In this work, we propose PhysQuantAgent, a framework for real-world object mass estimation using VLMs, together with VisPhysQuant, a new benchmark dataset for evaluation. VisPhysQuant consists of RGB-D videos of real objects captured from multiple viewpoints, annotated with precise mass measurements. To improve estimation accuracy, we introduce three visual prompting methods that enhance the input image with object detection, scale estimation, and cross-sectional image generation to help the model comprehend the size and internal structure of the target object. Experiments show that visual prompting significantly improves mass estimation accuracy on real-world data, suggesting the efficacy of integrating spatial reasoning with VLM knowledge for physical inference.
Abstract（参考訳）: 視覚言語モデル(VLM)はロボットの知覚と操作にますます応用されているが、操作に必要な物理的特性を推測する能力は限られている。特に、現実世界の物体の質量を推定することは、適切な把握力を決定し、安全な相互作用を確保するために不可欠である。しかしながら、現在のVLMには信頼性の高い質量推論能力がなく、既存のベンチマークでは現実的な感知条件下での物理量推定を明示的に評価していない。本稿では,VLMを用いた実世界のオブジェクト質量推定フレームワークであるPhysQuantAgentと,評価のための新しいベンチマークデータセットであるVisPhysQuantを提案する。 VisPhysQuantは、複数の視点から捉えた実物のRGB-Dビデオで構成され、正確な質量測定で注釈付けされている。推定精度を向上させるために,オブジェクト検出,スケール推定,断面画像生成によって入力画像を強化する3つの視覚的プロンプト手法を導入し,モデルが対象オブジェクトのサイズと内部構造を理解するのを支援する。実験の結果,視覚的プロンプトにより実世界のデータに対する質量推定精度が大幅に向上し,空間推論とVLM知識の統合の有効性が示唆された。

論文の概要: PhysQuantAgent: An Inference Pipeline of Mass Estimation for Vision-Language Models

関連論文リスト