Fugu-MT 論文翻訳(概要): InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models

論文の概要: InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models

arxiv url: http://arxiv.org/abs/2506.18385v1
Date: Mon, 23 Jun 2025 08:17:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-24 19:06:36.903741
Title: InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models
Title（参考訳）: InternSpatial:視覚言語モデルにおける空間推論のための包括的データセット
Authors: Nianchen Deng, Lixin Gu, Shenglong Ye, Yinan He, Zhe Chen, Songze Li, Haomin Wang, Xingguang Wei, Tianshuo Yang, Min Dou, Tong He, Wenqi Shao, Kaipeng Zhang, Yi Wang, Botian Shi, Yanting Zhang, Jifeng Dai, Yu Qiao, Hongjie Zhang, Wenhai Wang,
Abstract要約: InternSpatialは視覚言語モデル(VLM)における空間推論のための最大のオープンソースデータセットである InternSpatialは、シングルビューとマルチビューの両方にまたがる1200万のQAペアで構成されている。 InternSpatial-Benchは、多様な命令形式で空間的理解を評価するために設計された評価ベンチマークである。
参考スコア（独自算出の注目度）: 59.7084864920244
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent benchmarks and datasets have been proposed to improve spatial reasoning in vision-language models (VLMs), yet existing open resources remain limited in scale, visual diversity, and instruction expressiveness. In this work, we introduce InternSpatial, the largest open-source dataset for spatial reasoning in VLMs, along with InternSpatial-Bench, a corresponding evaluation benchmark designed to assess spatial understanding under diverse instruction formats. InternSpatial comprises 12 million QA pairs spanning both single-view and multi-view settings, drawn from diverse visual environments and supporting 19 instruction formats that reflect varied query styles. For evaluation, we propose InternSpatial-Bench for single-view tasks and expand multi-view reasoning by introducing a novel rotation angle prediction task that has not been explored in prior work. Experimental results show that models trained on InternSpatial achieve 12.1% improvement on InternSpatial-Bench and 10.7% on VSI-Bench, while maintaining strong performance on general-purpose benchmarks. We hope these resources will support the development of spatially capable VLMs in practical applications such as robotics and embodied AI.
Abstract（参考訳）: 最近のベンチマークやデータセットは視覚言語モデル(VLM)の空間的推論を改善するために提案されているが、既存のオープンリソースはスケール、視覚的多様性、命令表現性に制限されている。本稿では,VLMにおける空間的推論のためのオープンソースデータセットであるInternSpatialと,多様な命令形式で空間的理解を評価するための評価ベンチマークであるInternSpatial-Benchを紹介する。 InternSpatialは、シングルビューとマルチビューの両方にまたがる1200万のQAペアで構成され、多様なビジュアル環境から描画され、さまざまなクエリスタイルを反映した19の命令フォーマットをサポートする。 InternSpatial-Bench for single-view task and expand multi-view reasoning by introduced a novel rotation angle prediction task has been been explore in previous work。実験の結果、InternSpatialでトレーニングされたモデルは、InternSpatial-Benchで12.1%、VSI-Benchで10.7%向上し、汎用ベンチマークでは高い性能を維持した。ロボット工学やAIの具体化といった実践的な応用において,これらの資源が空間能力を持つVLMの開発を支援することを願っている。

関連論文リスト

SVQA-R1: Reinforcing Spatial Reasoning in MLLMs via View-Consistent Reward Optimization [57.484274282231226]
本稿では,R1スタイルのトレーニングを空間VQAに拡張する最初のフレームワークであるSVQA-R1を提案する。特に,オブジェクト間の空間的関係を摂動させることで,視点に一貫性のある報酬を構成する新しいグループワイドRL戦略であるSpatial-GRPOを紹介する。我々のモデルSVQA-R1は空間的VQAベンチマークの精度を劇的に向上させるだけでなく、教師付き微調整データを使用しなくても解釈可能な推論経路を示す。
論文参考訳（メタデータ） (2025-06-02T06:58:43Z)
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [47.237216851265316]
視覚言語モデル (VLM) は視覚的内容の理解と推論において顕著な能力を示した。現在のVLMは、主に自我中心の空間的推論(カメラの観点から)に優れるが、同中心の視点に一般化することができない。マルチ視点空間位置認識評価に特化して設計された,初の総合的なベンチマークであるViewSpatial-Benchを紹介する。
論文参考訳（メタデータ） (2025-05-27T17:59:26Z)
SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding [64.15606979785355]
マルチモーダル大規模言語モデル(MLLM)は,質問応答タスクにおいて顕著な成功を収めているが,空間的理解能力は乏しい。既存のMLLMは3次元空間認識と理解能力を持っているか?
論文参考訳（メタデータ） (2025-05-22T17:59:03Z)
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
我々は、FG-BMKと呼ばれる包括的きめ細かい評価ベンチマークを導入し、1.01万の質問と0.33万の画像を含む。本評価では,人間指向と機械指向の両方の観点からLVLMを体系的に検討する。トレーニングパラダイム,モダリティアライメント,摂動感受性,および細粒度カテゴリー推論がタスクパフォーマンスに与える影響について,重要な知見を明らかにした。
論文参考訳（メタデータ） (2025-04-21T09:30:41Z)
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
視覚中心のアプローチで設計したマルチモーダルLLM(MLLM)のファミリーであるCambrian-1を紹介する。本研究は,様々な視覚表現を評価するためのインタフェースとして,LLMとビジュアルインストラクションチューニングを用いた。モデルウェイト、コード、サポートツール、データセット、詳細なインストラクションチューニングと評価のレシピを提供しています。
論文参考訳（メタデータ） (2024-06-24T17:59:42Z)
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models [45.040292339670096]
大規模視覚言語モデル(LVLM)は、その強力な推論と一般化能力を備えた幅広い視覚言語タスクの前提を示してきた。本研究では,従来のLVLMとリソースフレンドリなライトバージョンのパフォーマンスギャップを,高品質なトレーニングデータを用いて橋渡しすることを目的とする。
論文参考訳（メタデータ） (2024-02-18T19:26:49Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。