Fugu-MT 論文翻訳(概要): GeoWorld-VLM: Geometry from World Models for Vision-Language Models

論文の概要: GeoWorld-VLM: Geometry from World Models for Vision-Language Models

arxiv url: http://arxiv.org/abs/2605.16713v1
Date: Fri, 15 May 2026 23:52:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:46.919484
Title: GeoWorld-VLM: Geometry from World Models for Vision-Language Models
Title（参考訳）: GeoWorld-VLM:視覚言語モデルのための世界モデルからの幾何学
Authors: Renjie Gu, Kaichen Zhou, Yan Luo, Mengyu Wang,
Abstract要約: 現代視覚言語モデル (VLM) は強力な意味認識を実現するが, 基本的な空間的関係は不安定である。冷凍ビデオワールドモデルからVLMへ幾何学構造を転送するVLM側蒸留フレームワークであるGeoWorld-VLMを紹介する。 GeoWorld-VLMファインチューニングはイメージエンコーダとマルチモーダルプロジェクタのみであり、メインのバックボーンを凍結しながら、プロジェクタ後のイメージ特徴と中間のワールドモデル表現とを一致させる。
参考スコア（独自算出の注目度）: 10.86505613923278
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern Vision-Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left of, on, behind, and between. One cause of this failure arises before language reasoning begins: the visual pathway may compress or discard critical 3D structural cues during feature extraction, so the language model receives image representations that are already insufficient for reliable spatial judgment. We introduce GeoWorld-VLM, a VLM-side distillation framework that transfers geometric structure from frozen camera-conditioned video world models into VLMs. GeoWorld-VLM fine-tunes only the image encoder and multimodal projector, aligning post-projector image features with intermediate world-model representations while leaving the main backbone frozen. Given images, a prompt, and a sampled camera trajectory, the world-model teacher converts static visual input into a synthetic multi-view spatial signal. Training combines spatial answer supervision, teacher-student feature alignment, and a preservation anchor to the original VLM. Since the language model remains frozen, GeoWorld-VLM preserves the original model's linguistic capabilities while attributing spatial improvements to the enhanced visual pathway. To evaluate the effectiveness and generality of the proposed method, we apply GeoWorld-VLM to two distinct VLM architectures and observe consistent improvements across both backbones. GeoWorld-VLM improves performance by approximately 4 percent on both the What'sUp and VSR benchmarks, suggesting that world-model-guided visual alignment generalizes across model structures and spatial reasoning datasets.
Abstract（参考訳）: 現代の視覚言語モデル(VLM)は、強い意味認識を実現するが、左、オン、バック、インターセプションといった基本的な空間的関係は不安定である。視覚経路は特徴抽出中に重要な3D構造的手がかりを圧縮または破棄する可能性があるため、言語モデルは、信頼できる空間的判断のために既に不十分な画像表現を受け取る。我々は,凍ったカメラコンディショニングビデオワールドモデルからVLMへ幾何学構造を転送する,VLM側の蒸留フレームワークであるGeoWorld-VLMを紹介する。 GeoWorld-VLMファインチューニングはイメージエンコーダとマルチモーダルプロジェクタのみであり、メインのバックボーンを凍結しながら、プロジェクタ後のイメージ特徴と中間のワールドモデル表現とを一致させる。画像、プロンプト、サンプルカメラの軌跡が与えられた後、ワールドモデル教師は静的視覚入力を合成多視点空間信号に変換する。トレーニングは、空間的回答の監督、教師と学生による特徴のアライメント、およびオリジナルのVLMの保存アンカーを組み合わせる。言語モデルは凍結されているため、GeoWorld-VLMは、拡張された視覚経路に空間的改善をもたらしながら、元のモデルの言語能力を保っている。提案手法の有効性と汎用性を評価するため,GeoWorld-VLMを2つの異なるVLMアーキテクチャに適用し,両バックボーン間の一貫した改善を観察する。 GeoWorld-VLMはWhat'sUpベンチマークとVSRベンチマークの両方のパフォーマンスを約4%向上させ、世界モデル誘導視覚アライメントがモデル構造と空間推論データセットをまたいで一般化することを示唆している。

論文の概要: GeoWorld-VLM: Geometry from World Models for Vision-Language Models

関連論文リスト