Fugu-MT 論文翻訳(概要): Multimodal Latent Reasoning via Predictive Embeddings

論文の概要: Multimodal Latent Reasoning via Predictive Embeddings

arxiv url: http://arxiv.org/abs/2604.08065v1
Date: Thu, 09 Apr 2026 10:27:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:05.863325
Title: Multimodal Latent Reasoning via Predictive Embeddings
Title（参考訳）: 予測埋め込みによるマルチモーダル潜時推論
Authors: Ashutosh Adhikari, Mirella Lapata,
Abstract要約: Pearlは、専門的なツール使用トラジェクトリから学習するフレームワークである。 Pearlはモデルに依存しず、トレーニングが簡単で、複数のツールコールでトラジェクトリを自然にサポートする。
参考スコア（独自算出の注目度）: 43.40267514669565
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tool-augmented multimodal reasoning enables visual language models (VLMs) to improve perception by interacting with external tools (e.g., cropping, depth estimation). However, such approaches incur substantial inference overhead, require specialized supervision, and are prone to erroneous tool calls. We propose Pearl (Predictive Embedding Alignment for Reasoning in Latent space), a JEPA-inspired framework that learns from expert tool-use trajectories entirely in the latent space, eliminating the need for explicit tool invocation at inference time. Unlike reconstruction-based latent reasoning methods, which autoregressively generate latent tokens and suffer from training-inference mismatch and limited support for multi-step tool use, Pearl directly learns predictive embeddings from multimodal trajectories while preserving the standard vision-language generation pipeline: it is model-agnostic, simple to train, and naturally supports trajectories with multiple tool calls. Experiments across multiple perception benchmarks show that Pearl matches or outperforms standard supervised fine-tuning and reconstruction-based latent reasoning approaches. Furthermore, we provide empirical evidence that reconstruction-based methods primarily learn embeddings rather than image edits in latent space, motivating predictive embedding learning as a more principled alternative.
Abstract（参考訳）: ツール拡張マルチモーダル推論により、視覚言語モデル(VLM)は、外部ツール(例えば、収穫、深さ推定)と相互作用することで知覚を改善することができる。しかし、そのようなアプローチはかなりの推測オーバーヘッドを発生させ、特別な監督を必要とし、誤ったツールコールをしがちである。提案するPearl(Predictive Embedding Alignment for Reasoning in Latent space)は、JEPAにインスパイアされたフレームワークで、潜伏空間における専門的なツール使用軌跡から学習し、推論時に明示的なツール実行の必要性を排除する。遅延トークンを自動回帰的に生成し、トレーニング推論ミスマッチやマルチステップツール使用の限定的なサポートに苦しむ再構築ベースの潜在推論方法とは異なり、Pearlは標準的な視覚言語生成パイプラインを保存しながら、マルチモーダルな軌跡からの予測埋め込みを直接学習する。複数の知覚ベンチマークによる実験は、パールが標準的な微調整および再構成に基づく潜伏推論アプローチに適合または優れることを示している。さらに,提案手法は画像編集よりも埋め込みを学習し,より原理化された代替手段として予測埋め込み学習を動機付けるという実証的証拠を提供する。

論文の概要: Multimodal Latent Reasoning via Predictive Embeddings

関連論文リスト