Fugu-MT 論文翻訳(概要): GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding

論文の概要: GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding

arxiv url: http://arxiv.org/abs/2605.13352v1
Date: Wed, 13 May 2026 11:12:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:28.002558
Title: GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding
Title（参考訳）: GeoFlowVLM:凍結ビジョンランゲージ埋め込みのための幾何学的関節不確かさ
Authors: Mayank Nautiyal, Li Ju, Andreas Hellander, Ekta Vats, Prashant Singh,
Abstract要約: ペア化された$ell$-normalized dual-encoder VLM の結合分布を学習するポストホックアダプタとして textbfGeoFlowVLM を提案する。整合性の結果, 人口制限下では, トレーニングされたネットワークは, 接合流と両モード条件流を露呈することがわかった。この単一モデルから2つの量を得る:Fano型境界による決定論的解釈でアレター的曖昧さを定量化する条件付き検索エントロピーと、関節NLLの正確な連鎖ルール分解によって正当化される限界特異性スコアである。
参考スコア（独自算出の注目度）: 3.0708725114491293
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Standard dual-encoder vision-language models that map images and text to deterministic points on a shared unit hypersphere through $\ell_2$ normalization typically expose neither \emph{aleatoric} uncertainty (cross-modal ambiguity) nor \emph{epistemic} uncertainty (lack of training-distribution support). Existing post-hoc methods either recover at most one of the two uncertainty components, or ignore the hyperspherical geometry of these models' embeddings. We propose \textbf{GeoFlowVLM} as a post-hoc adapter that learns the joint distribution of paired $\ell_2$-normalised dual-encoder VLM embeddings on the product hypersphere $\mathbb{S}^{d-1} \times \mathbb{S}^{d-1}$ via Riemannian flow matching with a single masked velocity field. A consistency result shows that, in the population limit, the trained network exposes the joint flow and both cross-modal conditional flows as valid Riemannian flow-matching velocity fields on their respective domains. We derive two quantities from this single model: a conditional retrieval entropy that quantifies aleatoric ambiguity with a decision-theoretic interpretation via a Fano-type bound, and a marginal-typicality epistemic score justified by an exact chain-rule decomposition of the joint NLL. This decomposition isolates a cross-modal pointwise-mutual-information term that is structurally discriminative rather than epistemic, and is empirically the only consistently uninformative standalone component. Empirically, the entropy tracks Recall@1 with near-ideal monotonic calibration across three retrieval benchmarks in both directions, and the marginal-typicality sum yields consistently calibrated selective accuracy across four zero-shot classification benchmarks.
Abstract（参考訳）: 画像とテキストを共有単位超球面上の決定論的な点にマッピングする標準的なデュアルエンコーダビジョン言語モデルは、$\ell_2$正規化によって、通常、 \emph{aleatoric}不確実性(モデム間のあいまいさ)も \emph{epistemic}不確実性(トレーニングと配布のサポートの欠如)も露呈しない。既存のポストホック法は、2つの不確実性成分のほとんどを回復するか、あるいはこれらのモデルの埋め込みの超球面幾何学を無視するかのいずれかである。積超球面 $\mathbb{S}^{d-1} \times \mathbb{S}^{d-1} \times \mathbb{S}^{d-1}$ 上のペア付き $\ell_2$-normalized dual-encoder VLM の結合分布を1つのマスク付速度場とマッチングして学習するポストホックアダプタとして \textbf{GeoFlowVLM} を提案する。整合性の結果, 訓練されたネットワークは, それぞれの領域上のリーマン流整合速度場として, 連接流と両モード間条件流を露呈することがわかった。この単一モデルから,Fano型境界による決定論的解釈でアレター的曖昧さを定量化する条件付き検索エントロピーと,関節NLLの正確な鎖-ルール分解によって正当化される限界-特異性エピステマスコアの2つを導出する。この分解は、エピステミックではなく構造的に区別されるクロスモーダルな点-ミューチュアル-情報項を分離し、経験的に唯一一貫した非形式的独立成分である。実験的に、エントロピートラックRecall@1は、両方向の3つの検索ベンチマークでほぼ理想的モノトニックなキャリブレーションを施し、限界-特異性の和は、4つのゼロショット分類ベンチマークで選択的精度を一定に調整する。

論文の概要: GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding

関連論文リスト