Fugu-MT 論文翻訳(概要): HyperVis: Continuous Latent Visual Relational Graphs on the Lorentz Hyperboloid for Compositional Reasoning

論文の概要: HyperVis: Continuous Latent Visual Relational Graphs on the Lorentz Hyperboloid for Compositional Reasoning

arxiv url: http://arxiv.org/abs/2606.06100v1
Date: Thu, 04 Jun 2026 12:40:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-06 06:55:34.666239
Title: HyperVis: Continuous Latent Visual Relational Graphs on the Lorentz Hyperboloid for Compositional Reasoning
Title（参考訳）: HyperVis: 合成推論のためのローレンツハイパーボロイド上の連続潜時視覚関係グラフ
Authors: Moshiur Farazi, Sameera Ramasinghe, Mahbub Ahmed Turza, Shafin Rahman,
Abstract要約: 我々はSGGセマンティックボトルネックを完全に回避するtextbfHyperVisを提案する。我々は空間的に偏った交叉アテンションを通して高密度な$O(N2)$ビジュアルリレーションテンソルを計算し、ローレンツ双曲体に投影し、空間物理学、すなわちIoA駆動のエンテーメントコーンと外角反発によって階層を強制する。
参考スコア（独自算出の注目度）: 19.982012555038573
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models (VLMs) struggle with compositional reasoning that requires understanding inter-object relationships. A natural remedy is to inject explicit scene graph triplets $\langle s, p, o \rangle$ from an off-the-shelf scene graph generator (SGG), but we show this backfires: discrete text labels collide with the continuous visual modality, degrading GQA accuracy from 60.38\% to 58.86\%. We propose \textbf{HyperVis}, which bypasses the SGG semantic bottleneck entirely. From $N$ class-agnostic region proposals, we compute a dense $O(N^2)$ visual relation tensor via spatially-biased cross-attention, project it onto a Lorentz hyperboloid, and enforce hierarchy through spatial physics, namely IoA-driven entailment cones and exterior-angle repulsion. We discover that HyperVis contributes in two complementary ways: (1) as a \emph{training-time regularizer}, the hyperbolic relational losses shape LoRA representations that improve generative VQA (GQA 61.03\% vs.\ 57.21\% for LoRA fine-tuning without relational losses, recovering and surpassing the baseline); and (2) as an \emph{inference-time relational encoder}, hyperbolic prefix tokens boost discriminative compositional scoring (SugarCrepe 79.94\%, $+$6.25pp over baseline). The learned curvature stabilises at $κ{=}4.0$, an order of magnitude above prior hyperbolic VLMs where $κ$ typically collapses toward zero, indicating that continuous visual features genuinely require the exponential volume of strongly curved space. A controlled Euclidean ablation confirms this decomposition: the relational pipeline regularises LoRA comparably in flat space (GQA 60.81\%), but the compositionality gain is specifically hyperbolic (SugarCrepe $+$4.58pp over Euclidean), with entailment loss ${\sim}6{\times}$ higher in Euclidean training. Codes are available at TBA.
Abstract（参考訳）: VLM(Vision-Language Models)は、オブジェクト間の関係を理解する必要がある構成的推論に苦慮する。自然の対策として、露骨なシーングラフ三重項$\langle s, p, o \rangle$をオフザシェルのシーングラフ生成器(SGG)から注入することであるが、このバックファイアを示す: 離散テキストラベルは連続的な視覚的モダリティと衝突し、GQAの精度を60.38\%から58.86\%に低下させる。本稿では,SGGセマンティック・ボトルネックを完全に回避した \textbf{HyperVis} を提案する。 N$クラス非依存領域の提案から、空間的に偏りを持つクロスアテンションを介して高密度な$O(N^2)$ビジュアルリレーションテンソルを計算し、ローレンツ双曲体に投影し、空間物理学、すなわちIoA駆動のエンテーメント円錐と外角反発によって階層を強制する。ハイパービジョンは,(1)emph{training-time regularizer} として,生成性VQA(GQA 61.03\% vs. GQA 61.03\%)を改善する双曲型リレーショナル損失形 LoRA 表現の2つの相補的方法に寄与することがわかった。 57.21\% の LoRA 微調整では、リレーショナル損失がなく、ベースラインを回復し、超える。(2) の \emph{inference-time relational encoder} として、双曲的な接頭辞は、差別的な合成スコアを高める(SugarCrepe 79.94\%, $+$6.25pp over baseline)。学習された曲率の安定化は$κ{=}4.0$であり、これは従来の双曲型 VLM よりも桁違いに大きく、そこでは$κ$は通常ゼロに向かって崩壊し、連続的な視覚的特徴が真に強く湾曲した空間の指数体積を必要とすることを示している。リレーショナルパイプラインは LoRA を平坦な空間(GQA 60.81 %)で可分に正規化するが、組成的ゲインは特に双曲的である(SugarCrepe $+4.58pp over Euclidean)。コードはTBAで入手できる。

論文の概要: HyperVis: Continuous Latent Visual Relational Graphs on the Lorentz Hyperboloid for Compositional Reasoning

関連論文リスト