Fugu-MT 論文翻訳(概要): Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics

論文の概要: Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics

arxiv url: http://arxiv.org/abs/2604.24642v1
Date: Mon, 27 Apr 2026 16:10:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:08.139503
Title: Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics
Title（参考訳）: CLIPの360度テキストとビジュアルセマンティクスの理解
Authors: Hai Wang, Xiaochen Yang, Mingzhi Dong, Jing-Hao Xue,
Abstract要約: 対照的に、標準的なAI評価モデルであるコントラスト言語-画像事前学習モデルは、360度パノラマ画像-テキストペアの理解に関して、オープンな疑問に直面している。本稿では、まず、emph360のテキスト意味論、明示的な形式識別子によって伝達される意味情報、およびemph360の視覚意味論、水平方向の円形シフトの下で不変な意味論という2つの概念を導入することで、このギャップに対処する。
参考スコア（独自算出の注目度）: 34.9343777313078
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The dream of instantly creating rich 360-degree panoramic worlds from text is rapidly becoming a reality, yet a crucial gap exists in our ability to reliably evaluate their semantic alignment. Contrastive Language-Image Pre-training (CLIP) models, standard AI evaluators, predominantly trained on perspective image-text pairs, face an open question regarding their understanding of the unique characteristics of 360-degree panoramic image-text pairs. This paper addresses this gap by first introducing two concepts: \emph{360-degree textual semantics}, semantic information conveyed by explicit format identifiers, and \emph{360-degree visual semantics}, invariant semantics under horizontal circular shifts. To probe CLIP's comprehension of these semantics, we then propose novel evaluation methodologies using keyword manipulation and horizontal circular shifts of varying magnitudes. Rigorous statistical analyses across popular CLIP configurations reveal that: (1) CLIP models effectively leverage explicit textual identifiers, demonstrating an understanding of 360-degree textual semantics; and (2) CLIP models fail to robustly preserve semantic alignment under horizontal circular shifts, indicating limited comprehension of 360-degree visual semantics. To address this limitation, we propose a LoRA-based fine-tuning framework that explicitly instills invariance to circular shifts. Our fine-tuned models exhibit improved comprehension of 360-degree visual semantics, though with a slight degradation in original semantic evaluation performance, highlighting a fundamental trade-off in adapting CLIP to 360-degree panoramic images. Code is available at https://github.com/littlewhitesea/360Semantics.
Abstract（参考訳）: テキストからリッチな360度パノラマ世界を作るという夢は、急速に現実化しつつあるが、私たちのセマンティックアライメントを確実に評価する能力には、重要なギャップがある。コントラスト言語-画像事前学習(CLIP)モデル、標準AI評価器は、視点画像-テキストペアを主に訓練しており、360度パノラマ画像-テキストペアのユニークな特徴を理解することについて、オープンな疑問に直面している。本論文は、まず、明示的な形式識別子によって伝達される意味情報である \emph{360-degree textual semantics} と、水平方向の円形シフトの下で不変な意味論である \emph{360-degree visual semantics} という2つの概念を導入することで、このギャップに対処する。そこで我々は,これらの意味論の理解を探索するために,キーワード操作と水平方向の円形シフトを用いた新しい評価手法を提案する。一般的なCLIP構成の厳密な統計分析では,(1)CLIPモデルは明示的なテキスト識別子を効果的に活用し,360度テキストセマンティクスの理解を示す。この制限に対処するため,LoRAをベースとした微細チューニングフレームワークを提案する。私たちの微調整モデルでは,360度パノラマ画像にCLIPを適用する際の基本的なトレードオフとして,元のセマンティック評価性能をわずかに低下させながら,360度の視覚的セマンティクスの理解が向上した。コードはhttps://github.com/littlewhitesea/360Semanticsで入手できる。

論文の概要: Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics

関連論文リスト