Fugu-MT 論文翻訳(概要): HANCLIP: A Family of Hyperbolic Angular Negation Vision Language Models

論文の概要: HANCLIP: A Family of Hyperbolic Angular Negation Vision Language Models

arxiv url: http://arxiv.org/abs/2606.23843v1
Date: Mon, 22 Jun 2026 18:25:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 22:16:48.624616
Title: HANCLIP: A Family of Hyperbolic Angular Negation Vision Language Models
Title（参考訳）: HANCLIP: 双曲型角否定視覚言語モデルの一家系
Authors: Hoang-Bao Le, Aiden Durrant, Thai Son Mai, Binh T. Nguyen, Liting Zhou, Cathal Gurrin,
Abstract要約: HANCLIP(Hyperbolic + Angular + Negation)は、組み込みスペースを明示的に再構成し、"イメージとは何か"と"何なのか"をエンコードするビジョン言語モデルである。 HANCLIPは、20,000の画像テキスト四重項からなるコンパクトなセットに基づいて訓練され、階層的意味論関係と非対称性をモデル化した双曲的定式化と、否定的記述と対応する正の体系的な分離を促進する角三重項目的とを組み合わせる。実験により、HANCLIPは標準分類における競争力や改善性能を維持しつつ、否定に焦点を当てたNegBenchベンチマークで一貫したゲインを提供することが示された。
参考スコア（独自算出の注目度）: 3.4657033095341845
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) are typically pre-trained on large-scale image-text datasets to capture semantic correspondences between visual content and natural language. However, they remain surprisingly brittle to negation: models often rely on shallow word co-occurrence and are easily distracted by misleading or irrelevant textual cues, even when their overall retrieval or classification performance is strong. Moreover, directly finetuning on negation data can interfere with previously acquired knowledge, causing noticeable degradation on standard vision-language benchmarks. To tackle these issues, this work introduces HANCLIP (Hyperbolic + Angular + Negation), a family of VLMs that explicitly restructures the embedding space to encode "what an image is not" alongside "what it is." HANCLIP is trained on a compact set of 20,000 image-text quadruplets and combines a hyperbolic formulation, which models hierarchical semantic relations and asymmetries, with an angular triplet objective that drives systematic separation between negated descriptions and their corresponding positives. This geometry-aware design strengthens negation sensitivity while preserving the global structure of pretrained representations, rather than overwriting them. Extensive experiments across multiple vision-language tasks show that HANCLIP delivers consistent gains on the negation-focused NegBench benchmark, while maintaining competitive or improved performance on standard classification and image-text retrieval benchmarks. The framework is model-agnostic and can be plugged into CLIP, LongCLIP, SmartCLIP, and HiMo-CLIP without large-scale retraining, demonstrating that a carefully designed geometric objective can substantially extend the reasoning capabilities of existing VLMs using only modest additional data.
Abstract（参考訳）: VLM(Vision-Language Models)は通常、視覚コンテンツと自然言語のセマンティックな対応を捉えるために、大規模な画像テキストデータセット上で事前訓練されている。モデルは、しばしば浅い単語の共起に依存し、全体的な検索や分類性能が強い場合でも、誤解を招くか、無関係なテキストの手がかりに気を取られてしまう。さらに、否定データを直接微調整することは、以前取得した知識に干渉し、標準的なビジョン言語ベンチマークで顕著な劣化を引き起こす。これらの問題に対処するために、この研究はHANCLIP (Hyperbolic + Angular + Negation)を導入している。 HANCLIPは、20,000の画像テキスト四重項からなるコンパクトなセットに基づいて訓練され、階層的意味論関係と非対称性をモデル化した双曲的定式化と、否定的記述と対応する正の体系的な分離を促進する角三重項目的とを組み合わせる。この幾何学的設計は、上書きではなく、事前訓練された表現のグローバル構造を保ちながら、否定感度を高める。複数の視覚言語タスクにわたる広範囲な実験により、HANCLIPは、標準分類と画像テキスト検索ベンチマークのパフォーマンスの競争力を維持しながら、否定に焦点を当てたNegBenchベンチマークで一貫した利得を提供することが示された。このフレームワークはモデルに依存しず、大規模リトレーニングなしでCLIP、LongCLIP、SmartCLIP、HiMo-CLIPにプラグインすることができる。

論文の概要: HANCLIP: A Family of Hyperbolic Angular Negation Vision Language Models

関連論文リスト