Contrastive Learning with Enhanced Abstract Representations using Grouped Loss of Abstract Semantic Supervision
- URL: http://arxiv.org/abs/2509.12771v1
- Date: Tue, 16 Sep 2025 07:36:44 GMT
- Title: Contrastive Learning with Enhanced Abstract Representations using Grouped Loss of Abstract Semantic Supervision
- Authors: Omri Suissa, Muhiim Ali, Shengmai Chen, Yinuo Cai, Shekhar Pradhan,
- Abstract summary: We introduce a grouped image-caption dataset (MAGIC)<n>We use a novel contrastive loss technique to induce the model to encode in the representation of each image (caption) in a group.<n>Our training methodology results in the CLEAR GLASS model having the concept abstraction capacity as an emergent capacity.
- Score: 0.5219568203653523
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Humans can recognize an image as an instance of a general concept, beyond simply identifying its objects and their relationships. In this paper, we investigate 1. The extent to which VLMs have this concept abstraction capacity, and 2. Strategies for encoding the sort of higher-concept information in images that would enable the resulting VLM model (CLEAR GLASS model) to have this capability to a greater degree. To this end, we introduce a grouped image-caption dataset (MAGIC), which consists of several groups of image captions and for each group a set of associated images and higher-level conceptual labels. We use a novel contrastive loss technique to induce the model to encode in the representation of each image (caption) in a group the information that is common to all members of the image-caption group. Our main contribution is a grouped contrastive loss function based on text-image contrastive groups (outer contrastive loss) as well as an inner loss which measures the distances between image-caption instances in the group. Our training methodology results in the CLEAR GLASS model having the concept abstraction capacity as an emergent capacity because the model is not exposed to the higher-level concepts associated with each group. Instead, the training forces the model to create for each image-caption group a semantic representation that brings it closer to the semantic representation of the higher-level concepts in the latent semantic space. Our experiments show that this training methodology results in a model which shows improvement in abstract concept recognition compared to SOTA models.
Related papers
- GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation [51.95701097588426]
We introduce a Global Perspective Tokenizer (GloTok) to model a more uniform semantic distribution of tokenized features.<n>A residual learning module is proposed to recover the fine-grained details to minimize the reconstruction error caused by quantization.<n>Experiments on the standard ImageNet-1k benchmark clearly show that our proposed method achieves state-of-the-art reconstruction performance and generation quality.
arXiv Detail & Related papers (2025-11-18T06:40:26Z) - ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution [71.69364653858447]
Existing Multimodal Large Language Models (MLLMs) suffer from increased inference costs due to the additional vision tokens introduced by image inputs.<n>We propose Visual Consistency Learning (ViCO), a novel training algorithm that enables the model to represent images of varying complexities using different numbers of vision tokens.<n> Experimental results demonstrate that our method can reduce the number of vision tokens by up to 50% while maintaining the model's perception, reasoning, and OCR capabilities.
arXiv Detail & Related papers (2025-10-14T17:58:10Z) - Data-Efficient Generalization for Zero-shot Composed Image Retrieval [67.46975191141928]
ZS-CIR aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training.<n>One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space.<n>We propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic-Set (S-Set)
arXiv Detail & Related papers (2025-03-07T07:49:31Z) - Image Embedding Sampling Method for Diverse Captioning [2.705107928847026]
We introduce a training-free framework that enhances caption diversity and informativeness by explicitly attending to distinct image regions.<n>We demonstrate that our method allows smaller VLMs to achieve performance comparable to larger models in terms of image-caption alignment, semantic integrity, and diversity.
arXiv Detail & Related papers (2025-02-14T12:33:19Z) - Discriminative Image Generation with Diffusion Models for Zero-Shot Learning [53.44301001173801]
We present DIG-ZSL, a novel Discriminative Image Generation framework for Zero-Shot Learning.<n>We learn a discriminative class token (DCT) for each unseen class under the guidance of a pre-trained category discrimination model (CDM)<n>In this paper, the extensive experiments and visualizations on four datasets show that our DIG-ZSL: (1) generates diverse and high-quality images, (2) outperforms previous state-of-the-art nonhuman-annotated semantic prototype-based methods by a large margin, and (3) achieves comparable or better performance than baselines that leverage human-annot
arXiv Detail & Related papers (2024-12-23T02:18:54Z) - Grounding Descriptions in Images informs Zero-Shot Visual Recognition [47.66166611138081]
We propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously.<n>We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods.
arXiv Detail & Related papers (2024-12-05T18:52:00Z) - CEIR: Concept-based Explainable Image Representation Learning [0.4198865250277024]
We introduce Concept-based Explainable Image Representation (CEIR) to derive high-quality representations without label dependency.
Our method exhibits state-of-the-art unsupervised clustering performance on benchmarks such as CIFAR10, CIFAR100, and STL10.
CEIR can seamlessly extract the related concept from open-world images without fine-tuning.
arXiv Detail & Related papers (2023-12-17T15:37:41Z) - Cross-Modal Concept Learning and Inference for Vision-Language Models [31.463771883036607]
In existing fine-tuning methods, the class-specific text description is matched against the whole image.
We develop a new method called cross-model concept learning and inference (CCLI)
Our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts.
arXiv Detail & Related papers (2023-07-28T10:26:28Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation [17.019848796027485]
Self-supervised visual pre-training models have shown great promise in representing pixel-level semantic relationships.
In this work, we investigate the pixel-level semantic aggregation in self-trained models as image encodes and design concepts.
We propose the Adaptive Concept Generator (ACG) which adaptively maps these prototypes to informative concepts for each image.
arXiv Detail & Related papers (2022-10-12T06:16:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.