Related papers: Heterogeneous Contrastive Learning: Encoding Spatial Information for Compact Visual Representations

Heterogeneous Contrastive Learning: Encoding Spatial Information for Compact Visual Representations

URL: http://arxiv.org/abs/2011.09941v1
Date: Thu, 19 Nov 2020 16:26:25 GMT
Title: Heterogeneous Contrastive Learning: Encoding Spatial Information for Compact Visual Representations
Authors: Xinyue Huo, Lingxi Xie, Longhui Wei, Xiaopeng Zhang, Hao Li, Zijie Yang, Wengang Zhou, Houqiang Li, Qi Tian
Abstract summary: This paper presents an effective approach that adds spatial information to the encoding stage to alleviate the learning inconsistency between the contrastive objective and strong data augmentation operations. We show that our approach achieves higher efficiency in visual representations and thus delivers a key message to inspire the future research of self-supervised visual representation learning.
Score: 183.03278932562438
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Contrastive learning has achieved great success in self-supervised visual representation learning, but existing approaches mostly ignored spatial information which is often crucial for visual representation. This paper presents heterogeneous contrastive learning (HCL), an effective approach that adds spatial information to the encoding stage to alleviate the learning inconsistency between the contrastive objective and strong data augmentation operations. We demonstrate the effectiveness of HCL by showing that (i) it achieves higher accuracy in instance discrimination and (ii) it surpasses existing pre-training methods in a series of downstream tasks while shrinking the pre-training costs by half. More importantly, we show that our approach achieves higher efficiency in visual representations, and thus delivers a key message to inspire the future research of self-supervised visual representation learning.

Related papers

Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning [23.129998055266245]
Current pseudo-labeling strategies often struggle with mismatches between semantic and visual information. We introduce a simple yet effective approach called textbfAugmenting Dtextbfiscriminative textbfRichness via Diffusions (AiR)
arXiv Detail & Related papers (2025-04-16T10:09:45Z)
[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs [66.5266435598799]
Multi-language Large Language Models (MLLMs) have recently demonstrated strong performance across a wide range of vision tasks. However, their efficient deployment remains a substantial challenge due to high computational costs and memory requirements. We introduce a simple yet effective method for train-free visual compression, called VTC- compression.
arXiv Detail & Related papers (2024-12-08T05:29:39Z)
ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models [32.24716280370563]
ICT is a lightweight, training-free method that calculates an intervention direction to shift the model's focus towards different levels of visual information. It achieves strong performance with a small amount of data and generalizes well across different datasets and models.
arXiv Detail & Related papers (2024-11-22T12:22:21Z)
LAC: Graph Contrastive Learning with Learnable Augmentation in Continuous Space [16.26882307454389]
We introduce LAC, a graph contrastive learning framework with learnable data augmentation in an orthogonal continuous space. To capture the representative information in the graph data during augmentation, we introduce a continuous view augmenter. We propose an information-theoretic principle named InfoBal and introduce corresponding pretext tasks. Our experimental results show that LAC significantly outperforms the state-of-the-art frameworks.
arXiv Detail & Related papers (2024-10-20T10:47:15Z)
Multi-Grained Contrast for Data-Efficient Unsupervised Representation Learning [10.630297877530614]
We propose a novel Multi-Grained Contrast method (MGC) for unsupervised representation learning. Specifically, we construct delicate multi-grained correspondences between positive views and then conduct multi-grained contrast by the correspondences to learn more general unsupervised representations. Our method significantly outperforms the existing state-of-the-art methods on extensive downstream tasks, including object detection, instance segmentation, scene parsing, semantic segmentation and keypoint detection.
arXiv Detail & Related papers (2024-07-02T07:35:21Z)
Visual In-Context Learning for Large Vision-Language Models [62.5507897575317]
In Large Visual Language Models (LVLMs) the efficacy of In-Context Learning (ICL) remains limited by challenges in cross-modal interactions and representation disparities. We introduce a novel Visual In-Context Learning (VICL) method comprising Visual Demonstration Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented Demonstration Composition. Our approach retrieves images via ''Retrieval & Rerank'' paradigm, summarises images with task intent and task-specific visual parsing, and composes language-based demonstrations.
arXiv Detail & Related papers (2024-02-18T12:43:38Z)
A Probabilistic Model Behind Self-Supervised Learning [53.64989127914936]
In self-supervised learning (SSL), representations are learned via an auxiliary task without annotated labels. We present a generative latent variable model for self-supervised learning. We show that several families of discriminative SSL, including contrastive methods, induce a comparable distribution over representations.
arXiv Detail & Related papers (2024-02-02T13:31:17Z)
Focalized Contrastive View-invariant Learning for Self-supervised Skeleton-based Action Recognition [16.412306012741354]
We propose a self-supervised framework called Focalized Contrastive View-invariant Learning (FoCoViL) FoCoViL significantly suppresses the view-specific information on the representation space where the viewpoints are coarsely aligned. It associates actions with common view-invariant properties and simultaneously separates the dissimilar ones.
arXiv Detail & Related papers (2023-04-03T10:12:30Z)
Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP) We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)
Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
Self-supervised Co-training for Video Representation Learning [103.69904379356413]
We investigate the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation training. We propose a novel self-supervised co-training scheme to improve the popular infoNCE loss. We evaluate the quality of the learnt representation on two different downstream tasks: action recognition and video retrieval.
arXiv Detail & Related papers (2020-10-19T17:59:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.