Self-supervised Learning of Contextualized Local Visual Embeddings
- URL: http://arxiv.org/abs/2310.00527v3
- Date: Wed, 4 Oct 2023 09:05:17 GMT
- Title: Self-supervised Learning of Contextualized Local Visual Embeddings
- Authors: Thalles Santos Silva, Helio Pedrini and Ad\'in Ram\'irez Rivera
- Abstract summary: Contextualized Local Visual Embeddings (CLoVE) is a self-supervised convolutional-based method that learns representations suited for dense prediction tasks.
We benchmark CLoVE's pre-trained representations on multiple datasets.
CLoVE reaches state-of-the-art performance for CNN-based architectures in 4 dense prediction downstream tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Contextualized Local Visual Embeddings (CLoVE), a self-supervised
convolutional-based method that learns representations suited for dense
prediction tasks. CLoVE deviates from current methods and optimizes a single
loss function that operates at the level of contextualized local embeddings
learned from output feature maps of convolution neural network (CNN) encoders.
To learn contextualized embeddings, CLoVE proposes a normalized mult-head
self-attention layer that combines local features from different parts of an
image based on similarity. We extensively benchmark CLoVE's pre-trained
representations on multiple datasets. CLoVE reaches state-of-the-art
performance for CNN-based architectures in 4 dense prediction downstream tasks,
including object detection, instance segmentation, keypoint detection, and
dense pose estimation.
Related papers
- Taming CLIP for Fine-grained and Structured Visual Understanding of Museum Exhibits [59.66134971408414]
We aim to adapt CLIP for fine-grained and structured understanding of museum exhibits.
Our dataset is the first of its kind in the public domain.
The proposed method (MUZE) learns to map CLIP's image embeddings to the tabular structure by means of a proposed transformer-based parsing network (parseNet)
arXiv Detail & Related papers (2024-09-03T08:13:06Z) - Refining Skewed Perceptions in Vision-Language Models through Visual Representations [0.033483662989441935]
Large vision-language models (VLMs) have become foundational, demonstrating remarkable success across a variety of downstream tasks.
Despite their advantages, these models inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment.
This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications.
arXiv Detail & Related papers (2024-05-22T22:03:11Z) - Neural Clustering based Visual Representation Learning [61.72646814537163]
Clustering is one of the most classic approaches in machine learning and data analysis.
We propose feature extraction with clustering (FEC), which views feature extraction as a process of selecting representatives from data.
FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives.
arXiv Detail & Related papers (2024-03-26T06:04:50Z) - Deciphering 'What' and 'Where' Visual Pathways from Spectral Clustering of Layer-Distributed Neural Representations [15.59251297818324]
We present an approach for analyzing grouping information contained within a neural network's activations.
We exploit features from all layers and obviating the need to guess which part of the model contains relevant information.
arXiv Detail & Related papers (2023-12-11T01:20:34Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - Exploiting the relationship between visual and textual features in
social networks for image classification with zero-shot deep learning [0.0]
In this work, we propose a classifier ensemble based on the transferable learning capabilities of the CLIP neural network architecture.
Our experiments, based on image classification tasks according to the labels of the Places dataset, are performed by first considering only the visual part.
Considering the associated texts to the images can help to improve the accuracy depending on the goal.
arXiv Detail & Related papers (2021-07-08T10:54:59Z) - PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive
Learning [109.84770951839289]
We present PredRNN, a new recurrent network for learning visual dynamics from historical context.
We show that our approach obtains highly competitive results on three standard datasets.
arXiv Detail & Related papers (2021-03-17T08:28:30Z) - Spatially Consistent Representation Learning [12.120041613482558]
We propose a spatially consistent representation learning algorithm (SCRL) for multi-object and location-specific tasks.
We devise a novel self-supervised objective that tries to produce coherent spatial representations of a randomly cropped local region.
On various downstream localization tasks with benchmark datasets, the proposed SCRL shows significant performance improvements.
arXiv Detail & Related papers (2021-03-10T15:23:45Z) - Image Matching across Wide Baselines: From Paper to Practice [80.9424750998559]
We introduce a comprehensive benchmark for local features and robust estimation algorithms.
Our pipeline's modular structure allows easy integration, configuration, and combination of different methods.
We show that with proper settings, classical solutions may still outperform the perceived state of the art.
arXiv Detail & Related papers (2020-03-03T15:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.