Related papers: ViCE: Self-Supervised Visual Concept Embeddings as Contextual and Pixel Appearance Invariant Semantic Representations

ViCE: Self-Supervised Visual Concept Embeddings as Contextual and Pixel Appearance Invariant Semantic Representations

URL: http://arxiv.org/abs/2111.12460v1
Date: Wed, 24 Nov 2021 12:27:30 GMT
Title: ViCE: Self-Supervised Visual Concept Embeddings as Contextual and Pixel Appearance Invariant Semantic Representations
Authors: Robin Karlsson, Tomoki Hayashi, Keisuke Fujii, Alexander Carballo, Kento Ohtani, Kazuya Takeda
Abstract summary: This work presents a self-supervised method to learn dense semantically rich visual embeddings for images inspired by methods for learning word embeddings in NLP.
Score: 77.3590853897664
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work presents a self-supervised method to learn dense semantically rich visual concept embeddings for images inspired by methods for learning word embeddings in NLP. Our method improves on prior work by generating more expressive embeddings and by being applicable for high-resolution images. Viewing the generation of natural images as a stochastic process where a set of latent visual concepts give rise to observable pixel appearances, our method is formulated to learn the inverse mapping from pixels to concepts. Our method greatly improves the effectiveness of self-supervised learning for dense embedding maps by introducing superpixelization as a natural hierarchical step up from pixels to a small set of visually coherent regions. Additional contributions are regional contextual masking with nonuniform shapes matching visually coherent patches and complexity-based view sampling inspired by masked language models. The enhanced expressiveness of our dense embeddings is demonstrated by significantly improving the state-of-the-art representation quality benchmarks on COCO (+12.94 mIoU, +87.6\%) and Cityscapes (+16.52 mIoU, +134.2\%). Results show favorable scaling and domain generalization properties not demonstrated by prior work.

Related papers

COMIX: Compositional Explanations using Prototypes [46.15031477955461]
We propose a method to align machine representations with human understanding. The proposed method, named COMIX, classifies an image by decomposing it into regions based on learned concepts. We show that our method provides fidelity of explanations and shows that the efficiency is competitive with other inherently interpretable architectures.
arXiv Detail & Related papers (2025-01-10T15:40:31Z)
Image inpainting enhancement by replacing the original mask with a self-attended region from the input image [44.8450669068833]
We introduce a novel deep learning-based pre-processing methodology for image inpainting utilizing the Vision Transformer (ViT) Our approach involves replacing masked pixel values with those generated by the ViT, leveraging diverse visual patches within the attention matrix to capture discriminative spatial features.
arXiv Detail & Related papers (2024-11-08T17:04:05Z)
CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition [73.51329037954866]
We propose a robust global representation method with cross-image correlation awareness for visual place recognition. Our method uses the attention mechanism to correlate multiple images within a batch. Our method outperforms state-of-the-art methods by a large margin with significantly less training time.
arXiv Detail & Related papers (2024-02-29T15:05:11Z)
Patch-Wise Self-Supervised Visual Representation Learning: A Fine-Grained Approach [4.9204263448542465]
This study introduces an innovative, fine-grained dimension by integrating patch-level discrimination into self-supervised visual representation learning. We employ a distinctive photometric patch-level augmentation, where each patch is individually augmented, independent from other patches within the same view. We present a simple yet effective patch-matching algorithm to find the corresponding patches across the augmented views.
arXiv Detail & Related papers (2023-10-28T09:35:30Z)
Pixel-Inconsistency Modeling for Image Manipulation Localization [59.968362815126326]
Digital image forensics plays a crucial role in image authentication and manipulation localization. This paper presents a generalized and robust manipulation localization model through the analysis of pixel inconsistency artifacts. Experiments show that our method successfully extracts inherent pixel-inconsistency forgery fingerprints.
arXiv Detail & Related papers (2023-09-30T02:54:51Z)
Saliency-based Video Summarization for Face Anti-spoofing [4.730428911461921]
We present a video summarization method for face anti-spoofing detection that aims to enhance the performance of deep learning models by leveraging visual saliency. In particular, saliency information is extracted from the differences between the Laplacian and Wiener filter outputs of the source images. Weighting maps are then computed based on the saliency information, indicating the importance of each pixel in the image.
arXiv Detail & Related papers (2023-08-23T18:08:32Z)
Face Anti-Spoofing Via Disentangled Representation Learning [90.90512800361742]
Face anti-spoofing is crucial to security of face recognition systems. We propose a novel perspective of face anti-spoofing that disentangles the liveness features and content features from images.
arXiv Detail & Related papers (2020-08-19T03:54:23Z)
PerceptionGAN: Real-world Image Construction from Provided Text through Perceptual Understanding [11.985768957782641]
We propose a method to provide good images by incorporating perceptual understanding in the discriminator module. We show that the perceptual information included in the initial image is improved while modeling image distribution at multiple stages. More importantly, the proposed method can be integrated into the pipeline of other state-of-the-art text-based-image-generation models.
arXiv Detail & Related papers (2020-07-02T09:23:08Z)
Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
Learning Representations by Predicting Bags of Visual Words [55.332200948110895]
Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data. Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions.
arXiv Detail & Related papers (2020-02-27T16:45:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.