ViCE: Self-Supervised Visual Concept Embeddings as Contextual and Pixel
Appearance Invariant Semantic Representations
- URL: http://arxiv.org/abs/2111.12460v1
- Date: Wed, 24 Nov 2021 12:27:30 GMT
- Title: ViCE: Self-Supervised Visual Concept Embeddings as Contextual and Pixel
Appearance Invariant Semantic Representations
- Authors: Robin Karlsson, Tomoki Hayashi, Keisuke Fujii, Alexander Carballo,
Kento Ohtani, Kazuya Takeda
- Abstract summary: This work presents a self-supervised method to learn dense semantically rich visual embeddings for images inspired by methods for learning word embeddings in NLP.
- Score: 77.3590853897664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work presents a self-supervised method to learn dense semantically rich
visual concept embeddings for images inspired by methods for learning word
embeddings in NLP. Our method improves on prior work by generating more
expressive embeddings and by being applicable for high-resolution images.
Viewing the generation of natural images as a stochastic process where a set of
latent visual concepts give rise to observable pixel appearances, our method is
formulated to learn the inverse mapping from pixels to concepts. Our method
greatly improves the effectiveness of self-supervised learning for dense
embedding maps by introducing superpixelization as a natural hierarchical step
up from pixels to a small set of visually coherent regions. Additional
contributions are regional contextual masking with nonuniform shapes matching
visually coherent patches and complexity-based view sampling inspired by masked
language models. The enhanced expressiveness of our dense embeddings is
demonstrated by significantly improving the state-of-the-art representation
quality benchmarks on COCO (+12.94 mIoU, +87.6\%) and Cityscapes (+16.52 mIoU,
+134.2\%). Results show favorable scaling and domain generalization properties
not demonstrated by prior work.
Related papers
- Image inpainting enhancement by replacing the original mask with a self-attended region from the input image [44.8450669068833]
We introduce a novel deep learning-based pre-processing methodology for image inpainting utilizing the Vision Transformer (ViT)
Our approach involves replacing masked pixel values with those generated by the ViT, leveraging diverse visual patches within the attention matrix to capture discriminative spatial features.
arXiv Detail & Related papers (2024-11-08T17:04:05Z) - CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition [73.51329037954866]
We propose a robust global representation method with cross-image correlation awareness for visual place recognition.
Our method uses the attention mechanism to correlate multiple images within a batch.
Our method outperforms state-of-the-art methods by a large margin with significantly less training time.
arXiv Detail & Related papers (2024-02-29T15:05:11Z) - Patch-Wise Self-Supervised Visual Representation Learning: A Fine-Grained Approach [4.9204263448542465]
This study introduces an innovative, fine-grained dimension by integrating patch-level discrimination into self-supervised visual representation learning.
We employ a distinctive photometric patch-level augmentation, where each patch is individually augmented, independent from other patches within the same view.
We present a simple yet effective patch-matching algorithm to find the corresponding patches across the augmented views.
arXiv Detail & Related papers (2023-10-28T09:35:30Z) - Pixel-Inconsistency Modeling for Image Manipulation Localization [59.968362815126326]
Digital image forensics plays a crucial role in image authentication and manipulation localization.
This paper presents a generalized and robust manipulation localization model through the analysis of pixel inconsistency artifacts.
Experiments show that our method successfully extracts inherent pixel-inconsistency forgery fingerprints.
arXiv Detail & Related papers (2023-09-30T02:54:51Z) - Saliency-based Video Summarization for Face Anti-spoofing [4.730428911461921]
We present a video summarization method for face anti-spoofing detection that aims to enhance the performance of deep learning models by leveraging visual saliency.
In particular, saliency information is extracted from the differences between the Laplacian and Wiener filter outputs of the source images.
Weighting maps are then computed based on the saliency information, indicating the importance of each pixel in the image.
arXiv Detail & Related papers (2023-08-23T18:08:32Z) - Face Anti-Spoofing Via Disentangled Representation Learning [90.90512800361742]
Face anti-spoofing is crucial to security of face recognition systems.
We propose a novel perspective of face anti-spoofing that disentangles the liveness features and content features from images.
arXiv Detail & Related papers (2020-08-19T03:54:23Z) - PerceptionGAN: Real-world Image Construction from Provided Text through
Perceptual Understanding [11.985768957782641]
We propose a method to provide good images by incorporating perceptual understanding in the discriminator module.
We show that the perceptual information included in the initial image is improved while modeling image distribution at multiple stages.
More importantly, the proposed method can be integrated into the pipeline of other state-of-the-art text-based-image-generation models.
arXiv Detail & Related papers (2020-07-02T09:23:08Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Learning Representations by Predicting Bags of Visual Words [55.332200948110895]
Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data.
Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions.
arXiv Detail & Related papers (2020-02-27T16:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.