Local Manifold Augmentation for Multiview Semantic Consistency
- URL: http://arxiv.org/abs/2211.02798v1
- Date: Sat, 5 Nov 2022 02:00:13 GMT
- Title: Local Manifold Augmentation for Multiview Semantic Consistency
- Authors: Yu Yang, Wing Yin Cheung, Chang Liu, Xiangyang Ji
- Abstract summary: We propose to extract the underlying data variation from datasets and construct a novel augmentation operator, named local manifold augmentation (LMA)
LMA shows the ability to create an infinite number of data views, preserve semantics, and simulate complicated variations in object pose, viewpoint, lighting condition, background etc.
- Score: 40.28906509638541
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Multiview self-supervised representation learning roots in exploring semantic
consistency across data of complex intra-class variation. Such variation is not
directly accessible and therefore simulated by data augmentations. However,
commonly adopted augmentations are handcrafted and limited to simple
geometrical and color changes, which are unable to cover the abundant
intra-class variation. In this paper, we propose to extract the underlying data
variation from datasets and construct a novel augmentation operator, named
local manifold augmentation (LMA). LMA is achieved by training an
instance-conditioned generator to fit the distribution on the local manifold of
data and sampling multiview data using it. LMA shows the ability to create an
infinite number of data views, preserve semantics, and simulate complicated
variations in object pose, viewpoint, lighting condition, background etc.
Experiments show that with LMA integrated, self-supervised learning methods
such as MoCov2 and SimSiam gain consistent improvement on prevalent benchmarks
including CIFAR10, CIFAR100, STL10, ImageNet100, and ImageNet. Furthermore, LMA
leads to representations that obtain more significant invariance to the
viewpoint, object pose, and illumination changes and stronger robustness to
various real distribution shifts reflected by ImageNet-V2, ImageNet-R, ImageNet
Sketch etc.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models [32.57246173437492]
This study introduces a novel dataset named Img-Diff, designed to enhance fine-grained image recognition in MLLMs.
By analyzing object differences between similar images, we challenge models to identify both matching and distinct components.
We utilize the Stable-Diffusion-XL model and advanced image editing techniques to create pairs of similar images that highlight object replacements.
arXiv Detail & Related papers (2024-08-08T17:10:16Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding [38.53988682814626]
We propose a context-enhanced masked image modeling method (CtxMIM) for remote sensing image understanding.
CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches.
With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset.
arXiv Detail & Related papers (2023-09-28T18:04:43Z) - Style-Hallucinated Dual Consistency Learning: A Unified Framework for
Visual Domain Generalization [113.03189252044773]
We propose a unified framework, Style-HAllucinated Dual consistEncy learning (SHADE), to handle domain shift in various visual tasks.
Our versatile SHADE can significantly enhance the generalization in various visual recognition tasks, including image classification, semantic segmentation and object detection.
arXiv Detail & Related papers (2022-12-18T11:42:51Z) - Multi-Spectral Image Classification with Ultra-Lean Complex-Valued
Models [28.798100220715686]
Multi-spectral imagery is invaluable for remote sensing due to different spectral signatures exhibited by materials.
We apply complex-valued co-domain symmetric models to classify real-valued MSI images.
Our work is the first to demonstrate the value of complex-valued deep learning on real-valued MSI data.
arXiv Detail & Related papers (2022-11-21T19:01:53Z) - Sketched Multi-view Subspace Learning for Hyperspectral Anomalous Change
Detection [12.719327447589345]
A sketched multi-view subspace learning model is proposed for anomalous change detection.
The proposed model preserves major information from the image pairs and improves computational complexity.
experiments are conducted on a benchmark hyperspectral remote sensing dataset and a natural hyperspectral dataset.
arXiv Detail & Related papers (2022-10-09T14:08:17Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.