Progressive Transformation Learning for Leveraging Virtual Images in
Training
- URL: http://arxiv.org/abs/2211.01778v2
- Date: Mon, 27 Mar 2023 19:21:27 GMT
- Title: Progressive Transformation Learning for Leveraging Virtual Images in
Training
- Authors: Yi-Ting Shen, Hyungtae Lee, Heesung Kwon, Shuvra Shikhar Bhattacharyya
- Abstract summary: We introduce Progressive Transformation Learning (PTL) to augment a training dataset by adding transformed virtual images with enhanced realism.
PTL takes a novel approach that progressively iterates the following three steps: 1) select a subset from a pool of virtual images according to the domain gap, 2) transform the selected virtual images to enhance realism, and 3) add the transformed virtual images to the training set while removing them from the pool.
Experiments show that PTL results in a substantial performance increase over the baseline, especially in the small data and the cross-domain regime.
- Score: 21.590496842692744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To effectively interrogate UAV-based images for detecting objects of
interest, such as humans, it is essential to acquire large-scale UAV-based
datasets that include human instances with various poses captured from widely
varying viewing angles. As a viable alternative to laborious and costly data
curation, we introduce Progressive Transformation Learning (PTL), which
gradually augments a training dataset by adding transformed virtual images with
enhanced realism. Generally, a virtual2real transformation generator in the
conditional GAN framework suffers from quality degradation when a large domain
gap exists between real and virtual images. To deal with the domain gap, PTL
takes a novel approach that progressively iterates the following three steps:
1) select a subset from a pool of virtual images according to the domain gap,
2) transform the selected virtual images to enhance realism, and 3) add the
transformed virtual images to the training set while removing them from the
pool. In PTL, accurately quantifying the domain gap is critical. To do that, we
theoretically demonstrate that the feature representation space of a given
object detector can be modeled as a multivariate Gaussian distribution from
which the Mahalanobis distance between a virtual object and the Gaussian
distribution of each object category in the representation space can be readily
computed. Experiments show that PTL results in a substantial performance
increase over the baseline, especially in the small data and the cross-domain
regime.
Related papers
- Efficient Visual State Space Model for Image Deblurring [83.57239834238035]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration.
We propose a simple yet effective visual state space model (EVSSM) for image deblurring.
arXiv Detail & Related papers (2024-05-23T09:13:36Z) - Cross-domain and Cross-dimension Learning for Image-to-Graph
Transformers [50.576354045312115]
Direct image-to-graph transformation is a challenging task that solves object detection and relationship prediction in a single model.
We introduce a set of methods enabling cross-domain and cross-dimension transfer learning for image-to-graph transformers.
We demonstrate our method's utility in cross-domain and cross-dimension experiments, where we pretrain our models on 2D satellite images before applying them to vastly different target domains in 2D and 3D.
arXiv Detail & Related papers (2024-03-11T10:48:56Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - Local Manifold Augmentation for Multiview Semantic Consistency [40.28906509638541]
We propose to extract the underlying data variation from datasets and construct a novel augmentation operator, named local manifold augmentation (LMA)
LMA shows the ability to create an infinite number of data views, preserve semantics, and simulate complicated variations in object pose, viewpoint, lighting condition, background etc.
arXiv Detail & Related papers (2022-11-05T02:00:13Z) - Towards Scale Consistent Monocular Visual Odometry by Learning from the
Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data.
We first train a scale-aware disparity network using both monocular real images and stereo virtual data.
The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - PIT: Position-Invariant Transform for Cross-FoV Domain Adaptation [53.428312630479816]
We observe that the Field of View (FoV) gap induces noticeable instance appearance differences between the source and target domains.
Motivated by the observations, we propose the textbfPosition-Invariant Transform (PIT) to better align images in different domains.
arXiv Detail & Related papers (2021-08-16T15:16:47Z) - Domain Adaptation with Morphologic Segmentation [8.0698976170854]
We present a novel domain adaptation framework that uses morphologic segmentation to translate images from arbitrary input domains (real and synthetic) into a uniform output domain.
Our goal is to establish a preprocessing step that unifies data from multiple sources into a common representation.
We showcase the effectiveness of our approach by qualitatively and quantitatively evaluating our method on four data sets of simulated and real data of urban scenes.
arXiv Detail & Related papers (2020-06-16T17:06:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.