Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning
- URL: http://arxiv.org/abs/2310.07510v1
- Date: Wed, 11 Oct 2023 14:06:04 GMT
- Title: Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning
- Authors: Zhiming Qian
- Abstract summary: We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To mimic human vision with the way of recognizing the diverse and open world,
foundation vision models are much critical. While recent techniques of
self-supervised learning show the promising potentiality of this mission, we
argue that signals from labelled data are also important for common-sense
recognition, and properly chosen pre-text tasks can facilitate the efficiency
of vision representation learning. To this end, we propose a novel pre-training
framework by adopting both self-supervised and supervised visual pre-text tasks
in a multi-task manner. Specifically, given an image, we take a heuristic way
by considering its intrinsic style properties, inside objects with their
locations and correlations, and how it looks like in 3D space for basic visual
understanding. However, large-scale object bounding boxes and correlations are
usually hard to achieve. Alternatively, we develop a hybrid method by
leveraging both multi-label classification and self-supervised learning. On the
one hand, under the multi-label supervision, the pre-trained model can explore
the detailed information of an image, e.g., image types, objects, and part of
semantic relations. On the other hand, self-supervised learning tasks, with
respect to Masked Image Modeling (MIM) and contrastive learning, can help the
model learn pixel details and patch correlations. Results show that our
pre-trained models can deliver results on par with or better than
state-of-the-art (SOTA) results on multiple visual tasks. For example, with a
vanilla Swin-B backbone, we achieve 85.3\% top-1 accuracy on ImageNet-1K
classification, 47.9 box AP on COCO object detection for Mask R-CNN, and 50.6
mIoU on ADE-20K semantic segmentation when using Upernet. The performance shows
the ability of our vision foundation model to serve general purpose vision
tasks.
Related papers
- Enhancing Large Vision Language Models with Self-Training on Image Comprehension [99.9389737339175]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Learning Transferable Pedestrian Representation from Multimodal
Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information.
We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations.
We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z) - CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View
Completion [20.121597331207276]
Masked Image Modeling (MIM) has recently been established as a potent pre-training paradigm.
In this paper we seek to learn representations that transfer well to a wide variety of 3D vision and lower-level geometric downstream tasks.
Our experiments show that our pretext task leads to significantly improved performance for monocular 3D vision downstream tasks.
arXiv Detail & Related papers (2022-10-19T16:50:36Z) - The Devil is in the Frequency: Geminated Gestalt Autoencoder for
Self-Supervised Visual Pre-Training [13.087987450384036]
We present a new Masked Image Modeling (MIM), termed Geminated Autoencoder (Ge$2$-AE) for visual pre-training.
Specifically, we equip our model with geminated decoders in charge of reconstructing image contents from both pixel and frequency space.
arXiv Detail & Related papers (2022-04-18T09:22:55Z) - UniVIP: A Unified Framework for Self-Supervised Visual Pre-training [50.87603616476038]
We propose a novel self-supervised framework to learn versatile visual representations on either single-centric-object or non-iconic dataset.
Massive experiments show that UniVIP pre-trained on non-iconic COCO achieves state-of-the-art transfer performance.
Our method can also exploit single-centric-object dataset such as ImageNet and outperforms BYOL by 2.5% with the same pre-training epochs in linear probing.
arXiv Detail & Related papers (2022-03-14T10:04:04Z) - Multimodal Contrastive Training for Visual Representation Learning [45.94662252627284]
We develop an approach to learning visual representations that embraces multimodal data.
Our method exploits intrinsic data properties within each modality and semantic information from cross-modal correlation simultaneously.
By including multimodal training in a unified framework, our method can learn more powerful and generic visual features.
arXiv Detail & Related papers (2021-04-26T19:23:36Z) - Learning View-Disentangled Human Pose Representation by Contrastive
Cross-View Mutual Information Maximization [33.36330493757669]
We introduce a novel representation learning method to disentangle pose-dependent as well as view-dependent factors from 2D human poses.
The method trains a network using cross-view mutual information (CV-MIM) which maximizes mutual information of the same pose performed from different viewpoints.
CV-MIM outperforms other competing methods by a large margin in the single-shot cross-view setting.
arXiv Detail & Related papers (2020-12-02T18:55:35Z) - What Can You Learn from Your Muscles? Learning Visual Representation
from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations.
Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z) - Self-Supervised Viewpoint Learning From Image Collections [116.56304441362994]
We propose a novel learning framework which incorporates an analysis-by-synthesis paradigm to reconstruct images in a viewpoint aware manner.
We show that our approach performs competitively to fully-supervised approaches for several object categories like human faces, cars, buses, and trains.
arXiv Detail & Related papers (2020-04-03T22:01:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.