Scale-Aware Pre-Training for Human-Centric Visual Perception: Enabling Lightweight and Generalizable Models
- URL: http://arxiv.org/abs/2503.08201v2
- Date: Fri, 27 Jun 2025 11:01:48 GMT
- Title: Scale-Aware Pre-Training for Human-Centric Visual Perception: Enabling Lightweight and Generalizable Models
- Authors: Xuanhan Wang, Huimin Deng, Lianli Gao, Jingkuan Song,
- Abstract summary: We introduce Scale-Aware Image Pretraining (SAIP) to acquire general patterns for human-centric visual perception.<n>SAIP incorporates three learning objectives based on the principle of cross-scale consistency.<n>Experiments conducted across 12 HVP datasets demonstrate that SAIP exhibits remarkable capabilities across 9 human-centric vision tasks.
- Score: 88.3233363693087
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Human-centric visual perception (HVP) has recently achieved remarkable progress due to advancements in large-scale self-supervised pretraining (SSP). However, existing HVP models face limitations in adapting to real-world applications, which require general visual patterns for downstream tasks while maintaining computationally sustainable costs to ensure compatibility with edge devices. These limitations primarily arise from two issues: 1) the pretraining objectives focus solely on specific visual patterns, limiting the generalizability of the learned patterns for diverse downstream tasks; and 2) HVP models often exhibit excessively large model sizes, making them incompatible with real-world applications.To address these limitations, we introduce Scale-Aware Image Pretraining (SAIP), a novel SSP framework pretraining lightweight vision models to acquire general patterns for HVP. Specifically, SAIP incorporates three learning objectives based on the principle of cross-scale consistency: 1) Cross-scale Matching (CSM) which contrastively learns image-level invariant patterns from multi-scale single-person images; 2) Cross-scale Reconstruction (CSR) which learns pixel-level consistent visual structures from multi-scale masked single-person images; and 3) Cross-scale Search (CSS) which learns to capture diverse patterns from multi-scale multi-person images. Three objectives complement one another, enabling lightweight models to learn multi-scale generalizable patterns essential for HVP downstream tasks.Extensive experiments conducted across 12 HVP datasets demonstrate that SAIP exhibits remarkable generalization capabilities across 9 human-centric vision tasks. Moreover, it achieves significant performance improvements over existing methods, with gains of 3%-13% in single-person discrimination tasks, 1%-11% in dense prediction tasks, and 1%-6% in multi-person visual understanding tasks.
Related papers
- Visual Bridge: Universal Visual Perception Representations Generating [27.034175361589572]
We propose a universal visual perception framework based on flow matching that can generate diverse visual representations across multiple tasks.<n>Our approach formulates the process as a universal flow-matching problem from image patch tokens to task-specific representations.<n>Our model achieves competitive performance in both zero-shot and fine-tuned settings, outperforming prior generalist and several specialist models.
arXiv Detail & Related papers (2025-11-11T06:25:30Z) - HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models [60.028070589466445]
We propose HERO, a framework that integrates content-adaptive token budget allocation with function-aware token selection.<n>This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.
arXiv Detail & Related papers (2025-09-16T13:22:08Z) - Dynamic Pattern Alignment Learning for Pretraining Lightweight Human-Centric Vision Models [84.30626369903221]
We propose Dynamic Pattern Alignment Learning (DPAL) to efficiently train lightweight human-centric vision models.<n>The DPAL guides lightweight HVMs to learn all typical human visual patterns from large HVMs, which can generalize to various human-centric vision tasks.<n>Extensive experiments conducted on 15 challenging datasets demonstrate the effectiveness of the DPAL.
arXiv Detail & Related papers (2025-08-10T02:27:06Z) - MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - AimTS: Augmented Series and Image Contrastive Learning for Time Series Classification [19.7216139977931]
Time series classification (TSC) is an important task in time series analysis.<n>AimTS is a pre-training framework that learns generalizable representations from multi-source time series data.
arXiv Detail & Related papers (2025-04-14T08:55:16Z) - Semi-supervised Semantic Segmentation for Remote Sensing Images via Multi-scale Uncertainty Consistency and Cross-Teacher-Student Attention [59.19580789952102]
This paper proposes a novel semi-supervised Multi-Scale Uncertainty and Cross-Teacher-Student Attention (MUCA) model for RS image semantic segmentation tasks.<n>MUCA constrains the consistency among feature maps at different layers of the network by introducing a multi-scale uncertainty consistency regularization.<n>MUCA utilizes a Cross-Teacher-Student attention mechanism to guide the student network, guiding the student network to construct more discriminative feature representations.
arXiv Detail & Related papers (2025-01-18T11:57:20Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - PAD: Self-Supervised Pre-Training with Patchwise-Scale Adapter for
Infrared Images [45.507517332100804]
Self-supervised learning (SSL) for RGB images has achieved significant success, yet there is still limited research on SSL for infrared images.
Non-iconic infrared images rendering common pre-training tasks less effective.
The scarcity of fine-grained textures making it particularly challenging to learn general image features.
arXiv Detail & Related papers (2023-12-13T14:57:28Z) - Beyond Random Augmentations: Pretraining with Hard Views [40.88518237601708]
Self-Supervised Learning (SSL) methods rely on random image augmentations, or views, to make models invariant to different transformations.<n>We propose Hard View Pretraining (HVP), a learning-free strategy that extends random view generation by exposing models to more challenging samples during SSL pretraining.<n>HVP sets a new state-of-the-art on DINO ViT-B/16, reaching 78.8% linear evaluation accuracy (a 0.6% improvement) and consistent gains of 1% for both 100 and 300 pretraining.
arXiv Detail & Related papers (2023-10-05T23:09:19Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - Progressive Multi-view Human Mesh Recovery with Self-Supervision [68.60019434498703]
Existing solutions typically suffer from poor generalization performance to new settings.
We propose a novel simulation-based training pipeline for multi-view human mesh recovery.
arXiv Detail & Related papers (2022-12-10T06:28:29Z) - One-Time Model Adaptation to Heterogeneous Clients: An Intra-Client and
Inter-Image Attention Design [40.97593636235116]
We propose a new intra-client and inter-image attention (ICIIA) module into existing backbone recognition models.
In particular, given a target image from a certain client, ICIIA introduces multi-head self-attention to retrieve relevant images from the client's historical unlabeled images.
We evaluate ICIIA using 3 different recognition tasks with 9 backbone models over 5 representative datasets.
arXiv Detail & Related papers (2022-11-11T15:33:21Z) - Scale Attention for Learning Deep Face Representation: A Study Against
Visual Scale Variation [69.45176408639483]
We reform the conv layer by resorting to the scale-space theory.
We build a novel style named SCale AttentioN Conv Neural Network (textbfSCAN-CNN)
As a single-shot scheme, the inference is more efficient than multi-shot fusion.
arXiv Detail & Related papers (2022-09-19T06:35:04Z) - X-Learner: Learning Cross Sources and Tasks for Universal Visual
Representation [71.51719469058666]
We propose a representation learning framework called X-Learner.
X-Learner learns the universal feature of multiple vision tasks supervised by various sources.
X-Learner achieves strong performance on different tasks without extra annotations, modalities and computational costs.
arXiv Detail & Related papers (2022-03-16T17:23:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.