Related papers: Scale-Aware Pre-Training for Human-Centric Visual Perception: Enabling Lightweight and Generalizable Models

Scale-Aware Pre-Training for Human-Centric Visual Perception: Enabling Lightweight and Generalizable Models

URL: http://arxiv.org/abs/2503.08201v2
Date: Fri, 27 Jun 2025 11:01:48 GMT
Title: Scale-Aware Pre-Training for Human-Centric Visual Perception: Enabling Lightweight and Generalizable Models
Authors: Xuanhan Wang, Huimin Deng, Lianli Gao, Jingkuan Song,
Abstract summary: We introduce Scale-Aware Image Pretraining (SAIP) to acquire general patterns for human-centric visual perception.<n>SAIP incorporates three learning objectives based on the principle of cross-scale consistency.<n>Experiments conducted across 12 HVP datasets demonstrate that SAIP exhibits remarkable capabilities across 9 human-centric vision tasks.
Score: 88.3233363693087
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Human-centric visual perception (HVP) has recently achieved remarkable progress due to advancements in large-scale self-supervised pretraining (SSP). However, existing HVP models face limitations in adapting to real-world applications, which require general visual patterns for downstream tasks while maintaining computationally sustainable costs to ensure compatibility with edge devices. These limitations primarily arise from two issues: 1) the pretraining objectives focus solely on specific visual patterns, limiting the generalizability of the learned patterns for diverse downstream tasks; and 2) HVP models often exhibit excessively large model sizes, making them incompatible with real-world applications.To address these limitations, we introduce Scale-Aware Image Pretraining (SAIP), a novel SSP framework pretraining lightweight vision models to acquire general patterns for HVP. Specifically, SAIP incorporates three learning objectives based on the principle of cross-scale consistency: 1) Cross-scale Matching (CSM) which contrastively learns image-level invariant patterns from multi-scale single-person images; 2) Cross-scale Reconstruction (CSR) which learns pixel-level consistent visual structures from multi-scale masked single-person images; and 3) Cross-scale Search (CSS) which learns to capture diverse patterns from multi-scale multi-person images. Three objectives complement one another, enabling lightweight models to learn multi-scale generalizable patterns essential for HVP downstream tasks.Extensive experiments conducted across 12 HVP datasets demonstrate that SAIP exhibits remarkable generalization capabilities across 9 human-centric vision tasks. Moreover, it achieves significant performance improvements over existing methods, with gains of 3%-13% in single-person discrimination tasks, 1%-11% in dense prediction tasks, and 1%-6% in multi-person visual understanding tasks.

Related papers

Visual Bridge: Universal Visual Perception Representations Generating [27.034175361589572]
We propose a universal visual perception framework based on flow matching that can generate diverse visual representations across multiple tasks.<n>Our approach formulates the process as a universal flow-matching problem from image patch tokens to task-specific representations.<n>Our model achieves competitive performance in both zero-shot and fine-tuned settings, outperforming prior generalist and several specialist models.
arXiv Detail & Related papers (2025-11-11T06:25:30Z)
HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models [60.028070589466445]
We propose HERO, a framework that integrates content-adaptive token budget allocation with function-aware token selection.<n>This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.
arXiv Detail & Related papers (2025-09-16T13:22:08Z)
Dynamic Pattern Alignment Learning for Pretraining Lightweight Human-Centric Vision Models [84.30626369903221]
We propose Dynamic Pattern Alignment Learning (DPAL) to efficiently train lightweight human-centric vision models.<n>The DPAL guides lightweight HVMs to learn all typical human visual patterns from large HVMs, which can generalize to various human-centric vision tasks.<n>Extensive experiments conducted on 15 challenging datasets demonstrate the effectiveness of the DPAL.
arXiv Detail & Related papers (2025-08-10T02:27:06Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
AimTS: Augmented Series and Image Contrastive Learning for Time Series Classification [19.7216139977931]
Time series classification (TSC) is an important task in time series analysis.<n>AimTS is a pre-training framework that learns generalizable representations from multi-source time series data.
arXiv Detail & Related papers (2025-04-14T08:55:16Z)
Semi-supervised Semantic Segmentation for Remote Sensing Images via Multi-scale Uncertainty Consistency and Cross-Teacher-Student Attention [59.19580789952102]
This paper proposes a novel semi-supervised Multi-Scale Uncertainty and Cross-Teacher-Student Attention (MUCA) model for RS image semantic segmentation tasks.<n>MUCA constrains the consistency among feature maps at different layers of the network by introducing a multi-scale uncertainty consistency regularization.<n>MUCA utilizes a Cross-Teacher-Student attention mechanism to guide the student network, guiding the student network to construct more discriminative feature representations.
arXiv Detail & Related papers (2025-01-18T11:57:20Z)
Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples. For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge. We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z)
PAD: Self-Supervised Pre-Training with Patchwise-Scale Adapter for Infrared Images [45.507517332100804]
Self-supervised learning (SSL) for RGB images has achieved significant success, yet there is still limited research on SSL for infrared images. Non-iconic infrared images rendering common pre-training tasks less effective. The scarcity of fine-grained textures making it particularly challenging to learn general image features.
arXiv Detail & Related papers (2023-12-13T14:57:28Z)
Beyond Random Augmentations: Pretraining with Hard Views [40.88518237601708]
Self-Supervised Learning (SSL) methods rely on random image augmentations, or views, to make models invariant to different transformations.<n>We propose Hard View Pretraining (HVP), a learning-free strategy that extends random view generation by exposing models to more challenging samples during SSL pretraining.<n>HVP sets a new state-of-the-art on DINO ViT-B/16, reaching 78.8% linear evaluation accuracy (a 0.6% improvement) and consistent gains of 1% for both 100 and 300 pretraining.
arXiv Detail & Related papers (2023-10-05T23:09:19Z)
An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently. Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z)
Progressive Multi-view Human Mesh Recovery with Self-Supervision [68.60019434498703]
Existing solutions typically suffer from poor generalization performance to new settings. We propose a novel simulation-based training pipeline for multi-view human mesh recovery.
arXiv Detail & Related papers (2022-12-10T06:28:29Z)
One-Time Model Adaptation to Heterogeneous Clients: An Intra-Client and Inter-Image Attention Design [40.97593636235116]
We propose a new intra-client and inter-image attention (ICIIA) module into existing backbone recognition models. In particular, given a target image from a certain client, ICIIA introduces multi-head self-attention to retrieve relevant images from the client's historical unlabeled images. We evaluate ICIIA using 3 different recognition tasks with 9 backbone models over 5 representative datasets.
arXiv Detail & Related papers (2022-11-11T15:33:21Z)
Scale Attention for Learning Deep Face Representation: A Study Against Visual Scale Variation [69.45176408639483]
We reform the conv layer by resorting to the scale-space theory. We build a novel style named SCale AttentioN Conv Neural Network (textbfSCAN-CNN) As a single-shot scheme, the inference is more efficient than multi-shot fusion.
arXiv Detail & Related papers (2022-09-19T06:35:04Z)
X-Learner: Learning Cross Sources and Tasks for Universal Visual Representation [71.51719469058666]
We propose a representation learning framework called X-Learner. X-Learner learns the universal feature of multiple vision tasks supervised by various sources. X-Learner achieves strong performance on different tasks without extra annotations, modalities and computational costs.
arXiv Detail & Related papers (2022-03-16T17:23:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.