Dynamic Pattern Alignment Learning for Pretraining Lightweight Human-Centric Vision Models
- URL: http://arxiv.org/abs/2508.07144v1
- Date: Sun, 10 Aug 2025 02:27:06 GMT
- Title: Dynamic Pattern Alignment Learning for Pretraining Lightweight Human-Centric Vision Models
- Authors: Xuanhan Wang, Huimin Deng, Ke Liu, Jun Wang, Lianli Gao, Jingkuan Song,
- Abstract summary: We propose Dynamic Pattern Alignment Learning (DPAL) to efficiently train lightweight human-centric vision models.<n>The DPAL guides lightweight HVMs to learn all typical human visual patterns from large HVMs, which can generalize to various human-centric vision tasks.<n>Extensive experiments conducted on 15 challenging datasets demonstrate the effectiveness of the DPAL.
- Score: 84.30626369903221
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Human-centric vision models (HVMs) have achieved remarkable generalization due to large-scale pretraining on massive person images. However, their dependence on large neural architectures and the restricted accessibility of pretraining data significantly limits their practicality in real-world applications. To address this limitation, we propose Dynamic Pattern Alignment Learning (DPAL), a novel distillation-based pretraining framework that efficiently trains lightweight HVMs to acquire strong generalization from large HVMs. In particular, human-centric visual perception are highly dependent on three typical visual patterns, including global identity pattern, local shape pattern and multi-person interaction pattern. To achieve generalizable lightweight HVMs, we firstly design a dynamic pattern decoder (D-PaDe), acting as a dynamic Mixture of Expert (MoE) model. It incorporates three specialized experts dedicated to adaptively extract typical visual patterns, conditioned on both input image and pattern queries. And then, we present three levels of alignment objectives, which aims to minimize generalization gap between lightweight HVMs and large HVMs at global image level, local pixel level, and instance relation level. With these two deliberate designs, the DPAL effectively guides lightweight model to learn all typical human visual patterns from large HVMs, which can generalize to various human-centric vision tasks. Extensive experiments conducted on 15 challenging datasets demonstrate the effectiveness of the DPAL. Remarkably, when employing PATH-B as the teacher, DPAL-ViT/Ti (5M parameters) achieves surprising generalizability similar to existing large HVMs such as PATH-B (84M) and Sapiens-L (307M), and outperforms previous distillation-based pretraining methods including Proteus-ViT/Ti (5M) and TinyMiM-ViT/Ti (5M) by a large margin.
Related papers
- Modeling Cross-vision Synergy for Unified Large Vision Model [130.37489011094036]
PolyV is a unified large vision model that achieves cross-vision synergy at both the architectural and training levels.<n>PolyV consistently outperforms existing models, achieving over 10% average improvement over its backbone.
arXiv Detail & Related papers (2026-03-03T22:44:43Z) - Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation [0.0]
Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic.<n>We propose a dual-teacher contrastive distillation framework for multispectral imagery.<n>Our approach combines a multispectral teacher with an optical VFM teacher, enabling coherent cross-modal representation learning.
arXiv Detail & Related papers (2026-02-23T14:09:01Z) - LVLM-Aided Alignment of Task-Specific Vision Models [49.96265491629163]
Small task-specific vision models are crucial in high-stakes domains.<n>We introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge.<n>Our method demonstrates substantial improvement in aligning model behavior with human specifications.
arXiv Detail & Related papers (2025-12-26T11:11:25Z) - Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding [8.090058633054852]
We introduce a plug-and-play module that implicitly injects 3D geometry features into Vision-Language-Action (VLA) models.<n>Our method significantly improves the performance of state-of-the-art VLA models across diverse scenarios.
arXiv Detail & Related papers (2025-07-01T04:05:47Z) - One RL to See Them All: Visual Triple Unified Reinforcement Learning [92.90120580989839]
We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables visual reasoning and perception tasks within a single training pipeline.<n>V-Triune comprises triple complementary components: Sample-Level Datashelf (to unify diverse task inputs), Verifier-Level Reward (to deliver custom rewards via specialized verifiers).<n>We introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune.
arXiv Detail & Related papers (2025-05-23T17:41:14Z) - Scale-Aware Pre-Training for Human-Centric Visual Perception: Enabling Lightweight and Generalizable Models [88.3233363693087]
We introduce Scale-Aware Image Pretraining (SAIP) to acquire general patterns for human-centric visual perception.<n>SAIP incorporates three learning objectives based on the principle of cross-scale consistency.<n>Experiments conducted across 12 HVP datasets demonstrate that SAIP exhibits remarkable capabilities across 9 human-centric vision tasks.
arXiv Detail & Related papers (2025-03-11T09:12:51Z) - Building 6G Radio Foundation Models with Transformer Architectures [6.70088826174291]
Foundation deep learning (DL) models are designed to learn general, robust and adaptable representations of their target modality.
These models are pretrained on large, unlabeled datasets using self-supervised learning (SSL)
We propose and demonstrate the effectiveness of a Vision Transformer (ViT) as a radio foundation model for spectrogram learning.
arXiv Detail & Related papers (2024-11-15T07:01:44Z) - DMT: Comprehensive Distillation with Multiple Self-supervised Teachers [27.037140667247208]
We introduce Comprehensive Distillation with Multiple Self-supervised Teachers (DMT) for pretrained model compression.
Our experimental results on prominent benchmark datasets exhibit that the proposed method significantly surpasses state-of-the-art competitors.
arXiv Detail & Related papers (2023-12-19T08:31:30Z) - UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes [91.24112204588353]
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks.
In contrast to previous models, UViM has the same functional form for all tasks.
We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks.
arXiv Detail & Related papers (2022-05-20T17:47:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.