Human-inspired Global-to-Parallel Multi-scale Encoding for Lightweight Vision Models
- URL: http://arxiv.org/abs/2601.08190v2
- Date: Thu, 15 Jan 2026 03:38:46 GMT
- Title: Human-inspired Global-to-Parallel Multi-scale Encoding for Lightweight Vision Models
- Authors: Wei Xu,
- Abstract summary: We propose a lightweight vision network based on a global-to-Parallel Multi-scale system.<n>We show that H-GPE achieves strong performance while maintaining a balanced footprint in both FLOPs and parameters.<n> Experiments on image classification, object detection, and semantic segmentation show that H-GPE achieves strong performance while maintaining a balanced footprint in both FLOPs and parameters, delivering a more favorable accuracy-efficiency trade-off compared with recent state-of-the-art lightweight models.
- Score: 8.505472042732375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lightweight vision networks have witnessed remarkable progress in recent years, yet achieving a satisfactory balance among parameter scale, computational overhead, and task performance remains difficult. Although many existing lightweight models manage to reduce computation considerably, they often do so at the expense of a substantial increase in parameter count (e.g., LSNet, MobileMamba), which still poses obstacles for deployment on resource-limited devices. In parallel, some studies attempt to draw inspiration from human visual perception, but their modeling tends to oversimplify the visual process, making it hard to reflect how perception truly operates. Revisiting the cooperative mechanism of the human visual system, we propose GPM (Global-to-Parallel Multi-scale Encoding). GPM first employs a Global Insight Generator (GIG) to extract holistic cues, and subsequently processes features of different scales through parallel branches: LSAE emphasizes mid-/large-scale semantic relations, while IRB (Inverted Residual Block) preserves fine-grained texture information, jointly enabling coherent representation of global and local features. As such, GPM conforms to two characteristic behaviors of human vision perceiving the whole before focusing on details, and maintaining broad contextual awareness even during local attention. Built upon GPM, we further develop the lightweight H-GPE network. Experiments on image classification, object detection, and semantic segmentation show that H-GPE achieves strong performance while maintaining a balanced footprint in both FLOPs and parameters, delivering a more favorable accuracy-efficiency trade-off compared with recent state-of-the-art lightweight models.
Related papers
- Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z) - DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model [22.28268642142352]
DiG (Differential Grounding) is a novel proxy task framework where MLLMs learn fine-grained perception by identifying and localizing all differences between similar image pairs without prior knowledge of their number.<n>Our results highlight differential grounding as a scalable and robust approach for advancing fine-grained visual reasoning in MLLMs.
arXiv Detail & Related papers (2025-12-14T10:40:27Z) - GCRPNet: Graph-Enhanced Contextual and Regional Perception Network for Salient Object Detection in Optical Remote Sensing Images [68.33481681452675]
We propose a graph-enhanced contextual and regional perception network (GCRPNet)<n>It builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation.<n>It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information.
arXiv Detail & Related papers (2025-08-14T11:31:43Z) - LGM-Pose: A Lightweight Global Modeling Network for Real-time Human Pose Estimation [14.20517652381698]
A single-branch lightweight global modeling network (LGM-Pose) is proposed to address these challenges.<n>In the network, a lightweight MobileViM Block is designed with a proposed Lightweight Attentional Representation Module (LARM)
arXiv Detail & Related papers (2025-06-05T02:29:04Z) - LeMoRe: Learn More Details for Lightweight Semantic Segmentation [48.81126061219231]
We introduce an efficient paradigm by synergizing explicit and implicit modeling to balance computational efficiency with representational fidelity.<n>Our method combines well-defined Cartesian directions with explicitly modeled views and implicitly inferred intermediate representations, efficiently capturing global dependencies.
arXiv Detail & Related papers (2025-05-29T04:55:10Z) - LSNet: See Large, Focus Small [67.05569159984691]
We introduce LS (textbfLarge-textbfSmall) convolution, which combines large- kernel perception and small- kernel aggregation.<n>LSNet achieves superior performance and efficiency over existing lightweight networks in various vision tasks.
arXiv Detail & Related papers (2025-03-29T16:00:54Z) - vGamba: Attentive State Space Bottleneck for efficient Long-range Dependencies in Visual Recognition [0.0]
State-space models (SSMs) offer an alternative, but their application in vision remains underexplored.<n>This work introduces vGamba, a hybrid vision backbone that integrates SSMs with attention mechanisms to enhance efficiency and expressiveness.<n>Tests on classification, detection, and segmentation tasks demonstrate that vGamba achieves a superior trade-off between accuracy and computational efficiency, outperforming several existing models.
arXiv Detail & Related papers (2025-03-27T08:39:58Z) - Global Semantic-Guided Sub-image Feature Weight Allocation in High-Resolution Large Vision-Language Models [50.98559225639266]
Sub-images with higher semantic relevance to the entire image encapsulate richer visual information for preserving the model's visual understanding ability.<n>Global Semantic-guided Weight Allocator (GSWA) module allocates weights to sub-images based on their relative information density.<n>SleighVL, a lightweight yet high-performing model, outperforms models with comparable parameters and remains competitive with larger models.
arXiv Detail & Related papers (2025-01-24T06:42:06Z) - Brain-Inspired Stepwise Patch Merging for Vision Transformers [6.108377966393714]
We propose Stepwise Patch Merging (SPM), which enhances the subsequent attention mechanism's ability to'see' better.<n>The code has been released at https://github.com/Yonghao-Yu/StepwisePatchMerging.
arXiv Detail & Related papers (2024-09-11T03:04:46Z) - Learning a Mini-batch Graph Transformer via Two-stage Interaction Augmentation [34.969019293698885]
Mini-batch Graph Transformer (MGT) has demonstrated significant advantages in semi-supervised node prediction tasks.
The limited number of nodes in each mini-batch restricts the model's capacity to capture the global characteristic of the graph.
We propose LGMformer, a novel MGT model that employs a two-stage augmented interaction strategy.
arXiv Detail & Related papers (2024-07-13T14:42:22Z) - Lightweight Vision Transformer with Bidirectional Interaction [59.39874544410419]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information.<n>Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.