MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning
- URL: http://arxiv.org/abs/2510.15026v1
- Date: Thu, 16 Oct 2025 18:00:00 GMT
- Title: MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning
- Authors: Mattia Segu, Marta Tintore Gazulla, Yongqin Xian, Luc Van Gool, Federico Tombari,
- Abstract summary: Scaling up model size and training data has advanced foundation models for instance-level perception.<n>High computational cost limits adoption on resource-constrained platforms.<n>We introduce a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
- Score: 91.90342432541138
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scaling up model size and training data has advanced foundation models for instance-level perception, achieving state-of-the-art in-domain and zero-shot performance across object detection and segmentation. However, their high computational cost limits adoption on resource-constrained platforms. We first examine the limitations of existing architectures in enabling efficient edge deployment without compromising performance. We then introduce MOBIUS, a family of foundation models for universal instance segmentation, designed for Pareto-optimal downscaling to support deployment across devices ranging from high-end accelerators to mobile hardware. To reduce training and inference demands, we propose: (i) a bottleneck pixel decoder for efficient multi-scale and multi-modal fusion, (ii) a language-guided uncertainty calibration loss for adaptive decoder pruning, and (iii) a streamlined, unified training strategy. Unlike efficient baselines that trade accuracy for reduced complexity, MOBIUS reduces pixel and transformer decoder FLOPs by up to 55% and 75%, respectively, while maintaining state-of-the-art performance in just a third of the training iterations. MOBIUS establishes a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
Related papers
- Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing [8.705453442427585]
Large language models (LLMs) have achieved near-human performance across diverse reasoning tasks.<n>Their deployment on resource-constrained Internet-of-Things (IoT) devices remains impractical due to massive parameter footprints and memory-intensive autoregressive decoding.<n>This work introduces the first autoregressive-aware split computing framework designed explicitly for LLM deployment on edge devices.
arXiv Detail & Related papers (2025-11-06T02:55:07Z) - CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks [57.95170323315603]
We introduce CollaPipe, a distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving networks.<n>In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks.<n>To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power.
arXiv Detail & Related papers (2025-09-24T07:54:01Z) - OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation [91.45421429922506]
OneCAT is a unified multimodal model that seamlessly integrates understanding, generation, and editing.<n>Our framework eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference.
arXiv Detail & Related papers (2025-09-03T17:29:50Z) - CFMD: Dynamic Cross-layer Feature Fusion for Salient Object Detection [7.262250906929891]
Cross-layer feature pyramid networks (CFPNs) have achieved notable progress in multi-scale feature fusion and boundary detail preservation for salient object detection.<n>To address these challenges, we propose CFMD, a novel cross-layer feature pyramid network that introduces two key innovations.<n>First, we design a context-aware feature aggregation module (CFLMA), which incorporates the state-of-the-art Mamba architecture to construct a dynamic weight distribution mechanism.<n>Second, we introduce an adaptive dynamic upsampling unit (CFLMD) that preserves spatial details during resolution recovery.
arXiv Detail & Related papers (2025-04-02T03:22:36Z) - ContextFormer: Redefining Efficiency in Semantic Segmentation [48.81126061219231]
Convolutional methods, although capturing local dependencies well, struggle with long-range relationships.<n>Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands.<n>We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation.
arXiv Detail & Related papers (2025-01-31T16:11:04Z) - Resource Management for Low-latency Cooperative Fine-tuning of Foundation Models at the Network Edge [35.40849522296486]
Large-scale foundation models (FoMos) can perform human-like intelligence.
FoMos need to be adapted to specialized downstream tasks through fine-tuning techniques.
We advocate multi-device cooperation within the device-edge cooperative fine-tuning paradigm.
arXiv Detail & Related papers (2024-07-13T12:47:14Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks.
specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples.
We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.