Related papers: I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation

I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation

URL: http://arxiv.org/abs/2509.10334v1
Date: Fri, 12 Sep 2025 15:14:19 GMT
Title: I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation
Authors: Jordan Sassoon, Michal Szczepanski, Martyna Poreba,
Abstract summary: Quantization offers an effective strategy to improve efficiency, but ViT-based segmentation models are notoriously fragile under low precision.<n>We introduce I-Segmenter, the first fully integer-only ViT segmentation framework.<n>I-Segmenter achieves competitive accuracy even in one-shot PTQ with a single calibration image.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Transformers (ViTs) have recently achieved strong results in semantic segmentation, yet their deployment on resource-constrained devices remains limited due to their high memory footprint and computational cost. Quantization offers an effective strategy to improve efficiency, but ViT-based segmentation models are notoriously fragile under low precision, as quantization errors accumulate across deep encoder-decoder pipelines. We introduce I-Segmenter, the first fully integer-only ViT segmentation framework. Building on the Segmenter architecture, I-Segmenter systematically replaces floating-point operations with integer-only counterparts. To further stabilize both training and inference, we propose $\lambda$-ShiftGELU, a novel activation function that mitigates the limitations of uniform quantization in handling long-tailed activation distributions. In addition, we remove the L2 normalization layer and replace bilinear interpolation in the decoder with nearest neighbor upsampling, ensuring integer-only execution throughout the computational graph. Extensive experiments show that I-Segmenter achieves accuracy within a reasonable margin of its FP32 baseline (5.1 % on average), while reducing model size by up to 3.8x and enabling up to 1.2x faster inference with optimized runtimes. Notably, even in one-shot PTQ with a single calibration image, I-Segmenter delivers competitive accuracy, underscoring its practicality for real-world deployment.

Related papers

Fast SAM2 with Text-Driven Token Pruning [52.8350457627401]
Segment Anything Model 2 (SAM2), a vision computation model has significantly advanced in prompt-driven video object segmentation.<n>SAM2 pipelines propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object.<n>We introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation.
arXiv Detail & Related papers (2025-12-24T18:59:05Z)
Binary-Gaussian: Compact and Progressive Representation for 3D Gaussian Segmentation [83.90109373769614]
3D Gaussian Splatting (3D-GS) has emerged as an efficient 3D representation and a promising foundation for semantic tasks like segmentation.<n>We propose a coarse-to-fine binary encoding scheme for per-Gaussian category representation, which compresses each feature into a single integer via the binary-to-decimal mapping.<n>We further design a progressive training strategy that decomposes panoptic segmentation into a series of independent sub-tasks, reducing inter-class conflicts and thereby enhancing fine-grained segmentation capability.
arXiv Detail & Related papers (2025-11-30T15:51:30Z)
GSPN-2: Efficient Parallel Sequence Modeling [101.33780567131716]
Generalized Spatial Propagation Network (GSPN) addresses this by replacing quadratic self-attention with a line-scan propagation scheme.<n>GSPN-2 establishes a new efficiency frontier for modeling global spatial context in vision applications.
arXiv Detail & Related papers (2025-11-28T07:26:45Z)
Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing [8.705453442427585]
Large language models (LLMs) have achieved near-human performance across diverse reasoning tasks.<n>Their deployment on resource-constrained Internet-of-Things (IoT) devices remains impractical due to massive parameter footprints and memory-intensive autoregressive decoding.<n>This work introduces the first autoregressive-aware split computing framework designed explicitly for LLM deployment on edge devices.
arXiv Detail & Related papers (2025-11-06T02:55:07Z)
MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning [91.90342432541138]
Scaling up model size and training data has advanced foundation models for instance-level perception.<n>High computational cost limits adoption on resource-constrained platforms.<n>We introduce a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
arXiv Detail & Related papers (2025-10-16T18:00:00Z)
An Efficient Dual-Line Decoder Network with Multi-Scale Convolutional Attention for Multi-organ Segmentation [5.6873464177873245]
This paper introduces an efficient dual-line decoder segmentation network (EDLDNet)<n>The proposed method features a noisy decoder, which learns to incorporate structured perturbation at training time for better model robustness.<n>By leveraging multi-scale segmentation masks from both decoders, we also utilize a mutation-based loss function to enhance the model's generalization.
arXiv Detail & Related papers (2025-08-23T12:34:27Z)
FA-Seg: A Fast and Accurate Diffusion-Based Method for Open-Vocabulary Segmentation [1.4525238046020867]
Open-vocabulary semantic segmentation aims to segment objects from arbitrary text categories without requiring densely annotated datasets.<n>We present FA-Seg, a training-free framework for open-vocabulary segmentation based on diffusion models.
arXiv Detail & Related papers (2025-06-29T16:41:41Z)
Transformers with Joint Tokens and Local-Global Attention for Efficient Human Pose Estimation [34.99437411281915]
This paper proposes two ViT-based models for accurate, efficient, and robust 2D pose estimation.<n> Experiments on six benchmarks demonstrate that the proposed methods significantly outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-02-28T22:34:22Z)
Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions.<n>With efficiency being a high priority for scaling such models, we observed that the state-of-the-art method Mask2Former uses 50% of its compute only on the transformer encoder.<n>This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer.
arXiv Detail & Related papers (2024-04-23T01:34:20Z)
SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers [76.13755422671822]
This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder-decoder framework. We introduce a novel Attention-to-Mask (atm) module to design a lightweight decoder effective for plain ViT. Our decoder outperforms the popular decoder UPerNet using various ViT backbones while consuming only about $5%$ of the computational cost.
arXiv Detail & Related papers (2023-06-09T22:29:56Z)
Lightweight and Progressively-Scalable Networks for Semantic Segmentation [100.63114424262234]
Multi-scale learning frameworks have been regarded as a capable class of models to boost semantic segmentation. In this paper, we thoroughly analyze the design of convolutional blocks and the ways of interactions across multiple scales. We devise Lightweight and Progressively-Scalable Networks (LPS-Net) that novelly expands the network complexity in a greedy manner.
arXiv Detail & Related papers (2022-07-27T16:00:28Z)
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction [67.11722682878722]
This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention. Our multi-scale linear attention achieves the global receptive field and multi-scale learning. EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
arXiv Detail & Related papers (2022-05-29T20:07:23Z)
InverseForm: A Loss Function for Structured Boundary-Aware Segmentation [80.39674800972182]
We present a novel boundary-aware loss term for semantic segmentation using an inverse-transformation network. This plug-in loss term complements the cross-entropy loss in capturing boundary transformations. We analyze the quantitative and qualitative effects of our loss function on three indoor and outdoor segmentation benchmarks.
arXiv Detail & Related papers (2021-04-06T18:52:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.