Related papers: SVAC: Scaling Is All You Need For Referring Video Object Segmentation

SVAC: Scaling Is All You Need For Referring Video Object Segmentation

URL: http://arxiv.org/abs/2509.24109v1
Date: Sun, 28 Sep 2025 23:02:09 GMT
Title: SVAC: Scaling Is All You Need For Referring Video Object Segmentation
Authors: Li Zhang, Haoxiang Gao, Zhihao Zhang, Luoxiao Huang, Tao Zhang,
Abstract summary: Video Video Object (RVOS) aims to segment target objects in video sequences based on natural language descriptions.<n>Recent advances in Multi-modal Large Language Models (LMMLs) have improved RVOS performance through enhanced text-video understanding.<n>We propose SVAC, a unified model that improves RVOS by scaling input frames and segmentation tokens.
Score: 6.940369414261821
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring Video Object Segmentation (RVOS) aims to segment target objects in video sequences based on natural language descriptions. While recent advances in Multi-modal Large Language Models (MLLMs) have improved RVOS performance through enhanced text-video understanding, several challenges remain, including insufficient exploitation of MLLMs' prior knowledge, prohibitive computational and memory costs for long-duration videos, and inadequate handling of complex temporal dynamics. In this work, we propose SVAC, a unified model that improves RVOS by scaling up input frames and segmentation tokens to enhance video-language interaction and segmentation precision. To address the resulting computational challenges, SVAC incorporates the Anchor-Based Spatio-Temporal Compression (ASTC) module to compress visual tokens while preserving essential spatio-temporal structure. Moreover, the Clip-Specific Allocation (CSA) strategy is introduced to better handle dynamic object behaviors across video clips. Experimental results demonstrate that SVAC achieves state-of-the-art performance on multiple RVOS benchmarks with competitive efficiency. Our code is available at https://github.com/lizhang1998/SVAC.

Related papers

VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs [28.026438743789907]
VideoScaffold is a dynamic representation framework designed for streaming video understanding.<n>It adaptively adjusts event granularity according to video duration while preserving fine-grained visual semantics.<n>The framework is modular and plug-and-play, seamlessly extending existing image-based MLLMs to continuous video comprehension.
arXiv Detail & Related papers (2025-12-23T03:33:45Z)
SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction [65.15449703659772]
Video Object (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames.<n>We propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations.<n>SeC achieves an 11.8-point improvement over SAM SeCVOS, establishing a new state-of-the-art concept-aware video object segmentation.
arXiv Detail & Related papers (2025-07-21T17:59:02Z)
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding [73.60257070465377]
AdaVideoRAG is a novel framework that adapts retrieval based on query complexity using a lightweight intent classifier.<n>Our framework employs an Omni-Knowledge Indexing module to build hierarchical databases from text (captions, ASR, OCR), visual features, and semantic graphs.<n> Experiments demonstrate improved efficiency and accuracy for long-video understanding, with seamless integration into existing MLLMs.
arXiv Detail & Related papers (2025-06-16T15:18:15Z)
CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos [59.391265901911005]
We propose CoT-RVS, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address complex challenges by temporal-semantic reasoning.<n>CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding for each object that can be observed effortlessly among all frames (temporal)<n>Our framework's training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge
arXiv Detail & Related papers (2025-05-24T07:01:31Z)
One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos [41.34787907803329]
VideoLISA is a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos. VideoLISA generates temporally consistent segmentation masks in videos based on language instructions.
arXiv Detail & Related papers (2024-09-29T07:47:15Z)
VISA: Reasoning Video Object Segmentation via Large Language Models [64.33167989521357]
We introduce a new task, Reasoning Video Object (ReasonVOS) This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities. We introduce VISA (Video-based large language Instructed Assistant) to tackle ReasonVOS.
arXiv Detail & Related papers (2024-07-16T02:29:29Z)
Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT) We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information. Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z)
Bidirectional Correlation-Driven Inter-Frame Interaction Transformer for Referring Video Object Segmentation [44.952526831843386]
We propose a correlation-driven inter-frame interaction Transformer, dubbed BIFIT, to address these issues in RVOS. Specifically, we design a lightweight plug-and-play inter-frame interaction module in the decoder. A vision-ferring interaction is implemented before the Transformer to facilitate the correlation between the visual and linguistic features.
arXiv Detail & Related papers (2023-07-02T10:29:35Z)
SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation [35.063881868130075]
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment. We propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin.
arXiv Detail & Related papers (2023-05-26T15:13:44Z)
Learning Video Object Segmentation from Unlabeled Videos [158.18207922363783]
We propose a new method for video object segmentation (VOS) that addresses object pattern learning from unlabeled videos. We introduce a unified unsupervised/weakly supervised learning framework, called MuG, that comprehensively captures properties of VOS at multiple granularities.
arXiv Detail & Related papers (2020-03-10T22:12:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.