Referring Video Object Segmentation via Language-aligned Track Selection
- URL: http://arxiv.org/abs/2412.01136v1
- Date: Mon, 02 Dec 2024 05:20:35 GMT
- Title: Referring Video Object Segmentation via Language-aligned Track Selection
- Authors: Seongchan Kim, Woojeong Jin, Sangbeom Lim, Heeji Yoon, Hyunwook Choi, Seungryong Kim,
- Abstract summary: Referring Video Object (RVOS) seeks to segment objects throughout a video based on natural language expressions.
inconsistent mask tracks can disrupt vision-language alignment, leading to suboptimal performance.
We present Selection by Object Language Alignment (SOLA), a novel framework that reformulates RVOS into two sub-problems, track generation and track selection.
- Score: 30.226373787454833
- License:
- Abstract: Referring Video Object Segmentation (RVOS) seeks to segment objects throughout a video based on natural language expressions. While existing methods have made strides in vision-language alignment, they often overlook the importance of robust video object tracking, where inconsistent mask tracks can disrupt vision-language alignment, leading to suboptimal performance. In this work, we present Selection by Object Language Alignment (SOLA), a novel framework that reformulates RVOS into two sub-problems, track generation and track selection. In track generation, we leverage a vision foundation model, Segment Anything Model 2 (SAM2), which generates consistent mask tracks across frames, producing reliable candidates for both foreground and background objects. For track selection, we propose a light yet effective selection module that aligns visual and textual features while modeling object appearance and motion within video sequences. This design enables precise motion modeling and alignment of the vision language. Our approach achieves state-of-the-art performance on the challenging MeViS dataset and demonstrates superior results in zero-shot settings on the Ref-Youtube-VOS and Ref-DAVIS datasets. Furthermore, SOLA exhibits strong generalization and robustness in corrupted settings, such as those with added Gaussian noise or motion blur. Our project page is available at https://cvlab-kaist.github.io/SOLA
Related papers
- Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation [52.337472185022136]
We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description.
We propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation.
We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art consistency.
arXiv Detail & Related papers (2025-01-06T14:49:26Z) - Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks.
Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - Exploring the Design Space of Visual Context Representation in Video MLLMs [102.11582556690388]
Video Multimodal Large Language Models (MLLMs) have shown remarkable capability of understanding the video semantics on various downstream tasks.
Visual context representation refers to the scheme to select frames from a video and further select the tokens from a frame.
In this paper, we explore the design space for visual context representation, and aim to improve the performance of video MLLMs by finding more effective representation schemes.
arXiv Detail & Related papers (2024-10-17T15:59:52Z) - Unifying Visual and Vision-Language Tracking via Contrastive Learning [34.49865598433915]
Single object tracking aims to locate the target object in a video sequence according to different modal references.
Due to the gap between different modalities, most existing trackers are designed for single or partial of these reference settings.
We present a unified tracker called UVLTrack, which can simultaneously handle all three reference settings.
arXiv Detail & Related papers (2024-01-20T13:20:54Z) - MeViS: A Large-scale Benchmark for Video Segmentation with Motion
Expressions [93.35942025232943]
We propose a large-scale dataset called MeViS, which contains numerous motion expressions to indicate target objects in complex environments.
The goal of our benchmark is to provide a platform that enables the development of effective language-guided video segmentation algorithms.
arXiv Detail & Related papers (2023-08-16T17:58:34Z) - Tracking Anything in High Quality [63.63653185865726]
HQTrack is a framework for High Quality Tracking anything in videos.
It consists of a video multi-object segmenter (VMOS) and a mask refiner (MR)
arXiv Detail & Related papers (2023-07-26T06:19:46Z) - STOA-VLP: Spatial-Temporal Modeling of Object and Action for
Video-Language Pre-training [30.16501510589718]
We propose a pre-training framework that jointly models object and action information across spatial and temporal dimensions.
We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model.
arXiv Detail & Related papers (2023-02-20T03:13:45Z) - Rethinking Cross-modal Interaction from a Top-down Perspective for
Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice.
In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video.
Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z) - Siamese Tracking with Lingual Object Constraints [28.04334832366449]
This paper explores, tracking visual objects subjected to additional lingual constraints.
Differently from Li et al., we impose additional lingual constraints upon tracking, which enables new applications of tracking.
Our method enables the selective compression of videos, based on the validity of the constraint.
arXiv Detail & Related papers (2020-11-23T20:55:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.