Related papers: Temporal Collection and Distribution for Referring Video Object Segmentation

Temporal Collection and Distribution for Referring Video Object Segmentation

URL: http://arxiv.org/abs/2309.03473v1
Date: Thu, 7 Sep 2023 04:22:02 GMT
Title: Temporal Collection and Distribution for Referring Video Object Segmentation
Authors: Jiajin Tang, Ge Zheng, Sibei Yang
Abstract summary: Referring video object segmentation aims to segment a referent throughout a video sequence according to a natural language expression. We propose to simultaneously maintain a global referent token and a sequence of object queries. We show that our method outperforms state-of-the-art methods on all benchmarks consistently and significantly.
Score: 14.886278504056063
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring video object segmentation aims to segment a referent throughout a video sequence according to a natural language expression. It requires aligning the natural language expression with the objects' motions and their dynamic associations at the global video level but segmenting objects at the frame level. To achieve this goal, we propose to simultaneously maintain a global referent token and a sequence of object queries, where the former is responsible for capturing video-level referent according to the language expression, while the latter serves to better locate and segment objects with each frame. Furthermore, to explicitly capture object motions and spatial-temporal cross-modal reasoning over objects, we propose a novel temporal collection-distribution mechanism for interacting between the global referent token and object queries. Specifically, the temporal collection mechanism collects global information for the referent token from object queries to the temporal motions to the language expression. In turn, the temporal distribution first distributes the referent token to the referent sequence across all frames and then performs efficient cross-frame reasoning between the referent sequence and object queries in every frame. Experimental results show that our method outperforms state-of-the-art methods on all benchmarks consistently and significantly.

Related papers

Object-Centric Framework for Video Moment Retrieval [15.916994168542345]
Most existing moment retrieval methods rely on temporal sequences of frame-level features that primarily encode global visual and semantic information.<n>In particular temporal dynamics at the object level have been largely overlooked, limiting existing approaches in scenarios requiring object-level reasoning.<n>Our method first extracts query-relevant objects using a scene graph and then graphs from video frames to represent these objects and their relationships.<n>Based on the scene graphs, we construct object-level feature sequences that encode rich visual and semantic information. These sequences are processed by a video tracklet transformer, which models relational-temporal localization among objects over time.
arXiv Detail & Related papers (2025-12-20T17:44:53Z)
SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding [48.64661382961745]
We introduce Spatio-temporal Video Action Grounding (SVAG), a novel task that requires models to simultaneously detect, track, and temporally localize all referent objects in videos.<n>To support this task, we construct SVAG-Bench, a large-scale benchmark comprising 688 videos, 19,590 annotated records, and 903 unique verbs.<n> Empirical results show that existing models perform poorly on SVAG, particularly in dense or complex scenes.
arXiv Detail & Related papers (2025-10-14T22:10:49Z)
LTCA: Long-range Temporal Context Attention for Referring Video Object Segmentation [14.277537679679101]
We propose an effective long-range temporal context attention (LTCA) mechanism to aggregate global context information into object features.<n>We show our method achieves new state-of-the-art on four referring video segmentation benchmarks.
arXiv Detail & Related papers (2025-10-09T14:55:52Z)
Temporal Prompting Matters: Rethinking Referring Video Object Segmentation [64.82333675385802]
Referring Video Object (RVOS) aims to segment the object referred to by the query sentence in the video.<n>Most existing methods require end-to-end training with dense mask annotations.<n>We propose a Temporal Prompt Generation and Selection (Tenet) framework to address the referring and video factors.
arXiv Detail & Related papers (2025-10-08T17:59:57Z)
Unleashing Hierarchical Reasoning: An LLM-Driven Framework for Training-Free Referring Video Object Segmentation [17.238084264485988]
Referring Video Object (RVOS) aims to segment an object of interest throughout a video based on a language description.<n>bftextPARSE-VOS is a training-free framework powered by Large Language Models (LLMs)<n>bftextPARSE-VOS achieved state-of-the-art performance on three major benchmarks: Ref-YouTube-VOS, Ref-DAVIS17, and MeViS.
arXiv Detail & Related papers (2025-09-06T15:46:23Z)
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting [60.58915701973593]
We present CAT-V (Caption AnyThing in Video), a training-free framework for fine-grained object-centric video captioning. Cat-V integrates three key components: a Segmenter based on SAMI for precise object segmentation across frames, a Temporal Analyzer powered by TRACE-UniVL, and a Captioner using Intern-2.5. Our framework generates detailed, temporally-aware descriptions of objects' attributes, actions, statuses, interactions, and environmental contexts without requiring additional training data.
arXiv Detail & Related papers (2025-04-07T22:35:36Z)
Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation [19.190651264839065]
Referring video object segmentation aims to segment and track a target object in a video using a natural language prompt. We introduce FindTrack, a novel decoupled framework that separates target identification from mask propagation. We demonstrate that FindTrack outperforms existing methods on public benchmarks.
arXiv Detail & Related papers (2025-03-05T13:32:49Z)
Instance-Aware Generalized Referring Expression Segmentation [32.96760407482406]
InstAlign is a method that incorporates object-level reasoning into the segmentation process. Our method significantly advances state-of-the-art performance, setting a new standard for precise and flexible GRES.
arXiv Detail & Related papers (2024-11-22T17:28:43Z)
One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos [41.34787907803329]
VideoLISA is a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos. VideoLISA generates temporally consistent segmentation masks in videos based on language instructions.
arXiv Detail & Related papers (2024-09-29T07:47:15Z)
OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning [95.6696714640357]
We propose a new task 'open-world video instance segmentation and captioning' It requires to detect, segment, track and describe with rich captions never before seen objects. We develop an object abstractor and an object-to-text abstractor.
arXiv Detail & Related papers (2024-04-04T17:59:58Z)
Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals. Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars. Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z)
SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation [35.063881868130075]
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment. We propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin.
arXiv Detail & Related papers (2023-05-26T15:13:44Z)
Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation. We introduce a scalable pipeline for generating synthetic training data with multiple objects. We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z)
The Second Place Solution for The 4th Large-scale Video Object Segmentation Challenge--Track 3: Referring Video Object Segmentation [18.630453674396534]
ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames. This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference. The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
arXiv Detail & Related papers (2022-06-24T02:15:06Z)
Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference. Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice. In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video. Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z)
Target-Aware Object Discovery and Association for Unsupervised Video Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation. We introduce a novel approach for more accurate and efficient unseen-temporal segmentation. We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z)
DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video [98.54696229182335]
We study the task of temporal moment localization in a long untrimmed video using natural language query. Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm. A temporal sub-graph captures the activities within the video through time.
arXiv Detail & Related papers (2020-10-13T09:50:29Z)
Local-Global Video-Text Interactions for Temporal Grounding [77.5114709695216]
This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query. We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query. The proposed method effectively predicts the target time interval by exploiting contextual information from local to global.
arXiv Detail & Related papers (2020-04-16T08:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.