OnlineRefer: A Simple Online Baseline for Referring Video Object
Segmentation
- URL: http://arxiv.org/abs/2307.09356v1
- Date: Tue, 18 Jul 2023 15:43:35 GMT
- Title: OnlineRefer: A Simple Online Baseline for Referring Video Object
Segmentation
- Authors: Dongming Wu, Tiancai Wang, Yuang Zhang, Xiangyu Zhang, Jianbing Shen
- Abstract summary: Referring video object segmentation (RVOS) aims at segmenting an object in a video following human instruction.
Current state-of-the-art methods fall into an offline pattern, in which each clip independently interacts with text embedding.
We propose a simple yet effective online model using explicit query propagation, named OnlineRefer.
- Score: 75.07460026246582
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Referring video object segmentation (RVOS) aims at segmenting an object in a
video following human instruction. Current state-of-the-art methods fall into
an offline pattern, in which each clip independently interacts with text
embedding for cross-modal understanding. They usually present that the offline
pattern is necessary for RVOS, yet model limited temporal association within
each clip. In this work, we break up the previous offline belief and propose a
simple yet effective online model using explicit query propagation, named
OnlineRefer. Specifically, our approach leverages target cues that gather
semantic information and position prior to improve the accuracy and ease of
referring predictions for the current frame. Furthermore, we generalize our
online model into a semi-online framework to be compatible with video-based
backbones. To show the effectiveness of our method, we evaluate it on four
benchmarks, \ie, Refer-Youtube-VOS, Refer-DAVIS17, A2D-Sentences, and
JHMDB-Sentences. Without bells and whistles, our OnlineRefer with a Swin-L
backbone achieves 63.5 J&F and 64.8 J&F on Refer-Youtube-VOS and Refer-DAVIS17,
outperforming all other offline methods.
Related papers
- Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams [78.72965584414368]
We present Flash-VStream, a video-language model that simulates the memory mechanism of human.
Compared to existing models, Flash-VStream achieves significant reductions in latency inference and VRAM consumption.
We propose VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding.
arXiv Detail & Related papers (2024-06-12T11:07:55Z) - X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization [56.75782714530429]
We propose a cross-modal adaptation framework, which we call X-MIC.
Our pipeline learns to align frozen text embeddings to each egocentric video directly in the shared embedding space.
This results in an enhanced alignment of text embeddings to each egocentric video, leading to a significant improvement in cross-dataset generalization.
arXiv Detail & Related papers (2024-03-28T19:45:35Z) - TCOVIS: Temporally Consistent Online Video Instance Segmentation [98.29026693059444]
We propose a novel online method for video instance segmentation called TCOVIS.
The core of our method consists of a global instance assignment strategy and a video-temporal enhancement module.
We evaluate our method on four VIS benchmarks and achieve state-of-the-art performance on all benchmarks without bells-and-whistles.
arXiv Detail & Related papers (2023-09-21T07:59:15Z) - NOVIS: A Case for End-to-End Near-Online Video Instance Segmentation [22.200700685751826]
Video Instance (VIS) community operated under the common belief that offline methods are generally superior to a frame by frame online processing.
We present a detailed analysis on different processing paradigms and a new end-to-end Video Instance method.
Our NOVIS represents the first near-online VIS approach which avoids any handcrafted trackings.
arXiv Detail & Related papers (2023-08-29T12:51:04Z) - Two-Level Temporal Relation Model for Online Video Instance Segmentation [3.9349485816629888]
We propose an online method that is on par with the performance of the offline counterparts.
We introduce a message-passing graph neural network that encodes objects and relates them through time.
Our model achieves trained end-to-end, state-of-the-art performance on the YouTube-VIS dataset.
arXiv Detail & Related papers (2022-10-30T10:01:01Z) - In Defense of Online Models for Video Instance Segmentation [70.16915119724757]
We propose an online framework based on contrastive learning that is able to learn more discriminative instance embeddings for association.
Despite its simplicity, our method outperforms all online and offline methods on three benchmarks.
The proposed method won first place in the video instance segmentation track of the 4th Large-scale Video Object Challenge.
arXiv Detail & Related papers (2022-07-21T17:56:54Z) - Online Video Instance Segmentation via Robust Context Fusion [36.376900904288966]
Video instance segmentation (VIS) aims at classifying, segmenting and tracking object instances in video sequences.
Recent transformer-based neural networks have demonstrated their powerful capability of modeling for the VIS task.
We propose a robust context fusion network to tackle VIS in an online fashion, which predicts instance segmentation frame-by-frame with a few preceding frames.
arXiv Detail & Related papers (2022-07-12T15:04:50Z) - Rethinking Cross-modal Interaction from a Top-down Perspective for
Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice.
In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video.
Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.