Related papers: ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

URL: http://arxiv.org/abs/2501.14607v1
Date: Fri, 24 Jan 2025 16:24:15 GMT
Title: ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations
Authors: Tianming Liang, Kun-Yu Lin, Chaolei Tan, Jianguo Zhang, Wei-Shi Zheng, Jian-Fang Hu,
Abstract summary: Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description.<n>We present textbfReferDINO, an end-to-end RVOS model that inherits strong vision-language understanding from the pretrained visual grounding foundation models.
Score: 33.74746234704817
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description. Despite notable progress in recent years, current RVOS models remain struggle to handle complicated object descriptions due to their limited video-language understanding. To address this limitation, we present \textbf{ReferDINO}, an end-to-end RVOS model that inherits strong vision-language understanding from the pretrained visual grounding foundation models, and is further endowed with effective temporal understanding and object segmentation capabilities. In ReferDINO, we contribute three technical innovations for effectively adapting the foundation models to RVOS: 1) an object-consistent temporal enhancer that capitalizes on the pretrained object-text representations to enhance temporal understanding and object consistency; 2) a grounding-guided deformable mask decoder that integrates text and grounding conditions to generate accurate object masks; 3) a confidence-aware query pruning strategy that significantly improves the object decoding efficiency without compromising performance. We conduct extensive experiments on five public RVOS benchmarks to demonstrate that our proposed ReferDINO outperforms state-of-the-art methods significantly. Project page: \url{https://isee-laboratory.github.io/ReferDINO}

Related papers

4th PVUW MeViS 3rd Place Report: Sa2VA [105.88675577642204]
We show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS. In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos.
arXiv Detail & Related papers (2025-04-01T07:06:47Z)
OLIVE: Object Level In-Context Visual Embeddings [8.168219870640318]
We propose a novel method to prompt large language models with in-context visual object vectors. This eliminates the necessity of fusing a lengthy array of image patch features and significantly speeds up training. Our experiments reveal that our method achieves competitive referring object classification and captioning performance.
arXiv Detail & Related papers (2024-06-02T21:36:31Z)
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos. We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z)
Learning Referring Video Object Segmentation from Weak Annotation [78.45828085350936]
Referring video object segmentation (RVOS) is a task that aims to segment the target object in all video frames based on a sentence describing the object. We propose a new annotation scheme that reduces the annotation effort by 8 times, while providing sufficient supervision for RVOS. Our scheme only requires a mask for the frame where the object first appears and bounding boxes for the rest of the frames.
arXiv Detail & Related papers (2023-08-04T06:50:52Z)
Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning. This task unifies spatial and temporal localization in video. We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z)
A Threefold Review on Deep Semantic Segmentation: Efficiency-oriented, Temporal and Depth-aware design [77.34726150561087]
We conduct a survey on the most relevant and recent advances in Deep Semantic in the context of vision for autonomous vehicles. Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective.
arXiv Detail & Related papers (2023-03-08T01:29:55Z)
The Second Place Solution for The 4th Large-scale Video Object Segmentation Challenge--Track 3: Referring Video Object Segmentation [18.630453674396534]
ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames. This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference. The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
arXiv Detail & Related papers (2022-06-24T02:15:06Z)
Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference. Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice. In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video. Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.