Related papers: Local-Global Context Aware Transformer for Language-Guided Video Segmentation

Local-Global Context Aware Transformer for Language-Guided Video Segmentation

URL: http://arxiv.org/abs/2203.09773v2
Date: Fri, 19 Jan 2024 13:01:44 GMT
Title: Local-Global Context Aware Transformer for Language-Guided Video Segmentation
Authors: Chen Liang, Wenguan Wang, Tianfei Zhou, Jiaxu Miao, Yawei Luo and Yi Yang
Abstract summary: We explore the task of language-guided video segmentation (LVS) We present Locater, which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset.
Score: 103.35509224722097
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We explore the task of language-guided video segmentation (LVS). Previous algorithms mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context and easily suffering from visual-linguistic misalignment. In light of this, we present Locater (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. The memory is designed to involve two components -- one for persistently preserving global video content, and one for dynamically gathering local temporal context and segmentation history. Based on the memorized local-global context and the particular content of each frame, Locater holistically and flexibly comprehends the expression as an adaptive query vector for each frame. The vector is used to query the corresponding frame for mask generation. The memory also allows Locater to process videos with linear time complexity and constant size memory, while Transformer-style self-attention computation scales quadratically with sequence length. To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset but poses increased challenges in disambiguating among similar objects. Experiments on three LVS datasets and our A2D-S+ show that Locater outperforms previous state-of-the-arts. Further, we won the 1st place in the Referring Video Object Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge, where Locater served as the foundation for the winning solution. Our code and dataset are available at: https://github.com/leonnnop/Locater

Related papers

ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation [14.534308478766476]
We introduce ViCaS, a new dataset containing thousands of challenging videos. Our benchmark evaluates models on holistic/high-level understanding and language-guided, pixel-precise segmentation.
arXiv Detail & Related papers (2024-12-12T23:10:54Z)
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos [58.765796160750504]
VideoGLaMM is a new model for fine-grained pixel-level grounding in videos based on user-provided textual inputs. The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions. Experimental results show that our model consistently outperforms existing approaches across all three tasks.
arXiv Detail & Related papers (2024-11-07T17:59:27Z)
One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos [41.34787907803329]
VideoLISA is a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos. VideoLISA generates temporally consistent segmentation masks in videos based on language instructions.
arXiv Detail & Related papers (2024-09-29T07:47:15Z)
VISA: Reasoning Video Object Segmentation via Large Language Models [64.33167989521357]
We introduce a new task, Reasoning Video Object (ReasonVOS) This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities. We introduce VISA (Video-based large language Instructed Assistant) to tackle ReasonVOS.
arXiv Detail & Related papers (2024-07-16T02:29:29Z)
Fully Transformer-Equipped Architecture for End-to-End Referring Video Object Segmentation [24.814534011440877]
We propose an end-to-end RVOS framework which treats the RVOS task as a mask sequence learning problem. To capture the object-level spatial context, we have developed the Stacked Transformer. The model finds the best matching between mask sequence and text query.
arXiv Detail & Related papers (2023-09-21T09:47:47Z)
Tracking Anything with Decoupled Video Segmentation [87.07258378407289]
We develop a decoupled video segmentation approach (DEVA) It is composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks.
arXiv Detail & Related papers (2023-09-07T17:59:41Z)
Multi-Attention Network for Compressed Video Referring Object Segmentation [103.18477550023513]
Referring video object segmentation aims to segment the object referred by a given language expression. Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented. This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
arXiv Detail & Related papers (2022-07-26T03:00:52Z)
The Second Place Solution for The 4th Large-scale Video Object Segmentation Challenge--Track 3: Referring Video Object Segmentation [18.630453674396534]
ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames. This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference. The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
arXiv Detail & Related papers (2022-06-24T02:15:06Z)
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer. HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.