Local-Global Context Aware Transformer for Language-Guided Video
Segmentation
- URL: http://arxiv.org/abs/2203.09773v2
- Date: Fri, 19 Jan 2024 13:01:44 GMT
- Title: Local-Global Context Aware Transformer for Language-Guided Video
Segmentation
- Authors: Chen Liang, Wenguan Wang, Tianfei Zhou, Jiaxu Miao, Yawei Luo and Yi
Yang
- Abstract summary: We explore the task of language-guided video segmentation (LVS)
We present Locater, which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner.
To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset.
- Score: 103.35509224722097
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We explore the task of language-guided video segmentation (LVS). Previous
algorithms mostly adopt 3D CNNs to learn video representation, struggling to
capture long-term context and easily suffering from visual-linguistic
misalignment. In light of this, we present Locater (local-global context aware
Transformer), which augments the Transformer architecture with a finite memory
so as to query the entire video with the language expression in an efficient
manner. The memory is designed to involve two components -- one for
persistently preserving global video content, and one for dynamically gathering
local temporal context and segmentation history. Based on the memorized
local-global context and the particular content of each frame, Locater
holistically and flexibly comprehends the expression as an adaptive query
vector for each frame. The vector is used to query the corresponding frame for
mask generation. The memory also allows Locater to process videos with linear
time complexity and constant size memory, while Transformer-style
self-attention computation scales quadratically with sequence length. To
thoroughly examine the visual grounding capability of LVS models, we contribute
a new LVS dataset, A2D-S+, which is built upon A2D-S dataset but poses
increased challenges in disambiguating among similar objects. Experiments on
three LVS datasets and our A2D-S+ show that Locater outperforms previous
state-of-the-arts. Further, we won the 1st place in the Referring Video Object
Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge,
where Locater served as the foundation for the winning solution. Our code and
dataset are available at: https://github.com/leonnnop/Locater
Related papers
- VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos [58.765796160750504]
VideoGLaMM is a new model for fine-grained pixel-level grounding in videos based on user-provided textual inputs.
The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions.
Experimental results show that our model consistently outperforms existing approaches across all three tasks.
arXiv Detail & Related papers (2024-11-07T17:59:27Z) - One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos [41.34787907803329]
VideoLISA is a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos.
VideoLISA generates temporally consistent segmentation masks in videos based on language instructions.
arXiv Detail & Related papers (2024-09-29T07:47:15Z) - VISA: Reasoning Video Object Segmentation via Large Language Models [64.33167989521357]
We introduce a new task, Reasoning Video Object (ReasonVOS)
This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities.
We introduce VISA (Video-based large language Instructed Assistant) to tackle ReasonVOS.
arXiv Detail & Related papers (2024-07-16T02:29:29Z) - Fully Transformer-Equipped Architecture for End-to-End Referring Video
Object Segmentation [24.814534011440877]
We propose an end-to-end RVOS framework which treats the RVOS task as a mask sequence learning problem.
To capture the object-level spatial context, we have developed the Stacked Transformer.
The model finds the best matching between mask sequence and text query.
arXiv Detail & Related papers (2023-09-21T09:47:47Z) - Tracking Anything with Decoupled Video Segmentation [87.07258378407289]
We develop a decoupled video segmentation approach (DEVA)
It is composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation.
We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks.
arXiv Detail & Related papers (2023-09-07T17:59:41Z) - Multi-Attention Network for Compressed Video Referring Object
Segmentation [103.18477550023513]
Referring video object segmentation aims to segment the object referred by a given language expression.
Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented.
This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
arXiv Detail & Related papers (2022-07-26T03:00:52Z) - The Second Place Solution for The 4th Large-scale Video Object
Segmentation Challenge--Track 3: Referring Video Object Segmentation [18.630453674396534]
ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames.
This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference.
The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
arXiv Detail & Related papers (2022-06-24T02:15:06Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.