Local-Global Context Aware Transformer for Language-Guided Video
Segmentation
- URL: http://arxiv.org/abs/2203.09773v2
- Date: Fri, 19 Jan 2024 13:01:44 GMT
- Title: Local-Global Context Aware Transformer for Language-Guided Video
Segmentation
- Authors: Chen Liang, Wenguan Wang, Tianfei Zhou, Jiaxu Miao, Yawei Luo and Yi
Yang
- Abstract summary: We explore the task of language-guided video segmentation (LVS)
We present Locater, which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner.
To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset.
- Score: 103.35509224722097
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We explore the task of language-guided video segmentation (LVS). Previous
algorithms mostly adopt 3D CNNs to learn video representation, struggling to
capture long-term context and easily suffering from visual-linguistic
misalignment. In light of this, we present Locater (local-global context aware
Transformer), which augments the Transformer architecture with a finite memory
so as to query the entire video with the language expression in an efficient
manner. The memory is designed to involve two components -- one for
persistently preserving global video content, and one for dynamically gathering
local temporal context and segmentation history. Based on the memorized
local-global context and the particular content of each frame, Locater
holistically and flexibly comprehends the expression as an adaptive query
vector for each frame. The vector is used to query the corresponding frame for
mask generation. The memory also allows Locater to process videos with linear
time complexity and constant size memory, while Transformer-style
self-attention computation scales quadratically with sequence length. To
thoroughly examine the visual grounding capability of LVS models, we contribute
a new LVS dataset, A2D-S+, which is built upon A2D-S dataset but poses
increased challenges in disambiguating among similar objects. Experiments on
three LVS datasets and our A2D-S+ show that Locater outperforms previous
state-of-the-arts. Further, we won the 1st place in the Referring Video Object
Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge,
where Locater served as the foundation for the winning solution. Our code and
dataset are available at: https://github.com/leonnnop/Locater
Related papers
- ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We propose a new video segmentation task - video reasoning segmentation.
The task is designed to output tracklets of segmentation masks given a complex input text query.
We present ViLLa: Video reasoning segmentation with a Large Language Model.
arXiv Detail & Related papers (2024-07-18T17:59:17Z) - VISA: Reasoning Video Object Segmentation via Large Language Models [64.33167989521357]
We introduce a new task, Reasoning Video Object (ReasonVOS)
This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities.
We introduce VISA (Video-based large language Instructed Assistant) to tackle ReasonVOS.
arXiv Detail & Related papers (2024-07-16T02:29:29Z) - Encoding and Controlling Global Semantics for Long-form Video Question Answering [40.129800076300434]
We introduce a state space layer (SSL) into multi-modal Transformer to efficiently integrate global semantics of the video.
Our SSL includes a gating unit to enable controllability over the flow of global semantics into visual representations.
To rigorously evaluate long-form videoQA capacity, we construct two new benchmarks Ego-QA and MAD-QA featuring videos of considerably long length.
arXiv Detail & Related papers (2024-05-30T06:10:10Z) - Fully Transformer-Equipped Architecture for End-to-End Referring Video
Object Segmentation [24.814534011440877]
We propose an end-to-end RVOS framework which treats the RVOS task as a mask sequence learning problem.
To capture the object-level spatial context, we have developed the Stacked Transformer.
The model finds the best matching between mask sequence and text query.
arXiv Detail & Related papers (2023-09-21T09:47:47Z) - Tracking Anything with Decoupled Video Segmentation [87.07258378407289]
We develop a decoupled video segmentation approach (DEVA)
It is composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation.
We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks.
arXiv Detail & Related papers (2023-09-07T17:59:41Z) - Video Referring Expression Comprehension via Transformer with
Content-aware Query [60.89442448993627]
Video Referring Expression (REC) aims to localize a target object in video frames referred by the natural language expression.
We argue that the current query design is suboptima and suffers from two drawbacks.
We set up a fixed number of learnable bounding boxes across the frame and the aligned region features are employed to provide fruitful clues.
arXiv Detail & Related papers (2022-10-06T14:45:41Z) - Multi-Attention Network for Compressed Video Referring Object
Segmentation [103.18477550023513]
Referring video object segmentation aims to segment the object referred by a given language expression.
Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented.
This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
arXiv Detail & Related papers (2022-07-26T03:00:52Z) - The Second Place Solution for The 4th Large-scale Video Object
Segmentation Challenge--Track 3: Referring Video Object Segmentation [18.630453674396534]
ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames.
This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference.
The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
arXiv Detail & Related papers (2022-06-24T02:15:06Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.