Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor
Segmentation
- URL: http://arxiv.org/abs/2105.06818v1
- Date: Fri, 14 May 2021 13:27:53 GMT
- Title: Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor
Segmentation
- Authors: Tianrui Hui, Shaofei Huang, Si Liu, Zihan Ding, Guanbin Li, Wenguan
Wang, Jizhong Han, Fei Wang
- Abstract summary: Language-queried video actor segmentation aims to predict the pixel-mask of the actor which performs the actions described by a natural language query in the target frames.
We propose a collaborative spatial-temporal encoder-decoder framework which contains a 3D temporal encoder over the video clip to recognize the queried actions, and a 2D spatial encoder over the target frame to accurately segment the queried actors.
- Score: 90.74732705236336
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language-queried video actor segmentation aims to predict the pixel-level
mask of the actor which performs the actions described by a natural language
query in the target frames. Existing methods adopt 3D CNNs over the video clip
as a general encoder to extract a mixed spatio-temporal feature for the target
frame. Though 3D convolutions are amenable to recognizing which actor is
performing the queried actions, it also inevitably introduces misaligned
spatial information from adjacent frames, which confuses features of the target
frame and yields inaccurate segmentation. Therefore, we propose a collaborative
spatial-temporal encoder-decoder framework which contains a 3D temporal encoder
over the video clip to recognize the queried actions, and a 2D spatial encoder
over the target frame to accurately segment the queried actors. In the decoder,
a Language-Guided Feature Selection (LGFS) module is proposed to flexibly
integrate spatial and temporal features from the two encoders. We also propose
a Cross-Modal Adaptive Modulation (CMAM) module to dynamically recombine
spatial- and temporal-relevant linguistic features for multimodal feature
interaction in each stage of the two encoders. Our method achieves new
state-of-the-art performance on two popular benchmarks with less computational
overhead than previous approaches.
Related papers
- GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts [48.28000728061778]
We propose a method that integrates an open vocabulary scene encoder into the architecture, establishing a robust connection between text and scene.
Our methodology achieves up to a 30% reduction in the goal object distance metric compared to the prior state-of-the-art baseline model.
arXiv Detail & Related papers (2024-04-08T18:24:12Z) - Language-Bridged Spatial-Temporal Interaction for Referring Video Object
Segmentation [28.472006665544033]
Referring video object segmentation aims to predict foreground labels for objects referred by natural language expressions in videos.
Previous methods either depend on 3D ConvNets or incorporate additional 2D ConvNets as encoders to extract mixed spatial-temporal features.
We propose a Language-Bridged Duplex Transfer (LBDT) module which utilizes language as an intermediary bridge to accomplish explicit and adaptive spatial-temporal interaction earlier in the encoding phase.
arXiv Detail & Related papers (2022-06-08T10:12:53Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [87.49579477873196]
We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically.
A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features.
In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
arXiv Detail & Related papers (2022-03-30T01:06:13Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Siamese Network with Interactive Transformer for Video Object
Segmentation [34.202137199782804]
We propose a network with a specifically designed interactive transformer, called SITVOS, to enable effective context propagation from historical to current frames.
We employ the backbone architecture to extract backbone features of both past and current frames, which enables feature reuse and is more efficient than existing methods.
arXiv Detail & Related papers (2021-12-28T03:38:17Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.