Language-Bridged Spatial-Temporal Interaction for Referring Video Object
Segmentation
- URL: http://arxiv.org/abs/2206.03789v1
- Date: Wed, 8 Jun 2022 10:12:53 GMT
- Title: Language-Bridged Spatial-Temporal Interaction for Referring Video Object
Segmentation
- Authors: Zihan Ding, Tianrui Hui, Junshi Huang, Xiaoming Wei, Jizhong Han, Si
Liu
- Abstract summary: Referring video object segmentation aims to predict foreground labels for objects referred by natural language expressions in videos.
Previous methods either depend on 3D ConvNets or incorporate additional 2D ConvNets as encoders to extract mixed spatial-temporal features.
We propose a Language-Bridged Duplex Transfer (LBDT) module which utilizes language as an intermediary bridge to accomplish explicit and adaptive spatial-temporal interaction earlier in the encoding phase.
- Score: 28.472006665544033
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring video object segmentation aims to predict foreground labels for
objects referred by natural language expressions in videos. Previous methods
either depend on 3D ConvNets or incorporate additional 2D ConvNets as encoders
to extract mixed spatial-temporal features. However, these methods suffer from
spatial misalignment or false distractors due to delayed and implicit
spatial-temporal interaction occurring in the decoding phase. To tackle these
limitations, we propose a Language-Bridged Duplex Transfer (LBDT) module which
utilizes language as an intermediary bridge to accomplish explicit and adaptive
spatial-temporal interaction earlier in the encoding phase. Concretely,
cross-modal attention is performed among the temporal encoder, referring words
and the spatial encoder to aggregate and transfer language-relevant motion and
appearance information. In addition, we also propose a Bilateral Channel
Activation (BCA) module in the decoding phase for further denoising and
highlighting the spatial-temporal consistent features via channel-wise
activation. Extensive experiments show our method achieves new state-of-the-art
performances on four popular benchmarks with 6.8% and 6.9% absolute AP gains on
A2D Sentences and J-HMDB Sentences respectively, while consuming around 7x less
computational overhead.
Related papers
- A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection [7.202931445597171]
We present a novel network that detects actions in untrimmed videos.
The network encodes the locations of action semantics in video frames utilizing motion-aware 2D positional encoding.
The approach outperforms the state-the-art solutions on four proposed datasets.
arXiv Detail & Related papers (2024-05-13T21:47:35Z) - Bidirectional Correlation-Driven Inter-Frame Interaction Transformer for
Referring Video Object Segmentation [44.952526831843386]
We propose a correlation-driven inter-frame interaction Transformer, dubbed BIFIT, to address these issues in RVOS.
Specifically, we design a lightweight plug-and-play inter-frame interaction module in the decoder.
A vision-ferring interaction is implemented before the Transformer to facilitate the correlation between the visual and linguistic features.
arXiv Detail & Related papers (2023-07-02T10:29:35Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal
Grounding [78.71529237748018]
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields.
Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance.
We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
arXiv Detail & Related papers (2022-04-04T13:07:05Z) - Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [87.49579477873196]
We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically.
A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features.
In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
arXiv Detail & Related papers (2022-03-30T01:06:13Z) - Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor
Segmentation [90.74732705236336]
Language-queried video actor segmentation aims to predict the pixel-mask of the actor which performs the actions described by a natural language query in the target frames.
We propose a collaborative spatial-temporal encoder-decoder framework which contains a 3D temporal encoder over the video clip to recognize the queried actions, and a 2D spatial encoder over the target frame to accurately segment the queried actors.
arXiv Detail & Related papers (2021-05-14T13:27:53Z) - BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded
Dialogues [95.8297116307127]
We propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos.
Specifically, our approach exploits both spatial and temporal-level information, and learns dynamic information diffusion between the two feature spaces.
BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark.
arXiv Detail & Related papers (2020-10-20T07:43:00Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.