Actor and Action Modular Network for Text-based Video Segmentation
- URL: http://arxiv.org/abs/2011.00786v2
- Date: Mon, 22 Aug 2022 01:49:29 GMT
- Title: Actor and Action Modular Network for Text-based Video Segmentation
- Authors: Jianhua Yang, Yan Huang, Kai Niu, Linjiang Huang, Zhanyu Ma, Liang
Wang
- Abstract summary: Text-based video segmentation aims to segment an actor in video sequences by specifying the actor and its performing action with a textual query.
Previous methods fail to explicitly align the video content with the textual query in a fine-grained manner according to the actor and its action.
We propose a novel actor and action modular network that individually localizes the actor and its action in two separate modules.
- Score: 28.104884795973177
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-based video segmentation aims to segment an actor in video sequences by
specifying the actor and its performing action with a textual query. Previous
methods fail to explicitly align the video content with the textual query in a
fine-grained manner according to the actor and its action, due to the problem
of \emph{semantic asymmetry}. The \emph{semantic asymmetry} implies that two
modalities contain different amounts of semantic information during the
multi-modal fusion process. To alleviate this problem, we propose a novel actor
and action modular network that individually localizes the actor and its action
in two separate modules. Specifically, we first learn the actor-/action-related
content from the video and textual query, and then match them in a symmetrical
manner to localize the target tube. The target tube contains the desired actor
and action which is then fed into a fully convolutional network to predict
segmentation masks of the actor. Our method also establishes the association of
objects cross multiple frames with the proposed temporal proposal aggregation
mechanism. This enables our method to segment the video effectively and keep
the temporal consistency of predictions. The whole model is allowed for joint
learning of the actor-action matching and segmentation, as well as achieves the
state-of-the-art performance for both single-frame segmentation and full video
segmentation on A2D Sentences and J-HMDB Sentences datasets.
Related papers
- Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning [12.066823214932345]
Weakly-Supervised Dense Video Captioning aims to localize and describe all events of interest in a video without requiring annotations of event boundaries.
Existing methods rely on explicit alignment constraints between event locations and captions.
We propose a novel implicit location-caption alignment paradigm by complementary masking.
arXiv Detail & Related papers (2024-12-17T10:52:50Z) - Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation [36.7177155799825]
Two tasks aim to segment specific objects from video sequences according to expression prompts.
EPCFormer exploits the fact that audio and text prompts referring to the same objects are semantically equivalent.
The knowledge of video object segmentation in terms of the expression prompts can seamlessly transfer between the two tasks.
arXiv Detail & Related papers (2023-08-08T09:48:00Z) - Boosting Weakly-Supervised Temporal Action Localization with Text
Information [94.48602948837664]
We propose a Text-Segment Mining (TSM) mechanism, which constructs a text description based on the action class label, and regards the text as the query to mine all class-related segments.
We also introduce a generative objective named Video-text Language Completion (VLC), which focuses on all semantic-related segments from videos to complete the text sentence.
Surprisingly, we also find our proposed method can be seamlessly applied to existing methods, and improve their performances with a clear margin.
arXiv Detail & Related papers (2023-05-01T00:07:09Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor
Segmentation [90.74732705236336]
Language-queried video actor segmentation aims to predict the pixel-mask of the actor which performs the actions described by a natural language query in the target frames.
We propose a collaborative spatial-temporal encoder-decoder framework which contains a 3D temporal encoder over the video clip to recognize the queried actions, and a 2D spatial encoder over the target frame to accurately segment the queried actors.
arXiv Detail & Related papers (2021-05-14T13:27:53Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - Referring Segmentation in Images and Videos with Cross-Modal
Self-Attention Network [27.792054915363106]
Cross-modal self-attention (CMSA) module to utilize fine details of individual words and the input image or video.
gated multi-level fusion (GMLF) module to selectively integrate self-attentive cross-modal features.
Cross-frame self-attention (CFSA) module to effectively integrate temporal information in consecutive frames.
arXiv Detail & Related papers (2021-02-09T11:27:59Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.