ClawCraneNet: Leveraging Object-level Relation for Text-based Video
Segmentation
- URL: http://arxiv.org/abs/2103.10702v4
- Date: Fri, 19 Jan 2024 14:43:57 GMT
- Title: ClawCraneNet: Leveraging Object-level Relation for Text-based Video
Segmentation
- Authors: Chen Liang, Yu Wu, Yawei Luo and Yi Yang
- Abstract summary: Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos.
We introduce a novel top-down approach by imitating how we human segment an object with the language guidance.
Our method outperforms state-of-the-art methods by a large margin.
- Score: 47.7867284770227
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-based video segmentation is a challenging task that segments out the
natural language referred objects in videos. It essentially requires semantic
comprehension and fine-grained video understanding. Existing methods introduce
language representation into segmentation models in a bottom-up manner, which
merely conducts vision-language interaction within local receptive fields of
ConvNets. We argue that such interaction is not fulfilled since the model can
barely construct region-level relationships given partial observations, which
is contrary to the description logic of natural language/referring expressions.
In fact, people usually describe a target object using relations with other
objects, which may not be easily understood without seeing the whole video. To
address the issue, we introduce a novel top-down approach by imitating how we
human segment an object with the language guidance. We first figure out all
candidate objects in videos and then choose the refereed one by parsing
relations among those high-level objects. Three kinds of object-level relations
are investigated for precise relationship understanding, i.e., positional
relation, text-guided semantic relation, and temporal relation. Extensive
experiments on A2D Sentences and J-HMDB Sentences show our method outperforms
state-of-the-art methods by a large margin. Qualitative results also show our
results are more explainable.
Related papers
- NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality [52.08735848128973]
We study the capability of Video-Language (VidL) models in understanding compositions between objects, attributes, actions and their relations.
We propose a training method called NAVERO which utilizes video-text data augmented with negative texts to enhance composition understanding.
arXiv Detail & Related papers (2024-08-18T15:27:06Z) - Context Propagation from Proposals for Semantic Video Object Segmentation [1.223779595809275]
We propose a novel approach to learning semantic contextual relationships in videos for semantic object segmentation.
Our proposals derives the semantic contexts from video object which encode the key evolution of objects and the relationship among objects over semantic-temporal domain.
arXiv Detail & Related papers (2024-07-08T14:44:18Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Contrastive Video-Language Segmentation [41.1635597261304]
We focus on the problem of segmenting a certain object referred by a natural language sentence in video content.
We propose to interwind the visual and linguistic modalities in an explicit way via the contrastive learning objective.
arXiv Detail & Related papers (2021-09-29T01:40:58Z) - Object Priors for Classifying and Localizing Unseen Actions [45.91275361696107]
We propose three spatial object priors, which encode local person and object detectors along with their spatial relations.
On top we introduce three semantic object priors, which extend semantic matching through word embeddings.
A video embedding combines the spatial and semantic object priors.
arXiv Detail & Related papers (2021-04-10T08:56:58Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - DORi: Discovering Object Relationship for Moment Localization of a
Natural-Language Query in Video [98.54696229182335]
We study the task of temporal moment localization in a long untrimmed video using natural language query.
Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm.
A temporal sub-graph captures the activities within the video through time.
arXiv Detail & Related papers (2020-10-13T09:50:29Z) - Visual Relation Grounding in Videos [86.06874453626347]
We explore a novel named visual Relation Grounding in Videos (RGV)
This task aims at providing supportive visual facts for other video-language tasks (e.g., video grounding and video question answering)
We tackle challenges by collaboratively optimizing two sequences of regions over a constructed hierarchical-temporal region.
Experimental results demonstrate our model can not only outperform baseline approaches significantly, but also produces visually meaningful facts.
arXiv Detail & Related papers (2020-07-17T08:20:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.