Visual Relation Grounding in Videos
- URL: http://arxiv.org/abs/2007.08814v2
- Date: Tue, 21 Jul 2020 07:20:32 GMT
- Title: Visual Relation Grounding in Videos
- Authors: Junbin Xiao, Xindi Shang, Xun Yang, Sheng Tang, Tat-Seng Chua
- Abstract summary: We explore a novel named visual Relation Grounding in Videos (RGV)
This task aims at providing supportive visual facts for other video-language tasks (e.g., video grounding and video question answering)
We tackle challenges by collaboratively optimizing two sequences of regions over a constructed hierarchical-temporal region.
Experimental results demonstrate our model can not only outperform baseline approaches significantly, but also produces visually meaningful facts.
- Score: 86.06874453626347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we explore a novel task named visual Relation Grounding in
Videos (vRGV). The task aims at spatio-temporally localizing the given
relations in the form of subject-predicate-object in the videos, so as to
provide supportive visual facts for other high-level video-language tasks
(e.g., video-language grounding and video question answering). The challenges
in this task include but not limited to: (1) both the subject and object are
required to be spatio-temporally localized to ground a query relation; (2) the
temporal dynamic nature of visual relations in videos is difficult to capture;
and (3) the grounding should be achieved without any direct supervision in
space and time. To ground the relations, we tackle the challenges by
collaboratively optimizing two sequences of regions over a constructed
hierarchical spatio-temporal region graph through relation attending and
reconstruction, in which we further propose a message passing mechanism by
spatial attention shifting between visual entities. Experimental results
demonstrate that our model can not only outperform baseline approaches
significantly, but also produces visually meaningful facts to support visual
grounding. (Code is available at https://github.com/doc-doc/vRGV).
Related papers
- Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Visual Spatio-temporal Relation-enhanced Network for Cross-modal
Text-Video Retrieval [17.443195531553474]
Cross-modal retrieval of texts and videos aims to understand the correspondence between vision and language.
We propose a Visual S-temporal Relation-enhanced semantic network (CNN-SRNet), a cross-temporal retrieval framework.
Experiments are conducted on both MSR-VTT and MSVD datasets.
arXiv Detail & Related papers (2021-10-29T08:23:40Z) - ClawCraneNet: Leveraging Object-level Relation for Text-based Video
Segmentation [47.7867284770227]
Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos.
We introduce a novel top-down approach by imitating how we human segment an object with the language guidance.
Our method outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2021-03-19T09:31:08Z) - Decoupled Spatial Temporal Graphs for Generic Visual Grounding [120.66884671951237]
This work investigates into a more general setting, generic visual grounding, aiming to mine all the objects satisfying the given expression.
We propose a simple yet effective approach, named DSTG, which commits to 1) decomposing the spatial and temporal representations to collect all-sided cues for precise grounding.
We further elaborate a new video dataset, GVG, that consists of challenging referring cases with far-ranging videos.
arXiv Detail & Related papers (2021-03-18T11:56:29Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z) - Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form
Sentences [107.0776836117313]
Given an un-trimmed video and a declarative/interrogative sentence, STVG aims to localize the-temporal tube of the object queried.
Existing methods cannot tackle the STVG task due to the ineffective tube pre-generation and the lack of novel object relationship modeling.
We present a declarative-Temporal Graph Reasoning Network (STGRN) for this task.
arXiv Detail & Related papers (2020-01-19T19:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.