Decoupled Spatial Temporal Graphs for Generic Visual Grounding
- URL: http://arxiv.org/abs/2103.10191v1
- Date: Thu, 18 Mar 2021 11:56:29 GMT
- Title: Decoupled Spatial Temporal Graphs for Generic Visual Grounding
- Authors: Qianyu Feng, Yunchao Wei, Mingming Cheng, Yi Yang
- Abstract summary: This work investigates into a more general setting, generic visual grounding, aiming to mine all the objects satisfying the given expression.
We propose a simple yet effective approach, named DSTG, which commits to 1) decomposing the spatial and temporal representations to collect all-sided cues for precise grounding.
We further elaborate a new video dataset, GVG, that consists of challenging referring cases with far-ranging videos.
- Score: 120.66884671951237
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual grounding is a long-lasting problem in vision-language understanding
due to its diversity and complexity. Current practices concentrate mostly on
performing visual grounding in still images or well-trimmed video clips. This
work, on the other hand, investigates into a more general setting, generic
visual grounding, aiming to mine all the objects satisfying the given
expression, which is more challenging yet practical in real-world scenarios.
Importantly, grounding results are expected to accurately localize targets in
both space and time. Whereas, it is tricky to make trade-offs between the
appearance and motion features. In real scenarios, model tends to fail in
distinguishing distractors with similar attributes. Motivated by these
considerations, we propose a simple yet effective approach, named DSTG, which
commits to 1) decomposing the spatial and temporal representations to collect
all-sided cues for precise grounding; 2) enhancing the discriminativeness from
distractors and the temporal consistency with a contrastive learning routing
strategy. We further elaborate a new video dataset, GVG, that consists of
challenging referring cases with far-ranging videos. Empirical experiments well
demonstrate the superiority of DSTG over state-of-the-art on Charades-STA,
ActivityNet-Caption and GVG datasets. Code and dataset will be made available.
Related papers
- AffordanceLLM: Grounding Affordance from Vision Language Models [36.97072698640563]
Affordance grounding refers to the task of finding the area of an object with which one can interact.
Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set.
We make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge.
arXiv Detail & Related papers (2024-01-12T03:21:02Z) - UniVTG: Towards Unified Video-Language Temporal Grounding [52.56732639951834]
Video Temporal Grounding (VTG) aims to ground target clips from videos according to custom language queries.
We propose to Unify the diverse VTG labels and tasks, dubbed UniVTG, along three directions.
Thanks to the unified framework, we are able to unlock temporal grounding pretraining from large-scale diverse labels.
arXiv Detail & Related papers (2023-07-31T14:34:49Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - SAVi++: Towards End-to-End Object-Centric Learning from Real-World
Videos [23.64091569954785]
We introduce SAVi++, an object-centric video model which is trained to predict depth signals from a slot-based video representation.
By using sparse depth signals obtained from LiDAR, SAVi++ is able to learn emergent object segmentation and tracking from videos in the real-world Open dataset.
arXiv Detail & Related papers (2022-06-15T18:57:07Z) - Sim-To-Real Transfer of Visual Grounding for Human-Aided Ambiguity
Resolution [0.0]
We consider the task of visual grounding, where the agent segments an object from a crowded scene given a natural language description.
Modern holistic approaches to visual grounding usually ignore language structure and struggle to cover generic domains.
We introduce a fully decoupled modular framework for compositional visual grounding of entities, attributes, and spatial relations.
arXiv Detail & Related papers (2022-05-24T14:12:32Z) - Revisiting Contrastive Methods for Unsupervised Learning of Visual
Representations [78.12377360145078]
Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection.
In this paper, we first study how biases in the dataset affect existing methods.
We show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets.
arXiv Detail & Related papers (2021-06-10T17:59:13Z) - Visual Relation Grounding in Videos [86.06874453626347]
We explore a novel named visual Relation Grounding in Videos (RGV)
This task aims at providing supportive visual facts for other video-language tasks (e.g., video grounding and video question answering)
We tackle challenges by collaboratively optimizing two sequences of regions over a constructed hierarchical-temporal region.
Experimental results demonstrate our model can not only outperform baseline approaches significantly, but also produces visually meaningful facts.
arXiv Detail & Related papers (2020-07-17T08:20:39Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.