Co-Grounding Networks with Semantic Attention for Referring Expression
Comprehension in Videos
- URL: http://arxiv.org/abs/2103.12346v1
- Date: Tue, 23 Mar 2021 06:42:49 GMT
- Title: Co-Grounding Networks with Semantic Attention for Referring Expression
Comprehension in Videos
- Authors: Sijie Song, Xudong Lin, Jiaying Liu, Zongming Guo and Shih-Fu Chang
- Abstract summary: We tackle the problem of referring expression comprehension in videos with an elegant one-stage framework.
We enhance the single-frame grounding accuracy by semantic attention learning and improve the cross-frame grounding consistency.
Our model is also applicable to referring expression comprehension in images, illustrated by the improved performance on the RefCOCO dataset.
- Score: 96.85840365678649
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we address the problem of referring expression comprehension
in videos, which is challenging due to complex expression and scene dynamics.
Unlike previous methods which solve the problem in multiple stages (i.e.,
tracking, proposal-based matching), we tackle the problem from a novel
perspective, \textbf{co-grounding}, with an elegant one-stage framework. We
enhance the single-frame grounding accuracy by semantic attention learning and
improve the cross-frame grounding consistency with co-grounding feature
learning. Semantic attention learning explicitly parses referring cues in
different attributes to reduce the ambiguity in the complex expression.
Co-grounding feature learning boosts visual feature representations by
integrating temporal correlation to reduce the ambiguity caused by scene
dynamics. Experiment results demonstrate the superiority of our framework on
the video grounding datasets VID and LiOTB in generating accurate and stable
results across frames. Our model is also applicable to referring expression
comprehension in images, illustrated by the improved performance on the RefCOCO
dataset. Our project is available at https://sijiesong.github.io/co-grounding.
Related papers
- Cross-modal Contrastive Learning with Asymmetric Co-attention Network
for Video Moment Retrieval [0.17590081165362778]
Video moment retrieval is a challenging task requiring fine-grained interactions between video and text modalities.
Recent work in image-text pretraining has demonstrated that most existing pretrained models suffer from information asymmetry due to the difference in length between visual and textual sequences.
We question whether the same problem also exists in the video-text domain with an auxiliary need to preserve both spatial and temporal information.
arXiv Detail & Related papers (2023-12-12T17:00:46Z) - Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z) - Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal
Sentence Localization in Videos [67.12603318660689]
We propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN)
HVSARN enables both visual- and semantic-aware query reasoning from object-level to frame-level.
Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-02T08:00:22Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - Relation-aware Instance Refinement for Weakly Supervised Visual
Grounding [44.33411132188231]
Visual grounding aims to build a correspondence between visual objects and their language entities.
We propose a novel weakly-supervised learning method that incorporates coarse-to-fine object refinement and entity relation modeling.
Experiments on two public benchmarks demonstrate the efficacy of our framework.
arXiv Detail & Related papers (2021-03-24T05:03:54Z) - Image Captioning with Visual Object Representations Grounded in the
Textual Modality [14.797241131469486]
We explore the possibilities of a shared embedding space between textual and visual modality.
We propose an approach opposite to the current trend, grounding of the representations in the word embedding space of the captioning system.
arXiv Detail & Related papers (2020-10-19T12:21:38Z) - Visual Relation Grounding in Videos [86.06874453626347]
We explore a novel named visual Relation Grounding in Videos (RGV)
This task aims at providing supportive visual facts for other video-language tasks (e.g., video grounding and video question answering)
We tackle challenges by collaboratively optimizing two sequences of regions over a constructed hierarchical-temporal region.
Experimental results demonstrate our model can not only outperform baseline approaches significantly, but also produces visually meaningful facts.
arXiv Detail & Related papers (2020-07-17T08:20:39Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.