Correspondence Matters for Video Referring Expression Comprehension
- URL: http://arxiv.org/abs/2207.10400v1
- Date: Thu, 21 Jul 2022 10:31:39 GMT
- Title: Correspondence Matters for Video Referring Expression Comprehension
- Authors: Meng Cao, Ji Jiang, Long Chen, Yuexian Zou
- Abstract summary: Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
- Score: 64.60046797561455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate the problem of video Referring Expression Comprehension (REC),
which aims to localize the referent objects described in the sentence to visual
regions in the video frames. Despite the recent progress, existing methods
suffer from two problems: 1) inconsistent localization results across video
frames; 2) confusion between the referent and contextual objects. To this end,
we propose a novel Dual Correspondence Network (dubbed as DCNet) which
explicitly enhances the dense associations in both the inter-frame and
cross-modal manners. Firstly, we aim to build the inter-frame correlations for
all existing instances within the frames. Specifically, we compute the
inter-frame patch-wise cosine similarity to estimate the dense alignment and
then perform the inter-frame contrastive learning to map them close in feature
space. Secondly, we propose to build the fine-grained patch-word alignment to
associate each patch with certain words. Due to the lack of this kind of
detailed annotations, we also predict the patch-word correspondence through the
cosine similarity. Extensive experiments demonstrate that our DCNet achieves
state-of-the-art performance on both video and image REC benchmarks.
Furthermore, we conduct comprehensive ablation studies and thorough analyses to
explore the optimal model designs. Notably, our inter-frame and cross-modal
contrastive losses are plug-and-play functions and are applicable to any video
REC architectures. For example, by building on top of Co-grounding, we boost
the performance by 1.48% absolute improvement on Accu.@0.5 for VID-Sentence
dataset.
Related papers
- Video Referring Expression Comprehension via Transformer with
Content-conditioned Query [68.06199031102526]
Video Referring Expression (REC) aims to localize a target object in videos based on the queried natural language.
Recent improvements in video REC have been made using Transformer-based methods with learnable queries.
arXiv Detail & Related papers (2023-10-25T06:38:42Z) - Unified Coarse-to-Fine Alignment for Video-Text Retrieval [71.85966033484597]
We propose a Unified Coarse-to-fine Alignment model, dubbed UCoFiA.
Our model captures the cross-modal similarity information at different granularity levels.
We apply the Sinkhorn-Knopp algorithm to normalize the similarities of each level before summing them.
arXiv Detail & Related papers (2023-09-18T19:04:37Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z) - Video Referring Expression Comprehension via Transformer with
Content-aware Query [60.89442448993627]
Video Referring Expression (REC) aims to localize a target object in video frames referred by the natural language expression.
We argue that the current query design is suboptima and suffers from two drawbacks.
We set up a fixed number of learnable bounding boxes across the frame and the aligned region features are employed to provide fruitful clues.
arXiv Detail & Related papers (2022-10-06T14:45:41Z) - HunYuan_tvr for Text-Video Retrivial [23.650824732136158]
HunYuan_tvr explores hierarchical cross-modal interactions by simultaneously exploring video-sentence, clip-phrase, and frame-word relationships.
HunYuan_tvr obtains new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 57.8%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet respectively.
arXiv Detail & Related papers (2022-04-07T11:59:36Z) - Visual Spatio-temporal Relation-enhanced Network for Cross-modal
Text-Video Retrieval [17.443195531553474]
Cross-modal retrieval of texts and videos aims to understand the correspondence between vision and language.
We propose a Visual S-temporal Relation-enhanced semantic network (CNN-SRNet), a cross-temporal retrieval framework.
Experiments are conducted on both MSR-VTT and MSVD datasets.
arXiv Detail & Related papers (2021-10-29T08:23:40Z) - Co-Grounding Networks with Semantic Attention for Referring Expression
Comprehension in Videos [96.85840365678649]
We tackle the problem of referring expression comprehension in videos with an elegant one-stage framework.
We enhance the single-frame grounding accuracy by semantic attention learning and improve the cross-frame grounding consistency.
Our model is also applicable to referring expression comprehension in images, illustrated by the improved performance on the RefCOCO dataset.
arXiv Detail & Related papers (2021-03-23T06:42:49Z) - Exploiting Visual Semantic Reasoning for Video-Text Retrieval [14.466809435818984]
We propose a Visual Semantic Enhanced Reasoning Network (ViSERN) to exploit reasoning between frame regions.
We perform reasoning by novel random walk rule-based graph convolutional networks to generate region features involved with semantic relations.
With the benefit of reasoning, semantic interactions between regions are considered, while the impact of redundancy is suppressed.
arXiv Detail & Related papers (2020-06-16T02:56:46Z) - Near-duplicate video detection featuring coupled temporal and perceptual
visual structures and logical inference based matching [0.0]
We propose an architecture for near-duplicate video detection based on: (i) index and query signature based structures integrating temporal and perceptual visual features.
For matching, we propose to instantiate a retrieval model based on logical inference through the coupling of an N-gram sliding window process and theoretically-sound lattice-based structures.
arXiv Detail & Related papers (2020-05-15T04:45:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.