Video Referring Expression Comprehension via Transformer with
Content-aware Query
- URL: http://arxiv.org/abs/2210.02953v1
- Date: Thu, 6 Oct 2022 14:45:41 GMT
- Title: Video Referring Expression Comprehension via Transformer with
Content-aware Query
- Authors: Ji Jiang, Meng Cao, Tengtao Song, Yuexian Zou
- Abstract summary: Video Referring Expression (REC) aims to localize a target object in video frames referred by the natural language expression.
We argue that the current query design is suboptima and suffers from two drawbacks.
We set up a fixed number of learnable bounding boxes across the frame and the aligned region features are employed to provide fruitful clues.
- Score: 60.89442448993627
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Referring Expression Comprehension (REC) aims to localize a target
object in video frames referred by the natural language expression. Recently,
the Transformerbased methods have greatly boosted the performance limit.
However, we argue that the current query design is suboptima and suffers from
two drawbacks: 1) the slow training convergence process; 2) the lack of
fine-grained alignment. To alleviate this, we aim to couple the pure learnable
queries with the content information. Specifically, we set up a fixed number of
learnable bounding boxes across the frame and the aligned region features are
employed to provide fruitful clues. Besides, we explicitly link certain phrases
in the sentence to the semantically relevant visual areas. To this end, we
introduce two new datasets (i.e., VID-Entity and VidSTG-Entity) by augmenting
the VIDSentence and VidSTG datasets with the explicitly referred words in the
whole sentence, respectively. Benefiting from this, we conduct the fine-grained
cross-modal alignment at the region-phrase level, which ensures more detailed
feature representations. Incorporating these two designs, our proposed model
(dubbed as ContFormer) achieves the state-of-the-art performance on widely
benchmarked datasets. For example on VID-Entity dataset, compared to the
previous SOTA, ContFormer achieves 8.75% absolute improvement on Accu.@0.6.
Related papers
- Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video
Moment Retrieval [31.42856682276394]
Video Moment Retrieval (VMR) aims to retrieve temporal segments in untrimmed videos corresponding to a given language query.
Existing strategies are often sub-optimal since they ignore the modality imbalance problem.
We introduce Modal-Enhanced Semantic Modeling (MESM), a novel framework for more balanced alignment.
arXiv Detail & Related papers (2023-12-19T13:38:48Z) - Video Referring Expression Comprehension via Transformer with
Content-conditioned Query [68.06199031102526]
Video Referring Expression (REC) aims to localize a target object in videos based on the queried natural language.
Recent improvements in video REC have been made using Transformer-based methods with learnable queries.
arXiv Detail & Related papers (2023-10-25T06:38:42Z) - Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal
Sentence Localization in Videos [67.12603318660689]
We propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN)
HVSARN enables both visual- and semantic-aware query reasoning from object-level to frame-level.
Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-02T08:00:22Z) - Diverse Video Captioning by Adaptive Spatio-temporal Attention [7.96569366755701]
Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures.
We introduce an adaptive frame selection scheme to reduce the number of required incoming frames.
We estimate semantic concepts relevant for video captioning by aggregating all ground captions truth of each sample.
arXiv Detail & Related papers (2022-08-19T11:21:59Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - Local-Global Context Aware Transformer for Language-Guided Video
Segmentation [103.35509224722097]
We explore the task of language-guided video segmentation (LVS)
We present Locater, which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner.
To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset.
arXiv Detail & Related papers (2022-03-18T07:35:26Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.