Video Referring Expression Comprehension via Transformer with
Content-conditioned Query
- URL: http://arxiv.org/abs/2310.16402v1
- Date: Wed, 25 Oct 2023 06:38:42 GMT
- Title: Video Referring Expression Comprehension via Transformer with
Content-conditioned Query
- Authors: Ji Jiang, Meng Cao, Tengtao Song, Long Chen, Yi Wang, Yuexian Zou
- Abstract summary: Video Referring Expression (REC) aims to localize a target object in videos based on the queried natural language.
Recent improvements in video REC have been made using Transformer-based methods with learnable queries.
- Score: 68.06199031102526
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Video Referring Expression Comprehension (REC) aims to localize a target
object in videos based on the queried natural language. Recent improvements in
video REC have been made using Transformer-based methods with learnable
queries. However, we contend that this naive query design is not ideal given
the open-world nature of video REC brought by text supervision. With numerous
potential semantic categories, relying on only a few slow-updated queries is
insufficient to characterize them. Our solution to this problem is to create
dynamic queries that are conditioned on both the input video and language to
model the diverse objects referred to. Specifically, we place a fixed number of
learnable bounding boxes throughout the frame and use corresponding region
features to provide prior information. Also, we noticed that current query
features overlook the importance of cross-modal alignment. To address this, we
align specific phrases in the sentence with semantically relevant visual areas,
annotating them in existing video datasets (VID-Sentence and VidSTG). By
incorporating these two designs, our proposed model (called ConFormer)
outperforms other models on widely benchmarked datasets. For example, in the
testing split of VID-Sentence dataset, ConFormer achieves 8.75% absolute
improvement on Accu.@0.6 compared to the previous state-of-the-art model.
Related papers
- Localizing Events in Videos with Multimodal Queries [71.40602125623668]
We introduce a new benchmark, ICQ, for localizing events in videos with multimodal queries.
We include 4 styles of reference images and 5 types of refinement texts, allowing us to explore model performance across different domains.
arXiv Detail & Related papers (2024-06-14T14:35:58Z) - Query-Dependent Video Representation for Moment Retrieval and Highlight
Detection [8.74967598360817]
Key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to a given text query.
Recent transformer-based models do not fully exploit the information of a given query.
We introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD.
arXiv Detail & Related papers (2023-03-24T09:32:50Z) - Video Referring Expression Comprehension via Transformer with
Content-aware Query [60.89442448993627]
Video Referring Expression (REC) aims to localize a target object in video frames referred by the natural language expression.
We argue that the current query design is suboptima and suffers from two drawbacks.
We set up a fixed number of learnable bounding boxes across the frame and the aligned region features are employed to provide fruitful clues.
arXiv Detail & Related papers (2022-10-06T14:45:41Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - Object-aware Video-language Pre-training for Retrieval [24.543719616308945]
We present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object representations.
We show clear improvement in performance across all tasks and datasets considered, demonstrating the value of a model that incorporates object representations into a video-language architecture.
arXiv Detail & Related papers (2021-12-01T17:06:39Z) - CLIP-It! Language-Guided Video Summarization [96.69415453447166]
This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization.
We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another.
Our model can be extended to the unsupervised setting by training without ground-truth supervision.
arXiv Detail & Related papers (2021-07-01T17:59:27Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.