Knowing Where to Focus: Event-aware Transformer for Video Grounding
- URL: http://arxiv.org/abs/2308.06947v1
- Date: Mon, 14 Aug 2023 05:54:32 GMT
- Title: Knowing Where to Focus: Event-aware Transformer for Video Grounding
- Authors: Jinhyun Jang, Jungin Park, Jin Kim, Hyeongjun Kwon, Kwanghoon Sohn
- Abstract summary: We formulate an event-aware dynamic moment query to enable the model to take the input-specific content and positional information of the video into account.
Experiments demonstrate the effectiveness and efficiency of the event-aware dynamic moment queries, outperforming state-of-the-art approaches on several video grounding benchmarks.
- Score: 40.526461893854226
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent DETR-based video grounding models have made the model directly predict
moment timestamps without any hand-crafted components, such as a pre-defined
proposal or non-maximum suppression, by learning moment queries. However, their
input-agnostic moment queries inevitably overlook an intrinsic temporal
structure of a video, providing limited positional information. In this paper,
we formulate an event-aware dynamic moment query to enable the model to take
the input-specific content and positional information of the video into
account. To this end, we present two levels of reasoning: 1) Event reasoning
that captures distinctive event units constituting a given video using a slot
attention mechanism; and 2) moment reasoning that fuses the moment queries with
a given sentence through a gated fusion transformer layer and learns
interactions between the moment queries and video-sentence representations to
predict moment timestamps. Extensive experiments demonstrate the effectiveness
and efficiency of the event-aware dynamic moment queries, outperforming
state-of-the-art approaches on several video grounding benchmarks.
Related papers
- On the Consistency of Video Large Language Models in Temporal Comprehension [57.985769348320616]
Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments.
We conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding.
arXiv Detail & Related papers (2024-11-20T00:47:17Z) - Event-aware Video Corpus Moment Retrieval [79.48249428428802]
Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos.
Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos.
We propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval.
arXiv Detail & Related papers (2024-02-21T06:55:20Z) - Background-aware Moment Detection for Video Moment Retrieval [19.11524416308641]
Video moment retrieval (VMR) identifies a specific moment in an untrimmed video for a given natural language query.
Due to the ambiguity, a query does not fully cover the relevant details of the corresponding moment.
We propose a background-aware moment detection transformer (BM-DETR)
Our model learns to predict the target moment from the joint probability of each frame given the positive query and the complement of negative queries.
arXiv Detail & Related papers (2023-06-05T09:26:33Z) - Query-Dependent Video Representation for Moment Retrieval and Highlight
Detection [8.74967598360817]
Key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to a given text query.
Recent transformer-based models do not fully exploit the information of a given query.
We introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD.
arXiv Detail & Related papers (2023-03-24T09:32:50Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - Video Imprint [107.1365846180187]
A new unified video analytics framework (ER3) is proposed for complex event retrieval, recognition and recounting.
The proposed video imprint representation exploits temporal correlations among image features across video frames.
The video imprint is fed into a reasoning network and a feature aggregation module, for event recognition/recounting and event retrieval tasks, respectively.
arXiv Detail & Related papers (2021-06-07T00:32:47Z) - Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query.
Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions.
We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z) - Activity Graph Transformer for Temporal Action Localization [41.69734359113706]
We introduce Activity Graph Transformer, an end-to-end learnable model for temporal action localization.
In this work, we capture this non-linear temporal structure by reasoning over the videos as non-sequential entities in the form of graphs.
Our results show that our proposed model outperforms the state-of-the-art by a considerable margin.
arXiv Detail & Related papers (2021-01-21T10:42:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.