Related papers: Knowing Where to Focus: Event-aware Transformer for Video Grounding

Knowing Where to Focus: Event-aware Transformer for Video Grounding

URL: http://arxiv.org/abs/2308.06947v1
Date: Mon, 14 Aug 2023 05:54:32 GMT
Title: Knowing Where to Focus: Event-aware Transformer for Video Grounding
Authors: Jinhyun Jang, Jungin Park, Jin Kim, Hyeongjun Kwon, Kwanghoon Sohn
Abstract summary: We formulate an event-aware dynamic moment query to enable the model to take the input-specific content and positional information of the video into account. Experiments demonstrate the effectiveness and efficiency of the event-aware dynamic moment queries, outperforming state-of-the-art approaches on several video grounding benchmarks.
Score: 40.526461893854226
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent DETR-based video grounding models have made the model directly predict moment timestamps without any hand-crafted components, such as a pre-defined proposal or non-maximum suppression, by learning moment queries. However, their input-agnostic moment queries inevitably overlook an intrinsic temporal structure of a video, providing limited positional information. In this paper, we formulate an event-aware dynamic moment query to enable the model to take the input-specific content and positional information of the video into account. To this end, we present two levels of reasoning: 1) Event reasoning that captures distinctive event units constituting a given video using a slot attention mechanism; and 2) moment reasoning that fuses the moment queries with a given sentence through a gated fusion transformer layer and learns interactions between the moment queries and video-sentence representations to predict moment timestamps. Extensive experiments demonstrate the effectiveness and efficiency of the event-aware dynamic moment queries, outperforming state-of-the-art approaches on several video grounding benchmarks.

Related papers

Mind the Time: Temporally-Controlled Multi-Event Video Generation [65.05423863685866]
We present MinT, a multi-event video generator with temporal control. Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time. For the first time in the literature, our model offers control over the timing of events in generated videos.
arXiv Detail & Related papers (2024-12-06T18:52:20Z)
On the Consistency of Video Large Language Models in Temporal Comprehension [57.985769348320616]
Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments. We conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding.
arXiv Detail & Related papers (2024-11-20T00:47:17Z)
Event-aware Video Corpus Moment Retrieval [79.48249428428802]
Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos. Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos. We propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval.
arXiv Detail & Related papers (2024-02-21T06:55:20Z)
Background-aware Moment Detection for Video Moment Retrieval [19.11524416308641]
Video moment retrieval (VMR) identifies a specific moment in an untrimmed video for a given natural language query. Due to the ambiguity, a query does not fully cover the relevant details of the corresponding moment. We propose a background-aware moment detection transformer (BM-DETR) Our model learns to predict the target moment from the joint probability of each frame given the positive query and the complement of negative queries.
arXiv Detail & Related papers (2023-06-05T09:26:33Z)
Query-Dependent Video Representation for Moment Retrieval and Highlight Detection [8.74967598360817]
Key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to a given text query. Recent transformer-based models do not fully exploit the information of a given query. We introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD.
arXiv Detail & Related papers (2023-03-24T09:32:50Z)
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z)
Video Imprint [107.1365846180187]
A new unified video analytics framework (ER3) is proposed for complex event retrieval, recognition and recounting. The proposed video imprint representation exploits temporal correlations among image features across video frames. The video imprint is fed into a reasoning network and a feature aggregation module, for event recognition/recounting and event retrieval tasks, respectively.
arXiv Detail & Related papers (2021-06-07T00:32:47Z)
Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query. Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions. We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z)
Activity Graph Transformer for Temporal Action Localization [41.69734359113706]
We introduce Activity Graph Transformer, an end-to-end learnable model for temporal action localization. In this work, we capture this non-linear temporal structure by reasoning over the videos as non-sequential entities in the form of graphs. Our results show that our proposed model outperforms the state-of-the-art by a considerable margin.
arXiv Detail & Related papers (2021-01-21T10:42:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.