EVOQUER: Enhancing Temporal Grounding with Video-Pivoted BackQuery
Generation
- URL: http://arxiv.org/abs/2109.04600v1
- Date: Fri, 10 Sep 2021 00:30:36 GMT
- Title: EVOQUER: Enhancing Temporal Grounding with Video-Pivoted BackQuery
Generation
- Authors: Yanjun Gao, Lulu Liu, Jason Wang, Xin Chen, Huayan Wang, Rui Zhang
- Abstract summary: Temporal grounding aims to predict a time interval of a video clip corresponding to a natural language query input.
We present EVOQUER, a temporal grounding framework incorporating an existing text-to-video grounding model and a video-assisted query generation network.
- Score: 10.799980374791316
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Temporal grounding aims to predict a time interval of a video clip
corresponding to a natural language query input. In this work, we present
EVOQUER, a temporal grounding framework incorporating an existing text-to-video
grounding model and a video-assisted query generation network. Given a query
and an untrimmed video, the temporal grounding model predicts the target
interval, and the predicted video clip is fed into a video translation task by
generating a simplified version of the input query. EVOQUER forms closed-loop
learning by incorporating loss functions from both temporal grounding and query
generation serving as feedback. Our experiments on two widely used datasets,
Charades-STA and ActivityNet, show that EVOQUER achieves promising improvements
by 1.05 and 1.31 at R@0.7. We also discuss how the query generation task could
facilitate error analysis by explaining temporal grounding model behavior.
Related papers
- Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval [56.05621657583251]
Cross-modal (e.g. image-text, video-text) retrieval is an important task in information retrieval and multimodal vision-language understanding field.
We introduce RTime, a novel temporal-emphasized video-text retrieval dataset.
Our RTime dataset currently consists of 21k videos with 10 captions per video, totalling about 122 hours.
arXiv Detail & Related papers (2024-12-26T11:32:00Z) - TimeRefine: Temporal Grounding with Time Refining Video LLM [75.99665302872901]
Video temporal grounding aims to localize relevant temporal boundaries in a video given a textual prompt.
We reformulate the temporal grounding task as a temporal refining task.
We incorporate an auxiliary prediction head that penalizes the model more if a predicted segment deviates further from the ground truth.
arXiv Detail & Related papers (2024-12-12T18:59:11Z) - Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models [53.235170710385006]
We introduce Grounded-VideoLLM, a novel Video-LLM adept at perceiving and reasoning over specific video moments in a fine-grained manner.
We sharpen our model by incorporating (1) an additional temporal stream to encode the relationships between frames and (2) discrete temporal tokens enriched with specific time knowledge.
In experiments, Grounded-VideoLLM excels in fine-grained grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA.
arXiv Detail & Related papers (2024-10-04T10:04:37Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - Knowing Where to Focus: Event-aware Transformer for Video Grounding [40.526461893854226]
We formulate an event-aware dynamic moment query to enable the model to take the input-specific content and positional information of the video into account.
Experiments demonstrate the effectiveness and efficiency of the event-aware dynamic moment queries, outperforming state-of-the-art approaches on several video grounding benchmarks.
arXiv Detail & Related papers (2023-08-14T05:54:32Z) - HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training [49.52679453475878]
We propose a Temporal-Aware video-language pre-training framework, HiTeA, for modeling cross-modal alignment between moments and texts.
We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks.
arXiv Detail & Related papers (2022-12-30T04:27:01Z) - End-to-End Dense Video Grounding via Parallel Regression [30.984657885692553]
Video grounding aims to localize the corresponding video moment in an untrimmed video given a language query.
We present an end-to-end parallel decoding paradigm by re-purposing a Transformer-alike architecture (PRVG)
Thanks to its simplicity in design, our PRVG framework can be applied in different testing schemes.
arXiv Detail & Related papers (2021-09-23T10:03:32Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.