UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval
and Highlight Detection
- URL: http://arxiv.org/abs/2203.12745v1
- Date: Wed, 23 Mar 2022 22:11:43 GMT
- Title: UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval
and Highlight Detection
- Authors: Ye Liu, Siyuan Li, Yang Wu, Chang Wen Chen, Ying Shan, Xiaohu Qie
- Abstract summary: We present the first unified framework, named Unified Multi-modal Transformers (UMT)
UMT is capable of realizing such joint optimization while can also be easily degenerated for solving individual problems.
As far as we are aware, this is the first scheme to integrate multi-modal (visual-audio) learning for either joint optimization or the individual moment retrieval task.
- Score: 46.25856560381347
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Finding relevant moments and highlights in videos according to natural
language queries is a natural and highly valuable common need in the current
video content explosion era. Nevertheless, jointly conducting moment retrieval
and highlight detection is an emerging research topic, even though its
component problems and some related tasks have already been studied for a
while. In this paper, we present the first unified framework, named Unified
Multi-modal Transformers (UMT), capable of realizing such joint optimization
while can also be easily degenerated for solving individual problems. As far as
we are aware, this is the first scheme to integrate multi-modal (visual-audio)
learning for either joint optimization or the individual moment retrieval task,
and tackles moment retrieval as a keypoint detection problem using a novel
query generator and query decoder. Extensive comparisons with existing methods
and ablation studies on QVHighlights, Charades-STA, YouTube Highlights, and
TVSum datasets demonstrate the effectiveness, superiority, and flexibility of
the proposed method under various settings. Source code and pre-trained models
are available at https://github.com/TencentARC/UMT.
Related papers
- VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [36.516226519328015]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute.
This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval.
We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z) - Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - Joint Moment Retrieval and Highlight Detection Via Natural Language
Queries [0.0]
We propose a new method for natural language query based joint video summarization and highlight detection.
This approach will use both visual and audio cues to match a user's natural language query to retrieve the most relevant and interesting moments from a video.
Our approach employs multiple recent techniques used in Vision Transformers (ViTs) to create a transformer-like encoder-decoder model.
arXiv Detail & Related papers (2023-05-08T18:00:33Z) - TubeDETR: Spatio-Temporal Video Grounding with Transformers [89.71617065426146]
We consider the problem of encoder localizing a-temporal tube in a video corresponding to a given text query.
To address this task, we propose TubeDETR, a transformer- architecture inspired by the recent success of such models for text-conditioned object detection.
arXiv Detail & Related papers (2022-03-30T16:31:49Z) - CONQUER: Contextual Query-aware Ranking for Video Corpus Moment
Retrieval [24.649068267308913]
Video retrieval applications should enable users to retrieve a precise moment from a large video corpus.
We propose a novel model for effective moment localization and ranking.
We conduct studies on two datasets, TVR for closed-world TV episodes and DiDeMo for open-world user-generated videos.
arXiv Detail & Related papers (2021-09-21T08:07:27Z) - Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z) - Multi-modal Transformer for Video Retrieval [67.86763073161012]
We present a multi-modal transformer to jointly encode the different modalities in video.
On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer.
This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets.
arXiv Detail & Related papers (2020-07-21T07:38:46Z) - Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video
Parsing [48.87278703876147]
A new problem, named audio-visual video parsing, aims to parse a video into temporal event segments and label them as audible, visible, or both.
We propose a novel hybrid attention network to explore unimodal and cross-modal temporal contexts simultaneously.
Experimental results show that the challenging audio-visual video parsing can be achieved even with only video-level weak labels.
arXiv Detail & Related papers (2020-07-21T01:53:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.