Related papers: Joint Moment Retrieval and Highlight Detection Via Natural Language Queries

Joint Moment Retrieval and Highlight Detection Via Natural Language Queries

URL: http://arxiv.org/abs/2305.04961v1
Date: Mon, 8 May 2023 18:00:33 GMT
Title: Joint Moment Retrieval and Highlight Detection Via Natural Language Queries
Authors: Richard Luo, Austin Peng, Heidi Yap and Koby Beard
Abstract summary: We propose a new method for natural language query based joint video summarization and highlight detection. This approach will use both visual and audio cues to match a user's natural language query to retrieve the most relevant and interesting moments from a video. Our approach employs multiple recent techniques used in Vision Transformers (ViTs) to create a transformer-like encoder-decoder model.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video summarization has become an increasingly important task in the field of computer vision due to the vast amount of video content available on the internet. In this project, we propose a new method for natural language query based joint video summarization and highlight detection using multi-modal transformers. This approach will use both visual and audio cues to match a user's natural language query to retrieve the most relevant and interesting moments from a video. Our approach employs multiple recent techniques used in Vision Transformers (ViTs) to create a transformer-like encoder-decoder model. We evaluated our approach on multiple datasets such as YouTube Highlights and TVSum to demonstrate the flexibility of our proposed method.

Related papers

Realizing Video Summarization from the Path of Language-based Semantic Understanding [19.825666473712197]
We propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm. Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries.
arXiv Detail & Related papers (2024-10-06T15:03:22Z)
VideoPoet: A Large Language Model for Zero-Shot Video Generation [78.57171527944774]
VideoPoet is a language model capable of synthesizing high-quality video with matching audio. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs.
arXiv Detail & Related papers (2023-12-21T18:46:41Z)
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z)
AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality. We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z)
UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection [46.25856560381347]
We present the first unified framework, named Unified Multi-modal Transformers (UMT) UMT is capable of realizing such joint optimization while can also be easily degenerated for solving individual problems. As far as we are aware, this is the first scheme to integrate multi-modal (visual-audio) learning for either joint optimization or the individual moment retrieval task.
arXiv Detail & Related papers (2022-03-23T22:11:43Z)
All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations. The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z)
Hierarchical Multimodal Transformer to Summarize Videos [103.47766795086206]
Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization. To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer. Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
arXiv Detail & Related papers (2021-09-22T07:38:59Z)
GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization [18.543372365239673]
The proposed model consists of a contextualized video summary controller, multi-modal attention mechanisms, an interactive attention network, and a video summary generator. Results show that the proposed model is effective with the increase of +5.88% in accuracy and +4.06% increase of F1-score, compared with the state-of-the-art method.
arXiv Detail & Related papers (2021-04-26T10:50:37Z)
Multi-modal Transformer for Video Retrieval [67.86763073161012]
We present a multi-modal transformer to jointly encode the different modalities in video. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets.
arXiv Detail & Related papers (2020-07-21T07:38:46Z)
Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.