A Hierarchical Multi-Modal Encoder for Moment Localization in Video
Corpus
- URL: http://arxiv.org/abs/2011.09046v2
- Date: Tue, 24 Nov 2020 04:11:13 GMT
- Title: A Hierarchical Multi-Modal Encoder for Moment Localization in Video
Corpus
- Authors: Bowen Zhang, Hexiang Hu, Joonseok Lee, Ming Zhao, Sheide Chammas,
Vihan Jain, Eugene Ie, Fei Sha
- Abstract summary: We show how to identify a short segment in a long video that semantically matches a text query.
To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level.
We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
- Score: 31.387948069111893
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Identifying a short segment in a long video that semantically matches a text
query is a challenging task that has important application potentials in
language-based video search, browsing, and navigation. Typical retrieval
systems respond to a query with either a whole video or a pre-defined video
segment, but it is challenging to localize undefined segments in untrimmed and
unsegmented videos where exhaustively searching over all possible segments is
intractable. The outstanding challenge is that the representation of a video
must account for different levels of granularity in the temporal domain. To
tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER)
that encodes a video at both the coarse-grained clip level and the fine-grained
frame level to extract information at different scales based on multiple
subtasks, namely, video retrieval, segment temporal localization, and masked
language modeling. We conduct extensive experiments to evaluate our model on
moment localization in video corpus on ActivityNet Captions and TVR datasets.
Our approach outperforms the previous methods as well as strong baselines,
establishing new state-of-the-art for this task.
Related papers
- One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos [41.34787907803329]
VideoLISA is a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos.
VideoLISA generates temporally consistent segmentation masks in videos based on language instructions.
arXiv Detail & Related papers (2024-09-29T07:47:15Z) - ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We propose a new video segmentation task - video reasoning segmentation.
The task is designed to output tracklets of segmentation masks given a complex input text query.
We present ViLLa: Video reasoning segmentation with a Large Language Model.
arXiv Detail & Related papers (2024-07-18T17:59:17Z) - Localizing Events in Videos with Multimodal Queries [71.40602125623668]
We introduce a new benchmark, ICQ, for localizing events in videos with multimodal queries.
We include 4 styles of reference images and 5 types of refinement texts, allowing us to explore model performance across different domains.
arXiv Detail & Related papers (2024-06-14T14:35:58Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - Text-based Localization of Moments in a Video Corpus [38.393877654679414]
We address the task of temporal localization of moments in a corpus of videos for a given sentence query.
We propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences.
In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries.
arXiv Detail & Related papers (2020-08-20T00:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.