Video Moment Localization using Object Evidence and Reverse Captioning
- URL: http://arxiv.org/abs/2006.10260v1
- Date: Thu, 18 Jun 2020 03:45:49 GMT
- Title: Video Moment Localization using Object Evidence and Reverse Captioning
- Authors: Madhawa Vidanapathirana, Supriya Pandhre, Sonia Raychaudhuri, Anjali
Khurana
- Abstract summary: We address the problem of language-based temporal localization of moments in untrimmed videos.
Current state-of-the-art model MAC addresses it by mining activity concepts from both video and language modalities.
We propose "Multi-faceted VideoMoment Localizer" (MML), an extension of MAC model by the introduction of visual object evidence.
- Score: 1.1549572298362785
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address the problem of language-based temporal localization of moments in
untrimmed videos. Compared to temporal localization with fixed categories, this
problem is more challenging as the language-based queries have no predefined
activity classes and may also contain complex descriptions. Current
state-of-the-art model MAC addresses it by mining activity concepts from both
video and language modalities. This method encodes the semantic activity
concepts from the verb/object pair in a language query and leverages visual
activity concepts from video activity classification prediction scores. We
propose "Multi-faceted VideoMoment Localizer" (MML), an extension of MAC model
by the introduction of visual object evidence via object segmentation masks and
video understanding features via video captioning. Furthermore, we improve
language modelling in sentence embedding. We experimented on Charades-STA
dataset and identified that MML outperforms MAC baseline by 4.93% and 1.70% on
R@1 and R@5metrics respectively. Our code and pre-trained model are publicly
available at https://github.com/madhawav/MML.
Related papers
- Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks.
Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos [41.34787907803329]
VideoLISA is a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos.
VideoLISA generates temporally consistent segmentation masks in videos based on language instructions.
arXiv Detail & Related papers (2024-09-29T07:47:15Z) - ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We propose a new video segmentation task - video reasoning segmentation.
The task is designed to output tracklets of segmentation masks given a complex input text query.
We present ViLLa: Video reasoning segmentation with a Large Language Model.
arXiv Detail & Related papers (2024-07-18T17:59:17Z) - OSCaR: Object State Captioning and State Change Representation [52.13461424520107]
This paper introduces the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark.
OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections.
It sets a new testbed for evaluating multimodal large language models (MLLMs)
arXiv Detail & Related papers (2024-02-27T01:48:19Z) - Meta-Personalizing Vision-Language Models to Find Named Instances in
Video [30.63415402318075]
Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications.
They currently struggle with personalized searches for moments in a video where a specific object instance such as My dog Biscuit'' appears.
We present a method to meta-personalize a pre-trained VLM, i.e., learning how to learn to personalize a VLM at test time to search in video.
arXiv Detail & Related papers (2023-06-16T20:12:11Z) - Self-Chained Image-Language Model for Video Localization and Question
Answering [66.86740990630433]
We propose Self-Chained Video-Answering (SeViLA) framework to tackle both temporal localization and QA on videos.
SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2.
arXiv Detail & Related papers (2023-05-11T17:23:00Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.