Learning to Locate Visual Answer in Video Corpus Using Question
- URL: http://arxiv.org/abs/2210.05423v2
- Date: Thu, 13 Oct 2022 15:48:51 GMT
- Title: Learning to Locate Visual Answer in Video Corpus Using Question
- Authors: Bin Li, Yixuan Weng, Bin Sun and Shutao Li
- Abstract summary: We introduce a new task, named video corpus visual answer localization (VCVAL), which aims to locate the visual answer in instructional videos.
We propose a cross-modal contrastive global-span (CCGS) method for the VCVAL, jointly training the video corpus retrieval and visual answer localization subtasks.
Experimental results show that the proposed method outperforms other competitive methods both in the video corpus retrieval and visual answer localization subtasks.
- Score: 21.88924465126168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a new task, named video corpus visual answer localization
(VCVAL), which aims to locate the visual answer in a large collection of
untrimmed, unsegmented instructional videos using a natural language question.
This task requires a range of skills - the interaction between vision and
language, video retrieval, passage comprehension, and visual answer
localization. In this paper, we propose a cross-modal contrastive global-span
(CCGS) method for the VCVAL, jointly training the video corpus retrieval and
visual answer localization subtasks. More precisely, we first enhance the video
question-answer semantic by adding element-wise visual information into the
pre-trained language model, and then design a novel global-span predictor
through fusion information to locate the visual answer point. The global-span
contrastive learning is adopted to sort the span point from the positive and
negative samples with the global-span matrix. We have reconstructed a dataset
named MedVidCQA, on which the VCVAL task is benchmarked. Experimental results
show that the proposed method outperforms other competitive methods both in the
video corpus retrieval and visual answer localization subtasks. Most
importantly, we perform detailed analyses on extensive experiments, paving a
new path for understanding the instructional videos, which ushers in further
research.
Related papers
- Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset [4.452729255042396]
A more robust and holistic language-video representation is the key to pushing video understanding forward.
The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks.
This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware.
arXiv Detail & Related papers (2024-06-19T20:16:17Z) - ViLCo-Bench: VIdeo Language COntinual learning Benchmark [8.660555226687098]
We present ViLCo-Bench, designed to evaluate continual learning models across a range of video-text tasks.
The dataset comprises ten-minute-long videos and corresponding language queries collected from publicly available datasets.
We introduce a novel memory-efficient framework that incorporates self-supervised learning and mimics long-term and short-term memory effects.
arXiv Detail & Related papers (2024-06-19T00:38:19Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z) - Visual Commonsense-aware Representation Network for Video Captioning [84.67432867555044]
We propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN) for video captioning.
Our method reaches state-of-the-art performance, indicating the effectiveness of our method.
arXiv Detail & Related papers (2022-11-17T11:27:15Z) - Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - DORi: Discovering Object Relationship for Moment Localization of a
Natural-Language Query in Video [98.54696229182335]
We study the task of temporal moment localization in a long untrimmed video using natural language query.
Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm.
A temporal sub-graph captures the activities within the video through time.
arXiv Detail & Related papers (2020-10-13T09:50:29Z) - Text-based Localization of Moments in a Video Corpus [38.393877654679414]
We address the task of temporal localization of moments in a corpus of videos for a given sentence query.
We propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences.
In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries.
arXiv Detail & Related papers (2020-08-20T00:05:45Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.