Visual Answer Localization with Cross-modal Mutual Knowledge Transfer
- URL: http://arxiv.org/abs/2210.14823v3
- Date: Fri, 28 Oct 2022 08:42:01 GMT
- Title: Visual Answer Localization with Cross-modal Mutual Knowledge Transfer
- Authors: Yixuan Weng and Bin Li
- Abstract summary: We propose a cross-modal mutual knowledge transfer span localization (MutualSL) method to reduce the knowledge deviation.
On this basis, we design a one-way dynamic loss function to dynamically adjust the proportion of knowledge transfer.
Our method outperforms other competitive state-of-the-art (SOTA) methods, demonstrating its effectiveness.
- Score: 6.895321502252051
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of visual answering localization (VAL) in the video is to obtain a
relevant and concise time clip from a video as the answer to the given natural
language question. Early methods are based on the interaction modelling between
video and text to predict the visual answer by the visual predictor. Later,
using the textual predictor with subtitles for the VAL proves to be more
precise. However, these existing methods still have cross-modal knowledge
deviations from visual frames or textual subtitles. In this paper, we propose a
cross-modal mutual knowledge transfer span localization (MutualSL) method to
reduce the knowledge deviation. MutualSL has both visual predictor and textual
predictor, where we expect the prediction results of these both to be
consistent, so as to promote semantic knowledge understanding between
cross-modalities. On this basis, we design a one-way dynamic loss function to
dynamically adjust the proportion of knowledge transfer. We have conducted
extensive experiments on three public datasets for evaluation. The experimental
results show that our method outperforms other competitive state-of-the-art
(SOTA) methods, demonstrating its effectiveness.
Related papers
- Multi-Modal interpretable automatic video captioning [1.9874264019909988]
We introduce a novel video captioning method trained with multi-modal contrastive loss.
Our approach is designed to capture the dependency between these modalities, resulting in more accurate, thus pertinent captions.
arXiv Detail & Related papers (2024-11-11T11:12:23Z) - On the Role of Context in Reading Time Prediction [50.87306355705826]
We present a new perspective on how readers integrate context during real-time language comprehension.
Our proposals build on surprisal theory, which posits that the processing effort of a linguistic unit is an affine function of its in-context information content.
arXiv Detail & Related papers (2024-09-12T15:52:22Z) - Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing.
We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample.
We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z) - A Multi-Modal Context Reasoning Approach for Conditional Inference on
Joint Textual and Visual Clues [23.743431157431893]
Conditional inference on joint textual and visual clues is a multi-modal reasoning task.
We propose a Multi-modal Context Reasoning approach, named ModCR.
We conduct extensive experiments on two corresponding data sets and experimental results show significantly improved performance.
arXiv Detail & Related papers (2023-05-08T08:05:40Z) - CLOP: Video-and-Language Pre-Training with Knowledge Regularizations [43.09248976105326]
Video-and-language pre-training has shown promising results for learning generalizable representations.
We denote such form of representations as structural knowledge, which express rich semantics of multiple granularities.
We propose a Cross-modaL knedgeOwl-enhanced Pre-training (CLOP) method with Knowledge Regularizations.
arXiv Detail & Related papers (2022-11-07T05:32:12Z) - Learning to Locate Visual Answer in Video Corpus Using Question [21.88924465126168]
We introduce a new task, named video corpus visual answer localization (VCVAL), which aims to locate the visual answer in instructional videos.
We propose a cross-modal contrastive global-span (CCGS) method for the VCVAL, jointly training the video corpus retrieval and visual answer localization subtasks.
Experimental results show that the proposed method outperforms other competitive methods both in the video corpus retrieval and visual answer localization subtasks.
arXiv Detail & Related papers (2022-10-11T13:04:59Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Interventional Video Grounding with Dual Contrastive Learning [16.0734337895897]
Video grounding aims to localize a moment from an untrimmed video for a given textual query.
We propose a novel paradigm from the perspective of causal inference to uncover the causality behind the model and data.
We also introduce a dual contrastive learning approach to better align the text and video.
arXiv Detail & Related papers (2021-06-21T12:11:28Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.