Towards Visual-Prompt Temporal Answering Grounding in Medical
Instructional Video
- URL: http://arxiv.org/abs/2203.06667v2
- Date: Tue, 15 Mar 2022 07:46:41 GMT
- Title: Towards Visual-Prompt Temporal Answering Grounding in Medical
Instructional Video
- Authors: Bin Li, Yixuan Weng, Bin Sun and Shutao Li
- Abstract summary: The temporal answering grounding in the video (TAGV) is a new task deriving from temporal sentence grounding in the video (TSGV)
Existing methods tend to formulate the TAGV task with a visual span-based question answering (QA) approach by matching the visual frame span queried by the text question.
We propose a visual-prompt text span localizing (VPTSL) method, which enhances the text span localization in the pre-trained language model (PLM) with the visual highlight features.
- Score: 21.88924465126168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The temporal answering grounding in the video (TAGV) is a new task naturally
deriving from temporal sentence grounding in the video (TSGV). Given an
untrimmed video and a text question, this task aims at locating the matching
span from the video that can semantically answer the question. Existing methods
tend to formulate the TAGV task with a visual span-based question answering
(QA) approach by matching the visual frame span queried by the text question.
However, due to the weak correlations and huge gaps in semantics in features
between the textual question and visual answer, existing methods adopting
visual span predictor fail to perform well in the TAGV task. In this work, we
propose a visual-prompt text span localizing (VPTSL) method, which enhances the
text span localization in the pre-trained language model (PLM) with the visual
highlight features. Specifically, the context query attention is utilized to
perform cross-modal modeling between the textual and visual features. Then, the
highlight features are obtained through the highlight module with a linear
layer to provide the visual prompt. To alleviate the differences in semantics
and correlations between textual and visual features, we design the text span
predictor by encoding the question, the subtitles, and the visual prompt in the
PLM. As a result, the TAGV task is formulated to predict the span of subtitles
matching the answering frame timeline. Extensive experiments on the medical
instructional dataset, namely MedVidQA, show the proposed VPTSL outperforms
other state-of-the-art methods, which demonstrates the effectiveness of visual
prompt and the text span predictor.
Related papers
- VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search [51.9899504535878]
We propose a Vision-Guided Semantic-Group Network (VGSG) for text-based person search.
In VGSG, a vision-guided attention is employed to extract visual-related textual features.
With the help of relational knowledge transfer, VGKT is capable of aligning semantic-group textual features with corresponding visual features.
arXiv Detail & Related papers (2023-11-13T17:56:54Z) - Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer [79.20605034378187]
Video-language pre-trained models have shown remarkable success in guiding video question-answering tasks.
Due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones.
This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.
arXiv Detail & Related papers (2023-08-16T15:00:50Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - Video Question Answering Using CLIP-Guided Visual-Text Attention [17.43377106246301]
Cross-modal learning of video and text plays a key role in Video Question Answering (VideoQA)
We propose a visual-text attention mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained on lots of general domain language-image pairs.
The proposed method is evaluated on MSVD-QA and MSRVTT-QA datasets, and outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-03-06T13:49:15Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Learning to Locate Visual Answer in Video Corpus Using Question [21.88924465126168]
We introduce a new task, named video corpus visual answer localization (VCVAL), which aims to locate the visual answer in instructional videos.
We propose a cross-modal contrastive global-span (CCGS) method for the VCVAL, jointly training the video corpus retrieval and visual answer localization subtasks.
Experimental results show that the proposed method outperforms other competitive methods both in the video corpus retrieval and visual answer localization subtasks.
arXiv Detail & Related papers (2022-10-11T13:04:59Z) - VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts [2.0434814235659555]
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its transferable visual representation learning.
We propose to enhance CLIP via Visual-guided Texts, named VT-CLIP.
In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets to demonstrate its effectiveness.
arXiv Detail & Related papers (2021-12-04T18:34:24Z) - Visual Spatio-temporal Relation-enhanced Network for Cross-modal
Text-Video Retrieval [17.443195531553474]
Cross-modal retrieval of texts and videos aims to understand the correspondence between vision and language.
We propose a Visual S-temporal Relation-enhanced semantic network (CNN-SRNet), a cross-temporal retrieval framework.
Experiments are conducted on both MSR-VTT and MSVD datasets.
arXiv Detail & Related papers (2021-10-29T08:23:40Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.