LGDN: Language-Guided Denoising Network for Video-Language Modeling
- URL: http://arxiv.org/abs/2209.11388v1
- Date: Fri, 23 Sep 2022 03:35:59 GMT
- Title: LGDN: Language-Guided Denoising Network for Video-Language Modeling
- Authors: Haoyu Lu and Mingyu Ding and Nanyi Fei and Yuqi Huo and Zhiwu Lu
- Abstract summary: We propose an efficient and effective model, termed Language-Guided Denoising Network (LGDN) for video-language modeling.
Our LGDN dynamically filters out the misaligned or redundant frames under the language supervision and obtains only 2--4 salient frames per video for cross-modal token-level alignment.
- Score: 30.99646752913056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-language modeling has attracted much attention with the rapid growth of
web videos. Most existing methods assume that the video frames and text
description are semantically correlated, and focus on video-language modeling
at video level. However, this hypothesis often fails for two reasons: (1) With
the rich semantics of video contents, it is difficult to cover all frames with
a single video-level description; (2) A raw video typically has
noisy/meaningless information (e.g., scenery shot, transition or teaser).
Although a number of recent works deploy attention mechanism to alleviate this
problem, the irrelevant/noisy information still makes it very difficult to
address. To overcome such challenge, we thus propose an efficient and effective
model, termed Language-Guided Denoising Network (LGDN), for video-language
modeling. Different from most existing methods that utilize all extracted video
frames, LGDN dynamically filters out the misaligned or redundant frames under
the language supervision and obtains only 2--4 salient frames per video for
cross-modal token-level alignment. Extensive experiments on five public
datasets show that our LGDN outperforms the state-of-the-arts by large margins.
We also provide detailed ablation study to reveal the critical importance of
solving the noise issue, in hope of inspiring future video-language work.
Related papers
- InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Multi-scale 2D Temporal Map Diffusion Models for Natural Language Video
Localization [85.85582751254785]
We present a novel approach to NLVL that aims to address this issue.
Our method involves the direct generation of a global 2D temporal map via a conditional denoising diffusion process.
Our approach effectively encapsulates the interaction between the query and video data across various time scales.
arXiv Detail & Related papers (2024-01-16T09:33:29Z) - VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools.
An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z) - Revealing Single Frame Bias for Video-and-Language Learning [115.01000652123882]
We show that a single-frame trained model can achieve better performance than existing methods that use multiple frames for training.
This result reveals the existence of a strong "static appearance bias" in popular video-and-language datasets.
We propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling.
arXiv Detail & Related papers (2022-06-07T16:28:30Z) - Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training.
VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.