Related papers: MAMA: Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning

MAMA: Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning

URL: http://arxiv.org/abs/2407.03788v4
Date: Thu, 10 Oct 2024 02:10:16 GMT
Title: MAMA: Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning
Authors: Thong Nguyen, Yi Bin, Xiaobao Wu, Xinshuai Dong, Zhiyuan Hu, Khoi Le, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan,
Abstract summary: MAMA is a new approach to learning video-language representations by utilizing a contrastive objective with a subtractive angular margin. MAMA improves video-language representations and achieve superior performances on commonly used video question answering and text-video retrieval datasets.
Score: 34.259833094575285
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data quality stands at the forefront of deciding the effectiveness of video-language representation learning. However, video-text pairs in previous data typically do not align perfectly with each other, which might lead to video-language representations that do not accurately reflect cross-modal semantics. Moreover, previous data also possess an uneven distribution of concepts, thereby hampering the downstream performance across unpopular subjects. To address these problems, we propose MAMA, a new approach to learning video-language representations by utilizing a contrastive objective with a subtractive angular margin to regularize cross-modal representations in their effort to reach perfect similarity. Furthermore, to adapt to the non-uniform concept distribution, MAMA utilizes a multi-layer perceptron (MLP)-parameterized weighting function that maps loss values to sample weights which enable dynamic adjustment of the model's focus throughout the training. With the training guided by a small amount of unbiased meta-data and augmented by video-text data generated by large vision-language model, MAMA improves video-language representations and achieve superior performances on commonly used video question answering and text-video retrieval datasets. The code, model, and data have been made available at https://nguyentthong.github.io/MAMA.

Related papers

Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models [21.966865098520277]
Video Large Language Models (Video-LLMs) are predominantly trained on questions generated directly from video content.<n>In real-world scenarios, users often pose questions that extend beyond the informational scope of the video.<n>We propose alignment for answerability, a framework that equips Video-LLMs with the ability to evaluate the relevance of a question based on the input video.
arXiv Detail & Related papers (2025-07-07T13:19:43Z)
Realizing Video Summarization from the Path of Language-based Semantic Understanding [19.825666473712197]
We propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm. Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries.
arXiv Detail & Related papers (2024-10-06T15:03:22Z)
Training-free Video Temporal Grounding using Large-scale Pre-trained Models [41.71055776623368]
Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query. Existing video temporal localization models rely on specific datasets for training and have high data collection costs. We propose a Training-Free Video Temporal Grounding approach that leverages the ability of pre-trained large models.
arXiv Detail & Related papers (2024-08-29T02:25:12Z)
Video In-context Learning [46.40277880351059]
In this paper, we study video in-context learning, where the model starts from an existing video clip and generates diverse potential future sequences. To achieve this, we provide a clear definition of the task, and train an autoregressive Transformer on video datasets. We design various evaluation metrics, including both objective and subjective measures, to demonstrate the visual quality and semantic accuracy of generation results.
arXiv Detail & Related papers (2024-07-10T04:27:06Z)
VideoCon: Robust Video-Language Alignment via Contrast Captions [80.08882631838914]
Video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions. Our work identifies a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order. Our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks.
arXiv Detail & Related papers (2023-11-15T19:51:57Z)
Learning a Grammar Inducer from Massive Uncurated Instructional Videos [118.7279072358029]
Video-aided grammar induction aims to leverage video information for finding more accurate syntactic grammars for accompanying text. We build a new model that can better learn video-span correlation without manually designed features. Our model yields higher F1 scores than the previous state-of-the-art systems trained on in-domain data.
arXiv Detail & Related papers (2022-10-22T00:22:55Z)
Mitigating Representation Bias in Action Recognition: Algorithms and Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects. We tackle this problem from two different angles: algorithm and dataset. We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z)
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels. contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text. There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z)
Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks. We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment. Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z)
Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework. We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.