Verbs in Action: Improving verb understanding in video-language models
- URL: http://arxiv.org/abs/2304.06708v1
- Date: Thu, 13 Apr 2023 17:57:01 GMT
- Title: Verbs in Action: Improving verb understanding in video-language models
- Authors: Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman,
Cordelia Schmid
- Abstract summary: State-of-the-art video-language models based on CLIP have been shown to have limited verb understanding.
We improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive framework.
- Score: 128.87443209118726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding verbs is crucial to modelling how people and objects interact
with each other and the environment through space and time. Recently,
state-of-the-art video-language models based on CLIP have been shown to have
limited verb understanding and to rely extensively on nouns, restricting their
performance in real-world video applications that require action and temporal
understanding. In this work, we improve verb understanding for CLIP-based
video-language models by proposing a new Verb-Focused Contrastive (VFC)
framework. This consists of two main components: (1) leveraging pretrained
large language models (LLMs) to create hard negatives for cross-modal
contrastive learning, together with a calibration strategy to balance the
occurrence of concepts in positive and negative pairs; and (2) enforcing a
fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art
results for zero-shot performance on three downstream tasks that focus on verb
understanding: video-text matching, video question-answering and video
classification. To the best of our knowledge, this is the first work which
proposes a method to alleviate the verb understanding problem, and does not
simply highlight it.
Related papers
- NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality [52.08735848128973]
We study the capability of Video-Language (VidL) models in understanding compositions between objects, attributes, actions and their relations.
We propose a training method called NAVERO which utilizes video-text data augmented with negative texts to enhance composition understanding.
arXiv Detail & Related papers (2024-08-18T15:27:06Z) - Language-based Action Concept Spaces Improve Video Self-Supervised
Learning [8.746806973828738]
We introduce language tied self-supervised learning to adapt an image CLIP model to the video domain.
A backbone modified for temporal modeling is trained under self-distillation settings with train objectives operating in an action concept space.
Our approach improves zero-shot and linear probing performance on three action recognition benchmarks.
arXiv Detail & Related papers (2023-07-20T14:47:50Z) - EC^2: Emergent Communication for Embodied Control [72.99894347257268]
Embodied control requires agents to leverage multi-modal pre-training to quickly learn how to act in new environments.
We propose Emergent Communication for Embodied Control (EC2), a novel scheme to pre-train video-language representations for few-shot embodied control.
EC2 is shown to consistently outperform previous contrastive learning methods for both videos and texts as task inputs.
arXiv Detail & Related papers (2023-04-19T06:36:02Z) - CLOP: Video-and-Language Pre-Training with Knowledge Regularizations [43.09248976105326]
Video-and-language pre-training has shown promising results for learning generalizable representations.
We denote such form of representations as structural knowledge, which express rich semantics of multiple granularities.
We propose a Cross-modaL knedgeOwl-enhanced Pre-training (CLOP) method with Knowledge Regularizations.
arXiv Detail & Related papers (2022-11-07T05:32:12Z) - Contrastive Video-Language Segmentation [41.1635597261304]
We focus on the problem of segmenting a certain object referred by a natural language sentence in video content.
We propose to interwind the visual and linguistic modalities in an explicit way via the contrastive learning objective.
arXiv Detail & Related papers (2021-09-29T01:40:58Z) - Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training.
VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.