CLIP-Event: Connecting Text and Images with Event Structures
- URL: http://arxiv.org/abs/2201.05078v1
- Date: Thu, 13 Jan 2022 17:03:57 GMT
- Title: CLIP-Event: Connecting Text and Images with Event Structures
- Authors: Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin,
Chenguang Zhu, Michael Zeng, Heng Ji, Shih-Fu Chang
- Abstract summary: We propose a contrastive learning framework to enforce vision-language pretraining models.
We take advantage of text information extraction technologies to obtain event structural knowledge.
Experiments show that our zero-shot CLIP-Event outperforms the state-of-the-art supervised model in argument extraction.
- Score: 123.31452120399827
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language (V+L) pretraining models have achieved great success in
supporting multimedia applications by understanding the alignments between
images and text. While existing vision-language pretraining models primarily
focus on understanding objects in images or entities in text, they often ignore
the alignment at the level of events and their argument structures. % In this
work, we propose a contrastive learning framework to enforce vision-language
pretraining models to comprehend events and associated argument (participant)
roles. To achieve this, we take advantage of text information extraction
technologies to obtain event structural knowledge, and utilize multiple prompt
functions to contrast difficult negative descriptions by manipulating event
structures. We also design an event graph alignment loss based on optimal
transport to capture event argument structures. In addition, we collect a large
event-rich dataset (106,875 images) for pretraining, which provides a more
challenging image retrieval benchmark to assess the understanding of
complicated lengthy sentences. Experiments show that our zero-shot CLIP-Event
outperforms the state-of-the-art supervised model in argument extraction on
Multimedia Event Extraction, achieving more than 5\% absolute F-score gain in
event extraction, as well as significant improvements on a variety of
downstream tasks under zero-shot settings.
Related papers
- Debiasing Vison-Language Models with Text-Only Training [15.069736314663352]
We propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
To address the limitations, we propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
arXiv Detail & Related papers (2024-10-12T04:34:46Z) - PromptCL: Improving Event Representation via Prompt Template and Contrastive Learning [3.481567499804089]
We present PromptCL, a novel framework for event representation learning.
PromptCL elicits the capabilities of PLMs to comprehensively capture the semantics of short event texts.
Our experimental results demonstrate that PromptCL outperforms state-of-the-art baselines on event related tasks.
arXiv Detail & Related papers (2024-04-27T12:22:43Z) - EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding [7.797154022794006]
EventBind is a novel framework that unleashes the potential of vision-language models (VLMs) for event-based recognition.
We first introduce a novel event encoder that subtly models the temporal information from events.
We then design a text encoder that generates content prompts and utilizes hybrid text prompts to enhance EventBind's generalization ability.
arXiv Detail & Related papers (2023-08-06T15:05:42Z) - COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining.
We propose COSA, a COncatenated SAmple pretrained vision-language foundation model.
We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining.
This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.