Few-shot Domain-Adaptive Visually-fused Event Detection from Text
- URL: http://arxiv.org/abs/2305.03517v2
- Date: Mon, 5 Jun 2023 00:58:41 GMT
- Title: Few-shot Domain-Adaptive Visually-fused Event Detection from Text
- Authors: Farhad Moghimifar, Fatemeh Shiri, Van Nguyen, Reza Haffari, Yuan-Fang
Li
- Abstract summary: We present a novel domain-adaptive visually-fused event detection approach that can be trained on a few labelled image-text paired data points.
Specifically, we introduce a visual imaginator method that synthesises images from text in the absence of visual context.
Our model can leverage the capabilities of pre-trained vision-language models and can be trained in a few-shot setting.
- Score: 13.189886554546929
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Incorporating auxiliary modalities such as images into event detection models
has attracted increasing interest over the last few years. The complexity of
natural language in describing situations has motivated researchers to leverage
the related visual context to improve event detection performance. However,
current approaches in this area suffer from data scarcity, where a large amount
of labelled text-image pairs are required for model training. Furthermore,
limited access to the visual context at inference time negatively impacts the
performance of such models, which makes them practically ineffective in
real-world scenarios. In this paper, we present a novel domain-adaptive
visually-fused event detection approach that can be trained on a few labelled
image-text paired data points. Specifically, we introduce a visual imaginator
method that synthesises images from text in the absence of visual context.
Moreover, the imaginator can be customised to a specific domain. In doing so,
our model can leverage the capabilities of pre-trained vision-language models
and can be trained in a few-shot setting. This also allows for effective
inference where only single-modality data (i.e. text) is available. The
experimental evaluation on the benchmark M2E2 dataset shows that our model
outperforms existing state-of-the-art models, by up to 11 points.
Related papers
- FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Visual Analytics for Efficient Image Exploration and User-Guided Image
Captioning [35.47078178526536]
Recent advancements in pre-trained large-scale language-image models have ushered in a new era of visual comprehension.
This paper tackles two well-known issues within the realm of visual analytics: (1) the efficient exploration of large-scale image datasets and identification of potential data biases within them; (2) the evaluation of image captions and steering of their generation process.
arXiv Detail & Related papers (2023-11-02T06:21:35Z) - Learning the Visualness of Text Using Large Vision-Language Models [42.75864384249245]
Visual text evokes an image in a person's mind, while non-visual text fails to do so.
A method to automatically detect visualness in text will enable text-to-image retrieval and generation models to augment text with relevant images.
We curate a dataset of 3,620 English sentences and their visualness scores provided by multiple human annotators.
arXiv Detail & Related papers (2023-05-11T17:45:16Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z) - Detection and Captioning with Unseen Object Classes [12.894104422808242]
Test images may contain visual objects with no corresponding visual or textual training examples.
We propose a detection-driven approach based on a generalized zero-shot detection model and a template-based sentence generation model.
Our experiments show that the proposed zero-shot detection model obtains state-of-the-art performance on the MS-COCO dataset.
arXiv Detail & Related papers (2021-08-13T10:43:20Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.