Hypernymization of named entity-rich captions for grounding-based
multi-modal pretraining
- URL: http://arxiv.org/abs/2304.13130v1
- Date: Tue, 25 Apr 2023 20:17:40 GMT
- Title: Hypernymization of named entity-rich captions for grounding-based
multi-modal pretraining
- Authors: Giacomo Nebbia, Adriana Kovashka
- Abstract summary: We investigate hypernymization as a way to deal with named entities for pretraining grounding-based multi-modal models.
We report improved pretraining performance on objects of interest following hypernymization.
We show the promise of hypernymization on open-vocabulary detection, specifically on classes not seen during training.
- Score: 36.75629570208193
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Named entities are ubiquitous in text that naturally accompanies images,
especially in domains such as news or Wikipedia articles. In previous work,
named entities have been identified as a likely reason for low performance of
image-text retrieval models pretrained on Wikipedia and evaluated on named
entities-free benchmark datasets. Because they are rarely mentioned, named
entities could be challenging to model. They also represent missed learning
opportunities for self-supervised models: the link between named entity and
object in the image may be missed by the model, but it would not be if the
object were mentioned using a more common term. In this work, we investigate
hypernymization as a way to deal with named entities for pretraining
grounding-based multi-modal models and for fine-tuning on open-vocabulary
detection. We propose two ways to perform hypernymization: (1) a ``manual''
pipeline relying on a comprehensive ontology of concepts, and (2) a ``learned''
approach where we train a language model to learn to perform hypernymization.
We run experiments on data from Wikipedia and from The New York Times. We
report improved pretraining performance on objects of interest following
hypernymization, and we show the promise of hypernymization on open-vocabulary
detection, specifically on classes not seen during training.
Related papers
- Premonition: Using Generative Models to Preempt Future Data Changes in
Continual Learning [63.850451635362425]
Continual learning requires a model to adapt to ongoing changes in the data distribution.
We show that the combination of a large language model and an image generation model can similarly provide useful premonitions.
We find that the backbone of our pre-trained networks can learn representations useful for the downstream continual learning problem.
arXiv Detail & Related papers (2024-03-12T06:29:54Z) - Multicultural Name Recognition For Previously Unseen Names [65.268245109828]
This paper attempts to improve recognition of person names, a diverse category that can grow any time someone is born or changes their name.
I look at names from 103 countries to compare how well the model performs on names from different cultures.
I find that a model with combined character and word input outperforms word-only models and may improve on accuracy compared to classical NER models.
arXiv Detail & Related papers (2024-01-23T17:58:38Z) - What's in a Name? Beyond Class Indices for Image Recognition [31.68225941659493]
We propose a vision-language model to assign class names to images given only a large and essentially unconstrained vocabulary of categories as prior information.
Specifically, we propose iteratively clustering the data and voting on class names within them, showing that this enables a roughly 50% improvement over the baseline on ImageNet.
arXiv Detail & Related papers (2023-04-05T11:01:23Z) - Focus! Relevant and Sufficient Context Selection for News Image
Captioning [69.36678144800936]
News Image Captioning requires describing an image by leveraging additional context from a news article.
We propose to use the pre-trained vision and language retrieval model CLIP to localize the visually grounded entities in the news article.
Our experiments demonstrate that by simply selecting a better context from the article, we can significantly improve the performance of existing models.
arXiv Detail & Related papers (2022-12-01T20:00:27Z) - Exploiting Unlabeled Data with Vision and Language Models for Object
Detection [64.94365501586118]
Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets.
We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images.
We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection and semi-supervised object detection.
arXiv Detail & Related papers (2022-07-18T21:47:15Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - HyperBox: A Supervised Approach for Hypernym Discovery using Box
Embeddings [0.0]
We present a novel model HyperBox to learn box embeddings for hypernym discovery.
Given an input term, HyperBox retrieves its suitable hypernym from a target corpus.
We show that our model outperforms existing methods on the majority of the evaluation metrics.
arXiv Detail & Related papers (2022-04-05T08:46:50Z) - A Realistic Study of Auto-regressive Language Models for Named Entity
Typing and Recognition [7.345578385749421]
We study pre-trained language models for named entity recognition in a meta-learning setup.
First, we test named entity typing (NET) in a zero-shot transfer scenario. Then, we perform NER by giving few examples at inference.
We propose a method to select seen and rare / unseen names when having access only to the pre-trained model and report results on these groups.
arXiv Detail & Related papers (2021-08-26T15:29:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.