Related papers: Hypernymization of named entity-rich captions for grounding-based multi-modal pretraining

Hypernymization of named entity-rich captions for grounding-based multi-modal pretraining

URL: http://arxiv.org/abs/2304.13130v1
Date: Tue, 25 Apr 2023 20:17:40 GMT
Title: Hypernymization of named entity-rich captions for grounding-based multi-modal pretraining
Authors: Giacomo Nebbia, Adriana Kovashka
Abstract summary: We investigate hypernymization as a way to deal with named entities for pretraining grounding-based multi-modal models. We report improved pretraining performance on objects of interest following hypernymization. We show the promise of hypernymization on open-vocabulary detection, specifically on classes not seen during training.
Score: 36.75629570208193
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Named entities are ubiquitous in text that naturally accompanies images, especially in domains such as news or Wikipedia articles. In previous work, named entities have been identified as a likely reason for low performance of image-text retrieval models pretrained on Wikipedia and evaluated on named entities-free benchmark datasets. Because they are rarely mentioned, named entities could be challenging to model. They also represent missed learning opportunities for self-supervised models: the link between named entity and object in the image may be missed by the model, but it would not be if the object were mentioned using a more common term. In this work, we investigate hypernymization as a way to deal with named entities for pretraining grounding-based multi-modal models and for fine-tuning on open-vocabulary detection. We propose two ways to perform hypernymization: (1) a ``manual'' pipeline relying on a comprehensive ontology of concepts, and (2) a ``learned'' approach where we train a language model to learn to perform hypernymization. We run experiments on data from Wikipedia and from The New York Times. We report improved pretraining performance on objects of interest following hypernymization, and we show the promise of hypernymization on open-vocabulary detection, specifically on classes not seen during training.

Related papers

DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with Entities [29.716152560414738]
We enhance the Learned Sparse Retrieval (LSR) model with Wikipedia concepts and entities. In experiments across three entity-rich document ranking datasets, the resulting DyVo model substantially outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-10-10T08:41:34Z)
ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding [42.10086029931937]
Visual grounding aims to localize the object referred to in an image based on a natural language query. Existing methods demonstrate a significant performance drop when there are multiple distractions in an image. We propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue.
arXiv Detail & Related papers (2024-08-29T07:32:01Z)
Premonition: Using Generative Models to Preempt Future Data Changes in Continual Learning [63.850451635362425]
Continual learning requires a model to adapt to ongoing changes in the data distribution. We show that the combination of a large language model and an image generation model can similarly provide useful premonitions. We find that the backbone of our pre-trained networks can learn representations useful for the downstream continual learning problem.
arXiv Detail & Related papers (2024-03-12T06:29:54Z)
Multicultural Name Recognition For Previously Unseen Names [65.268245109828]
This paper attempts to improve recognition of person names, a diverse category that can grow any time someone is born or changes their name. I look at names from 103 countries to compare how well the model performs on names from different cultures. I find that a model with combined character and word input outperforms word-only models and may improve on accuracy compared to classical NER models.
arXiv Detail & Related papers (2024-01-23T17:58:38Z)
What's in a Name? Beyond Class Indices for Image Recognition [28.02490526407716]
We propose a vision-language model with assigning class names to images given only a large (essentially unconstrained) vocabulary of categories as prior information. We leverage non-parametric methods to establish meaningful relationships between images, allowing the model to automatically narrow down the pool of candidate names. Our method leads to a roughly 50% improvement over the baseline on ImageNet in the unsupervised setting.
arXiv Detail & Related papers (2023-04-05T11:01:23Z)
Focus! Relevant and Sufficient Context Selection for News Image Captioning [69.36678144800936]
News Image Captioning requires describing an image by leveraging additional context from a news article. We propose to use the pre-trained vision and language retrieval model CLIP to localize the visually grounded entities in the news article. Our experiments demonstrate that by simply selecting a better context from the article, we can significantly improve the performance of existing models.
arXiv Detail & Related papers (2022-12-01T20:00:27Z)
Exploiting Unlabeled Data with Vision and Language Models for Object Detection [64.94365501586118]
Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets. We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images. We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection and semi-supervised object detection.
arXiv Detail & Related papers (2022-07-18T21:47:15Z)
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z)
HyperBox: A Supervised Approach for Hypernym Discovery using Box Embeddings [0.0]
We present a novel model HyperBox to learn box embeddings for hypernym discovery. Given an input term, HyperBox retrieves its suitable hypernym from a target corpus. We show that our model outperforms existing methods on the majority of the evaluation metrics.
arXiv Detail & Related papers (2022-04-05T08:46:50Z)
A Realistic Study of Auto-regressive Language Models for Named Entity Typing and Recognition [7.345578385749421]
We study pre-trained language models for named entity recognition in a meta-learning setup. First, we test named entity typing (NET) in a zero-shot transfer scenario. Then, we perform NER by giving few examples at inference. We propose a method to select seen and rare / unseen names when having access only to the pre-trained model and report results on these groups.
arXiv Detail & Related papers (2021-08-26T15:29:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.