A Better Loss for Visual-Textual Grounding
- URL: http://arxiv.org/abs/2108.05308v1
- Date: Wed, 11 Aug 2021 16:26:54 GMT
- Title: A Better Loss for Visual-Textual Grounding
- Authors: Davide Rigoni, Luciano Serafini, Alessandro Sperduti
- Abstract summary: Given a textual phrase and an image, the visual grounding problem is defined as the task of locating the content of the image referenced by the sentence.
It is a challenging task that has several real-world applications in human-computer interaction, image-text reference resolution, and video-text reference resolution.
We propose a model that is able to achieve a higher accuracy than state-of-the-art models thanks to the adoption of a more effective loss function.
- Score: 74.81353762517979
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given a textual phrase and an image, the visual grounding problem is defined
as the task of locating the content of the image referenced by the sentence. It
is a challenging task that has several real-world applications in
human-computer interaction, image-text reference resolution, and video-text
reference resolution. In the last years, several works have addressed this
problem with heavy and complex models that try to capture visual-textual
dependencies better than before. These models are typically constituted by two
main components that focus on how to learn useful multi-modal features for
grounding and how to improve the predicted bounding box of the visual mention,
respectively. Finding the right learning balance between these two sub-tasks is
not easy, and the current models are not necessarily optimal with respect to
this issue. In this work, we propose a model that, although using a simple
multi-modal feature fusion component, is able to achieve a higher accuracy than
state-of-the-art models thanks to the adoption of a more effective loss
function, based on the classes probabilities, that reach, in the considered
datasets, a better learning balance between the two sub-tasks mentioned above.
Related papers
- ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding [42.10086029931937]
Visual grounding aims to localize the object referred to in an image based on a natural language query.
Existing methods demonstrate a significant performance drop when there are multiple distractions in an image.
We propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue.
arXiv Detail & Related papers (2024-08-29T07:32:01Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - GS-Pose: Category-Level Object Pose Estimation via Geometric and
Semantic Correspondence [5.500735640045456]
Category-level pose estimation is a challenging task with many potential applications in computer vision and robotics.
We propose to utilize both geometric and semantic features obtained from a pre-trained foundation model.
This requires significantly less data to train than prior methods since the semantic features are robust to object texture and appearance.
arXiv Detail & Related papers (2023-11-23T02:35:38Z) - UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z) - Weakly-Supervised Visual-Textual Grounding with Semantic Prior
Refinement [52.80968034977751]
Using only image-sentence pairs, weakly-supervised visual-textual grounding aims to learn region-phrase correspondences of the respective entity mentions.
We propose the Semantic Prior Refinement Model (SPRM), whose predictions are obtained by combining the output of two main modules.
Our approach shows state-of-the-art results on two popular datasets, Flickr30k Entities and ReferIt, with a 9.6% absolute improvement.
arXiv Detail & Related papers (2023-05-18T12:25:07Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - Fully Context-Aware Image Inpainting with a Learned Semantic Pyramid [102.24539566851809]
Restoring reasonable and realistic content for arbitrary missing regions in images is an important yet challenging task.
Recent image inpainting models have made significant progress in generating vivid visual details, but they can still lead to texture blurring or structural distortions.
We propose the Semantic Pyramid Network (SPN) motivated by the idea that learning multi-scale semantic priors can greatly benefit the recovery of locally missing content in images.
arXiv Detail & Related papers (2021-12-08T04:33:33Z) - Dependent Multi-Task Learning with Causal Intervention for Image
Captioning [10.6405791176668]
In this paper, we propose a dependent multi-task learning framework with the causal intervention (DMTCI)
Firstly, we involve an intermediate task, bag-of-categories generation, before the final task, image captioning.
Secondly, we apply Pearl's do-calculus on the model, cutting off the link between the visual features and possible confounders.
Finally, we use a multi-agent reinforcement learning strategy to enable end-to-end training and reduce the inter-task error accumulations.
arXiv Detail & Related papers (2021-05-18T14:57:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.