AlignVE: Visual Entailment Recognition Based on Alignment Relations
- URL: http://arxiv.org/abs/2211.08736v1
- Date: Wed, 16 Nov 2022 07:52:24 GMT
- Title: AlignVE: Visual Entailment Recognition Based on Alignment Relations
- Authors: Biwei Cao, Jiuxin Cao, Jie Gui, Jiayun Shen, Bo Liu, Lei He, Yuan Yan
Tang and James Tin-Yau Kwok
- Abstract summary: Visual entailment (VE) is to recognize whether the semantics of a hypothesis text can be inferred from the given premise image.
New architecture called AlignVE is proposed to solve the visual entailment problem with a relation interaction method.
Our architecture reaches 72.45% accuracy on SNLI-VE dataset, outperforming previous content-based models under the same settings.
- Score: 32.190603887676666
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual entailment (VE) is to recognize whether the semantics of a hypothesis
text can be inferred from the given premise image, which is one special task
among recent emerged vision and language understanding tasks. Currently, most
of the existing VE approaches are derived from the methods of visual question
answering. They recognize visual entailment by quantifying the similarity
between the hypothesis and premise in the content semantic features from multi
modalities. Such approaches, however, ignore the VE's unique nature of relation
inference between the premise and hypothesis. Therefore, in this paper, a new
architecture called AlignVE is proposed to solve the visual entailment problem
with a relation interaction method. It models the relation between the premise
and hypothesis as an alignment matrix. Then it introduces a pooling operation
to get feature vectors with a fixed size. Finally, it goes through the
fully-connected layer and normalization layer to complete the classification.
Experiments show that our alignment-based architecture reaches 72.45\% accuracy
on SNLI-VE dataset, outperforming previous content-based models under the same
settings.
Related papers
- Learning from Semi-Factuals: A Debiased and Semantic-Aware Framework for
Generalized Relation Discovery [12.716874398564482]
Generalized Relation Discovery (GRD) aims to identify unlabeled instances in existing pre-defined relations or discover novel relations.
We propose a novel framework, SFGRD, for this task by learning from semi-factuals in two stages.
SFGRD surpasses state-of-the-art models in terms of accuracy by 2.36% $sim$5.78% and cosine similarity by 32.19%$sim$ 84.45%.
arXiv Detail & Related papers (2024-01-12T02:38:55Z) - LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images.
We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information.
Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z) - Learnable Pillar-based Re-ranking for Image-Text Retrieval [119.9979224297237]
Image-text retrieval aims to bridge the modality gap and retrieve cross-modal content based on semantic similarities.
Re-ranking, a popular post-processing practice, has revealed the superiority of capturing neighbor relations in single-modality retrieval tasks.
We propose a novel learnable pillar-based re-ranking paradigm for image-text retrieval.
arXiv Detail & Related papers (2023-04-25T04:33:27Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - SA-VQA: Structured Alignment of Visual and Semantic Representations for
Visual Question Answering [29.96818189046649]
We propose to apply structured alignments, which work with graph representation of visual and textual content.
As demonstrated in our experimental results, such a structured alignment improves reasoning performance.
The proposed model, without any pretraining, outperforms the state-of-the-art methods on GQA dataset, and beats the non-pretrained state-of-the-art methods on VQA-v2 dataset.
arXiv Detail & Related papers (2022-01-25T22:26:09Z) - Instance-Level Relative Saliency Ranking with Graph Reasoning [126.09138829920627]
We present a novel unified model to segment salient instances and infer relative saliency rank order.
A novel loss function is also proposed to effectively train the saliency ranking branch.
experimental results demonstrate that our proposed model is more effective than previous methods.
arXiv Detail & Related papers (2021-07-08T13:10:42Z) - Cross-modal Knowledge Reasoning for Knowledge-based Visual Question
Answering [27.042604046441426]
Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image.
In this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views.
We decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol.
We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA.
arXiv Detail & Related papers (2020-08-31T23:25:01Z) - Visual Question Answering with Prior Class Semantics [50.845003775809836]
We show how to exploit additional information pertaining to the semantics of candidate answers.
We extend the answer prediction process with a regression objective in a semantic space.
Our method brings improvements in consistency and accuracy over a range of question types.
arXiv Detail & Related papers (2020-05-04T02:46:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.