Chunk-aware Alignment and Lexical Constraint for Visual Entailment with
Natural Language Explanations
- URL: http://arxiv.org/abs/2207.11401v1
- Date: Sat, 23 Jul 2022 03:19:50 GMT
- Title: Chunk-aware Alignment and Lexical Constraint for Visual Entailment with
Natural Language Explanations
- Authors: Qian Yang and Yunxin Li and Baotian Hu and Lin Ma and Yuxing Ding and
Min Zhang
- Abstract summary: Visual Entailment with natural language explanations aims to infer the relationship between a text-image pair and generate a sentence to explain the decision-making process.
Previous methods rely mainly on a pre-trained vision-language model to perform the relation inference and a language model to generate the corresponding explanation.
We propose a unified Chunk-aware Alignment and Lexical Constraint based method, dubbed as CALeC.
- Score: 38.50987889221086
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Visual Entailment with natural language explanations aims to infer the
relationship between a text-image pair and generate a sentence to explain the
decision-making process. Previous methods rely mainly on a pre-trained
vision-language model to perform the relation inference and a language model to
generate the corresponding explanation. However, the pre-trained
vision-language models mainly build token-level alignment between text and
image yet ignore the high-level semantic alignment between the phrases (chunks)
and visual contents, which is critical for vision-language reasoning. Moreover,
the explanation generator based only on the encoded joint representation does
not explicitly consider the critical decision-making points of relation
inference. Thus the generated explanations are less faithful to visual-language
reasoning. To mitigate these problems, we propose a unified Chunk-aware
Alignment and Lexical Constraint based method, dubbed as CALeC. It contains a
Chunk-aware Semantic Interactor (arr. CSI), a relation inferrer, and a Lexical
Constraint-aware Generator (arr. LeCG). Specifically, CSI exploits the sentence
structure inherent in language and various image regions to build chunk-aware
semantic alignment. Relation inferrer uses an attention-based reasoning network
to incorporate the token-level and chunk-level vision-language representations.
LeCG utilizes lexical constraints to expressly incorporate the words or chunks
focused by the relation inferrer into explanation generation, improving the
faithfulness and informativeness of the explanations. We conduct extensive
experiments on three datasets, and experimental results indicate that CALeC
significantly outperforms other competitor models on inference accuracy and
quality of generated explanations.
Related papers
- Natural Language Inference Improves Compositionality in Vision-Language Models [35.71815423077561]
We present a principled approach that generates entailments and contradictions from a given premise.
CECE produces lexically diverse sentences while maintaining their core meaning.
We achieve significant improvements over previous methods without requiring additional fine-tuning.
arXiv Detail & Related papers (2024-10-29T17:54:17Z) - SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph
Attention [19.23636231942245]
We propose a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer.
Our method replaces original language-independent encoding with cross-modal encoding in visual analysis.
Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-13T02:11:04Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - 3VL: using Trees to teach Vision & Language models compositional
concepts [45.718319397947056]
We introduce the Tree-augmented Vision-Language (3VL) model architecture and training technique.
We show how Anchor, a simple technique for text unification, can be employed to filter nuisance factors.
We also exhibit how DiRe, which performs a differential relevancy comparison between VLM maps, enables us to generate compelling visualizations of a model's success or failure.
arXiv Detail & Related papers (2023-12-28T20:26:03Z) - Rewrite Caption Semantics: Bridging Semantic Gaps for
Language-Supervised Semantic Segmentation [100.81837601210597]
We propose Concept Curation (CoCu) to bridge the gap between visual and textual semantics in pre-training data.
CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin.
arXiv Detail & Related papers (2023-09-24T00:05:39Z) - Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial.
We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments.
The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z) - Natural Language Decompositions of Implicit Content Enable Better Text
Representations [56.85319224208865]
We introduce a method for the analysis of text that takes implicitly communicated content explicitly into account.
We use a large language model to produce sets of propositions that are inferentially related to the text that has been observed.
Our results suggest that modeling the meanings behind observed language, rather than the literal text alone, is a valuable direction for NLP.
arXiv Detail & Related papers (2023-05-23T23:45:20Z) - SrTR: Self-reasoning Transformer with Visual-linguistic Knowledge for
Scene Graph Generation [12.977857322594206]
One-stage scene graph generation approaches infer the effective relation between entity pairs using sparse proposal sets and a few queries.
A Self-reasoning Transformer with Visual-linguistic Knowledge (SrTR) is proposed to add flexible self-reasoning ability to the model.
Inspired by the large-scale pre-training image-text foundation models, visual-linguistic prior knowledge is introduced.
arXiv Detail & Related papers (2022-12-19T09:47:27Z) - Lexically-constrained Text Generation through Commonsense Knowledge
Extraction and Injection [62.071938098215085]
We focus on the Commongen benchmark, wherein the aim is to generate a plausible sentence for a given set of input concepts.
We propose strategies for enhancing the semantic correctness of the generated text.
arXiv Detail & Related papers (2020-12-19T23:23:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.