Chunk-aware Alignment and Lexical Constraint for Visual Entailment with
Natural Language Explanations
- URL: http://arxiv.org/abs/2207.11401v1
- Date: Sat, 23 Jul 2022 03:19:50 GMT
- Title: Chunk-aware Alignment and Lexical Constraint for Visual Entailment with
Natural Language Explanations
- Authors: Qian Yang and Yunxin Li and Baotian Hu and Lin Ma and Yuxing Ding and
Min Zhang
- Abstract summary: Visual Entailment with natural language explanations aims to infer the relationship between a text-image pair and generate a sentence to explain the decision-making process.
Previous methods rely mainly on a pre-trained vision-language model to perform the relation inference and a language model to generate the corresponding explanation.
We propose a unified Chunk-aware Alignment and Lexical Constraint based method, dubbed as CALeC.
- Score: 38.50987889221086
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Visual Entailment with natural language explanations aims to infer the
relationship between a text-image pair and generate a sentence to explain the
decision-making process. Previous methods rely mainly on a pre-trained
vision-language model to perform the relation inference and a language model to
generate the corresponding explanation. However, the pre-trained
vision-language models mainly build token-level alignment between text and
image yet ignore the high-level semantic alignment between the phrases (chunks)
and visual contents, which is critical for vision-language reasoning. Moreover,
the explanation generator based only on the encoded joint representation does
not explicitly consider the critical decision-making points of relation
inference. Thus the generated explanations are less faithful to visual-language
reasoning. To mitigate these problems, we propose a unified Chunk-aware
Alignment and Lexical Constraint based method, dubbed as CALeC. It contains a
Chunk-aware Semantic Interactor (arr. CSI), a relation inferrer, and a Lexical
Constraint-aware Generator (arr. LeCG). Specifically, CSI exploits the sentence
structure inherent in language and various image regions to build chunk-aware
semantic alignment. Relation inferrer uses an attention-based reasoning network
to incorporate the token-level and chunk-level vision-language representations.
LeCG utilizes lexical constraints to expressly incorporate the words or chunks
focused by the relation inferrer into explanation generation, improving the
faithfulness and informativeness of the explanations. We conduct extensive
experiments on three datasets, and experimental results indicate that CALeC
significantly outperforms other competitor models on inference accuracy and
quality of generated explanations.
Related papers
- Causal Graphical Models for Vision-Language Compositional Understanding [36.24185263818946]
We show that our method significantly outperforms all the state-of-the-art compositional approaches by a large margin.
It also improves over methods trained using much larger datasets.
arXiv Detail & Related papers (2024-12-12T15:22:03Z) - Natural Language Inference Improves Compositionality in Vision-Language Models [35.71815423077561]
We present a principled approach that generates entailments and contradictions from a given premise.
CECE produces lexically diverse sentences while maintaining their core meaning.
We achieve significant improvements over previous methods without requiring additional fine-tuning.
arXiv Detail & Related papers (2024-10-29T17:54:17Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - 3VL: Using Trees to Improve Vision-Language Models' Interpretability [40.678288227161936]
Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks.
These representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects' attributes, states, and relations between different objects.
In this work, we introduce the architecture and training technique of Tree-augmented Vision-Language (3VL) model accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool.
arXiv Detail & Related papers (2023-12-28T20:26:03Z) - Rewrite Caption Semantics: Bridging Semantic Gaps for
Language-Supervised Semantic Segmentation [100.81837601210597]
We propose Concept Curation (CoCu) to bridge the gap between visual and textual semantics in pre-training data.
CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin.
arXiv Detail & Related papers (2023-09-24T00:05:39Z) - Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial.
We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments.
The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z) - Natural Language Decompositions of Implicit Content Enable Better Text
Representations [56.85319224208865]
We introduce a method for the analysis of text that takes implicitly communicated content explicitly into account.
We use a large language model to produce sets of propositions that are inferentially related to the text that has been observed.
Our results suggest that modeling the meanings behind observed language, rather than the literal text alone, is a valuable direction for NLP.
arXiv Detail & Related papers (2023-05-23T23:45:20Z) - SrTR: Self-reasoning Transformer with Visual-linguistic Knowledge for
Scene Graph Generation [12.977857322594206]
One-stage scene graph generation approaches infer the effective relation between entity pairs using sparse proposal sets and a few queries.
A Self-reasoning Transformer with Visual-linguistic Knowledge (SrTR) is proposed to add flexible self-reasoning ability to the model.
Inspired by the large-scale pre-training image-text foundation models, visual-linguistic prior knowledge is introduced.
arXiv Detail & Related papers (2022-12-19T09:47:27Z) - Lexically-constrained Text Generation through Commonsense Knowledge
Extraction and Injection [62.071938098215085]
We focus on the Commongen benchmark, wherein the aim is to generate a plausible sentence for a given set of input concepts.
We propose strategies for enhancing the semantic correctness of the generated text.
arXiv Detail & Related papers (2020-12-19T23:23:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.