Grounded Image Text Matching with Mismatched Relation Reasoning
- URL: http://arxiv.org/abs/2308.01236v2
- Date: Fri, 4 Aug 2023 17:51:57 GMT
- Title: Grounded Image Text Matching with Mismatched Relation Reasoning
- Authors: Yu Wu, Yana Wei, Haozhe Wang, Yongfei Liu, Sibei Yang, Xuming He
- Abstract summary: Grounded Image Text Matching with Mismatched Relation (GITM-MR) is a novel visual-linguistic joint task.
GITM-MR requires a model to first determine if an expression describes an image, then localize referred objects or ground the mismatched parts of the text.
We propose the Relation-sensitive Correspondence Reasoning Network (RCRN), which incorporates relation-aware reasoning via bi-directional message propagation.
- Score: 39.524420144738684
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces Grounded Image Text Matching with Mismatched Relation
(GITM-MR), a novel visual-linguistic joint task that evaluates the relation
understanding capabilities of transformer-based pre-trained models. GITM-MR
requires a model to first determine if an expression describes an image, then
localize referred objects or ground the mismatched parts of the text. We
provide a benchmark for evaluating pre-trained models on this task, with a
focus on the challenging settings of limited data and out-of-distribution
sentence lengths. Our evaluation demonstrates that pre-trained models lack data
efficiency and length generalization ability. To address this, we propose the
Relation-sensitive Correspondence Reasoning Network (RCRN), which incorporates
relation-aware reasoning via bi-directional message propagation guided by
language structure. RCRN can be interpreted as a modular program and delivers
strong performance in both length generalization and data efficiency.
Related papers
- Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing.
We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample.
We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z) - Exploiting Contextual Target Attributes for Target Sentiment
Classification [53.30511968323911]
Existing PTLM-based models for TSC can be categorized into two groups: 1) fine-tuning-based models that adopt PTLM as the context encoder; 2) prompting-based models that transfer the classification task to the text/word generation task.
We present a new perspective of leveraging PTLM for TSC: simultaneously leveraging the merits of both language modeling and explicit target-context interactions via contextual target attributes.
arXiv Detail & Related papers (2023-12-21T11:45:28Z) - RDR: the Recap, Deliberate, and Respond Method for Enhanced Language
Understanding [6.738409533239947]
The Recap, Deliberate, and Respond (RDR) paradigm addresses this issue by incorporating three distinct objectives within the neural network pipeline.
By cascading these three models, we mitigate the potential for gaming the benchmark and establish a robust method for capturing the underlying semantic patterns.
Our results demonstrate improved performance compared to competitive baselines, with an enhancement of up to 2% on standard metrics.
arXiv Detail & Related papers (2023-12-15T16:41:48Z) - Prompt-based Logical Semantics Enhancement for Implicit Discourse
Relation Recognition [4.7938839332508945]
We propose a Prompt-based Logical Semantics Enhancement (PLSE) method for Implicit Discourse Relation Recognition (IDRR)
Our method seamlessly injects knowledge relevant to discourse relation into pre-trained language models through prompt-based connective prediction.
Experimental results on PDTB 2.0 and CoNLL16 datasets demonstrate that our method achieves outstanding and consistent performance against the current state-of-the-art models.
arXiv Detail & Related papers (2023-11-01T08:38:08Z) - Zero-shot Composed Text-Image Retrieval [72.43790281036584]
We consider the problem of composed image retrieval (CIR)
It aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability.
arXiv Detail & Related papers (2023-06-12T17:56:01Z) - Enhancing Pre-trained Models with Text Structure Knowledge for Question
Generation [2.526624977753083]
We model text structure as answer position and syntactic dependency, and propose answer localness modeling and syntactic mask attention to address these limitations.
Experiments on SQuAD dataset show that our proposed two modules improve performance over the strong pre-trained model ProphetNet.
arXiv Detail & Related papers (2022-09-09T08:33:47Z) - Improving Distantly Supervised Relation Extraction by Natural Language
Inference [9.181270251524866]
We propose a novel DSRE-NLI framework, which considers both distant supervision from existing knowledge bases and indirect supervision from pretrained language models for other tasks.
DSRE-NLI energizes an off-the-shelf natural language inference (NLI) engine with a semi-automatic relation verbalization (SARV) mechanism to provide indirect supervision.
With two simple and effective data consolidation strategies, the quality of training data is substantially improved.
arXiv Detail & Related papers (2022-07-31T02:48:34Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Syntax-Enhanced Pre-trained Model [49.1659635460369]
We study the problem of leveraging the syntactic structure of text to enhance pre-trained models such as BERT and RoBERTa.
Existing methods utilize syntax of text either in the pre-training stage or in the fine-tuning stage, so that they suffer from discrepancy between the two stages.
We present a model that utilizes the syntax of text in both pre-training and fine-tuning stages.
arXiv Detail & Related papers (2020-12-28T06:48:04Z) - GPT-too: A language-model-first approach for AMR-to-text generation [22.65728041544785]
We propose an approach that combines a strong pre-trained language model with cycle consistency-based re-scoring.
Despite the simplicity of the approach, our experimental results show these models outperform all previous techniques.
arXiv Detail & Related papers (2020-05-18T22:50:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.