How to Understand "Support"? An Implicit-enhanced Causal Inference
Approach for Weakly-supervised Phrase Grounding
- URL: http://arxiv.org/abs/2402.19116v2
- Date: Mon, 4 Mar 2024 08:42:56 GMT
- Title: How to Understand "Support"? An Implicit-enhanced Causal Inference
Approach for Weakly-supervised Phrase Grounding
- Authors: Jiamin Luo, Jianing Zhao, Jingjing Wang, Guodong Zhou
- Abstract summary: Weakly-supervised Phrase Grounding (WPG) is an emerging task of inferring the fine-grained phrase-region matching.
This paper proposes an Implicit-Enhanced Causal Inference approach to address the challenges of modeling the implicit relations.
- Score: 18.97081348819219
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly-supervised Phrase Grounding (WPG) is an emerging task of inferring the
fine-grained phrase-region matching, while merely leveraging the coarse-grained
sentence-image pairs for training. However, existing studies on WPG largely
ignore the implicit phrase-region matching relations, which are crucial for
evaluating the capability of models in understanding the deep multimodal
semantics. To this end, this paper proposes an Implicit-Enhanced Causal
Inference (IECI) approach to address the challenges of modeling the implicit
relations and highlighting them beyond the explicit. Specifically, this
approach leverages both the intervention and counterfactual techniques to
tackle the above two challenges respectively. Furthermore, a high-quality
implicit-enhanced dataset is annotated to evaluate IECI and detailed
evaluations show the great advantages of IECI over the state-of-the-art
baselines. Particularly, we observe an interesting finding that IECI
outperforms the advanced multimodal LLMs by a large margin on this
implicit-enhanced dataset, which may facilitate more research to evaluate the
multimodal LLMs in this direction.
Related papers
- Cross-target Stance Detection by Exploiting Target Analytical
Perspectives [22.320628580895164]
Cross-target stance detection (CTSD) is an important task, which infers the attitude of the destination target by utilizing annotated data from the source target.
One important approach in CTSD is to extract domain-invariant features to bridge the knowledge gap between multiple targets.
We propose a Multi-Perspective Prompt-Tuning (MPPT) model for CTSD that uses the analysis perspective as a bridge to transfer knowledge.
arXiv Detail & Related papers (2024-01-03T14:28:55Z) - Igniting Language Intelligence: The Hitchhiker's Guide From
Chain-of-Thought Reasoning to Language Agents [80.5213198675411]
Large language models (LLMs) have dramatically enhanced the field of language intelligence.
LLMs leverage the intriguing chain-of-thought (CoT) reasoning techniques, obliging them to formulate intermediate steps en route to deriving an answer.
Recent research endeavors have extended CoT reasoning methodologies to nurture the development of autonomous language agents.
arXiv Detail & Related papers (2023-11-20T14:30:55Z) - Prompt-based Logical Semantics Enhancement for Implicit Discourse
Relation Recognition [4.7938839332508945]
We propose a Prompt-based Logical Semantics Enhancement (PLSE) method for Implicit Discourse Relation Recognition (IDRR)
Our method seamlessly injects knowledge relevant to discourse relation into pre-trained language models through prompt-based connective prediction.
Experimental results on PDTB 2.0 and CoNLL16 datasets demonstrate that our method achieves outstanding and consistent performance against the current state-of-the-art models.
arXiv Detail & Related papers (2023-11-01T08:38:08Z) - Topic-driven Distant Supervision Framework for Macro-level Discourse
Parsing [72.14449502499535]
The task of analyzing the internal rhetorical structure of texts is a challenging problem in natural language processing.
Despite the recent advances in neural models, the lack of large-scale, high-quality corpora for training remains a major obstacle.
Recent studies have attempted to overcome this limitation by using distant supervision.
arXiv Detail & Related papers (2023-05-23T07:13:51Z) - Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual
Recognition [57.08108545219043]
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision.
Existing literature addresses this challenge by employing local-based representation approaches.
This article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition.
arXiv Detail & Related papers (2023-05-12T00:13:17Z) - SAIS: Supervising and Augmenting Intermediate Steps for Document-Level
Relation Extraction [51.27558374091491]
We propose to explicitly teach the model to capture relevant contexts and entity types by supervising and augmenting intermediate steps (SAIS) for relation extraction.
Based on a broad spectrum of carefully designed tasks, our proposed SAIS method not only extracts relations of better quality due to more effective supervision, but also retrieves the corresponding supporting evidence more accurately.
arXiv Detail & Related papers (2021-09-24T17:37:35Z) - Deep Context- and Relation-Aware Learning for Aspect-based Sentiment
Analysis [3.7175198778996483]
We propose Deep Contextualized Relation-Aware Network (DCRAN), which allows interactive relations among subtasks with deep contextual information.
DCRAN significantly outperforms previous state-of-the-art methods by large margins on three widely used benchmarks.
arXiv Detail & Related papers (2021-06-07T17:16:15Z) - Learning Relation Alignment for Calibrated Cross-modal Retrieval [52.760541762871505]
We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations.
We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
arXiv Detail & Related papers (2021-05-28T14:25:49Z) - Dive into Ambiguity: Latent Distribution Mining and Pairwise Uncertainty
Estimation for Facial Expression Recognition [59.52434325897716]
We propose a solution, named DMUE, to address the problem of annotation ambiguity from two perspectives.
For the former, an auxiliary multi-branch learning framework is introduced to better mine and describe the latent distribution in the label space.
For the latter, the pairwise relationship of semantic feature between instances are fully exploited to estimate the ambiguity extent in the instance space.
arXiv Detail & Related papers (2021-04-01T03:21:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.