Learning to Agree on Vision Attention for Visual Commonsense Reasoning
- URL: http://arxiv.org/abs/2302.02117v1
- Date: Sat, 4 Feb 2023 07:02:29 GMT
- Title: Learning to Agree on Vision Attention for Visual Commonsense Reasoning
- Authors: Zhenyang Li, Yangyang Guo, Yangyang Guo, Fan Liu, Liqiang Nie, Mohan
Kankanhalli
- Abstract summary: A VCR model aims at answering a question regarding an image, followed by the rationale prediction for the preceding answering process.
Existing methods ignore the pivotal relationship between the two processes, leading to sub-optimal model performance.
This paper presents a novel visual attention alignment method to efficaciously handle these two processes in a unified framework.
- Score: 50.904275811951614
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Commonsense Reasoning (VCR) remains a significant yet challenging
research problem in the realm of visual reasoning. A VCR model generally aims
at answering a textual question regarding an image, followed by the rationale
prediction for the preceding answering process. Though these two processes are
sequential and intertwined, existing methods always consider them as two
independent matching-based instances. They, therefore, ignore the pivotal
relationship between the two processes, leading to sub-optimal model
performance. This paper presents a novel visual attention alignment method to
efficaciously handle these two processes in a unified framework. To achieve
this, we first design a re-attention module for aggregating the vision
attention map produced in each process. Thereafter, the resultant two sets of
attention maps are carefully aligned to guide the two processes to make
decisions based on the same image regions. We apply this method to both
conventional attention and the recent Transformer models and carry out
extensive experiments on the VCR benchmark dataset. The results demonstrate
that with the attention alignment module, our method achieves a considerable
improvement over the baseline methods, evidently revealing the feasibility of
the coupling of the two processes as well as the effectiveness of the proposed
method.
Related papers
- Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy [12.257725479880458]
Action recognition has become one of the popular research topics in computer vision.
We propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos.
Our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets.
arXiv Detail & Related papers (2024-05-02T14:43:21Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Single Stage Virtual Try-on via Deformable Attention Flows [51.70606454288168]
Virtual try-on aims to generate a photo-realistic fitting result given an in-shop garment and a reference person image.
We develop a novel Deformable Attention Flow (DAFlow) which applies the deformable attention scheme to multi-flow estimation.
Our proposed method achieves state-of-the-art performance both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-07-19T10:01:31Z) - Probing Visual-Audio Representation for Video Highlight Detection via
Hard-Pairs Guided Contrastive Learning [23.472951216815765]
Key to effective video representations is cross-modal representation learning and fine-grained feature discrimination.
In this paper, we enrich intra-modality and cross-modality relations for representation modeling.
We enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning scheme.
arXiv Detail & Related papers (2022-06-21T07:29:37Z) - S2-Net: Self-supervision Guided Feature Representation Learning for
Cross-Modality Images [0.0]
Cross-modality image pairs often fail to make the feature representations of correspondences as close as possible.
In this letter, we design a cross-modality feature representation learning network, S2-Net, which is based on the recently successful detect-and-describe pipeline.
We introduce self-supervised learning with a well-designed loss function to guide the training without discarding the original advantages.
arXiv Detail & Related papers (2022-03-28T08:47:49Z) - Joint Answering and Explanation for Visual Commonsense Reasoning [46.44588492897933]
Visual Commonsense Reasoning endeavors to pursue a more high-level visual comprehension.
It is composed of two indispensable processes: question answering over a given image and rationale inference for answer explanation.
We present a plug-and-play knowledge distillation enhanced framework to couple the question answering and rationale inference processes.
arXiv Detail & Related papers (2022-02-25T11:26:52Z) - Light Field Saliency Detection with Dual Local Graph Learning
andReciprocative Guidance [148.9832328803202]
We model the infor-mation fusion within focal stack via graph networks.
We build a novel dual graph modelto guide the focal stack fusion process using all-focus pat-terns.
arXiv Detail & Related papers (2021-10-02T00:54:39Z) - Learning Gaussian Graphical Models with Latent Confounders [74.72998362041088]
We compare and contrast two strategies for inference in graphical models with latent confounders.
While these two approaches have similar goals, they are motivated by different assumptions about confounding.
We propose a new method, which combines the strengths of these two approaches.
arXiv Detail & Related papers (2021-05-14T00:53:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.