Visual Dependency Transformers: Dependency Tree Emerges from Reversed
Attention
- URL: http://arxiv.org/abs/2304.03282v1
- Date: Thu, 6 Apr 2023 17:59:26 GMT
- Title: Visual Dependency Transformers: Dependency Tree Emerges from Reversed
Attention
- Authors: Mingyu Ding, Yikang Shen, Lijie Fan, Zhenfang Chen, Zitian Chen, Ping
Luo, Joshua B. Tenenbaum, Chuang Gan
- Abstract summary: We propose Visual Dependency Transformers (DependencyViT) that can induce visual dependencies without any labels.
We formulate it as a dependency graph where a child token in reversed attention is trained to attend to its parent tokens and send information.
DependencyViT works well on both self- and weakly-supervised pretraining paradigms on ImageNet.
- Score: 106.67741967871969
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans possess a versatile mechanism for extracting structured
representations of our visual world. When looking at an image, we can decompose
the scene into entities and their parts as well as obtain the dependencies
between them. To mimic such capability, we propose Visual Dependency
Transformers (DependencyViT) that can induce visual dependencies without any
labels. We achieve that with a novel neural operator called \emph{reversed
attention} that can naturally capture long-range visual dependencies between
image patches. Specifically, we formulate it as a dependency graph where a
child token in reversed attention is trained to attend to its parent tokens and
send information following a normalized probability distribution rather than
gathering information in conventional self-attention. With such a design,
hierarchies naturally emerge from reversed attention layers, and a dependency
tree is progressively induced from leaf nodes to the root node unsupervisedly.
DependencyViT offers several appealing benefits. (i) Entities and their parts
in an image are represented by different subtrees, enabling part partitioning
from dependencies; (ii) Dynamic visual pooling is made possible. The leaf nodes
which rarely send messages can be pruned without hindering the model
performance, based on which we propose the lightweight DependencyViT-Lite to
reduce the computational and memory footprints; (iii) DependencyViT works well
on both self- and weakly-supervised pretraining paradigms on ImageNet, and
demonstrates its effectiveness on 8 datasets and 5 tasks, such as unsupervised
part and saliency segmentation, recognition, and detection.
Related papers
- VSFormer: Visual-Spatial Fusion Transformer for Correspondence Pruning [22.0082111649259]
Correspondence pruning aims to find correct matches (inliers) from an initial set of putative correspondences.
We propose a Visual-Spatial Fusion Transformer (VSFormer) to identify inliers and recover camera poses accurately.
arXiv Detail & Related papers (2023-12-14T09:50:09Z) - Pose-Aided Video-based Person Re-Identification via Recurrent Graph
Convolutional Network [41.861537712563816]
We propose to learn the discriminative pose feature beyond the appearance feature for video retrieval.
To learn the pose feature, we first detect the pedestrian pose in each frame through an off-the-shelf pose detector.
We then exploit a recurrent graph convolutional network (RGCN) to learn the node embeddings of the temporal pose graph.
arXiv Detail & Related papers (2022-09-23T13:20:33Z) - Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision.
A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive.
We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z) - Graph Reasoning Transformer for Image Parsing [67.76633142645284]
We propose a novel Graph Reasoning Transformer (GReaT) for image parsing to enable image patches to interact following a relation reasoning pattern.
Compared to the conventional transformer, GReaT has higher interaction efficiency and a more purposeful interaction pattern.
Results show that GReaT achieves consistent performance gains with slight computational overheads on the state-of-the-art transformer baselines.
arXiv Detail & Related papers (2022-09-20T08:21:37Z) - Patch-level Representation Learning for Self-supervised Vision
Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks.
Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations.
We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z) - Good Visual Guidance Makes A Better Extractor: Hierarchical Visual
Prefix for Multimodal Entity and Relation Extraction [88.6585431949086]
We propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for visual-enhanced entity and relation extraction.
We regard visual representation as pluggable visual prefix to guide the textual representation for error insensitive forecasting decision.
Experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-05-07T02:10:55Z) - One-shot Scene Graph Generation [130.57405850346836]
We propose Multiple Structured Knowledge (Relational Knowledgesense Knowledge) for the one-shot scene graph generation task.
Our method significantly outperforms existing state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-02-22T11:32:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.