Related papers: Auto-Parsing Network for Image Captioning and Visual Question Answering

Auto-Parsing Network for Image Captioning and Visual Question Answering

URL: http://arxiv.org/abs/2108.10568v1
Date: Tue, 24 Aug 2021 08:14:35 GMT
Title: Auto-Parsing Network for Image Captioning and Visual Question Answering
Authors: Xu Yang and Chongyang Gao and Hanwang Zhang and Jianfei Cai
Abstract summary: We propose an Auto-Parsing Network (APN) to discover and exploit the input data's hidden tree structures. Specifically, we impose a Probabilistic Graphical Model (PGM) parameterized by the attention operations on each self-attention layer to incorporate sparse assumption.
Score: 101.77688388554097
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose an Auto-Parsing Network (APN) to discover and exploit the input data's hidden tree structures for improving the effectiveness of the Transformer-based vision-language systems. Specifically, we impose a Probabilistic Graphical Model (PGM) parameterized by the attention operations on each self-attention layer to incorporate sparse assumption. We use this PGM to softly segment an input sequence into a few clusters where each cluster can be treated as the parent of the inside entities. By stacking these PGM constrained self-attention layers, the clusters in a lower layer compose into a new sequence, and the PGM in a higher layer will further segment this sequence. Iteratively, a sparse tree can be implicitly parsed, and this tree's hierarchical knowledge is incorporated into the transformed embeddings, which can be used for solving the target vision-language tasks. Specifically, we showcase that our APN can strengthen Transformer based networks in two major vision-language tasks: Captioning and Visual Question Answering. Also, a PGM probability-based parsing algorithm is developed by which we can discover what the hidden structure of input is during the inference.

Related papers

Hierarchical Graph Interaction Transformer with Dynamic Token Clustering for Camouflaged Object Detection [57.883265488038134]
We propose a hierarchical graph interaction network termed HGINet for camouflaged object detection. The network is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features. Our experiments demonstrate the superior performance of HGINet compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2024-08-27T12:53:25Z)
DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based Point-Level Consistency [12.881617910150688]
We propose a transformer framework for self-supervised learning called DenseDINO to learn dense visual representations. Specifically, DenseDINO introduces some extra input tokens called reference tokens to match the point-level features with the position prior. Compared with the vanilla DINO, our approach obtains competitive performance when evaluated on classification in ImageNet.
arXiv Detail & Related papers (2023-06-06T15:04:45Z)
Adaptively Clustering Neighbor Elements for Image-Text Generation [78.82346492527425]
We propose a novel Transformer-based image-to-text generation model termed as textbfACF. ACF adaptively clusters vision patches into object regions and language words into phrases to implicitly learn object-phrase alignments. Experiment results demonstrate the effectiveness of ACF, which outperforms most SOTA captioning and VQA models.
arXiv Detail & Related papers (2023-01-05T08:37:36Z)
Exploring and Exploiting Multi-Granularity Representations for Machine Reading Comprehension [13.191437539419681]
We propose a novel approach called Adaptive Bidirectional Attention-Capsule Network (ABA-Net) ABA-Net adaptively exploits the source representations of different levels to the predictor. We set the new state-of-the-art performance on the SQuAD 1.0 dataset.
arXiv Detail & Related papers (2022-08-18T10:14:32Z)
Graph Ordering Attention Networks [22.468776559433614]
Graph Neural Networks (GNNs) have been successfully used in many problems involving graph-structured data. We introduce the Graph Ordering Attention (GOAT) layer, a novel GNN component that captures interactions between nodes in a neighborhood. GOAT layer demonstrates its increased performance in modeling graph metrics that capture complex information.
arXiv Detail & Related papers (2022-04-11T18:13:19Z)
Augmenting Convolutional networks with attention-based aggregation [55.97184767391253]
We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning. We plug this learned aggregation layer with a simplistic patch-based convolutional network parametrized by 2 parameters (width and depth) It yields surprisingly competitive trade-offs between accuracy and complexity, in particular in terms of memory consumption.
arXiv Detail & Related papers (2021-12-27T14:05:41Z)
An evidential classifier based on Dempster-Shafer theory and deep learning [6.230751621285322]
We propose a new classification system based on Dempster-Shafer (DS) theory and a convolutional neural network (CNN) architecture for set-valued classification. Experiments on image recognition, signal processing, and semantic-relationship classification tasks demonstrate that the proposed combination of deep CNN, DS layer, and expected utility layer makes it possible to improve classification accuracy.
arXiv Detail & Related papers (2021-03-25T01:29:05Z)
Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding [90.77031668988661]
Cluster-Former is a novel clustering-based sparse Transformer to perform attention across chunked sequences. The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer. Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.
arXiv Detail & Related papers (2020-09-13T22:09:30Z)
DHP: Differentiable Meta Pruning via HyperNetworks [158.69345612783198]
This paper introduces a differentiable pruning method via hypernetworks for automatic network pruning. Latent vectors control the output channels of the convolutional layers in the backbone network and act as a handle for the pruning of the layers. Experiments are conducted on various networks for image classification, single image super-resolution, and denoising.
arXiv Detail & Related papers (2020-03-30T17:59:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.