Efficient Representation Learning via Adaptive Context Pooling
- URL: http://arxiv.org/abs/2207.01844v1
- Date: Tue, 5 Jul 2022 07:10:31 GMT
- Title: Efficient Representation Learning via Adaptive Context Pooling
- Authors: Chen Huang, Walter Talbott, Navdeep Jaitly, Josh Susskind
- Abstract summary: Self-attention mechanisms assume a fixed attention granularity defined by the individual tokens, which may not be optimal for modeling complex dependencies at higher levels.
We propose ContextPool to address this problem by adapting the attention granularity for each token.
We show that ContextPool makes attention models more expressive, achieving strong performance often with fewer layers and thus significantly reduced cost.
- Score: 15.673260849127695
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-attention mechanisms model long-range context by using pairwise
attention between all input tokens. In doing so, they assume a fixed attention
granularity defined by the individual tokens (e.g., text characters or image
pixels), which may not be optimal for modeling complex dependencies at higher
levels. In this paper, we propose ContextPool to address this problem by
adapting the attention granularity for each token. Inspired by the success of
ConvNets that are combined with pooling to capture long-range dependencies, we
learn to pool neighboring features for each token before computing attention in
a given attention layer. The pooling weights and support size are adaptively
determined, allowing the pooled features to encode meaningful context with
varying scale. We show that ContextPool makes attention models more expressive,
achieving strong performance often with fewer layers and thus significantly
reduced cost. Experiments validate that our ContextPool module, when plugged
into transformer models, matches or surpasses state-of-the-art performance
using less compute on several language and image benchmarks, outperforms recent
works with learned context sizes or sparse attention patterns, and is also
applicable to ConvNets for efficient feature learning.
Related papers
- Core Context Aware Attention for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-range context modeling.
Our CCA-Attention significantly outperforms state-of-the-art models in terms of computational efficiency and long-context modeling ability.
arXiv Detail & Related papers (2024-12-17T01:54:08Z) - Composing Object Relations and Attributes for Image-Text Matching [70.47747937665987]
This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges.
Our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system.
arXiv Detail & Related papers (2024-06-17T17:56:01Z) - Training-Free Long-Context Scaling of Large Language Models [114.53296002607993]
We propose Dual Chunk Attention, which enables Llama2 70B to support context windows of more than 100k tokens without continual training.
By decomposing the attention for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens.
arXiv Detail & Related papers (2024-02-27T12:39:23Z) - Gramian Attention Heads are Strong yet Efficient Vision Learners [26.79263390835444]
We introduce a novel architecture design that enhances expressiveness by incorporating multiple head classifiers (ie, classification heads)
Our approach employs attention-based aggregation, utilizing pairwise feature similarity to enhance multiple lightweight heads with minimal resource overhead.
Our models eventually surpass state-of-the-art CNNs and ViTs regarding the accuracy-grained trade-off on ImageNet-1K.
arXiv Detail & Related papers (2023-10-25T09:08:58Z) - TokenFlow: Rethinking Fine-grained Cross-modal Alignment in
Vision-Language Retrieval [30.429340065755436]
We devise a new model-agnostic formulation for fine-grained cross-modal alignment.
Inspired by optimal transport theory, we introduce emphTokenFlow, an instantiation of the proposed scheme.
arXiv Detail & Related papers (2022-09-28T04:11:05Z) - Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision.
A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive.
We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z) - AdaPool: Exponential Adaptive Pooling for Information-Retaining
Downsampling [82.08631594071656]
Pooling layers are essential building blocks of Convolutional Neural Networks (CNNs)
We propose an adaptive and exponentially weighted pooling method named adaPool.
We demonstrate how adaPool improves the preservation of detail through a range of tasks including image and video classification and object detection.
arXiv Detail & Related papers (2021-11-01T08:50:37Z) - Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths.
In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.