Related papers: Efficient Representation Learning via Adaptive Context Pooling

Efficient Representation Learning via Adaptive Context Pooling

URL: http://arxiv.org/abs/2207.01844v1
Date: Tue, 5 Jul 2022 07:10:31 GMT
Title: Efficient Representation Learning via Adaptive Context Pooling
Authors: Chen Huang, Walter Talbott, Navdeep Jaitly, Josh Susskind
Abstract summary: Self-attention mechanisms assume a fixed attention granularity defined by the individual tokens, which may not be optimal for modeling complex dependencies at higher levels. We propose ContextPool to address this problem by adapting the attention granularity for each token. We show that ContextPool makes attention models more expressive, achieving strong performance often with fewer layers and thus significantly reduced cost.
Score: 15.673260849127695
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-attention mechanisms model long-range context by using pairwise attention between all input tokens. In doing so, they assume a fixed attention granularity defined by the individual tokens (e.g., text characters or image pixels), which may not be optimal for modeling complex dependencies at higher levels. In this paper, we propose ContextPool to address this problem by adapting the attention granularity for each token. Inspired by the success of ConvNets that are combined with pooling to capture long-range dependencies, we learn to pool neighboring features for each token before computing attention in a given attention layer. The pooling weights and support size are adaptively determined, allowing the pooled features to encode meaningful context with varying scale. We show that ContextPool makes attention models more expressive, achieving strong performance often with fewer layers and thus significantly reduced cost. Experiments validate that our ContextPool module, when plugged into transformer models, matches or surpasses state-of-the-art performance using less compute on several language and image benchmarks, outperforms recent works with learned context sizes or sparse attention patterns, and is also applicable to ConvNets for efficient feature learning.

Related papers

PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models [0.0]
We introduce PACT, a method that reduces inference time and memory usage by pruning irrelevant tokens and merging visually redundant ones. Our approach uses a novel importance metric to identify unimportant tokens without relying on attention scores. We also propose a novel clustering algorithm, called Distance Bounded Density Peak Clustering, which efficiently clusters visual tokens.
arXiv Detail & Related papers (2025-04-11T20:45:00Z)
Core Context Aware Attention for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-range context modeling. Our CCA-Attention significantly outperforms state-of-the-art models in terms of computational efficiency and long-context modeling ability.
arXiv Detail & Related papers (2024-12-17T01:54:08Z)
Composing Object Relations and Attributes for Image-Text Matching [70.47747937665987]
This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges. Our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system.
arXiv Detail & Related papers (2024-06-17T17:56:01Z)
Training-Free Long-Context Scaling of Large Language Models [114.53296002607993]
We propose Dual Chunk Attention, which enables Llama2 70B to support context windows of more than 100k tokens without continual training. By decomposing the attention for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens.
arXiv Detail & Related papers (2024-02-27T12:39:23Z)
Gramian Attention Heads are Strong yet Efficient Vision Learners [26.79263390835444]
We introduce a novel architecture design that enhances expressiveness by incorporating multiple head classifiers (ie, classification heads) Our approach employs attention-based aggregation, utilizing pairwise feature similarity to enhance multiple lightweight heads with minimal resource overhead. Our models eventually surpass state-of-the-art CNNs and ViTs regarding the accuracy-grained trade-off on ImageNet-1K.
arXiv Detail & Related papers (2023-10-25T09:08:58Z)
TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval [30.429340065755436]
We devise a new model-agnostic formulation for fine-grained cross-modal alignment. Inspired by optimal transport theory, we introduce emphTokenFlow, an instantiation of the proposed scheme.
arXiv Detail & Related papers (2022-09-28T04:11:05Z)
Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision. A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive. We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z)
AdaPool: Exponential Adaptive Pooling for Information-Retaining Downsampling [82.08631594071656]
Pooling layers are essential building blocks of Convolutional Neural Networks (CNNs) We propose an adaptive and exponentially weighted pooling method named adaPool. We demonstrate how adaPool improves the preservation of detail through a range of tasks including image and video classification and object detection.
arXiv Detail & Related papers (2021-11-01T08:50:37Z)
Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation. We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths. In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z)
Text Information Aggregation with Centrality Attention [86.91922440508576]
We propose a new way of obtaining aggregation weights, called eigen-centrality self-attention. We build a fully-connected graph for all the words in a sentence, then compute the eigen-centrality as the attention score of each word.
arXiv Detail & Related papers (2020-11-16T13:08:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.