Related papers: Pool Me Wisely: On the Effect of Pooling in Transformer-Based Models

Pool Me Wisely: On the Effect of Pooling in Transformer-Based Models

URL: http://arxiv.org/abs/2510.03339v1
Date: Thu, 02 Oct 2025 11:17:24 GMT
Title: Pool Me Wisely: On the Effect of Pooling in Transformer-Based Models
Authors: Sofiane Ennadir, Levente Zólyomi, Oleg Smirnov, Tianze Wang, John Pertoft, Filip Cornell, Lele Cao,
Abstract summary: We introduce a theoretical framework that characterizes the expressivity of Transformer-based models equipped with widely used pooling methods.<n>We empirically evaluate pooling strategies across tasks requiring both global and local contextual understanding.<n>Our findings unify theoretical and empirical perspectives, providing practical guidance for selecting or designing pooling mechanisms suited to specific tasks.
Score: 7.244206185339429
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer models have become the dominant backbone for sequence modeling, leveraging self-attention to produce contextualized token representations. These are typically aggregated into fixed-size vectors via pooling operations for downstream tasks. While much of the literature has focused on attention mechanisms, the role of pooling remains underexplored despite its critical impact on model behavior. In this paper, we introduce a theoretical framework that rigorously characterizes the expressivity of Transformer-based models equipped with widely used pooling methods by deriving closed-form bounds on their representational capacity and the ability to distinguish similar inputs. Our analysis extends to different variations of attention formulations, demonstrating that these bounds hold across diverse architectural variants. We empirically evaluate pooling strategies across tasks requiring both global and local contextual understanding, spanning three major modalities: computer vision, natural language processing, and time-series analysis. Results reveal consistent trends in how pooling choices affect accuracy, sensitivity, and optimization behavior. Our findings unify theoretical and empirical perspectives, providing practical guidance for selecting or designing pooling mechanisms suited to specific tasks. This work positions pooling as a key architectural component in Transformer models and lays the foundation for more principled model design beyond attention alone.

Related papers

Be Wary of Your Time Series Preprocessing [8.040528928994556]
We present the first formal analysis of how different normalization strategies, specifically instance-based and global scaling, impact the expressivity of Transformer-based architectures.<n>We derive theoretical bounds for two widely used normalization methods: Standard and Min-Max scaling.<n>Our results show that no single normalization method consistently outperforms others, and in some cases, omitting normalization entirely leads to superior performance.
arXiv Detail & Related papers (2026-02-19T17:23:56Z)
An Integrated Fusion Framework for Ensemble Learning Leveraging Gradient Boosting and Fuzzy Rule-Based Models [59.13182819190547]
Fuzzy rule-based models excel in interpretability and have seen widespread application across diverse fields.<n>They face challenges such as complex design specifications and scalability issues with large datasets.<n>This paper proposes an Integrated Fusion Framework that merges the strengths of both paradigms to enhance model performance and interpretability.
arXiv Detail & Related papers (2025-11-11T10:28:23Z)
Neural Network Reprogrammability: A Unified Theme on Model Reprogramming, Prompt Tuning, and Prompt Instruction [55.914891182214475]
We introduce neural network reprogrammability as a unifying framework for model adaptation.<n>We present a taxonomy that categorizes such information manipulation approaches across four key dimensions.<n>We also analyze remaining technical challenges and ethical considerations.
arXiv Detail & Related papers (2025-06-05T05:42:27Z)
The Coverage Principle: A Framework for Understanding Compositional Generalization [31.762330857169914]
We show that models relying primarily on pattern matching for compositional tasks cannot reliably generalize beyond substituting fragments that yield identical results when used in the same contexts.<n>We demonstrate that this framework has a strong predictive power for the generalization capabilities of Transformers.
arXiv Detail & Related papers (2025-05-26T17:55:15Z)
Comparative Analysis of Pooling Mechanisms in LLMs: A Sentiment Analysis Perspective [0.0]
Transformer-based models like BERT and GPT rely on pooling layers to aggregate token-level embeddings into sentence-level representations.<n>Common pooling mechanisms such as Mean, Max, and Weighted Sum play a pivotal role in this aggregation process.<n>This paper investigates the effects of these pooling mechanisms on two prominent LLM families -- BERT and GPT, in the context of sentence-level sentiment analysis.
arXiv Detail & Related papers (2024-11-22T00:59:25Z)
Interpreting token compositionality in LLMs: A robustness analysis [10.777646083061395]
Constituent-Aware Pooling (CAP) is a methodology designed to analyse how large language models process linguistic structures.<n>CAP intervenes in model activations through constituent-based pooling at various model levels.<n>Our findings highlight fundamental limitations in current transformer architectures regarding compositional semantics processing and model interpretability.
arXiv Detail & Related papers (2024-10-16T18:10:50Z)
Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks [50.75902473813379]
This work introduces a comprehensive evaluation framework that systematically examines the role of instructions and inputs in the generalisation abilities of such models. The proposed framework uncovers the resilience of multimodal models to extreme instruction perturbations and their vulnerability to observational changes.
arXiv Detail & Related papers (2024-07-04T14:36:49Z)
Corpus Considerations for Annotator Modeling and Scaling [9.263562546969695]
We show that the commonly used user token model consistently outperforms more complex models. Our findings shed light on the relationship between corpus statistics and annotator modeling performance.
arXiv Detail & Related papers (2024-04-02T22:27:24Z)
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking [53.66999416757543]
We study how fine-tuning affects the internal mechanisms implemented in language models. Fine-tuning enhances, rather than alters, the mechanistic operation of the model.
arXiv Detail & Related papers (2024-02-22T18:59:24Z)
Slimmable Domain Adaptation [112.19652651687402]
We introduce a simple framework, Slimmable Domain Adaptation, to improve cross-domain generalization with a weight-sharing model bank. Our framework surpasses other competing approaches by a very large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-06-14T06:28:04Z)
Towards Robust and Adaptive Motion Forecasting: A Causal Representation Perspective [72.55093886515824]
We introduce a causal formalism of motion forecasting, which casts the problem as a dynamic process with three groups of latent variables. We devise a modular architecture that factorizes the representations of invariant mechanisms and style confounders to approximate a causal graph. Experiment results on synthetic and real datasets show that our three proposed components significantly improve the robustness and reusability of the learned motion representations.
arXiv Detail & Related papers (2021-11-29T18:59:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.