MaxPoolBERT: Enhancing BERT Classification via Layer- and Token-Wise Aggregation
- URL: http://arxiv.org/abs/2505.15696v1
- Date: Wed, 21 May 2025 16:10:02 GMT
- Title: MaxPoolBERT: Enhancing BERT Classification via Layer- and Token-Wise Aggregation
- Authors: Maike Behrendt, Stefan Sylvius Wagner, Stefan Harmeling,
- Abstract summary: MaxBERT refines the [BERT] representation by aggregating information across layers and tokens.<n>Our approach enhances BERT's classification accuracy without requiring pre-training or significantly increasing model size.
- Score: 1.2699007098398807
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The [CLS] token in BERT is commonly used as a fixed-length representation for classification tasks, yet prior work has shown that both other tokens and intermediate layers encode valuable contextual information. In this work, we propose MaxPoolBERT, a lightweight extension to BERT that refines the [CLS] representation by aggregating information across layers and tokens. Specifically, we explore three modifications: (i) max-pooling the [CLS] token across multiple layers, (ii) enabling the [CLS] token to attend over the entire final layer using an additional multi-head attention (MHA) layer, and (iii) combining max-pooling across the full sequence with MHA. Our approach enhances BERT's classification accuracy (especially on low-resource tasks) without requiring pre-training or significantly increasing model size. Experiments on the GLUE benchmark show that MaxPoolBERT consistently achieves a better performance on the standard BERT-base model.
Related papers
- RSQ: Learning from Important Tokens Leads to Better Quantized LLMs [65.5558181902098]
Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining.<n>We propose RSQ (Rotate, Scale, then Quantize), which applies rotations to the model to mitigate outliers.<n>We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families.
arXiv Detail & Related papers (2025-03-03T18:46:33Z) - Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models [81.74999702045339]
Multi-Level Optimal Transport (MultiLevelOT) is a novel approach that advances the optimal transport for universal cross-tokenizer knowledge distillation.<n>Our method aligns the logit distributions of the teacher and the student at both token and sequence levels.<n>At the token level, MultiLevelOT integrates both global and local information by jointly optimizing all tokens within a sequence to enhance robustness.
arXiv Detail & Related papers (2024-12-19T04:51:06Z) - Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models [79.70436109672599]
We derive non-vacuous generalization bounds for large language models as large as LLaMA2-70B.
Our work achieves the first non-vacuous bounds for models that are deployed in practice and generate high-quality text.
arXiv Detail & Related papers (2024-07-25T16:13:58Z) - A Novel Two-Step Fine-Tuning Pipeline for Cold-Start Active Learning in Text Classification Tasks [7.72751543977484]
This work investigates the effectiveness of BERT-based contextual embeddings in active learning (AL) tasks on cold-start scenarios.
Our primary contribution is the proposal of a more robust fine-tuning pipeline - DoTCAL.
Our evaluation contrasts BERT-based embeddings with other prevalent text representation paradigms, including Bag of Words (BoW), Latent Semantic Indexing (LSI) and FastText.
arXiv Detail & Related papers (2024-07-24T13:50:21Z) - Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding [54.532578213126065]
Most document understanding methods preserve all tokens within sub-images and treat them equally.
This neglects their different informativeness and leads to a significant increase in the number of image tokens.
We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
arXiv Detail & Related papers (2024-07-19T16:11:15Z) - Breaking the Token Barrier: Chunking and Convolution for Efficient Long
Text Classification with BERT [0.0]
Transformer-based models, specifically BERT, have propelled research in various NLP tasks.
BERT models are limited to a maximum token limit of 512 tokens. Consequently, this makes it non-trivial to apply it in a practical setting with long input.
We propose a relatively simple extension to vanilla BERT architecture called ChunkBERT that allows finetuning of any pretrained models to perform inference on arbitrarily long text.
arXiv Detail & Related papers (2023-10-31T15:41:08Z) - Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling [34.88128747535637]
Ensembling BERT models often significantly improves accuracy, but at the cost of computation and memory footprint.
We propose Multi-BERT, a novel ensembling method for CLS-based prediction tasks.
In experiments on GLUE and SuperGLUE we show that our Multi- BERT reliably improves both overall accuracy and confidence estimation.
arXiv Detail & Related papers (2022-10-10T23:15:17Z) - Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask
Training [55.43088293183165]
Recent studies show that pre-trained language models (PLMs) like BERT contain matchingworks that have similar transfer learning performance as the original PLM.
In this paper, we find that the BERTworks have even more potential than these studies have shown.
We train binary masks over model weights on the pre-training tasks, with the aim of preserving the universal transferability of the subnetwork.
arXiv Detail & Related papers (2022-04-24T08:42:47Z) - Pyramid-BERT: Reducing Complexity via Successive Core-set based Token
Selection [23.39962989492527]
Transformer-based language models such as BERT have achieved the state-of-the-art on various NLP tasks, but are computationally prohibitive.
We present Pyramid-BERT where we replace previously useds with a em core-set based token selection method justified by theoretical results.
The core-set based token selection technique allows us to avoid expensive pre-training, gives a space-efficient fine tuning, and thus makes it suitable to handle longer sequence lengths.
arXiv Detail & Related papers (2022-03-27T19:52:01Z) - Layer-wise Guided Training for BERT: Learning Incrementally Refined
Document Representations [11.46458298316499]
We propose a novel approach to fine-tune BERT in a structured manner.
Specifically, we focus on Large Scale Multilabel Text Classification (LMTC)
Our approach guides specific BERT layers to predict labels from specific hierarchy levels.
arXiv Detail & Related papers (2020-10-12T14:56:22Z) - BERT's output layer recognizes all hidden layers? Some Intriguing
Phenomena and a simple way to boost BERT [53.63288887672302]
Bidirectional Representations from Transformers (BERT) have achieved tremendous success in many natural language processing (NLP) tasks.
We find that surprisingly the output layer of BERT can reconstruct the input sentence by directly taking each layer of BERT as input.
We propose a quite simple method to boost the performance of BERT.
arXiv Detail & Related papers (2020-01-25T13:35:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.