Inceptive Transformers: Enhancing Contextual Representations through Multi-Scale Feature Learning Across Domains and Languages
- URL: http://arxiv.org/abs/2505.20496v2
- Date: Fri, 19 Sep 2025 22:29:42 GMT
- Title: Inceptive Transformers: Enhancing Contextual Representations through Multi-Scale Feature Learning Across Domains and Languages
- Authors: Asif Shahriar, Rifat Shahriyar, M Saifur Rahman,
- Abstract summary: transformer models compress information from all tokens in a sequence into a single [/] token to represent global context.<n>This approach risks diluting fine-grained or hierarchical features, leading to information loss in downstream tasks where local patterns are important.<n>We propose an inception-style 1-D convolution module that sits on top of the transformer layer and augments token representations with multi-scale local features.
- Score: 3.294155819837931
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Encoder transformer models compress information from all tokens in a sequence into a single [CLS] token to represent global context. This approach risks diluting fine-grained or hierarchical features, leading to information loss in downstream tasks where local patterns are important. To remedy this, we propose a lightweight architectural enhancement: an inception-style 1-D convolution module that sits on top of the transformer layer and augments token representations with multi-scale local features. This enriched feature space is then processed by a self-attention layer that dynamically weights tokens based on their task relevance. Experiments on five diverse tasks show that our framework consistently improves general-purpose, domain-specific, and multilingual models, outperforming baselines by 1% to 14% while maintaining efficiency. Ablation studies show that multi-scale convolution performs better than any single kernel and that the self-attention layer is critical for performance.
Related papers
- Revisiting [CLS] and Patch Token Interaction in Vision Transformers [16.71411137558127]
Vision Transformers have emerged as powerful, scalable and versatile representation learners.<n>We investigate the friction between global and local feature learning under different pre-training strategies.<n>We propose specialized processing paths that selectively disentangle the computational flow of class and patch tokens.
arXiv Detail & Related papers (2026-02-09T13:16:01Z) - Beyond the final layer: Attentive multilayer fusion for vision transformers [45.627646781613386]
We show that task-relevant information is distributed across the network hierarchy rather than solely encoded in any of the last layers.<n>We apply an attentive probing mechanism that dynamically fuses representations from all layers of a Vision Transformer.<n>This mechanism learns to identify the most relevant layers for a target task and combines low-level structural cues with high-level semantic abstractions.
arXiv Detail & Related papers (2026-01-14T09:50:09Z) - What matters for Representation Alignment: Global Information or Spatial Structure? [64.67092609921816]
Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features.<n>We investigate a fundamental question: what aspect of the target representation matters for generation, its textitglobal revisionsemantic information.<n>We replace the standard projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation.
arXiv Detail & Related papers (2025-12-11T16:39:53Z) - MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models [4.828668077793944]
Multi-Modal Representation Learning generates space tokens projected into both text and image encoders as representation tokens.<n>MML++ is a parameter-efficient and interaction-aware extension that significantly reduces trainable parameters.<n> experiments on 15 datasets demonstrate that MMRL and MMRL++ consistently outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-05-15T08:43:53Z) - Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z) - Layer-Wise Evolution of Representations in Fine-Tuned Transformers: Insights from Sparse AutoEncoders [0.0]
Fine-tuning pre-trained transformers is a powerful technique for enhancing the performance of base models on specific tasks.<n>This paper explores the underlying mechanisms of fine-tuning, specifically in the BERT transformer.
arXiv Detail & Related papers (2025-02-23T21:29:50Z) - Unified Local and Global Attention Interaction Modeling for Vision Transformers [1.9571946424055506]
We present a novel method that extends the self-attention mechanism of a vision transformer (ViT) for more accurate object detection across diverse datasets.<n>ViTs show strong capability for image understanding tasks such as object detection, segmentation, and classification.<n>We introduce two modifications to the traditional self-attention framework; a novel aggressive convolution pooling strategy for local feature mixing, and a new conceptual attention transformation.
arXiv Detail & Related papers (2024-12-25T04:53:19Z) - Core Context Aware Transformers for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-context modeling.<n>Our method automatically focuses and strengthens core context while diminishing redundancy during the learning process.<n>Our method is able to replace the self-attention module in existing Large Language Models with minimal fine-tuning cost.
arXiv Detail & Related papers (2024-12-17T01:54:08Z) - Sharing Key Semantics in Transformer Makes Efficient Image Restoration [148.22790334216117]
Self-attention mechanism, a cornerstone of Vision Transformers (ViTs) tends to encompass all global cues.<n>Small segments of a degraded image, particularly those closely aligned semantically, provide particularly relevant information to aid in the restoration process.<n>We propose boosting IR's performance by sharing the key semantics via Transformer for IR (ie, SemanIR) in this paper.
arXiv Detail & Related papers (2024-05-30T12:45:34Z) - One-for-All: Towards Universal Domain Translation with a Single StyleGAN [86.33216867136639]
We propose a novel translation model, UniTranslator, for transforming representations between visually distinct domains.<n>The proposed UniTranslator is versatile and capable of performing various tasks, including style mixing, stylization, and translations.<n>UniTranslator surpasses the performance of existing general-purpose models and performs well against specialized models in representative tasks.
arXiv Detail & Related papers (2023-10-22T08:02:55Z) - FMRT: Learning Accurate Feature Matching with Reconciliatory Transformer [29.95553680263075]
We propose Feature Matching with Reconciliatory Transformer (FMRT), a detector-free method that reconciles different features with multiple receptive fields adaptively.
FMRT yields extraordinary performance on multiple benchmarks, including pose estimation, visual localization, homography estimation, and image matching.
arXiv Detail & Related papers (2023-10-20T15:54:18Z) - MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic
Segmentation [90.73815426893034]
We propose a transformer-based framework that aims to enhance weakly supervised semantic segmentation.
We introduce a Multi-Class Token transformer, which incorporates multiple class tokens to enable class-aware interactions with the patch tokens.
A Contrastive-Class-Token (CCT) module is proposed to enhance the learning of discriminative class tokens.
arXiv Detail & Related papers (2023-08-06T03:30:20Z) - Framework-agnostic Semantically-aware Global Reasoning for Segmentation [29.69187816377079]
We propose a component that learns to project image features into latent representations and reason between them.
Our design encourages the latent regions to represent semantic concepts by ensuring that the activated regions are spatially disjoint.
Our latent tokens are semantically interpretable and diverse and provide a rich set of features that can be transferred to downstream tasks.
arXiv Detail & Related papers (2022-12-06T21:42:05Z) - Compositional Generalization in Grounded Language Learning via Induced
Model Sparsity [81.38804205212425]
We consider simple language-conditioned navigation problems in a grid world environment with disentangled observations.
We design an agent that encourages sparse correlations between words in the instruction and attributes of objects, composing them together to find the goal.
Our agent maintains a high level of performance on goals containing novel combinations of properties even when learning from a handful of demonstrations.
arXiv Detail & Related papers (2022-07-06T08:46:27Z) - Dynamic Spatial Sparsification for Efficient Vision Transformers and
Convolutional Neural Networks [88.77951448313486]
We present a new approach for model acceleration by exploiting spatial sparsity in visual data.
We propose a dynamic token sparsification framework to prune redundant tokens.
We extend our method to hierarchical models including CNNs and hierarchical vision Transformers.
arXiv Detail & Related papers (2022-07-04T17:00:51Z) - Multi-class Token Transformer for Weakly Supervised Semantic
Segmentation [94.78965643354285]
We propose a new transformer-based framework to learn class-specific object localization maps as pseudo labels for weakly supervised semantic segmentation (WSSS)
Inspired by the fact that the attended regions of the one-class token in the standard vision transformer can be leveraged to form a class-agnostic localization map, we investigate if the transformer model can also effectively capture class-specific attention for more discriminative object localization.
The proposed framework is shown to fully complement the Class Activation Mapping (CAM) method, leading to remarkably superior WSSS results on the PASCAL VOC and MS COCO datasets.
arXiv Detail & Related papers (2022-03-06T07:18:23Z) - Feature Fusion Vision Transformer for Fine-Grained Visual Categorization [22.91753200323264]
We propose a novel pure transformer-based framework Feature Fusion Vision Transformer (FFVT)
We aggregate the important tokens from each transformer layer to compensate the local, low-level and middle-level information.
We design a novel token selection mod-ule called mutual attention weight selection (MAWS) to guide the network effectively and efficiently towards selecting discriminative tokens.
arXiv Detail & Related papers (2021-07-06T01:48:43Z) - X2Parser: Cross-Lingual and Cross-Domain Framework for Task-Oriented
Compositional Semantic Parsing [51.81533991497547]
Task-oriented compositional semantic parsing (TCSP) handles complex nested user queries.
We present X2 compared a transferable Cross-lingual and Cross-domain for TCSP.
We propose to predict flattened intents and slots representations separately and cast both prediction tasks into sequence labeling problems.
arXiv Detail & Related papers (2021-06-07T16:40:05Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Rethinking Global Context in Crowd Counting [70.54184500538338]
A pure transformer is used to extract features with global information from overlapping image patches.
Inspired by classification, we add a context token to the input sequence, to facilitate information exchange with tokens corresponding to image patches.
arXiv Detail & Related papers (2021-05-23T12:44:27Z) - Point Cloud Learning with Transformer [2.3204178451683264]
We introduce a novel framework, called Multi-level Multi-scale Point Transformer (MLMSPT)
Specifically, a point pyramid transformer is investigated to model features with diverse resolutions or scales.
A multi-level transformer module is designed to aggregate contextual information from different levels of each scale and enhance their interactions.
arXiv Detail & Related papers (2021-04-28T08:39:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.