Related papers: Inceptive Transformers: Enhancing Contextual Representations through Multi-Scale Feature Learning Across Domains and Languages

Inceptive Transformers: Enhancing Contextual Representations through Multi-Scale Feature Learning Across Domains and Languages

URL: http://arxiv.org/abs/2505.20496v1
Date: Mon, 26 May 2025 19:59:22 GMT
Title: Inceptive Transformers: Enhancing Contextual Representations through Multi-Scale Feature Learning Across Domains and Languages
Authors: Asif Shahriar, Rifat Shahriyar, M Saifur Rahman,
Abstract summary: We introduce textitInceptive Transformer, a modular and lightweight architecture that enriches transformer-based token representations.<n>Our model is designed to balance local and global dependencies by dynamically weighting tokens based on their relevance to a particular task.
Score: 2.98683507969764
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Conventional transformer models typically compress the information from all tokens in a sequence into a single \texttt{[CLS]} token to represent global context-- an approach that can lead to information loss in tasks requiring localized or hierarchical cues. In this work, we introduce \textit{Inceptive Transformer}, a modular and lightweight architecture that enriches transformer-based token representations by integrating a multi-scale feature extraction module inspired by inception networks. Our model is designed to balance local and global dependencies by dynamically weighting tokens based on their relevance to a particular task. Evaluation across a diverse range of tasks including emotion recognition (both English and Bangla), irony detection, disease identification, and anti-COVID vaccine tweets classification shows that our models consistently outperform the baselines by 1\% to 14\% while maintaining efficiency. These findings highlight the versatility and cross-lingual applicability of our method for enriching transformer-based representations across diverse domains.

Related papers

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z)
Unified Local and Global Attention Interaction Modeling for Vision Transformers [1.9571946424055506]
We present a novel method that extends the self-attention mechanism of a vision transformer (ViT) for more accurate object detection across diverse datasets.<n>ViTs show strong capability for image understanding tasks such as object detection, segmentation, and classification.<n>We introduce two modifications to the traditional self-attention framework; a novel aggressive convolution pooling strategy for local feature mixing, and a new conceptual attention transformation.
arXiv Detail & Related papers (2024-12-25T04:53:19Z)
One-for-All: Towards Universal Domain Translation with a Single StyleGAN [86.33216867136639]
We propose a novel translation model, UniTranslator, for transforming representations between visually distinct domains.<n>The proposed UniTranslator is versatile and capable of performing various tasks, including style mixing, stylization, and translations.<n>UniTranslator surpasses the performance of existing general-purpose models and performs well against specialized models in representative tasks.
arXiv Detail & Related papers (2023-10-22T08:02:55Z)
FMRT: Learning Accurate Feature Matching with Reconciliatory Transformer [29.95553680263075]
We propose Feature Matching with Reconciliatory Transformer (FMRT), a detector-free method that reconciles different features with multiple receptive fields adaptively. FMRT yields extraordinary performance on multiple benchmarks, including pose estimation, visual localization, homography estimation, and image matching.
arXiv Detail & Related papers (2023-10-20T15:54:18Z)
MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation [90.73815426893034]
We propose a transformer-based framework that aims to enhance weakly supervised semantic segmentation. We introduce a Multi-Class Token transformer, which incorporates multiple class tokens to enable class-aware interactions with the patch tokens. A Contrastive-Class-Token (CCT) module is proposed to enhance the learning of discriminative class tokens.
arXiv Detail & Related papers (2023-08-06T03:30:20Z)
Framework-agnostic Semantically-aware Global Reasoning for Segmentation [29.69187816377079]
We propose a component that learns to project image features into latent representations and reason between them. Our design encourages the latent regions to represent semantic concepts by ensuring that the activated regions are spatially disjoint. Our latent tokens are semantically interpretable and diverse and provide a rich set of features that can be transferred to downstream tasks.
arXiv Detail & Related papers (2022-12-06T21:42:05Z)
Compositional Generalization in Grounded Language Learning via Induced Model Sparsity [81.38804205212425]
We consider simple language-conditioned navigation problems in a grid world environment with disentangled observations. We design an agent that encourages sparse correlations between words in the instruction and attributes of objects, composing them together to find the goal. Our agent maintains a high level of performance on goals containing novel combinations of properties even when learning from a handful of demonstrations.
arXiv Detail & Related papers (2022-07-06T08:46:27Z)
Multi-class Token Transformer for Weakly Supervised Semantic Segmentation [94.78965643354285]
We propose a new transformer-based framework to learn class-specific object localization maps as pseudo labels for weakly supervised semantic segmentation (WSSS) Inspired by the fact that the attended regions of the one-class token in the standard vision transformer can be leveraged to form a class-agnostic localization map, we investigate if the transformer model can also effectively capture class-specific attention for more discriminative object localization. The proposed framework is shown to fully complement the Class Activation Mapping (CAM) method, leading to remarkably superior WSSS results on the PASCAL VOC and MS COCO datasets.
arXiv Detail & Related papers (2022-03-06T07:18:23Z)
X2Parser: Cross-Lingual and Cross-Domain Framework for Task-Oriented Compositional Semantic Parsing [51.81533991497547]
Task-oriented compositional semantic parsing (TCSP) handles complex nested user queries. We present X2 compared a transferable Cross-lingual and Cross-domain for TCSP. We propose to predict flattened intents and slots representations separately and cast both prediction tasks into sequence labeling problems.
arXiv Detail & Related papers (2021-06-07T16:40:05Z)
Rethinking Global Context in Crowd Counting [70.54184500538338]
A pure transformer is used to extract features with global information from overlapping image patches. Inspired by classification, we add a context token to the input sequence, to facilitate information exchange with tokens corresponding to image patches.
arXiv Detail & Related papers (2021-05-23T12:44:27Z)
Point Cloud Learning with Transformer [2.3204178451683264]
We introduce a novel framework, called Multi-level Multi-scale Point Transformer (MLMSPT) Specifically, a point pyramid transformer is investigated to model features with diverse resolutions or scales. A multi-level transformer module is designed to aggregate contextual information from different levels of each scale and enhance their interactions.
arXiv Detail & Related papers (2021-04-28T08:39:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.