Related papers: GFT: Gradient Focal Transformer

GFT: Gradient Focal Transformer

URL: http://arxiv.org/abs/2504.09852v1
Date: Mon, 14 Apr 2025 03:49:06 GMT
Title: GFT: Gradient Focal Transformer
Authors: Boris Kriuk, Simranjit Kaur Gill, Shoaib Aslam, Amir Fakhrutdinov,
Abstract summary: This paper introduces GFT (Gradient Focal Transformer), a new ViT-derived framework created for Fine-Grained Image Classification tasks.<n>GFT integrates the Gradient Attention Learning Alignment (GALA) mechanism to dynamically prioritize class-discriminative features.<n>GFT achieves SOTA accuracy on FGVC Aircraft, Food-101, and datasets with 93M parameters, outperforming ViT-based advanced FGIC models in efficiency.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-Grained Image Classification (FGIC) remains a complex task in computer vision, as it requires models to distinguish between categories with subtle localized visual differences. Well-studied CNN-based models, while strong in local feature extraction, often fail to capture the global context required for fine-grained recognition, while more recent ViT-backboned models address FGIC with attention-driven mechanisms but lack the ability to adaptively focus on truly discriminative regions. TransFG and other ViT-based extensions introduced part-aware token selection to enhance attention localization, yet they still struggle with computational efficiency, attention region selection flexibility, and detail-focus narrative in complex environments. This paper introduces GFT (Gradient Focal Transformer), a new ViT-derived framework created for FGIC tasks. GFT integrates the Gradient Attention Learning Alignment (GALA) mechanism to dynamically prioritize class-discriminative features by analyzing attention gradient flow. Coupled with a Progressive Patch Selection (PPS) strategy, the model progressively filters out less informative regions, reducing computational overhead while enhancing sensitivity to fine details. GFT achieves SOTA accuracy on FGVC Aircraft, Food-101, and COCO datasets with 93M parameters, outperforming ViT-based advanced FGIC models in efficiency. By bridging global context and localized detail extraction, GFT sets a new benchmark in fine-grained recognition, offering interpretable solutions for real-world deployment scenarios.

Related papers

Point Cloud Understanding via Attention-Driven Contrastive Learning [64.65145700121442]
Transformer-based models have advanced point cloud understanding by leveraging self-attention mechanisms. PointACL is an attention-driven contrastive learning framework designed to address these limitations. Our method employs an attention-driven dynamic masking strategy that guides the model to focus on under-attended regions.
arXiv Detail & Related papers (2024-11-22T05:41:00Z)
ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection [65.59969454655996]
We propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions. Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks. We also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings.
arXiv Detail & Related papers (2024-03-26T17:46:25Z)
GLC++: Source-Free Universal Domain Adaptation through Global-Local Clustering and Contrastive Affinity Learning [84.54244771470012]
Source-Free Universal Domain Adaptation (SF-UniDA) aims to accurately classify "known" data belonging to common categories. We propose a novel Global and Local Clustering (GLC) technique, which comprises an adaptive one-vs-all global clustering algorithm. We evolve GLC to GLC++, integrating a contrastive affinity learning strategy.
arXiv Detail & Related papers (2024-03-21T13:57:45Z)
Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation [12.103012959947055]
This work explores the use of Swin Transformer by proposing "SWTformer" to enhance the accuracy of the initial seed CAMs. SWTformer-V1 achieves a 0.98% mAP higher localization accuracy, outperforming state-of-the-art models. SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract additional information.
arXiv Detail & Related papers (2024-01-31T13:41:17Z)
ASWT-SGNN: Adaptive Spectral Wavelet Transform-based Self-Supervised Graph Neural Network [20.924559944655392]
This paper proposes an Adaptive Spectral Wavelet Transform-based Self-Supervised Graph Neural Network (ASWT-SGNN) ASWT-SGNN accurately approximates the filter function in high-density spectral regions, avoiding costly eigen-decomposition. It achieves comparable performance to state-of-the-art models in node classification tasks.
arXiv Detail & Related papers (2023-12-10T03:07:42Z)
Hybrid Focal and Full-Range Attention Based Graph Transformers [0.0]
We present a purely attention-based architecture, namely Focal and Full-Range Graph Transformer (FFGT) FFGT combines the conventional full-range attention with K-hop focal attention on ego-nets to aggregate both global and local information. Our approach enhances the performance of existing Graph Transformers on various open datasets.
arXiv Detail & Related papers (2023-11-08T12:53:07Z)
Unlocking the Potential of Prompt-Tuning in Bridging Generalized and Personalized Federated Learning [49.72857433721424]
Vision Transformers (ViT) and Visual Prompt Tuning (VPT) achieve state-of-the-art performance with improved efficiency in various computer vision tasks. We present a novel algorithm, SGPT, that integrates Generalized FL (GFL) and Personalized FL (PFL) approaches by employing a unique combination of both shared and group-specific prompts.
arXiv Detail & Related papers (2023-10-27T17:22:09Z)
Salient Mask-Guided Vision Transformer for Fine-Grained Classification [48.1425692047256]
Fine-grained visual classification (FGVC) is a challenging computer vision problem. One of its main difficulties is capturing the most discriminative inter-class variances. We introduce a simple yet effective Salient Mask-Guided Vision Transformer (SM-ViT)
arXiv Detail & Related papers (2023-05-11T19:24:33Z)
Feature Fusion Vision Transformer for Fine-Grained Visual Categorization [22.91753200323264]
We propose a novel pure transformer-based framework Feature Fusion Vision Transformer (FFVT) We aggregate the important tokens from each transformer layer to compensate the local, low-level and middle-level information. We design a novel token selection mod-ule called mutual attention weight selection (MAWS) to guide the network effectively and efficiently towards selecting discriminative tokens.
arXiv Detail & Related papers (2021-07-06T01:48:43Z)
Cross-Domain Facial Expression Recognition: A Unified Evaluation Benchmark and Adversarial Graph Learning [85.6386289476598]
We develop a novel adversarial graph representation adaptation (AGRA) framework for cross-domain holistic-local feature co-adaptation. We conduct extensive and fair evaluations on several popular benchmarks and show that the proposed AGRA framework outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2020-08-03T15:00:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.