GazeFormer-MoE: Context-Aware Gaze Estimation via CLIP and MoE Transformer
- URL: http://arxiv.org/abs/2601.12316v1
- Date: Sun, 18 Jan 2026 08:54:02 GMT
- Title: GazeFormer-MoE: Context-Aware Gaze Estimation via CLIP and MoE Transformer
- Authors: Xinyuan Zhao, Xianrui Chen, Ahmad Chaddad,
- Abstract summary: We present a semantics modulated, multi scale Transformer for 3D gaze estimation.<n>Our model achieves new state of the art angular errors of 2.49, 3.22, 10.16, and 1.44, demonstrating up to a 64% relative improvement over previously reported results.
- Score: 7.153682966455712
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present a semantics modulated, multi scale Transformer for 3D gaze estimation. Our model conditions CLIP global features with learnable prototype banks (illumination, head pose, background, direction), fuses these prototype-enriched global vectors with CLIP patch tokens and high-resolution CNN tokens in a unified attention space, and replaces several FFN blocks with routed/shared Mixture of Experts to increase conditional capacity. Evaluated on MPIIFaceGaze, EYEDIAP, Gaze360 and ETH-XGaze, our model achieves new state of the art angular errors of 2.49°, 3.22°, 10.16°, and 1.44°, demonstrating up to a 64% relative improvement over previously reported results. ablations attribute gains to prototype conditioning, cross scale fusion, MoE and hyperparameter. Our code is publicly available at https://github. com/AIPMLab/Gazeformer.
Related papers
- Pre-training Point Cloud Compact Model with Partial-aware Reconstruction [51.403810709250024]
We present a pre-trained Point cloud Compact Model with Partial-aware textbfReconstruction, named Point-CPR.
Our model exhibits strong performance across various tasks, especially surpassing the leading MPM-based model PointGPT-B with only 2% of its parameters.
arXiv Detail & Related papers (2024-07-12T15:18:14Z) - Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules [96.21649779507831]
We propose a novel architecture dubbed mixture-of-modules (MoM)
MoM is motivated by an intuition that any layer, regardless of its position, can be used to compute a token.
We show that MoM provides not only a unified framework for Transformers but also a flexible and learnable approach for reducing redundancy.
arXiv Detail & Related papers (2024-07-09T08:50:18Z) - SDPose: Tokenized Pose Estimation via Circulation-Guide Self-Distillation [53.675725490807615]
We introduce SDPose, a new self-distillation method for improving the performance of small transformer-based models.
SDPose-T obtains 69.7% mAP with 4.4M parameters and 1.8 GFLOPs, while SDPose-S-V2 obtains 73.5% mAP on the MSCOCO validation dataset.
arXiv Detail & Related papers (2024-04-04T15:23:14Z) - ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding [9.144813021145039]
This paper introduces ParFormer, a vision transformer that incorporates a Parallel Mixer and a Sparse Channel Attention Patch Embedding (SCAPE)
ParFormer improves feature extraction by combining convolutional and attention mechanisms.
For edge device deployment, ParFormer-T excels with a throughput of 278.1 images/sec, which is 1.38 $times$ higher than EdgeNeXt-S.
The larger variant, ParFormer-L, reaches 83.5% Top-1 accuracy, offering a balanced trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2024-03-22T07:32:21Z) - MaskConver: Revisiting Pure Convolution Model for Panoptic Segmentation [17.627376199097185]
We revisit pure convolution model and propose a novel panoptic architecture named MaskConver.
MaskConver proposes to fully unify things and stuff representation by predicting their centers.
We introduce a powerful ConvNeXt-UNet decoder that closes the performance gap between convolution- and transformerbased models.
arXiv Detail & Related papers (2023-12-11T00:52:26Z) - MotionAGFormer: Enhancing 3D Human Pose Estimation with a
Transformer-GCNFormer Network [2.7268855969580166]
We present a novel Attention-GCNFormer block that divides the number of channels by using two parallel transformer and GCNFormer streams.
Our proposed GCNFormer module exploits the local relationship between adjacent joints, outputting a new representation that is complementary to the transformer output.
We evaluate our model on two popular benchmark datasets: Human3.6M and MPI-INF-3DHP.
arXiv Detail & Related papers (2023-10-25T01:46:35Z) - Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST)
CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background.
Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z) - Focal-UNet: UNet-like Focal Modulation for Medical Image Segmentation [8.75217589103206]
We propose a new U-shaped architecture for medical image segmentation with the help of the newly introduced focal modulation mechanism.
Due to the ability of the focal module to aggregate local and global features, our model could simultaneously benefit the wide receptive field of transformers.
arXiv Detail & Related papers (2022-12-19T06:17:22Z) - Fcaformer: Forward Cross Attention in Hybrid Vision Transformer [29.09883780571206]
We propose forward cross attention for hybrid vision transformer (FcaFormer)
Our FcaFormer achieves 83.1% top-1 accuracy on Imagenet with only 16.3 million parameters and about 3.6 billion MACs.
This saves almost half of the parameters and a few computational costs while achieving 0.7% higher accuracy compared to distilled EfficientFormer.
arXiv Detail & Related papers (2022-11-14T08:43:44Z) - EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm [111.17100512647619]
This paper explains the rationality of Vision Transformer by analogy with the proven practical evolutionary algorithm (EA)
We propose a novel pyramid EATFormer backbone that only contains the proposed EA-based transformer (EAT) block.
Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach.
arXiv Detail & Related papers (2022-06-19T04:49:35Z) - TPC: Transformation-Specific Smoothing for Point Cloud Models [9.289813586197882]
We propose a transformation-specific smoothing framework TPC, which provides robustness guarantees for point cloud models against semantic transformation attacks.
Experiments on several common 3D transformations show that TPC significantly outperforms the state of the art.
arXiv Detail & Related papers (2022-01-30T05:41:50Z) - Pyramid Fusion Transformer for Semantic Segmentation [44.57867861592341]
We propose a transformer-based Pyramid Fusion Transformer (PFT) for per-mask approach semantic segmentation with multi-scale features.
We achieve competitive performance on three widely used semantic segmentation datasets.
arXiv Detail & Related papers (2022-01-11T16:09:25Z) - Focal Self-attention for Local-Global Interactions in Vision
Transformers [90.9169644436091]
We present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions.
With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers.
arXiv Detail & Related papers (2021-07-01T17:56:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.