GeoDeformer: Geometric Deformable Transformer for Action Recognition
- URL: http://arxiv.org/abs/2311.17975v1
- Date: Wed, 29 Nov 2023 16:55:55 GMT
- Title: GeoDeformer: Geometric Deformable Transformer for Action Recognition
- Authors: Jinhui Ye, Jiaming Zhou, Hui Xiong, Junwei Liang
- Abstract summary: Vision transformers have recently emerged as an effective alternative to convolutional networks for action recognition.
This paper proposes a novel approach, GeoDeformer, designed to capture the variations inherent in action video by integrating geometric comprehension directly into the ViT architecture.
- Score: 22.536307401874105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers have recently emerged as an effective alternative to
convolutional networks for action recognition. However, vision transformers
still struggle with geometric variations prevalent in video data. This paper
proposes a novel approach, GeoDeformer, designed to capture the variations
inherent in action video by integrating geometric comprehension directly into
the ViT architecture. Specifically, at the core of GeoDeformer is the Geometric
Deformation Predictor, a module designed to identify and quantify potential
spatial and temporal geometric deformations within the given video. Spatial
deformations adjust the geometry within individual frames, while temporal
deformations capture the cross-frame geometric dynamics, reflecting motion and
temporal progression. To demonstrate the effectiveness of our approach, we
incorporate it into the established MViTv2 framework, replacing the standard
self-attention blocks with GeoDeformer blocks. Our experiments at UCF101,
HMDB51, and Mini-K200 achieve significant increases in both Top-1 and Top-5
accuracy, establishing new state-of-the-art results with only a marginal
increase in computational cost. Additionally, visualizations affirm that
GeoDeformer effectively manifests explicit geometric deformations and minimizes
geometric variations. Codes and checkpoints will be released.
Related papers
- ArGEnT: Arbitrary Geometry-encoded Transformer for Operator Learning [2.757490632589873]
We propose Arbitrary Geometry-encoded Transformer (ArGEnT), a geometry-aware attention-based architecture for operator learning on arbitrary domains.<n>By combining flexible geometry encoding with operator-learning capabilities, ArGEnT provides a scalable surrogate modeling framework for optimization, uncertainty, and data-driven modeling of complex physical systems.
arXiv Detail & Related papers (2026-02-12T06:22:59Z) - Scalable Adaptation of 3D Geometric Foundation Models via Weak Supervision from Internet Video [76.32954467706581]
We propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams.<n>We use a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision.<n>Experiments show that SAGE significantly enhances zero-shot generalization, reducing Chamfer Distance by 20-42% on unseen benchmarks.
arXiv Detail & Related papers (2026-02-08T09:53:21Z) - Geo-Code: A Code Framework for Reverse Code Generation from Geometric Images Based on Two-Stage Multi-Agent Evolution [22.312869477454864]
We propose Geo-coder -- the first inverse programming framework for geometric images based on a multi-agent system.<n>Our method innovatively decouples the process into geometric modeling via pixel-wise anchoring and metric-driven code evolution.<n>Experiments demonstrate that Geo-coder achieves a substantial lead in both geometric reconstruction accuracy and visual consistency.
arXiv Detail & Related papers (2026-02-08T00:48:49Z) - Rectifying Geometry-Induced Similarity Distortions for Real-World Aerial-Ground Person Re-Identification [4.039576422478934]
Aerial-ground person re-identification (AG-ReID) is fundamentally challenged by extreme viewpoint and distance discrepancies.<n>Existing methods rely on geometry-aware feature learning or appearance-conditioned prompting.<n>We introduce Geometry-Induced Query-Key Transformation (GIQT), a lightweight low-rank module that rectifies the similarity space by conditioning query-key interactions on camera geometry.
arXiv Detail & Related papers (2026-01-29T08:41:42Z) - GeoVideo: Introducing Geometric Regularization into Video Generation Model [46.38507581500745]
We introduce geometric regularization losses into video generation by augmenting latent diffusion models with per-frame depth prediction.<n>Our method bridges the gap between appearance generation and 3D structure modeling, leading to improved structural coherence-temporal shape, consistency, and physical plausibility.
arXiv Detail & Related papers (2025-12-03T05:11:57Z) - GeoWorld: Unlocking the Potential of Geometry Models to Facilitate High-Fidelity 3D Scene Generation [68.02988074681427]
Previous works leveraging video models for image-to-3D scene generation tend to suffer from geometric distortions and blurry content.<n>In this paper, we renovate the pipeline of image-to-3D scene generation by unlocking the potential of geometry models.<n>Our GeoWorld can generate high-fidelity 3D scenes from a single image and a given camera trajectory, outperforming prior methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2025-11-28T13:55:45Z) - Epipolar Geometry Improves Video Generation Models [73.44978239787501]
3D-consistent video generation could significantly impact numerous downstream applications in generation and reconstruction tasks.<n>We explore how epipolar geometry constraints improve modern video diffusion models.<n>By bridging data-driven deep learning with classical geometric computer vision, we present a practical method for generating spatially consistent videos.
arXiv Detail & Related papers (2025-10-24T16:21:37Z) - GeoAda: Efficiently Finetune Geometric Diffusion Models with Equivariant Adapters [61.51810815162003]
We propose an SE(3)-equivariant adapter framework ( GeoAda) that enables flexible and parameter-efficient fine-tuning for controlled generative tasks.<n>GeoAda preserves the model's geometric consistency while mitigating overfitting and catastrophic forgetting.<n>We demonstrate the wide applicability of GeoAda across diverse geometric control types, including frame control, global control, subgraph control, and a broad range of application domains.
arXiv Detail & Related papers (2025-07-02T18:44:03Z) - UniGeo: Taming Video Diffusion for Unified Consistent Geometry Estimation [63.90470530428842]
In this work, we demonstrate that, through appropriate design and fine-tuning, the intrinsic consistency of video generation models can be effectively harnessed for consistent geometric estimation.<n>Our results achieve superior performance in predicting global geometric attributes in videos and can be directly applied to reconstruction tasks.
arXiv Detail & Related papers (2025-05-30T12:31:59Z) - Geometry-Editable and Appearance-Preserving Object Compositon [67.98806888489385]
General object composition (GOC) aims to seamlessly integrate a target object into a background scene with desired geometric properties.<n>Recent approaches derive semantic embeddings and integrate them into advanced diffusion models to enable geometry-editable generation.<n>We introduce a Disentangled Geometry-editable and Appearance-preserving Diffusion model that first leverages semantic embeddings to implicitly capture desired geometric transformations.
arXiv Detail & Related papers (2025-05-27T09:05:28Z) - AdS-GNN -- a Conformally Equivariant Graph Neural Network [9.96018310438305]
We build a neural network that is equivariant under general conformal transformations.<n>We validate our model on tasks from computer vision and statistical physics.
arXiv Detail & Related papers (2025-05-19T09:08:52Z) - Geometry-Informed Neural Operator Transformer [0.8906214436849201]
This work introduces the Geometry-Informed Neural Operator Transformer (GINOT), which integrates the transformer architecture with the neural operator framework to enable forward predictions for arbitrary geometries.
The performance of GINOT is validated on multiple challenging datasets, showcasing its high accuracy and strong generalization capabilities for complex and arbitrary 2D and 3D geometries.
arXiv Detail & Related papers (2025-04-28T03:39:27Z) - GERD: Geometric event response data generation [1.5269221584932013]
Event-based vision sensors are appealing because of their time resolution, higher dynamic range, and low-power consumption.
They also provide data that is fundamentally different from conventional frame-based cameras: events are sparse, discrete, and require integration in time.
We introduce a method to generate event-based data under controlled transformations.
arXiv Detail & Related papers (2024-12-04T11:59:36Z) - Bridging Geometric States via Geometric Diffusion Bridge [79.60212414973002]
We introduce the Geometric Diffusion Bridge (GDB), a novel generative modeling framework that accurately bridges initial and target geometric states.
GDB employs an equivariant diffusion bridge derived by a modified version of Doob's $h$-transform for connecting geometric states.
We show that GDB surpasses existing state-of-the-art approaches, opening up a new pathway for accurately bridging geometric states.
arXiv Detail & Related papers (2024-10-31T17:59:53Z) - Boosting Cross-Domain Point Classification via Distilling Relational Priors from 2D Transformers [59.0181939916084]
Traditional 3D networks mainly focus on local geometric details and ignore the topological structure between local geometries.
We propose a novel Priors Distillation (RPD) method to extract priors from the well-trained transformers on massive images.
Experiments on the PointDA-10 and the Sim-to-Real datasets verify that the proposed method consistently achieves the state-of-the-art performance of UDA for point cloud classification.
arXiv Detail & Related papers (2024-07-26T06:29:09Z) - GGAvatar: Geometric Adjustment of Gaussian Head Avatar [6.58321368492053]
GGAvatar is a novel 3D avatar representation designed to robustly model dynamic head avatars with complex identities.
GGAvatar can produce high-fidelity renderings, outperforming state-of-the-art methods in visual quality and quantitative metrics.
arXiv Detail & Related papers (2024-05-20T12:54:57Z) - SGFormer: Spherical Geometry Transformer for 360 Depth Estimation [54.13459226728249]
Panoramic distortion poses a significant challenge in 360 depth estimation.
We propose a spherical geometry transformer, named SGFormer, to address the above issues.
We also present a query-based global conditional position embedding to compensate for spatial structure at varying resolutions.
arXiv Detail & Related papers (2024-04-23T12:36:24Z) - DragD3D: Realistic Mesh Editing with Rigidity Control Driven by 2D Diffusion Priors [10.355568895429588]
Direct mesh editing and deformation are key components in the geometric modeling and animation pipeline.
Regularizers are not aware of the global context and semantics of the object.
We show that our deformations can be controlled to yield realistic shape deformations aware of the global context.
arXiv Detail & Related papers (2023-10-06T19:55:40Z) - Learning Modulated Transformation in GANs [69.95217723100413]
We equip the generator in generative adversarial networks (GANs) with a plug-and-play module, termed as modulated transformation module (MTM)
MTM predicts spatial offsets under the control of latent codes, based on which the convolution operation can be applied at variable locations.
It is noteworthy that towards human generation on the challenging TaiChi dataset, we improve the FID of StyleGAN3 from 21.36 to 13.60, demonstrating the efficacy of learning modulated geometry transformation.
arXiv Detail & Related papers (2023-08-29T17:51:22Z) - Learning Transformations To Reduce the Geometric Shift in Object
Detection [60.20931827772482]
We tackle geometric shifts emerging from variations in the image capture process.
We introduce a self-training approach that learns a set of geometric transformations to minimize these shifts.
We evaluate our method on two different shifts, i.e., a camera's field of view (FoV) change and a viewpoint change.
arXiv Detail & Related papers (2023-01-13T11:55:30Z) - Surface Vision Transformers: Attention-Based Modelling applied to
Cortical Analysis [8.20832544370228]
We introduce a domain-agnostic architecture to study any surface data projected onto a spherical manifold.
A vision transformer model encodes the sequence of patches via successive multi-head self-attention layers.
Experiments show that the SiT generally outperforms surface CNNs, while performing comparably on registered and unregistered data.
arXiv Detail & Related papers (2022-03-30T15:56:11Z) - 3D Unsupervised Region-Aware Registration Transformer [13.137287695912633]
Learning robust point cloud registration models with deep neural networks has emerged as a powerful paradigm.
We propose a new design of 3D region partition module that is able to divide the input shape to different regions with a self-supervised 3D shape reconstruction loss.
Our experiments show that our 3D-URRT achieves superior registration performance over various benchmark datasets.
arXiv Detail & Related papers (2021-10-07T15:06:52Z) - DSG-Net: Learning Disentangled Structure and Geometry for 3D Shape
Generation [98.96086261213578]
We introduce DSG-Net, a deep neural network that learns a disentangled structured and geometric mesh representation for 3D shapes.
This supports a range of novel shape generation applications with disentangled control, such as of structure (geometry) while keeping geometry (structure) unchanged.
Our method not only supports controllable generation applications but also produces high-quality synthesized shapes, outperforming state-of-the-art methods.
arXiv Detail & Related papers (2020-08-12T17:06:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.