Related papers: Conceptualizing Multi-scale Wavelet Attention and Ray-based Encoding for Human-Object Interaction Detection

Conceptualizing Multi-scale Wavelet Attention and Ray-based Encoding for Human-Object Interaction Detection

URL: http://arxiv.org/abs/2507.10977v1
Date: Tue, 15 Jul 2025 04:44:54 GMT
Title: Conceptualizing Multi-scale Wavelet Attention and Ray-based Encoding for Human-Object Interaction Detection
Authors: Quan Bi Pay, Vishnu Monn Baskaran, Junn Yong Loo, KokSheik Wong, Simon See,
Abstract summary: We propose a wavelet attention-like backbone and a ray-based encoder architecture tailored for HOI detection.<n>Our wavelet backbone addresses the limitations of expressing middle-order interactions by aggregating discriminative features from the low- and high-order interactions extracted from convolutional filters.<n>Our decoder aligns query embeddings with emphasized regions of interest for accurate predictions.
Score: 15.125734989910429
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human-object interaction (HOI) detection is essential for accurately localizing and characterizing interactions between humans and objects, providing a comprehensive understanding of complex visual scenes across various domains. However, existing HOI detectors often struggle to deliver reliable predictions efficiently, relying on resource-intensive training methods and inefficient architectures. To address these challenges, we conceptualize a wavelet attention-like backbone and a novel ray-based encoder architecture tailored for HOI detection. Our wavelet backbone addresses the limitations of expressing middle-order interactions by aggregating discriminative features from the low- and high-order interactions extracted from diverse convolutional filters. Concurrently, the ray-based encoder facilitates multi-scale attention by optimizing the focus of the decoder on relevant regions of interest and mitigating computational overhead. As a result of harnessing the attenuated intensity of learnable ray origins, our decoder aligns query embeddings with emphasized regions of interest for accurate predictions. Experimental results on benchmark datasets, including ImageNet and HICO-DET, showcase the potential of our proposed architecture. The code is publicly available at [https://github.com/henry-pay/RayEncoder].

Related papers

Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition [71.5328300638085]
Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions.<n>Existing methods, including two-stage methods, tightly couple interaction recognition with a specific detector.<n>We propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR.
arXiv Detail & Related papers (2026-02-16T19:01:31Z)
The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding [82.53463660564933]
semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders retain high-frequency information that conveys fine-grained detail.<n>We propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator.
arXiv Detail & Related papers (2025-12-22T18:59:57Z)
Generative MIMO Beam Map Construction for Location Recovery and Beam Tracking [67.65578956523403]
This paper proposes a generative framework to recover location labels directly from sparse channel state information (CSI) measurements.<n>Instead of directly storing raw CSI, we learn a compact low-dimensional radio map embedding and leverage a generative model to reconstruct the high-dimensional CSI.<n> Numerical experiments demonstrate that the proposed model can improve localization accuracy by over 30% and achieve a 20% capacity gain in non-line-of-sight (NLOS) scenarios.
arXiv Detail & Related papers (2025-11-21T07:25:49Z)
UESA-Net: U-Shaped Embedded Multidirectional Shrinkage Attention Network for Ultrasound Nodule Segmentation [12.967178888045728]
Existing networks struggle to reconcile high-level semantics with low-level spatial details.<n>We propose UESA-Net, a U-shaped network with multidirectional shrinkage attention.<n>On two public datasets, UESA-Net achieved state-of-the-art performance with intersection-over-union (IoU) scores of 0.8487 and 0.6495, respectively.
arXiv Detail & Related papers (2025-09-26T14:54:38Z)
Rethinking the Nested U-Net Approach: Enhancing Biomarker Segmentation with Attention Mechanisms and Multiscale Feature Fusion [2.0799865428691393]
We introduce a nested UNet architecture that captures both local and global context through Multiscale Feature Fusion and Attention Mechanisms.<n>This design improves feature integration from encoders, highlights key channels and regions, and restores spatial details to enhance segmentation performance.
arXiv Detail & Related papers (2025-04-08T15:53:46Z)
CognitionCapturer: Decoding Visual Stimuli From Human EEG Signal With Multimodal Information [61.1904164368732]
We propose CognitionCapturer, a unified framework that fully leverages multimodal data to represent EEG signals.<n>Specifically, CognitionCapturer trains Modality Experts for each modality to extract cross-modal information from the EEG modality.<n>The framework does not require any fine-tuning of the generative models and can be extended to incorporate more modalities.
arXiv Detail & Related papers (2024-12-13T16:27:54Z)
Leveraging Mixture of Experts for Improved Speech Deepfake Detection [53.69740463004446]
Speech deepfakes pose a significant threat to personal security and content authenticity. We introduce a novel approach for enhancing speech deepfake detection performance using a Mixture of Experts architecture.
arXiv Detail & Related papers (2024-09-24T13:24:03Z)
Deep Homography Estimation for Visual Place Recognition [49.235432979736395]
We propose a transformer-based deep homography estimation (DHE) network. It takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification. Experiments on benchmark datasets show that our method can outperform several state-of-the-art methods.
arXiv Detail & Related papers (2024-02-25T13:22:17Z)
Fus-MAE: A cross-attention-based data fusion approach for Masked Autoencoders in remote sensing [5.070981175240306]
Fus-MAE is a self-supervised learning framework based on masked autoencoders.<n>Our empirical findings demonstrate that Fus-MAE can effectively compete with contrastive learning strategies tailored for SAR-optical data fusion.
arXiv Detail & Related papers (2024-01-05T11:36:21Z)
Correlation Pyramid Network for 3D Single Object Tracking [16.694809791177263]
We propose a novel Correlation Pyramid Network (CorpNet) with a unified encoder and a motion-factorized decoder. CorpNet achieves state-of-the-art results while running in real-time.
arXiv Detail & Related papers (2023-05-16T06:07:20Z)
DeepSTEP -- Deep Learning-Based Spatio-Temporal End-To-End Perception for Autonomous Vehicles [0.0]
We present our concept for an end-to-end perception architecture, named DeepSTEP. DeepSTEP processes raw sensor data from the camera, LiDAR, and RaDAR, and combines the extracted data in a deep fusion network. The architecture's end-to-end design, time-aware attention mechanism, and integration of multiple perception tasks make it a promising solution for real-world deployment.
arXiv Detail & Related papers (2023-05-11T14:13:37Z)
Multimodal Graph Learning for Deepfake Detection [10.077496841634135]
Existing deepfake detectors face several challenges in achieving robustness and generalization. We propose a novel framework, namely Multimodal Graph Learning (MGL), that leverages information from multiple modalities. Our proposed method aims to effectively identify and utilize distinguishing features for deepfake detection.
arXiv Detail & Related papers (2022-09-12T17:17:49Z)
Deep Equilibrium Assisted Block Sparse Coding of Inter-dependent Signals: Application to Hyperspectral Imaging [71.57324258813675]
A dataset of inter-dependent signals is defined as a matrix whose columns demonstrate strong dependencies. A neural network is employed to act as structure prior and reveal the underlying signal interdependencies. Deep unrolling and Deep equilibrium based algorithms are developed, forming highly interpretable and concise deep-learning-based architectures.
arXiv Detail & Related papers (2022-03-29T21:00:39Z)
Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds [155.388487263872]
We propose a new infrared small-dim target detection method with the transformer. We adopt the self-attention mechanism of the transformer to learn the interaction information of image features in a larger range. We also design a feature enhancement module to learn more features of small-dim targets.
arXiv Detail & Related papers (2021-09-29T12:23:41Z)
Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation. CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body. It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z)
Depthwise Non-local Module for Fast Salient Object Detection Using a Single Thread [136.2224792151324]
We propose a new deep learning algorithm for fast salient object detection. The proposed algorithm achieves competitive accuracy and high inference efficiency simultaneously with a single CPU thread.
arXiv Detail & Related papers (2020-01-22T15:23:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.