Related papers: MI-DETR: A Strong Baseline for Moving Infrared Small Target Detection with Bio-Inspired Motion Integration

MI-DETR: A Strong Baseline for Moving Infrared Small Target Detection with Bio-Inspired Motion Integration

URL: http://arxiv.org/abs/2603.05071v1
Date: Thu, 05 Mar 2026 11:39:31 GMT
Title: MI-DETR: A Strong Baseline for Moving Infrared Small Target Detection with Bio-Inspired Motion Integration
Authors: Nian Liu, Jin Gao, Shubo Lin, Yutong Kou, Sikui Zhang, Fudong Ge, Zhiqiang Pu, Liang Li, Gang Wang, Yizheng Wang, Weiming Hu,
Abstract summary: We propose Motion Integration DETR (MI-DETR), a bio-inspired dual-pathway detector for infrared small target detection.<n>First, a retina-inspired cellular automaton (RCA) converts raw frame sequences into a motion map defined on the same pixel grid as the appearance image.<n>Second, a Parvocellular-Magnocellular Interconnection (PMI) Block facilitates bidirectional feature interaction between the two pathways.
Score: 63.87179575890912
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Infrared small target detection (ISTD) is challenging because tiny, low-contrast targets are easily obscured by complex and dynamic backgrounds. Conventional multi-frame approaches typically learn motion implicitly through deep neural networks, often requiring additional motion supervision or explicit alignment modules. We propose Motion Integration DETR (MI-DETR), a bio-inspired dual-pathway detector that processes one infrared frame per time step while explicitly modeling motion. First, a retina-inspired cellular automaton (RCA) converts raw frame sequences into a motion map defined on the same pixel grid as the appearance image, enabling parvocellular-like appearance and magnocellular-like motion pathways to be supervised by a single set of bounding boxes without extra motion labels or alignment operations. Second, a Parvocellular-Magnocellular Interconnection (PMI) Block facilitates bidirectional feature interaction between the two pathways, providing a biologically motivated intermediate interconnection mechanism. Finally, a RT-DETR decoder operates on features from the two pathways to produce detection results. Surprisingly, our proposed simple yet effective approach yields strong performance on three commonly used ISTD benchmarks. MI-DETR achieves 70.3% mAP@50 and 72.7% F1 on IRDST-H (+26.35 mAP@50 over the best multi-frame baseline), 98.0% mAP@50 on DAUB-R, and 88.3% mAP@50 on ITSDT-15K, demonstrating the effectiveness of biologically inspired motion-appearance integration. Code is available at https://github.com/nliu-25/MI-DETR.

Related papers

UNIV: Unified Foundation Model for Infrared and Visible Modalities [12.0490466425884]
We propose a biologically inspired UNified foundation model for Infrared and Visible modalities (UNIV)<n>PCCL is an attention-guided distillation framework that mimics retinal horizontal cells' lateral inhibition.<n>Our dual-knowledge preservation mechanism emulates the retina's bipolar cell signal routing.
arXiv Detail & Related papers (2025-09-19T06:07:53Z)
Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in Driving Scenes [82.4186966781934]
We introduce a simple, efficient, and effective two-stage detector, termed as Ret3D. At the core of Ret3D is the utilization of novel intra-frame and inter-frame relation modules. With negligible extra overhead, Ret3D achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-08-18T03:48:58Z)
Representation Learning for Compressed Video Action Recognition via Attentive Cross-modal Interaction with Motion Enhancement [28.570085937225976]
This paper proposes a novel framework, namely Attentive Cross-modal Interaction Network with Motion Enhancement. It follows the two-stream architecture, i.e. one for the RGB modality and the other for the motion modality. Experiments on the UCF-101, HMDB-51 and Kinetics-400 benchmarks demonstrate the effectiveness and efficiency of MEACI-Net.
arXiv Detail & Related papers (2022-05-07T06:26:49Z)
MFGNet: Dynamic Modality-Aware Filter Generation for RGB-T Tracking [72.65494220685525]
We propose a new dynamic modality-aware filter generation module (named MFGNet) to boost the message communication between visible and thermal data. We generate dynamic modality-aware filters with two independent networks. The visible and thermal filters will be used to conduct a dynamic convolutional operation on their corresponding input feature maps respectively. To address issues caused by heavy occlusion, fast motion, and out-of-view, we propose to conduct a joint local and global search by exploiting a new direction-aware target-driven attention mechanism.
arXiv Detail & Related papers (2021-07-22T03:10:51Z)
Distractor-Aware Fast Tracking via Dynamic Convolutions and MOT Philosophy [63.91005999481061]
A practical long-term tracker typically contains three key properties, i.e. an efficient model design, an effective global re-detection strategy and a robust distractor awareness mechanism. We propose a two-task tracking frame work (named DMTrack) to achieve distractor-aware fast tracking via Dynamic convolutions (d-convs) and Multiple object tracking (MOT) philosophy. Our tracker achieves state-of-the-art performance on the LaSOT, OxUvA, TLP, VOT2018LT and VOT 2019LT benchmarks and runs in real-time (3x faster
arXiv Detail & Related papers (2021-04-25T00:59:53Z)
Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame. Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency. We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector. We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z)
A Bioinspired Approach-Sensitive Neural Network for Collision Detection in Cluttered and Dynamic Backgrounds [19.93930316898735]
Rapid accurate and robust detection of looming objects in moving backgrounds is a significant and challenging problem for robotic visual systems. Inspired by the neural circuit elementary vision in the mammalian retina, this paper proposes a bioinspired approach-sensitive neural network (AS) The proposed model is able to not only detect collision accurately and robustly in cluttered and dynamic backgrounds but also extract more collision information like position and direction, for guiding rapid decision making.
arXiv Detail & Related papers (2021-03-01T09:16:18Z)
Actions as Moving Points [66.21507857877756]
We present a conceptually simple, efficient, and more precise action tubelet detection framework, termed as MovingCenter Detector (MOC-detector) Based on the insight that movement information could simplify and assist action tubelet detection, our MOC-detector is composed of three crucial head branches. Our MOC-detector outperforms the existing state-of-the-art methods for both metrics of frame-mAP and video-mAP on the JHMDB and UCF101-24 datasets.
arXiv Detail & Related papers (2020-01-14T03:29:44Z)
A Time-Delay Feedback Neural Network for Discriminating Small, Fast-Moving Targets in Complex Dynamic Environments [8.645725394832969]
Discriminating small moving objects within complex visual environments is a significant challenge for autonomous micro robots. We propose an STMD-based neural network with feedback connection (Feedback STMD), where the network output is temporally delayed, then fed back to the lower layers to mediate neural responses.
arXiv Detail & Related papers (2019-12-29T03:10:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.