Related papers: MD-BERT: Action Recognition in Dark Videos via Dynamic Multi-Stream Fusion and Temporal Modeling

MD-BERT: Action Recognition in Dark Videos via Dynamic Multi-Stream Fusion and Temporal Modeling

URL: http://arxiv.org/abs/2502.03724v1
Date: Thu, 06 Feb 2025 02:26:47 GMT
Title: MD-BERT: Action Recognition in Dark Videos via Dynamic Multi-Stream Fusion and Temporal Modeling
Authors: Sharana Dharshikgan Suresh Dass, Hrishav Bakul Barua, Ganesh Krishnasamy, Raveendran Paramesran, Raphael C. -W. Phan,
Abstract summary: This paper proposes a novel multi-stream approach that integrates complementary pre-processing techniques such as gamma correction and histograms alongside raw dark frames.<n>Extensive experiments on ARID V1.0 and ARID1.5 dark video datasets show that MD-BERT outperforms existing methods.
Score: 4.736059095502584
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Action recognition in dark, low-light (under-exposed) or noisy videos is a challenging task due to visibility degradation, which can hinder critical spatiotemporal details. This paper proposes MD-BERT, a novel multi-stream approach that integrates complementary pre-processing techniques such as gamma correction and histogram equalization alongside raw dark frames to address these challenges. We introduce the Dynamic Feature Fusion (DFF) module, extending existing attentional fusion methods to a three-stream setting, thereby capturing fine-grained and global contextual information across different brightness and contrast enhancements. The fused spatiotemporal features are then processed by a BERT-based temporal model, which leverages its bidirectional self-attention to effectively capture long-range dependencies and contextual relationships across frames. Extensive experiments on the ARID V1.0 and ARID V1.5 dark video datasets show that MD-BERT outperforms existing methods, establishing a new state-of-the-art performance. Ablation studies further highlight the individual contributions of each input stream and the effectiveness of the proposed DFF and BERT modules. The official website of this work is available at: https://github.com/HrishavBakulBarua/DarkBERT

Related papers

VideoFusion: A Spatio-Temporal Collaborative Network for Mutli-modal Video Fusion and Restoration [26.59510171451438]
Existing multi-sensor fusion research predominantly integrates complementary from multiple images rather than videos. VideoFusion exploits cross-modal complementarity and temporal dynamics to generate context-temporally coherent videos. Extensive experiments reveal that VideoFusion outperforms existing image-oriented fusion paradigms in sequential scenarios.
arXiv Detail & Related papers (2025-03-30T08:27:18Z)
FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [63.87313550399871]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability. We propose Self-supervised Transfer (PST) and FrequencyDe-coupled Fusion module (FreDF) PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models. FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.
arXiv Detail & Related papers (2025-03-25T15:04:53Z)
COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition [3.271109623410664]
We propose COMODO, a cross-modal self-supervised distillation framework that transfers rich semantic knowledge from the video modality to the IMU modality without requiring labeled annotations. Our approach enables the IMU encoder to inherit rich semantic information from video while preserving its efficiency for real-world applications.
arXiv Detail & Related papers (2025-03-10T12:43:51Z)
Multi-Modality Driven LoRA for Adverse Condition Depth Estimation [61.525312117638116]
We propose Multi-Modality Driven LoRA (MMD-LoRA) for Adverse Condition Depth Estimation.<n>It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning (VTCCL)<n>It achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets.
arXiv Detail & Related papers (2024-12-28T14:23:58Z)
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval [8.908777234657046]
Large-language and vision-language models (LLM/LVLMs) have gained prominence across various domains.<n>Here we propose VideoLights, a novel HD/MR framework addressing these limitations through (i) Convolutional Projection and Feature Refinement modules.<n> Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2024-12-02T14:45:53Z)
DAE-Fuse: An Adaptive Discriminative Autoencoder for Multi-Modality Image Fusion [10.713089596405053]
We propose DAE-Fuse, a novel two-phase discriminative autoencoder framework that generates sharp and natural fused images.<n>We pioneer the extension of image fusion techniques from static images to the video domain.<n>DaE-Fuse achieves state-of-the-art performance on multiple benchmarks, with superior generalizability to tasks like medical image fusion.
arXiv Detail & Related papers (2024-09-16T08:37:09Z)
DemMamba: Alignment-free Raw Video Demoireing with Frequency-assisted Spatio-Temporal Mamba [18.06907326360215]
Moire patterns, resulting from the interference of two similar repetitive patterns, are frequently observed during the capture of images or videos on screens. This paper introduces a novel alignment-free raw video demoireing network with frequency-assisted-temporal Mamba. Our proposed DemMamba surpasses state-of-the-art methods by 1.3 dB in PSNR, and also provides a satisfactory visual experience.
arXiv Detail & Related papers (2024-08-20T09:31:03Z)
MaeFuse: Transferring Omni Features with Pretrained Masked Autoencoders for Infrared and Visible Image Fusion via Guided Training [57.18758272617101]
MaeFuse is a novel autoencoder model designed for infrared and visible image fusion (IVIF) Our model utilizes a pretrained encoder from Masked Autoencoders (MAE), which facilities the omni features extraction for low-level reconstruction and high-level vision tasks. MaeFuse not only introduces a novel perspective in the realm of fusion techniques but also stands out with impressive performance across various public datasets.
arXiv Detail & Related papers (2024-04-17T02:47:39Z)
ToddlerDiffusion: Interactive Structured Image Generation with Cascaded Schrödinger Bridge [63.00793292863]
ToddlerDiffusion is a novel approach to decomposing the complex task of RGB image generation into simpler, interpretable stages. Our method, termed ToddlerDiffusion, cascades modality-specific models, each responsible for generating an intermediate representation. ToddlerDiffusion consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-11-24T15:20:01Z)
Searching a Compact Architecture for Robust Multi-Exposure Image Fusion [55.37210629454589]
Two major stumbling blocks hinder the development, including pixel misalignment and inefficient inference. This study introduces an architecture search-based paradigm incorporating self-alignment and detail repletion modules for robust multi-exposure image fusion. The proposed method outperforms various competitive schemes, achieving a noteworthy 3.19% improvement in PSNR for general scenarios and an impressive 23.5% enhancement in misaligned scenarios.
arXiv Detail & Related papers (2023-05-20T17:01:52Z)
A Unified Multimodal De- and Re-coupling Framework for RGB-D Motion Recognition [24.02488085447691]
We introduce a novel video data augmentation method dubbed ShuffleMix, which acts as a supplement to MixUp, to provide additional temporal regularization for motion recognition. Secondly, a Unified Multimodal De-coupling and multi-stage Re-coupling framework, termed UMDR, is proposed for video representation learning.
arXiv Detail & Related papers (2022-11-16T19:00:23Z)
End-to-end Multi-modal Video Temporal Grounding [105.36814858748285]
We propose a multi-modal framework to extract complementary information from videos. We adopt RGB images for appearance, optical flow for motion, and depth maps for image structure. We conduct experiments on the Charades-STA and ActivityNet Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches.
arXiv Detail & Related papers (2021-07-12T17:58:10Z)
EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation [62.210091681352914]
We study multi-sensor fusion for 3D semantic segmentation for many applications, such as autonomous driving and robotics. In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF) We propose a two-stream network to extract features from the two modalities separately. The extracted features are fused by effective residual-based fusion modules.
arXiv Detail & Related papers (2021-06-21T10:47:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.