Related papers: ModalFormer: Multimodal Transformer for Low-Light Image Enhancement

ModalFormer: Multimodal Transformer for Low-Light Image Enhancement

URL: http://arxiv.org/abs/2507.20388v1
Date: Sun, 27 Jul 2025 19:07:22 GMT
Title: ModalFormer: Multimodal Transformer for Low-Light Image Enhancement
Authors: Alexandru Brateanu, Raul Balmez, Ciprian Orhei, Codruta Ancuti, Cosmin Ancuti,
Abstract summary: Low-light image enhancement (LLIE) is a fundamental yet challenging task due to the presence of noise, loss of detail, and poor contrast in images captured under insufficient lighting conditions.<n>Recent methods often rely solely on pixel-level transformations of RGB images, neglecting the rich contextual information available from multiple visual modalities.<n>We present ModalFormer, the first large-scale multimodal framework for LLIE that fully exploits nine auxiliary modalities to achieve state-of-the-art performance.
Score: 42.56657385578874
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Low-light image enhancement (LLIE) is a fundamental yet challenging task due to the presence of noise, loss of detail, and poor contrast in images captured under insufficient lighting conditions. Recent methods often rely solely on pixel-level transformations of RGB images, neglecting the rich contextual information available from multiple visual modalities. In this paper, we present ModalFormer, the first large-scale multimodal framework for LLIE that fully exploits nine auxiliary modalities to achieve state-of-the-art performance. Our model comprises two main components: a Cross-modal Transformer (CM-T) designed to restore corrupted images while seamlessly integrating multimodal information, and multiple auxiliary subnetworks dedicated to multimodal feature reconstruction. Central to the CM-T is our novel Cross-modal Multi-headed Self-Attention mechanism (CM-MSA), which effectively fuses RGB data with modality-specific features--including deep feature embeddings, segmentation information, geometric cues, and color information--to generate information-rich hybrid attention maps. Extensive experiments on multiple benchmark datasets demonstrate ModalFormer's state-of-the-art performance in LLIE. Pre-trained models and results are made available at https://github.com/albrateanu/ModalFormer.

Related papers

MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models [30.494968865008513]
Recent text-to-image models struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex image generation.<n>We propose MENTOR, a novel framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation.<n>Our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods.
arXiv Detail & Related papers (2025-07-13T10:52:59Z)
MIFNet: Learning Modality-Invariant Features for Generalizable Multimodal Image Matching [54.740256498985026]
Keypoint detection and description methods often struggle with multimodal data.<n>We propose a modality-invariant feature learning network (MIFNet) to compute modality-invariant features for keypoint descriptions in multimodal image matching.
arXiv Detail & Related papers (2025-01-20T06:56:30Z)
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models [79.59567114769513]
We introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images.<n>Our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 24.94% and even surpassing much larger 70B models.
arXiv Detail & Related papers (2025-01-10T07:56:23Z)
Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter. We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another. Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z)
Generalizing Supervised Deep Learning MRI Reconstruction to Multiple and Unseen Contrasts using Meta-Learning Hypernetworks [1.376408511310322]
This work aims to develop a multimodal meta-learning model for image reconstruction. Our proposed model has hypernetworks that evolve to generate mode-specific weights. Experiments on MRI reconstruction show that our model exhibits superior reconstruction performance over joint training.
arXiv Detail & Related papers (2023-07-13T14:22:59Z)
Dynamic Enhancement Network for Partial Multi-modality Person Re-identification [52.70235136651996]
We design a novel dynamic enhancement network (DENet), which allows missing arbitrary modalities while maintaining the representation ability of multiple modalities. Since the missing state might be changeable, we design a dynamic enhancement module, which dynamically enhances modality features according to the missing state in an adaptive manner.
arXiv Detail & Related papers (2023-05-25T06:22:01Z)
Multi-modal Gated Mixture of Local-to-Global Experts for Dynamic Image Fusion [59.19469551774703]
Infrared and visible image fusion aims to integrate comprehensive information from multiple sources to achieve superior performances on various practical tasks. We propose a dynamic image fusion framework with a multi-modal gated mixture of local-to-global experts. Our model consists of a Mixture of Local Experts (MoLE) and a Mixture of Global Experts (MoGE) guided by a multi-modal gate.
arXiv Detail & Related papers (2023-02-02T20:06:58Z)
Multi-scale Transformer Network with Edge-aware Pre-training for Cross-Modality MR Image Synthesis [52.41439725865149]
Cross-modality magnetic resonance (MR) image synthesis can be used to generate missing modalities from given ones. Existing (supervised learning) methods often require a large number of paired multi-modal data to train an effective synthesis model. We propose a Multi-scale Transformer Network (MT-Net) with edge-aware pre-training for cross-modality MR image synthesis.
arXiv Detail & Related papers (2022-12-02T11:40:40Z)
MAFNet: A Multi-Attention Fusion Network for RGB-T Crowd Counting [40.4816930622052]
We propose a two-stream RGB-T crowd counting network called Multi-Attention Fusion Network (MAFNet) In the encoder part, a Multi-Attention Fusion (MAF) module is embedded into different stages of the two modality-specific branches for cross-modal fusion. Extensive experiments on two popular datasets show that the proposed MAFNet is effective for RGB-T crowd counting.
arXiv Detail & Related papers (2022-08-14T02:42:09Z)
Towards Reliable Image Outpainting: Learning Structure-Aware Multimodal Fusion with Depth Guidance [49.94504248096527]
We propose a Depth-Guided Outpainting Network (DGONet) to model the feature representations of different modalities. Two components are designed to implement: 1) The Multimodal Learning Module produces unique depth and RGB feature representations from perspectives of different modal characteristics. We specially design an additional constraint strategy consisting of Cross-modal Loss and Edge Loss to enhance ambiguous contours and expedite reliable content generation.
arXiv Detail & Related papers (2022-04-12T06:06:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.