Related papers: LangMamba: A Language-driven Mamba Framework for Low-dose CT Denoising with Vision-language Models

LangMamba: A Language-driven Mamba Framework for Low-dose CT Denoising with Vision-language Models

URL: http://arxiv.org/abs/2507.06140v1
Date: Tue, 08 Jul 2025 16:22:05 GMT
Title: LangMamba: A Language-driven Mamba Framework for Low-dose CT Denoising with Vision-language Models
Authors: Zhihao Chen, Tao Chen, Chenhui Wang, Qi Gao, Huidong Xie, Chuang Niu, Ge Wang, Hongming Shan,
Abstract summary: Low-dose computed tomography (LDCT) reduces radiation exposure but often degrades image quality.<n>Recent advances in vision-language models (VLMs) suggest that language can serve as a powerful tool for capturing structured semantic information.<n>We introduce LangMamba, a Language-driven Mamba framework for LDCT denoising.
Score: 20.753316100847986
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Low-dose computed tomography (LDCT) reduces radiation exposure but often degrades image quality, potentially compromising diagnostic accuracy. Existing deep learning-based denoising methods focus primarily on pixel-level mappings, overlooking the potential benefits of high-level semantic guidance. Recent advances in vision-language models (VLMs) suggest that language can serve as a powerful tool for capturing structured semantic information, offering new opportunities to improve LDCT reconstruction. In this paper, we introduce LangMamba, a Language-driven Mamba framework for LDCT denoising that leverages VLM-derived representations to enhance supervision from normal-dose CT (NDCT). LangMamba follows a two-stage learning strategy. First, we pre-train a Language-guided AutoEncoder (LangAE) that leverages frozen VLMs to map NDCT images into a semantic space enriched with anatomical information. Second, we synergize LangAE with two key components to guide LDCT denoising: Semantic-Enhanced Efficient Denoiser (SEED), which enhances NDCT-relevant local semantic while capturing global features with efficient Mamba mechanism, and Language-engaged Dual-space Alignment (LangDA) Loss, which ensures that denoised images align with NDCT in both perceptual and semantic spaces. Extensive experiments on two public datasets demonstrate that LangMamba outperforms conventional state-of-the-art methods, significantly improving detail preservation and visual fidelity. Remarkably, LangAE exhibits strong generalizability to unseen datasets, thereby reducing training costs. Furthermore, LangDA loss improves explainability by integrating language-guided insights into image reconstruction and offers a plug-and-play fashion. Our findings shed new light on the potential of language as a supervisory signal to advance LDCT denoising. The code is publicly available on https://github.com/hao1635/LangMamba.

Related papers

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models [44.578308186225826]
Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data.<n>We show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance.
arXiv Detail & Related papers (2025-01-31T08:27:31Z)
Decoding fMRI Data into Captions using Prefix Language Modeling [3.4328283704703866]
We present an alternative method for decoding brain signals into image captions by predicting a DINOv2 model's embedding of an image from the corresponding fMRI signal.<n>We also explore 3D Convolutional Neural Network mapping of fMRI signals to image embedding space for better accounting positional information of voxels.
arXiv Detail & Related papers (2025-01-05T15:06:25Z)
Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.<n>LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.<n>We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z)
Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z)
DenoMamba: A fused state-space model for low-dose CT denoising [6.468495781611433]
Low-dose computed tomography (LDCT) lower potential risks linked to radiation exposure.<n>LDCT denoising is based on neural network models that learn data-driven image priors to separate noise evoked by dose reduction from underlying tissue signals.<n>DenoMamba is a novel denoising method based on state-space modeling (SSM) that efficiently captures short- and long-range context in medical images.
arXiv Detail & Related papers (2024-09-19T21:32:07Z)
Low-dose CT Denoising with Language-engaged Dual-space Alignment [21.172319554618497]
We propose a plug-and-play Language-Engaged Dual-space Alignment loss (LEDA) to optimize low-dose CT denoising models. Our idea is to leverage large language models (LLMs) to align denoised CT and normal dose CT images in both the continuous perceptual space and discrete semantic space. LEDA involves two steps: the first is to pretrain an LLM-guided CT autoencoder, which can encode a CT image into continuous high-level features and quantize them into a token space to produce semantic tokens.
arXiv Detail & Related papers (2024-03-10T08:21:50Z)
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z)
Towards Effective Disambiguation for Machine Translation with Large Language Models [65.80775710657672]
We study the capabilities of large language models to translate "ambiguous sentences" Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions.
arXiv Detail & Related papers (2023-09-20T22:22:52Z)
Self-supervised Noise2noise Method Utilizing Corrupted Images with a Modular Network for LDCT Denoising [9.794579903055668]
Deep learning is a promising technique for low-dose computed tomography (LDCT) image denoising. Traditional deep learning methods require paired noisy and clean datasets. This paper proposes a new method for performing LDCT image denoising with only LDCT data.
arXiv Detail & Related papers (2023-08-13T11:26:56Z)
Planting a SEED of Vision in Large Language Model [73.17530130368053]
We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the ability to SEE and Draw at the same time. This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs.
arXiv Detail & Related papers (2023-07-16T13:41:39Z)
Cross-Modal Causal Intervention for Medical Report Generation [107.76649943399168]
Radiology Report Generation (RRG) is essential for computer-aided diagnosis and medication guidance.<n> generating accurate lesion descriptions remains challenging due to spurious correlations from visual-linguistic biases.<n>We propose a two-stage framework named CrossModal Causal Representation Learning (CMCRL)<n> Experiments on IU-Xray and MIMIC-CXR show that our CMCRL pipeline significantly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-03-16T07:23:55Z)
PCRLv2: A Unified Visual Information Preservation Framework for Self-supervised Pre-training in Medical Image Analysis [56.63327669853693]
We propose to incorporate the task of pixel restoration for explicitly encoding more pixel-level information into high-level semantics. We also address the preservation of scale information, a powerful tool in aiding image understanding. The proposed unified SSL framework surpasses its self-supervised counterparts on various tasks.
arXiv Detail & Related papers (2023-01-02T17:47:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.