U-VLM: Hierarchical Vision Language Modeling for Report Generation
- URL: http://arxiv.org/abs/2603.00479v1
- Date: Sat, 28 Feb 2026 05:43:11 GMT
- Title: U-VLM: Hierarchical Vision Language Modeling for Report Generation
- Authors: Pengcheng Shi, Minghui Zhang, Kehan Song, Jiaqi Liu, Yun Gu, Xinglin Zhang,
- Abstract summary: We propose U-VLM, which enables hierarchical vision-language modeling in both training and architecture.<n>U-VLM achieves state-of-the-art performance on CT-RATE and AbdomenAtlas 3.0 using only a 0.1B decoder trained from scratch.
- Score: 20.09433657986766
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated radiology report generation is key for reducing radiologist workload and improving diagnostic consistency, yet generating accurate reports for 3D medical imaging remains challenging. Existing vision-language models face two limitations: they do not leverage segmentation-pretrained encoders, and they inject visual features only at the input layer of language models, losing multi-scale information. We propose U-VLM, which enables hierarchical vision-language modeling in both training and architecture: (1) progressive training from segmentation to classification to report generation, and (2) multi-layer visual injection that routes U-Net encoder features to corresponding language model layers. Each training stage can leverage different datasets without unified annotations. U-VLM achieves state-of-the-art performance on CT-RATE (F1: 0.414 vs 0.258, BLEU-mean: 0.349 vs 0.305) and AbdomenAtlas 3.0 (F1: 0.624 vs 0.518 for segmentation-based detection) using only a 0.1B decoder trained from scratch, demonstrating that well-designed vision encoder pretraining outweighs the benefits of 7B+ pre-trained language models. Ablation studies show that progressive pretraining significantly improves F1, while multi-layer injection improves BLEU-mean. Code is available at https://github.com/yinghemedical/U-VLM.
Related papers
- DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning [94.62097655403683]
We propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action framework.<n>Our method jointly performs spatial understanding, 3D perception, prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel.<n>With only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models.
arXiv Detail & Related papers (2025-12-14T18:45:54Z) - Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging [19.44554736205812]
We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference.<n>A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and (iii) long-context decoder refinement.<n>It improves BLEU scores and increases clinical F1 by 40% over CT2Rep, CT-CHAT, and Merlin for report generation.<n>It reduces FID by 75% and halves FVD compared to GenerateCT and MedSyn for text-to-CT synthesis.
arXiv Detail & Related papers (2025-10-23T15:13:13Z) - Comprehensive language-image pre-training for 3D medical image understanding [40.12276593119101]
Vision-language pre-training, i.e., aligning images with paired text, is a powerful paradigm to create encoders.<n>We develop the Comprehensive Language-image Pre-training (COLIPRI) encoder family.<n>Our COLIPRI encoders achieve state-of-the-art performance in report generation, classification probing, and zero-shot classification.
arXiv Detail & Related papers (2025-10-16T18:01:31Z) - More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era [7.5669441185108015]
Large Language Models (LLMs) can facilitate large-scale supervised pre-training.<n>LLMs can extract diagnostic labels from radiology reports with remarkable precision.<n>We show that supervised pre-training fundamentally improves contrastive vision-language alignment.
arXiv Detail & Related papers (2025-09-16T15:27:14Z) - Evaluating Cell Type Inference in Vision Language Models Under Varying Visual Context [0.16385815610837165]
Vision-Language Models (VLMs) have rapidly advanced alongside Large Language Models (LLMs)<n>This study evaluates the capabilities of prominent generative VLMs, such as GPT-4.1 and Gemini 2.5 Pro, for histopathology image classification tasks.
arXiv Detail & Related papers (2025-06-15T01:50:16Z) - Perception Encoder: The best visual embeddings are not at the output of the network [70.86738083862099]
We introduce Perception (PE), a vision encoder for image and video understanding trained via simple vision-language learning.<n>We find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks.<n>Together, our PE family of models achieves best-in-class results on a wide variety of tasks.
arXiv Detail & Related papers (2025-04-17T17:59:57Z) - TULIP: Towards Unified Language-Image Pretraining [60.99500935831526]
We introduce T, an open-source, drop-in replacement for existing CLIP-like models.<n>Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features.<n>Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across benchmarks.
arXiv Detail & Related papers (2025-03-19T17:58:57Z) - CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning.
Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding [96.95120198412395]
We introduce tri-modal pre-training framework that automatically generates holistic language descriptions for 3D shapes.
It only needs 3D data as input, eliminating the need for any manual 3D annotations, and is therefore scalable to large datasets.
We conduct experiments on two large-scale 3D datasets, NN and ShapeNet, and augment them with tri-modal datasets of 3D point clouds, captioning, and language for training.
Experiments show that NN-2 demonstrates substantial benefits in three downstream tasks: zero-shot 3D classification, standard 3D classification with finetuning, and 3D (3D
arXiv Detail & Related papers (2023-05-14T23:14:09Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.