Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection
- URL: http://arxiv.org/abs/2511.18385v1
- Date: Sun, 23 Nov 2025 10:25:24 GMT
- Title: Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection
- Authors: Chuang Peng, Renshuai Tao, Zhongwei Ren, Xianglong Liu, Yunchao Wei,
- Abstract summary: We introduce DualXrayBench, the first comprehensive benchmark for X-ray inspection.<n>We introduce a caption corpus consisting of 45,613 dual-view image pairs across 12 categories with corresponding captions.<n>We propose the Geometric (cross-view)-Semantic (cross-modality) Reasoner (GSR) as a multimodal model that learns correspondences between cross-view geometry and cross-modal semantics.
- Score: 55.44671451998018
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic X-ray prohibited items detection is vital for security inspection and has been widely studied. Traditional methods rely on visual modality, often struggling with complex threats. While recent studies incorporate language to guide single-view images, human inspectors typically use dual-view images in practice. This raises the question: can the second view provide constraints similar to a language modality? In this work, we introduce DualXrayBench, the first comprehensive benchmark for X-ray inspection that includes multiple views and modalities. It supports eight tasks designed to test cross-view reasoning. In DualXrayBench, we introduce a caption corpus consisting of 45,613 dual-view image pairs across 12 categories with corresponding captions. Building upon these data, we propose the Geometric (cross-view)-Semantic (cross-modality) Reasoner (GSR), a multimodal model that jointly learns correspondences between cross-view geometry and cross-modal semantics, treating the second-view images as a "language-like modality". To enable this, we construct the GSXray dataset, with structured Chain-of-Thought sequences: <top>, <side>, <conclusion>. Comprehensive evaluations on DualXrayBench demonstrate that GSR achieves significant improvements across all X-ray tasks, offering a new perspective for real-world X-ray inspection.
Related papers
- GLEAM: Learning to Match and Explain in Cross-View Geo-Localization [66.11208984986813]
Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location.<n>We present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities-including UAV imagery, street maps, panoramic views, and ground photographs-by aligning them exclusively with satellite imagery.<n>To address the lack of interpretability in traditional CVGL methods, we propose GLEAM-X, which combines cross-view correspondence prediction with explainable reasoning.
arXiv Detail & Related papers (2025-09-09T07:14:31Z) - Adapting Foundation Model for Dental Caries Detection with Dual-View Co-Training [53.77904429789069]
We present Attention-TNet, a novel Dual-View Co-Training network for accurate dental caries detection.<n>OurTNet starts with employing automated tooth detection to establish two complementary views: a global view from panoramic X-ray images and a local view from cropped tooth images.<n>To effectively integrate information from both views, we introduce a Gated Cross-View module.
arXiv Detail & Related papers (2025-08-28T14:13:26Z) - Self-Supervised Multiview Xray Matching [4.033064933995391]
Current methods often struggle to establish robust correspondences between different X-ray views.<n>We present a novel self-supervised pipeline that eliminates the need for manual annotation.<n>Our approach incorporates a transformer-based training phase to accurately predict correspondences across two or more X-ray views.
arXiv Detail & Related papers (2025-06-30T21:56:14Z) - Dual-view X-ray Detection: Can AI Detect Prohibited Items from Dual-view X-ray Images like Humans? [78.26435264182763]
We introduce the Large-scale Dual-view X-ray (LDXray), which consists of 353,646 instances across 12 categories.<n>To emulate human intelligence in dual-view detection, we propose the Auxiliary-view Enhanced Network (AENet)<n>Experiments on the LDXray dataset demonstrate that the dual-view mechanism significantly enhances detection performance.
arXiv Detail & Related papers (2024-11-27T06:36:20Z) - Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology
Report Generation [48.723504098917324]
We propose an Unify, Align and then Refine (UAR) approach to learn multi-level cross-modal alignments.
We introduce three novel modules: Latent Space Unifier, Cross-modal Representation Aligner and Text-to-Image Refiner.
Experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate the superiority of our UAR against varied state-of-the-art methods.
arXiv Detail & Related papers (2023-03-28T12:42:12Z) - Improving Joint Learning of Chest X-Ray and Radiology Report by Word
Region Alignment [9.265044250068554]
This paper proposes a Joint Image Text Representation Learning Network (JoImTeRNet) for pre-training on chest X-ray images and their radiology reports.
The model was pre-trained on both the global image-sentence level and the local image region-word level for visual-textual matching.
arXiv Detail & Related papers (2021-09-04T22:58:35Z) - Cross-Modal Contrastive Learning for Abnormality Classification and
Localization in Chest X-rays with Radiomics using a Feedback Loop [63.81818077092879]
We propose an end-to-end semi-supervised cross-modal contrastive learning framework for medical images.
We first apply an image encoder to classify the chest X-rays and to generate the image features.
The radiomic features are then passed through another dedicated encoder to act as the positive sample for the image features generated from the same chest X-ray.
arXiv Detail & Related papers (2021-04-11T09:16:29Z) - Image Separation with Side Information: A Connected Auto-Encoders Based
Approach [18.18248997032482]
We deal with the problem of separating mixed X-ray images originating from the radiography of double-sided paintings.
We propose a new Neural Network architecture, based upon 'connected' auto-encoders, designed to separate the mixed X-ray image into two simulated X-ray images corresponding to each side.
These tests show that the proposed approach outperforms other state-of-the-art X-ray image separation methods for art investigation applications.
arXiv Detail & Related papers (2020-09-16T18:39:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.