Related papers: LILE: Look In-Depth before Looking Elsewhere -- A Dual Attention Network using Transformers for Cross-Modal Information Retrieval in Histopathology Archives

LILE: Look In-Depth before Looking Elsewhere -- A Dual Attention Network using Transformers for Cross-Modal Information Retrieval in Histopathology Archives

URL: http://arxiv.org/abs/2203.01445v2
Date: Fri, 4 Mar 2022 06:08:09 GMT
Title: LILE: Look In-Depth before Looking Elsewhere -- A Dual Attention Network using Transformers for Cross-Modal Information Retrieval in Histopathology Archives
Authors: Danial Maleki, H.R Tizhoosh
Abstract summary: Cross-modality data retrieval has become a requirement for many domains and disciplines of research. This study proposes a novel architecture with a new loss term to help represent images and texts in the joint latent space.
Score: 0.7614628596146599
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The volume of available data has grown dramatically in recent years in many applications. Furthermore, the age of networks that used multiple modalities separately has practically ended. Therefore, enabling bidirectional cross-modality data retrieval capable of processing has become a requirement for many domains and disciplines of research. This is especially true in the medical field, as data comes in a multitude of types, including various types of images and reports as well as molecular data. Most contemporary works apply cross attention to highlight the essential elements of an image or text in relation to the other modalities and try to match them together. However, regardless of their importance in their own modality, these approaches usually consider features of each modality equally. In this study, self-attention as an additional loss term will be proposed to enrich the internal representation provided into the cross attention module. This work suggests a novel architecture with a new loss term to help represent images and texts in the joint latent space. Experiment results on two benchmark datasets, i.e. MS-COCO and ARCH, show the effectiveness of the proposed method.

Related papers

Infrared and Visible Image Fusion: From Data Compatibility to Task Adaption [65.06388526722186]
Infrared-visible image fusion is a critical task in computer vision. There is a lack of recent comprehensive surveys that address this rapidly expanding domain. We introduce a multi-dimensional framework to elucidate common learning-based IVIF methods.
arXiv Detail & Related papers (2025-01-18T13:17:34Z)
Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training [99.2891802841936]
We introduce the Med-ST framework for fine-grained spatial and temporal modeling. For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views. For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR)
arXiv Detail & Related papers (2024-05-30T03:15:09Z)
Multimodal Information Interaction for Medical Image Segmentation [24.024848382458767]
We introduce an innovative Multimodal Information Cross Transformer (MicFormer) It queries features from one modality and retrieves corresponding responses from another, facilitating effective communication between bimodal features. Compared to other multimodal segmentation techniques, our method outperforms by margins of 2.83 and 4.23, respectively.
arXiv Detail & Related papers (2024-04-25T07:21:14Z)
Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features. We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z)
EDIS: Entity-Driven Image Search over Multimodal Web Content [95.40238328527931]
We introduce textbfEntity-textbfDriven textbfImage textbfSearch (EDIS), a dataset for cross-modal image search in the news domain. EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description.
arXiv Detail & Related papers (2023-05-23T02:59:19Z)
Few-shot Medical Image Segmentation via Cross-Reference Transformer [3.2634122554914]
Few-shot segmentation(FSS) has the potential to address these challenges by learning new categories from a small number of labeled samples. We propose a novel self-supervised few shot medical image segmentation network with Cross-Reference Transformer. Experimental results show that the proposed model achieves good results on both CT dataset and MRI dataset.
arXiv Detail & Related papers (2023-04-19T13:05:18Z)
Few Shot Medical Image Segmentation with Cross Attention Transformer [30.54965157877615]
We propose a novel framework for few-shot medical image segmentation, termed CAT-Net. Our proposed network mines the correlations between the support image and query image, limiting them to focus only on useful foreground information. We validated the proposed method on three public datasets: Abd-CT, Abd-MRI, and Card-MRI.
arXiv Detail & Related papers (2023-03-24T09:10:14Z)
Modeling Entities as Semantic Points for Visual Information Extraction in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images. We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities. The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z)
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities. We explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z)
Cross-Modality Sub-Image Retrieval using Contrastive Multimodal Image Representations [3.3754780158324564]
Cross-modality image retrieval is challenging, since images of similar (or even the same) content captured by different modalities might share few common structures. We propose a new application-independent content-based image retrieval system for reverse (sub-)image search across modalities.
arXiv Detail & Related papers (2022-01-10T19:04:28Z)
MD-CSDNetwork: Multi-Domain Cross Stitched Network for Deepfake Detection [80.83725644958633]
Current deepfake generation methods leave discriminative artifacts in the frequency spectrum of fake images and videos. We present a novel approach, termed as MD-CSDNetwork, for combining the features in the spatial and frequency domains to mine a shared discriminative representation.
arXiv Detail & Related papers (2021-09-15T14:11:53Z)
Domain Siamese CNNs for Sparse Multispectral Disparity Estimation [15.065764374430783]
We propose a new CNN architecture able to do disparity estimation between images from different spectrum. Our method was tested using the publicly available LITIV 2014 and LITIV 2018 datasets.
arXiv Detail & Related papers (2020-04-30T20:29:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.