LILE: Look In-Depth before Looking Elsewhere -- A Dual Attention Network
using Transformers for Cross-Modal Information Retrieval in Histopathology
Archives
- URL: http://arxiv.org/abs/2203.01445v2
- Date: Fri, 4 Mar 2022 06:08:09 GMT
- Title: LILE: Look In-Depth before Looking Elsewhere -- A Dual Attention Network
using Transformers for Cross-Modal Information Retrieval in Histopathology
Archives
- Authors: Danial Maleki, H.R Tizhoosh
- Abstract summary: Cross-modality data retrieval has become a requirement for many domains and disciplines of research.
This study proposes a novel architecture with a new loss term to help represent images and texts in the joint latent space.
- Score: 0.7614628596146599
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The volume of available data has grown dramatically in recent years in many
applications. Furthermore, the age of networks that used multiple modalities
separately has practically ended. Therefore, enabling bidirectional
cross-modality data retrieval capable of processing has become a requirement
for many domains and disciplines of research. This is especially true in the
medical field, as data comes in a multitude of types, including various types
of images and reports as well as molecular data. Most contemporary works apply
cross attention to highlight the essential elements of an image or text in
relation to the other modalities and try to match them together. However,
regardless of their importance in their own modality, these approaches usually
consider features of each modality equally. In this study, self-attention as an
additional loss term will be proposed to enrich the internal representation
provided into the cross attention module. This work suggests a novel
architecture with a new loss term to help represent images and texts in the
joint latent space. Experiment results on two benchmark datasets, i.e. MS-COCO
and ARCH, show the effectiveness of the proposed method.
Related papers
- Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training [99.2891802841936]
We introduce the Med-ST framework for fine-grained spatial and temporal modeling.
For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views.
For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR)
arXiv Detail & Related papers (2024-05-30T03:15:09Z) - Multimodal Information Interaction for Medical Image Segmentation [24.024848382458767]
We introduce an innovative Multimodal Information Cross Transformer (MicFormer)
It queries features from one modality and retrieves corresponding responses from another, facilitating effective communication between bimodal features.
Compared to other multimodal segmentation techniques, our method outperforms by margins of 2.83 and 4.23, respectively.
arXiv Detail & Related papers (2024-04-25T07:21:14Z) - Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features.
We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z) - EDIS: Entity-Driven Image Search over Multimodal Web Content [95.40238328527931]
We introduce textbfEntity-textbfDriven textbfImage textbfSearch (EDIS), a dataset for cross-modal image search in the news domain.
EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description.
arXiv Detail & Related papers (2023-05-23T02:59:19Z) - Few-shot Medical Image Segmentation via Cross-Reference Transformer [3.2634122554914]
Few-shot segmentation(FSS) has the potential to address these challenges by learning new categories from a small number of labeled samples.
We propose a novel self-supervised few shot medical image segmentation network with Cross-Reference Transformer.
Experimental results show that the proposed model achieves good results on both CT dataset and MRI dataset.
arXiv Detail & Related papers (2023-04-19T13:05:18Z) - Few Shot Medical Image Segmentation with Cross Attention Transformer [30.54965157877615]
We propose a novel framework for few-shot medical image segmentation, termed CAT-Net.
Our proposed network mines the correlations between the support image and query image, limiting them to focus only on useful foreground information.
We validated the proposed method on three public datasets: Abd-CT, Abd-MRI, and Card-MRI.
arXiv Detail & Related papers (2023-03-24T09:10:14Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Cross-Modality Sub-Image Retrieval using Contrastive Multimodal Image
Representations [3.3754780158324564]
Cross-modality image retrieval is challenging, since images of similar (or even the same) content captured by different modalities might share few common structures.
We propose a new application-independent content-based image retrieval system for reverse (sub-)image search across modalities.
arXiv Detail & Related papers (2022-01-10T19:04:28Z) - MD-CSDNetwork: Multi-Domain Cross Stitched Network for Deepfake
Detection [80.83725644958633]
Current deepfake generation methods leave discriminative artifacts in the frequency spectrum of fake images and videos.
We present a novel approach, termed as MD-CSDNetwork, for combining the features in the spatial and frequency domains to mine a shared discriminative representation.
arXiv Detail & Related papers (2021-09-15T14:11:53Z) - Domain Siamese CNNs for Sparse Multispectral Disparity Estimation [15.065764374430783]
We propose a new CNN architecture able to do disparity estimation between images from different spectrum.
Our method was tested using the publicly available LITIV 2014 and LITIV 2018 datasets.
arXiv Detail & Related papers (2020-04-30T20:29:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.