METransformer: Radiology Report Generation by Transformer with Multiple
Learnable Expert Tokens
- URL: http://arxiv.org/abs/2304.02211v1
- Date: Wed, 5 Apr 2023 03:54:28 GMT
- Title: METransformer: Radiology Report Generation by Transformer with Multiple
Learnable Expert Tokens
- Authors: Zhanyu Wang, Lingqiao Liu, Lei Wang, Luping Zhou
- Abstract summary: We propose a method to upgrade the existing "single expert" framework with a transformer-based backbone.
Key design of our method is the introduction of multiple learnable "expert" tokens into both the transformer encoder and decoder.
A metrics-based expert voting strategy is further developed to generate the final report.
- Score: 43.817603311027646
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In clinical scenarios, multi-specialist consultation could significantly
benefit the diagnosis, especially for intricate cases. This inspires us to
explore a "multi-expert joint diagnosis" mechanism to upgrade the existing
"single expert" framework commonly seen in the current literature. To this end,
we propose METransformer, a method to realize this idea with a
transformer-based backbone. The key design of our method is the introduction of
multiple learnable "expert" tokens into both the transformer encoder and
decoder. In the encoder, each expert token interacts with both vision tokens
and other expert tokens to learn to attend different image regions for image
representation. These expert tokens are encouraged to capture complementary
information by an orthogonal loss that minimizes their overlap. In the decoder,
each attended expert token guides the cross-attention between input words and
visual tokens, thus influencing the generated report. A metrics-based expert
voting strategy is further developed to generate the final report. By the
multi-experts concept, our model enjoys the merits of an ensemble-based
approach but through a manner that is computationally more efficient and
supports more sophisticated interactions among experts. Experimental results
demonstrate the promising performance of our proposed model on two widely used
benchmarks. Last but not least, the framework-level innovation makes our work
ready to incorporate advances on existing "single-expert" models to further
improve its performance.
Related papers
- A Shared Encoder Approach to Multimodal Representation Learning [17.863705872504]
We present a shared encoder framework for multimodal representation learning tailored to the medical domain.
Our approach employs a single set of encoder parameters shared across modalities, augmented with learnable modality features.
arXiv Detail & Related papers (2025-03-03T15:29:26Z) - Leveraging Mixture of Experts for Improved Speech Deepfake Detection [53.69740463004446]
Speech deepfakes pose a significant threat to personal security and content authenticity.
We introduce a novel approach for enhancing speech deepfake detection performance using a Mixture of Experts architecture.
arXiv Detail & Related papers (2024-09-24T13:24:03Z) - Unity in Diversity: Multi-expert Knowledge Confrontation and Collaboration for Generalizable Vehicle Re-identification [32.80872775195836]
Generalizable vehicle re-identification (ReID) aims to enable the well-trained model in diverse source domains to broadly adapt to unknown target domains.
It still faces the challenges of domain shift problem and has difficulty accurately generalizing to unknown target domains.
This paper proposes the two-stage Multi-expert Knowledge Confrontation and Collaboration (MiKeCoCo) method.
arXiv Detail & Related papers (2024-07-10T04:06:39Z) - MoVA: Adapting Mixture of Vision Experts to Multimodal Context [38.8308841469793]
We propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism.
In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts.
In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge.
arXiv Detail & Related papers (2024-04-19T17:59:48Z) - MouSi: Poly-Visual-Expert Vision-Language Models [132.58949014605477]
This paper proposes the use of ensemble experts technique to synergize the capabilities of individual visual encoders.
This technique introduces a fusion network to unify the processing of outputs from different visual experts.
In our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1.
arXiv Detail & Related papers (2024-01-30T18:09:11Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Triple-View Knowledge Distillation for Semi-Supervised Semantic
Segmentation [54.23510028456082]
We propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation.
The framework includes the triple-view encoder and the dual-frequency decoder.
arXiv Detail & Related papers (2023-09-22T01:02:21Z) - LAVT: Language-Aware Vision Transformer for Referring Image Segmentation [80.54244087314025]
We show that better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in vision Transformer encoder network.
Our method surpasses the previous state-of-the-art methods on RefCOCO, RefCO+, and G-Ref by large margins.
arXiv Detail & Related papers (2021-12-04T04:53:35Z) - HydraSum -- Disentangling Stylistic Features in Text Summarization using
Multi-Decoder Models [12.070474521259776]
We introduce HydraSum, a new summarization architecture that extends the single decoder framework of current models.
Our proposed model encourages each expert, i.e. decoder, to learn and generate stylistically-distinct summaries.
A guided version of the training process can explicitly govern which summary style is partitioned between decoders.
arXiv Detail & Related papers (2021-10-08T22:49:49Z) - BERTERS: Multimodal Representation Learning for Expert Recommendation
System with Transformer [2.131521514043068]
We introduce a multimodal classification approach for expert recommendation system (BERTERS)
BERTERS converts text into a vector using the Bidirectional Representations from Transformer (BERT)
Also, a graph Representation technique called ExEm is used to extract the features of candidates from the co-author network.
arXiv Detail & Related papers (2020-06-30T12:30:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.