Related papers: METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens

METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens

URL: http://arxiv.org/abs/2304.02211v1
Date: Wed, 5 Apr 2023 03:54:28 GMT
Title: METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens
Authors: Zhanyu Wang, Lingqiao Liu, Lei Wang, Luping Zhou
Abstract summary: We propose a method to upgrade the existing "single expert" framework with a transformer-based backbone. Key design of our method is the introduction of multiple learnable "expert" tokens into both the transformer encoder and decoder. A metrics-based expert voting strategy is further developed to generate the final report.
Score: 43.817603311027646
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In clinical scenarios, multi-specialist consultation could significantly benefit the diagnosis, especially for intricate cases. This inspires us to explore a "multi-expert joint diagnosis" mechanism to upgrade the existing "single expert" framework commonly seen in the current literature. To this end, we propose METransformer, a method to realize this idea with a transformer-based backbone. The key design of our method is the introduction of multiple learnable "expert" tokens into both the transformer encoder and decoder. In the encoder, each expert token interacts with both vision tokens and other expert tokens to learn to attend different image regions for image representation. These expert tokens are encouraged to capture complementary information by an orthogonal loss that minimizes their overlap. In the decoder, each attended expert token guides the cross-attention between input words and visual tokens, thus influencing the generated report. A metrics-based expert voting strategy is further developed to generate the final report. By the multi-experts concept, our model enjoys the merits of an ensemble-based approach but through a manner that is computationally more efficient and supports more sophisticated interactions among experts. Experimental results demonstrate the promising performance of our proposed model on two widely used benchmarks. Last but not least, the framework-level innovation makes our work ready to incorporate advances on existing "single-expert" models to further improve its performance.

Related papers

HAMLET-FFD: Hierarchical Adaptive Multi-modal Learning Embeddings Transformation for Face Forgery Detection [6.060036926093259]
HAMLET-FFD is a cross-domain generalization framework for face forgery detection.<n>It integrates visual evidence with conceptual cues, emulating expert forensic analysis.<n>By design, HAMLET-FFD freezes all pretrained parameters, serving as an external plugin.
arXiv Detail & Related papers (2025-07-28T15:09:52Z)
METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models [92.37117312251755]
We propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR)<n>For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy.<n>For multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning.
arXiv Detail & Related papers (2025-07-28T13:50:53Z)
A Shared Encoder Approach to Multimodal Representation Learning [17.863705872504]
We present a shared encoder framework for multimodal representation learning tailored to the medical domain. Our approach employs a single set of encoder parameters shared across modalities, augmented with learnable modality features.
arXiv Detail & Related papers (2025-03-03T15:29:26Z)
Leveraging Mixture of Experts for Improved Speech Deepfake Detection [53.69740463004446]
Speech deepfakes pose a significant threat to personal security and content authenticity. We introduce a novel approach for enhancing speech deepfake detection performance using a Mixture of Experts architecture.
arXiv Detail & Related papers (2024-09-24T13:24:03Z)
Unity in Diversity: Multi-expert Knowledge Confrontation and Collaboration for Generalizable Vehicle Re-identification [32.80872775195836]
Generalizable vehicle re-identification (ReID) aims to enable the well-trained model in diverse source domains to broadly adapt to unknown target domains. It still faces the challenges of domain shift problem and has difficulty accurately generalizing to unknown target domains. This paper proposes the two-stage Multi-expert Knowledge Confrontation and Collaboration (MiKeCoCo) method.
arXiv Detail & Related papers (2024-07-10T04:06:39Z)
MoVA: Adapting Mixture of Vision Experts to Multimodal Context [38.8308841469793]
We propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts. In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge.
arXiv Detail & Related papers (2024-04-19T17:59:48Z)
MouSi: Poly-Visual-Expert Vision-Language Models [132.58949014605477]
This paper proposes the use of ensemble experts technique to synergize the capabilities of individual visual encoders. This technique introduces a fusion network to unify the processing of outputs from different visual experts. In our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1.
arXiv Detail & Related papers (2024-01-30T18:09:11Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Triple-View Knowledge Distillation for Semi-Supervised Semantic Segmentation [54.23510028456082]
We propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation. The framework includes the triple-view encoder and the dual-frequency decoder.
arXiv Detail & Related papers (2023-09-22T01:02:21Z)
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation [80.54244087314025]
We show that better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in vision Transformer encoder network. Our method surpasses the previous state-of-the-art methods on RefCOCO, RefCO+, and G-Ref by large margins.
arXiv Detail & Related papers (2021-12-04T04:53:35Z)
HydraSum -- Disentangling Stylistic Features in Text Summarization using Multi-Decoder Models [12.070474521259776]
We introduce HydraSum, a new summarization architecture that extends the single decoder framework of current models. Our proposed model encourages each expert, i.e. decoder, to learn and generate stylistically-distinct summaries. A guided version of the training process can explicitly govern which summary style is partitioned between decoders.
arXiv Detail & Related papers (2021-10-08T22:49:49Z)
BERTERS: Multimodal Representation Learning for Expert Recommendation System with Transformer [2.131521514043068]
We introduce a multimodal classification approach for expert recommendation system (BERTERS) BERTERS converts text into a vector using the Bidirectional Representations from Transformer (BERT) Also, a graph Representation technique called ExEm is used to extract the features of candidates from the co-author network.
arXiv Detail & Related papers (2020-06-30T12:30:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.